cs.CV [Total: 49]
cs.CL [Total: 54]
eess.IV [Total: 1]
cs.LG [Total: 8]
cs.SE [Total: 1]
cs.RO [Total: 5]
eess.AS [Total: 1]
cs.AI [Total: 3]
cs.SD [Total: 2]

cs.CV [Back]

[1] Explainable Detection of AI-Generated Images with Artifact Localization Using Faster-Than-Lies and Vision-Language Models for Edge Devices cs.CV | cs.AI | eess.IVPDF

Aryan Mathur, Asaduddin Ahmed, Pushti Amit Vasoya, Simeon Kandan Sonar, Yasir Z

TL;DR: 该论文提出了一种可解释的图像真实性检测系统，结合轻量级卷积分类器和视觉语言模型，用于检测和定位AI生成图像中的伪影，并在低分辨率图像上实现高精度和快速推理。

Details

Motivation: 随着AI生成图像的逼真度提升，验证图像真实性变得更具挑战性。本文旨在开发一种能够在边缘设备上高效运行的可解释检测工具，以减少虚假信息的传播。

Result: 在对抗扰动增强的CiFAKE数据集上达到96.5%准确率，推理时间为175ms（8核CPU），适用于边缘设备部署。

Insight: 结合视觉与语言推理在低分辨率图像真实性检测中具有潜力，可扩展到法医、工业检测及社交媒体管理等领域。

Abstract: The increasing realism of AI-generated imagery poses challenges for verifying visual authenticity. We present an explainable image authenticity detection system that combines a lightweight convolutional classifier (“Faster-Than-Lies”) with a Vision-Language Model (Qwen2-VL-7B) to classify, localize, and explain artifacts in 32x32 images. Our model achieves 96.5% accuracy on the extended CiFAKE dataset augmented with adversarial perturbations and maintains an inference time of 175ms on 8-core CPUs, enabling deployment on local or edge devices. Using autoencoder-based reconstruction error maps, we generate artifact localization heatmaps, which enhance interpretability for both humans and the VLM. We further categorize 70 visual artifact types into eight semantic groups and demonstrate explainable text generation for each detected anomaly. This work highlights the feasibility of combining visual and linguistic reasoning for interpretable authenticity detection in low-resolution imagery and outlines potential cross-domain applications in forensics, industrial inspection, and social media moderation.

[2] CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting cs.CV | cs.AIPDF

Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen

TL;DR: CountFormer利用Transformer框架，通过自监督基础模型DINOv2学习视觉重复和结构关系，实现类无关对象计数，并在FSC-147数据集上表现优异。

Details

Motivation: 人类能够通过感知视觉重复和结构关系而非类别身份轻松计数，而现有模型难以处理复杂形状或密集场景的计数问题。

Result: 在FSC-147数据集上，性能媲美SOTA，且在结构复杂或密集场景中更准确。

Insight: 引入基础模型（如DINOv2）可使计数系统更接近人类的结构感知能力，迈向无需示例的通用计数范式。

Abstract: Humans can effortlessly count diverse objects by perceiving visual repetition and structural relationships rather than relying on class identity. However, most existing counting models fail to replicate this ability; they often miscount when objects exhibit complex shapes, internal symmetry, or overlapping components. In this work, we introduce CountFormer, a transformer-based framework that learns to recognize repetition and structural coherence for class-agnostic object counting. Built upon the CounTR architecture, our model replaces its visual encoder with the self-supervised foundation model DINOv2, which produces richer and spatially consistent feature representations. We further incorporate positional embedding fusion to preserve geometric relationships before decoding these features into density maps through a lightweight convolutional decoder. Evaluated on the FSC-147 dataset, our model achieves performance comparable to current state-of-the-art methods while demonstrating superior accuracy on structurally intricate or densely packed scenes. Our findings indicate that integrating foundation models such as DINOv2 enables counting systems to approach human-like structural perception, advancing toward a truly general and exemplar-free counting paradigm.

[3] A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras cs.CV | cs.AIPDF

Gauthier Grimmer, Romain Wenger, Clément Flint, Germain Forestier, Gilles Rixhon

TL;DR: 本研究提出了一种结合几何模型和深度学习的新方法，用于监测城市河流中的漂浮人为垃圾，重点解决了数据泄漏和负样本问题，并验证了模型的准确性和速度。

Details

Motivation: 随着漂浮人为垃圾对环境和人类活动的负面影响日益严重，需要一种低成本、自动化的监测方法。本研究通过固定摄像头和深度学习解决这一问题。

Result: 研究表明数据集构建协议（尤其是负样本和时间泄漏）的重要性，并提出了一种基于投影几何和回归修正的物体尺寸估算方法。

Insight: 通过低成本摄像头和自动化方法，可实现城市水域漂浮垃圾的稳健监测，为环境管理提供实用工具。

Abstract: The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.

[4] RareFlow: Physics-Aware Flow-Matching for Cross-Sensor Super-Resolution of Rare-Earth Features cs.CVPDF

Forouzan Fallah, Wenwen Li, Chia-Yu Hsu, Hyunho Lee, Yezhou Yang

TL;DR: RareFlow提出了一种基于物理感知的超分辨率框架，针对罕见地貌特征的多传感器数据，通过双重调节架构和不确定性量化实现高保真结果。

Details

Motivation: 传统超分辨率方法在罕见地貌的多传感器数据上表现不佳，易产生视觉合理但物理不准确的结果，亟需一种鲁棒的解决方案。

Result: 在地球物理专家盲评测中接近真实影像质量，FID指标提升近40%，显著优于现有方法。

Insight: RareFlow为数据稀缺领域的高保真合成提供了新范式，尤其在严重域偏移场景下表现突出。

Abstract: Super-resolution (SR) for remote sensing imagery often fails under out-of-distribution (OOD) conditions, such as rare geomorphic features captured by diverse sensors, producing visually plausible but physically inaccurate results. We present RareFlow, a physics-aware SR framework designed for OOD robustness. RareFlow’s core is a dual-conditioning architecture. A Gated ControlNet preserves fine-grained geometric fidelity from the low-resolution input, while textual prompts provide semantic guidance for synthesizing complex features. To ensure physically sound outputs, we introduce a multifaceted loss function that enforces both spectral and radiometric consistency with sensor properties. Furthermore, the framework quantifies its own predictive uncertainty by employing a stochastic forward pass approach; the resulting output variance directly identifies unfamiliar inputs, mitigating feature hallucination. We validate RareFlow on a new, curated benchmark of multi-sensor satellite imagery. In blind evaluations, geophysical experts rated our model’s outputs as approaching the fidelity of ground truth imagery, significantly outperforming state-of-the-art baselines. This qualitative superiority is corroborated by quantitative gains in perceptual metrics, including a nearly 40% reduction in FID. RareFlow provides a robust framework for high-fidelity synthesis in data-scarce scientific domains and offers a new paradigm for controlled generation under severe domain shift.

[5] DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning cs.CV | cs.AI | cs.LGPDF

Eddison Pham, Prisha Priyadarshini, Adrian Maliackel, Kanishk Bandi, Cristian Meo

TL;DR: DynaStride提出了一种动态步长窗口化方法，结合多模态思维链（MMCoT），用于生成教学视频的多场景连贯描述，无需手动场景分割，显著提升了描述质量和时序一致性。

Details

Motivation: 教学视频的场景级描述需要同时理解视觉提示和时序结构，但现有方法生成的描述可能缺乏连贯性和质量，影响教育效果。DynaStride旨在填补这一空白。

Result: 在YouCookII数据集上，DynaStride在BLEU、METEOR等指标上优于VLLaMA3和GPT-4o，生成描述更连贯和信息丰富。

Insight: 动态步长窗口化与多模态思维链的结合能有效提升教学视频描述的时序连贯性和语义质量，为AI生成教学内容提供了新方向。

Abstract: Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video’s educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset’s scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and fused using a dynamic stride window selection algorithm that adaptively balances temporal context and redundancy. The final scene-level caption integrates visual semantics and temporal reasoning in a single instructional caption. Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o, demonstrate consistent gains on both N-gram-based metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore). Qualitative analyses further show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.

[6] TurboPortrait3D: Single-step diffusion-based fast portrait novel-view synthesis cs.CVPDF

Emily Kim, Julieta Martinez, Timur Bagautdinov, Jessica Hodgins

TL;DR: TurboPortrait3D提出了一种基于扩散模型的单步快速肖像新视角合成方法，解决了现有方法在3D肖像生成中的视觉伪影和身份保持问题，同时保持低延迟和高效率。

Details

Motivation: 现有图像到3D模型在肖像生成中存在视觉伪影、细节缺失和身份保持不足的问题，而扩散模型虽然生成高质量图像，但计算昂贵且缺乏3D一致性。

Result: TurboPortrait3D在质量和效率上均优于当前SOTA方法，定量和定性结果均表现优异。

Insight: 扩散模型可以通过特定训练策略高效增强3D生成任务的图像质量，同时保持3D一致性和低延迟。

Abstract: We introduce TurboPortrait3D: a method for low-latency novel-view synthesis of human portraits. Our approach builds on the observation that existing image-to-3D models for portrait generation, while capable of producing renderable 3D representations, are prone to visual artifacts, often lack of detail, and tend to fail at fully preserving the identity of the subject. On the other hand, image diffusion models excel at generating high-quality images, but besides being computationally expensive, are not grounded in 3D and thus are not directly capable of producing multi-view consistent outputs. In this work, we demonstrate that image-space diffusion models can be used to significantly enhance the quality of existing image-to-avatar methods, while maintaining 3D-awareness and running with low-latency. Our method takes a single frontal image of a subject as input, and applies a feedforward image-to-avatar generation pipeline to obtain an initial 3D representation and corresponding noisy renders. These noisy renders are then fed to a single-step diffusion model which is conditioned on input image(s), and is specifically trained to refine the renders in a multi-view consistent way. Moreover, we introduce a novel effective training strategy that includes pre-training on a large corpus of synthetic multi-view data, followed by fine-tuning on high-quality real images. We demonstrate that our approach both qualitatively and quantitatively outperforms current state-of-the-art for portrait novel-view synthesis, while being efficient in time.

[7] PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors cs.CVPDF

Xirui Jin, Renbiao Jin, Boying Li, Danping Zou, Wenxian Yu

TL;DR: PlanarGS是一个基于3D高斯溅射（3DGS）的框架，专为室内场景重建设计，通过引入语言提示的平面先验（LP3）和几何先验，解决了在低纹理区域中3DGS几何模糊的问题，显著提升了重建质量。

Details

Motivation: 传统3DGS在低纹理区域（如室内场景）中优化时，仅依赖光度损失会导致几何模糊，无法恢复高保真的3D表面。因此，需要引入额外的先验信息来提升重建效果。

Result: 在标准室内基准测试中，PlanarGS显著优于现有方法，重建出准确且细节丰富的3D表面。

Insight: 视觉-语言模型可以作为高层先验有效引导低纹理区域的几何重建，结合几何先验能显著提升3DGS的鲁棒性和保真度。

Abstract: Three-dimensional Gaussian Splatting (3DGS) has recently emerged as an efficient representation for novel-view synthesis, achieving impressive visual quality. However, in scenes dominated by large and low-texture regions, common in indoor environments, the photometric loss used to optimize 3DGS yields ambiguous geometry and fails to recover high-fidelity 3D surfaces. To overcome this limitation, we introduce PlanarGS, a 3DGS-based framework tailored for indoor scene reconstruction. Specifically, we design a pipeline for Language-Prompted Planar Priors (LP3) that employs a pretrained vision-language segmentation model and refines its region proposals via cross-view fusion and inspection with geometric priors. 3D Gaussians in our framework are optimized with two additional terms: a planar prior supervision term that enforces planar consistency, and a geometric prior supervision term that steers the Gaussians toward the depth and normal cues. We have conducted extensive experiments on standard indoor benchmarks. The results show that PlanarGS reconstructs accurate and detailed 3D surfaces, consistently outperforming state-of-the-art methods by a large margin. Project page: https://planargs.github.io

[8] Adaptive Training of INRs via Pruning and Densification cs.CVPDF

Diana Aldana, João Paulo Lima, Daniel Csillag, Daniel Perazzo, Haoan Feng

TL;DR: 论文提出了一种名为AIRe的自适应训练方法，通过剪枝和密集化优化隐式神经表示（INRs）的结构，实现了模型大小与重构质量的更好权衡。

Details

Motivation: 隐式神经表示（INRs）需要合理的输入频率选择和架构设计，但目前依赖启发式方法和繁琐的超参数优化。论文旨在通过自适应训练解决这一问题。

Result: 在图像和SDF上的实验表明，AIRe在减小模型规模的同时保持或提升了重构质量。

Insight: 自适应调整INRs结构可以有效平衡模型复杂度和性能，避免了传统启发式方法的局限性。

Abstract: Encoding input coordinates with sinusoidal functions into multilayer perceptrons (MLPs) has proven effective for implicit neural representations (INRs) of low-dimensional signals, enabling the modeling of high-frequency details. However, selecting appropriate input frequencies and architectures while managing parameter redundancy remains an open challenge, often addressed through heuristics and heavy hyperparameter optimization schemes. In this paper, we introduce AIRe ($\textbf{A}$daptive $\textbf{I}$mplicit neural $\textbf{Re}$presentation), an adaptive training scheme that refines the INR architecture over the course of optimization. Our method uses a neuron pruning mechanism to avoid redundancy and input frequency densification to improve representation capacity, leading to an improved trade-off between network size and reconstruction quality. For pruning, we first identify less-contributory neurons and apply a targeted weight decay to transfer their information to the remaining neurons, followed by structured pruning. Next, the densification stage adds input frequencies to spectrum regions where the signal underfits, expanding the representational basis. Through experiments on images and SDFs, we show that AIRe reduces model size while preserving, or even improving, reconstruction quality. Code and pretrained models will be released for public use.

[9] Neural USD: An object-centric framework for iterative editing and control cs.CV | cs.AIPDF

Alejandro Escontrela, Shrinu Kushagra, Sjoerd van Steenkiste, Yulia Rubanova, Aleksander Holynski

TL;DR: 论文提出Neural USD框架，解决生成模型中对象精确编辑的问题，通过层次化结构表示场景和对象，实现对象级的控制和解耦。

Details

Motivation: 当前生成模型在对象级别的精确和迭代编辑上存在不足，例如修改特定对象属性时会引发全局变化。

Result: 展示了Neural USD支持迭代和增量工作流的能力，设计选择验证了其有效性。

Insight: 层次化结构和解耦控制是实现对象级编辑的关键，为生成模型的实际应用提供了新思路。

Abstract: Amazing progress has been made in controllable generative modeling, especially over the last few years. However, some challenges remain. One of them is precise and iterative object editing. In many of the current methods, trying to edit the generated image (for example, changing the color of a particular object in the scene or changing the background while keeping other elements unchanged) by changing the conditioning signals often leads to unintended global changes in the scene. In this work, we take the first steps to address the above challenges. Taking inspiration from the Universal Scene Descriptor (USD) standard developed in the computer graphics community, we introduce the “Neural Universal Scene Descriptor” or Neural USD. In this framework, we represent scenes and objects in a structured, hierarchical manner. This accommodates diverse signals, minimizes model-specific constraints, and enables per-object control over appearance, geometry, and pose. We further apply a fine-tuning approach which ensures that the above control signals are disentangled from one another. We evaluate several design considerations for our framework, demonstrating how Neural USD enables iterative and incremental workflows. More information at: https://escontrela.me/neural_usd .

[10] SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability cs.CV | cs.AI | cs.CRPDF

Peiyang Xu, Minzhou Pan, Zhaorun Chen, Shuang Yang, Chaowei Xiao

TL;DR: SafeVision 是一种高效的图像护栏系统，通过结合人类推理提升适应性和透明度，动态对齐安全策略并实现快速风险评估和解释。

Details

Motivation: 传统图像护栏模型依赖于预定义类别且缺乏语义推理，难以适应新威胁，需频繁重新训练。SafeVision 旨在解决这些问题。

Result: SafeVision 在 VisionHarm-T 和 VisionHarm-C 上分别超越 GPT-4o 8.6% 和 15.5%，速度提升 16 倍以上。

Insight: 结合人类推理和动态策略对齐是提升图像护栏适应性和透明度的关键。

Abstract: With the rapid proliferation of digital media, the need for efficient and transparent safeguards against unsafe content is more critical than ever. Traditional image guardrail models, constrained by predefined categories, often misclassify content due to their pure feature-based learning without semantic reasoning. Moreover, these models struggle to adapt to emerging threats, requiring costly retraining for new threats. To address these limitations, we introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency. Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function. We also propose a diverse QA generation and training strategy to enhance learning effectiveness. SafeVision dynamically aligns with evolving safety policies at inference time, eliminating the need for retraining while ensuring precise risk assessments and explanations. Recognizing the limitations of existing unsafe image benchmarks, which either lack granularity or cover limited risks, we introduce VisionHarm, a high-quality dataset comprising two subsets: VisionHarm Third-party (VisionHarm-T) and VisionHarm Comprehensive(VisionHarm-C), spanning diverse harmful categories. Through extensive experiments, we show that SafeVision achieves state-of-the-art performance on different benchmarks. SafeVision outperforms GPT-4o by 8.6% on VisionHarm-T and by 15.5% on VisionHarm-C, while being over 16x faster. SafeVision sets a comprehensive, policy-following, and explainable image guardrail with dynamic adaptation to emerging threats.

[11] Reasoning Visual Language Model for Chest X-Ray Analysis cs.CVPDF

Andriy Myronenko, Dong Yang, Baris Turkbey, Mariam Aboian, Sena Azamat

TL;DR: 提出了一种结合视觉-语言模型的框架，通过链式推理（CoT）模仿放射科医生的诊断思维，提升胸部X光分析的透明度和可信度。

Details

Motivation: 现有视觉-语言模型在医学图像分析中缺乏透明推理过程，而临床医生需要可解释的中间步骤以验证结论。本文旨在填补这一空白。

Result: 在分布外测试中，模型在保持多标签分类竞争力的同时提高了可解释性；专家研究表明，完整的推理痕迹提升了信心并缩短报告时间。

Insight: 3B规模的模型证明了推理质量与预测质量同等重要，适用于需要透明AI的其他医学任务。

Abstract: Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.

[12] Efficient Cost-and-Quality Controllable Arbitrary-scale Super-resolution with Fourier Constraints cs.CVPDF

Kazutoshi Akita, Norimichi Ukita

TL;DR: 本文提出了一种联合预测多个傅里叶分量的方法，以提升任意尺度超分辨率的效率和质量控制能力。

Details

Motivation: 现有方法通过递归神经网络逐个预测傅里叶分量，导致性能下降和低效问题。本文旨在解决这一问题。

Result: 新方法在质量和效率上均优于现有方法。

Insight: 联合预测策略在超分辨率任务中具有潜力，可以扩展到其他类似任务。

Abstract: Cost-and-Quality (CQ) controllability in arbitrary-scale super-resolution is crucial. Existing methods predict Fourier components one by one using a recurrent neural network. However, this approach leads to performance degradation and inefficiency due to independent prediction. This paper proposes predicting multiple components jointly to improve both quality and efficiency.

[13] TeleEgo: Benchmarking Egocentric AI Assistants in the Wild cs.CVPDF

Jiaqi Yan, Ruilong Ren, Jingren Liu, Shuning Xu, Ling Wang

TL;DR: TeleEgo是一个用于评估第一人称AI助手在真实场景中多模态处理能力的基准测试，支持长时间、流式数据和全局时间对齐，提供了12个子任务和多种评估指标。

Details

Motivation: 现有基准测试通常孤立评估多模态能力，缺乏真实流式场景和长期任务支持，限制了AI助手在实际应用中的发展。

Result: TeleEgo包含3,291个人工验证的QA项目，支持多种问题格式，并在流式设置下严格评估。

Insight: TeleEgo为开发实用AI助手提供了更全面的评估框架，强调了长期记忆和实时响应的重要性。

Abstract: Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work & study, lifestyle & routines, social activities, and outings & culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics – Real-Time Accuracy and Memory Persistence Time – to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.

[14] AdvBlur: Adversarial Blur for Robust Diabetic Retinopathy Classification and Cross-Domain Generalization cs.CVPDF

Heethanjan Kanagalingam, Thenukan Pathmanathan, Mokeeshan Vathanakumar, Tharmakulasingam Mukunthan

TL;DR: 论文提出AdvBlur方法，通过在数据集中引入对抗性模糊图像并结合双损失函数框架，提升糖尿病视网膜病变分类的领域泛化能力，有效应对分布变化带来的挑战。

Details

Motivation: 糖尿病视网膜病变（DR）的早期准确检测至关重要，但现有深度学习方法因设备、人口统计和成像条件差异导致的分布变化而泛化能力不足。

Result: 实验表明，AdvBlur在多个外部数据集上表现优异，优于现有的领域泛化方法。

Insight: 对抗性模糊图像有助于模型学习更鲁棒的特征，双损失函数框架则有效平衡了分类准确性和领域泛化能力。

Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, yet early and accurate detection can significantly improve treatment outcomes. While numerous Deep learning (DL) models have been developed to predict DR from fundus images, many face challenges in maintaining robustness due to distributional variations caused by differences in acquisition devices, demographic disparities, and imaging conditions. This paper addresses this critical limitation by proposing a novel DR classification approach, a method called AdvBlur. Our method integrates adversarial blurred images into the dataset and employs a dual-loss function framework to address domain generalization. This approach effectively mitigates the impact of unseen distributional variations, as evidenced by comprehensive evaluations across multiple datasets. Additionally, we conduct extensive experiments to explore the effects of factors such as camera type, low-quality images, and dataset size. Furthermore, we perform ablation studies on blurred images and the loss function to ensure the validity of our choices. The experimental results demonstrate the effectiveness of our proposed method, achieving competitive performance compared to state-of-the-art domain generalization DR models on unseen external datasets.

[15] Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks cs.CV | cs.AI | cs.LGPDF

Mirali Purohit, Bimal Gajera, Vatsal Malaviya, Irish Mehta, Kunal Kasodekar

TL;DR: Mars-Bench是首个专注于火星科学任务的基准测试，包含20个数据集，涵盖分类、分割和检测任务，旨在推动火星专属基础模型的发展。

Details

Motivation: 火星科学领域缺乏标准化的基准测试和评估框架，限制了基础模型的应用与发展。

Result: 结果显示火星专属基础模型可能优于通用模型，支持进一步探索领域适应的预训练方法。

Insight: 火星科学领域的基础模型需要领域特定的预训练数据和方法，以实现更好的性能。

Abstract: Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science lacks such benchmarks and standardized evaluation frameworks, which have limited progress toward developing foundation models for Martian tasks. To address this gap, we introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks using both orbital and surface imagery. Mars-Bench comprises 20 datasets spanning classification, segmentation, and object detection, focused on key geologic features such as craters, cones, boulders, and frost. We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models. Results from all analyses suggest that Mars-specific foundation models may offer advantages over general-domain counterparts, motivating further exploration of domain-adapted pre-training. Mars-Bench aims to establish a standardized foundation for developing and comparing machine learning models for Mars science. Our data, models, and code are available at: https://mars-bench.github.io/.

[16] ResNet: Enabling Deep Convolutional Neural Networks through Residual Learning cs.CV | cs.AIPDF

Xingyu Liu, Kun Ming Goh

TL;DR: ResNet通过引入残差连接解决了深度CNN训练中的梯度消失问题，使网络能训练数百层，提升性能和稳定性。

Details

Motivation: 训练极深度CNN时梯度消失问题限制了性能提升，ResNet旨在解决这一问题。

Result: 在CIFAR-10上，ResNet-18达到89.9%准确率，优于传统CNN的84.1%，且收敛更快更稳定。

Insight: 残差连接不仅缓解梯度消失，还能加速训练并提升深度网络的性能上限。

Abstract: Convolutional Neural Networks (CNNs) has revolutionized computer vision, but training very deep networks has been challenging due to the vanishing gradient problem. This paper explores Residual Networks (ResNet), introduced by He et al. (2015), which overcomes this limitation by using skip connections. ResNet enables the training of networks with hundreds of layers by allowing gradients to flow directly through shortcut connections that bypass intermediate layers. In our implementation on the CIFAR-10 dataset, ResNet-18 achieves 89.9% accuracy compared to 84.1% for a traditional deep CNN of similar depth, while also converging faster and training more stably.

[17] Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models cs.CV | cs.LGPDF

Shufan Shen, Junshu Sun, Shuhui Wang, Qingming Huang

TL;DR: SNELLA是一种高效的参数微调方法，通过低秩矩阵和非线性核函数优化稀疏微调，减少了内存占用并提升了性能。

Details

Motivation: 当前的稀疏微调方法存在两阶段的局限性：一是任务相关权重的定位忽略了微调过程中的参数调整，二是内存占用高。

Result: 在分类、分割和生成任务上，SNELLA实现了SOTA性能，内存占用减少了31.1%-39.9%，Top-1准确率提升了1.8%。

Insight: 通过低秩分解和非线性核函数的结合，SNELLA在减少内存占用的同时提升了任务适应性，展示了稀疏微调的潜力。

Abstract: Parameter-efficient fine-tuning (PEFT) aims to adapt pre-trained vision models to downstream tasks. Among PEFT paradigms, sparse tuning achieves remarkable performance by adjusting only the weights most relevant to downstream tasks, rather than densely tuning the entire weight matrix. Current methods follow a two-stage paradigm. First, it locates task-relevant weights by gradient information, which overlooks the parameter adjustments during fine-tuning and limits the performance. Second, it updates only the located weights by applying a sparse mask to the gradient of the weight matrix, which results in high memory usage due to the storage of all weight matrices in the optimizer. In this paper, we propose a one-stage method named SNELLA to overcome the above limitations. For memory usage, SNELLA selectively updates the weight matrix by adding it to another sparse matrix that is merged by two low-rank learnable matrices. We extend the low-rank decomposition by introducing nonlinear kernel functions, thereby increasing the rank of the resulting merged matrix to prevent the interdependency among weight updates, enabling better adaptation to downstream tasks. For locating task-relevant weights, we propose an adaptive bi-level sparsity allocation mechanism that encourages weights to compete across and inside layers based on their importance scores in an end-to-end manner. Extensive experiments are conducted on classification, segmentation, and generation tasks using different pre-trained vision models. The results show that SNELLA achieves SOTA performance with low memory usage. Notably, SNELLA obtains 1.8% (91.9% v.s. 90.1%) higher Top-1 accuracy on the FGVC benchmark compared to SPT-LoRA. Compared to previous methods, SNELLA achieves a memory reduction of 31.1%-39.9% across models with parameter scales from 86M to 632M. Our source codes are available at https://github.com/ssfgunner/SNELL.

[18] Enhancing CLIP Robustness via Cross-Modality Alignment cs.CVPDF

Xingyu Zhu, Beier Zhu, Shuo Wang, Kesen Zhao, Hanwang Zhang

TL;DR: 论文提出了一种名为COLA的方法，通过跨模态对齐增强CLIP对抗性扰动的鲁棒性。COLA基于最优传输理论，恢复图像和文本特征的全局对齐与局部结构一致性，显著提升对抗性攻击下的分类性能。

Details

Motivation: 现有的方法主要关注对抗性微调或提示优化，忽视了CLIP特征空间中图像和文本特征的错位问题，这种错位在对抗性扰动下加剧，导致性能下降。

Result: 在14个零样本分类基准测试中表现优异，ImageNet及其变体在PGD攻击下平均提升6.7%，同时保持干净样本的高精度。

Insight: 特征空间中的跨模态对齐是提升对抗性鲁棒性的关键，COLA的方法无需训练且兼容已有微调模型。

Abstract: Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization; they often overlook the gaps in CLIP’s encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose Cross-modality Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.

[19] Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification cs.CVPDF

William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky

TL;DR: BOB提出了一种新的文本到图像(T2I)模型微调策略，通过提取类无关属性并显式控制生成过程，解决了低样本细粒度分类任务中合成数据生成的多样性和过拟合问题，显著提升了分类性能。

Details

Motivation: 传统的T2I模型在生成合成数据时容易过拟合或缺乏多样性，限制了其在低样本细粒度分类任务中的应用。BOB旨在通过类无关属性控制和边际化技术解决这些问题。

Result: BOB在多个数据集和T2I模型中表现出色，显著提升了细粒度分类的准确率。例如，在Aircraft数据集上，BOB比DataDream高出7.4%。在18个实验设置中，BOB超越了现有方法。

Insight: 1. 显式建模类无关属性有助于平衡生成数据的多样性和类别区分度。2. 边际化技术是减少过拟合和提升生成质量的有效手段。3. 合成数据在低样本任务中具有显著潜力。

Abstract: Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model’s generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.

[20] OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation cs.CVPDF

Agus Gunawan, Samuel Teodoro, Yun Chen, Soo Ye Kim, Jihyong Oh

TL;DR: OmniText提出了一种无需训练的通用方法，用于可控的文本图像编辑任务，解决了现有文本修复方法在文本去除、风格控制和文本生成重复等方面的局限性。

Details

Motivation: 现有基于扩散模型的文本修复方法在文本图像编辑中存在三个主要问题：无法去除文本、缺乏对文本风格的控制、易生成重复字母。OmniText旨在解决这些问题。

Result: OmniText在多种文本图像编辑任务中表现优异，超越现有文本修复方法，并与专用方法性能相当。

Insight: 自注意力和交叉注意力机制在文本图像编辑任务中的独特作用为通用方法的开发提供了新方向。

Abstract: Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model’s tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.

[21] Enhancing Pre-trained Representation Classifiability can Boost its Interpretability cs.CV | cs.LGPDF

Shufan Shen, Zhaobo Qi, Junshu Sun, Qingming Huang, Qi Tian

TL;DR: 本文研究发现预训练视觉模型的表示在分类性能和可解释性之间存在正相关关系，并提出了一种新的指标（IIS）量化这种关系，同时还展示了如何通过最大化可解释性来进一步提升分类性能。

Details

Motivation: 预训练视觉模型广泛应用于下游任务，但对表示的可解释性需求日益增加，但目前尚不清楚分类性能和可解释性能否同时提高。

Result: 实验表明，分类性能更高的表示具有更高的可解释性；通过最大化可解释性可以进一步提升分类性能。

Insight: 分类性能和可解释性可以统一优化，这为预训练视觉模型的改进提供了新的方向。

Abstract: The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.

[22] UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations cs.CVPDF

Fengming Yu, Haiwei Pan, Kejia Zhang, Jian Guan, Haiying Jiang

TL;DR: UHKD提出了一种基于频域表示的异构知识蒸馏统一框架，通过傅里叶变换和特征对齐模块，解决异构模型间的语义差异问题，提升了蒸馏效果。

Details

Motivation: 异构模型在知识蒸馏中因结构和语义差异导致性能下降，现有方法主要针对同构模型，限制了中间特征的利用。

Result: 在CIFAR-100和ImageNet-1K上分别实现了5.59%和0.83%的性能提升。

Insight: 频域表示能有效缓解异构模型的语义差异，联合优化中间特征和对数空间能进一步提升蒸馏效果。

Abstract: Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing cost while maintaining accuracy. In visual applications, where large-scale image models are widely used, KD enables efficient deployment. However, architectural diversity introduces semantic discrepancies that hinder the use of intermediate representations. Most existing KD methods are designed for homogeneous models and degrade in heterogeneous scenarios, especially when intermediate features are involved. Prior studies mainly focus on the logits space, making limited use of the semantic information in intermediate layers. To address this limitation, Unified Heterogeneous Knowledge Distillation (UHKD) is proposed as a framework that leverages intermediate features in the frequency domain for cross-architecture transfer. Fourier transform is applied to capture global feature information, alleviating representational discrepancies between heterogeneous teacher-student pairs. A Feature Transformation Module (FTM) produces compact frequency-domain representations of teacher features, while a learnable Feature Alignment Module (FAM) projects student features and aligns them via multi-level matching. Training is guided by a joint objective combining mean squared error on intermediate features with Kullback-Leibler divergence on logits. Experiments on CIFAR-100 and ImageNet-1K demonstrate gains of 5.59% and 0.83% over the latest method, highlighting UHKD as an effective approach for unifying heterogeneous representations and enabling efficient utilization of visual knowledge

[23] DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery cs.CVPDF

Zan Wang, Siyu Chen, Luya Mo, Xinfeng Gao, Yuxin Shen

TL;DR: DogMo是一个大规模多视角RGB-D数据集，专注于从图像中恢复犬类运动，填补了现有数据集的不足，并提出了一种基于SMAL模型的三阶段优化方法。

Details

Motivation: 现有犬类运动数据集缺乏多视角和真实3D数据，且规模和多样性有限，DogMo旨在解决这些问题，推动犬类运动恢复的研究。

Result: DogMo为犬类运动恢复提供了系统评估基准，同时提出的方法显著提升了运动恢复的准确性。

Insight: 该研究结合了计算机视觉、图形学和动物行为建模，为未来跨领域研究提供了新方向。

Abstract: We present DogMo, a large-scale multi-view RGB-D video dataset capturing diverse canine movements for the task of motion recovery from images. DogMo comprises 1.2k motion sequences collected from 10 unique dogs, offering rich variation in both motion and breed. It addresses key limitations of existing dog motion datasets, including the lack of multi-view and real 3D data, as well as limited scale and diversity. Leveraging DogMo, we establish four motion recovery benchmark settings that support systematic evaluation across monocular and multi-view, RGB and RGB-D inputs. To facilitate accurate motion recovery, we further introduce a three-stage, instance-specific optimization pipeline that fits the SMAL model to the motion sequences. Our method progressively refines body shape and pose through coarse alignment, dense correspondence supervision, and temporal regularization. Our dataset and method provide a principled foundation for advancing research in dog motion recovery and open up new directions at the intersection of computer vision, computer graphics, and animal behavior modeling.

[24] Compositional Image Synthesis with Inference-Time Scaling cs.CV | cs.AIPDF

Minsuk Ji, Sanghyeok Lee, Namhyuk Ahn

TL;DR: 本文提出了一种无需训练的框架，通过结合对象中心方法和自优化技术，提升文本到图像模型的合成能力，特别是在对象数量、属性和空间关系上的准确性。

Details

Motivation: 现有的文本到图像模型在合成复杂场景时存在组合性问题，例如对象数量、属性和空间关系的准确性不足。本文旨在解决这一问题，同时保持生成图像的美观性。

Result: 实验表明，相比现有文本到图像模型，该框架在场景与提示对齐方面表现更优，且在保持美观性的同时提升了合成准确性。

Insight: 本文的创新点在于将显式布局与自优化相结合，展示了推理时优化对提升组合性的有效性。

Abstract: Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.

[25] VC4VG: Optimizing Video Captions for Text-to-Video Generation cs.CV | cs.AI | cs.CLPDF

Yang Du, Zhuoran Lin, Kaiqiang Song, Biao Wang, Zhicheng Zheng

TL;DR: 论文提出VC4VG框架，针对文本生成视频（T2V）优化视频标注，设计多维度标注方法论，并构建VC4VG-Bench评测基准，实验证明高质量标注提升视频生成效果。

Details

Motivation: 文本生成视频依赖高质量视频-文本对，但目前缺乏针对T2V训练的标注优化策略，影响模型生成效果。

Result: 实验显示优化标注质量显著提升视频生成性能。

Insight: 高质量标注对T2V模型至关重要，多维度评测可有效指导标注优化。

Abstract: Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.

[26] Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning cs.CV | cs.AIPDF

Aodi Wu, Xubo Luo

TL;DR: 该技术报告提出了一种基于任务特定提示和空间推理的视觉语言模型（VLM）增强框架，用于自动驾驶场景理解，通过在IROS 2025 RoboSense挑战赛中实现高精度表现，证明了其有效性。

Details

Motivation: 自动驾驶需要复杂的多任务场景理解能力，现有VLM在任务干扰和空间推理方面存在不足，亟需一种系统性解决方案。

Result: 在Phase-1（清洁数据）和Phase-2（损坏数据）上分别取得70.87%和72.85%的平均准确率，表现优于基准方法。

Insight: 结构化提示和空间推理显著提升了VLM在自动驾驶任务中的性能，证明了任务特定设计和空间信息嵌入的重要性。

Abstract: This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at https://github.com/wuaodi/UCAS-CSU-phase2.

[27] Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2 cs.CVPDF

Ziqi Zhou, Yifan Hu, Yufei Song, Zijing Li, Shengshan Hu

TL;DR: 论文分析了SAM2在对抗样本攻击下的脆弱性，提出了UAP-SAM2方法，通过双语义偏差实现跨提示通用攻击，显著优于现有方法。

Details

Motivation: SAM2作为SAM的升级版，在视频分割中表现出强大的泛化能力，但其鲁棒性未经检验。论文旨在填补这一空白，研究SAM2的脆弱性及现有攻击方法的局限性。

Result: 在六个数据集上的实验显示，UAP-SAM2显著优于现有攻击方法。

Insight: SAM2的架构差异导致了新的攻击挑战，而UAP-SAM2通过双语义偏差有效解决了这些问题，证明了其在实际应用中的攻击潜力。

Abstract: Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP-SAM2, the first cross-prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross-prompt transferability, we begin by designing a target-scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization. For effectiveness, we design a dual semantic deviation framework that optimizes a UAP by distorting the semantics within the current frame and disrupting the semantic consistency across consecutive frames. Extensive experiments on six datasets across two segmentation tasks demonstrate the effectiveness of the proposed method for SAM2. The comparative results show that UAP-SAM2 significantly outperforms state-of-the-art (SOTA) attacks by a large margin.

[28] CLFSeg: A Fuzzy-Logic based Solution for Boundary Clarity and Uncertainty Reduction in Medical Image Segmentation cs.CVPDF

Anshul Kaushal, Kunal Jangid, Vinod K. Kurmi

TL;DR: CLFSeg是一种基于模糊逻辑的医学图像分割方法，通过结合卷积层和模糊逻辑处理边界不确定性和噪声，显著提升了分割性能。

Details

Motivation: 传统CNN模型在医学图像分割中存在泛化能力和鲁棒性不足的问题，尤其是在处理边界不确定性和噪声时表现不佳。CLFSeg旨在解决这些问题，提升分割的准确性和效率。

Result: 在CVC-ColonDB等四个公开数据集上表现优异，超越了现有SOTA方法，同时确保了计算效率。

Insight: 模糊逻辑在医学图像分割中有效处理边界不确定性，FC模块的引入提升了模型对噪声和模糊区域的鲁棒性。

Abstract: Accurate polyp and cardiac segmentation for early detection and treatment is essential for the diagnosis and treatment planning of cancer-like diseases. Traditional convolutional neural network (CNN) based models have represented limited generalizability, robustness, and inability to handle uncertainty, which affects the segmentation performance. To solve these problems, this paper introduces CLFSeg, an encoder-decoder based framework that aggregates the Fuzzy-Convolutional (FC) module leveraging convolutional layers and fuzzy logic. This module enhances the segmentation performance by identifying local and global features while minimizing the uncertainty, noise, and ambiguity in boundary regions, ensuring computing efficiency. In order to handle class imbalance problem while focusing on the areas of interest with tiny and boundary regions, binary cross-entropy (BCE) with dice loss is incorporated. Our proposed model exhibits exceptional performance on four publicly available datasets, including CVC-ColonDB, CVC-ClinicDB, EtisLaribPolypDB, and ACDC. Extensive experiments and visual studies show CLFSeg surpasses the existing SOTA performance and focuses on relevant regions of interest in anatomical structures. The proposed CLFSeg improves performance while ensuring computing efficiency, which makes it a potential solution for real-world medical diagnostic scenarios. Project page is available at https://visdomlab.github.io/CLFSeg/

[29] MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration cs.CVPDF

Junhyuk So, Hyunho Kook, Chaeyeon Jang, Eunhyeok Park

TL;DR: MC-SJD是一种无损并行解码框架，通过耦合方法提升Speculative Jacobi Decoding (SJD)的效率，显著加速自回归视觉生成，图像生成速度提升4.2倍，视频生成速度提升13.3倍。

Details

Motivation: 自回归（AR）建模在视觉生成中表现出色，但因逐令牌生成的缓慢推理速度限制了实际应用，MC-SJD旨在解决这一问题。

Result: 在图像和视频生成中分别实现了4.2倍和13.3倍的加速，且输出质量无损失。

Insight: 通过简单的算法调整（如耦合方法），可以显著提升自回归生成模型的效率，为实际部署提供可能。

Abstract: While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.

[30] SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs cs.CVPDF

Jinhong Deng, Wen Li, Joey Tianyi Zhou, Yang He

TL;DR: SCOPE是一种新型的视觉token修剪策略，通过联合建模显著性（saliency）和覆盖度（coverage），提升了多模态大语言模型（MLLMs）的效率，同时保持了语义完整性。

Details

Motivation: 现有的视觉token修剪方法仅基于注意力分数选择最显著的token，导致语义不完整。SCOPE旨在解决这一问题，通过综合考虑显著性和覆盖度来选择token。

Result: 实验表明，SCOPE在多个视觉语言理解基准测试中优于现有方法，显著提升了模型效率。

Insight: SCOPE的成功表明，在token修剪中同时考虑显著性和覆盖度是提升语义完整性和模型效率的关键。

Abstract: Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.

[31] Benchmarking Microsaccade Recognition with Event Cameras: A Novel Dataset and Evaluation cs.CVPDF

Waseem Shariff, Timothy Hanley, Maciej Stec, Hossein Javidnia, Peter Corcoran

TL;DR: 论文提出了首个基于事件相机的小眼动数据集，支持认知计算中的小眼动动态研究，通过Spiking-VGG系列模型实现了约90%的分类准确率。

Details

Motivation: 传统眼动分析方法成本高、可扩展性和时间分辨率有限，而基于事件的传感提供了高速、低延迟的替代方案，适合捕捉细微时空变化。

Result: 模型平均准确率约90%，能够独立于事件数量或持续时间对小眼动进行分类。

Insight: 展示了脉冲神经网络在精细运动识别中的潜力，为基于事件的视觉研究提供了基准。

Abstract: Microsaccades are small, involuntary eye movements vital for visual perception and neural processing. Traditional microsaccade studies typically use eye trackers or frame-based analysis, which, while precise, are costly and limited in scalability and temporal resolution. Event-based sensing offers a high-speed, low-latency alternative by capturing fine-grained spatiotemporal changes efficiently. This work introduces a pioneering event-based microsaccade dataset to support research on small eye movement dynamics in cognitive computing. Using Blender, we render high-fidelity eye movement scenarios and simulate microsaccades with angular displacements from 0.5 to 2.0 degrees, divided into seven distinct classes. These are converted to event streams using v2e, preserving the natural temporal dynamics of microsaccades, with durations ranging from 0.25 ms to 2.25 ms. We evaluate the dataset using Spiking-VGG11, Spiking-VGG13, and Spiking-VGG16, and propose Spiking-VGG16Flow, an optical-flow-enhanced variant implemented in SpikingJelly. The models achieve around 90 percent average accuracy, successfully classifying microsaccades by angular displacement, independent of event count or duration. These results demonstrate the potential of spiking neural networks for fine motion recognition and establish a benchmark for event-based vision research. The dataset, code, and trained models will be publicly available at https://waseemshariff126.github.io/microsaccades/ .

[32] UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation cs.CV | cs.LGPDF

Jiyu Guo, Shuo Yang, Yiming Huang, Yancheng Long, Xiaobo Xia

TL;DR: UtilGen提出了一种面向任务效用的数据增强框架，通过下游任务反馈自适应优化数据生成过程，显著提升了任务性能。

Details

Motivation: 现有的生成数据增强方法通常忽略任务特定需求，导致生成的合成数据对下游任务的实际效用有限。为了解决这一问题，UtilGen专注于任务效用，通过反馈优化数据生成。

Result: 在八个基准数据集上，UtilGen平均准确率提升3.87%，生成了更具任务相关性和影响力的合成数据。

Insight: 从视觉特征中心转向任务效用中心的数据增强范式更具实际价值，任务反馈驱动的方法能够有效提升生成数据的下游任务性能。

Abstract: Data augmentation using generative models has emerged as a powerful paradigm for enhancing performance in computer vision tasks. However, most existing augmentation approaches primarily focus on optimizing intrinsic data attributes – such as fidelity and diversity – to generate visually high-quality synthetic data, while often neglecting task-specific requirements. Yet, it is essential for data generators to account for the needs of downstream tasks, as training data requirements can vary significantly across different tasks and network architectures. To address these limitations, we propose UtilGen, a novel utility-centric data augmentation framework that adaptively optimizes the data generation process to produce task-specific, high-utility training data via downstream task feedback. Specifically, we first introduce a weight allocation network to evaluate the task-specific utility of each synthetic sample. Guided by these evaluations, UtilGen iteratively refines the data generation process using a dual-level optimization strategy to maximize the synthetic data utility: (1) model-level optimization tailors the generative model to the downstream task, and (2) instance-level optimization adjusts generation policies – such as prompt embeddings and initial noise – at each generation round. Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UtilGen consistently achieves superior performance, with an average accuracy improvement of 3.87% over previous SOTA. Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.

[33] Training-free Source Attribution of AI-generated Images via Resynthesis cs.CV | cs.AIPDF

Pietro Bongini, Valentina Molinari, Andrea Costanzo, Benedetta Tondi, Mauro Barni

TL;DR: 提出了一种基于图像重合成的免训练单样本源归属方法，通过在特征空间中比较候选生成模型的重合成结果与原始图像的相似性，实现高效归属。同时发布了一个新的合成图像归属数据集，验证了该方法的优越性。

Details

Motivation: 在数据稀缺条件下（如少样本或零样本），现有方法难以有效实现合成图像的源归属。本文旨在解决这一挑战，提出免训练的高效方法。

Result: 在少样本条件下，该方法优于现有技术；新数据集被证明具有挑战性，适合未来方法评测。

Insight: 重合成方法在少样本或无标签数据条件下具有潜力，为合成图像溯源问题提供了新思路。

Abstract: Synthetic image source attribution is a challenging task, especially in data scarcity conditions requiring few-shot or zero-shot classification capabilities. We present a new training-free one-shot attribution method based on image resynthesis. A prompt describing the image under analysis is generated, then it is used to resynthesize the image with all the candidate sources. The image is attributed to the model which produced the resynthesis closest to the original image in a proper feature space. We also introduce a new dataset for synthetic image attribution consisting of face images from commercial and open-source text-to-image generators. The dataset provides a challenging attribution framework, useful for developing new attribution models and testing their capabilities on different generative architectures. The dataset structure allows to test approaches based on resynthesis and to compare them to few-shot methods. Results from state-of-the-art few-shot approaches and other baselines show that the proposed resynthesis method outperforms existing techniques when only a few samples are available for training or fine-tuning. The experiments also demonstrate that the new dataset is a challenging one and represents a valuable benchmark for developing and evaluating future few-shot and zero-shot methods.

[34] ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model cs.CV | cs.AI | cs.CLPDF

Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin

TL;DR: ViPER提出了一种新颖的两阶段任务，通过自我批判和自我预测实现视觉语言模型的自演进，显著提升了细粒度视觉感知能力。

Details

Motivation: 视觉语言模型在细粒度视觉感知方面的能力有限，现有方法（如监督微调和强化微调）存在数据和能力平衡的挑战。

Result: Qwen-Viper系列在多项基准测试中平均提升1.7%，细粒度感知任务最高提升6.0%。

Insight: ViPER证明了生成与理解之间的互惠关系，为开发更自主的视觉语言模型提供了突破。

Abstract: The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.

[35] Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning cs.CV | cs.AIPDF

Ivica Dimitrovski, Vlatko Spasev, Ivan Kitanovski

TL;DR: 本文探讨了如何通过提示学习（prompt learning）优化CLIP模型在少量样本遥感图像场景分类中的表现，几种代表性方法在基准测试中表现优异。

Details

Motivation: 遥感图像场景分类受限于标注数据的稀缺性和高成本，而CLIP等模型的直接应用因领域差异和语义适配问题效果不佳，提示学习被视为轻量高效的适应策略。

Result: 在多个遥感数据集的实验中，提示学习方法一致优于基线，自我约束提示方法在跨域任务中表现最鲁棒。

Insight: 提示学习是弥补遥感图像领域差距的高效方案，为未来研究提供了重要基础。

Abstract: Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.

[36] Adaptive Knowledge Transferring with Switching Dual-Student Framework for Semi-Supervised Medical Image Segmentation cs.CVPDF

Thanh-Huy Nguyen, Hoang-Thien Nguyen, Ba-Thinh Lam, Vi Vu, Bach X. Nguyen

TL;DR: 该论文提出了一种新颖的切换双学生框架（Switching Dual-Student Framework）和损失感知指数移动平均策略（Loss-Aware Exponential Moving Average），以解决半监督医学图像分割中师生网络间知识传递不可靠的问题。

Details

Motivation: 师生框架在半监督医学图像分割中表现优异，但师生网络之间的强相关性和不可靠的知识传递限制了学习效果。

Result: 在3D医学图像分割数据集上，该方法显著优于现有半监督方法，提升了有限监督下的分割精度。

Insight: 动态选择和协作机制可以有效提升半监督学习中知识传递的可靠性，为医学图像分割提供了一种高效的解决方案。

Abstract: Teacher-student frameworks have emerged as a leading approach in semi-supervised medical image segmentation, demonstrating strong performance across various tasks. However, the learning effects are still limited by the strong correlation and unreliable knowledge transfer process between teacher and student networks. To overcome this limitation, we introduce a novel switching Dual-Student architecture that strategically selects the most reliable student at each iteration to enhance dual-student collaboration and prevent error reinforcement. We also introduce a strategy of Loss-Aware Exponential Moving Average to dynamically ensure that the teacher absorbs meaningful information from students, improving the quality of pseudo-labels. Our plug-and-play framework is extensively evaluated on 3D medical image segmentation datasets, where it outperforms state-of-the-art semi-supervised methods, demonstrating its effectiveness in improving segmentation accuracy under limited supervision.

[37] Decoupling What to Count and Where to See for Referring Expression Counting cs.CVPDF

Yuda Zou, Zijian Zhang, Yongchao Xu

TL;DR: 该论文提出W2-Net框架，通过双查询机制和新型匹配策略解决Referring Expression Counting（REC）任务中如何计数和从哪里观察的问题，显著提升了性能和定位精度。

Details

Motivation: REC任务通常关注类级别的特征而忽略属性信息，因为标注点通常位于类代表性位置（如头部），导致模型忽视其他视觉区域的属性信息（如腿部的“行走”）。

Result: 在REC-8K数据集上，W2-Net将计数误差降低22.5%（验证集）和18.0%（测试集），定位F1分数提升7%和8%，显著优于现有方法。

Insight: 分离“计数什么”和“从哪里看”可有效利用属性信息，SSM策略进一步提升了子类间的区分能力，适用于复杂REC任务。

Abstract: Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for “walking”). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into “what to count” and “where to see” via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.

[38] A Hybrid Approach for Visual Multi-Object Tracking cs.CV | cs.ROPDF

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

TL;DR: 论文提出了一种混合多目标视觉跟踪方法，结合随机和确定性机制解决非线性动态和身份一致性等问题，实验表现优于现有方法。

Details

Motivation: 解决多目标跟踪中非线性动态、噪声和非高斯分布等挑战，同时在复杂场景中保持目标身份的一致性。

Result: 实验结果表明，所提出的方法在性能上优于现有的先进跟踪器。

Insight: 混合随机和确定性机制在多目标跟踪任务中具有显著优势，尤其是在复杂动态和遮挡场景下。

Abstract: This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

[39] Rethinking Visual Intelligence: Insights from Video Pretraining cs.CV | cs.AI | 68T07, 68T45, 68T20 | I.2.10; I.4.8; I.5.1; I.2.6PDF

Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi

TL;DR: 这篇论文探讨了视频预训练（Video Diffusion Models, VDMs）在视觉智能领域的潜力，通过实验表明其在组合理解、数据效率和通用问题解决方面优于语言模型（LLMs）。

Details

Motivation: 语言模型的成功未能在视觉领域复现，现有视觉模型在组合理解、数据效率和通用任务适应能力上仍有不足。作者希望通过视频预训练模型（VDMs）解决这一问题。

Result: 实验结果显示VDMs在数据效率和任务适应性上显著优于LLMs，表明视频预训练为视觉基础模型的发展提供了有力支持。

Insight: 视频数据中的时空动态性为模型提供了更强的结构和动态性归纳偏置，可能是实现更通用视觉智能的关键因素。

Abstract: Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

[40] A Critical Study towards the Detection of Parkinsons Disease using ML Technologies cs.CVPDF

Vivek Chetia, Abdul Taher Khan, Rahish Gogoi, David Kapsian Khual, Purnendu Bikash

TL;DR: 该论文提出了一种基于深度学习的方法，用于分类和检测茶叶病害（如红锈病、Helopeltis和红蜘蛛螨），并评估了SSD MobileNet V2和Faster R-CNN ResNet50 V1两种模型，后者表现更好。同时，还使用Mask R-CNN实现了病害区域的分割。

Details

Motivation: 茶叶病害严重影响产量和质量，传统检测方法效率低且主观性强。因此，作者希望通过深度学习技术实现自动化检测和病害区域的量化分析。

Result: Faster R-CNN ResNet50 V1的表现优于SSD MobileNet V2（mAP分别为25%和20.9%）；Mask R-CNN成功实现了病害区域的分割和量化。

Insight: 1. Faster R-CNN在复杂目标检测任务中可能更有效；
2. 分割技术可以扩展应用于病害严重程度的评估。

Abstract: The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infectious organisms) and environmental conditions and also show the area damaged by a disease in leaves. Namely Red Rust, Helopeltis and Red spider mite respectively. In this paper we have evaluated two models namely SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for the object detection. The SSD MobileNet V2 gave precision of 0.209 for IOU range of 0.50:0.95 with recall of 0.02 on IOU 0.50:0.95 and final mAP of 20.9%. While Faster R-CNN ResNet50 V1 has precision of 0.252 on IOU range of 0.50:0.95 and recall of 0.044 on IOU of 0.50:0.95 with a mAP of 25%, which is better than SSD. Also used Mask R-CNN for Object Instance Segmentation where we have implemented our custom method to calculate the damaged diseased portion of leaves. Keywords: Tea Leaf Disease, Deep Learning, Red Rust, Helopeltis and Red Spider Mite, SSD MobileNet V2, Faster R-CNN ResNet50 V1 and Mask RCNN.

[41] Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras cs.CVPDF

Charles Javerliat, Pierre Raimbaud, Guillaume Lavoué

TL;DR: Kineo提出了一种无需标定的多视角运动捕捉方法，通过稀疏RGB相机实现高效且精确的运动重建。

Details

Motivation: 现有方法依赖精确的相机标定，限制了非专家用户和野外捕获的实用性。虽然已有无需标定的方法，但其计算成本高且精度低。

Result: 在EgoHumans和Human3.6M数据集上表现出色，相比现有方法，相机平移误差降低83-85%，角度误差降低86-92%，W-MPJPE降低83-91%。效率高，处理速度快于视频时长。

Insight: Kineo通过高效的全局优化和置信驱动策略，显著提升了无需标定方法的精度和计算效率，为实际应用提供了可行的解决方案。

Abstract: Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at https://liris-xr.github.io/kineo/.

[42] Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs cs.CV | cs.CLPDF

Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia

TL;DR: Latent Sketchpad为MLLMs引入了一个内部视觉画板，通过在推理过程中交替生成视觉潜变量和文本内容，提升了模型的视觉规划和想象能力。实验表明，该方法在不影响推理性能的同时，增强了模型的可解释性和泛化能力。

Details

Motivation: 现有MLLMs在视觉理解和生成任务中表现优异，但在需要视觉规划和想象的复杂场景中存在局限，类似于人类通过草图辅助思考的需求。

Result: 在MazePlanning数据集上验证了Latent Sketchpad的有效性，在不同MLLMs（如Gemma3和Qwen2.5-VL）上表现优于或与基础模型相当。

Insight: 通过引入视觉画板，MLLMs不仅提升了视觉规划和想象能力，还增强了与人类交互的可解释性，为更丰富的应用场景提供了可能。

Abstract: While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model’s textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: https://latent-sketchpad.github.io/.

[43] OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents cs.CVPDF

Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie

TL;DR: OSWorld-MCP是首个全面评估计算机使用代理工具调用能力的基准，强调GUI操作与工具调用的结合，并展示了工具调用对任务成功率的提升。

Details

Motivation: 以往评估多模态代理的研究主要关注GUI交互能力，忽视了工具调用（如MCP协议）的重要性。这种忽略导致对代理能力的评估不公平，因此需要一个新的基准来全面评估工具调用和GUI操作的结合能力。

Result: 实验结果显示，MCP工具显著提高了任务成功率（如OpenAI o3从8.3%提升到20.4%），但最强模型的工具调用率仅为36.3%，表明仍有改进空间。

Insight: 1. 工具调用能力是多模态代理的重要组成部分，但与纯GUI交互能力相比仍有明显不足；2. OSWorld-MCP为评估复杂、工具辅助环境下的代理能力提供了新的标准。

Abstract: With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents’ tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark’s challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.

[44] Physics-Inspired Gaussian Kolmogorov-Arnold Networks for X-ray Scatter Correction in Cone-Beam CT cs.CV | I.4.5; I.5PDF

Xu Jiang, Huiying Pan, Ligen Shi, Jianing Sun, Wenfeng Xu

TL;DR: 该论文提出了一种基于深度学习的散射伪影校正方法，结合物理先验知识，利用高斯径向基函数建模点散射函数，并将其嵌入到Kolmogorov-Arnold网络中，以提高CBCT图像质量。

Details

Motivation: CBCT在数据采集过程中容易受到散射影响，导致图像CT值偏差和组织对比度降低，影响诊断准确性。因此，需要一种能够结合物理特性的高效校正方法。

Result: 通过合成和真实扫描实验验证，该方法能有效校正重建图像中的散射伪影，并在定量指标上优于现有方法。

Insight: 结合物理先验知识和深度学习模型（如KAN）可以提高散射校正的精度和效率，这对提升CBCT图像质量具有实际意义。

Abstract: Cone-beam CT (CBCT) employs a flat-panel detector to achieve three-dimensional imaging with high spatial resolution. However, CBCT is susceptible to scatter during data acquisition, which introduces CT value bias and reduced tissue contrast in the reconstructed images, ultimately degrading diagnostic accuracy. To address this issue, we propose a deep learning-based scatter artifact correction method inspired by physical prior knowledge. Leveraging the fact that the observed point scatter probability density distribution exhibits rotational symmetry in the projection domain. The method uses Gaussian Radial Basis Functions (RBF) to model the point scatter function and embeds it into the Kolmogorov-Arnold Networks (KAN) layer, which provides efficient nonlinear mapping capabilities for learning high-dimensional scatter features. By incorporating the physical characteristics of the scattered photon distribution together with the complex function mapping capacity of KAN, the model improves its ability to accurately represent scatter. The effectiveness of the method is validated through both synthetic and real-scan experiments. Experimental results show that the model can effectively correct the scatter artifacts in the reconstructed images and is superior to the current methods in terms of quantitative metrics.

[45] SAGE: Structure-Aware Generative Video Transitions between Diverse Clips cs.CVPDF

Mia Kan, Yilin Liu, Niloy Mitra

TL;DR: SAGE提出了一种结构感知的生成视频过渡方法，通过结合结构引导和生成合成，实现了多样片段间的平滑、语义一致过渡。

Details

Motivation: 传统视频过渡方法（如线性混合）在多样片段间过渡时可能引入伪影或破坏时间一致性，无法处理大时间间隔或语义差异较大的情况。

Result: SAGE在定量指标和用户研究中优于现有方法（如FILM、TVG等），尤其在多样片段间过渡表现优异。

Insight: 结合结构信息与生成模型在视频过渡任务中具有重要意义，为未来相关研究提供了新思路。

Abstract: Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.

[46] Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? cs.CV | cs.AI | cs.LG | q-bio.NCPDF

Yihao Li, Saeed Salehi, Lyle Ungar, Konrad P. Kording

TL;DR: 研究发现，预训练的Vision Transformers（ViTs）能够自然涌现出物体绑定的能力，这种能力有助于下游任务，并在自监督模型中表现更强。

Details

Motivation: 探讨预训练的ViTs是否能够自然涌现出物体绑定的能力，因为这在人类认知中非常重要。

Result: IsSameObject的准确率超过90%，且在自监督模型中表现更强；该信号还主动引导注意力机制。

Insight: 物体绑定能力并非ViT架构的固有特性，而是通过特定的预训练目标习得的，这对连接主义系统的理解提出了新的视角。

Abstract: Object binding, the brain’s ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of “which parts belong together” emerges naturally in a connectionist system.

[47] Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance cs.CVPDF

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen

TL;DR: 论文提出了ProMoE，一种针对扩散变换器（DiTs）的两步路由器架构，通过显式路由指导促进专家专业化，提升了视觉MoE的性能。

Details

Motivation: 现有MoE在语言模型中表现优异，但在视觉任务（如DiTs）中表现不佳。主要原因是视觉令牌的空间冗余和功能异质性，阻碍了专家的专业化。

Result: ProMoE在ImageNet上实现了state-of-the-art性能，适用于Rectified Flow和DDPM训练目标。

Insight: 视觉MoE需要显式语义指导，路由机制的设计至关重要。条件路由和原型路由的结合，可以显著提升专家专业化和模型性能。

Abstract: Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.

[48] Uniform Discrete Diffusion with Metric Path for Video Generation cs.CVPDF

Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo

TL;DR: 论文提出了URSA框架，通过线性化度量路径和分辨率依赖的时间步移位机制，实现了高效的离散视频生成，性能接近连续扩散方法。

Details

Motivation: 离散视频生成方法因误差积累和长上下文不一致性问题落后于连续空间方法，URSA旨在弥合这一差距。

Result: 在视频和图像生成基准测试中，URSA表现优于现有离散方法，性能接近连续扩散方法。

Insight: 离散方法通过设计高效的全局优化机制，可以接近连续方法的性能，同时降低计算成本。

Abstract: Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA

[49] Generative View Stitching cs.CV | cs.LGPDF

Chonghyuk Song, Michal Stary, Boyuan Chen, George Kopanas, Vincent Sitzmann

TL;DR: 论文提出了Generative View Stitching (GVS)方法，通过并行采样序列解决相机引导视频生成中未来条件缺失导致的问题，实现了稳定、无碰撞且帧间一致的视频生成。

Details

Motivation: 自回归视频扩散模型无法利用未来条件指导当前生成，导致相机轨迹引导的视频生成中易出现场景碰撞和崩溃。

Result: 在预定义相机路径下实现了稳定、无碰撞且帧间一致的视频生成，包括解决不可能阶梯场景。

Insight: GVS展示了扩散缝合在视频生成中的潜力，无需专用模型训练即可提升生成质量。

Abstract: Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv"ard’s Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.

cs.CL [Back]

[50] Evaluating Long-Term Memory for Long-Context Question Answering cs.CLPDF

Alessandra Terranova, Björn Ross, Alexandra Birch

TL;DR: 论文系统评估了长上下文对话任务中不同类型的记忆方法，提出LoCoMo基准，发现记忆增强方法显著减少token使用，同时保持准确性，并分析了不同模型能力下最优记忆架构。

Details

Motivation: 为了提升大型语言模型的对话连续性和经验学习能力，需要有效的记忆系统，但目前不清楚哪种记忆类型最适合长上下文任务。

Result: 记忆增强方法减少了90%以上的token使用，同时保持竞争力；小模型受益于RAG，强推理模型则更适合情景记忆和复杂语义记忆。

Insight: 记忆架构复杂度应与模型能力匹配；情景记忆能帮助模型识别自身知识局限。

Abstract: In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with small foundation models benefitting most from RAG, and strong instruction-tuned reasoning model gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.

[51] Beyond Understanding: Evaluating the Pragmatic Gap in LLMs’ Cultural Processing of Figurative Language cs.CLPDF

Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab

TL;DR: 论文评估了大型语言模型（LLM）在处理反映文化背景的比喻语言时的表现，发现模型在阿拉伯语和文化相关的表达中表现较差，尤其是在实用任务和内涵理解上。发布了首个埃及阿拉伯语习语数据集Kinayat。

Details

Motivation: 当前LLM在处理文化相关的比喻语言时表现不佳，尤其是在实用性和文化内涵理解上存在明显差距。研究旨在填补这一空白并为未来研究提供工具。

Result: 模型在阿拉伯语谚语上的准确率比英语低4.29%，埃及习语上更低10.28%。实用任务准确率比理解任务低14.07%，上下文补充可提升10.66%。内涵理解最高仅85.58%的人类标注一致性。

Insight: 比喻语言是评估文化推理能力的有效工具，LLM虽能理解比喻意义，但在实用性和文化适当性上仍有挑战。

Abstract: We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with 100% inter-annotator agreement. These findings demonstrate that figurative language serves as an effective diagnostic for cultural reasoning: while LLMs can often interpret figurative meaning, they face challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.

[52] OraPlan-SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning cs.CL | cs.AIPDF

Marianne Menglin Liu, Sai Ashish Somayajula, Syed Fahad Allam Shah, Sujith Ravi, Dan Roth

TL;DR: OraPlan-SQL是一个双语NL2SQL推理框架，通过两步代理（Planner和SQL代理）和反馈引导的元提示策略，显著提升了复杂推理任务的性能，并在Archer评测中排名第一。

Details

Motivation: 为解决复杂推理任务（如算术、常识和假设推理）的双语NL2SQL问题，传统方法因多代理协调开销大而效率低下。

Result: 在Archer评测中，执行准确率（EX）达55.0%（英文）和56.7%（中文），SQL有效性（VA）超99%。

Insight: 单一代理结合反馈优化和多候选策略可在复杂任务中实现高效性和可靠性，同时避免多代理协调的复杂性。

Abstract: We present OraPlan-SQL, our system for the Archer NL2SQL Evaluation Challenge 2025, a bilingual benchmark requiring complex reasoning such as arithmetic, commonsense, and hypothetical inference. OraPlan-SQL ranked first, exceeding the second-best system by more than 6% in execution accuracy (EX), with 55.0% in English and 56.7% in Chinese, while maintaining over 99% SQL validity (VA). Our system follows an agentic framework with two components: Planner agent that generates stepwise natural language plans, and SQL agent that converts these plans into executable SQL. Since SQL agent reliably adheres to the plan, our refinements focus on the planner. Unlike prior methods that rely on multiple sub-agents for planning and suffer from orchestration overhead, we introduce a feedback-guided meta-prompting strategy to refine a single planner. Failure cases from a held-out set are clustered with human input, and an LLM distills them into corrective guidelines that are integrated into the planner’s system prompt, improving generalization without added complexity. For the multilingual scenario, to address transliteration and entity mismatch issues, we incorporate entity-linking guidelines that generate alternative surface forms for entities and explicitly include them in the plan. Finally, we enhance reliability through plan diversification: multiple candidate plans are generated for each query, with the SQL agent producing a query for each plan, and final output selected via majority voting over their executions.

[53] Language Models for Longitudinal Clinical Prediction cs.CLPDF

Tananun Songdechakraiwut, Michael Lutz

TL;DR: 论文提出了一种轻量级框架，利用冻结的大型语言模型分析纵向临床数据，通过整合患者历史和上下文信息生成准确预测，无需模型微调。

Details

Motivation: 动机在于利用预训练语言模型的潜力，解决临床纵向数据分析中的挑战，尤其是小样本条件下的预测性能。

Result: 在神经心理学评估任务中，即使在训练数据有限的情况下，也能实现准确可靠的预测性能，尤其适用于早期阿尔茨海默病的监测。

Insight: 研究表明，预训练语言模型在纵向临床数据分析中具有潜力，尤其在资源受限的场景下表现优异。

Abstract: We explore a lightweight framework that adapts frozen large language models to analyze longitudinal clinical data. The approach integrates patient history and context within the language model space to generate accurate forecasts without model fine-tuning. Applied to neuropsychological assessments, it achieves accurate and reliable performance even with minimal training data, showing promise for early-stage Alzheimer’s monitoring.

[54] AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages cs.CLPDF

Kosei Uemura, Miaoran Zhang, David Ifeoluwa Adelani

TL;DR: 该论文提出了AfriMTEB和AfriE5，分别是针对非洲语言的多任务文本嵌入基准和改进的嵌入模型，填补了非洲语言在文本嵌入任务中的空白。

Details

Motivation: 非洲语言在多语言文本嵌入任务中代表性不足，现有任务多来自翻译基准，缺乏多样性。

Result: AfriE5在性能上优于Gemini-Embeddings和mE5，达到新SOTA。

Insight: 任务的多样性和语言的广泛覆盖对提升模型的泛化能力至关重要。

Abstract: Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB – a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.

[55] Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward cs.CL | cs.AIPDF

Hao An, Yang Xu

TL;DR: 该论文提出了一种基于细粒度语义置信度奖励的强化学习框架（FiSCoRe），通过语义聚类和多候选答案筛选，指导大语言模型（LLM）更准确地拒绝回答超出其知识范围的问题，从而减少幻觉现象。

Details

Motivation: 现有方法通常依赖粗粒度信号（如整体置信度）指导LLMs拒绝回答不确定问题，但这可能导致模型对自身知识边界认识不精确。为此，需要一种更精细的置信度评估方法。

Result: 方法在领域内和分布外基准测试中显著提升了LLMs的可靠性。

Insight: 细粒度的语义置信度评估能更精确地识别模型的知识边界，从而提升拒绝回答的准确性。

Abstract: Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model’s own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on $\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.

[56] SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs cs.CL | cs.AIPDF

Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren

TL;DR: 论文提出了SpecKD，一种动态的、基于令牌的知识蒸馏方法，通过选择性损失应用提升学生模型性能。

Details

Motivation: 传统知识蒸馏方法对所有令牌统一应用损失，可能导致学生模型学习教师的不确定预测，从而引入噪声。

Result: 在各种文本生成任务中，SpecKD显著优于基线方法，实现了最先进的结果。

Insight: 选择性损失应用可稳定训练并提升学生模型能力，尤其在教师模型远强于学生时效果显著。

Abstract: Knowledge Distillation (KD) has become a cornerstone technique for compressing Large Language Models (LLMs) into smaller, more efficient student models. However, conventional KD approaches typically apply the distillation loss uniformly across all tokens, regardless of the teacher’s confidence. This indiscriminate mimicry can introduce noise, as the student is forced to learn from the teacher’s uncertain or high-entropy predictions, which may ultimately harm student performance-especially when the teacher is much larger and more powerful. To address this, we propose Speculative Knowledge Distillation (SpecKD), a novel, plug-and-play framework that introduces a dynamic, token-level gating mechanism inspired by the “propose-and-verify” paradigm of speculative decoding. At each step, the student’s token proposal is verified against the teacher’s distribution; the distillation loss is selectively applied only to “accepted” tokens, while “rejected” tokens are masked out. Extensive experiments on diverse text generation tasks show that SpecKD consistently and significantly outperforms strong KD baselines, leading to more stable training and more capable student models, and achieving state-of-the-art results.

[57] Success and Cost Elicit Convention Formation for Efficient Communication cs.CLPDF

Saujas Vaduguru, Yilun Hua, Yoav Artzi, Daniel Fried

TL;DR: 该论文研究如何通过模拟对话游戏训练多模态模型形成高效的沟通惯例，减少消息长度并提高成功率，同时揭示成功和成本是形成惯例的必要条件。

Details

Motivation: 人类通过共享对话上下文逐渐形成高效沟通的惯例，但如何让多模态模型也学会这种惯例以实现高效沟通是一个开放性问题。

Result: 模型在与人类互动中消息长度减少41%，成功率提升15%，且人类响应速度更快。实验证明仅基于成功或成本无法形成惯例。

Insight: 成功和成本是形成高效沟通惯例的必要条件，缺一不可；模拟游戏是一种有效训练模型形成惯例的方法。

Abstract: Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.

[58] Pie: A Programmable Serving System for Emerging LLM Applications cs.CLPDF

In Gim, Zhiyao Ma, Seung-seob Lee, Lin Zhong

TL;DR: Pie是一个可编程的LLM服务系统，通过分解传统的生成循环为细粒度服务处理器，支持用户自定义的inferlets程序，从而实现灵活高效的LLM应用优化。

Details

Motivation: 现有LLM服务系统基于单一代币生成循环，难以适应新兴LLM应用的多样化推理策略和代理工作流程。

Result: 在标准任务上性能接近SOTA（延迟开销3-12%），在代理工作流上显著提升性能（延迟和吞吐量提高1.3x-3.4x）。

Insight: 通过用户自定义程序实现应用特定优化，可以显著提升LLM服务系统的灵活性和效率。

Abstract: Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.

[59] Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation cs.CLPDF

Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu

TL;DR: 论文提出了一种新的分类法和基准测试HalloMTBench，用于揭示多语言大语言模型（LLMs）在翻译中的幻觉问题。

Details

Motivation: 现有机器翻译基准测试无法有效暴露多语言LLMs的幻觉问题，因此需要新的诊断工具。

Result: 评估17个LLMs后发现幻觉触发因素，如模型规模、源长度敏感性、语言偏差和RL放大的语言混合问题。

Insight: HalloMTBench为诊断LLMs翻译问题提供了前瞻性测试平台，揭示了独特的失败模式。

Abstract: Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers’’ – unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.

[60] Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures cs.CLPDF

Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Sadallah, Abeer Kashar

TL;DR: Global PIQA是一个覆盖100多种语言和文化的评估基准，由全球65个国家的335名研究人员手工构建，用于评估大型语言模型（LLMs）的物理常识推理能力。

Details

Motivation: 目前缺乏覆盖多语言和多文化的评估基准，Global PIQA旨在填补这一空白，并提供对LLMs在不同文化和语言环境下表现的综合评估。

Result: 发现LLMs在低资源语言中表现较差（准确率差距高达37%），开源模型普遍逊于商业模型。

Insight: Global PIQA揭示了LLMs在日常知识方面的不足，尤其是在低资源语言和文化中，为未来研究提供了重要方向。

Abstract: To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.

[61] Reinforcement Learning for Long-Horizon Multi-Turn Search Agents cs.CLPDF

Vivek Kalyan, Martin Andrews

TL;DR: 该论文探讨了利用强化学习训练大型语言模型代理在长视野多轮任务中的性能提升，结果表明RL训练的模型在复杂任务中优于前沿模型。

Details

Motivation: 现有基于提示的方法在复杂任务中表现良好，但研究者希望通过强化学习进一步挖掘模型的潜力，尤其是在长视野多轮任务中。

Result: RL训练的模型在法律文档搜索任务中取得了85%的准确率，显著优于前沿模型的78%。

Insight: 长视野多轮任务中，强化学习能显著提升模型性能，且允许更多交互轮次能进一步提升效果。

Abstract: Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.

[62] Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean cs.CL | cs.AIPDF

Chanwoo Park, Suyoung Park, JiA Kang, Jongyeon Park, Sangho Kim

TL;DR: Ko-MuSR是首个专注于评估韩语长篇叙事中多步软推理能力的基准，旨在减少数据污染。通过测试四种大型语言模型，发现多语言模型在韩语任务中表现更优，提示策略的提升使其接近人类水平。

Details

Motivation: 当前韩语自然语言处理（NLP）缺乏评估长篇叙事和多步软推理能力的基准。Ko-MuSR填补了这一空白，并支持韩语NLP的系统性发展。

Result: 多语言模型优于韩语专用模型；提示策略显著提升准确率，接近人类表现。

Insight: 跨语言能力在多步推理任务中具有优势，提示设计对性能提升至关重要。

Abstract: We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models – two multilingual and two Korean-specialized – show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.

Aaron Scott, Maike Züfle, Jan Niehues

TL;DR: 论文介绍了MuSaG，首个德语多模态讽刺检测数据集，包含文本、音频和视频模态的完整标注，并对比了开源和商业模型的性能，发现模型在文本模态表现最佳，凸显多模态模型的不足。

Details

Motivation: 讽刺检测在社交媒体和内容审核中至关重要，但现有研究多集中于单模态（文本），而忽视多模态线索。MuSaG填补了德语多模态讽刺数据集的空白。

Result: 人类在对话中主要依赖音频线索，而模型在文本模态表现最佳，表明当前多模态模型存在不足。

Insight: 多模态模型需进一步优化以适应真实场景，MuSaG为未来研究和模型对齐提供了重要资源。

Abstract: Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.

[64] Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability cs.CLPDF

Iván Martínez-Murillo, Paloma Moreda, Elena Lloret

TL;DR: 论文探讨了外部知识对自然语言生成（NLG）可解释性的影响，通过扩展CommonGen数据集创建KITGI基准，并使用T5-Large模型比较不同知识条件下的生成效果。结果显示，完整知识显著提升了生成的正确性。

Details

Motivation: 研究动机在于探索外部知识如何影响NLG的可解释性和生成质量，尤其是常识生成任务中知识的关键作用。

Result: 结果显示，完整知识条件下的生成正确率高达91%，而过滤知识后降至6%。

Insight: 研究强调了知识增强的NLG系统设计的重要性，并呼吁开发超越表面指标的评价框架。

Abstract: This paper explores the influence of external knowledge integration in Natural Language Generation (NLG), focusing on a commonsense generation task. We extend the CommonGen dataset by creating KITGI, a benchmark that pairs input concept sets with retrieved semantic relations from ConceptNet and includes manually annotated outputs. Using the T5-Large model, we compare sentence generation under two conditions: with full external knowledge and with filtered knowledge where highly relevant relations were deliberately removed. Our interpretability benchmark follows a three-stage method: (1) identifying and removing key knowledge, (2) regenerating sentences, and (3) manually assessing outputs for commonsense plausibility and concept coverage. Results show that sentences generated with full knowledge achieved 91% correctness across both criteria, while filtering reduced performance drastically to 6%. These findings demonstrate that relevant external knowledge is critical for maintaining both coherence and concept coverage in NLG. This work highlights the importance of designing interpretable, knowledge-enhanced NLG systems and calls for evaluation frameworks that capture the underlying reasoning beyond surface-level metrics.

[65] Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models? cs.CLPDF

Teague McMillan, Gabriele Dominici, Martin Gjoreski, Marc Langheinrich

TL;DR: 该论文研究了大型语言模型（LLMs）生成的解释是否能忠实反映其预测驱动因素。通过实验表明，few-shot示例的数量和质量、提示设计以及指令调优阶段都会显著影响模型的忠实性。

Details

Motivation: 在医疗等敏感领域中，LLMs生成的解释如果不够忠实，可能掩盖关键临床线索或依赖虚假捷径，从而影响临床医生的信任和决策安全性。因此，研究如何提升LLMs解释的忠实性至关重要。

Result: 结果表明：1）few-shot示例的数量和质量显著影响忠实性；2）提示设计对忠实性敏感；3）指令调优阶段在MedQA上提升了忠实性。

Insight: 论文为在敏感领域中提升LLMs的解释可信度和可解释性提供了实用策略，强调了few-shot设计和指令调优的重要性。

Abstract: Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.

[66] Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations cs.CLPDF

Ahmad Ghannam, Naif Alharthi, Faris Alasmary, Kholood Al Tabash, Shouq Sadah

TL;DR: 论文提出了一种多模态方法CATT-Whisper，结合文本和语音信息恢复阿拉伯方言句子中的变音符号，通过两种融合策略和随机语音输入去活化提升模型鲁棒性。

Details

Motivation: 阿拉伯方言中的变音符号恢复（DR）任务复杂且重要，传统基于文本的方法效果有限，引入语音模态可以补充信息。

Result: 开发集上WER为0.25，CER为0.9；测试集上WER为0.55，CER为0.13。

Insight: 多模态信息融合显著提升变音符号恢复任务性能，语音输入的去活化设计增强了模型的适应性。

Abstract: In this work, we tackle the Diacritic Restoration (DR) task for Arabic dialectal sentences using a multimodal approach that combines both textual and speech information. We propose a model that represents the text modality using an encoder extracted from our own pre-trained model named CATT. The speech component is handled by the encoder module of the OpenAI Whisper base model. Our solution is designed following two integration strategies. The former consists of fusing the speech tokens with the input at an early stage, where the 1500 frames of the audio segment are averaged over 10 consecutive frames, resulting in 150 speech tokens. To ensure embedding compatibility, these averaged tokens are processed through a linear projection layer prior to merging them with the text tokens. Contextual encoding is guaranteed by the CATT encoder module. The latter strategy relies on cross-attention, where text and speech embeddings are fused. The cross-attention output is then fed to the CATT classification head for token-level diacritic prediction. To further improve model robustness, we randomly deactivate the speech input during training, allowing the model to perform well with or without speech. Our experiments show that the proposed approach achieves a word error rate (WER) of 0.25 and a character error rate (CER) of 0.9 on the development set. On the test set, our model achieved WER and CER scores of 0.55 and 0.13, respectively.

[67] From Memorization to Reasoning in the Spectrum of Loss Curvature cs.CL | cs.LGPDF

Jack Merullo, Srihita Vatsavaya, Lucius Bushnaq, Owen Lewis

TL;DR: 该论文通过损失曲面曲率分析，揭示了Transformer模型中记忆化的表现形式，并提出了一种基于权重编辑的方法，能够有效抑制无关记忆数据，同时保持模型性能。

Details

Motivation: 研究Transformer模型中记忆化的表现形式，探索一种无需显式标签即可区分记忆化数据的方法，并提出一种更有效的去记忆化编辑技术。

Result: 权重编辑技术能够有效减少无关记忆数据，同时在语言模型任务中保持低困惑度。数学和事实检索任务性能显著下降，但开放书事实检索和逻辑推理能力未受影响。

Insight: 记忆化与非记忆化数据在权重空间中的表达可通过曲率区分；特定任务依赖的权重方向可能高度专业化且与记忆化无关。

Abstract: We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data’s activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.

[68] Can LLMs Translate Human Instructions into a Reinforcement Learning Agent’s Internal Emergent Symbolic Representation? cs.CL | cs.ROPDF

Ziqi Ma, Sao Mai Nguyen, Philippe Xu

TL;DR: 本研究探讨了大型语言模型（LLM）是否能够将人类自然语言指令翻译为分层强化学习过程中涌现的内部符号表示，发现其性能对分区粒度和任务复杂性敏感。

Details

Motivation: 研究动机是探索LLMs在翻译人类指令为强化学习智能体内部符号表示方面的能力，以促进智能体的规划与泛化能力。

Result: 结果显示LLM具有一定的翻译能力，但性能受分区粒度和任务复杂性的显著影响。

Insight: 当前LLM在表示对齐方面存在不足，需进一步研究语言与智能体内部表示的鲁棒对齐方法。

Abstract: Emergent symbolic representations are critical for enabling developmental learning agents to plan and generalize across tasks. In this work, we investigate whether large language models (LLMs) can translate human natural language instructions into the internal symbolic representations that emerge during hierarchical reinforcement learning. We apply a structured evaluation framework to measure the translation performance of commonly seen LLMs – GPT, Claude, Deepseek and Grok – across different internal symbolic partitions generated by a hierarchical reinforcement learning algorithm in the Ant Maze and Ant Fall environments. Our findings reveal that although LLMs demonstrate some ability to translate natural language into a symbolic representation of the environment dynamics, their performance is highly sensitive to partition granularity and task complexity. The results expose limitations in current LLMs capacity for representation alignment, highlighting the need for further research on robust alignment between language and internal agent representations.

[69] MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference cs.CLPDF

Mădălina Zgreabăn, Tejaswini Deoskar, Lasha Abzianidze

TL;DR: 论文提出了一种自动生成高质量NLI问题变体的方法MERGE，通过替换开放类词汇保留原问题的推理逻辑，测试模型的泛化能力。结果显示模型在这些变体上表现下降4-20%。

Details

Motivation: 现有的NLI模型在泛化能力上表现不足，但手动或自动生成高质量测试集成本高昂且困难。本文旨在通过最小化修改生成高质量的变体测试集，评估模型的真实泛化能力。

Result: 实验表明，NLI模型在变体问题上的性能下降4-20%，揭示了其泛化能力的不足。

Insight: 模型的泛化能力不仅依赖于表面词汇，还与替换词汇的类别、概率和合理性密切相关。

Abstract: In recent years, many generalization benchmarks have shown language models’ lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models’ predictions across reasoning-preserving variants of the original problem. Our results show that NLI models’ perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models’ performance.

[70] Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards cs.CLPDF

Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren

TL;DR: 论文提出了Lookahead Tree-Based Rollouts（LATR）方法，通过分支和前瞻模拟增强轨迹多样性，显著提升强化学习中的策略学习和任务表现。

Details

Motivation: 当前强化学习在验证奖励（RLVR）中，轨迹多样性不足限制了策略学习效果，尤其是同质化轨迹导致信号减弱。

Result: 相比随机采样，LATR平均加速策略学习131%，最终pass@1性能提升4.2%。

Insight: 显式增强轨迹多样性可有效解决同质化问题，提升强化学习算法的性能和效率。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.

[71] Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning cs.CL | cs.AIPDF

Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang

TL;DR: Critique-RL提出了一种无需强监督的两阶段强化学习方法，通过优化评论语言模型的判别性和帮助性，提升复杂推理任务的性能。

Details

Motivation: 现有方法通常依赖强监督标注评论数据，限制了评论语言模型的开发和性能提升。

Result: 实验显示Critique-RL在多任务和多模型上性能显著提升，例如Qwen2.5-7B模型在域内和域外任务上分别提高了9.02%和5.70%。

Insight: 直接奖励信号对提升判别性至关重要，而间接奖励则能有效增强帮助性，两者结合通过两阶段优化能达到更好效果。

Abstract: Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor’s outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic’s helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.

[72] Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants cs.CL | cs.AI | 68T50 | F.2.2; I.2.7PDF

Hunzalah Hassan Bhatti, Firoj Alam

TL;DR: 论文提出了一个开放的阿拉伯文化问答基准，包含方言变体，将标准阿拉伯语多选题转化为开放式问题，并评估了多种LLM的表现。研究发现LLM在方言和文化问题上表现不佳，开放式问题更具挑战性，且CoT能提升正确率但结果不统一。

Details

Motivation: 大型语言模型在多语言和文化背景下的问答表现不均，尤其是阿拉伯语方言和文化相关内容。研究旨在填补这一空白，提供更全面的评测基准。

Result: 1. LLM在方言和文化问题上表现较差；2. 阿拉伯中心模型在多选题表现好，但开放式问题较差；3. CoT提升正确率但结果不一致。

Insight: 方言和文化内容是LLM的薄弱环节，开放式问题更具挑战性；CoT能辅助模型推理，但需进一步优化其效果。

Abstract: Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.

[73] LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability cs.CL | cs.AIPDF

Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma

TL;DR: LongWeave是一种长文本生成基准，通过Constraint-Verifier Evaluation (CoV-Eval)方法平衡现实世界相关性和可验证性评估，覆盖七种任务并支持自定义输入/输出长度。评估表明，即使最优模型在面对复杂性和长度时仍存在挑战。

Details

Motivation: 当前长文本生成的基准要么难以验证，要么忽略现实复杂性，亟需一种既能反映真实世界需求又可客观评估的方法。

Result: 评测23个LLM显示，模型在复杂性和长文本生成中表现显著受限。

Insight: 长文本生成的挑战不仅在于长度，还包括满足复杂现实约束的能力，设计可验证且现实的基准是关键。

Abstract: Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.

[74] SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models cs.CLPDF

Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu

TL;DR: SynthWorlds提出了一种框架，通过构造平行的真实映射世界和合成映射世界，分别评估语言模型的推理能力和知识记忆能力，解决了现有方法难以区分二者的局限性。

Details

Motivation: 当前评估语言模型推理能力的任务通常受其参数化知识的影响，导致性能反映的是知识记忆而非真正的推理能力。现有方法无法干净地区分这两种能力。

Result: 实验显示，语言模型在知识记忆和推理之间存在明显的知识优势差距，知识和检索机制的加入减少了但未消除这一差距。

Insight: SynthWorlds为评估语言模型的推理和记忆能力提供了精确可控的环境，揭示了知识和推理分离的可能性及其改进方向。

Abstract: Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.

[75] SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space cs.CL | cs.CVPDF

Viktoriia Zinkovich, Anton Antonov, Andrei Spiridonov, Denis Shepelev, Andrey Moskalenko

TL;DR: 论文提出了SPARTA方法，通过黑盒对抗性转述在文本自编码器潜在空间中评估推理分割的鲁棒性，揭示了现有模型在此类攻击下的脆弱性。

Details

Motivation: 现有研究主要关注图像输入的扰动，但文本输入的语义等效转述在真实应用中同样重要，如何通过转述攻击评估模型鲁棒性是一个未被充分探索的问题。

Result: SPARTA在ReasonSeg和LLMSeg-40k数据集上比现有方法高出2倍的成功率，揭示了现有推理分割模型在对抗转述下的脆弱性。

Insight: 即使语义和语法约束严格，现有推理分割模型仍易受对抗转述攻击，凸显了文本输入鲁棒性研究的重要性。

Abstract: Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.

[76] Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices cs.CL | cs.AIPDF

Špela Vintar, Taja Kuzman Pungeršek, Mojca Brglez, Nikola Ljubešić

TL;DR: 该论文提出了一种针对多语言或非英语使用场景的新分类法，并建议了一套最佳实践和质量标准，以促进欧洲语言评估基准的协调发展。

Details

Motivation: 由于大型语言模型（LLMs）在非英语语言中的评估仍是一个未充分开发的领域，需要更系统和协调的方法来填补这一空白。

Result: 论文推荐了几项措施，包括提高评估方法的语言和文化敏感性，以推动欧洲语言基准测试的协调发展。

Insight: 在非英语环境中评估LLM需要更注重语言和文化的多样性，通过系统化的方法可以提高基准测试的质量和代表性。

Abstract: While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.

[77] Iterative Critique-Refine Framework for Enhancing LLM Personalization cs.CL | cs.AI | cs.IRPDF

Durga Prasad Maram, Dhruvin Gandhi, Zonghai Yao, Gayathri Akkinapalli, Franck Dernoncourt

TL;DR: PerFine提出了一种无需训练的迭代批判-精炼框架，通过基于配置文件的反馈增强个性化文本生成，显著提升了生成内容的个性化质量。

Details

Motivation: 现有基于检索的个性化生成方法（如LaMP和PGraphRAG）在生成内容时容易出现风格或主题偏移，缺乏有效的反馈机制。PerFine旨在解决这一问题。

Result: 在Yelp、Goodreads和Amazon数据集上，PerFine比PGraphRAG提升了7-13%的个性化质量（GEval指标），且在3-5次迭代中表现稳定。

Insight: 基于配置文件的反馈机制为个性化文本生成提供了无需训练的高效解决方案，且具有模型无关性和可扩展性。

Abstract: Personalized text generation requires models not only to produce coherent text but also to align with a target user’s style, tone, and topical focus. Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich profiles with user and neighbor histories, but they stop at generation and often yield outputs that drift in tone, topic, or style. We present PerFine, a unified, training-free critique-refine framework that enhances personalization through iterative, profile-grounded feedback. In each iteration, an LLM generator produces a draft conditioned on the retrieved profile, and a critic LLM - also conditioned on the same profile - provides structured feedback on tone, vocabulary, sentence structure, and topicality. The generator then revises, while a novel knockout strategy retains the stronger draft across iterations. We further study additional inference-time strategies such as Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp, Goodreads, and Amazon datasets, PerFine consistently improves personalization over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5 refinement iterations, and scalability with increasing critic size. These results highlight that post-hoc, profile-aware feedback offers a powerful paradigm for personalized LLM generation that is both training-free and model-agnostic.

[78] Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems cs.CL | cs.AIPDF

Yihan Li, Xiyuan Fu, Ghanshyam Verma, Paul Buitelaar, Mingming Liu

TL;DR: 这篇文章探讨了如何通过检索增强生成（RAG）、推理增强和智能代理系统来减轻大语言模型（LLM）的幻觉问题，提出了基于知识型和逻辑型的幻觉分类，并分析了各方法的协同潜力。

Details

Motivation: 幻觉问题是LLM可靠部署的主要障碍，尤其是在实际应用中。现有方法如RAG和推理增强虽有效，但其协同机制和深层潜力尚未系统研究。

Result: 文章通过实际应用和评测展示了RAG与推理增强的协同效应对缓解幻觉的有效性，并提供了统一的框架支持。

Insight: RAG和推理增强的结合不仅能缓解幻觉，还能平衡创造性和可靠性，为LLM的实际部署提供了新思路。

Abstract: Hallucination remains one of the key obstacles to the reliable deployment of large language models (LLMs), particularly in real-world applications. Among various mitigation strategies, Retrieval-Augmented Generation (RAG) and reasoning enhancement have emerged as two of the most effective and widely adopted approaches, marking a shift from merely suppressing hallucinations to balancing creativity and reliability. However, their synergistic potential and underlying mechanisms for hallucination mitigation have not yet been systematically examined. This survey adopts an application-oriented perspective of capability enhancement to analyze how RAG, reasoning enhancement, and their integration in Agentic Systems mitigate hallucinations. We propose a taxonomy distinguishing knowledge-based and logic-based hallucinations, systematically examine how RAG and reasoning address each, and present a unified framework supported by real-world applications, evaluations, and benchmarks.

[79] Talk2Ref: A Dataset for Reference Prediction from Scientific Talks cs.CLPDF

Frederik Broy, Maike Züfle, Jan Niehues

TL;DR: 论文介绍了Talk2Ref数据集，用于从科学讲座中预测相关文献引用（RPT任务），并提出了一个双编码器架构以提高预测性能。结果表明，基于Talk2Ref的微调显著提升了性能。

Details

Motivation: 科学讲座是传播研究的重要途径，但自动识别讲座中引用的相关文献具有挑战性，对研究者和学生有价值。

Result: 微调后的模型显著提高了引用预测性能，证明了Talk2Ref的有效性。

Insight: 1. 科学讲座内容对引用预测任务具有挑战性；2. 领域适应训练是关键；3. Talk2Ref为语义表示学习提供了新途径。

Abstract: Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk’s corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.

[80] A word association network methodology for evaluating implicit biases in LLMs compared to humans cs.CL | cs.AIPDF

Katherine Abramski, Giulio Rossetti, Massimo Stella

TL;DR: 本文提出了一种基于词关联网络的新方法，用于评估LLM中的隐式偏见，并通过模拟语义启动直接比较LLM与人类的偏见。

Details

Motivation: LLM在社会中的广泛应用使其内在的隐式偏见成为重要问题，但目前缺乏有效的评估方法。

Result: 研究发现LLM与人类在某些偏见上存在一致和分歧，揭示了使用LLM的潜在风险。

Insight: 该方法为系统评估和比较LLM与人类的偏见提供了框架，有助于推动透明和负责任的语言技术发展。

Abstract: As Large language models (LLMs) become increasingly integrated into our lives, their inherent social biases remain a pressing concern. Detecting and evaluating these biases can be challenging because they are often implicit rather than explicit in nature, so developing evaluation methods that assess the implicit knowledge representations of LLMs is essential. We present a novel word association network methodology for evaluating implicit biases in LLMs based on simulating semantic priming within LLM-generated word association networks. Our prompt-based approach taps into the implicit relational structures encoded in LLMs, providing both quantitative and qualitative assessments of bias. Unlike most prompt-based evaluation methods, our method enables direct comparisons between various LLMs and humans, providing a valuable point of reference and offering new insights into the alignment of LLMs with human cognition. To demonstrate the utility of our methodology, we apply it to both humans and several widely used LLMs to investigate social biases related to gender, religion, ethnicity, sexual orientation, and political party. Our results reveal both convergences and divergences between LLM and human biases, providing new perspectives on the potential risks of using LLMs. Our methodology contributes to a systematic, scalable, and generalizable framework for evaluating and comparing biases across multiple LLMs and humans, advancing the goal of transparent and socially responsible language technologies.

[81] CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration? cs.CLPDF

Qing Zong, Jiayu Liu, Tianshi Zheng, Chunyang Li, Baixuan Xu

TL;DR: 论文CritiCal探讨了自然语言评价如何帮助大语言模型（LLM）提升不确定性或置信度校准。通过提出Self-Critique和CritiCal两种方法，研究显示CritiCal在复杂推理任务中显著优于基线方法，甚至超越GPT-4o。

Details

Motivation: LLM的置信度校准对其在高风险领域的应用至关重要，但现有方法难以捕捉推理所需的准确置信度评估。

Result: CritiCal在开放和多项选择任务中表现优异，超越Self-Critique和其他基线方法，甚至在泛化任务中优于GPT-4o。

Insight: 自然语言评价是优化LLM置信度校准的有效工具，能够提升模型的可靠性和泛化能力。

Abstract: Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM’s reliability.

[82] Levée d’ambiguïtés par grammaires locales cs.CLPDF

Eric G. C. Laporte

TL;DR: 该论文提出了一种针对词性标注中零静默率目标的本地语法消歧方法，强调了多转换器交互验证的重要性，并指出语法规则需经过严格测试以避免潜在错误。

Details

Motivation: 词性标注中的歧义消解是自然语言处理的核心挑战之一，现有系统可能在唯一正确解无法找到时不丢弃任何可能解，但这种方法可能导致复杂问题。论文旨在解决如何在零静默率目标下实现更可靠的本地语法消歧。

Result: 研究表明，单独验证每条转换器路径不足以确保消歧效果，必须综合考量多路径交互。同时，零静默率目标下的语法规则需通过严格测试以避免潜在错误。

Insight: 消歧规则的设计需结合实际文本标注结果，避免仅依赖语法直觉。多转换器的协同作用比单独使用更复杂，需系统验证以确保消歧可靠性。

Abstract: Many words are ambiguous in terms of their part of speech (POS). However, when a word appears in a text, this ambiguity is generally much reduced. Disambiguating POS involves using context to reduce the number of POS associated with words, and is one of the main challenges of lexical tagging. The problem of labeling words by POS frequently arises in natural language processing, for example for spelling correction, grammar or style checking, expression recognition, text-to-speech conversion, text corpus analysis, etc. Lexical tagging systems are thus useful as an initial component of many natural language processing systems. A number of recent lexical tagging systems produce multiple solutions when the text is lexically ambiguous or the uniquely correct solution cannot be found. These contributions aim to guarantee a zero silence rate: the correct tag(s) for a word must never be discarded. This objective is unrealistic for systems that tag each word uniquely. This article concerns a lexical disambiguation method adapted to the objective of a zero silence rate and implemented in Silberztein’s INTEX system (1993). We present here a formal description of this method. We show that to verify a local disambiguation grammar in this framework, it is not sufficient to consider the transducer paths separately: one needs to verify their interactions. Similarly, if a combination of multiple transducers is used, the result cannot be predicted by considering them in isolation. Furthermore, when examining the initial labeling of a text as produced by INTEX, ideas for disambiguation rules come spontaneously, but grammatical intuitions may turn out to be inaccurate, often due to an unforeseen construction or ambiguity. If a zero silence rate is targeted, local grammars must be carefully tested. This is where a detailed specification of what a grammar will do once applied to texts would be necessary.

[83] Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written cs.CLPDF

Venkata S Govindarajan, Laura Biester

TL;DR: 该论文研究了一种特殊的幽默类型——‘糟糕幽默’，并通过Bulwer-Lytton小说比赛的句子构建了新语料库。标准幽默检测模型在该语料上表现不佳，且这类句子融合了多种文学手法。

Details

Motivation: 研究旨在填补计算语言学中对‘糟糕幽默’的理解空白，探索其在文本中的表现形式及其独特性。

Result: 标准模型效果不佳；LLM虽能模仿形式，但过度使用某些手法，且形容词-名词组合过于新颖。

Insight: ‘糟糕幽默’是一个复杂的语言现象，需要更细致的模型设计；LLM在模仿风格时可能产生偏差。

Abstract: Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand “bad” humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at https://github.com/venkatasg/bulwer-lytton

[84] Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts cs.CLPDF

Seyoung Song, Nawon Kim, Songeun Chae, Kiwoong Park, Jiho Jin

TL;DR: 提出了Open Korean Historical Corpus，一个覆盖1300年、18百万文档的公开历史语料库，用于定量分析韩语的语言演变，并为大语言模型提供预训练资源。

Details

Motivation: 韩语的历史演变因其缺乏可访问的历史语料库而无法在自然语言处理（NLP）中进行深入研究。本文旨在填补这一空白。

Result: 发现了Idu使用的演变、Hanja到Hangul的快速转变，以及北韩词汇的显著差异对现代分词器的高OOV率影响。

Insight: 该语料库不仅支持历史语言学的研究，还可用于提升大语言模型对古韩文和现代韩语的理解能力。

Abstract: The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea’s lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.

[85] ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization cs.CLPDF

Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song

TL;DR: ReForm提出了一种反射式自动形式化方法，通过迭代生成和语义一致性评估提升自然语言数学问题转换为机器可验证形式化语句的准确性，显著优于基线方法。

Details

Motivation: 现有大型语言模型（LLM）在自动形式化任务中缺乏语义意图保留能力，无法像人类专家一样通过反思和迭代修正错误。

Result: 在4个基准测试中平均提升17.2个百分点，ConsistencyCheck基准揭示人类专家语义错误率高达38.5%。

Insight: 自动形式化任务中语义一致性至关重要，但其难度较高，需要通过动态反思和优化机制提升准确性。

Abstract: Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem’s semantic intent. This limitation arises from the LLM approaches’ treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.

[86] Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way cs.CLPDF

Yicun Yang, Cong Wang, Shaobo Wang, Zichen Wen, Biqing Qi

TL;DR: 这篇论文提出了一种扩散式大语言模型（dLLM）的自适应生成长度方法dLLM-Var，通过训练模型预测[EOS]标记解决了传统dLLM固定生成长度的问题。该方法既能保持高效并行性，又能灵活控制生成长度。

Details

Motivation: 现有扩散式大语言模型（dLLMs）需要预先设定固定生成长度，导致效率低下且灵活性不足，限制了实际应用的潜力。

Result: 实验表明，dLLM-Var比传统dLLM推理快30.1倍，比自回归模型（如Qwen和Llama）快2.4倍，同时在精度和速度上均有提升。

Insight: 通过动态生成长度设计，dLLMs不仅提升了学术价值，还为实际应用（如高效文本生成）提供了可行方案。

Abstract: Diffusion-based large language models (dLLMs) have exhibited substantial potential for parallel text generation, which may enable more efficient generation compared to autoregressive models. However, current dLLMs suffer from fixed generation lengths, which indicates the generation lengths of dLLMs have to be determined before decoding as a hyper-parameter, leading to issues in efficiency and flexibility. To solve these problems, in this work, we propose to train a diffusion LLM with native variable generation lengths, abbreviated as dLLM-Var. Concretely, we aim to train a model to accurately predict the [EOS] token in the generated text, which makes a dLLM be able to natively infer in a block diffusion manner, while still maintaining the ability of global bi-directional (full) attention and high parallelism. Experiments on standard benchmarks demonstrate that our method achieves a 30.1x speedup over traditional dLLM inference paradigms and a 2.4x speedup relative to autoregressive models such as Qwen and Llama. Our method achieves higher accuracy and faster inference, elevating dLLMs beyond mere academic novelty and supporting their practical use in real-world applications. Codes and models have been released.

[87] Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs cs.CLPDF

Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho

TL;DR: 论文提出了一种动态分层稀疏注意力（DHSA）方法，解决了长上下文LLM在资源受限环境中的扩展性问题，通过动态预测注意力稀疏性，显著降低了计算和内存开销。

Details

Motivation: 现有静态稀疏注意力方法（如滑动窗口或全局标记）无法适应内容相关的注意力变化，而动态方法依赖预定义模板或启发式机制，限制了通用性和准确性。

Result: 在Gemma2和LongBench上的实验表明，DHSA在精度上与密集注意力相当，预填充延迟降低20-60%，峰值内存使用减少35%，且相对基线准确率提升6-18%。

Insight: DHSA通过数据驱动的动态分段和归一化聚合，实现了效率和精度的平衡，为长上下文设备端LLM提供了高效且通用的解决方案。

Abstract: The quadratic cost of attention hinders the scalability of long-context LLMs, especially in resource-constrained settings. Existing static sparse methods such as sliding windows or global tokens utilizes the sparsity of attention to reduce the cost of attention, but poorly adapts to the content-dependent variations in attention due to their staticity. While previous work has proposed several dynamic approaches to improve flexibility, they still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and prune tokens that remain contextually important, limiting their accuracy across diverse tasks. To tackle these bottlenecks of existing methods for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to token level similarities to calculate importance scores that determine which token-level interactions should be preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%. Compared to other representative baselines such as block sparse attention, DHSA achieves consistently higher accuracy (6-18% relative gains) with comparable or lower cost, offering an efficient and adaptable solution for long-context on-device LLMs.

[88] “Mm, Wat?” Detecting Other-initiated Repair Requests in Dialogue cs.CLPDF

Anh Ngo, Nicolas Rollet, Catherine Pelachaud, Chloe Clavel

TL;DR: 论文提出了一种多模态模型，结合语言和韵律特征检测荷兰对话中的修复请求，结果表明韵律特征显著提升了性能。

Details

Motivation: 为了解决对话系统中无法识别用户修复请求导致的对话中断问题，研究旨在通过多模态方法提升修复请求的检测能力。

Result: 实验结果表明，韵律特征显著提升了修复请求检测的性能，揭示了多模态特征的互补性。

Insight: 韵律特征是检测修复请求的重要线索，未来可探索视觉信息和多语言/跨语境数据以提升模型的泛化能力。

Abstract: Maintaining mutual understanding is a key component in human-human conversation to avoid conversation breakdowns, in which repair, particularly Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the other to resolve), plays a vital role. However, Conversational Agents (CAs) still fail to recognize user repair initiation, leading to breakdowns or disengagement. This work proposes a multimodal model to automatically detect repair initiation in Dutch dialogues by integrating linguistic and prosodic features grounded in Conversation Analysis. The results show that prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, offering insights into how different features interact. Future directions include incorporating visual cues, exploring multilingual and cross-context corpora to assess the robustness and generalizability.

[89] OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning cs.CLPDF

Ziyou Hu, Zhengliang Shi, Minghang Zhu, Haitao Li, Teng Sun

TL;DR: 论文提出了一种名为OpenReward的工具增强奖励模型，用于评估知识密集型和长任务，通过调用外部工具收集证据，显著优于现有奖励模型，并在下游LLM对齐任务中表现出色。

Details

Motivation: 现有奖励模型在知识密集型和长任务中表现不佳，因其难以依赖外部证据进行正确性评估，需要一种能系统化地利用外部工具的新方法来提升评估可靠性。

Result: OpenRM在多个数据集和基准测试中显著优于现有奖励模型，并能提升下游LLM对齐任务的性能。

Insight: 工具增强的奖励模型可以有效解决长任务和知识密集型任务的评估挑战，为LLM对齐提供了新的技术路径。

Abstract: Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and long-form tasks, where evaluating correctness requires grounding beyond the model’s internal knowledge. This limitation hinders them from reliably discriminating subtle quality differences, especially when external evidence is necessary. To address this, we introduce OpenRM, a tool-augmented long-form reward model that systematically judges open-ended responses by invoking external tools to gather relevant evidence. We train OpenRM with Group Relative Policy Optimization (GRPO) on over 27K synthesized pairwise examples generated through a controllable data synthesis framework. The training objective jointly supervises intermediate tool usage and final outcome accuracy, incentivizing our reward model to learn effective evidence-based judgment strategies. Extensive experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches. As a further step, we integrate OpenRM into both inference-time response selection and training-time data selection. This yields consistent gains in downstream LLM alignment tasks, highlighting the potential of tool-augmented reward models for scaling reliable long-form evaluation.

[90] Quantifying the Effects of Word Length, Frequency, and Predictability on Dyslexia cs.CL | q-bio.NCPDF

Hugo Rydel-Johnston, Alex Kafkas

TL;DR: 这篇论文通过大规模自然阅读数据集和眼动追踪技术，量化了单词长度、频率和可预测性对阅读障碍（dyslexia）的影响。研究发现这些特征对阅读时间的影响在正常读者和阅读障碍读者中均显著，而阅读障碍读者的敏感性更强。预测性对缩小阅读障碍与控制组差距的作用最大。

Details

Motivation: 研究动机在于明确阅读障碍患者在自然阅读中面临的额外时间成本的具体来源，以及这些成本如何受单词特征（如长度、频率和可预测性）的影响，从而为干预措施和计算模型提供依据。

Result: 结果显示，所有三个特征均显著影响阅读时间，阅读障碍读者对这些特征的敏感性更强。预测性是缩小阅读障碍与控制组差距的最主要因素，其次是单词长度和频率。

Insight: 研究结果支持了阅读障碍理论中关于语言工作记忆和语音编码需求增加的假设。未来研究可以进一步探索词汇复杂性和副中央凹预览效益对剩余差距的解释。

Abstract: We ask where, and under what conditions, dyslexic reading costs arise in a large-scale naturalistic reading dataset. Using eye-tracking aligned to word-level features (word length, frequency, and predictability), we model how each feature influences dyslexic time costs. We find that all three features robustly change reading times in both typical and dyslexic readers, and that dyslexic readers show stronger sensitivities to each, especially predictability. Counterfactual manipulations of these features substantially narrow the dyslexic-control gap by about one third, with predictability showing the strongest effect, followed by length and frequency. These patterns align with dyslexia theories that posit heightened demands on linguistic working memory and phonological encoding, and they motivate further work on lexical complexity and parafoveal preview benefits to explain the remaining gap. In short, we quantify when extra dyslexic costs arise, how large they are, and offer actionable guidance for interventions and computational models for dyslexics.

[91] Optimizing Retrieval for RAG via Reinforced Contrastive Learning cs.CL | cs.IRPDF

Jiawei Zhou, Lei Chen

TL;DR: 论文提出了一种名为R3的检索框架，通过强化对比学习动态优化RAG系统中的检索性能，无需依赖标注或合成数据。

Details

Motivation: 随着RAG技术的普及，传统的信息检索方法难以直接定义或标注相关性目标，需要一种新的方法来动态优化检索性能。

Result: 在多种任务中，R3的RAG性能提升5.2%，超过当前最优检索器4.9%，同时效率高，仅需4颗GPU和一天完成训练。

Insight: 强化对比学习可有效优化检索器在RAG中的性能，且避免了传统方法的标注依赖问题。

Abstract: As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever’s self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.

[92] Evolving Diagnostic Agents in a Virtual Clinical Environment cs.CLPDF

Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao

TL;DR: 本文提出了一个通过强化学习训练大语言模型（LLM）作为诊断代理的框架，能够在虚拟临床环境中进行交互式诊断训练，显著优于现有模型。

Details

Motivation: 传统的指令微调模型基于静态案例摘要训练，缺乏动态诊断能力，而本文旨在通过交互式探索和反馈学习更有效的诊断策略。

Result: DiagAgent在诊断准确率和检查推荐命中率上显著优于DeepSeek-v3、GPT-4o等10种顶尖模型及两种提示工程代理。

Insight: 交互式临床环境训练能够赋予模型动态诊断能力，而非被动训练所能达到的效果。

Abstract: In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.

[93] MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation cs.CLPDF

Parker Riley, Daniel Deutsch, Mara Finkelstein, Colten DiIanni, Juraj Juraska

TL;DR: MQM重新标注是一种协作评估机器翻译质量的技术，通过两阶段的MQM标注过程提升评估质量，减少遗漏错误。

Details

Motivation: 随着机器翻译模型质量的提升，传统的评估方法因噪声问题已无法准确衡量质量改进。因此，研究者提出MQM重新标注技术，旨在通过协作标注减少评估中的遗漏错误。

Result: 实验表明，重新标注行为符合预期目标，显著提升了标注质量，尤其是在发现首次标注中遗漏的错误方面表现突出。

Insight: 协作标注机制可以有效减少评估噪声，特别适用于高质量翻译模型的评估场景。

Abstract: Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.

[94] InteractComp: Evaluating Search Agents With Ambiguous Queries cs.CL | cs.AIPDF

Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren

TL;DR: InteractComp是一个新的基准测试，旨在评估搜索代理是否能识别查询的模糊性并通过交互解决。研究发现现有模型在此任务上表现不佳，揭示了系统性过度自信问题。

Details

Motivation: 现实中的用户查询往往是模糊或不完整的，而现有搜索代理缺乏交互机制，无法有效解决这一问题。现有的基准测试也无法评估这种能力。

Result: 最好的模型在模糊查询下的准确率仅为13.73%，而在完整上下文下的准确率为71.50%。强制交互能显著提升性能。

Insight: 模型的交互能力在过去15个月中停滞不前，而搜索性能提升了七倍，揭示了这一领域的盲点。

Abstract: Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.

[95] Dissecting Role Cognition in Medical LLMs via Neuronal Ablation cs.CL | cs.AIPDF

Xun Liang, Huayi Lai, Hanyu Wang, Wentao Zhang, Linfeng Zhang

TL;DR: 该论文研究了医学大型语言模型（LLM）中角色提示（Prompt-Based Role Playing, PBRP）对模型推理能力的影响，发现角色提示仅改变表面语言风格，未显著提升医学推理能力。

Details

Motivation: 探讨角色提示是否能在医学LLM中引发角色特定的认知过程，抑或仅影响语言风格。

Result: 角色提示未显著增强医学推理能力，仅影响表面语言特征，核心决策机制在不同角色中保持一致。

Insight: 当前PBRP方法无法模拟真实医学实践中的认知复杂性，需开发能模拟真实认知过程的模型。

Abstract: Large language models (LLMs) have gained significant traction in medical decision support systems, particularly in the context of medical question answering and role-playing simulations. A common practice, Prompt-Based Role Playing (PBRP), instructs models to adopt different clinical roles (e.g., medical students, residents, attending physicians) to simulate varied professional behaviors. However, the impact of such role prompts on model reasoning capabilities remains unclear. This study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to evaluate whether role prompts induce distinct, role-specific cognitive processes in LLMs or merely modify linguistic style. We test this framework on three medical QA datasets, employing neuron ablation and representation analysis techniques to assess changes in reasoning pathways. Our results demonstrate that role prompts do not significantly enhance the medical reasoning abilities of LLMs. Instead, they primarily affect surface-level linguistic features, with no evidence of distinct reasoning pathways or cognitive differentiation across clinical roles. Despite superficial stylistic changes, the core decision-making mechanisms of LLMs remain uniform across roles, indicating that current PBRP methods fail to replicate the cognitive complexity found in real-world medical practice. This highlights the limitations of role-playing in medical AI and emphasizes the need for models that simulate genuine cognitive processes rather than linguistic imitation.We have released the related code in the following repository:https: //github.com/IAAR-Shanghai/RolePlay_LLMDoctor

[96] SPICE: Self-Play In Corpus Environments Improves Reasoning cs.CLPDF

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao

TL;DR: SPICE是一个通过对抗性自博弈在大型语料库环境中提升推理能力的强化学习框架，Challenger和Reasoner角色交互生成的挑战性任务推动了模型的持续改进。

Details

Motivation: 现有基于无ground的自博弈方法在任务生成和改进方面存在局限性，而通过与语料库的交互可以提供更丰富和持续的外部信号。

Result: 在多个模型族上的实验显示，SPICE在数学和通用推理基准上分别取得了8.9%和9.8%的提升。

Insight: 语料库的grounding是SPICE能够持续生成并解决挑战性任务的关键，推动了模型的自我改进。

Abstract: Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner’s capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.

[97] Repurposing Synthetic Data for Fine-grained Search Agent Supervision cs.CL | cs.AIPDF

Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang

TL;DR: 本文提出了一种名为E-GRPO的新方法，通过利用实体信息改进LLM搜索代理的训练，显著提升了性能。

Details

Motivation: 当前训练LLM搜索代理的方法（如GRPO）忽略了丰富的实体信息，仅依赖稀疏的结果奖励，导致无法区分部分正确的“近似错误”样本，浪费了学习信号。

Result: 在多个QA和研究基准测试中，E-GRPO显著优于GRPO基线，且推理效率更高，工具调用次数更少。

Insight: 实体信息与最终答案准确性高度相关，利用这些信息可以显著提升代理的学习效率和性能。

Abstract: LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative “near-miss” samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent’s reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these “near-misses”. Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.

[98] AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis cs.CLPDF

Xuanzhong Chen, Zile Qiao, Guoxin Chen, Liangcai Su, Zhen Zhang

TL;DR: 论文提出了基于教育理论‘最近发展区’（ZPD）的数据合成方法AgentFrontier，通过自动化流水线生成高质量多学科数据，提升LLM代理在能力边界任务上的表现。

Details

Motivation: 当前LLM代理在能力边界任务上的表现受限，需要更有效的数据合成方法以提高复杂推理能力。

Result: AgentFrontier-30B-A3B在Humanity’s Last Exam等基准上达到SOTA性能，超越部分领先专有代理。

Insight: ZPD指导的数据合成为提升LLM代理能力提供了可扩展且有效的途径。

Abstract: Training large language model agents on tasks at the frontier of their capabilities is key to unlocking advanced reasoning. We introduce a data synthesis approach inspired by the educational theory of the Zone of Proximal Development (ZPD), which defines this frontier as tasks an LLM cannot solve alone but can master with guidance. To operationalize this, we present the AgentFrontier Engine, an automated pipeline that synthesizes high-quality, multidisciplinary data situated precisely within the LLM’s ZPD. This engine supports both continued pre-training with knowledge-intensive data and targeted post-training on complex reasoning tasks. From the same framework, we derive the ZPD Exam, a dynamic and automated benchmark designed to evaluate agent capabilities on these frontier tasks. We train AgentFrontier-30B-A3B model on our synthesized data, which achieves state-of-the-art results on demanding benchmarks like Humanity’s Last Exam, even surpassing some leading proprietary agents. Our work demonstrates that a ZPD-guided approach to data synthesis offers a scalable and effective path toward building more capable LLM agents.

[99] WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking cs.CLPDF

Zhengwei Tao, Haiyang Shen, Baixuan Li, Wenbiao Yin, Jialong Wu

TL;DR: WebLeaper是一个基于大语言模型（LLM）的信息搜索（IS）代理框架，旨在通过提高搜索效率和有效性来解决当前IS代理在训练任务中目标实体稀疏的问题。

Details

Motivation: 当前基于LLM的信息搜索代理存在搜索效率低的问题，限制了整体性能，原因在于训练任务中目标实体稀疏，无法高效学习和泛化搜索行为。

Result: 在五个基准测试（BrowserComp、GAIA、xbench-DeepSearch、WideSearch、Seal-0）中，WebLeaper在效率和有效性上均显著优于现有基线方法。

Insight: 通过提高任务的覆盖范围和生成高效轨迹，可以有效优化信息搜索代理的性能，尤其是在实体稀疏的情况下。

Abstract: Large Language Model (LLM)-based agents have emerged as a transformative approach for open-ended problem solving, with information seeking (IS) being a core capability that enables autonomous reasoning and decision-making. While prior research has largely focused on improving retrieval depth, we observe that current IS agents often suffer from low search efficiency, which in turn constrains overall performance. A key factor underlying this inefficiency is the sparsity of target entities in training tasks, which limits opportunities for agents to learn and generalize efficient search behaviors. To address these challenges, we propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories. We formulate IS as a tree-structured reasoning problem, enabling a substantially larger set of target entities to be embedded within a constrained context. Leveraging curated Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic, Union, and Reverse-Union, to systematically increase both IS efficiency and efficacy. Finally, we curate training trajectories by retaining only those that are simultaneously accurate and efficient, ensuring that the model is optimized for both correctness and search performance. Extensive experiments on both basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp, GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.

[100] ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking cs.CL | cs.AIPDF

Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao

TL;DR: ParallelMuse提出了一种两阶段范式，通过功能性指定部分展开和压缩推理聚合，提升了信息搜索代理的效率和性能。

Details

Motivation: 传统的并行思考在信息搜索中存在效率不足和长时推理轨迹整合困难的问题，限制了深度信息搜索代理的能力。

Result: 在多组实验基准上实现了最高62%的性能提升，同时减少了10-30%的探索性token消耗。

Insight: 通过功能分区和冗余压缩，ParallelMuse成功解决了长时推理中的效率和信息整合问题，为深度信息搜索代理提供了新思路。

Abstract: Parallel thinking expands exploration breadth, complementing the deep exploration of information-seeking (IS) agents to further enhance problem-solving capability. However, conventional parallel thinking faces two key challenges in this setting: inefficiency from repeatedly rolling out from scratch, and difficulty in integrating long-horizon reasoning trajectories during answer generation, as limited context capacity prevents full consideration of the reasoning process. To address these issues, we propose ParallelMuse, a two-stage paradigm designed for deep IS agents. The first stage, Functionality-Specified Partial Rollout, partitions generated sequences into functional regions and performs uncertainty-guided path reuse and branching to enhance exploration efficiency. The second stage, Compressed Reasoning Aggregation, exploits reasoning redundancy to losslessly compress information relevant to answer derivation and synthesize a coherent final answer. Experiments across multiple open-source agents and benchmarks demonstrate up to 62% performance improvement with a 10–30% reduction in exploratory token consumption.

[101] Tongyi DeepResearch Technical Report cs.CL | cs.AI | cs.IR | cs.LG | cs.MAPDF

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang

TL;DR: 论文介绍了Tongyi DeepResearch，这是一个专为长期、深度信息检索任务设计的代理大语言模型，通过端到端训练框架实现自主深度研究能力，并在多任务基准测试中达到最优性能。

Details

Motivation: 现有的语言模型在处理长期、复杂的信息检索和研究任务时存在局限性，需要一种能够自主推理和检索信息的专业代理模型。

Result: Tongyi DeepResearch在多个代理深度研究基准测试中达到最优性能，模型参数量为305亿，每token激活33亿参数。

Insight: 通过自动化数据合成和定制化环境设计，可以在无需人工标注的情况下高效训练大规模代理语言模型。

Abstract: We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity’s Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

[102] Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents cs.CL | cs.AIPDF

Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou

TL;DR: 论文提出了Agent Data Protocol（ADP），一种轻量级表示语言，用于统一异构格式的智能体数据集，从而提高大规模监督微调的效果。通过将13个现有数据集转换为ADP格式，实验显示性能平均提升20%。

Details

Motivation: 当前智能体训练数据的多样性导致数据格式碎片化，阻碍了大规模监督微调的研究进展。

Result: 模型在编码、浏览、工具使用等任务上平均提升20%，接近或达到SOTA性能。

Insight: ADP降低了智能体训练的标准化和可复现性门槛，推动了大规模监督微调的发展。

Abstract: Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an “interlingua” between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.

[103] ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games? cs.CL | cs.AI | cs.HC | cs.SEPDF

Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng

TL;DR: ComboBench评估大型语言模型（LLMs）在虚拟现实（VR）游戏中将语义动作转换为设备操控序列的能力，发现顶级模型如Gemini-1.5-Pro表现良好，但在程序推理和空间理解上仍落后于人类。

Details

Motivation: VR游戏需要玩家将高层次语义动作转换为设备操控序列，目前LLMs是否具备这种能力尚不明确。

Result: Gemini-1.5-Pro表现最佳，但在程序推理和空间理解上仍落后于人类；Few-shot示例显著提升性能。

Insight: LLMs在VR任务中的表现受交互复杂度影响，Few-shot学习是提升性能的有效途径。

Abstract: Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs’ capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs’ VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.

eess.IV [Back]

[104] MSRANetV2: An Explainable Deep Learning Architecture for Multi-class Classification of Colorectal Histopathological Images eess.IV | cs.CVPDF

Ovi Sarkar, Md Shafiuzzaman, Md. Faysal Ahamed, Golam Mahmud, Muhammad E. H. Chowdhury

TL;DR: 该论文提出了MSRANetV2，一种基于ResNet50V2的卷积神经网络，结合残差注意力机制和SE块，用于结直肠组织图像的多分类，具有高性能和可解释性。

Details

Motivation: 结直肠癌诊断传统方法存在主观性和耗时问题，数字病理学与深度学习结合可以提升诊断精确度和效率。

Result: 在两个数据集上表现优异，平均精度、召回率、F1分数、AUC和测试准确率均接近99%。

Insight: 多尺度特征融合和注意力机制显著提升了分类性能，Grad-CAM增强了模型的可解释性。

Abstract: Colorectal cancer (CRC) is a leading worldwide cause of cancer-related mortality, and the role of prompt precise detection is of paramount interest in improving patient outcomes. Conventional diagnostic methods such as colonoscopy and histological examination routinely exhibit subjectivity, are extremely time-consuming, and are susceptible to variation. Through the development of digital pathology, deep learning algorithms have become a powerful approach in enhancing diagnostic precision and efficiency. In our work, we proposed a convolutional neural network architecture named MSRANetV2, specially optimized for the classification of colorectal tissue images. The model employs a ResNet50V2 backbone, extended with residual attention mechanisms and squeeze-and-excitation (SE) blocks, to extract deep semantic and fine-grained spatial features. With channel alignment and upsampling operations, MSRANetV2 effectively fuses multi-scale representations, thereby enhancing the robustness of the classification. We evaluated our model on a five-fold stratified cross-validation strategy on two publicly available datasets: CRC-VAL-HE-7K and NCT-CRC-HE-100K. The proposed model achieved remarkable average Precision, recall, F1-score, AUC, and test accuracy were 0.9884 plus-minus 0.0151, 0.9900 plus-minus 0.0151, 0.9900 plus-minus 0.0145, 0.9999 plus-minus 0.00006, and 0.9905 plus-minus 0.0025 on the 7K dataset. On the 100K dataset, they were 0.9904 plus-minus 0.0091, 0.9900 plus-minus 0.0071, 0.9900 plus-minus 0.0071, 0.9997 plus-minus 0.00016, and 0.9902 plus-minus 0.0006. Additionally, Grad-CAM visualizations were incorporated to enhance model interpretability by highlighting tissue areas that are medically relevant. These findings validate that MSRANetV2 is a reliable, interpretable, and high-performing architectural model for classifying CRC tissues.

cs.LG [Back]

[105] An Enhanced Dual Transformer Contrastive Network for Multimodal Sentiment Analysis cs.LG | cs.AI | cs.CLPDF

Phuong Q. Dao, Mark Roantree, Vuong M. Ngo

TL;DR: 论文提出了一种新的多模态情感分析方法，结合BERT和ViT的早期融合策略，并使用对比学习增强跨模态对齐，性能优于现有方法。

Details

Motivation: 多模态情感分析（MSA）需要更有效地结合文本和图像信息，以提高情感理解的准确性和鲁棒性。

Result: 在TumEmo和MVSA-Single数据集上，DTCN分别达到78.4%/78.3%和76.6%/75.9%的准确率和F1分数，优于基线。

Insight: 早期融合和对比学习在多模态任务中能显著提升性能，尤其在跨模态表征对齐方面具有潜力。

Abstract: Multimodal Sentiment Analysis (MSA) seeks to understand human emotions by jointly analyzing data from multiple modalities typically text and images offering a richer and more accurate interpretation than unimodal approaches. In this paper, we first propose BERT-ViT-EF, a novel model that combines powerful Transformer-based encoders BERT for textual input and ViT for visual input through an early fusion strategy. This approach facilitates deeper cross-modal interactions and more effective joint representation learning. To further enhance the model’s capability, we propose an extension called the Dual Transformer Contrastive Network (DTCN), which builds upon BERT-ViT-EF. DTCN incorporates an additional Transformer encoder layer after BERT to refine textual context (before fusion) and employs contrastive learning to align text and image representations, fostering robust multimodal feature learning. Empirical results on two widely used MSA benchmarks MVSA-Single and TumEmo demonstrate the effectiveness of our approach. DTCN achieves best accuracy (78.4%) and F1-score (78.3%) on TumEmo, and delivers competitive performance on MVSA-Single, with 76.6% accuracy and 75.9% F1-score. These improvements highlight the benefits of early fusion and deeper contextual modeling in Transformer-based multimodal sentiment analysis.

Shuang Geng, Wenli Zhang, Jiaheng Xie, Rui Wang, Sudha Ram

TL;DR: 该论文提出了一种闭环的大型语言模型（LLM）-知识图框架，通过社交媒体内容同时进行抑郁检测和医学知识扩展。

Details

Motivation: 现有研究利用医学知识提高抑郁预测准确性，但忽略了通过预测过程扩展知识的机会。本文旨在整合预测和知识扩展，实现动态的医学知识演化。

Result: 实验表明，该框架不仅提高了预测准确性，还发现了与抑郁相关的临床有意义的新症状、共病和社会触发因素。

Insight: 研究展示了预测和学习过程的相互强化，为动态风险监测提供了方法论和理论基础。

Abstract: Social media user-generated content (UGC) provides real-time, self-reported indicators of mental health conditions such as depression, offering a valuable source for predictive analytics. While prior studies integrate medical knowledge to improve prediction accuracy, they overlook the opportunity to simultaneously expand such knowledge through predictive processes. We develop a Closed-Loop Large Language Model (LLM)-Knowledge Graph framework that integrates prediction and knowledge expansion in an iterative learning cycle. In the knowledge-aware depression detection phase, the LLM jointly performs depression detection and entity extraction, while the knowledge graph represents and weights these entities to refine prediction performance. In the knowledge refinement and expansion phase, new entities, relationships, and entity types extracted by the LLM are incorporated into the knowledge graph under expert supervision, enabling continual knowledge evolution. Using large-scale UGC, the framework enhances both predictive accuracy and medical understanding. Expert evaluations confirmed the discovery of clinically meaningful symptoms, comorbidities, and social triggers complementary to existing literature. We conceptualize and operationalize prediction-through-learning and learning-through-prediction as mutually reinforcing processes, advancing both methodological and theoretical understanding in predictive analytics. The framework demonstrates the co-evolution of computational models and domain knowledge, offering a foundation for adaptive, data-driven knowledge systems applicable to other dynamic risk monitoring contexts.

[107] What do vision-language models see in the context? Investigating multimodal in-context learning cs.LG | cs.CVPDF

Gabriel O. dos Santos, Esther Colombini, Sandra Avila

TL;DR: 该论文系统研究了视觉语言模型（VLMs）的上下文学习能力，揭示了训练数据和模型架构对性能的影响，并发现当前VLMs在视觉与文本信息整合上存在局限。

Details

Motivation: 上下文学习在大型语言模型中已被广泛研究，但在视觉语言模型中的应用和效果尚不明确，研究旨在填补这一空白。

Result: 训练于图像-文本交织数据能提升ICL性能，但视觉信息整合不足；指令调优虽改进指令遵循，却减少对上下文示例的依赖。

Insight: 当前VLMs主要依赖文本信息且视觉整合能力有限，改进多模态ICL需要更有效的跨模态交互机制。

Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.

[108] NUM2EVENT: Interpretable Event Reasoning from Numerical time-series cs.LG | cs.AI | cs.CLPDF

Ninghui Feng, Yiyan Qi

TL;DR: 论文提出了NUM2EVENT任务，旨在从数值时间序列中推断可解释的结构化事件，并通过提出的框架实现了优于LLM基线的事件级性能。

Details

Motivation: 现有方法主要关注数值时间序列的预测或趋势描述，而忽视了驱动数值变化的潜在事件及其推理过程，因此需要填补这一研究空白。

Result: 在多个数据集上，方法在事件级准确率和召回率上显著优于LLM基线。

Insight: 论文为数值推理和语义理解的结合提供了新方向，使LLM能够直接从数值动态中解释和预测事件。

Abstract: Large language models (LLMs) have recently demonstrated impressive multimodal reasoning capabilities, yet their understanding of purely numerical time-series signals remains limited. Existing approaches mainly focus on forecasting or trend description, without uncovering the latent events that drive numerical changes or explaining the reasoning process behind them. In this work, we introduce the task of number-to-event reasoning and decoding, which aims to infer interpretable structured events from numerical inputs, even when current text is unavailable. To address the data scarcity and semantic alignment challenges, we propose a reasoning-aware framework that integrates an agent-guided event extractor (AGE), a marked multivariate Hawkes-based synthetic generator (EveDTS), and a two-stage fine-tuning pipeline combining a time-series encoder with a structured decoder. Our model explicitly reasons over numerical changes, generates intermediate explanations, and outputs structured event hypotheses. Experiments on multi-domain datasets show that our method substantially outperforms strong LLM baselines in event-level precision and recall. These results suggest a new direction for bridging quantitative reasoning and semantic understanding, enabling LLMs to explain and predict events directly from numerical dynamics.

[109] Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation cs.LG | cs.AI | cs.CLPDF

Thaweerath Phisannupawong, Joshua Julian Damanik, Han-Lim Choi

TL;DR: 该论文提出了一种基于大型语言模型的多模态航班延误预测方法，通过融合飞机轨迹表示和文本航空信息，实现高效的延误预测。

Details

Motivation: 航班延误已成为空中交通管理的关键问题，影响了网络整体性能。现有方法未能充分利用多模态信息（如轨迹数据和文本信息）来提升预测精度。

Result: 实验结果表明，该方法能实现亚分钟级的预测误差，具有较高的实用性和可扩展性。

Insight: 研究表明，将轨迹数据与语言模态结合可以有效提升延误预测性能，同时支持实时更新，适用于实际应用。

Abstract: Flight delay prediction has become a key focus in air traffic management, as delays highlight inefficiencies that impact overall network performance. This paper presents a lightweight large language model-based multimodal flight delay prediction, formulated from the perspective of air traffic controllers monitoring aircraft delay after entering the terminal area. The approach integrates trajectory representations with textual aeronautical information, including flight information, weather reports, and aerodrome notices, by adapting trajectory data into the language modality to capture airspace conditions. Experimental results show that the model consistently achieves sub-minute prediction error by effectively leveraging contextual information related to the sources of delay. The framework demonstrates that linguistic understanding, when combined with cross-modality adaptation of trajectory information, enhances delay prediction. Moreover, the approach shows practicality and scalability for real-world operations, supporting real-time updates that refine predictions upon receiving new operational information.

[110] MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection cs.LG | cs.CLPDF

Anisha Saha, Varsha Suresh, Timothy Hospedales, Vera Demberg

TL;DR: 论文提出MUStReason基准，用于评估视频语言模型在多模态讽刺检测中的语用推理能力，并引入PragCoT框架以提高模型对隐含意图的捕捉能力。

Details

Motivation: 讽刺检测需要跨模态的复杂推理，而当前多模态模型在此任务上表现不佳，因此需要一个专门的基准来诊断模型的局限性。

Result: MUStReason提供了定量和定性的评估结果，PragCoT框架显著提升了模型在讽刺检测任务中的表现。

Insight: 讽刺检测的成功依赖于对隐含意图的捕捉，而当前模型需进一步优化跨模态推理能力，PragCoT为此提供了一种有效方法。

Abstract: Sarcasm is a specific type of irony which involves discerning what is said from what is meant. Detecting sarcasm depends not only on the literal content of an utterance but also on non-verbal cues such as speaker’s tonality, facial expressions and conversational context. However, current multimodal models struggle with complex tasks like sarcasm detection, which require identifying relevant cues across modalities and pragmatically reasoning over them to infer the speaker’s intention. To explore these limitations in VideoLMs, we introduce MUStReason, a diagnostic benchmark enriched with annotations of modality-specific relevant cues and underlying reasoning steps to identify sarcastic intent. In addition to benchmarking sarcasm classification performance in VideoLMs, using MUStReason we quantitatively and qualitatively evaluate the generated reasoning by disentangling the problem into perception and reasoning, we propose PragCoT, a framework that steers VideoLMs to focus on implied intentions over literal meaning, a property core to detecting sarcasm.

[111] GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA cs.LG | cs.CLPDF

Zhichao Wang

TL;DR: GIFT是一种新颖的强化学习框架，通过最小化隐式和显式奖励模型的差异对齐LLMs，结合了GRPO、DPO和UNA的核心思想。

Details

Motivation: 现有方法（如PPO、GRPO）直接最大化累积奖励，而GIFT通过隐式与显式奖励的对齐，解决了隐式奖励难以有效利用的问题。

Result: GIFT在数学基准测试中表现优异，训练效率高，收敛快且泛化能力强。

Insight: 归一化隐式和显式奖励避免了非凸优化问题，提升了模型的稳定性和可微性。

Abstract: I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.

[112] GraphNet: A Large-Scale Computational Graph Dataset for Tensor Compiler Research cs.LG | cs.CLPDF

Xinqi Li, Yiqun Liu, Shan Jiang, Enrong Zheng, Huaijin Zheng

TL;DR: GraphNet是一个包含2.7K个真实计算图的数据集，支持张量编译器研究，并提出了Speedup Score S(t)和Error-aware Speedup Score ES(t)两种评测指标。

Details

Motivation: 现有张量编译器研究缺乏大规模、多样化的真实计算图数据集和可靠的评测指标。

Result: 在CV和NLP任务上评测了CINN和TorchInductor，展示了GraphNet的实用性。

Insight: GraphNet为张量编译器优化提供了标准化评测基准，助力开发者发现性能瓶颈。

Abstract: We introduce GraphNet, a dataset of 2.7K real-world deep learning computational graphs with rich metadata, spanning six major task categories across multiple deep learning frameworks. To evaluate tensor compiler performance on these samples, we propose the benchmark metric Speedup Score S(t), which jointly considers runtime speedup and execution correctness under tunable tolerance levels, offering a reliable measure of general optimization capability. Furthermore, we extend S(t) to the Error-aware Speedup Score ES(t), which incorporates error information and helps compiler developers identify key performance bottlenecks. In this report, we benchmark the default tensor compilers, CINN for PaddlePaddle and TorchInductor for PyTorch, on computer vision (CV) and natural language processing (NLP) samples to demonstrate the practicality of GraphNet. The full construction pipeline with graph extraction and compiler evaluation tools is available at https://github.com/PaddlePaddle/GraphNet .

cs.SE [Back]

[113] Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation cs.SE | cs.CLPDF

Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu

TL;DR: 论文提出了PRDBench，一种基于代理驱动标注的评估方法，用于自动化测试代码代理的能力。该方法解决了现有评估方法的高标注成本和刚性评价标准的问题。

Details

Motivation: 现有代码代理评估方法面临标注成本高和评价标准单一的问题，限制了评估的多样性和复杂性。

Result: 实验证明PRDBench能有效评估代码代理和评估代理的能力，为标注和评估提供了可扩展框架。

Insight: 代理驱动的标注和Agent-as-a-Judge范式为复杂任务评估提供了新思路，可推广至其他领域。

Abstract: Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs) and widely adopted tools. However, existing benchmarks for code agent evaluation face two major limitations: high annotation cost and expertise requirements, and rigid evaluation metrics that rely primarily on unit tests. To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse and challenging project-level tasks. Based on this approach, we introduce PRDBench, a novel benchmark comprising 50 real-world Python projects across 20 domains, each with structured Product Requirement Document (PRD) requirements, comprehensive evaluation criteria, and reference implementations. PRDBench features rich data sources, high task complexity, and flexible metrics. We further employ an Agent-as-a-Judge paradigm to score agent outputs, enabling the evaluation of various test types beyond unit tests. Extensive experiments on PRDBench demonstrate its effectiveness in assessing the capabilities of both code agents and evaluation agents, providing a scalable and robust framework for annotation and evaluation.

cs.RO [Back]

Siyin Wang, Jinlan Fu, Feihong Liu, Xinzhe He, Huangxuan Wu

TL;DR: RoboOmni是一个基于多模态大语言模型（MLLMs）的机器人操作框架，能够通过语音、环境声音和视觉信号推断用户意图，无需依赖显式指令。

Details

Motivation: 当前机器人操作主要依赖显式指令，而真实场景中人类意图通常隐含在多模态信号中，需要机器人主动推断。

Result: RoboOmni在仿真和真实场景中均优于基于文本和ASR的基线方法，成功率和推理速度显著提升。

Insight: 机器人需要融合多模态信号（如声音和视觉）来主动理解意图，而非仅依赖显式指令，这对未来人机交互至关重要。

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.

[115] ZTRS: Zero-Imitation End-to-end Autonomous Driving with Trajectory Scoring cs.RO | cs.CVPDF

Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Jingde Chen

TL;DR: ZTRS提出了一个零模仿学习的端到端自动驾驶框架，通过轨迹评分直接从传感器输入学习规划，避免了模仿学习的局限性，并利用离线强化学习实现了高性能。

Details

Motivation: 传统端到端自动驾驶主要依赖模仿学习（IL），但IL受限于专家演示质量和实际部署时的协变量偏移。强化学习（RL）虽然在高维传感器数据上的潜力未被充分挖掘。

Result: ZTRS在Navhard上达到SOTA性能，在HUGSIM上超越基于IL的基线方法。

Insight: 该方法展示了直接从传感器输入学习规划的可行性，避免了模仿学习的局限性，同时通过RL训练的鲁棒性解决了高维输入的挑战。

Abstract: End-to-end autonomous driving maps raw sensor inputs directly into ego-vehicle trajectories to avoid cascading errors from perception modules and to leverage rich semantic cues. Existing frameworks largely rely on Imitation Learning (IL), which can be limited by sub-optimal expert demonstrations and covariate shift during deployment. On the other hand, Reinforcement Learning (RL) has recently shown potential in scaling up with simulations, but is typically confined to low-dimensional symbolic inputs (e.g. 3D objects and maps), falling short of full end-to-end learning from raw sensor data. We introduce ZTRS (Zero-Imitation End-to-End Autonomous Driving with Trajectory Scoring), a framework that combines the strengths of both worlds: sensor inputs without losing information and RL training for robust planning. To the best of our knowledge, ZTRS is the first framework that eliminates IL entirely by only learning from rewards while operating directly on high-dimensional sensor data. ZTRS utilizes offline reinforcement learning with our proposed Exhaustive Policy Optimization (EPO), a variant of policy gradient tailored for enumerable actions and rewards. ZTRS demonstrates strong performance across three benchmarks: Navtest (generic real-world open-loop planning), Navhard (open-loop planning in challenging real-world and synthetic scenarios), and HUGSIM (simulated closed-loop driving). Specifically, ZTRS achieves the state-of-the-art result on Navhard and outperforms IL-based baselines on HUGSIM. Code will be available at https://github.com/woxihuanjiangguo/ZTRS.

[116] DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation cs.RO | cs.AI | cs.CVPDF

Jingyi Tian, Le Wang, Sanping Zhou, Sen Wang, Jiayi Li

TL;DR: DynaRend 是一种表示学习框架，通过掩码重构和未来预测学习 3D 感知的动态特征，显著提升机器人操作的泛化能力。

Details

Motivation: 现有的自监督表示学习方法多聚焦于 2D 视觉或无结构的视频预测，未能联合学习几何、语义和动态信息，限制了机器人操作的泛化性。

Result: 在 RLBench、Colosseum 和真实世界实验中，DynaRend 显著提高了策略成功率和环境扰动下的泛化能力。

Insight: 联合学习几何、语义和动态信息的三平面特征表示可以有效提升机器人操作的泛化性和实际应用能力。

Abstract: Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.

Mingyu Jeong, Eunsung Kim, Sehun Park, Andrew Jaeyong Choi

TL;DR: NVSim是一个自动化构建大型可导航室内模拟器的框架，仅需常见图像序列即可生成，解决了传统3D扫描的成本和扩展性问题。

Details

Motivation: 传统3D扫描方法成本高且难以扩展，而机器人导航数据中稀疏观测的地面常导致视觉伪影问题，需要新的解决方案。

Result: 展示了系统从真实数据生成有效大规模导航图的能力。

Insight: 通过直接分析渲染视图构建拓扑图，避免了传统网格方法的复杂性，提升了可扩展性和实用性。

Abstract: We present NVSim, a framework that automatically constructs large-scale, navigable indoor simulators from only common image sequences, overcoming the cost and scalability limitations of traditional 3D scanning. Our approach adapts 3D Gaussian Splatting to address visual artifacts on sparsely observed floors a common issue in robotic traversal data. We introduce Floor-Aware Gaussian Splatting to ensure a clean, navigable ground plane, and a novel mesh-free traversability checking algorithm that constructs a topological graph by directly analyzing rendered views. We demonstrate our system’s ability to generate valid, large-scale navigation graphs from real-world data. A video demonstration is avilable at https://youtu.be/tTiIQt6nXC8

[118] GroundLoc: Efficient Large-Scale Outdoor LiDAR-Only Localization cs.RO | cs.CVPDF

Nicolai Steinke, Daniel Goehring

TL;DR: GroundLoc提出了一种基于LiDAR的大规模室外定位方法，通过BEV投影和关键点提取技术，在多个数据集和传感器上表现优异，支持实时运行且存储需求低。

Details

Motivation: 传统的LiDAR定位方法在大规模环境中面临计算复杂和存储需求高的挑战，GroundLoc旨在提供一种高效且实用的解决方案。

Result: 在SemanticKITTI和HeLiPR数据集上表现优异，ATE低于50cm，且满足实时运行需求。

Insight: 通过BEV投影和轻量化存储设计，GroundLoc展示了LiDAR定位在大规模场景中的高效性和实用性。

Abstract: In this letter, we introduce GroundLoc, a LiDAR-only localization pipeline designed to localize a mobile robot in large-scale outdoor environments using prior maps. GroundLoc employs a Bird’s-Eye View (BEV) image projection focusing on the perceived ground area and utilizes the place recognition network R2D2, or alternatively, the non-learning approach Scale-Invariant Feature Transform (SIFT), to identify and select keypoints for BEV image map registration. Our results demonstrate that GroundLoc outperforms state-of-the-art methods on the SemanticKITTI and HeLiPR datasets across various sensors. In the multi-session localization evaluation, GroundLoc reaches an Average Trajectory Error (ATE) well below 50 cm on all Ouster OS2 128 sequences while meeting online runtime requirements. The system supports various sensor models, as evidenced by evaluations conducted with Velodyne HDL-64E, Ouster OS2 128, Aeva Aeries II, and Livox Avia sensors. The prior maps are stored as 2D raster image maps, which can be created from a single drive and require only 4 MB of storage per square kilometer. The source code is available at https://github.com/dcmlr/groundloc.

eess.AS [Back]

[119] Listening without Looking: Modality Bias in Audio-Visual Captioning eess.AS | cs.CV | eess.IVPDF

Yuchi Ishikawa, Toranosuke Manabe, Tatsuya Komatsu, Yoshimitsu Aoki

TL;DR: 论文探讨了音频-视觉字幕生成模型中的模态偏见问题，发现现有模型LAVCap对音频流的依赖远高于视觉流。通过模态鲁棒性测试和新的数据集AudioVisualCaps，研究表明模型在新数据上训练后可减少模态偏见。

Details

Motivation: 现有音频-视觉字幕模型虽在多模态融合上取得进展，但其对音频和视觉模态的互补性和鲁棒性尚不明确，尤其在某一模态受损时的表现缺乏研究。

Result: LAVCap在AudioVisualCaps上训练后表现出更低的模态偏见，证明了新数据集的有效性。

Insight: 当前的音频-视觉字幕模型可能过度依赖单一模态（如音频），而通过精心设计的数据集可以有效缓解这一问题。

Abstract: Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.

cs.AI [Back]

[120] Why Foundation Models in Pathology Are Failing cs.AI | cs.CVPDF

Hamid R. Tizhoosh

TL;DR: 这篇短文讨论了病理学中基础模型(FMs)的失败原因，指出其在癌症诊断等领域表现不佳，并提出这是由于通用AI基础模型假设与人类组织的复杂性不匹配所致。

Details

Motivation: 基础模型在非医学领域取得了巨大成功，但在病理学中的应用却表现不佳。作者希望通过分析这些失败的原因，推动领域内模型的重新设计。

Result: 研究发现当前模型诊断准确性低、鲁棒性差、计算量大，且存在安全隐患。

Insight: 病理学基础模型的失败反映了通用AI假设与组织学复杂性之间的不匹配，需设计更符合病理学特性的新方法。

Abstract: In non-medical domains, foundation models (FMs) have revolutionized computer vision and language processing through large-scale self-supervised and multimodal learning. Consequently, their rapid adoption in computational pathology was expected to deliver comparable breakthroughs in cancer diagnosis, prognostication, and multimodal retrieval. However, recent systematic evaluations reveal fundamental weaknesses: low diagnostic accuracy, poor robustness, geometric instability, heavy computational demands, and concerning safety vulnerabilities. This short paper examines these shortcomings and argues that they stem from deeper conceptual mismatches between the assumptions underlying generic foundation modeling in mainstream AI and the intrinsic complexity of human tissue. Seven interrelated causes are identified: biological complexity, ineffective self-supervision, overgeneralization, excessive architectural complexity, lack of domain-specific innovation, insufficient data, and a fundamental design flaw related to tissue patch size. These findings suggest that current pathology foundation models remain conceptually misaligned with the nature of tissue morphology and call for a fundamental rethinking of the paradigm itself.

[121] OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows cs.AI | cs.CL | cs.CV | cs.HCPDF

Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu

TL;DR: OS-Sentinel是一种混合安全检测框架，结合形式化验证和基于VLM的上下文判断，显著提升了移动GUI代理的安全性。

Details

Motivation: 尽管基于视觉语言模型（VLM）的移动代理展现了类似人类的能力，但其潜在的不安全操作（如系统破坏和隐私泄露）引发严重担忧，现有研究在这一领域探索不足。

Result: 实验表明，OS-Sentinel在多指标上比现有方法提升了10%-30%。

Insight: 研究表明，结合形式化方法和上下文感知的VLM可以显著提升移动代理的安全性，为开发更可靠的自主代理提供了关键洞见。

Abstract: Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.

[122] Latent Chain-of-Thought for Visual Reasoning cs.AI | cs.CLPDF

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat

TL;DR: 论文提出了一种基于隐式思维链（Latent CoT）的可解释性和可靠性强的视觉推理方法，通过变分推断和多样性强化学习克服了传统方法的局限性。

Details

Motivation: 当前大型视觉语言模型（LVLMs）的推理方法（如SFT、PPO、GRPO）在未见任务上泛化能力有限，且依赖于有偏的奖励模型。

Result: 在七个推理基准上提升了LVLMs的性能，增强了模型的泛化能力和可解释性。

Insight: 隐式思维链和贝叶斯推断的结合能够更好地解决推理任务的多样性和泛化问题。

Abstract: Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.

cs.SD [Back]

[123] Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes cs.SD | cs.CV | eess.AS | eess.IVPDF

Jonas Hein, Lazaros Vlachopoulos, Maurits Geert Laurent Olthof, Bastian Sigrist, Philipp Fürnstahl

TL;DR: 该论文提出了一种新颖的框架，通过整合3D声学信息和视觉数据，生成手术场景的4D音视频表示，用于空间映射手术动作。

Details

Motivation: 现有的手术场景理解方法主要依赖视觉数据或端到端学习，限制了细粒度上下文建模。该工作旨在通过整合3D声学信息，增强手术场景的表征能力。

Result: 实验表明，该方法能够成功在3D空间中定位手术声学事件，并与视觉场景元素关联，实现了多模态数据的准确空间定位和鲁棒融合。

Insight: 该工作展示了声学信息在手术场景理解中的潜力，为未来智能手术系统提供了多模态数据融合的范例。

Abstract: Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems.

[124] STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence cs.SD | cs.CL | eess.ASPDF

Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan

TL;DR: STAR-Bench是一个新的音频基准测试，专注于评估模型在时间和3D空间中对声音动态的推理能力（音频4D智能），揭示了现有模型在细粒度感知推理上的缺陷。

Details

Motivation: 现有音频基准主要测试可从文本标题恢复的语义，缺乏对细粒度感知推理的评估。STAR-Bench填补了这一空白。

Result: 评测19个模型显示，人类与模型在时空推理上差距显著（时间-31.5%，空间-35.2%），开源模型全面落后。

Insight: 封闭源模型受限于细粒度感知，开源模型在感知、知识和推理上均落后，为未来模型开发提供了方向。

Abstract: Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5% temporal, -35.2% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.

Table of Contents

cs.CV [Back]

[1] Explainable Detection of AI-Generated Images with Artifact Localization Using Faster-Than-Lies and Vision-Language Models for Edge Devices cs.CV | cs.AI | eess.IVPDF

[2] CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting cs.CV | cs.AIPDF

[3] A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras cs.CV | cs.AIPDF

[4] RareFlow: Physics-Aware Flow-Matching for Cross-Sensor Super-Resolution of Rare-Earth Features cs.CVPDF

[5] DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning cs.CV | cs.AI | cs.LGPDF

[6] TurboPortrait3D: Single-step diffusion-based fast portrait novel-view synthesis cs.CVPDF

[7] PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors cs.CVPDF

[8] Adaptive Training of INRs via Pruning and Densification cs.CVPDF

[9] Neural USD: An object-centric framework for iterative editing and control cs.CV | cs.AIPDF

[10] SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability cs.CV | cs.AI | cs.CRPDF

[11] Reasoning Visual Language Model for Chest X-Ray Analysis cs.CVPDF

[12] Efficient Cost-and-Quality Controllable Arbitrary-scale Super-resolution with Fourier Constraints cs.CVPDF

[13] TeleEgo: Benchmarking Egocentric AI Assistants in the Wild cs.CVPDF

[14] AdvBlur: Adversarial Blur for Robust Diabetic Retinopathy Classification and Cross-Domain Generalization cs.CVPDF

[15] Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks cs.CV | cs.AI | cs.LGPDF

[16] ResNet: Enabling Deep Convolutional Neural Networks through Residual Learning cs.CV | cs.AIPDF

[17] Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models cs.CV | cs.LGPDF

[18] Enhancing CLIP Robustness via Cross-Modality Alignment cs.CVPDF

[19] Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification cs.CVPDF

[20] OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation cs.CVPDF

[21] Enhancing Pre-trained Representation Classifiability can Boost its Interpretability cs.CV | cs.LGPDF

[22] UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations cs.CVPDF

[23] DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery cs.CVPDF

[24] Compositional Image Synthesis with Inference-Time Scaling cs.CV | cs.AIPDF

[25] VC4VG: Optimizing Video Captions for Text-to-Video Generation cs.CV | cs.AI | cs.CLPDF

[26] Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning cs.CV | cs.AIPDF

[27] Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2 cs.CVPDF

[28] CLFSeg: A Fuzzy-Logic based Solution for Boundary Clarity and Uncertainty Reduction in Medical Image Segmentation cs.CVPDF

[29] MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration cs.CVPDF

[30] SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs cs.CVPDF

[31] Benchmarking Microsaccade Recognition with Event Cameras: A Novel Dataset and Evaluation cs.CVPDF

[32] UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation cs.CV | cs.LGPDF

[33] Training-free Source Attribution of AI-generated Images via Resynthesis cs.CV | cs.AIPDF

[34] ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model cs.CV | cs.AI | cs.CLPDF

[35] Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning cs.CV | cs.AIPDF

[36] Adaptive Knowledge Transferring with Switching Dual-Student Framework for Semi-Supervised Medical Image Segmentation cs.CVPDF

[37] Decoupling What to Count and Where to See for Referring Expression Counting cs.CVPDF

[38] A Hybrid Approach for Visual Multi-Object Tracking cs.CV | cs.ROPDF

[39] Rethinking Visual Intelligence: Insights from Video Pretraining cs.CV | cs.AI | 68T07, 68T45, 68T20 | I.2.10; I.4.8; I.5.1; I.2.6PDF

[40] A Critical Study towards the Detection of Parkinsons Disease using ML Technologies cs.CVPDF

[41] Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras cs.CVPDF

[42] Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs cs.CV | cs.CLPDF

[43] OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents cs.CVPDF

[44] Physics-Inspired Gaussian Kolmogorov-Arnold Networks for X-ray Scatter Correction in Cone-Beam CT cs.CV | I.4.5; I.5PDF

[45] SAGE: Structure-Aware Generative Video Transitions between Diverse Clips cs.CVPDF

[46] Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? cs.CV | cs.AI | cs.LG | q-bio.NCPDF

[47] Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance cs.CVPDF

[48] Uniform Discrete Diffusion with Metric Path for Video Generation cs.CVPDF

[49] Generative View Stitching cs.CV | cs.LGPDF

cs.CL [Back]

[50] Evaluating Long-Term Memory for Long-Context Question Answering cs.CLPDF

[51] Beyond Understanding: Evaluating the Pragmatic Gap in LLMs’ Cultural Processing of Figurative Language cs.CLPDF

[52] OraPlan-SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning cs.CL | cs.AIPDF

[53] Language Models for Longitudinal Clinical Prediction cs.CLPDF

[54] AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages cs.CLPDF

[55] Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward cs.CL | cs.AIPDF

[56] SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs cs.CL | cs.AIPDF

[57] Success and Cost Elicit Convention Formation for Efficient Communication cs.CLPDF

[58] Pie: A Programmable Serving System for Emerging LLM Applications cs.CLPDF

[59] Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation cs.CLPDF

[60] Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures cs.CLPDF

[61] Reinforcement Learning for Long-Horizon Multi-Turn Search Agents cs.CLPDF

[62] Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean cs.CL | cs.AIPDF

[63] MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations cs.CL | cs.AIPDF

[64] Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability cs.CLPDF

[65] Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models? cs.CLPDF

[66] Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations cs.CLPDF

[67] From Memorization to Reasoning in the Spectrum of Loss Curvature cs.CL | cs.LGPDF

[68] Can LLMs Translate Human Instructions into a Reinforcement Learning Agent’s Internal Emergent Symbolic Representation? cs.CL | cs.ROPDF

[69] MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference cs.CLPDF

[70] Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards cs.CLPDF

[71] Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning cs.CL | cs.AIPDF

[72] Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants cs.CL | cs.AI | 68T50 | F.2.2; I.2.7PDF

[73] LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability cs.CL | cs.AIPDF

[74] SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models cs.CLPDF

[75] SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space cs.CL | cs.CVPDF

[76] Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices cs.CL | cs.AIPDF

[77] Iterative Critique-Refine Framework for Enhancing LLM Personalization cs.CL | cs.AI | cs.IRPDF