cs.CV [Total: 66]
cs.CL [Total: 27]
cs.CR [Total: 1]
cs.RO [Total: 1]
cs.LG [Total: 7]
eess.IV [Total: 1]
cs.AI [Total: 4]
q-fin.CP [Total: 1]
cs.IR [Total: 2]

cs.CV [Back]

[1] GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments cs.CV | cs.AIPDF

Leela Krishna, Mengyang Zhao, Saicharithreddy Pasula, Harshit Rajgarhia, Abhishek Mukherji

TL;DR: GAZE提出了一种自动化预处理流程，将原始长视频转化为高质量、多模态的世界模型训练数据，显著提升了标注效率和隐私保护。

Details

Motivation: 传统的人工标注多模态数据效率低且成本高，阻碍了鲁棒世界模型的训练。GAZE旨在通过自动化流程解决这一问题。

Result: 效率显著提升（每审核小时节省19分钟），标注密度和一致性提高，同时确保隐私保护。

Insight: 通过自动化预标注和多模态信号整合，GAZE为世界模型训练提供了高质量数据，同时兼顾效率和治理需求。

Abstract: Training robust world models requires large-scale, precisely labeled multimodal datasets, a process historically bottlenecked by slow and expensive manual annotation. We present a production-tested GAZE pipeline that automates the conversion of raw, long-form video into rich, task-ready supervision for world-model training. Our system (i) normalizes proprietary 360-degree formats into standard views and shards them for parallel processing; (ii) applies a suite of AI models (scene understanding, object tracking, audio transcription, PII/NSFW/minor detection) for dense, multimodal pre-annotation; and (iii) consolidates signals into a structured output specification for rapid human validation. The GAZE workflow demonstrably yields efficiency gains (~19 minutes saved per review hour) and reduces human review volume by >80% through conservative auto-skipping of low-salience segments. By increasing label density and consistency while integrating privacy safeguards and chain-of-custody metadata, our method generates high-fidelity, privacy-aware datasets directly consumable for learning cross-modal dynamics and action-conditioned prediction. We detail our orchestration, model choices, and data dictionary to provide a scalable blueprint for generating high-quality world model training data without sacrificing throughput or governance.

[2] PC-UNet: An Enforcing Poisson Statistics U-Net for Positron Emission Tomography Denoising cs.CV | cs.AIPDF

Yang Shi, Jingchao Wang, Liangsi Lu, Mingxuan Huang, Ruixin He

TL;DR: PC-UNet通过提出Poisson Variance and Mean Consistency Loss（PVMC-Loss）有效地解决了PET图像去噪中的Poisson噪声问题，显著提高了图像质量和物理一致性。

Details

Motivation: PET图像在高剂量使用时存在辐射风险，而降低剂量会引入Poisson噪声，传统去噪方法无法有效处理这种噪声，导致图像失真和伪影。

Result: 在PET数据集上的实验表明，PC-UNet显著提升了图像的质量和物理一致性。

Insight: 结合物理统计特性的深度学习方法可以更有效地处理医学图像中的噪声问题。

Abstract: Positron Emission Tomography (PET) is crucial in medicine, but its clinical use is limited due to high signal-to-noise ratio doses increasing radiation exposure. Lowering doses increases Poisson noise, which current denoising methods fail to handle, causing distortions and artifacts. We propose a Poisson Consistent U-Net (PC-UNet) model with a new Poisson Variance and Mean Consistency Loss (PVMC-Loss) that incorporates physical data to improve image fidelity. PVMC-Loss is statistically unbiased in variance and gradient adaptation, acting as a Generalized Method of Moments implementation, offering robustness to minor data mismatches. Tests on PET datasets show PC-UNet improves physical consistency and image fidelity, proving its ability to integrate physical information effectively.

[3] DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models cs.CV | cs.AI | cs.CLPDF

Mor Ventura, Michael Toker, Or Patashnik, Yonatan Belinkov, Roi Reichart

TL;DR: DeLeaker提出了一种动态推理时重加权方法，通过干预模型的注意力图来缓解文本到图像模型中的语义泄漏问题。

Details

Motivation: 文本到图像（T2I）模型在快速发展，但仍存在语义泄漏问题（语义特征在不相关的实体间传递）。现有方法通常基于优化或依赖外部输入，DeLeaker提出了一种轻量级、无需优化的解决方案。

Result: DeLeaker在实验中表现优异，即使基线方法使用外部信息，DeLeaker仍能更有效地减轻语义泄漏且保持生成质量。

Insight: 注意力控制是实现语义精确生成的关键方向，DeLeaker的成功为未来T2I模型的设计提供了重要启示。

Abstract: Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce DeLeaker, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the model’s attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce SLIM (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.

[4] UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos cs.CV | cs.AI | cs.ROPDF

Mingxuan Liu, Honglin He, Elisa Ricci, Wayne Wu, Bolei Zhou

TL;DR: UrbanVerse是一个数据驱动的真实到模拟系统，通过城市游览视频生成高保真、交互式的城市模拟场景，支持大规模城市AI代理训练。

Details

Motivation: 当前的城市模拟场景要么缺乏可扩展性，要么无法捕捉真实世界的复杂性，限制了城市AI代理的训练效果。UrbanVerse旨在解决这一问题。

Result: 实验中，UrbanVerse场景保留真实世界的语义和布局，训练出的导航策略在仿真和零样本迁移中分别提升6.3%和30.1%的成功率。

Insight: 数据驱动的真实到模拟方法可以显著提升仿真场景的多样性、真实性和训练效果，为城市AI代理的实际应用提供支持。

Abstract: Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.

Mattia Segu, Marta Tintore Gazulla, Yongqin Xian, Luc Van Gool, Federico Tombari

TL;DR: MOBIUS是一种面向移动设备的通用实例分割模型，通过多模态瓶颈融合和解码器剪枝技术，实现了高效的性能和计算资源平衡。

Details

Motivation: 现有的基础模型在实例级感知（如目标检测和分割）上表现优异，但计算成本高，限制了在资源受限平台上的应用。MOBIUS旨在解决这一问题，支持从高性能设备到移动硬件的广泛部署。

Result: MOBIUS减少了55%的像素解码器和75%的Transformer解码器的FLOPs，同时保持最佳性能，并在训练迭代次数上仅为传统方法的三分之一。

Insight: 通过融合和剪枝技术，MOBIUS实现了性能与效率的平衡，为资源受限设备的高效分割任务设立了新基准。

Abstract: Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.

[6] Composition-Grounded Instruction Synthesis for Visual Reasoning cs.CV | cs.CL | cs.LGPDF

Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li

TL;DR: 本文提出了一种名为COGS的高效数据框架，通过将种子问题分解为感知和推理因素，生成大量合成问答对，以增强多模态大语言模型（MLLMs）在人工图像领域的推理能力。实验表明，COGS显著提升了模型在未见问题上的表现。

Details

Motivation: 现有MLLMs在多模态任务上表现优异，但在缺乏大规模标注数据的领域（如图表、网页等人工图像）推理能力有限。COGS旨在通过少量种子问题生成多样化数据，提升模型的泛化能力。

Result: 在图表推理任务中，COGS显著提升了模型对未见问题的表现，尤其在推理密集和组合问题上。混合不同种子数据进一步提升了跨数据集的迁移能力。

Insight: COGS不仅能提升特定领域的性能，还能通过合成数据生成机制增强模型的泛化能力，适用于多种人工图像领域。

Abstract: Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.

[7] Generalized Dynamics Generation towards Scannable Physical World Model cs.CVPDF

Yichen Li, Zhiyi Li, Brandon Feng, Dinghuai Zhang, Antonio Torralba

TL;DR: GDGen提出了一种统一的方法，通过势能视角整合刚体、关节体和软体动力学，构建可扫描的物理世界模型。

Details

Motivation: 开发通用数字孪生世界，为复杂物理行为的交互式环境中的通用智能体提供训练基础。

Result: 实验表明GDGen能稳健地统一多样化的仿真范式，适用于复杂动态场景。

Insight: 势能视角为物理建模提供了新思路，几何无关的表示方法增强了模型的通用性和灵活性。

Abstract: Digital twin worlds with realistic interactive dynamics presents a new opportunity to develop generalist embodied agents in scannable environments with complex physical behaviors. To this end, we present GDGen (Generalized Representation for Generalized Dynamics Generation), a framework that takes a potential energy perspective to seamlessly integrate rigid body, articulated body, and soft body dynamics into a unified, geometry-agnostic system. GDGen operates from the governing principle that the potential energy for any stable physical system should be low. This fresh perspective allows us to treat the world as one holistic entity and infer underlying physical properties from simple motion observations. We extend classic elastodynamics by introducing directional stiffness to capture a broad spectrum of physical behaviors, covering soft elastic, articulated, and rigid body systems. We propose a specialized network to model the extended material property and employ a neural field to represent deformation in a geometry-agnostic manner. Extensive experiments demonstrate that GDGen robustly unifies diverse simulation paradigms, offering a versatile foundation for creating interactive virtual environments and training robotic agents in complex, dynamically rich scenarios.

[8] Comprehensive language-image pre-training for 3D medical image understanding cs.CV | cs.LGPDF

Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma

TL;DR: 论文提出了COLIPRI编码器家族，通过引入额外的归纳偏差（报告生成目标和视觉语言预训练结合视觉预训练），解决3D医学图像领域数据不足的问题，并在报告生成、分类探测和零样本分类等任务上实现SOTA。

Details

Motivation: 3D医学图像领域的数据稀缺限制了当前视觉语言编码器的能力。论文旨在通过利用更多的数据（包括图像和图像-文本对）和引入额外的归纳偏差，提升模型的性能。

Result: COLIPRI编码器在报告生成、分类探测和零样本分类任务中达到SOTA性能，同时在语义分割任务中保持竞争力。

Insight: 在数据稀缺的领域，通过引入额外的学习目标和结合不同的预训练策略（如视觉语言预训练与视觉预训练的结合），可以有效提升模型的性能。

Abstract: Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation.

[9] Directional Reasoning Injection for Fine-Tuning MLLMs cs.CVPDF

Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu

TL;DR: 该论文提出了DRIFT方法，通过梯度空间注入推理知识，提升多模态大语言模型（MLLMs）的推理能力，避免了资源密集的传统方法。

Details

Motivation: 现有的MLLMs推理能力落后于纯文本模型，传统方法（如监督微调或强化学习）成本高昂，而简单的模型融合在不同模型家族中效果不稳定。

Result: 在MathVista和MathVerse等基准测试中，DRIFT优于简单融合和监督微调，且成本显著低于训练密集型方法。

Insight: 梯度空间的推理知识转移是一种轻量且高效的方法，适用于不同MLLMs家族的推理能力提升。

Abstract: Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a “free lunch”: its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.

[10] A solution to generalized learning from small training sets found in everyday infant experiences cs.CVPDF

Frangil Ramirez, Elizabeth Clerkin, David J. Crandall, Linda B. Smith

TL;DR: 论文提出，婴儿从有限的视觉经验中高效学习的能力源于其日常生活中视觉输入的‘块状’相似性结构，并通过计算实验验证了这一结构对机器学习中从小数据集泛化的有效性。

Details

Motivation: 研究动机是解释为何婴儿能够从有限的视觉经验中高效学习和泛化，尽管传统机器学习需要大量数据。

Result: 结果表明，婴儿视觉输入的‘块状’结构显著提高了机器学习模型对小数据集的泛化能力。

Insight: 论文的深层洞见是，自然视觉输入的统计特性（如‘块状’结构）可以指导高效学习的算法设计，适用于多种学习任务和学习者。

Abstract: Young children readily recognize and generalize visual objects labeled by common nouns, suggesting that these basic level object categories may be given. Yet if they are, how they arise remains unclear. We propose that the answer lies in the statistics of infant daily life visual experiences. Whereas large and diverse datasets typically support robust learning and generalization in human and machine learning, infants achieve this generalization from limited experiences. We suggest that the resolution of this apparent contradiction lies in the visual diversity of daily life, repeated experiences with single object instances. Analyzing egocentric images from 14 infants (aged 7 to 11 months) we show that their everyday visual input exhibits a lumpy similarity structure, with clusters of highly similar images interspersed with rarer, more variable ones, across eight early-learned categories. Computational experiments show that mimicking this structure in machines improves generalization from small datasets in machine learning. The natural lumpiness of infant experience may thus support early category learning and generalization and, more broadly, offer principles for efficient learning across a variety of problems and kinds of learners.

[11] SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images cs.CVPDF

Jiaxin Guo, Tongfan Guan, Wenzhen Dong, Wenzhao Zheng, Wenting Wang

TL;DR: SaLon3R是一种新颖的在线通用3D高斯泼溅（3DGS）框架，专注于长期视频序列的结构感知重建。通过引入紧凑锚点基元和使用3D点Transformer，它显著减少了冗余并提升了几何一致性。

Details

Motivation: 现有方法在长期视频序列中预测高斯分布时存在冗余和几何不一致性问题，限制了重建效率和泛化能力。

Result: 在多数据集上展现了最先进的性能，提升了新视角合成和深度估计的效率和鲁棒性。

Insight: 结构感知和紧凑表征是长期3D重建中减少冗余和提升一致性的关键。

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled generalizable, on-the-fly reconstruction of sequential input views. However, existing methods often predict per-pixel Gaussians and combine Gaussians from all views as the scene representation, leading to substantial redundancies and geometric inconsistencies in long-duration video sequences. To address this, we propose SaLon3R, a novel framework for Structure-aware, Long-term 3DGS Reconstruction. To our best knowledge, SaLon3R is the first online generalizable GS method capable of reconstructing over 50 views in over 10 FPS, with 50% to 90% redundancy removal. Our method introduces compact anchor primitives to eliminate redundancy through differentiable saliency-aware Gaussian quantization, coupled with a 3D Point Transformer that refines anchor attributes and saliency to resolve cross-frame geometric and photometric inconsistencies. Specifically, we first leverage a 3D reconstruction backbone to predict dense per-pixel Gaussians and a saliency map encoding regional geometric complexity. Redundant Gaussians are compressed into compact anchors by prioritizing high-complexity regions. The 3D Point Transformer then learns spatial structural priors in 3D space from training data to refine anchor attributes and saliency, enabling regionally adaptive Gaussian decoding for geometric fidelity. Without known camera parameters or test-time optimization, our approach effectively resolves artifacts and prunes the redundant 3DGS in a single feed-forward pass. Experiments on multiple datasets demonstrate our state-of-the-art performance on both novel view synthesis and depth estimation, demonstrating superior efficiency, robustness, and generalization ability for long-term generalizable 3D reconstruction. Project Page: https://wrld.github.io/SaLon3R/.

[12] TGT: Text-Grounded Trajectories for Locally Controlled Video Generation cs.CVPDF

Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang

TL;DR: TGT是一个文本驱动轨迹的本地控制视频生成框架，通过结合轨迹和局部文本描述，提高了对视频中多对象外观和运动的控制能力。

Details

Motivation: 现有的文本到视频生成方法在多对象场景中控制能力有限，尤其是在复杂情境下无法精确对应视觉实体和文本描述。

Result: 实验表明，TGT在视觉质量、文本对齐精度和运动控制能力上均优于现有方法。

Insight: 轨迹结合局部文本描述是一种直观且有效的方式，能够实现对多对象视频的更精细控制，特别是在复杂场景中。

Abstract: Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: https://textgroundedtraj.github.io.

[13] Deep generative priors for 3D brain analysis cs.CV | cs.LGPDF

Ana Lawry Aguila, Dina Zemlyanker, You Cheng, Sudeshna Das, Daniel C. Alexander

TL;DR: 该论文提出了一种基于扩散模型的通用方法，用于解决医学影像中的多种逆问题，结合领域知识和数据驱动模型，在脑MRI分析中取得了最先进的性能。

Details

Motivation: 传统贝叶斯逆问题方法依赖经典数学先验，难以捕捉复杂脑部解剖结构；而尽管扩散模型在医学影像中表现优异，如何将其与领域知识结合仍是一个挑战。

Result: 在临床和研究MRI数据上实现了最先进的性能，生成一致且高质量的解决方案。

Insight: 扩散先验可以作为脑MRI分析的通用工具，既能结合领域知识，又能利用大数据训练的优势。

Abstract: Diffusion models have recently emerged as powerful generative models in medical imaging. However, it remains a major challenge to combine these data-driven models with domain knowledge to guide brain imaging problems. In neuroimaging, Bayesian inverse problems have long provided a successful framework for inference tasks, where incorporating domain knowledge of the imaging process enables robust performance without requiring extensive training data. However, the anatomical modeling component of these approaches typically relies on classical mathematical priors that often fail to capture the complex structure of brain anatomy. In this work, we present the first general-purpose application of diffusion models as priors for solving a wide range of medical imaging inverse problems. Our approach leverages a score-based diffusion prior trained extensively on diverse brain MRI data, paired with flexible forward models that capture common image processing tasks such as super-resolution, bias field correction, inpainting, and combinations thereof. We further demonstrate how our framework can refine outputs from existing deep learning methods to improve anatomical fidelity. Experiments on heterogeneous clinical and research MRI data show that our method achieves state-of-the-art performance producing consistent, high-quality solutions without requiring paired training datasets. These results highlight the potential of diffusion priors as versatile tools for brain MRI analysis.

[14] Fourier Transform Multiple Instance Learning for Whole Slide Image Classification cs.CVPDF

Anthony Bilic, Guangyu Sun, Ming Li, Md Sanzid Bin Hossain, Yu Tian

TL;DR: 论文提出了Fourier Transform Multiple Instance Learning（FFT-MIL）框架，通过频域分支增强MIL方法，捕捉全切片图像（WSI）的全局依赖关系，提升分类性能。

Details

Motivation: 现有基于MIL的全切片图像分类方法难以捕捉全局依赖，因为图像尺寸巨大且局部特征嵌入有限。这限制了模型对粗粒度结构的建模能力，影响了诊断预测的鲁棒性。

Result: 在三个公开数据集（BRACS、LUAD、IMP）上测试六种MIL方法，FFT-MIL实现了平均宏F1分数提升3.51%和AUC提升1.51%。

Insight: 频域学习是捕捉WSI全局依赖的有效机制，与空间特征互补，提升了计算病理学的可扩展性和分类准确性。

Abstract: Whole Slide Image (WSI) classification relies on Multiple Instance Learning (MIL) with spatial patch features, yet existing methods struggle to capture global dependencies due to the immense size of WSIs and the local nature of patch embeddings. This limitation hinders the modeling of coarse structures essential for robust diagnostic prediction. We propose Fourier Transform Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a frequency-domain branch to provide compact global context. Low-frequency crops are extracted from WSIs via the Fast Fourier Transform and processed through a modular FFT-Block composed of convolutional layers and Min-Max normalization to mitigate the high variance of frequency data. The learned global frequency feature is fused with spatial patch features through lightweight integration strategies, enabling compatibility with diverse MIL architectures. FFT-MIL was evaluated across six state-of-the-art MIL methods on three public datasets (BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across architectures and datasets. These results establish frequency-domain learning as an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology.

Xingrui Wang, Jiang Liu, Chao Huang, Xiaodong Yu, Ze Wang

TL;DR: XModBench是一个大规模三模态基准测试，旨在评估全模态大语言模型（OLLMs）的跨模态一致性。实验表明，现有模型在空间和时序推理、模态差异和方向性不平衡方面表现不足。

Details

Motivation: 现有的基准测试主要评估跨模态问答能力，但缺乏对模态不变推理或模态特定偏差的系统性分析，因此需要一个新的诊断工具。

Result: 实验显示，即使是顶尖模型Gemini 2.5 Pro也存在空间和时序推理能力不足（准确率低于60%）、模态差异显著（音频表现远低于文本）和方向性不平衡（视觉上下文一致性更低）等问题。

Insight: 当前OLLMs远未实现真正的模态不变推理，XModBench可作为未来研究改进跨模态能力的关键工具。

Abstract: Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM’s modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.

[16] Train a Unified Multimodal Data Quality Classifier with Synthetic Data cs.CV | cs.CLPDF

Weizhi Wang, Rongmei Lin, Shiyang Li, Colin Lockard, Ritesh Sarkhel

TL;DR: 论文提出了一种统一的多模态数据质量分类器UniFilter，通过合成数据解决高质量图像-文本数据的筛选问题，并显著提升了多模态大语言模型的性能。

Details

Motivation: 目前多模态大语言模型主要依赖混合的图像-文本数据和交错文档数据进行预训练，但对高质量数据的筛选研究较少，亟需一种高效的方法来统一筛选高质量数据。

Result: 实验表明，基于UniFilter筛选的数据预训练的MLLM在零样本推理和上下文学习中表现显著优于基线方法，并在下游任务中取得更强的性能。

Insight: 高质量的多模态数据对MLLM的性能提升至关重要，通过统一的分类器可以有效筛选数据，提高模型的泛化能力。

Abstract: The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.

[17] Salient Concept-Aware Generative Data Augmentation cs.CV | 68T45 (Machine learning) | I.2.10; I.2.6; I.4.8; I.5.1; I.5.4PDF

Tianchen Zhao, Xuanbai Chen, Zhihua Li, Jun Fang, Dongsheng An

TL;DR: 论文提出了一种基于显著概念感知的图像生成框架，用于解决生成式数据增强中保真度与多样性难以平衡的问题，通过减少不相关视觉细节的影响，提升下游模型的鲁棒性。

Details

Motivation: 现有的生成式数据增强方法在图像和文本提示条件下，难以同时保持图像的保真度和生成多样性，原因是合成过程中非必要的图像属性（如环境背景）与文本提示意图冲突。

Result: 在八个细粒度视觉数据集上，该方法比现有最佳方法平均准确率提升了0.73%（常规设置）和6.5%（长尾设置）。

Insight: 通过显著概念感知方法可以更好地控制生成图像的保真度和多样性，从而有效提升数据增强的效果和下游任务的性能。

Abstract: Recent generative data augmentation methods conditioned on both image and text prompts struggle to balance between fidelity and diversity, as it is challenging to preserve essential image details while aligning with varied text prompts. This challenge arises because representations in the synthesis process often become entangled with non-essential input image attributes such as environmental contexts, creating conflicts with text prompts intended to modify these elements. To address this, we propose a personalized image generation framework that uses a salient concept-aware image embedding model to reduce the influence of irrelevant visual details during the synthesis process, thereby maintaining intuitive alignment between image and text inputs. By generating images that better preserve class-discriminative features with additional controlled variations, our framework effectively enhances the diversity of training datasets and thereby improves the robustness of downstream models. Our approach demonstrates superior performance across eight fine-grained vision datasets, outperforming state-of-the-art augmentation methods with averaged classification accuracy improvements by 0.73% and 6.5% under conventional and long-tail settings, respectively.

[18] CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records cs.CVPDF

Daniela Vega, Hannah V. Ceballos, Javier S. Vera, Santiago Rodriguez, Alejandra Perez

TL;DR: 该论文介绍了首个公开的多模态数据集CARDIUM，用于先天性心脏病（CHD）的产前诊断，并提出了一种鲁棒的多模态Transformer架构，整合图像和表格数据，显著提升了检测性能。

Details

Motivation: 由于先天性心脏病的罕见性，高质量诊断数据稀缺，导致数据集不平衡且质量低，限制了AI模型的性能。此外，缺乏公开的多模态数据集进一步阻碍了AI在临床决策中的应用。

Result: 在多模态数据集上，模型性能比单模态方法提升11%（图像）和50%（表格数据），F1分数达到79.8 ± 4.8%。

Insight: 多模态数据融合能显著提升罕见病诊断的准确性，公开数据集和代码有助于推动该领域的研究进展。

Abstract: Prenatal diagnosis of Congenital Heart Diseases (CHDs) holds great potential for Artificial Intelligence (AI)-driven solutions. However, collecting high-quality diagnostic data remains difficult due to the rarity of these conditions, resulting in imbalanced and low-quality datasets that hinder model performance. Moreover, no public efforts have been made to integrate multiple sources of information, such as imaging and clinical data, further limiting the ability of AI models to support and enhance clinical decision-making. To overcome these challenges, we introduce the Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records (CARDIUM) dataset, the first publicly available multimodal dataset consolidating fetal ultrasound and echocardiographic images along with maternal clinical records for prenatal CHD detection. Furthermore, we propose a robust multimodal transformer architecture that incorporates a cross-attention mechanism to fuse feature representations from image and tabular data, improving CHD detection by 11% and 50% over image and tabular single-modality approaches, respectively, and achieving an F1 score of 79.8 $\pm$ 4.8% in the CARDIUM dataset. We will publicly release our dataset and code to encourage further research on this unexplored field. Our dataset and code are available at https://github.com/BCVUniandes/Cardium, and at the project website https://bcv-uniandes.github.io/CardiumPage/

[19] The Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads cs.CVPDF

Aysan Aghazadeh, Adriana Kovashka

TL;DR: 论文研究了文本到图像模型在广告定制中的潜力，分析了广告中的种族和性别偏见，并提出了一种针对特定国家文化的广告生成技术。

Details

Motivation: 研究动机是探索文本到图像模型在广告定制中的潜力，并分析广告生成中存在的偏见问题。

Result: 结果表明广告中存在显著的种族和性别偏见，并且针对性生成的广告对不同国家文化更有效。

Insight: 研究发现，广告生成需要注意文化的多样性，避免偏见，以提高广告的说服力和效果。

Abstract: Text-to-image models are appealing for customizing visual advertisements and targeting specific populations. We investigate this potential by examining the demographic bias within ads for different ad topics, and the disparate level of persuasiveness (judged by models) of ads that are identical except for gender/race of the people portrayed. We also experiment with a technique to target ads for specific countries. The code is available at https://github.com/aysanaghazadeh/FaceOfPersuasion

[20] DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion cs.CVPDF

Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu

TL;DR: DriveGen3D是一个结合高效视频扩散和动态3D重建的框架，用于实时生成高质量的动态3D驾驶场景。它通过FastDrive-DiT和FastRecon3D模块解决了现有方法在长期生成和3D表示上的局限性。

Details

Motivation: 现有方法在动态驾驶场景生成中存在计算量大、缺乏3D表示或仅支持静态场景的问题。DriveGen3D致力于填补这一方法学空白。

Result: 实现了12 FPS的高分辨率（424×800）驾驶视频生成，新视角合成的SSIM为0.811，PSNR为22.84。

Insight: 通过结合视频扩散和3D重建，能够在保持参数效率的同时生成高质量的动态场景，为自动驾驶模拟和数据增强提供了新工具。

Abstract: We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird’s-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward reconstruction module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. Together, these components enable real-time generation of extended driving videos (up to $424\times800$ at 12 FPS) and corresponding dynamic 3D scenes, achieving SSIM of 0.811 and PSNR of 22.84 on novel view synthesis, all while maintaining parameter efficiency.

[21] CuSfM: CUDA-Accelerated Structure-from-Motion cs.CV | cs.ROPDF

Jingrui Yu, Jun Liu, Kefei Ren, Joydeep Biswas, Rurui Ye

TL;DR: cuSfM是一个基于CUDA加速的离线Structure-from-Motion系统，通过GPU并行化提升计算效率，支持高精度相机位姿估计和全局一致的建图。

Details

Motivation: 高效准确的相机位姿估计是自动驾驶、机器人感知和虚拟仿真的基础需求。现有方法计算开销大，cuSfM利用GPU并行化解决这一问题。

Result: 实验表明，cuSfM在精度和处理速度上优于COLMAP，同时保持高精度和全局一致性。

Insight: GPU并行化在离线SfM任务中能显著提升效率和精度，cuSfM的开源实现有望推动计算机视觉和机器人研究。

Abstract: Efficient and accurate camera pose estimation forms the foundational requirement for dense reconstruction in autonomous navigation, robotic perception, and virtual simulation systems. This paper addresses the challenge via cuSfM, a CUDA-accelerated offline Structure-from-Motion system that leverages GPU parallelization to efficiently employ computationally intensive yet highly accurate feature extractors, generating comprehensive and non-redundant data associations for precise camera pose estimation and globally consistent mapping. The system supports pose optimization, mapping, prior-map localization, and extrinsic refinement. It is designed for offline processing, where computational resources can be fully utilized to maximize accuracy. Experimental results demonstrate that cuSfM achieves significantly improved accuracy and processing speed compared to the widely used COLMAP method across various testing scenarios, while maintaining the high precision and global consistency essential for offline SfM applications. The system is released as an open-source Python wrapper implementation, PyCuSfM, available at https://github.com/nvidia-isaac/pyCuSFM, to facilitate research and applications in computer vision and robotics.

[22] Hyperbolic Structured Classification for Robust Single Positive Multi-label Learning cs.CV | cs.LGPDF

Yiming Lin, Shang Wang, Junkai Zhou, Qiufeng Wang, Xiao-Bo Jin

TL;DR: 该论文提出了一种基于双曲几何的分类框架，用于解决单正多标签学习（SPMLL）问题。通过将标签表示为双曲球而非点或向量，能够同时建模标签间的层次结构、共现模式和语义独立性。还引入了温度自适应双曲球分类器和物理启发的双井正则化方法。实验在多个基准数据集上验证了方法的优越性和可解释性。

Details

Motivation: 现有SPMLL方法主要通过距离相似性隐式建模标签关系，缺乏对不同关系类型的显式几何定义。而双曲几何因其对层次结构的自然表示能力，更适合建模复杂标签关系。

Result: 在MS-COCO、PASCAL VOC等四个数据集上，该方法表现优于现有方法，且具有更高的可解释性。统计分析表明，学习到的嵌入与实际共现模式显著相关。

Insight: 双曲几何天然适合建模层次结构和复杂关系，尤其在标签监督不完整（如SPMLL）的场景下更具鲁棒性。

Abstract: Single Positive Multi-Label Learning (SPMLL) addresses the challenging scenario where each training sample is annotated with only one positive label despite potentially belonging to multiple categories, making it difficult to capture complex label relationships and hierarchical structures. While existing methods implicitly model label relationships through distance-based similarity, lacking explicit geometric definitions for different relationship types. To address these limitations, we propose the first hyperbolic classification framework for SPMLL that represents each label as a hyperbolic ball rather than a point or vector, enabling rich inter-label relationship modeling through geometric ball interactions. Our ball-based approach naturally captures multiple relationship types simultaneously: inclusion for hierarchical structures, overlap for co-occurrence patterns, and separation for semantic independence. Further, we introduce two key component innovations: a temperature-adaptive hyperbolic ball classifier and a physics-inspired double-well regularization that guides balls toward meaningful configurations. To validate our approach, extensive experiments on four benchmark datasets (MS-COCO, PASCAL VOC, NUS-WIDE, CUB-200-2011) demonstrate competitive performance with superior interpretability compared to existing methods. Furthermore, statistical analysis reveals strong correlation between learned embeddings and real-world co-occurrence patterns, establishing hyperbolic geometry as a more robust paradigm for structured classification under incomplete supervision.

[23] Latent Diffusion Model without Variational Autoencoder cs.CV | cs.AIPDF

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu

TL;DR: recent progress in diffusion-based visual generation has relied on latent diffusion with variational autoencoders (VAEs). LD suffers from limitations including lack of clear semantic separability.

Details

Motivation: LD suffers drawbacks from VAE latent spaces including unclear semantic separability.

Result: Experiments confirm SVG improves generative quality.

Insight: SVG leverages semantic discriminability for improvement.

Abstract: Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.

[24] Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation cs.CV | cs.LGPDF

Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu

TL;DR: 论文提出了一种名为CoMe的新方法，通过逐层拼接和层次蒸馏技术压缩大语言模型大小，同时保持性能。

Details

Motivation: 大语言模型的计算和存储需求高，现有结构化剪枝方法直接移除层会导致性能下降，且缺乏有效的后训练恢复机制。

Result: 在7个基准测试中达到SOTA性能，30%剪枝后仍保留83%原始准确率。

Insight: 层的拼接合并和层次蒸馏能有效减少模型大小并保持性能，为大语言模型的轻量化提供新思路。

Abstract: Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b’s parameters, the pruned model retains 83% of its original average accuracy. Our code is available at https://github.com/MPI-Lab/CoMe.

Joshua Li, Brendan Chharawala, Chang Shu, Xue Bin Peng, Pengcheng Xi

TL;DR: SHARE提出了一种利用场景几何信息精确重建人类运动的方法，仅需单目RGB视频即可实现人与场景的对齐。

Details

Motivation: 现有的人类运动重建方法难以准确将人放置在3D空间中，影响了游戏、AR/VR和机器人等领域中逼真角色交互的动画效果。

Result: 实验表明SHARE在重建精度上优于现有方法，适用于合成数据集和真实环境中的网络视频。

Insight: 利用场景几何信息可以有效提升人类运动重建的准确性，尤其是在复杂环境中。

Abstract: Animating realistic character interactions with the surrounding environment is important for autonomous agents in gaming, AR/VR, and robotics. However, current methods for human motion reconstruction struggle with accurately placing humans in 3D space. We introduce Scene-Human Aligned REconstruction (SHARE), a technique that leverages the scene geometry’s inherent spatial cues to accurately ground human motion reconstruction. Each reconstruction relies solely on a monocular RGB video from a stationary camera. SHARE first estimates a human mesh and segmentation mask for every frame, alongside a scene point map at keyframes. It iteratively refines the human’s positions at these keyframes by comparing the human mesh against the human point map extracted from the scene using the mask. Crucially, we also ensure that non-keyframe human meshes remain consistent by preserving their relative root joint positions to keyframe root joints during optimization. Our approach enables more accurate 3D human placement while reconstructing the surrounding scene, facilitating use cases on both curated datasets and in-the-wild web videos. Extensive experiments demonstrate that SHARE outperforms existing methods.

[26] Adaptive transfer learning for surgical tool presence detection in laparoscopic videos through gradual freezing fine-tuning cs.CVPDF

Ana Davila, Jacinto Colan, Yasuhisa Hasegawa

TL;DR: 论文提出了一种分阶段自适应微调方法，用于腹腔镜视频中手术工具的检测，通过渐进冻结微调提高了检测性能。

Details

Motivation: 微创手术中自动检测手术工具对分析和辅助至关重要，但注释数据有限，传统深度学习方法难以训练出鲁棒模型。

Result: 在Cholec80数据集上达到了96.4%的mAP，并在CATARACTS数据集上验证了通用性。

Insight: 渐进冻结微调是一种有效的域适应技术，可推广到其他医学图像分类任务。

Abstract: Minimally invasive surgery can benefit significantly from automated surgical tool detection, enabling advanced analysis and assistance. However, the limited availability of annotated data in surgical settings poses a challenge for training robust deep learning models. This paper introduces a novel staged adaptive fine-tuning approach consisting of two steps: a linear probing stage to condition additional classification layers on a pre-trained CNN-based architecture and a gradual freezing stage to dynamically reduce the fine-tunable layers, aiming to regulate adaptation to the surgical domain. This strategy reduces network complexity and improves efficiency, requiring only a single training loop and eliminating the need for multiple iterations. We validated our method on the Cholec80 dataset, employing CNN architectures (ResNet-50 and DenseNet-121) pre-trained on ImageNet for detecting surgical tools in cholecystectomy endoscopic videos. Our results demonstrate that our method improves detection performance compared to existing approaches and established fine-tuning techniques, achieving a mean average precision (mAP) of 96.4%. To assess its broader applicability, the generalizability of the fine-tuning strategy was further confirmed on the CATARACTS dataset, a distinct domain of minimally invasive ophthalmic surgery. These findings suggest that gradual freezing fine-tuning is a promising technique for improving tool presence detection in diverse surgical procedures and may have broader applications in general image classification tasks.

[27] FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers cs.CVPDF

Haisheng Su, Junjie Zhang, Feixiang Song, Sanping Zhou, Wei Wu

TL;DR: 论文提出了FreqPDE方法，通过频率感知的空间深度嵌入为多视角3D目标检测Transformer提供空间信息，解决了传统方法中深度预测质量差和跨视角一致性不足的问题。

Details

Motivation: 当前多视角3D目标检测方法依赖于深度预测恢复空间信息，但存在深度不连续、小目标区分度低等问题，且缺乏跨视角一致性和尺度不变性。

Result: 在nuScenes数据集上的实验证明，FreqPDE在3D目标检测任务中表现出优越性能。

Insight: 频率特征与空间信息的结合能有效提升深度预测质量，同时跨视角注意力机制增强了特征的一致性。

Abstract: Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Cross-view Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.

[28] PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction cs.CVPDF

Ting-Yu Yen, Yu-Sheng Chiu, Shih-Hsuan Hung, Peter Wonka, Hung-Kuo Chu

TL;DR: PFGS提出了一种基于3D高斯泼溅的多姿态对象重建方法，解决了现有方法在多姿态下重建不完整的问题。通过姿态感知的全局和局部配准策略，PFGS实现了高质量、完整的对象重建。

Details

Motivation: 现有的3D高斯泼溅方法通常假设对象在单一静态姿态下捕捉，导致重建结果不完整，尤其是被遮挡或自遮挡的区域。PFGS旨在解决从多姿态图像中实现完整对象重建的挑战。

Result: 实验表明，PFGS在定性和定量评估中均优于基线方法，能够生成更完整且保真度更高的3D高斯泼溅模型。

Insight: PFGS通过姿态感知的配准策略和多视角融合，展示了在多姿态对象重建中的优势，同时提出了解决背景不一致问题的新思路。

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality, real-time novel-view synthesis from multi-view images. However, most existing methods assume the object is captured in a single, static pose, resulting in incomplete reconstructions that miss occluded or self-occluded regions. We introduce PFGS, a pose-aware 3DGS framework that addresses the practical challenge of reconstructing complete objects from multi-pose image captures. Given images of an object in one main pose and several auxiliary poses, PFGS iteratively fuses each auxiliary set into a unified 3DGS representation of the main pose. Our pose-aware fusion strategy combines global and local registration to merge views effectively and refine the 3DGS model. While recent advances in 3D foundation models have improved registration robustness and efficiency, they remain limited by high memory demands and suboptimal accuracy. PFGS overcomes these challenges by incorporating them more intelligently into the registration process: it leverages background features for per-pose camera pose estimation and employs foundation models for cross-pose registration. This design captures the best of both approaches while resolving background inconsistency issues. Experimental results demonstrate that PFGS consistently outperforms strong baselines in both qualitative and quantitative evaluations, producing more complete reconstructions and higher-fidelity 3DGS models.

[29] LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding cs.CV | cs.LGPDF

Peng Ren, Hai Yang

TL;DR: LILAC提出了一种基于VAE-扩散模型的实时长序列运动风格化方法，通过因果解码和潜在空间流式处理架构，实现了高质量和低延迟的运动生成。

Details

Motivation: 实时生成具有稳定性和高质量的长序列运动风格化对于动态角色控制至关重要，现有方法要么计算开销大，要么仅支持离线处理。

Result: 在基准数据集上展示了高质量和响应性的平衡，优于现有方法。

Insight: 潜在空间流式处理和因果解码的结合是实时运动风格化的有效解决方案，无需依赖未来帧或修改扩散模型架构。

Abstract: Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE-Diffusion-based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding) builds upon a recent high-performing offline framework for arbitrary motion stylization and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets. Supplementary video and examples are available at the project page: https://pren1.github.io/lilac/

[30] Robust High-Resolution Multi-Organ Diffusion MRI Using Synthetic-Data-Tuned Prompt Learning cs.CV | cs.AI | physics.med-phPDF

Chen Qian, Haoyu Zhang, Junnan Ma, Liuhong Zhu, Qingrui Cai

TL;DR: LoSP-Prompt是一种新颖的多器官扩散MRI重建框架，通过物理学建模和合成数据驱动的提示学习解决运动伪影问题，实现高分辨率和跨器官泛化，临床验证表现出色。

Details

Motivation: 多器官扩散MRI在临床应用中被呼吸、蠕动等运动引起的相位伪影困扰，加上复杂的多参数需求，限制了其诊断效果。LoSP-Prompt旨在解决这些问题。

Result: 1. 分辨率是单次DWI的两倍；2. 单模型泛化7个解剖区域；3. 在图像质量、伪影抑制和降噪上领先（11位放射科医生评分4-5分）。

Insight: 1. 合成数据驱动提示学习可解决临床数据不足问题；2. 物理学建模结合机器学习提供可解释的鲁棒解决方案；3. 无导航信号设计简化了临床应用。

Abstract: Clinical adoption of multi-shot diffusion-weighted magnetic resonance imaging (multi-shot DWI) for body-wide tumor diagnostics is limited by severe motion-induced phase artifacts from respiration, peristalsis, and so on, compounded by multi-organ, multi-slice, multi-direction and multi-b-value complexities. Here, we introduce a reconstruction framework, LoSP-Prompt, that overcomes these challenges through physics-informed modeling and synthetic-data-driven prompt learning. We model inter-shot phase variations as a high-order Locally Smooth Phase (LoSP), integrated into a low-rank Hankel matrix reconstruction. Crucially, the algorithm’s rank parameter is automatically set via prompt learning trained exclusively on synthetic abdominal DWI data emulating physiological motion. Validated across 10,000+ clinical images (43 subjects, 4 scanner models, 5 centers), LoSP-Prompt: (1) Achieved twice the spatial resolution of clinical single-shot DWI, enhancing liver lesion conspicuity; (2) Generalized to seven diverse anatomical regions (liver, kidney, sacroiliac, pelvis, knee, spinal cord, brain) with a single model; (3) Outperformed state-of-the-art methods in image quality, artifact suppression, and noise reduction (11 radiologists’ evaluations on a 5-point scale, $p<0.05$), achieving 4-5 points (excellent) on kidney DWI, 4 points (good to excellent) on liver, sacroiliac and spinal cord DWI, and 3-4 points (good) on knee and tumor brain. The approach eliminates navigator signals and realistic data supervision, providing an interpretable, robust solution for high-resolution multi-organ multi-shot DWI. Its scanner-agnostic performance signifies transformative potential for precision oncology.

[31] Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models cs.CV | cs.AIPDF

Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang

TL;DR: 论文提出了一种名为Learning to Detect (LoD)的新框架，用于检测大规模视觉语言模型中的未知越狱攻击，通过任务特异性学习和多模态安全概念激活向量模块实现了更高的检测准确性和效率。

Details

Motivation: 现有的大型视觉语言模型尽管经过对齐，仍易受越狱攻击威胁。当前检测方法要么依赖攻击特异性学习，泛化能力差，要么基于启发式原则，准确性和效率有限。

Result: 实验表明，LoD在多种未知攻击中表现更高的检测AUROC，并提高了效率。

Insight: 任务特异性学习比攻击特异性学习更具泛化能力；多模态和无监督技术的结合能有效提升安全检测性能。

Abstract: Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

[32] Semantic4Safety: Causal Insights from Zero-shot Street View Imagery Segmentation for Urban Road Safety cs.CV | cs.LGPDF

Huan Chen, Ting Han, Siyu Chen, Zhihao Guo, Yiping Chen

TL;DR: 论文提出Semantic4Safety框架，结合零样本语义分割和因果推断分析街景图像中的道路安全特征，揭示不同事故类型的因果模式。

Details

Motivation: 现有研究在街景水平上缺乏捕捉事故相关特征及量化其因果影响的工具，本文旨在填补这一空白。

Result: 结果显示场景复杂度、暴露度和道路几何特征是主要预测因素；驾驶区域和应急空间增大可降低风险，而视觉开放性过高则增加风险。

Insight: 通过预测与因果推断结合，为城市道路安全规划提供了可扩展的数据驱动工具。

Abstract: Street-view imagery (SVI) offers a fine-grained lens on traffic risk, yet two fundamental challenges persist: (1) how to construct street-level indicators that capture accident-related features, and (2) how to quantify their causal impacts across different accident types. To address these challenges, we propose Semantic4Safety, a framework that applies zero-shot semantic segmentation to SVIs to derive 11 interpretable streetscape indicators, and integrates road type as contextual information to analyze approximately 30,000 accident records in Austin. Specifically, we train an eXtreme Gradient Boosting (XGBoost) multi-class classifier and use Shapley Additive Explanations (SHAP) to interpret both global and local feature contributions, and then apply Generalized Propensity Score (GPS) weighting and Average Treatment Effect (ATE) estimation to control confounding and quantify causal effects. Results uncover heterogeneous, accident-type-specific causal patterns: features capturing scene complexity, exposure, and roadway geometry dominate predictive power; larger drivable area and emergency space reduce risk, whereas excessive visual openness can increase it. By bridging predictive modeling with causal inference, Semantic4Safety supports targeted interventions and high-risk corridor diagnosis, offering a scalable, data-informed tool for urban road safety planning.

[33] Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation cs.CVPDF

Feifei Zhang, Zhenhong Jia, Sensen Song, Fei Shi, Dayong Ren

TL;DR: 论文提出了一种预测-校正（PC）范式，通过将建模任务解耦来加速学习，特别是在医学图像分割领域。基于此，设计了一个名为PCMambaNet的新网络，包含预测先验模块（PPM）和校正残差网络（CRN），显著提升了收敛速度和精度。

Details

Motivation: 现有端到端深度学习范式在医学图像分割中面临收敛慢和对大规模数据的依赖问题，限制了其在数据稀缺领域的效率和应用。

Result: 在高分辨率脑MRI分割任务中，PCMambaNet在1-5个epoch内实现SOTA精度，远超传统端到端模型。

Insight: 显式融入领域知识可以简化学习目标，有效缓解数据不足和过拟合问题，显著提升模型效率。

Abstract: Despite the remarkable success of the end-to-end paradigm in deep learning, it often suffers from slow convergence and heavy reliance on large-scale datasets, which fundamentally limits its efficiency and applicability in data-scarce domains such as medical imaging. In this work, we introduce the Predictive-Corrective (PC) paradigm, a framework that decouples the modeling task to fundamentally accelerate learning. Building upon this paradigm, we propose a novel network, termed PCMambaNet. PCMambaNet is composed of two synergistic modules. First, the Predictive Prior Module (PPM) generates a coarse approximation at low computational cost, thereby anchoring the search space. Specifically, the PPM leverages anatomical knowledge-bilateral symmetry-to predict a ‘focus map’ of diagnostically relevant asymmetric regions. Next, the Corrective Residual Network (CRN) learns to model the residual error, focusing the network’s full capacity on refining these challenging regions and delineating precise pathological boundaries. Extensive experiments on high-resolution brain MRI segmentation demonstrate that PCMambaNet achieves state-of-the-art accuracy while converging within only 1-5 epochs-a performance unattainable by conventional end-to-end models. This dramatic acceleration highlights that by explicitly incorporating domain knowledge to simplify the learning objective, PCMambaNet effectively mitigates data inefficiency and overfitting.

[34] Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning cs.CV | cs.AIPDF

Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang

TL;DR: 本文提出了一种基于证据优先的自适应框架EARL，通过强化学习动态选择关键帧并进行局部重采样，显著提升了视频大语言模型的推理能力。

Details

Motivation: 现有视频大语言模型在长视频推理中存在信息稀释和关键证据模糊的问题，且缺乏严格的奖励机制来确保证据纯度。

Result: 在五个视频推理基准测试中取得了最优效果，7B模型在LongVideoBench、MVBench和VideoMME上的表现分别为59.8%、69.0%和64.9%。

Insight: 证据纯度对视频推理至关重要，动态选择和局部重采样显著提升了模型性能。

Abstract: Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: “Select Less, Reason More.” Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.

[35] MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention cs.CVPDF

Nengbo Zhang, Hann Woei Ho

TL;DR: MAVR-Net是一个多视角学习框架，用于无人机（MAV）动作识别，结合RGB帧、光流和分割掩码数据，通过跨视角注意力模块和多尺度特征金字塔提升识别鲁棒性和准确性。

Details

Motivation: 现有的基于RGB数据的无人机动作识别方法难以捕捉复杂的时空运动特征，导致识别能力有限。

Result: 在Short MAV、Medium MAV和Long MAV数据集上分别达到了97.8%、96.5%和92.8%的准确率。

Insight: 多模态数据和跨视角注意力机制的结合能显著提升无人机动作识别的性能，尤其在复杂时空特征建模上表现出色。

Abstract: Recognizing the motion of Micro Aerial Vehicles (MAVs) is crucial for enabling cooperative perception and control in autonomous aerial swarms. Yet, vision-based recognition models relying only on RGB data often fail to capture the complex spatial temporal characteristics of MAV motion, which limits their ability to distinguish different actions. To overcome this problem, this paper presents MAVR-Net, a multi-view learning-based MAV action recognition framework. Unlike traditional single-view methods, the proposed approach combines three complementary types of data, including raw RGB frames, optical flow, and segmentation masks, to improve the robustness and accuracy of MAV motion recognition. Specifically, ResNet-based encoders are used to extract discriminative features from each view, and a multi-scale feature pyramid is adopted to preserve the spatiotemporal details of MAV motion patterns. To enhance the interaction between different views, a cross-view attention module is introduced to model the dependencies among various modalities and feature scales. In addition, a multi-view alignment loss is designed to ensure semantic consistency and strengthen cross-view feature representations. Experimental results on benchmark MAV action datasets show that our method clearly outperforms existing approaches, achieving 97.8%, 96.5%, and 92.8% accuracy on the Short MAV, Medium MAV, and Long MAV datasets, respectively.

[36] DPTrack:Directional Kernel-Guided Prompt Learning for Robust Nighttime Aerial Tracking cs.CVPDF

Zhiqiang Zhu, Xinbo Gao, Wen Lu, Jie Li, Zhaoyang Wang

TL;DR: DPTrack 是一种基于提示学习的夜间航拍跟踪器，通过编码目标对象的属性特征到带有细粒度线索的方向核中，生成精确的提示，提升跟踪性能。

Details

Motivation: 现有的夜间航拍跟踪器仅依赖于空间定位监督，导致生成的提示模糊，无法准确聚焦目标特征，跟踪性能较差。

Result: 在多个基准测试中，DPTrack 表现优异，证明了其鲁棒性和准确性。

Insight: 夜间航拍跟踪的关键在于利用目标的拓扑属性和细粒度线索生成精确提示，方向核的设计为这一任务提供了核心指导信号。

Abstract: Existing nighttime aerial trackers based on prompt learning rely solely on spatial localization supervision, which fails to provide fine-grained cues that point to target features and inevitably produces vague prompts. This limitation impairs the tracker’s ability to accurately focus on the object features and results in trackers still performing poorly. To address this issue, we propose DPTrack, a prompt-based aerial tracker designed for nighttime scenarios by encoding the given object’s attribute features into the directional kernel enriched with fine-grained cues to generate precise prompts. Specifically, drawing inspiration from visual bionics, DPTrack first hierarchically captures the object’s topological structure, leveraging topological attributes to enrich the feature representation. Subsequently, an encoder condenses these topology-aware features into the directional kernel, which serves as the core guidance signal that explicitly encapsulates the object’s fine-grained attribute cues. Finally, a kernel-guided prompt module built on channel-category correspondence attributes propagates the kernel across the features of the search region to pinpoint the positions of target features and convert them into precise prompts, integrating spatial gating for robust nighttime tracking. Extensive evaluations on established benchmarks demonstrate DPTrack’s superior performance. Our code will be available at https://github.com/zzq-vipsl/DPTrack.

[37] Improving Micro-Expression Recognition with Phase-Aware Temporal Augmentation cs.CVPDF

Vu Tram Anh Khuong, Luu Tu Nguyen, Thanh Ha Le, Thi Duyen Ngo

TL;DR: 本文提出了一种基于动态图像（Dynamic Image, DI）的阶段感知时间增强方法，通过将微表情序列分解为起始到顶点和顶点到结束两个运动阶段，生成双阶段DI，以丰富运动多样性并提升识别性能。

Details

Motivation: 微表情识别（MER）因标注数据稀缺而受限，传统方法主要依赖简单的空间增强（如翻转、旋转），忽视了时间增强策略的潜力。本文旨在通过阶段感知的时间增强方法提升MER性能。

Result: 在CASME-II和SAMM数据集上的实验表明，该方法显著提升了识别准确性、非加权F1分数和非加权平均召回率，结合空间增强后相对提升达10%。

Insight: 通过阶段分解增强时间多样性，可以有效捕捉微表情的细微运动特征，为低资源MER提供了一种鲁棒且通用的解决方案。

Abstract: Micro-expressions (MEs) are brief, involuntary facial movements that reveal genuine emotions, typically lasting less than half a second. Recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. Although deep learning has enabled significant advances in micro-expression recognition (MER), its effectiveness is limited by the scarcity of annotated ME datasets. This data limitation not only hinders generalization but also restricts the diversity of motion patterns captured during training. Existing MER studies predominantly rely on simple spatial augmentations (e.g., flipping, rotation) and overlook temporal augmentation strategies that can better exploit motion characteristics. To address this gap, this paper proposes a phase-aware temporal augmentation method based on dynamic image. Rather than encoding the entire expression as a single onset-to-offset dynamic image (DI), our approach decomposes each expression sequence into two motion phases: onset-to-apex and apex-to-offset. A separate DI is generated for each phase, forming a Dual-phase DI augmentation strategy. These phase-specific representations enrich motion diversity and introduce complementary temporal cues that are crucial for recognizing subtle facial transitions. Extensive experiments on CASME-II and SAMM datasets using six deep architectures, including CNNs, Vision Transformer, and the lightweight LEARNet, demonstrate consistent performance improvements in recognition accuracy, unweighted F1-score, and unweighted average recall, which are crucial for addressing class imbalance in MER. When combined with spatial augmentations, our method achieves up to a 10% relative improvement. The proposed augmentation is simple, model-agnostic, and effective in low-resource settings, offering a promising direction for robust and generalizable MER.

[38] MRASfM: Multi-Camera Reconstruction and Aggregation through Structure-from-Motion in Driving Scenes cs.CVPDF

Lingfeng Xuan, Chang Nie, Yiqing Xu, Zhe Liu, Yanzi Miao

TL;DR: MRASfM提出了一种针对驾驶场景的多相机SfM框架，通过固定空间关系提升相机位姿估计可靠性，采用平面模型优化路面重建质量，并通过Bundle Adjustment提升效率，在公开数据集上表现优异。

Details

Motivation: 驾驶场景的多相机SfM面临位姿估计不可靠、路面重建异常点多及效率低下等问题，MRASfM旨在解决这些挑战。

Result: 在nuScenes数据集上达到0.124的绝对位姿误差，表现优于现有方法。

Insight: MRASfM通过硬约束和多相机联合优化显著提升了驾驶场景SfM的可靠性和效率。

Abstract: Structure from Motion (SfM) estimates camera poses and reconstructs point clouds, forming a foundation for various tasks. However, applying SfM to driving scenes captured by multi-camera systems presents significant difficulties, including unreliable pose estimation, excessive outliers in road surface reconstruction, and low reconstruction efficiency. To address these limitations, we propose a Multi-camera Reconstruction and Aggregation Structure-from-Motion (MRASfM) framework specifically designed for driving scenes. MRASfM enhances the reliability of camera pose estimation by leveraging the fixed spatial relationships within the multi-camera system during the registration process. To improve the quality of road surface reconstruction, our framework employs a plane model to effectively remove erroneous points from the triangulated road surface. Moreover, treating the multi-camera set as a single unit in Bundle Adjustment (BA) helps reduce optimization variables to boost efficiency. In addition, MRASfM achieves multi-scene aggregation through scene association and assembly modules in a coarse-to-fine fashion. We deployed multi-camera systems on actual vehicles to validate the generalizability of MRASfM across various scenes and its robustness in challenging conditions through real-world applications. Furthermore, large-scale validation results on public datasets show the state-of-the-art performance of MRASfM, achieving 0.124 absolute pose error on the nuScenes dataset.

Jinghao Huang, Yaxiong Chen, Ganchao Liu

TL;DR: 该论文提出了一种名为MSAM的新方法，针对无人机视频-文本检索任务，通过多语义自适应学习机制和跨模态交互特征融合策略，显著提升了检索性能。

Details

Motivation: 无人机视频具有独特的俯视视角、强结构同质性和多样化的语义表达，现有的地面视角跨模态检索方法难以有效建模其特征，因此需要专门的检索机制。

Result: 在两个自建无人机视频-文本数据集上的实验表明，MSAM在检索任务中优于现有方法。

Insight: 无人机视频的独特视角和语义复杂性需要专门设计的检索机制，动态语义挖掘和目标区域特征聚焦是提升检索性能的关键。

Abstract: With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.

[40] A Novel Combined Optical Flow Approach for Comprehensive Micro-Expression Recognition cs.CVPDF

Vu Tram Anh Khuong, Thi Bich Phuong Man, Luu Tu Nguyen, Thanh Ha Le, Thi Duyen Ngo

TL;DR: 该论文提出了一种结合起始到顶点和顶点到终止两个阶段的光流方法（COF），用于全面的微表情识别，显著提升了识别性能。

Details

Motivation: 现有的微表情识别方法通常仅关注从起始到顶点的光流阶段，而忽略了顶点到终止阶段的关键时间动态信息，这限制了识别的全面性和准确性。

Result: 在CASMEII和SAMM数据集上的实验结果表明，COF方法优于仅基于单一光流的方法，验证了其捕捉微表情动态的有效性。

Insight: 微表情识别中，顶点到终止阶段的时间动态信息对提升识别性能至关重要，未来的研究应更多关注这一阶段。

Abstract: Facial micro-expressions are brief, involuntary facial movements that reveal hidden emotions. Most Micro-Expression Recognition (MER) methods that rely on optical flow typically focus on the onset-to-apex phase, neglecting the apex-to-offset phase, which holds key temporal dynamics. This study introduces a Combined Optical Flow (COF), integrating both phases to enhance feature representation. COF provides a more comprehensive motion analysis, improving MER performance. Experimental results on CASMEII and SAMM datasets show that COF outperforms single optical flow-based methods, demonstrating its effectiveness in capturing micro-expression dynamics.

[41] Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI cs.CV | cs.CLPDF

Syed Abdul Gaffar Shakhadri, Kruthika KR, Kartik Basavaraj Angadi

TL;DR: Shakti-VLMs是一系列1B和4B参数的视觉语言模型，通过架构创新和三阶段训练策略，在少量数据下实现高性能，适用于企业级多模态任务。

Details

Motivation: 当前视觉语言模型依赖大量训练数据，而Shakti-VLMs旨在通过模型设计和训练策略优化，减少数据需求，提升效率。

Result: Shakti-VLM-1B和4B在文档理解、视觉推理、OCR提取和多模态推理任务中表现优异。

Insight: 高性能可通过模型设计和训练策略而非仅依赖数据规模实现，为企业级多模态任务提供了高效解决方案。

Abstract: We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.

[42] Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement cs.CVPDF

Xianmin Chen, Peiliang Huang, Longfei Han, Dingwen Zhang, Junwei Han

TL;DR: 论文提出了一种名为HiMA的分层混合架构，结合Transformer和Mamba模块的优势，用于高效低光RAW图像增强。方法还包括局部分布调整（LoDA）和多先验融合（MPF）模块，显著提升了增强质量和效率。

Details

Motivation: 低光RAW图像增强任务复杂，现有深度学习方法在效率与质量之间难以平衡。HiMA旨在通过新型架构设计克服这一问题。

Result: 在多个公开数据集上，HiMA优于现有方法，且参数量更少。

Insight: 结合不同模块的优势（如Transformer和Mamba）能有效提升低光图像增强的性能和效率，局部特征处理和多域先验融合是关键。

Abstract: Low-light RAW image enhancement remains a challenging task. Although numerous deep learning based approaches have been proposed, they still suffer from inherent limitations. A key challenge is how to simultaneously achieve strong enhancement quality and high efficiency. In this paper, we rethink the architecture for efficient low-light image signal processing (ISP) and introduce a Hierarchical Mixing Architecture (HiMA). HiMA leverages the complementary strengths of Transformer and Mamba modules to handle features at large and small scales, respectively, thereby improving efficiency while avoiding the ambiguities observed in prior two-stage frameworks. To further address uneven illumination with strong local variations, we propose Local Distribution Adjustment (LoDA), which adaptively aligns feature distributions across different local regions. In addition, to fully exploit the denoised outputs from the first stage, we design a Multi-prior Fusion (MPF) module that integrates spatial and frequency-domain priors for detail enhancement. Extensive experiments on multiple public datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior performance with fewer parameters. Code will be released at https://github.com/Cynicarlos/HiMA.

[43] Exploring Conditions for Diffusion models in Robotic Control cs.CV | cs.ROPDF

Heeseong Shin, Byeongho Heo, Dongyoon Han, Seungryong Kim, Taekyung Kim

TL;DR: 预训练的扩散模型在视觉表现上取得了显著进展，但直接在机器人控制任务中应用文本条件效果不佳。本文提出ORCA，通过自适应任务提示和视觉提示，显著提升了控制任务的表现。

Details

Motivation: 预训练的视觉表征在模仿学习中表现优异，但由于其任务无关性，可能无法直接用于机器人控制任务。本文旨在探索如何利用预训练的扩散模型获取任务自适应的视觉表征，而无需微调模型本身。

Result: ORCA在多个机器人控制基准测试中达到了最先进的性能，显著优于现有方法。

Insight: 简单地复制其他视觉领域的成功方法（如文本条件）在机器人控制中可能无效，需设计更动态和任务特定的条件才能提升表现。

Abstract: While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model’s training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.

[44] ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents cs.CV | cs.AI | eess.IVPDF

Tingyu Lin, Marco Peer, Florian Kleber, Robert Sablatnig

TL;DR: ClapperText是一个面向低资源档案文档文本识别任务的基准数据集，包含手写和打印文本的标注数据，适用于复杂场景下的OCR研究。

Details

Motivation: 历史档案中的文档通常视觉退化严重，资源有限，现有OCR方法在这些场景下表现不佳。ClapperText旨在提供一个真实的低资源环境数据集，推动文档理解和OCR技术的进步。

Result: 实验表明，尽管训练数据有限（18段视频），微调仍能显著提升模型性能，证明了数据集在少样本学习场景中的适用性。

Insight: ClapperText揭示了复杂历史文档中的挑战（如运动模糊、手写变体），为低资源OCR研究提供了宝贵资源和基准测试平台。

Abstract: This paper presents ClapperText, a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings. The dataset is derived from 127 World War II-era archival video segments containing clapperboards that record structured production metadata such as date, location, and camera-operator identity. ClapperText includes 9,813 annotated frames and 94,573 word-level text instances, 67% of which are handwritten and 1,566 are partially occluded. Each instance includes transcription, semantic category, text type, and occlusion status, with annotations available as rotated bounding boxes represented as 4-point polygons to support spatially precise OCR applications. Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds, mirroring broader challenges in historical document analysis where structured content appears in degraded, non-standard forms. We provide both full-frame annotations and cropped word images to support downstream tasks. Using a consistent per-video evaluation protocol, we benchmark six representative recognition and seven detection models under zero-shot and fine-tuned conditions. Despite the small training set (18 videos), fine-tuning leads to substantial performance gains, highlighting ClapperText’s suitability for few-shot learning scenarios. The dataset offers a realistic and culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts. The dataset and evaluation code are available at https://github.com/linty5/ClapperText.

[45] Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation cs.CVPDF

Xiaoming Zhu, Xu Huang, Qinghongbing Xie, Zhi Deng, Junsheng Yu

TL;DR: Imaginarium提出了一种基于视觉引导的3D场景布局生成系统，通过高质量资源库、图像生成和解析模块，显著提升了布局的丰富性和质量。

Details

Motivation: 传统优化方法依赖繁琐手工规则，生成模型在多样性和鲁棒性上表现不足，大型语言模型难以捕捉复杂空间关系。

Result: 用户测试表明，该方法在布局丰富性和质量上显著优于现有方法。

Insight: 视觉引导结合语义优化是解决3D布局生成问题的有效途径。

Abstract: Generating artistic and coherent 3D scene layouts is crucial in digital content creation. Traditional optimization-based methods are often constrained by cumbersome manual rules, while deep generative models face challenges in producing content with richness and diversity. Furthermore, approaches that utilize large language models frequently lack robustness and fail to accurately capture complex spatial relationships. To address these challenges, this paper presents a novel vision-guided 3D layout generation system. We first construct a high-quality asset library containing 2,037 scene assets and 147 3D scene layouts. Subsequently, we employ an image generation model to expand prompt representations into images, fine-tuning it to align with our asset library. We then develop a robust image parsing module to recover the 3D layout of scenes based on visual semantics and geometric information. Finally, we optimize the scene layout using scene graphs and overall visual semantics to ensure logical coherence and alignment with the images. Extensive user testing demonstrates that our algorithm significantly outperforms existing methods in terms of layout richness and quality. The code and dataset will be available at https://github.com/HiHiAllen/Imaginarium.

Zhen Sun, Lei Tan, Yunhang Shen, Chengmao Cai, Xing Sun

TL;DR: FlexiReID是一个灵活的跨模态行人重识别框架，支持四种模态（RGB、红外、素描、文本）和七种检索模式，通过自适应专家混合（MoE）机制动态整合特征，并在新构建的CIRS-PEDES数据集上取得领先性能。

Details

Motivation: 现有跨模态行人重识别方法局限于特定模态组合，缺乏灵活性，难以支持实际部署中多样的查询-检索需求。

Result: 在CIRS-PEDES数据集上达到SOTA性能，并展现强泛化能力。

Insight: 动态特征融合和灵活模态支持是提升跨模态行人重识别实用性的关键。

Abstract: Multimodal person re-identification (Re-ID) aims to match pedestrian images across different modalities. However, most existing methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment. We propose FlexiReID, a flexible framework that supports seven retrieval modes across four modalities: rgb, infrared, sketches, and text. FlexiReID introduces an adaptive mixture-of-experts (MoE) mechanism to dynamically integrate diverse modality features and a cross-modal query fusion module to enhance multimodal feature extraction. To facilitate comprehensive evaluation, we construct CIRS-PEDES, a unified dataset extending four popular Re-ID datasets to include all four modalities. Extensive experiments demonstrate that FlexiReID achieves state-of-the-art performance and offers strong generalization in complex scenarios.

[47] Quantized FCA: Efficient Zero-Shot Texture Anomaly Detection cs.CV | I.4.7; I.2.10; I.3.8PDF

Andrei-Timotei Ardelean, Patrick Rückbeil, Tim Weyrich

TL;DR: 该论文提出了一种高效的零样本纹理异常检测方法QFCA，通过量化特征对应的统计分析实现了10倍的速度提升，同时保持了高精度。

Details

Motivation: 现有纹理异常检测方法由于运行时间长，难以应用于实际场景（如生产线监控）。通过量化技术和PCA预处理，解决了速度和精度的问题。

Result: QFCA在速度上实现了10倍提升，同时精度与现有方法相当甚至更优。

Insight: 量化技术和PCA预处理可以有效平衡速度和精度，为零样本异常检测提供实用解决方案。

Abstract: Zero-shot anomaly localization is a rising field in computer vision research, with important progress in recent years. This work focuses on the problem of detecting and localizing anomalies in textures, where anomalies can be defined as the regions that deviate from the overall statistics, violating the stationarity assumption. The main limitation of existing methods is their high running time, making them impractical for deployment in real-world scenarios, such as assembly line monitoring. We propose a real-time method, named QFCA, which implements a quantized version of the feature correspondence analysis (FCA) algorithm. By carefully adapting the patch statistics comparison to work on histograms of quantized values, we obtain a 10x speedup with little to no loss in accuracy. Moreover, we introduce a feature preprocessing step based on principal component analysis, which enhances the contrast between normal and anomalous features, improving the detection precision on complex textures. Our method is thoroughly evaluated against prior art, comparing favorably with existing methods. Project page: https://reality.tf.fau.de/pub/ardelean2025quantized.html

[48] Lightweight Data-Free Denoising for Detail-Preserving Biomedical Image Restoration cs.CVPDF

Tomáš Chobola, Julia A. Schnabel, Tingying Peng

TL;DR: 该论文提出了一种名为Noise2Detail（N2D）的超轻量级数据无关去噪方法，旨在解决现有自监督去噪技术在计算和内存需求上的限制，同时实现高速和高质量的图像恢复。

Details

Motivation: 现有自监督去噪技术尽管表现优异，但在实际应用中常因计算和内存需求过高而受限，难以平衡推理速度和重建质量。尤其是在生物医学成像中，干净的训练数据稀缺且成像模式复杂。

Result: 实验表明，N2D在性能上超越现有数据无关去噪技术，同时计算资源消耗显著降低。

Insight: N2D的效率、低成本和数据无关性使其特别适合生物医学成像，解决了干净数据稀缺的问题，同时支持快速推理以用于实际应用。

Abstract: Current self-supervised denoising techniques achieve impressive results, yet their real-world application is frequently constrained by substantial computational and memory demands, necessitating a compromise between inference speed and reconstruction quality. In this paper, we present an ultra-lightweight model that addresses this challenge, achieving both fast denoising and high quality image restoration. Built upon the Noise2Noise training framework-which removes the reliance on clean reference images or explicit noise modeling-we introduce an innovative multistage denoising pipeline named Noise2Detail (N2D). During inference, this approach disrupts the spatial correlations of noise patterns to produce intermediate smooth structures, which are subsequently refined to recapture fine details directly from the noisy input. Extensive testing reveals that Noise2Detail surpasses existing dataset-free techniques in performance, while requiring only a fraction of the computational resources. This combination of efficiency, low computational cost, and data-free approach make it a valuable tool for biomedical imaging, overcoming the challenges of scarce clean training data-due to rare and complex imaging modalities-while enabling fast inference for practical use.

[49] Deep Learning Based Domain Adaptation Methods in Remote Sensing: A Comprehensive Survey cs.CVPDF

Shuchang Lyu, Qi Zhao, Zheng Zhou, Meng Li, You Zhou

TL;DR: 本文是一篇关于深度学习在遥感领域域适应方法应用的全面综述，涵盖了任务分类、输入模式、监督范式和算法粒度等多方面内容，并总结了当前进展与未来挑战。

Details

Motivation: 遥感领域中，域适应任务因数据分布差异（如地面采样距离、成像模式等）面临巨大挑战。深度学习的强大特征表示和跨域知识迁移能力使其在该领域受到广泛关注，但缺乏全面系统的综述。

Result: 总结了当前最新方法的性能，并提供了对未来研究方向的指导。

Insight: 遥感域适应任务的多样性和复杂性需要更系统的分类方法，未来研究可能集中在多模态数据融合和更高效的跨域学习方法上。

Abstract: Domain adaptation is a crucial and increasingly important task in remote sensing, aiming to transfer knowledge from a source domain a differently distributed target domain. It has broad applications across various real-world applications, including remote sensing element interpretation, ecological environment monitoring, and urban/rural planning. However, domain adaptation in remote sensing poses significant challenges due to differences in data, such as variations in ground sampling distance, imaging modes from various sensors, geographical landscapes, and environmental conditions. In recent years, deep learning has emerged as a powerful tool for feature representation and cross-domain knowledge transfer, leading to widespread adoption in remote sensing tasks. In this paper, we present a comprehensive survey of significant advancements in deep learning based domain adaptation for remote sensing. We first introduce the preliminary knowledge to clarify key concepts, mathematical notations, and the taxonomy of methodologies. We then organize existing algorithms from multiple perspectives, including task categorization, input mode, supervision paradigm, and algorithmic granularity, providing readers with a structured understanding of the field. Next, we review widely used datasets and summarize the performance of state-of-the-art methods to provide an overview of current progress. We also identify open challenges and potential directions to guide future research in domain adaptation for remote sensing. Compared to previous surveys, this work addresses a broader range of domain adaptation tasks in remote sensing, rather than concentrating on a few subfields. It also presents a systematic taxonomy, providing a more comprehensive and organized understanding of the field. As a whole, this survey can inspire the research community, foster understanding, and guide future work in the field.

[50] Uncertainty-Aware Extreme Point Tracing for Weakly Supervised Ultrasound Image Segmentation cs.CVPDF

Lei Shi, Gang Li, Junxing Zhang

TL;DR: 该论文提出了一种基于极弱监督的超声图像分割方法，仅需四个极值点作为标注，并利用SAM2模型生成初始伪标签，通过FGEPM算法和不确定性估计逐步优化分割结果。

Details

Motivation: 传统的全监督医学图像分割需要大量像素级标注，成本高且耗时。为降低标注负担，提出一种仅需极值点的弱监督方法。

Result: 在BUSI和UNS数据集上的实验表明，该方法性能接近甚至超越全监督方法，同时大幅降低标注成本。

Insight: 1) 极值点标注足够支持高质量分割；2) 不确定性估计有助于边界优化；3) 弱监督方法在实际应用中潜力巨大。

Abstract: Automatic medical image segmentation is a fundamental step in computer-aided diagnosis, yet fully supervised approaches demand extensive pixel-level annotations that are costly and time-consuming. To alleviate this burden, we propose a weakly supervised segmentation framework that leverages only four extreme points as annotation. Specifically, bounding boxes derived from the extreme points are used as prompts for the Segment Anything Model 2 (SAM2) to generate reliable initial pseudo labels. These pseudo labels are progressively refined by an enhanced Feature-Guided Extreme Point Masking (FGEPM) algorithm, which incorporates Monte Carlo dropout-based uncertainty estimation to construct a unified gradient uncertainty cost map for boundary tracing. Furthermore, a dual-branch Uncertainty-aware Scale Consistency (USC) loss and a box alignment loss are introduced to ensure spatial consistency and precise boundary alignment during training. Extensive experiments on two public ultrasound datasets, BUSI and UNS, demonstrate that our method achieves performance comparable to, and even surpassing fully supervised counterparts while significantly reducing annotation cost. These results validate the effectiveness and practicality of the proposed weakly supervised framework for ultrasound image segmentation.

[51] Valeo Near-Field: a novel dataset for pedestrian intent detection cs.CV | cs.AIPDF

Antonyo Musabini, Rachid Benmokhtar, Jagdish Bhanushali, Victor Galizzi, Bertrand Luvison

TL;DR: 论文提出了一个名为Valeo Near-Field的新数据集，用于检测行人接近车辆时的意图。数据集包含多模态数据，并提供了详细的标注和基准测试，旨在推动智能车辆在近场场景中的研究。

Details

Motivation: 现有数据集在行人意图检测和近场感知任务上存在不足，尤其是在多模态数据同步和真实场景多样性方面。本文希望通过提供一个高质量、多模态的数据集，促进相关算法的研究和提升。

Result: 数据集和基准测试为行人检测、3D姿态估计和轨迹预测提供了评估标准，并展示了在多模态数据融合任务中的潜力。

Insight: 多模态数据结合详细标注能够显著提升行人意图检测的鲁棒性，尤其是在复杂动态环境和硬件限制条件下。该数据集为近场场景的研究提供了重要资源。

Abstract: This paper presents a novel dataset aimed at detecting pedestrians’ intentions as they approach an ego-vehicle. The dataset comprises synchronized multi-modal data, including fisheye camera feeds, lidar laser scans, ultrasonic sensor readings, and motion capture-based 3D body poses, collected across diverse real-world scenarios. Key contributions include detailed annotations of 3D body joint positions synchronized with fisheye camera images, as well as accurate 3D pedestrian positions extracted from lidar data, facilitating robust benchmarking for perception algorithms. We release a portion of the dataset along with a comprehensive benchmark suite, featuring evaluation metrics for accuracy, efficiency, and scalability on embedded systems. By addressing real-world challenges such as sensor occlusions, dynamic environments, and hardware constraints, this dataset offers a unique resource for developing and evaluating state-of-the-art algorithms in pedestrian detection, 3D pose estimation and 4D trajectory and intention prediction. Additionally, we provide baseline performance metrics using custom neural network architectures and suggest future research directions to encourage the adoption and enhancement of the dataset. This work aims to serve as a foundation for researchers seeking to advance the capabilities of intelligent vehicles in near-field scenarios.

[52] Towards Label-Free Brain Tumor Segmentation: Unsupervised Learning with Multimodal MRI cs.CV | cs.AIPDF

Gerard Comas-Quiles, Carles Garcia-Cabrera, Julia Dietlmeier, Noel E. O’Connor, Ferran Marques

TL;DR: 该论文提出了一种基于多模态MRI的无监督学习方法MViT-AE，通过重建误差图实现脑肿瘤分割，解决了标注数据稀缺的问题，并在BraTS-GoAT 2025数据集上取得了临床意义的结果。

Details

Motivation: 由于标注数据稀缺、昂贵或不一致，传统的监督学习方法在脑肿瘤分割中面临可扩展性瓶颈。无监督异常检测（UAD）提供了一个补充方案，无需依赖手动标注。

Result: 在BraTS-GoAT 2025测试集上，Dice系数分别为0.437（全肿瘤）、0.316（肿瘤核心）和0.350（增强肿瘤），验证集的异常检测率为89.4%。

Insight: 基于Transformer的无监督模型有望成为神经肿瘤影像的可扩展、高效工具，尤其在标注数据有限的情况下。

Abstract: Unsupervised anomaly detection (UAD) presents a complementary alternative to supervised learning for brain tumor segmentation in magnetic resonance imaging (MRI), particularly when annotated datasets are limited, costly, or inconsistent. In this work, we propose a novel Multimodal Vision Transformer Autoencoder (MViT-AE) trained exclusively on healthy brain MRIs to detect and localize tumors via reconstruction-based error maps. This unsupervised paradigm enables segmentation without reliance on manual labels, addressing a key scalability bottleneck in neuroimaging workflows. Our method is evaluated in the BraTS-GoAT 2025 Lighthouse dataset, which includes various types of tumors such as gliomas, meningiomas, and pediatric brain tumors. To enhance performance, we introduce a multimodal early-late fusion strategy that leverages complementary information across multiple MRI sequences, and a post-processing pipeline that integrates the Segment Anything Model (SAM) to refine predicted tumor contours. Despite the known challenges of UAD, particularly in detecting small or non-enhancing lesions, our method achieves clinically meaningful tumor localization, with lesion-wise Dice Similarity Coefficient of 0.437 (Whole Tumor), 0.316 (Tumor Core), and 0.350 (Enhancing Tumor) on the test set, and an anomaly Detection Rate of 89.4% on the validation set. These findings highlight the potential of transformer-based unsupervised models to serve as scalable, label-efficient tools for neuro-oncological imaging.

[53] Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis cs.CVPDF

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma

TL;DR: 本文提出了UniMedVL框架，通过Observation-Knowledge-Analysis（OKA）范式统一医学多模态理解和生成任务，填补了现有医学AI系统在数据表示和特征集成方面的空白。

Details

Motivation: 现有医学AI系统在处理多模态输入和生成多样化输出时存在割裂，无法同时完成图像理解和生成任务，限制了实际医疗诊断的应用效果。

Result: UniMedVL在5个医学图像理解基准测试中表现优异，同时在8种医学成像模态的生成任务中媲美专用模型。

Insight: 统一架构实现了双向知识共享，生成任务能够提升视觉理解特征，表明整合传统分离的能力可以显著提升医学视觉-语言任务的性能。

Abstract: Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.

[54] DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification cs.CV | cs.AI | eess.IVPDF

Tingyu Lin, Armin Dadras, Florian Kleber, Robert Sablatnig

TL;DR: DGME-T是一种基于Transformer的轻量级扩展，通过方向性网格运动编码（DGME）提升视频分类模型的鲁棒性，尤其在处理历史档案影片时表现显著提升。

Details

Motivation: 当前基于高质量现代影片训练的相机运动分类（CMC）模型在处理噪声多、帧缺失、低对比度的历史档案影片时性能下降。

Result: 在现代视频上，top-1准确率从81.78%提升至86.14%，宏F1从82.08%提升至87.81%；在二战档案影片上，准确率从83.43%提升至84.62%，宏F1从81.72%提升至82.63%。

Insight: 结构化运动先验和Transformer表示是互补的，即使是小规模的运动头部也能显著提升模型在退化影片分析中的鲁棒性。

Abstract: Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone’s top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at https://github.com/linty5/DGME-T.

[55] Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset cs.CVPDF

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang

TL;DR: 论文提出了一种名为Ditto的框架，通过生成大规模高质量合成数据集（Ditto-1M）解决了指令视频编辑数据稀缺的问题，并训练出性能优异的模型Editto。

Details

Motivation: 指令视频编辑在内容创作中具有潜力，但由于缺乏大规模高质量训练数据而进展缓慢，本文旨在解决这一瓶颈。

Result: Editto模型在指令视频编辑任务中表现优异，实现了新的SOTA性能。

Insight: 合成数据的高效生成和质量控制是指令视频编辑领域的关键突破点，同时模型的效率优化（如蒸馏和时序增强）对大规模训练至关重要。

Abstract: Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu

TL;DR: OmniVinci是一个开源的多模态理解大模型，通过创新的架构设计和数据优化，显著提升了多模态任务的表现。

Details

Motivation: 推动机器智能需要具备跨模态感知能力，模拟人类多模态感知世界的方式。

Result: OmniVinci在多项基准测试中超越Qwen2.5-Omni，训练令牌数仅为后者的1/6。

Insight: 多模态相互增强，不仅在感知上，也在推理任务中表现出协同效应。

Abstract: Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni’s 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

[57] SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior cs.CVPDF

Haoran Wang, Bo Zhao, Jinghui Wang, Hanzhang Wang, Huan Yang

TL;DR: 本文提出SEGA，一种分步进化的内容感知布局生成范式，通过分层推理框架和设计先验知识，显著提升了复杂布局规划的准确性。

Details

Motivation: 现有方法通常采用单步推理框架，缺乏反馈机制，导致复杂布局规划失败率高。

Result: 在多个基准数据集上达到SOTA效果。

Insight: 分步推理和设计先验的结合可显著提升布局生成的鲁棒性和质量。

Abstract: In this paper, we study the content-aware layout generation problem, which aims to automatically generate layouts that are harmonious with a given background image. Existing methods usually deal with this task with a single-step reasoning framework. The lack of a feedback-based self-correction mechanism leads to their failure rates significantly increasing when faced with complex element layout planning. To address this challenge, we introduce SEGA, a novel Stepwise Evolution Paradigm for Content-Aware Layout Generation. Inspired by the systematic mode of human thinking, SEGA employs a hierarchical reasoning framework with a coarse-to-fine strategy: first, a coarse-level module roughly estimates the layout planning results; then, another refining module performs fine-level reasoning regarding the coarse planning results. Furthermore, we incorporate layout design principles as prior knowledge into the model to enhance its layout planning ability. Besides, we present GenPoster-100K that is a new large-scale poster dataset with rich meta-information annotation. The experiments demonstrate the effectiveness of our approach by achieving the state-of-the-art results on multiple benchmark datasets. Our project page is at: https://brucew91.github.io/SEGA.github.io/

[58] Semantic segmentation with coarse annotations cs.CV | cs.AI | cs.LGPDF

Jort de Jong, Mike Holenderski

TL;DR: 本文提出了一种用于粗标注语义分割的正则化方法，通过超像素上采样优化边界对齐效果。

Details

Motivation: 精细标注语义分割数据成本高昂，粗标注是一种替代方案，但其边界对齐效果较差，需要改进。

Result: 在SUIM、Cityscapes和PanNuke数据集上验证，边界召回率显著优于现有方法。

Insight: 利用超像素的底层图像特征（颜色、位置）可以弥补粗标注边界信息的不足。

Abstract: Semantic segmentation is the task of classifying each pixel in an image. Training a segmentation model achieves best results using annotated images, where each pixel is annotated with the corresponding class. When obtaining fine annotations is difficult or expensive, it may be possible to acquire coarse annotations, e.g. by roughly annotating pixels in an images leaving some pixels around the boundaries between classes unlabeled. Segmentation with coarse annotations is difficult, in particular when the objective is to optimize the alignment of boundaries between classes. This paper proposes a regularization method for models with an encoder-decoder architecture with superpixel based upsampling. It encourages the segmented pixels in the decoded image to be SLIC-superpixels, which are based on pixel color and position, independent of the segmentation annotation. The method is applied to FCN-16 fully convolutional network architecture and evaluated on the SUIM, Cityscapes, and PanNuke data sets. It is shown that the boundary recall improves significantly compared to state-of-the-art models when trained on coarse annotations.

[59] Towards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model cs.CV | cs.LGPDF

Gaoxiang Huang, Songning Lai, Yutao Yue

TL;DR: 该论文提出了一种轻量化的解耦概念瓶颈模型（LDCBM），通过自动将视觉特征分组为语义上有意义的组件，提高了概念对齐和分类性能。

Details

Motivation: 现有的概念瓶颈模型（CBMs）存在输入到概念映射的偏差和有限的可控性，限制了其实用价值和策略的可靠性。

Result: 在三个多样化数据集上的实验表明，LDCBM在概念和类别准确性上均优于以往CBMs，同时提高了模型的透明性和鲁棒性。

Insight: 通过将概念根植于视觉证据中，该方法克服了先前模型的基本限制，显著提升了可解释AI的可靠性。

Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by predicting human-understandable concepts as intermediate representations. However, existing CBMs often suffer from input-to-concept mapping bias and limited controllability, which restricts their practical value, directly damage the responsibility of strategy from concept-based methods. We propose a lightweight Disentangled Concept Bottleneck Model (LDCBM) that automatically groups visual features into semantically meaningful components without region annotation. By introducing a filter grouping loss and joint concept supervision, our method improves the alignment between visual patterns and concepts, enabling more transparent and robust decision-making. Notably, Experiments on three diverse datasets demonstrate that LDCBM achieves higher concept and class accuracy, outperforming previous CBMs in both interpretability and classification performance. By grounding concepts in visual evidence, our method overcomes a fundamental limitation of prior models and enhances the reliability of interpretable AI.

[60] ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection cs.CVPDF

Haowei Zhu, Tianxiang Pan, Rui Qin, Jun-Hai Yong, Bin Wang

TL;DR: ReCon提出了一种新的数据增强框架，通过区域可控的生成模型和感知模型反馈，解决了当前生成方法中内容-位置不匹配和语义泄漏的问题。

Details

Motivation: 现有生成模型在数据增强中存在内容-位置不匹配和语义泄漏的问题，且需要复杂后处理或大量微调。ReCon旨在提升生成数据的质量和可训练性。

Result: 实验表明，ReCon显著提升了生成数据的质量和训练效果，在不同数据集、骨干架构和数据规模下均取得一致性能提升。

Insight: 通过感知模型反馈和区域对齐机制，ReCon实现了更可控和高质量的生成数据，为数据增强提供了新思路。

Abstract: The scale and quality of datasets are crucial for training robust perception models. However, obtaining large-scale annotated data is both costly and time-consuming. Generative models have emerged as a powerful tool for data augmentation by synthesizing samples that adhere to desired distributions. However, current generative approaches often rely on complex post-processing or extensive fine-tuning on massive datasets to achieve satisfactory results, and they remain prone to content-position mismatches and semantic leakage. To overcome these limitations, we introduce ReCon, a novel augmentation framework that enhances the capacity of structure-controllable generative models for object detection. ReCon integrates region-guided rectification into the diffusion sampling process, using feedback from a pre-trained perception model to rectify misgenerated regions within diffusion sampling process. We further propose region-aligned cross-attention to enforce spatial-semantic alignment between image regions and their textual cues, thereby improving both semantic consistency and overall image fidelity. Extensive experiments demonstrate that ReCon substantially improve the quality and trainability of generated data, achieving consistent performance gains across various datasets, backbone architectures, and data scales. Our code is available at https://github.com/haoweiz23/ReCon .

[61] VISTA: A Test-Time Self-Improving Video Generation Agent cs.CVPDF

Do Xuan Long, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister

TL;DR: VISTA是一个测试时自优化视频生成多智能体系统，通过迭代优化提示来提升视频生成质量，表现优于现有方法。

Details

Motivation: 现有文本到视频生成技术高度依赖用户提示的精确性，而测试时优化方法难以应对视频的多维度特性，因此提出VISTA来解决这一问题。

Result: 在单场景和多场景视频生成任务中，VISTA显著提升视频质量和用户意图对齐，60%的胜率超过现有方法，人类评估中66.4%的用户偏好VISTA生成结果。

Insight: 视频生成质量的提升不仅依赖于生成模型的改进，提示的迭代优化和多维度评估同样至关重要。

Abstract: Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.

[62] Neuro-Symbolic Spatial Reasoning in Segmentation cs.CVPDF

Jiayi Lin, Jiabo Huang, Shaogang Gong

TL;DR: 论文提出了RelateSeg方法，通过神经符号空间推理在开放词汇语义分割（OVSS）中引入显式空间关系约束，提升对未见对象的泛化能力。

Details

Motivation: 现有基于视觉语言模型（VLM）的方法在开放词汇语义分割中缺乏对场景中物体空间关系的理解，限制了其对未见类别的分割能力。

Result: 在四个基准数据集上达到SOTA的mIoU表现，尤其在多类别图像上优势明显，仅引入了一个辅助损失函数且无额外参数。

Insight: 显式建模空间关系可以显著提升OVSS的性能，尤其是对复杂场景的分割任务，神经符号方法的结合是一个有潜力的研究方向。

Abstract: Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., <cat, to-right-of, person>, and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., “cat”) and a spatial pseudo category (e.g., “right of person”) simultaneously, enforcing relational constraints (e.g., a “cat” pixel must lie to the right of a “person”). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.

[63] Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt cs.CVPDF

Joongwon Chae, Lihui Luo, Xi Yuan, Dongmei Yu, Zhenglin Chen

TL;DR: Memory-SAM通过检索技术自动生成提示符，无需人工干预或模型微调，实现高效的舌像分割，性能优于传统方法和SAM基线。

Details

Motivation: 传统舌像分割方法需要大量标注数据或依赖人工提示，Memory-SAM旨在实现无需训练和人工干预的自动分割，提升鲁棒性和数据效率。

Result: 在600张专家标注图像上，Memory-SAM的mIoU达到0.9863，显著优于FCN和SAM基线方法。

Insight: 检索技术可以为SAM自动生成有效提示符，提升不规则边界分割的鲁棒性和数据效率。

Abstract: Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.

[64] BLIP3o-NEXT: Next Frontier of Native Image Generation cs.CVPDF

Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang

TL;DR: BLIP3o-NEXT是一个开源的图像生成基础模型，结合了文本到图像生成和图像编辑功能，采用自回归+扩散架构，在性能和一致性上表现优异。

Details

Motivation: 本文旨在推动原生图像生成的边界，通过统一文本到图像生成和图像编辑任务，进一步提升模型的生成能力和编辑效果。

Result: 在多项文本到图像和图像编辑基准测试中，BLIP3o-NEXT表现优于现有模型。

Insight: （1）架构选择不影响性能，关键在扩展性和推理速度；（2）强化学习可推动图像生成；（3）图像编辑仍具挑战性；（4）数据质量和规模决定性能上限。

Abstract: We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.

[65] BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models cs.CV | cs.NEPDF

Kaushitha Silva, Mansitha Eashwara, Sanduni Ubayasiri, Ruwan Tennakoon, Damayanthi Herath

TL;DR: BiomedXPro是一种进化框架，利用大型语言模型作为生物医学知识提取器和自适应优化器，生成多样化的、可解释的自然语言提示对，用于疾病诊断，显著优于现有方法。

Details

Motivation: 生物医学视觉语言模型的临床应用受限于提示优化技术的透明度和多样性不足，无法捕捉临床诊断的多面性，影响模型的可靠性和信任度。

Result: 在多生物医学基准测试中，BiomedXPro表现优于现有提示优化方法，且生成的提示与显著临床特征具有强语义对齐。

Insight: 通过生成可解释的、多样化的提示对，BiomedXPro不仅提升了模型性能，还增强了其临床可信度，为高风险的AI系统提供了可验证的基础。

Abstract: The clinical adoption of biomedical vision-language models is hindered by prompt optimization techniques that produce either uninterpretable latent vectors or single textual prompts. This lack of transparency and failure to capture the multi-faceted nature of clinical diagnosis, which relies on integrating diverse observations, limits their trustworthiness in high-stakes settings. To address this, we introduce BiomedXPro, an evolutionary framework that leverages a large language model as both a biomedical knowledge extractor and an adaptive optimizer to automatically generate a diverse ensemble of interpretable, natural-language prompt pairs for disease diagnosis. Experiments on multiple biomedical benchmarks show that BiomedXPro consistently outperforms state-of-the-art prompt-tuning methods, particularly in data-scarce few-shot settings. Furthermore, our analysis demonstrates a strong semantic alignment between the discovered prompts and statistically significant clinical features, grounding the model’s performance in verifiable concepts. By producing a diverse ensemble of interpretable prompts, BiomedXPro provides a verifiable basis for model predictions, representing a critical step toward the development of more trustworthy and clinically-aligned AI systems.

[66] LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal cs.CVPDF

Shr-Ruei Tsai, Wei-Cheng Chang, Jie-Ying Lee, Chih-Hai Su, Yu-Lun Liu

TL;DR: LightsOut提出了一种基于扩散模型的图像外绘方法，专门用于增强单图像镜头光晕去除（SIFR）任务，通过重建画面外的光源来提升现有方法的性能。

Details

Motivation: 镜头光晕严重降低了图像质量，尤其是在画面外光源不完整或缺失时，现有的SIFR方法表现不佳。为了提高这些方法的鲁棒性，需要一种能够补充缺失光源的技术。

Result: 在多种复杂场景下，LightsOut均显著提升了现有SIFR方法的性能，证明了其作为通用预处理解决方案的有效性。

Insight: 通过外绘技术补充画面外光源，为镜头光晕去除问题提供了新思路，同时也展示了扩散模型在图像修复任务中的潜力。

Abstract: Lens flare significantly degrades image quality, impacting critical computer vision tasks like object detection and autonomous driving. Recent Single Image Flare Removal (SIFR) methods perform poorly when off-frame light sources are incomplete or absent. We propose LightsOut, a diffusion-based outpainting framework tailored to enhance SIFR by reconstructing off-frame light sources. Our method leverages a multitask regression module and LoRA fine-tuned diffusion model to ensure realistic and physically consistent outpainting results. Comprehensive experiments demonstrate LightsOut consistently boosts the performance of existing SIFR methods across challenging scenarios without additional retraining, serving as a universally applicable plug-and-play preprocessing solution. Project page: https://ray-1026.github.io/lightsout/

cs.CL [Back]

[67] Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective cs.CL | cs.AIPDF

Zhiqiang Kou, Junyang Chen, Xin-Qiang Cai, Ming-Kun Xie, Biao Liu

TL;DR: 论文提出了一个新的多标签视角来评估大型语言模型（LLM）的毒性生成问题，通过引入三个多标签基准数据集和一个伪标签训练方法，显著提升了毒性检测的性能。

Details

Motivation: 当前毒性检测器主要依赖单标签基准，无法充分捕捉现实中有毒提示的多维和模糊特性，导致检测偏差。

Result: 实验结果表明，该方法在性能上显著超越了GPT-4o和DeepSeek等先进基线。

Insight: 多标签视角能更真实地反映毒性的复杂性，伪标签训练是提升检测效果的有效途径。

Abstract: Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely on single-label benchmarks, which cannot adequately capture the inherently ambiguous and multi-dimensional nature of real-world toxic prompts. This limitation results in biased evaluations, including missed toxic detections and false positives, undermining the reliability of existing detectors. Additionally, gathering comprehensive multi-label annotations across fine-grained toxicity categories is prohibitively costly, further hindering effective evaluation and development. To tackle these issues, we introduce three novel multi-label benchmarks for toxicity detection: \textbf{Q-A-MLL}, \textbf{R-A-MLL}, and \textbf{H-X-MLL}, derived from public toxicity datasets and annotated according to a detailed 15-category taxonomy. We further provide a theoretical proof that, on our released datasets, training with pseudo-labels yields better performance than directly learning from single-label supervision. In addition, we develop a pseudo-label-based toxicity detection method. Extensive experimental results show that our approach significantly surpasses advanced baselines, including GPT-4o and DeepSeek, thus enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.

[68] A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling cs.CL | cs.SIPDF

Shiyu Ji, Farnoosh Hashemi, Joice Chen, Juanwen Pan, Weicheng Ma

TL;DR: 论文提出了一种利用大型语言模型（LLM）自动生成和标注辩论数据的方法，用于分析和标注修辞策略，并验证了模型在多领域中的泛化能力。

Details

Motivation: 修辞策略在说服性交流中至关重要，但现有研究依赖人工标注，成本高且难以扩展。

Result: 模型在人类标注数据和外部语料库上表现优异，并展示了在说服力预测和政治辩论分析中的应用。

Insight: 研究发现美国总统辩论中情感论证的使用增加，证明了方法的实用性和泛化能力。

Abstract: Rhetorical strategies are central to persuasive communication, from political discourse and marketing to legal argumentation. However, analysis of rhetorical strategies has been limited by reliance on human annotation, which is costly, inconsistent, difficult to scale. Their associated datasets are often limited to specific topics and strategies, posing challenges for robust model development. We propose a novel framework that leverages large language models (LLMs) to automatically generate and label synthetic debate data based on a four-part rhetorical typology (causal, empirical, emotional, moral). We fine-tune transformer-based classifiers on this LLM-labeled dataset and validate its performance against human-labeled data on this dataset and on multiple external corpora. Our model achieves high performance and strong generalization across topical domains. We illustrate two applications with the fine-tuned model: (1) the improvement in persuasiveness prediction from incorporating rhetorical strategy labels, and (2) analyzing temporal and partisan shifts in rhetorical strategies in U.S. Presidential debates (1960-2020), revealing increased use of affective over cognitive argument in U.S. Presidential debates.

[69] Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning cs.CL | cs.AI | cs.IRPDF

Junlin Wu, Xianrui Zhong, Jiashuo Sun, Bolian Li, Bowen Jin

TL;DR: 论文提出了Structure-R1框架，通过强化学习动态生成结构化表示以优化LLM的推理能力，相比传统RAG系统显著提升了信息密度和推理性能。

Details

Motivation: 大型语言模型（LLMs）在推理任务中表现优异，但受限于对显式结构化知识的有限访问，传统检索增强生成（RAG）系统通常处理非结构化文本，导致信息密度低和推理效果不佳。

Result: 在7B规模的基准模型上表现出色，性能接近更大规模的模型；理论分析表明结构化表示显著提升了信息密度和上下文清晰度。

Insight: 结构化表示能有效增强LLM的推理能力，其动态生成和验证机制为未来知识密集型任务提供了新思路。

Abstract: Large language models (LLMs) have demonstrated remarkable advances in reasoning capabilities. However, their performance remains constrained by limited access to explicit and structured domain knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external information as context to augment reasoning. Nevertheless, traditional RAG systems typically operate over unstructured and fragmented text, resulting in low information density and suboptimal reasoning. To overcome these limitations, we propose \textsc{Structure-R1}, a novel framework that transforms retrieved content into structured representations optimized for reasoning. Leveraging reinforcement learning, \textsc{Structure-R1} learns a content representation policy that dynamically generates and adapts structural formats based on the demands of multi-step reasoning. Unlike prior methods that rely on fixed schemas, our approach adopts a generative paradigm capable of producing task-specific structures tailored to individual queries. To ensure the quality and reliability of these representations, we introduce a self-reward structural verification mechanism that checks whether the generated structures are both correct and self-contained. Extensive experiments on seven knowledge-intensive benchmarks show that \textsc{Structure-R1} consistently achieves competitive performance with a 7B-scale backbone model and matches the performance of much larger models. Additionally, our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity. Our code and data are available at: https://github.com/jlwu002/sr1.

[70] Extending Audio Context for Long-Form Understanding in Large Audio-Language Models cs.CL | cs.AI | cs.SD | eess.ASPDF

Yuatyong Chaichana, Pittawat Taveekitworachai, Warit Sirichotedumrong, Potsawee Manakul, Kunat Pipatanakul

TL;DR: 论文提出Partial YaRN和VLAT两种方法，扩展大型音频-语言模型的音频上下文窗口，提升长音频理解能力。

Details

Motivation: 现有的LALMs受限于短音频上下文窗口，无法充分利用长音频内容，亟需一种扩展上下文的方法。

Result: 实验表明，Partial YaRN效果优于基线模型，VLAT进一步提升性能，支持未见长度的长音频理解。

Insight: 仅调整音频token位置优于全局调整；动态训练策略对长上下文任务至关重要。

Abstract: Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM’s text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.

[71] Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning cs.CL | cs.AI | cs.LGPDF

Lina Berrayana, Ahmed Heakl, Muhammad Abdullah Sohail, Thomas Hofmann, Salman Khan

TL;DR: 论文探讨了离散扩散语言模型（DDLM）与自回归模型（ARM）的混合架构，通过文本空间和潜在空间的协作，证明了其在推理任务中的互补优势和计算效率。

Details

Motivation: 当前自回归语言模型（ARMs）虽准确率高，但生成长序列时成本较高。离散扩散语言模型（DDLMs）具备并行生成能力，在复杂推理和长期规划任务中表现优异。研究旨在探索两者的混合架构是否能结合优势。

Result: 潜在空间协作显著提升准确性（DART-5从27%到54%，AIME24从0%到14%），计算效率高（64 token规划+5 token执行超越基线44倍token使用）。

Insight: DDLM与ARM的潜在空间协作能绕过扩散模型的文本生成限制，实现高效推理；混合架构在保持准确性的同时大幅降低计算成本。

Abstract: Current autoregressive language models (ARMs) achieve high accuracy but require long token sequences, making them costly. Discrete diffusion language models (DDLMs) enable parallel and flexible generation within a fixed number of steps and have recently emerged for their strong performance in complex reasoning and long-term planning tasks. We present a study exploring hybrid architectures that couple DDLMs with ARMs to assess whether their collaboration can yield complementary benefits. We first examine collaboration in text space, where one model plans the reasoning process and another executes the final answer based on that plan. We then extend this setup to latent-space communication, introducing a learned projector that maps DDLM latents into the ARM’s embedding space, potentially bypassing some of the text-generation limitations of diffusion models. We find that shifting DDLM –> ARM communication from text space to latent space yields significant accuracy gains, for example increasing from 27.0% to 54.0% on DART-5 and from 0.0% to 14.0% on AIME24. We also find that combining a DDLM planner with an ARM executor can provide substantial computational savings with little to no impact on accuracy. For example, the latent-space pipeline, using 64 tokens for planning and roughly 5 for execution, surpasses Qwen3.1-7B on DART-5 and AIME, despite Qwen using 44 times more tokens. Overall, our study offers new insights into reasoning with DDLMs and highlights their potential in hybrid architectures.

[72] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding cs.CL | cs.CVPDF

Sensen Gao, Shanshan Zhao, Xu Jiang, Lunhao Duan, Yong Xien Chng

TL;DR: 该论文系统综述了多模态检索增强生成（Multimodal RAG）在文档理解中的应用，提出了一种基于领域、检索模态和粒度的分类法，并探讨了相关数据集、基准和未来挑战。

Details

Motivation: 文档理解在金融分析和科学发现等领域至关重要。现有方法如基于OCR的流程或原生多模态大模型（MLLMs）存在信息丢失或上下文建模不足的问题，而多模态RAG能更全面地处理文档的多模态特性。

Result: 论文总结了多模态RAG的进展、数据集和基准，同时指出了效率、细粒度表示和鲁棒性等开放挑战。

Insight: 多模态RAG能够整合文本、表格、图表和布局等多种信息，为文档理解提供了更全面的解决方案，未来需要在效率和鲁棒性方面进一步优化。

Abstract: Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents’ multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.

[73] Exemplar-Guided Planing: Enhanced LLM Agent for KGQA cs.CL | cs.AIPDF

Jingao Xu, Shuoyoucheng Ma, Xin Song, Rong Jiang, Hongkui Tu

TL;DR: 该论文提出了EGP框架，通过检索和利用训练数据中的示例问题及其成功推理路径，动态指导LLM代理的规划和关系探索，显著提升了知识图谱问答（KGQA）的性能。

Details

Motivation: LLM作为交互式代理在KGQA中表现不佳，主要是由于自然语言查询与结构化知识图谱表示的语义鸿沟，导致规划低效且未能充分利用训练数据中的推理模式。

Result: 在WebQSP和CWQ数据集上的实验表明，PoG-EGP显著优于基线PoG系统和其他对比方法。

Insight: 利用训练数据中的推理模式和示例问题可以有效弥合语义鸿沟；动态规划和高效探索是实现高性能KGQA的关键。

Abstract: Large Language Models (LLMs) as interactive agents show significant promise in Knowledge Graph Question Answering (KGQA) but often struggle with the semantic gap between natural language queries and structured knowledge graph (KG) representations. This leads to suboptimal planning and inefficient exploration on KG, while training-free approaches often underutilize valuable reasoning patterns in training data. To address these limitations, we propose a novel framework, Exemplar-Guided Planning (EGP), which enhances the planning capabilities of LLM agents for KGQA. EGP first preprocesses the training set questions via entity templating to normalize semantic variations. It then retrieves highly similar exemplary questions and their successful reasoning paths from this preprocessed set using semantic embeddings and an efficient FAISS index. These retrieved exemplars dynamically guide the LLM’s planning process in two key phases: (1) Task Decomposition, by aligning generated sub-objectives with proven reasoning steps, and (2) Relation Exploration, by providing high-quality auxiliary information to improve relation pruning accuracy. Additionally, we introduce a Smart Lookahead mechanism during relation exploration to improve efficiency by preemptively exploring promising paths and potentially terminating exploration earlier. We apply EGP to the Plan-on-Graph (PoG) framework, termed PoG-EGP. Extensive experiments on two real-world KGQA datasets, WebQSP and CWQ, demonstrate that PoG-EGP significantly improves over the baseline PoG system and other compared methods.

[74] AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction cs.CLPDF

Hong Ting Tsang, Jiaxin Bai, Haoyu Huang, Qiao Xiao, Tianshi Zheng

TL;DR: AutoGraph-R1提出了一种基于强化学习（RL）的知识图谱（KG）构建框架，直接优化KG在下游任务中的表现，从而弥补了KG构建与其应用之间的脱节问题。

Details

Motivation: 传统的KG构建方法与下游应用（如问答系统）脱节，导致图谱结构在功能上不够高效。AutoGraph-R1旨在通过强化学习直接优化KG的功能性效用。

Result: 在多个问答基准测试中，AutoGraph-R1显著提升了图谱RAG方法的性能，超越了任务无关的基线图谱。

Insight: 研究表明，KG构建应从传统的“内在质量”转向“功能性效用”，实现了构建与应用的闭环优化。

Abstract: Building effective knowledge graphs (KGs) for Retrieval-Augmented Generation (RAG) is pivotal for advancing question answering (QA) systems. However, its effectiveness is hindered by a fundamental disconnect: the knowledge graph (KG) construction process is decoupled from its downstream application, yielding suboptimal graph structures. To bridge this gap, we introduce AutoGraph-R1, the first framework to directly optimize KG construction for task performance using Reinforcement Learning (RL). AutoGraph-R1 trains an LLM constructor by framing graph generation as a policy learning problem, where the reward is derived from the graph’s functional utility in a RAG pipeline. We design two novel, task-aware reward functions, one for graphs as knowledge carriers and another as knowledge indices. Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over using task-agnostic baseline graphs. Our work shows it is possible to close the loop between construction and application, shifting the paradigm from building intrinsically good'' graphs to building demonstrably useful’’ ones.

[75] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing cs.CL | F.2.2; I.2.7PDF

Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang

TL;DR: 论文提出了Infinity Parser，一种基于强化学习的框架LayoutRL，用于解决扫描文档解析中的布局理解问题。通过构建大型数据集Infinity-Doc-400K，并在多个基准测试中取得最先进性能。

Details

Motivation: 现有监督微调方法在复杂文档类型上泛化能力不足，且高质量训练数据有限，亟需新方法提升布局感知解析能力。

Result: 在OmniDocBench、olmOCR-Bench等基准测试中显著优于专用文档解析系统和通用视觉语言模型。

Insight: 强化学习结合布局感知奖励能有效提升复杂文档解析的泛化性；大规模数据支持是关键。

Abstract: Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.

[76] VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency cs.CLPDF

Hongcheng Liu, Yixuan Hou, Heyang Liu, Yuhao Wang, Yanfeng Wang

TL;DR: 该论文研究了语音大语言模型（Speech-LLMs）在语音不流畅情况下的鲁棒性，并提出了一种名为VocalBench-DF的评估框架，揭示了当前模型的局限性。

Details

Motivation: 现有研究通常依赖理想化的语音输入，而忽视了现实中常见的语音不流畅问题，尤其是与帕金森病等疾病相关的不流畅现象。因此，需要系统评估和提升Speech-LLMs在这些情况下的性能。

Result: 实验结果显示，当前Speech-LLMs在处理不流畅语音时性能显著下降，表明其在现实场景中的实用性受限。

Insight: 论文指出，通过增强模型组件和流程中的识别与推理能力，可以显著提升鲁棒性。此外，强调了未来研究亟需改进不流畅语音处理技术，以实现更具包容性的语音模型。

Abstract: While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson’s disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs

[77] Large-scale User Game Lifecycle Representation Learning cs.CLPDF

Yanjie Gou, Jiangming Liu, Kouying Xue, Yi Hua

TL;DR: 该论文提出了一种针对大规模用户游戏生命周期表示学习的方法，旨在解决游戏稀疏性和不平衡性问题，通过引入用户游戏生命周期（UGL）和改进的行为策略，显著提升了游戏广告和推荐的效果。

Details

Motivation: 随着视频游戏产业的快速发展，游戏平台需要高效的广告和推荐系统。然而，现有的推荐系统方法难以处理游戏的稀疏性和不平衡性问题，因此需要一种新的表示学习方法。

Result: 离线实验显示UGL表示学习平均提高了1.83%的AUC，在线实验则在游戏广告中实现了21.67%的CVR提升。同时，在游戏内物品推荐中，AUC提升了0.5%，ARPU提升了0.82%。

Insight: 通过处理游戏的稀疏性和不平衡性问题，UGL表示学习方法能显著提升推荐和广告的效果，尤其在短期和长期兴趣的提取上表现出色。

Abstract: The rapid expansion of video game production necessitates the development of effective advertising and recommendation systems for online game platforms. Recommending and advertising games to users hinges on capturing their interest in games. However, existing representation learning methods crafted for handling billions of items in recommendation systems are unsuitable for game advertising and recommendation. This is primarily due to game sparsity, where the mere hundreds of games fall short for large-scale user representation learning, and game imbalance, where user behaviors are overwhelmingly dominated by a handful of popular games. To address the sparsity issue, we introduce the User Game Lifecycle (UGL), designed to enrich user behaviors in games. Additionally, we propose two innovative strategies aimed at manipulating user behaviors to more effectively extract both short and long-term interests. To tackle the game imbalance challenge, we present an Inverse Probability Masking strategy for UGL representation learning. The offline and online experimental results demonstrate that the UGL representations significantly enhance model by achieving a 1.83% AUC offline increase on average and a 21.67% CVR online increase on average for game advertising and a 0.5% AUC offline increase and a 0.82% ARPU online increase for in-game item recommendation.

[78] Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs cs.CL | cs.AIPDF

Lee Qi Zun, Mohamad Zulhilmi Bin Abdul Halim, Goh Man Fye

TL;DR: 该论文提出了一种专门化MedGemma模型的方法，用于生成高质量临床图像描述，以提升马来西亚临床实践指南的多模态检索增强生成（RAG）系统效果。通过知识蒸馏和数据增强解决数据稀缺问题，并使用QLoRA方法进行高效微调。实验验证了模型在分类和描述准确性上的显著提升。

Details

Motivation: 现有的一般视觉语言模型生成的临床图像描述缺乏专业性和事实基础，限制了多模态RAG系统在临床决策支持中的有效性。因此，需要一种能够生成高保真医学图像描述的专门化模型。

Result: 微调后的MedGemma在分类性能和图像描述的信度、正确性上均有显著提升，证明了其生成可靠医学描述的能力。

Insight: 知识蒸馏和高效微调技术可以显著提升医学视觉语言模型的性能，为多模态RAG系统在临床决策中的应用提供了可靠基础。

Abstract: Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.

[79] When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs cs.CLPDF

Hongcheng Liu, Pingjie Wang, Yuhao Wang, Siqu Ou, Yanfeng Wang

TL;DR: 该论文探讨了多模态大语言模型（MLLMs）在主动推理任务中的表现，发现其性能远落后于被动推理任务，揭示了当前MLLMs的局限性。通过提出GuessBench基准，研究进一步指出细粒度感知和及时决策是主要挑战，并提出感知增强和思维导向方法为未来研究方向。

Details

Motivation: 现有MLLMs评测主要关注被动推理任务，而忽略了真实世界中信息不完全的场景。论文旨在探索MLLMs在主动获取缺失证据和迭代优化决策方面的能力，填补这一研究空白。

Result: MLLMs在主动推理任务中的表现显著低于被动推理任务。感知增强对小模型效果明显，而思维导向方法在不同规模模型中均有提升。

Insight: 研究表明，MLLMs在主动推理任务中存在较大改进空间，未来研究应关注细粒度感知能力和决策时效性，结合感知增强与思维导向方法实现突破。

Abstract: Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.

[80] DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios cs.CL | cs.AI | cs.LGPDF

Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong

TL;DR: DeceptionBench是一个系统性评估AI（特别是大语言模型和大推理模型）在现实场景中欺骗行为的基准测试，涵盖经济和医疗等领域，揭示了模型在激励或胁迫下欺骗行为的脆弱性。

Details

Motivation: 尽管大语言模型（LLMs）在认知任务上表现出色，但其能力的快速提升也带来了潜在的欺骗行为，可能在高风险应用中造成严重后果。然而，现有研究对现实场景中欺骗行为的系统性评估不足。

Result: 实验表明，现有模型在激励或胁迫下欺骗行为显著增加，缺乏对操纵性上下文线索的抵抗力，亟需更强大的防护机制。

Insight: 揭示了AI在现实交互中欺骗行为的复杂性，表明当前的模型仍容易受到激励和胁迫的影响，未来需研发更安全的AI系统。

Abstract: Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also introduces emergent deceptive behaviors that may induce severe risks in high-stakes deployments. More critically, the characterization of deception across realistic real-world scenarios remains underexplored. To bridge this gap, we establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different societal domains, what their intrinsic behavioral patterns are, and how extrinsic factors affect them. Specifically, on the static count, the benchmark encompasses 150 meticulously designed scenarios in five domains, i.e., Economy, Healthcare, Education, Social Interaction, and Entertainment, with over 1,000 samples, providing sufficient empirical foundations for deception analysis. On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement. On the extrinsic dimension, we investigate how contextual factors modulate deceptive outputs under neutral conditions, reward-based incentivization, and coercive pressures. Moreover, we incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics. Extensive experiments across LLMs and Large Reasoning Models (LRMs) reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, demonstrating that current models lack robust resistance to manipulative contextual cues and the urgent need for advanced safeguards against various deception behaviors. Code and resources are publicly available at https://github.com/Aries-iai/DeceptionBench.

[81] Temporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References? cs.CL | I.2.7PDF

Ashutosh Bajpai, Tanmoy Chakraborty

TL;DR: 这篇论文提出了一个新的基准TEMP-ReCon，用于评估LLMs在时间敏感问题中的时序参考一致性，并发现LLMs表现不足。作者提出了一个基于推理路径对齐的模型UnTRaP，以提升其一致性。

Details

Motivation: 随着LLMs在时间敏感领域（如法律、医疗和金融）的应用增加，确保其在时间维度上的准确性变得至关重要。但目前缺乏相关工作来评估或提升LLMs的时序一致性。

Result: 实验表明，UnTRaP在提升时序一致性上优于多个基线模型。

Insight: LLMs在时间敏感领域的应用中存在时序一致性问题，而通过推理路径对齐的方法可以有效缓解这一问题。

Abstract: The increasing acceptance of large language models (LLMs) as an alternative to knowledge sources marks a significant paradigm shift across various domains, including time-sensitive fields such as law, healthcare, and finance. To fulfill this expanded role, LLMs must not only be factually accurate but also demonstrate consistency across temporal dimensions, necessitating robust temporal reasoning capabilities. Despite this critical requirement, efforts to ensure temporal consistency in LLMs remain scarce including noticeable absence of endeavors aimed at evaluating or augmenting LLMs across temporal references in time-sensitive inquiries. In this paper, we seek to address this gap by introducing a novel benchmark entitled temporal referential consistency, accompanied by a resource TEMP-ReCon designed to benchmark a wide range of both open-source and closed-source LLMs with various linguistic contexts characterized by differing resource richness (including English, French, and Romanian). The findings emphasis that LLMs do exhibit insufficient temporal referent consistency. To address this, we propose \newmodel, a reasoning path alignment-based model that aims to enhance the temporal referential consistency of LLMs. Our empirical experiments substantiate the efficacy of UnTRaP compared to several baseline models.

[82] From Characters to Tokens: Dynamic Grouping with Hierarchical BPE cs.CLPDF

Rares Dolga, Lucas Maystre, Tudor Berariu, David Barber

TL;DR: 论文提出了一种动态字符分组方法，通过利用现有BPE分词的结构，无需额外模型，实现了高效、灵活且语言无关的表征。

Details

Motivation: 当前子词分词方法（如BPE）在表示罕见词时效率低下且需要大嵌入矩阵，而字符级模型在Transformer架构中性能受限。作者希望通过一种动态方法结合二者的优点。

Result: 实验表明，该方法在性能上匹敌或优于基于动态熵和空格的分词策略，同时保持了词汇的紧凑性。

Insight: 动态字符分组方法提供了一种语言无关的解决方案，避免了额外模型的依赖，适用于多语言任务。

Abstract: Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare words and require large embedding matrices. Character-level models address these issues but introduce performance bottlenecks, particularly in Transformer-based architectures. Recent hierarchical models attempt to merge the benefits of both paradigms by grouping characters into patches, but existing patching strategies either rely on whitespace-limiting applicability to certain languages, or require auxiliary models that introduce new dependencies. In this paper, we propose a dynamic character grouping method that leverages the structure of existing BPE tokenization without requiring additional models. By appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity, our method offers efficient, flexible, and language-agnostic representations. Empirical results demonstrate that our approach matches or exceeds the performance of dynamic entropy- and whitespace-based patching strategies, while maintaining a compact vocabulary.

[83] Latent Reasoning in LLMs as a Vocabulary-Space Superposition cs.CLPDF

Jingcheng Deng, Liang Pang, Zihao Wei, Shichen Xu, Zenghao Duan

TL;DR: 本文提出了一种新型的潜在推理方法Latent-SFT，通过将潜在空间限制在LLM词汇表的列空间中，显著降低了显式推理的计算成本，同时在多个数据集上表现优异。

Details

Motivation: 显式推理（如思维链提示）虽然有效，但带来了显著的计算开销。潜在推理虽能降低成本，但性能下降严重。本文旨在解决这一问题。

Result: Latent-SFT在GSM8k上达到显式SFT的性能，推理链缩短4倍，优于现有潜在方法；在Math500和AIME24上优于基于隐藏状态的方法。

Insight: 潜在推理不仅是对单一路径的压缩，也是对多路径的叠加；词汇表概率为基础的潜在推理更具优势。

Abstract: Large language models (LLMs) demonstrate strong reasoning abilities with chain-of-thought prompting, but explicit reasoning introduces substantial computational overhead. Recent work on latent reasoning reduces this cost by reasoning in latent space without explicit supervision, but performance drops significantly. Our preliminary experiments suggest that this degradation stems from the unstructured latent space, which makes fitting latent tokens difficult. To address this, we restrict the latent space to the column space of the LLM vocabulary, treating latent reasoning as a superposition over vocabulary probabilities. Once latent reasoning concludes, it collapses into an eigenstate of explicit reasoning to yield the final answer. Based on this idea, we propose Latent-SFT, a two-stage learning framework. In the first stage, we design two specialized attention masks to guide the Latent Token Encoder in generating latent tokens, allowing the LLM to produce the correct answer conditioned on them. In the second stage, the Latent Token Encoder is discarded, and the LLM is directly trained to generate these latent tokens autonomously for latent reasoning, optimized with KL and CE losses. Latent-SFT sets a new state of the art on GSM8k, matching explicit SFT performance while cutting reasoning chains by up to 4 times and outperforming prior latent methods. On Math500 and AIME24, lexical probability-based latent reasoning also clearly surpasses hidden-state-based approaches. Our metrics of effective compression rate and effective global parallelism further show that latent reasoning is both the compression of a single path and the superposition of multiple paths.

[84] MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval cs.CL | cs.AI | cs.IR | cs.MMPDF

Qiyu Wu, Shuyang Cui, Satoshi Hayakawa, Wei-Yao Wang, Hiromi Wakaki

TL;DR: 论文提出了一种模态复合感知（MCA）框架，通过偏好损失和复合正则化目标，增强多模态统一编码器在分布偏移下的鲁棒性，提升多模态检索性能。

Details

Motivation: 现有的多模态大语言模型（MLLMs）虽然灵活且先进，但在传统对比学习训练下容易学习模态捷径，导致在分布偏移下表现不佳。因此需要一种方法来增强模型的鲁棒性。

Result: 在多个基准测试中，MCA框架显著提升了分布偏移下的检索性能，验证了其有效性。

Insight: 模态复合感知是提升多模态统一编码器鲁棒性的重要原则，特别是在处理复合输入时。

Abstract: Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.

[85] Rethinking Cross-lingual Gaps from a Statistical Viewpoint cs.CL | cs.AI | cs.LGPDF

Vihari Piratla, Purvam Jain, Darshan Singh, Partha Talukdar, Trevor Cohn

TL;DR: 该论文从统计视角重新审视了跨语言差距问题，提出目标语言响应方差是跨语言差距的主要原因，并通过实验验证了这一假设。

Details

Motivation: 现有研究将跨语言差距归因于源语言和目标语言潜在表征的差异，而本文假设目标语言响应的方差是主要原因，试图从统计角度重新解释这一问题。

Result: 通过控制响应方差，目标语言准确率提升了20%-25%，验证了假设的正确性和方法的有效性。

Insight: 跨语言差距的关键因素是目标语言响应的方差，而非传统认为的潜在表征差异。这一发现为改善跨语言模型性能提供了新方向。

Abstract: Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried from target languages. Prior research has pointed to a cross-lingual gap, viz., a drop in accuracy when the knowledge is queried in a target language compared to when the query is in the source language. Existing research has rationalized divergence in latent representations in source and target languages as the source of cross-lingual gap. In this work, we take an alternative view and hypothesize that the variance of responses in the target language is the main cause of this gap. For the first time, we formalize the cross-lingual gap in terms of bias-variance decomposition. We present extensive experimental evidence which support proposed formulation and hypothesis. We then reinforce our hypothesis through multiple inference-time interventions that control the variance and reduce the cross-lingual gap. We demonstrate a simple prompt instruction to reduce the response variance, which improved target accuracy by 20-25% across different models.

[86] Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation cs.CL | cs.AIPDF

Jinliang Liu

TL;DR: 论文提出ParallaxRAG框架，通过多视角知识图谱检索增强生成（KG-RAG）解决多跳推理问题，利用注意力头的多样性提升检索质量，减少幻觉，并在实验中表现优异。

Details

Motivation: 大语言模型（LLMs）在多跳推理中容易产生幻觉且表现不佳，现有KG-RAG方法依赖扁平嵌入和噪声路径探索，亟需更鲁棒的解决方案。

Result: 在WebQSP和CWQ数据集上的实验表明，ParallaxRAG在检索和问答任务中表现优异， hallucination减少且泛化能力强。

Insight: 多视角注意力头专一性是知识基础多跳推理的可行方向，为LLM的逐步推理提供了新的理论基础。

Abstract: Large language models (LLMs) excel at language understanding but often hallucinate and struggle with multi-hop reasoning. Knowledge-graph-based retrieval-augmented generation (KG-RAG) offers grounding, yet most methods rely on flat embeddings and noisy path exploration. We propose ParallaxRAG, a framework that symmetrically decouples queries and graph triples into multi-view spaces, enabling a robust retrieval architecture that explicitly enforces head diversity while constraining weakly related paths. Central to our approach is the observation that different attention heads specialize in semantic relations at distinct reasoning stages, contributing to different hops of the reasoning chain. This specialization allows ParallaxRAG to construct cleaner subgraphs and guide LLMs through grounded, step-wise reasoning. Experiments on WebQSP and CWQ, under our unified, reproducible setup (BGE-M3 + Llama3.1-8B), demonstrate competitive retrieval and QA performance, alongside reduced hallucination and good generalization. Our results highlight multi-view head specialization as a principled direction for knowledge-grounded multi-hop reasoning. Our implementation will be released as soon as the paper is accepted.

[87] KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models cs.CL | cs.AIPDF

Dongjun Kim, Chanhee Park, Chanjun Park, Heuiseok Lim

TL;DR: KITE是一个专门用于评估大型语言模型（LLM）韩语指令遵循能力的基准测试工具，填补了韩语在开环指令任务评估上的空白。

Details

Motivation: 当前LLM的评估主要针对英语模型，忽略了其他语言的独特语法和文化特征。韩语因其复杂的语法、敬语系统和双数体系等特点，缺乏专门的指令遵循能力评估工具。

Result: KITE揭示了不同模型在韩语指令任务上的性能差异，为LLM的开发提供了重要参考。

Insight: KITE不仅是韩语评估的工具，也为其他低资源语言的类似研究提供了范本，推动了多语言LLM的发展。

Abstract: The instruction-following capabilities of large language models (LLMs) are pivotal for numerous applications, from conversational agents to complex reasoning systems. However, current evaluations predominantly focus on English models, neglecting the linguistic and cultural nuances of other languages. Specifically, Korean, with its distinct syntax, rich morphological features, honorific system, and dual numbering systems, lacks a dedicated benchmark for assessing open-ended instruction-following capabilities. To address this gap, we introduce the Korean Instruction-following Task Evaluation (KITE), a comprehensive benchmark designed to evaluate both general and Korean-specific instructions. Unlike existing Korean benchmarks that focus mainly on factual knowledge or multiple-choice testing, KITE directly targets diverse, open-ended instruction-following tasks. Our evaluation pipeline combines automated metrics with human assessments, revealing performance disparities across models and providing deeper insights into their strengths and weaknesses. By publicly releasing the KITE dataset and code, we aim to foster further research on culturally and linguistically inclusive LLM development and inspire similar endeavors for other underrepresented languages.

[88] Finetuning LLMs for EvaCun 2025 token prediction shared task cs.CLPDF

Josef Jon, Ondřej Bojar

TL;DR: 本文介绍了为EvaCun 2025的token预测任务提交的系统，基于Command-R、Mistral和Aya Expanse等大语言模型（LLM）进行微调。作者未对数据进行特定调整或预处理，并比较了三种不同提示方法的效果。

Details

Motivation: 任务目标是解决EvaCun 2025共享任务中的token预测问题，尽管作者对该领域和语言了解有限，但仍希望通过微调LLM来实现有效预测。

Result: 结果通过在保留数据集上的评估展示了不同提示方法的性能差异，但具体指标未在摘要中提及。

Insight: 研究表明，即使缺乏领域特定知识或数据处理，简单微调LLM也能在token预测任务中表现良好，提示方法的选择对结果有显著影响。

Abstract: In this paper, we present our submission for the token prediction task of EvaCun 2025. Our sys-tems are based on LLMs (Command-R, Mistral, and Aya Expanse) fine-tuned on the task data provided by the organizers. As we only pos-sess a very superficial knowledge of the subject field and the languages of the task, we simply used the training data without any task-specific adjustments, preprocessing, or filtering. We compare 3 different approaches (based on 3 different prompts) of obtaining the predictions, and we evaluate them on a held-out part of the data.

[89] HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination cs.CLPDF

Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He

TL;DR: HypoSpace是一个评估语言模型作为假设生成器在多解释科学问题中创造力的诊断套件，重点关注假设集的合法性、唯一性和覆盖率。

Details

Motivation: 随着语言模型在科学工作流中的应用增加，评估其提出多解释假设集的能力变得至关重要，因为许多科学问题是未确定的（即存在多个一致的假设）。

Result: 研究表明，随着可接纳空间的扩大，语言模型的合法性保持较高水平，但唯一性和覆盖率下降，揭示了传统正确性指标无法检测的模式崩溃现象。

Insight: HypoSpace揭示了语言模型在生成假设集时的局限性（如模式崩溃），并为探索和覆盖多解释问题的方法提供了可控的评估工具。

Abstract: As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.

[90] Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection cs.CLPDF

Joshua Wolfe Brook, Ilia Markov

TL;DR: 该论文提出了一种利用大语言模型（LLMs）作为动态知识库生成背景上下文并整合到仇恨言论检测（HSD）分类器输入中的新方法，显著提升了文本和多模态场景下的检测性能。

Details

Motivation: 仇恨言论检测在处理隐含或复杂的文本及多模态内容时面临挑战，现有方法通常缺乏对背景信息的有效利用，限制了检测效果。

Result: 在文本数据集（Latent Hatred）和多模态数据集（MAMI）上，性能分别提升了3和6个F1分数，证明了上下文信息及其整合方法的重要性。

Insight: 上下文信息的动态生成和合理整合是提升仇恨言论检测性能的关键，同时多模态场景的整合方法需要进一步优化。

Abstract: This research introduces a novel approach to textual and multimodal Hate Speech Detection (HSD), using Large Language Models (LLMs) as dynamic knowledge bases to generate background context and incorporate it into the input of HSD classifiers. Two context generation strategies are examined: one focused on named entities and the other on full-text prompting. Four methods of incorporating context into the classifier input are compared: text concatenation, embedding concatenation, a hierarchical transformer-based fusion, and LLM-driven text enhancement. Experiments are conducted on the textual Latent Hatred dataset of implicit hate speech and applied in a multimodal setting on the MAMI dataset of misogynous memes. Results suggest that both the contextual information and the method by which it is incorporated are key, with gains of up to 3 and 6 F1 points on textual and multimodal setups respectively, from a zero-context baseline to the highest-performing system, based on embedding concatenation.

[91] Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth cs.CL | cs.IRPDF

Helia Hashemi, Victor Rühle, Saravan Rajmohan

TL;DR: 论文提出了一种成本感知的检索增强推理模型，通过动态调整检索文档列表长度和强化学习方法，显著提高了效率（延迟降低16-20%）且无损效果（准确率提升5%）。

Details

Motivation: 现有的检索增强推理模型虽然性能强大，但计算成本高昂，检索和推理阶段均消耗大量资源。

Result: 在7个QA数据集上实验表明，模型延迟降低16-20%，准确率提升5%。

Insight: 动态调整检索深度和成本感知训练可以有效平衡效率和效果。

Abstract: Reasoning models have gained significant attention due to their strong performance, particularly when enhanced with retrieval augmentation. However, these models often incur high computational costs, as both retrieval and reasoning tokens contribute substantially to the overall resource usage. In this work, we make the following contributions: (1) we propose a retrieval-augmented reasoning model that dynamically adjusts the length of the retrieved document list based on the query and retrieval results; (2) we develop a cost-aware advantage function for training of efficient retrieval-augmented reasoning models through reinforcement learning; and (3) we explore both memory- and latency-bound implementations of the proposed cost-aware framework for both proximal and group relative policy optimization algorithms. We evaluate our approach on seven public question answering datasets and demonstrate significant efficiency gains, without compromising effectiveness. In fact, we observed that the model latency decreases by ~16-20% across datasets, while its effectiveness increases by ~5% on average, in terms of exact match.

[92] Enhanced Sentiment Interpretation via a Lexicon-Fuzzy-Transformer Framework cs.CL | cs.AIPDF

Shayan Rokhva, Mousa Alizadeh, Maryam Abdollahi Shamami

TL;DR: 论文提出了一种结合词典规则、模糊逻辑和Transformer的混合框架，用于提升情感分析的精度和可解释性，尤其在非正式和领域特定的文本中表现优异。

Details

Motivation: 现有的情感分析模型在面对非正式语言和领域特定文本时，往往表现不佳，尤其是难以准确捕捉情感的极性和强度。

Result: 在四个领域特定数据集上验证了模型的优越性，结果显示其在分布对齐、极端情感识别和减少误分类方面表现突出。

Insight: 结合符号推理与神经模型可以显著提升情感分析的可解释性和细粒度性能，尤其是在动态语言环境中。

Abstract: Accurately detecting sentiment polarity and intensity in product reviews and social media posts remains challenging due to informal and domain-specific language. To address this, we propose a novel hybrid lexicon-fuzzy-transformer framework that combines rule-based heuristics, contextual deep learning, and fuzzy logic to generate continuous sentiment scores reflecting both polarity and strength. The pipeline begins with VADER-based initial sentiment estimations, which are refined through a two-stage adjustment process. This involves leveraging confidence scores from DistilBERT, a lightweight transformer and applying fuzzy logic principles to mitigate excessive neutrality bias and enhance granularity. A custom fuzzy inference system then maps the refined scores onto a 0 to 1 continuum, producing expert)like judgments. The framework is rigorously evaluated on four domain-specific datasets. food delivery, e-commerce, tourism, and fashion. Results show improved alignment with user ratings, better identification of sentiment extremes, and reduced misclassifications. Both quantitative metrics (distributional alignment, confusion matrices) and qualitative insights (case studies, runtime analysis) affirm the models robustness and efficiency. This work demonstrates the value of integrating symbolic reasoning with neural models for interpretable, finegrained sentiment analysis in linguistically dynamic domains.

[93] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training cs.CL | cs.AIPDF

Pengkai Wang, Qi Zuo, Pengwei Liu, Zhijie Sang, Congkai Xie

TL;DR: 这篇论文提出了ORBIT框架，通过基于量规的增量训练方法，解决了LLM在开放领域任务（如医疗对话）中奖励模糊的问题，显著提升了模型性能。

Details

Motivation: LLM在程序化奖励明确的领域（如数学和代码）表现优异，但在开放领域（如医疗咨询）中因奖励模糊而受限。ORBIT旨在通过量规反馈填补这一空白。

Result: 在HealthBench-Hard基准上，Qwen3-4B-Instruct模型的性能从7.0提升至27.2，达到同类模型的SOTA。

Insight: 量规驱动的RL不仅提升数值指标，还能在多样化医疗场景中实现一致改进，为开放领域任务提供了一种可扩展的训练策略。

Abstract: Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.

cs.CR [Back]

[94] MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evaluation cs.CR | cs.CLPDF

Gurusha Juneja, Jayanth Naga Sai Pasupulati, Alon Albalak, Wenyue Hua, William Yang Wang

TL;DR: MAGPIE是一个用于评估多智能体协作环境中隐私保护能力的新基准，包含200个高风险任务，揭示了当前先进LLM智能体在隐私泄漏和协作方面的不足。

Details

Motivation: 现有隐私基准仅关注单轮简单交互，无法评估多智能体协作中隐私与任务效能的平衡问题。

Result: GPT-5和Gemini 2.5-Pro等先进智能体隐私泄漏严重（最高50.7%），且协作效果不佳，常出现操控行为。

Insight: 当前LLM智能体在多智能体环境中缺乏稳健的隐私保护能力，亟需改进对齐机制。

Abstract: A core challenge for autonomous LLM agents in collaborative settings is balancing robust privacy understanding and preservation alongside task efficacy. Existing privacy benchmarks only focus on simplistic, single-turn interactions where private information can be trivially omitted without affecting task outcomes. In this paper, we introduce MAGPIE (Multi-AGent contextual PrIvacy Evaluation), a novel benchmark of 200 high-stakes tasks designed to evaluate privacy understanding and preservation in multi-agent collaborative, non-adversarial scenarios. MAGPIE integrates private information as essential for task resolution, forcing agents to balance effective collaboration with strategic information control. Our evaluation reveals that state-of-the-art agents, including GPT-5 and Gemini 2.5-Pro, exhibit significant privacy leakage, with Gemini 2.5-Pro leaking up to 50.7% and GPT-5 up to 35.1% of the sensitive information even when explicitly instructed not to. Moreover, these agents struggle to achieve consensus or task completion and often resort to undesirable behaviors such as manipulation and power-seeking (e.g., Gemini 2.5-Pro demonstrating manipulation in 38.2% of the cases). These findings underscore that current LLM agents lack robust privacy understanding and are not yet adequately aligned to simultaneously preserve privacy and maintain effective collaboration in complex environments.

cs.RO [Back]

[95] VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation cs.RO | cs.CV | cs.LGPDF

Zehao Ni, Yonghao He, Lingfeng Qian, Jilei Mao, Fa Fu

TL;DR: 本文提出了一种仅依赖视觉的扩散策略学习方法VO-DP，通过融合语义和几何特征，在机器人操作任务中显著超越视觉基线方法，并媲美点云方法。

Details

Motivation: 现有模仿学习方法多依赖点云输入，缺乏对仅视觉方案的深入探索。VO-DP致力于解决这一问题，利用视觉基础模型实现特征融合。

Result: 在仿真任务中，VO-DP平均成功率64.6%，与点云方法DP3相当（64.0%），远超视觉基线DP（34.8%）；在现实任务中，VO-DP达到87.9%，显著优于DP3和DP。

Insight: 仅视觉输入结合语义-几何特征融合，能够有效替代点云方法，尤其是在现实任务中表现更优，显示其在复杂环境中的鲁棒性。

Abstract: In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.

cs.LG [Back]

[96] Internalizing World Models via Self-Play Finetuning for Agentic RL cs.LG | cs.CLPDF

Shiqi Chen, Tongyao Zhu, Zian Wang, Jinghan Zhang, Kangrui Wang

TL;DR: 论文提出SPA框架，通过自监督微调冷启动策略，学习世界模型以提升LLM智能体的决策能力，在多个环境中显著提高性能。

Details

Motivation: 大型语言模型在复杂和动态的真实环境中表现不佳，尤其是在分布外条件下，传统的强化学习难以适应。希望通过学习内部世界模型来改善决策。

Result: 在Sokoban中成功率从25.6%提升至59.8%，FrozenLake中得分从22.1%提升至70.9%。

Insight: 通过显式建模环境动态，可以显著提升LLM智能体在复杂任务中的表现，尤其是在分布外场景中。

Abstract: Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) scenarios. Real-world environments are complex and dynamic, governed by task-specific rules and stochasticity, which makes it difficult for LLMs to ground their internal knowledge in those dynamics. Under such OOD conditions, vanilla RL training often fails to scale; we observe Pass@k–the probability that at least one of (k) sampled trajectories succeeds–drops markedly across training steps, indicating brittle exploration and limited generalization. Inspired by model-based reinforcement learning, we hypothesize that equipping LLM agents with an internal world model can better align reasoning with environmental dynamics and improve decision-making. We show how to encode this world model by decomposing it into two components: state representation and transition modeling. Building on this, we introduce SPA, a simple reinforcement learning framework that cold-starts the policy via a Self-Play supervised finetuning (SFT) stage to learn the world model by interacting with the environment, then uses it to simulate future states prior to policy optimization. This simple initialization outperforms the online world-modeling baseline and greatly boosts the RL-based agent training performance. Experiments across diverse environments like Sokoban, FrozenLake, and Sudoku show that our approach significantly improves performance. For example, SPA boosts the Sokoban success rate from 25.6% to 59.8% and raises the FrozenLake score from 22.1% to 70.9% for the Qwen2.5-1.5B-Instruct model.

[97] Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models cs.LG | cs.CLPDF

Samuel Paech, Allen Roush, Judah Goldfeder, Ravid Shwartz-Ziv

TL;DR: 这篇论文提出了Antislop框架，用于检测和消除语言模型中重复的短语模式（slop），通过创新方法显著减少slop，同时保持模型性能。

Details

Motivation: 广泛使用的LLM产生了重复短语模式（slop），降低了输出质量并容易被识别为AI生成。论文旨在提供工具检测和消除这些模式。

Result: Antislop Sampler成功抑制8000+模式；FTPO减少90% slop，在GSM8K、MMLU等任务中保持或提升性能。

Insight: FTPO在减少slop的同时保持性能，优于DPO等方法，展示了针对token级优化的潜力。

Abstract: Widespread LLM adoption has introduced characteristic repetitive phraseology, termed ``slop,’’ which degrades output quality and makes AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary; (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data; (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace. We demonstrate that some slop patterns appear over 1,000$\times$ more frequently in LLM output than human text. The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000. Most importantly, FTPO achieves 90% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks. In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression. We release all code and results under MIT license: https://github.com/sam-paech/auto-antislop.

[98] DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu

TL;DR: DLER通过强化学习优化长度惩罚，显著提升语言模型效率，减少70%输出的同时保持更高准确率。

Details

Motivation: 现有推理语言模型如OpenAI-o1等虽性能强，但输出过长，缺乏高效的智能表现。优化智能/标记比—即准确率与长度的关系—仍未解决。

Result: DLER在7B模型上输出长度减少70%，准确率反超基线；测试时生成并行响应准确率提升28%，延迟更低。

Insight: RL优化不足是长度惩罚失效主因，而非惩罚设计复杂度；自适应截断与选择性合并为数据稀缺场景提供实用方案。

Abstract: Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token–accuracy relative to response length–remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty–truncation–and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy–efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

[99] Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential cs.LG | cs.CLPDF

Xuansheng Wu, Xiaoman Pan, Wenlin Yao, Jianshu Chen

TL;DR: 这篇论文研究了预训练大语言模型（LLMs）的内在微观特性对其推理潜力的影响，提出了‘Soundness-Aware Level’（SAL）指标，揭示了模型区分合理与不合理知识的能力与其推理表现之间的强相关性。

Details

Motivation: 现有研究表明，通过可验证奖励的强化学习（RLVR）可以显著提升LLMs的推理能力，但不同基础模型的表现差异巨大。论文旨在探索预训练模型的哪些微观特性导致了这种差异。

Result: SAL指标能够准确预测模型在RLVR后的推理表现（R²=0.87），在多种模型家族（Qwen、Mistral、Llama、DeepSeek）和规模（0.5B-14B）上均表现出普适性。

Insight: 模型的推理潜力与其预训练阶段形成的合理性区分能力密切相关，这强调了预训练的关键作用，并为选择和设计更强的基础模型提供了理论依据。

Abstract: Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses (“if-then” rules) built from features extracted from the LLM’s latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules’ soundness levels, becoming highly distinct for “strict” versus “noisy” rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL’s predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model’s reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model’s internal mechanisms for selecting/designing stronger base models.

[100] FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain cs.LG | cs.CLPDF

Tiansheng Hu, Tongyan Hu, Liuyang Bai, Yilun Zhao, Arman Cohan

TL;DR: FinTrust是一个专门为评估金融领域中大型语言模型（LLM）可信赖性设计的综合基准，覆盖广泛的实践问题和细粒度任务。结果显示专有模型在安全性等方面表现更优，开源模型在行业公平性上有优势，但所有模型在法律意识方面均表现不足。

Details

Motivation: 在金融领域应用LLMs面临高风险和高利益挑战，需要全面评估其可信赖性。

Result: 专有模型在安全等任务表现优，开源模型在行业公平性占优，但所有模型在法律意识任务中表现差。

Insight: 金融领域LLMs在法律意识和合规性方面亟需改进，FinTrust可作为可信赖性评估的重要工具。

Abstract: Recent LLMs have demonstrated promising ability in solving finance related problems. However, applying LLMs in real-world finance application remains challenging due to its high risk and high stakes property. This paper introduces FinTrust, a comprehensive benchmark specifically designed for evaluating the trustworthiness of LLMs in finance applications. Our benchmark focuses on a wide range of alignment issues based on practical context and features fine-grained tasks for each dimension of trustworthiness evaluation. We assess eleven LLMs on FinTrust and find that proprietary models like o4-mini outperforms in most tasks such as safety while open-source models like DeepSeek-V3 have advantage in specific areas like industry-level fairness. For challenging task like fiduciary alignment and disclosure, all LLMs fall short, showing a significant gap in legal awareness. We believe that FinTrust can be a valuable benchmark for LLMs’ trustworthiness evaluation in finance domain.

[101] Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection cs.LG | cs.CVPDF

Denis Janiak, Jakub Binkowski, Tomasz Kajdanowicz

TL;DR: 该论文通过实证研究分析了Mahalanobis距离方法在OOD检测中的局限性，定义了理想的数据表示几何，并提出了径向缩放的ℓ2归一化方法，显著提升了OOD检测性能。

Details

Motivation: Mahalanobis距离方法在OOD检测中广泛使用，但其性能和表示几何及归一化的关系尚不明确，限制了下游应用。论文旨在填补这一空白。

Result: 研究结果表明Mahalanobis方法并非普适可靠，且提出的归一化方法能显著提升OOD性能。

Insight: 表示几何和归一化对OOD检测至关重要，径向缩放ℓ2归一化提供了一种系统性优化特征空间几何的方法。

Abstract: Out-of-distribution (OOD) detection is critical for the reliable deployment of deep learning models. hile Mahalanobis distance methods are widely used, the impact of representation geometry and normalization on their performance is not fully understood, which may limit their downstream application. To address this gap, we conducted a comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes. First, our analysis shows that Mahalanobis-based methods aren’t universally reliable. Second, we define the ideal geometry for data representations and demonstrate that spectral and intrinsic-dimensionality metrics can accurately predict a model’s OOD performance. Finally, we analyze how normalization impacts OOD performance. Building upon these studies, we propose radially scaled $\ell_2$ normalization, a method that generalizes the standard $\ell_2$ normalization recently applied to Mahalanobis-based OOD detection. Our approach introduces a tunable parameter to directly control the radial geometry of the feature space, systematically contracting or expanding representations to significantly improve OOD detection performance. By bridging the gap between representation geometry, normalization, and OOD performance, our findings offer new insights into the design of more effective and reliable deep learning models.

[102] Poultry Farm Intelligence: An Integrated Multi-Sensor AI Platform for Enhanced Welfare and Productivity cs.LG | cs.CV | cs.NEPDF

Pieris Panagi, Savvas Karatsiolis, Kyriacos Mosphilis, Nicholas Hadjisavvas, Andreas Kamilaris

TL;DR: 该论文提出了PoultryFI平台，通过集成AI模块（如摄像头优化、视听监控、实时蛋计数等），实现了低成本的家禽养殖智能监控与优化，提升了福利和生产效率。

Details

Motivation: 家禽养殖行业面临生产效率与动物福利的双重压力，但中小型农场缺乏低成本、集成的监控工具，依赖人工检查。PoultryFI旨在填补这一技术空白。

Result: 实地试验证明，蛋计数准确率达100%，异常检测稳健，短期预测可靠，尤其在Raspberry Pi 5上表现优异。

Insight: PoultryFI展示了如何通过模块化AI技术将分散的试点工具整合为可扩展的农场智能平台，为养殖业提供了主动管理的可能性。

Abstract: Poultry farming faces increasing pressure to meet productivity targets while ensuring animal welfare and environmental compliance. Yet many small and medium-sized farms lack affordable, integrated tools for continuous monitoring and decision-making, relying instead on manual, reactive inspections. This paper presents Poultry Farm Intelligence (PoultryFI) - a modular, cost-effective platform that integrates six AI-powered modules: Camera Placement Optimizer, Audio-Visual Monitoring, Analytics & Alerting, Real-Time Egg Counting, Production & Profitability Forecasting, and a Recommendation Module. Camera layouts are first optimized offline using evolutionary algorithms for full poultry house coverage with minimal hardware. The Audio-Visual Monitoring module extracts welfare indicators from synchronized video, audio, and feeding data. Analytics & Alerting produces daily summaries and real-time notifications, while Real-Time Egg Counting uses an edge vision model to automate production tracking. Forecasting models predict egg yield and feed consumption up to 10 days in advance, and the Recommendation Module integrates forecasts with weather data to guide environmental and operational adjustments. This is among the first systems to combine low-cost sensing, edge analytics, and prescriptive AI to continuously monitor flocks, predict production, and optimize performance. Field trials demonstrate 100% egg-count accuracy on Raspberry Pi 5, robust anomaly detection, and reliable short-term forecasting. PoultryFI bridges the gap between isolated pilot tools and scalable, farm-wide intelligence, empowering producers to proactively safeguard welfare and profitability.

eess.IV [Back]

[103] Confidence-Weighted Semi-Supervised Learning for Skin Lesion Segmentation Using Hybrid CNN-Transformer Networks eess.IV | cs.CVPDF

Saqib Qamar

TL;DR: 该论文提出了一种半监督学习框架MIRA-U，用于皮肤病变分割，结合不确定性感知的师生伪标签生成和混合CNN-Transformer架构，显著提升了分割性能。

Details

Motivation: 皮肤病变分割在早期皮肤癌检测中至关重要，但标注数据的稀缺性限制了模型的性能。为了解决这一问题，作者提出了一个半监督学习框架。

Result: 在ISIC-2016和PH2数据集上，仅使用50%标注数据时取得了DSC 0.9153和IoU 0.8552的高分，显著优于基线方法。

Insight: 引入不确定性感知和混合架构可有效提升半监督学习中的分割性能，尤其在标注数据稀缺的情况下。

Abstract: Automated skin lesion segmentation through dermoscopic analysis is essential for early skin cancer detection, yet remains challenging due to limited annotated training data. We present MIRA-U, a semi-supervised framework that combines uncertainty-aware teacher-student pseudo-labeling with a hybrid CNN-Transformer architecture. Our approach employs a teacher network pre-trained via masked image modeling to generate confidence-weighted soft pseudo-labels, which guide a U-shaped CNN-Transformer student network featuring cross-attention skip connections. This design enhances pseudo-label quality and boundary delineation, surpassing reconstruction-based and CNN-only baselines, particularly in low-annotation regimes. Extensive evaluation on ISIC-2016 and PH2 datasets demonstrates superior performance, achieving a Dice Similarity Coefficient (DSC) of 0.9153 and Intersection over Union (IoU) of 0.8552 using only 50% labeled data. Code is publicly available on GitHub.

cs.AI [Back]

[104] HugAgent: Evaluating LLMs in Simulating Human-Like Individual Reasoning on Open-Ended Tasks cs.AI | cs.CL | cs.CYPDF

Chance Jiajie Li, Zhenze Mo, Yuhan Tang, Ao Qu, Jiayi Wu

TL;DR: HugAgent 是一个用于评估大型语言模型在模拟人类个性化推理能力上的基准测试，通过合成和真实数据的双轨设计，揭示了现有模型在捕捉个体推理风格上的差距。

Details

Motivation: 现有大型语言模型虽然在群体层面模拟人类回答表现良好，但忽略了个体推理风格和信念变化的独特性。HugAgent 旨在推动机器更接近人类个性化推理的目标。

Result: 实验表明，当前最先进的大型语言模型在适应个体推理风格上仍存在显著差距，HugAgent 为这类问题提供了可扩展的评估工具。

Insight: 捕捉人类推理的个体差异是机器推理更接近人类的下一步挑战，HugAgent 为此提供了可行的评估框架。

Abstract: Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, “out-loud” reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).

[105] Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism cs.AI | cs.CLPDF

Haoran Sun, Yankai Jiang, Zhenyu Tang, Yaning Pan, Shuang Gu

TL;DR: 论文提出了SciRecipe数据集和Thoth模型，通过结构化组件奖励机制改进生物实验协议生成，显著提升了协议的可靠性、逻辑性和语义准确性。

Details

Motivation: 当前大型语言模型生成生物实验协议时存在不完全或不一致的问题，限制了其在科学实验中的实用性，亟需一种可靠的方法改进生成质量。

Result: Thoth在多个基准测试中超越当前最优LLMs，显著提升了步骤对齐、逻辑顺序和语义准确性。

Insight: 通过结构化和分阶段方法改进协议生成可行且有效，为科学实验助手提供了新思路。

Abstract: The foundation of reproducible science lies in protocols that are precise, logically ordered, and executable. The autonomous generation of these protocols through natural language queries could greatly improve the efficiency of the reproduction process. However, current leading large language models (LLMs) often generate incomplete or inconsistent protocols, limiting their utility. To address this limitation, we first introduce SciRecipe, a large-scale dataset of over 12K structured protocols spanning 27 biological subfields and encompassing both comprehension and problem-solving tasks. To further improve protocol generation, we propose the “Sketch-and-Fill” paradigm, which separates analysis, structuring, and expression to ensure each step is explicit and verifiable. Complementing this, the structured component-based reward mechanism evaluates step granularity, action order, and semantic fidelity, aligning model optimization with experimental reliability. Building on these components, we develop Thoth, trained through a staged Knowledge-to-Action process that progresses from knowledge acquisition to operational reasoning and ultimately to robust, executable protocol generation. Across multiple benchmarks, Thoth consistently surpasses both proprietary and open-source LLMs, achieving significant improvements in step alignment, logical sequencing, and semantic accuracy. Our approach paves the way for reliable scientific assistants that bridge knowledge with experimental execution. All data, code, and models will be released publicly.

[106] Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation cs.AI | cs.CL | cs.LG | cs.MAPDF

Ed Li, Junyu Ren, Xintian Pan, Cat Yan, Chuanhao Li

TL;DR: exttt{freephdlabor}是一个开源的多代理框架，支持动态工作流和模块化架构，旨在通过自动化的多代理系统推动科学研究的持续性和交互性。

Details

Motivation: 现有的科学研究自动化系统存在两个核心问题：一是预定义的工作流无法适应中间结果，二是上下文管理不足导致长期研究困难。 exttt{freephdlabor}旨在解决这些问题。

Result: exttt{freephdlabor}能够将单次研究扩展为持续性的研究程序，并支持端到端的科学研究自动化。

Insight: 模块化和动态规划是实现科学自动化灵活性和可持续性的关键。

Abstract: The automation of scientific discovery represents a critical milestone in Artificial Intelligence (AI) research. However, existing agentic systems for science suffer from two fundamental limitations: rigid, pre-programmed workflows that cannot adapt to intermediate findings, and inadequate context management that hinders long-horizon research. We present \texttt{freephdlabor}, an open-source multiagent framework featuring \textit{fully dynamic workflows} determined by real-time agent reasoning and a \coloremph{\textit{modular architecture}} enabling seamless customization – users can modify, add, or remove agents to address domain-specific requirements. The framework provides comprehensive infrastructure including \textit{automatic context compaction}, \textit{workspace-based communication} to prevent information degradation, \textit{memory persistence} across sessions, and \textit{non-blocking human intervention} mechanisms. These features collectively transform automated research from isolated, single-run attempts into \textit{continual research programs} that build systematically on prior explorations and incorporate human feedback. By providing both the architectural principles and practical implementation for building customizable co-scientist systems, this work aims to facilitate broader adoption of automated research across scientific domains, enabling practitioners to deploy interactive multiagent systems that autonomously conduct end-to-end research – from ideation through experimentation to publication-ready manuscripts.

[107] Context-aware deep learning using individualized prior information reduces false positives in disease risk prediction and longitudinal health assessment cs.AI | cs.CVPDF

Lavanya Umapathy, Patricia M Johnson, Tarun Dutt, Angela Tong, Madhur Nayan

TL;DR: 该论文提出了一种结合个体化历史信息的上下文感知深度学习框架，用于降低疾病风险预测和纵向健康评估中的假阳性率，并在前列腺癌风险预测中验证了其效果。

Details

Motivation: 在医疗健康监测中，整合患者的历史信息（如既往影像和临床生物标志物）可以提高风险预测的特异性，减少假阳性，从而更准确地评估患者健康变化。

Result: 结果表明，结合历史信息能够逐步降低假阳性率（从51％降至24％），并且在预测未来五年前列腺癌风险时进一步降至9％。

Insight: 研究强调，时间上下文信息的整合可以显著提高医疗风险预测的特异性，为大规模健康监测项目提供了可行性路径，从而实现早期疾病检测和改善健康结果。

Abstract: Temporal context in medicine is valuable in assessing key changes in patient health over time. We developed a machine learning framework to integrate diverse context from prior visits to improve health monitoring, especially when prior visits are limited and their frequency is variable. Our model first estimates initial risk of disease using medical data from the most recent patient visit, then refines this assessment using information digested from previously collected imaging and/or clinical biomarkers. We applied our framework to prostate cancer (PCa) risk prediction using data from a large population (28,342 patients, 39,013 magnetic resonance imaging scans, 68,931 blood tests) collected over nearly a decade. For predictions of the risk of clinically significant PCa at the time of the visit, integrating prior context directly converted false positives to true negatives, increasing overall specificity while preserving high sensitivity. False positive rates were reduced progressively from 51% to 33% when integrating information from up to three prior imaging examinations, as compared to using data from a single visit, and were further reduced to 24% when also including additional context from prior clinical data. For predicting the risk of PCa within five years of the visit, incorporating prior context reduced false positive rates still further (64% to 9%). Our findings show that information collected over time provides relevant context to enhance the specificity of medical risk prediction. For a wide range of progressive conditions, sufficient reduction of false positive rates using context could offer a pathway to expand longitudinal health monitoring programs to large populations with comparatively low baseline risk of disease, leading to earlier detection and improved health outcomes.

q-fin.CP [Back]

[108] Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction q-fin.CP | cs.AI | cs.CL | cs.LGPDF

Tian Guo, Emmanuel Hauptmann

TL;DR: 这篇论文研究了如何有效结合量化因子和大型语言模型生成的新闻流表示来预测股票收益。提出了融合学习框架，并探讨了三种方法。随后提出了混合模型及解耦训练方法以提高稳定性，实验提供了多模态建模的有效见解。

Details

Motivation: 量化投资中，收益预测对股票选择、组合优化和风险管理至关重要。传统的量化因子和新兴的LLM生成的新闻流表示的结合潜力尚未充分探索，本文旨在填补这一空白。

Result: 在实际投资环境中实验验证了多模态建模的有效性，并提供了关于因素与新闻流结合的有用见解。

Insight: 1. 多模态结合（量化因子和新闻流）显著提升预测性能。2. 注意力机制和混合模型在动态融合中表现优异。3. 解耦训练方法有效解决了混合模型的训练不稳定问题。

Abstract: In quantitative investing, return prediction supports various tasks, including stock selection, portfolio optimization, and risk management. Quantitative factors, such as valuation, quality, and growth, capture various characteristics of stocks. Unstructured financial data, like news and transcripts, has attracted growing attention, driven by recent advances in large language models (LLMs). This paper examines effective methods for leveraging multimodal factors and newsflow in return prediction and stock selection. First, we introduce a fusion learning framework to learn a unified representation from factors and newsflow representations generated by an LLM. Within this framework, we compare three representative methods: representation combination, representation summation, and attentive representations. Next, building on empirical observations from fusion learning, we explore the mixture model that adaptively combines predictions made by single modalities and their fusion. To mitigate the training instability observed in the mixture model, we introduce a decoupled training approach with theoretical insights. Finally, our experiments on real investment universes yield several insights into effective multimodal modeling of factors and news for stock return prediction.

cs.IR [Back]

[109] SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation cs.IR | cs.CLPDF

Ines Besrour, Jingbo He, Tobias Schreieder, Michael Färber

TL;DR: SQuAI是一个多智能体检索增强生成框架，专注于解决科学问答任务中的复杂问题，通过分解问题、检索证据和生成带引用的答案，显著提升了可信度和效果。

Details

Motivation: 现有检索增强生成系统在科学领域处理复杂开放域问题时效果有限，缺乏明确的引用和可信度。SQuAI旨在解决这些问题，提供可验证的答案和上下文相关性。

Result: 系统在忠实性、答案相关性和上下文相关性上比基线提升高达12%（+0.088）。

Insight: 多智能体协作和混合检索策略能显著提升科学问答任务的性能，同时内联引用增强了生成结果的可信度和可验证性。

Abstract: We present SQuAI (https://squai.scads.ai/), a scalable and trustworthy multi-agent retrieval-augmented generation (RAG) framework for scientific question answering (QA) with large language models (LLMs). SQuAI addresses key limitations of existing RAG systems in the scholarly domain, where complex, open-domain questions demand accurate answers, explicit claims with citations, and retrieval across millions of scientific documents. Built on over 2.3 million full-text papers from arXiv.org, SQuAI employs four collaborative agents to decompose complex questions into sub-questions, retrieve targeted evidence via hybrid sparse-dense retrieval, and adaptively filter documents to improve contextual relevance. To ensure faithfulness and traceability, SQuAI integrates in-line citations for each generated claim and provides supporting sentences from the source documents. Our system improves faithfulness, answer relevance, and contextual relevance by up to +0.088 (12%) over a strong RAG baseline. We further release a benchmark of 1,000 scientific question-answer-evidence triplets to support reproducibility. With transparent reasoning, verifiable citations, and domain-wide scalability, SQuAI demonstrates how multi-agent RAG enables more trustworthy scientific QA with LLMs.

[110] GraphMind: Interactive Novelty Assessment System for Accelerating Scientific Discovery cs.IR | cs.CLPDF

Italo Luis da Silva, Hanqi Yan, Lin Gui, Yulan He

TL;DR: 论文介绍了GraphMind，一个基于LLM的交互式工具，帮助用户评估科学论文或想法的新颖性。它集成了外部API和LLM，提供结构化视图和结果可追溯性。

Details

Motivation: 科学论文的新颖性评估需要广泛的相关工作知识，但并非所有审稿人都具备。现有LLM辅助工具缺乏透明度和结果追溯机制，GraphMind旨在解决这一问题。

Result: GraphMind是一个可用工具，提供丰富的结构化视图，支持用户评估科学论文的新颖性，并通过演示视频和开源代码展示了其功能。

Insight: GraphMind的创新在于结合LLM和外部API，解决了现有工具在透明度和结果追溯上的不足，为科学文献分析提供了更高效的支持。

Abstract: Large Language Models (LLMs) show strong reasoning and text generation capabilities, prompting their use in scientific literature analysis, including novelty assessment. While evaluating novelty of scientific papers is crucial for peer review, it requires extensive knowledge of related work, something not all reviewers have. While recent work on LLM-assisted scientific literature analysis supports literature comparison, existing approaches offer limited transparency and lack mechanisms for result traceability via an information retrieval module. To address this gap, we introduce $\textbf{GraphMind}$, an easy-to-use interactive web tool designed to assist users in evaluating the novelty of scientific papers or drafted ideas. Specially, $\textbf{GraphMind}$ enables users to capture the main structure of a scientific paper, explore related ideas through various perspectives, and assess novelty via providing verifiable contextual insights. $\textbf{GraphMind}$ enables users to annotate key elements of a paper, explore related papers through various relationships, and assess novelty with contextual insight. This tool integrates external APIs such as arXiv and Semantic Scholar with LLMs to support annotation, extraction, retrieval and classification of papers. This combination provides users with a rich, structured view of a scientific idea’s core contributions and its connections to existing work. $\textbf{GraphMind}$ is available at https://oyarsa.github.io/graphmind and a demonstration video at https://youtu.be/wKbjQpSvwJg. The source code is available at https://github.com/oyarsa/graphmind.

Table of Contents

cs.CV [Back]

[1] GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments cs.CV | cs.AIPDF

[2] PC-UNet: An Enforcing Poisson Statistics U-Net for Positron Emission Tomography Denoising cs.CV | cs.AIPDF

[3] DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models cs.CV | cs.AI | cs.CLPDF

[4] UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos cs.CV | cs.AI | cs.ROPDF

[5] MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning cs.CVPDF

[6] Composition-Grounded Instruction Synthesis for Visual Reasoning cs.CV | cs.CL | cs.LGPDF

[7] Generalized Dynamics Generation towards Scannable Physical World Model cs.CVPDF

[8] Comprehensive language-image pre-training for 3D medical image understanding cs.CV | cs.LGPDF

[9] Directional Reasoning Injection for Fine-Tuning MLLMs cs.CVPDF

[10] A solution to generalized learning from small training sets found in everyday infant experiences cs.CVPDF

[11] SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images cs.CVPDF

[12] TGT: Text-Grounded Trajectories for Locally Controlled Video Generation cs.CVPDF

[13] Deep generative priors for 3D brain analysis cs.CV | cs.LGPDF

[14] Fourier Transform Multiple Instance Learning for Whole Slide Image Classification cs.CVPDF

[15] XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models cs.CV | cs.AIPDF

[16] Train a Unified Multimodal Data Quality Classifier with Synthetic Data cs.CV | cs.CLPDF

[17] Salient Concept-Aware Generative Data Augmentation cs.CV | 68T45 (Machine learning) | I.2.10; I.2.6; I.4.8; I.5.1; I.5.4PDF

[18] CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records cs.CVPDF

[19] The Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads cs.CVPDF

[20] DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion cs.CVPDF

[21] CuSfM: CUDA-Accelerated Structure-from-Motion cs.CV | cs.ROPDF

[22] Hyperbolic Structured Classification for Robust Single Positive Multi-label Learning cs.CV | cs.LGPDF

[23] Latent Diffusion Model without Variational Autoencoder cs.CV | cs.AIPDF

[24] Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation cs.CV | cs.LGPDF

[25] SHARE: Scene-Human Aligned Reconstruction cs.CVPDF

[26] Adaptive transfer learning for surgical tool presence detection in laparoscopic videos through gradual freezing fine-tuning cs.CVPDF

[27] FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers cs.CVPDF

[28] PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction cs.CVPDF

[29] LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding cs.CV | cs.LGPDF

[30] Robust High-Resolution Multi-Organ Diffusion MRI Using Synthetic-Data-Tuned Prompt Learning cs.CV | cs.AI | physics.med-phPDF

[31] Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models cs.CV | cs.AIPDF

[32] Semantic4Safety: Causal Insights from Zero-shot Street View Imagery Segmentation for Urban Road Safety cs.CV | cs.LGPDF

[33] Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation cs.CVPDF

[34] Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning cs.CV | cs.AIPDF

[35] MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention cs.CVPDF

[36] DPTrack:Directional Kernel-Guided Prompt Learning for Robust Nighttime Aerial Tracking cs.CVPDF

[37] Improving Micro-Expression Recognition with Phase-Aware Temporal Augmentation cs.CVPDF

[38] MRASfM: Multi-Camera Reconstruction and Aggregation through Structure-from-Motion in Driving Scenes cs.CVPDF

[39] MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval cs.CV | cs.IRPDF

[40] A Novel Combined Optical Flow Approach for Comprehensive Micro-Expression Recognition cs.CVPDF

[41] Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI cs.CV | cs.CLPDF

[42] Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement cs.CVPDF

[43] Exploring Conditions for Diffusion models in Robotic Control cs.CV | cs.ROPDF

[44] ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents cs.CV | cs.AI | eess.IVPDF

[45] Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation cs.CVPDF

[46] FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification cs.CVPDF

[47] Quantized FCA: Efficient Zero-Shot Texture Anomaly Detection cs.CV | I.4.7; I.2.10; I.3.8PDF

[48] Lightweight Data-Free Denoising for Detail-Preserving Biomedical Image Restoration cs.CVPDF

[49] Deep Learning Based Domain Adaptation Methods in Remote Sensing: A Comprehensive Survey cs.CVPDF

[50] Uncertainty-Aware Extreme Point Tracing for Weakly Supervised Ultrasound Image Segmentation cs.CVPDF

[51] Valeo Near-Field: a novel dataset for pedestrian intent detection cs.CV | cs.AIPDF

[52] Towards Label-Free Brain Tumor Segmentation: Unsupervised Learning with Multimodal MRI cs.CV | cs.AIPDF

[53] Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis cs.CVPDF

[54] DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification cs.CV | cs.AI | eess.IVPDF

[55] Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset cs.CVPDF

[56] OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM cs.CV | cs.AI | cs.CLPDF

[57] SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior cs.CVPDF

[58] Semantic segmentation with coarse annotations cs.CV | cs.AI | cs.LGPDF

[59] Towards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model cs.CV | cs.LGPDF

[60] ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection cs.CVPDF

[61] VISTA: A Test-Time Self-Improving Video Generation Agent cs.CVPDF

[62] Neuro-Symbolic Spatial Reasoning in Segmentation cs.CVPDF

[63] Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt cs.CVPDF

[64] BLIP3o-NEXT: Next Frontier of Native Image Generation cs.CVPDF

[65] BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models cs.CV | cs.NEPDF

[66] LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal cs.CVPDF

cs.CL [Back]

[67] Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective cs.CL | cs.AIPDF

[68] A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling cs.CL | cs.SIPDF

[69] Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning cs.CL | cs.AI | cs.IRPDF

[70] Extending Audio Context for Long-Form Understanding in Large Audio-Language Models cs.CL | cs.AI | cs.SD | eess.ASPDF

[71] Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning cs.CL | cs.AI | cs.LGPDF

[72] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding cs.CL | cs.CVPDF

[73] Exemplar-Guided Planing: Enhanced LLM Agent for KGQA cs.CL | cs.AIPDF

[74] AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction cs.CLPDF

[75] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing cs.CL | F.2.2; I.2.7PDF

[76] VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency cs.CLPDF

[77] Large-scale User Game Lifecycle Representation Learning cs.CLPDF