Table of Contents

cs.CV [Back]

[1] Transformed Multi-view 3D Shape Features with Contrastive Learning cs.CVPDF

Márcus Vinícius Lobo Costa, Sherlon Almeida da Silva, Bárbara Caroline Benato, Leo Sampaio Ferraz Ribeiro, Moacir Antonelli Ponti

TL;DR: 本文提出了一种结合Vision Transformers(ViTs)和对比学习目标的方法,用于多视角3D形状特征表示学习,提升了3D形状分析的性能,并在ModelNet10上取得了90.6%的准确率。

Details

Motivation: 计算机视觉方法在处理从2D图像识别3D物体时面临挑战,通常需要大量标注数据且依赖CNNs,这些方法可能忽略了关键的形状关系。本文旨在通过结合ViTs和对比学习,克服这些限制。

Result: 在ModelNet10上实现了约90.6%的准确率,证明了ViTs和对比学习结合的优越性。

Insight: ViTs能够捕捉全局形状语义,而对比学习则优化了局部判别特征,两者的结合有效减少了标注数据的需求并克服了CNNs的局限性。

Abstract: This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods struggle with recognizing 3D objects from 2D images, often requiring extensive labeled data and relying on Convolutional Neural Networks (CNNs) that may overlook crucial shape relationships. Our work demonstrates that Vision Transformers (ViTs) based architectures, when paired with modern contrastive objectives, achieve promising results in multi-view 3D analysis on our downstream tasks, unifying contrastive and 3D shape understanding pipelines. For example, supervised contrastive losses reached about 90.6% accuracy on ModelNet10. The use of ViTs and contrastive learning, leveraging ViTs’ ability to understand overall shapes and contrastive learning’s effectiveness, overcomes the need for extensive labeled data and the limitations of CNNs in capturing crucial shape relationships. The success stems from capturing global shape semantics via ViTs and refining local discriminative features through contrastive optimization. Importantly, our approach is empirical, as it is grounded on extensive experimental evaluation to validate the effectiveness of combining ViTs with contrastive objectives for 3D representation learning.


[2] FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking cs.CVPDF

Martha Teiko Teye, Ori Maoz, Matthias Rottmann

TL;DR: FutrTrack提出了一种基于相机-LiDAR融合的3D多目标跟踪框架,通过引入基于transformer的平滑器和融合驱动的跟踪器,显著提升了多传感器特征下的跟踪性能。

Details

Motivation: 现有的单传感器跟踪方法在多目标跟踪中(尤其是在遮挡和视角变化情况下)表现不足。FutrTrack通过融合相机和LiDAR的多模态特征,提升跟踪的鲁棒性和准确性。

Result: 在nuScenes和KITTI数据集上表现优异,nuScenes测试集的aMOTA达到74.7,减少了ID切换并保持了高精度。

Insight: 多模态特征融合(尤其是相机-LiDAR)显著提升了transformer-based跟踪器的性能,即使数据有限或无预训练也能与基于神经网络的方法竞争。

Abstract: We propose FutrTrack, a modular camera-LiDAR multi-object tracking framework that builds on existing 3D detectors by introducing a transformer-based smoother and a fusion-driven tracker. Inspired by query-based tracking frameworks, FutrTrack employs a multimodal two-stage transformer refinement and tracking pipeline. Our fusion tracker integrates bounding boxes with multimodal bird’s-eye-view (BEV) fusion features from multiple cameras and LiDAR without the need for an explicit motion model. The tracker assigns and propagates identities across frames, leveraging both geometric and semantic cues for robust re-identification under occlusion and viewpoint changes. Prior to tracking, we refine sequences of bounding boxes with a temporal smoother over a moving window to refine trajectories, reduce jitter, and improve spatial consistency. Evaluated on nuScenes and KITTI, FutrTrack demonstrates that query-based transformer tracking methods benefit significantly from multimodal sensor features compared with previous single-sensor approaches. With an aMOTA of 74.7 on the nuScenes test set, FutrTrack achieves strong performance on 3D MOT benchmarks, reducing identity switches while maintaining competitive accuracy. Our approach provides an efficient framework for improving transformer-based trackers to compete with other neural-network-based methods even with limited data and without pretraining.


[3] Endoshare: A Source Available Solution to De-Identify and Manage Surgical Videos cs.CVPDF

Lorenzo Arboit, Dennis N. Schneider, Britty Baby, Vinkle Srivastav, Pietro Mascagni

TL;DR: Endoshare是一个开源工具,用于标准化和去识别化内窥镜手术视频,解决了视频管理和隐私问题,通过用户反馈不断优化,展示了高可用性和实用性。

Details

Motivation: 手术视频在培训和研究中具有重要价值,但视频格式多样性和隐私问题限制了其广泛使用。Endoshare旨在提供一个标准化且隐私保护的解决方案。

Result: 用户测试显示高可用性和实用性(最高评分为5.07/7),处理时间受视频时长和硬件性能影响。

Insight: 通过开源和隐私设计的结合,Endoshare提供了一种透明、实用的手术视频管理方案,但需要进一步验证合规性和互操作性。

Abstract: Video-based assessment and surgical data science can advance surgical training, research, and quality improvement. However, widespread use remains limited by heterogeneous recording formats and privacy concerns associated with video sharing. We present Endoshare, a source-available, cross-platform application for merging, standardizing, and de-identifying endoscopic videos in minimally invasive surgery. Development followed the software development life cycle with iterative, user-centered feedback. During the analysis phase, an internal survey of clinicians and computer scientists based on ten usability heuristics identified key requirements that guided a privacy-by-design architecture. In the testing phase, an external clinician survey combined the same heuristics with Technology Acceptance Model constructs to assess usability and adoption, complemented by benchmarking across different hardware configurations. Four clinicians and four computer scientists initially tested the prototype, reporting high usability (4.68 +/- 0.40/5 and 4.03 +/- 0.51/5), with the lowest score (4.00 +/- 0.93/5) relating to label clarity. After refinement, the testing phase surveyed ten surgeons who reported high perceived usefulness (5.07 +/- 1.75/7), ease of use (5.15 +/- 1.71/7), heuristic usability (4.38 +/- 0.48/5), and strong recommendation (9.20 +/- 0.79/10). Processing time varied with processing mode, video duration (both p <= 0.001), and machine computational power (p = 0.041). Endoshare provides a transparent, user-friendly pipeline for standardized, privacy-preserving surgical video management. Compliance certification and broader interoperability validation are needed to establish it as a deployable alternative to proprietary systems. The software is available at https://camma-public.github.io/Endoshare/


[4] Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency cs.CVPDF

Hao Yu, Haoyu Chen, Yan Jiang, Wei Peng, Zhaodong Sun

TL;DR: 论文提出了一种名为Attentive Convolution(ATConv)的新型卷积操作,结合了自注意力(SA)的灵活性和传统卷积(Conv)的高效性,揭示了SA优于Conv的两个关键原则:自适应路由和侧向抑制,并基于此设计了一个高效的CNN家族AttNet,在多项任务中表现优异。

Details

Motivation: 自注意力机制在视觉任务中表现出强大的表达能力,但其二次复杂度限制了实际应用。传统卷积虽然高效,但在表达能力上存在不足。论文旨在揭示SA的核心优势,并将其融入卷积设计中,以兼顾高效性和表达能力。

Result: 1. ATConv仅用3×3核即优于多种SA机制。2. AttNet在ImageNet-1K上达到84.4%的Top-1准确率(27M参数)。3. 在扩散模型SiT-XL/2中替换SA,ImageNet FID降低0.15,采样速度更快。

Insight: 1. SA的优势在于动态性和竞争机制,而非单纯的感受野设计。2. 高效性和表达能力可以通过原则性设计相结合,ATConv是一个成功案例。

Abstract: Self-attention (SA) has become the cornerstone of modern vision backbones for its powerful expressivity over traditional Convolutions (Conv). However, its quadratic complexity remains a critical bottleneck for practical applications. Given that Conv offers linear complexity and strong visual priors, continuing efforts have been made to promote the renaissance of Conv. However, a persistent performance chasm remains, highlighting that these modernizations have not yet captured the intrinsic expressivity that defines SA. In this paper, we re-examine the design of the CNNs, directed by a key question: what principles give SA its edge over Conv? As a result, we reveal two fundamental insights that challenge the long-standing design intuitions in prior research (e.g., Receptive field). The two findings are: (1) \textit{Adaptive routing}: SA dynamically regulates positional information flow according to semantic content, whereas Conv employs static kernels uniformly across all positions. (2) \textit{Lateral inhibition}: SA induces score competition among token weighting, effectively suppressing redundancy and sharpening representations, whereas Conv filters lack such inhibitory dynamics and exhibit considerable redundancy. Based on this, we propose \textit{Attentive Convolution} (ATConv), a principled reformulation of the convolutional operator that intrinsically injects these principles. Interestingly, with only $3\times3$ kernels, ATConv consistently outperforms various SA mechanisms in fundamental vision tasks. Building on ATConv, we introduce AttNet, a CNN family that can attain \textbf{84.4%} ImageNet-1K Top-1 accuracy with only 27M parameters. In diffusion-based image generation, replacing all SA with the proposed $3\times 3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster sampling. Code is available at: github.com/price112/Attentive-Convolution.


[5] StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback cs.CV | cs.AIPDF

Jiho Park, Sieun Choi, Jaeyoon Seo, Jihie Kim

TL;DR: StableSketcher是一个新框架,通过视觉问答反馈增强扩散模型,生成更忠实的像素级手绘草图。

Details

Motivation: 扩散模型在生成高质量图像方面取得了进展,但在生成抽象表达(如手绘草图)时仍面临挑战。

Result: 实验表明,StableSketcher优于Stable Diffusion基线,生成了风格更忠实、语义更一致的草图。

Insight: 视觉问答反馈可作为提升生成任务中语义对齐的有效工具;高质量标注数据集对抽象表达生成至关重要。

Abstract: Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs. Our code and dataset will be made publicly available upon acceptance.


[6] BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models cs.CV | cs.CL | cs.LGPDF

Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson

TL;DR: 该研究探讨了利用合成描述性标注作为生物多模态基础模型的额外监督来源,通过生成合成标注并训练BIOCAP模型,提升了物种分类和图文检索性能。

Details

Motivation: 生物多模态基础模型通常仅依赖标签作为监督信号,而忽视了描述性标注的潜力。研究旨在填补这一空白,利用合成标注增强模型对生物特征的捕捉能力。

Result: BIOCAP在物种分类和图文检索任务中表现优异,验证了描述性标注的价值。

Insight: 描述性标注能帮助模型捕捉潜在的生物特征结构,抑制虚假相关性,为生物多模态模型提供了新的监督视角。

Abstract: This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BIOCAP (i.e., BIOCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.


[7] Physics-Guided Fusion for Robust 3D Tracking of Fast Moving Small Objects cs.CVPDF

Prithvi Raj Singh, Raju Gottumukkala, Anthony S. Maida, Alan B. Barhorst, Vijaya Gopu

TL;DR: 该论文提出了一种结合深度学习和物理模型的系统,用于实时检测和跟踪快速移动的小物体,显著提升了3D跟踪性能。

Details

Motivation: 快速移动的小物体在计算机视觉中的检测和跟踪问题缺乏深入研究,现有方法在复杂场景中表现不佳,如遮挡和快速方向变化。

Result: 在自定义数据集上,系统比基于卡尔曼滤波的方法平均位移误差减少70%。

Insight: 结合物理模型与深度学习可以提高复杂场景下的实时3D跟踪性能,尤其在快速移动小物体的应用中效果显著。

Abstract: While computer vision has advanced considerably for general object detection and tracking, the specific problem of fast-moving tiny objects remains underexplored. This paper addresses the significant challenge of detecting and tracking rapidly moving small objects using an RGB-D camera. Our novel system combines deep learning-based detection with physics-based tracking to overcome the limitations of existing approaches. Our contributions include: (1) a comprehensive system design for object detection and tracking of fast-moving small objects in 3D space, (2) an innovative physics-based tracking algorithm that integrates kinematics motion equations to handle outliers and missed detections, and (3) an outlier detection and correction module that significantly improves tracking performance in challenging scenarios such as occlusions and rapid direction changes. We evaluated our proposed system on a custom racquetball dataset. Our evaluation shows our system surpassing kalman filter based trackers with up to 70% less Average Displacement Error. Our system has significant applications for improving robot perception on autonomous platforms and demonstrates the effectiveness of combining physics-based models with deep learning approaches for real-time 3D detection and tracking of challenging small objects.


[8] Revisiting Logit Distributions for Reliable Out-of-Distribution Detection cs.CVPDF

Jiachen Liang, Ruibing Hou, Minyang Hu, Hong Chang, Shiguang Shan

TL;DR: 论文提出LogitGap,一种新颖的后处理OOD检测方法,通过利用最大logit与其余logit之间的关系,增强ID与OOD样本的可分离性,并通过聚焦更紧凑的logit子集进一步提升效果。

Details

Motivation: 现有后处理方法未充分利用模型logit空间的丰富信息,影响OOD检测的可靠性。

Result: 在多种OOD检测场景和基准测试中,LogitGap表现卓越。

Insight: logit空间的探索对提升OOD检测性能至关重要,尤其是对最大logit与其余logit的关系的利用。

Abstract: Out-of-distribution (OOD) detection is critical for ensuring the reliability of deep learning models in open-world applications. While post-hoc methods are favored for their efficiency and ease of deployment, existing approaches often underexploit the rich information embedded in the model’s logits space. In this paper, we propose LogitGap, a novel post-hoc OOD detection method that explicitly exploits the relationship between the maximum logit and the remaining logits to enhance the separability between in-distribution (ID) and OOD samples. To further improve its effectiveness, we refine LogitGap by focusing on a more compact and informative subset of the logit space. Specifically, we introduce a training-free strategy that automatically identifies the most informative logits for scoring. We provide both theoretical analysis and empirical evidence to validate the effectiveness of our approach. Extensive experiments on both vision-language and vision-only models demonstrate that LogitGap consistently achieves state-of-the-art performance across diverse OOD detection scenarios and benchmarks. Code is available at https://github.com/GIT-LJc/LogitGap.


[9] PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding cs.CVPDF

Penghao Wang, Yiyang He, Xin Lv, Yukai Zhou, Lan Xu

TL;DR: PartNeXt是一个新一代的高质量纹理3D数据集,包含超过23,000个模型和50个类别的细粒度分层部件标注,旨在推动3D部件理解的研究,并在部件分割和3D部件问答任务中展现出优越性能。

Details

Motivation: 现有的3D部件理解数据集(如PartNet)依赖于无纹理几何和专家标注,限制了其扩展性和可用性。PartNeXt旨在填补这一空白,提供高质量的纹理模型和多任务评估。

Result: 在部件分割任务中,PartNeXt推动了方法的性能提升;在3D部件问答任务中,暴露了开放词汇部件定位的显著差距。

Insight: 高质量纹理数据和细粒度标注是提升3D部件理解性能的关键,同时多任务评估揭示了未来研究的潜在方向。

Abstract: Understanding objects at the level of their constituent parts is fundamental to advancing computer vision, graphics, and robotics. While datasets like PartNet have driven progress in 3D part understanding, their reliance on untextured geometries and expert-dependent annotation limits scalability and usability. We introduce PartNeXt, a next-generation dataset addressing these gaps with over 23,000 high-quality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. We benchmark PartNeXt on two tasks: (1) class-agnostic part segmentation, where state-of-the-art methods (e.g., PartField, SAMPart3D) struggle with fine-grained and leaf-level parts, and (2) 3D part-centric question answering, a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary part grounding. Additionally, training Point-SAM on PartNeXt yields substantial gains over PartNet, underscoring the dataset’s superior quality and diversity. By combining scalable annotation, texture-aware labels, and multi-task evaluation, PartNeXt opens new avenues for research in structured 3D understanding.


[10] Monocular Visual 8D Pose Estimation for Articulated Bicycles and Cyclists cs.CVPDF

Eduardo R. Corral-Soto, Yang Liu, Yuan Ren, Bai Dongfeng, Liu Bingbing

TL;DR: 该论文提出了一种用于估计铰接式自行车和骑行者的8D位姿的单目视觉方法,通过估计自行车车把和踏板的旋转,改进了传统的6D位姿估计。

Details

Motivation: 在自动驾驶中,骑行者的位姿估计对安全至关重要。传统的6D位姿估计方法无法捕捉铰接式自行车的动态变化(如车把和踏板的旋转),而这些变化直接影响骑行方向和意图预测。

Result: 实验表明,该方法在8D位姿参数估计上表现优异,性能与当前最先进的6D位姿估计方法相当。

Insight: 对铰接式物体的位姿估计需要考虑其动态部分(如车把和踏板),这能显著提升位姿描述的准确性和实用性。

Abstract: In Autonomous Driving, cyclists belong to the safety-critical class of Vulnerable Road Users (VRU), and accurate estimation of their pose is critical for cyclist crossing intention classification, behavior prediction, and collision avoidance. Unlike rigid objects, articulated bicycles are composed of movable rigid parts linked by joints and constrained by a kinematic structure. 6D pose methods can estimate the 3D rotation and translation of rigid bicycles, but 6D becomes insufficient when the steering/pedals angles of the bicycle vary. That is because: 1) varying the articulated pose of the bicycle causes its 3D bounding box to vary as well, and 2) the 3D box orientation is not necessarily aligned to the orientation of the steering which determines the actual intended travel direction. In this work, we introduce a method for category-level 8D pose estimation for articulated bicycles and cyclists from a single RGB image. Besides being able to estimate the 3D translation and rotation of a bicycle from a single image, our method also estimates the rotations of its steering handles and pedals with respect to the bicycle body frame. These two new parameters enable the estimation of a more fine-grained bicycle pose state and travel direction. Our proposed model jointly estimates the 8D pose and the 3D Keypoints of articulated bicycles, and trains with a mix of synthetic and real image data to generalize on real images. We include an evaluation section where we evaluate the accuracy of our estimated 8D pose parameters, and our method shows promising results by achieving competitive scores when compared against state-of-the-art category-level 6D pose estimators that use rigid canonical object templates for matching.


[11] TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning cs.CVPDF

Xudong Yan, Songhe Feng

TL;DR: 论文提出了一种名为TOMCAT的方法,通过在测试时积累文本和视觉模态的综合知识来更新多模态原型,以解决组合零样本学习中分布偏移的问题。

Details

Motivation: 现有的组合零样本学习方法在测试时会因标签空间的分布偏移(由未见过的属性-对象组合引起)而性能下降,因此需要一种能够在测试时动态适应分布变化的方法。

Result: 在四个基准数据集上(封闭和开放世界设置下)取得了最先进的性能。

Insight: 在测试时动态积累知识可以有效缓解分布偏移问题,同时多模态对齐增强了模型的鲁棒性。

Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at https://github.com/xud-yan/TOMCAT .


[12] PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching cs.CV | cs.AIPDF

Yun Wang, Junjie Hu, Qiaole Dong, Yongjian Zhang, Yanwei Fu

TL;DR: PPMStereo提出了一种动态立体匹配方法,通过Pick-and-Play Memory(PPM)模块高效建模长时空一致性,解决了传统方法在计算效率和长程依赖性之间的权衡问题。

Details

Motivation: 动态立体匹配中,时间一致性对于增强现实等应用至关重要。传统方法在建模长时间一致性时面临计算效率与性能的权衡问题。

Result: 在Sintel数据集上取得了0.62/1.11 TEPE的SOTA性能,计算成本更低。

Insight: 两阶段协作的内存构造机制有效平衡了长时空建模的效率与性能。

Abstract: Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a \textbf{P}ick-and-\textbf{P}lay \textbf{M}emory (PPM) construction module for dynamic \textbf{Stereo} matching, dubbed as \textbf{PPMStereo}. PPM consists of a pick' process that identifies the most relevant frames and a play’ process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. % Notably, PPMStereo achieves 0.62/1.11 TEPE on the Sintel clean/final (17.3% & 9.02% improvements over BiDAStereo) with fewer computational costs. Codes are available at \textcolor{blue}{https://github.com/cocowy1/PPMStereo}.


[13] Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories cs.CVPDF

Aaron Appelle, Jerome P. Lynch

TL;DR: 论文提出了一种评估视频生成模型(T2V和I2V)作为多行人轨迹模拟器的严格方法,揭示模型在多智能体行为上的有效性及其局限性。

Details

Motivation: 尽管大规模视频生成模型在多样化场景中表现出高视觉真实感,但它们在多行人交互场景中的动态合理性尚未得到验证,因此需要一种评估方法来填补这一空白。

Result: 分析表明,主流模型在多智能体行为的先验学习上表现优异,但存在行人合并或消失等失败模式,需要进一步改进。

Insight: 视频生成模型在多行人交互场景中展现出潜力,但仍需解决动态合理性不足的问题,尤其是在复杂交互和多密度场景下。

Abstract: Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird’s-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.


[14] SPAN: Continuous Modeling of Suspicion Progression for Temporal Intention Localization cs.CVPDF

Xinyi Hu, Yuran Wang, Yue Li, Wenxuan Liu, Zheng Wang

TL;DR: 论文提出SPAN网络,将可疑意图定位从离散分类转为连续回归,捕捉可疑意图的动态变化。

Details

Motivation: 现有的离散分类方法无法捕捉可疑意图的连续性和动态变化,限制了早期干预和可解释性。

Result: 在HAI数据集上,MSE降低19.8%,mAP提升1.78%,低频场景下mAP提升2.74%。

Insight: 连续建模可疑意图可以更早检测微小行为变化,增强系统的可解释性和实用性。

Abstract: Temporal Intention Localization (TIL) is crucial for video surveillance, focusing on identifying varying levels of suspicious intentions to improve security monitoring. However, existing discrete classification methods fail to capture the continuous nature of suspicious intentions, limiting early intervention and explainability. In this paper, we propose the Suspicion Progression Analysis Network (SPAN), which shifts from discrete classification to continuous regression, enabling the capture of fluctuating and evolving suspicious intentions. We reveal that suspicion exhibits long-term dependencies and cumulative effects, similar to Temporal Point Process (TPP) theory. Based on these insights, we define a suspicion score formula that models continuous changes while accounting for temporal characteristics. We also introduce Suspicion Coefficient Modulation, which adjusts suspicion coefficients using multimodal information to reflect the varying impacts of suspicious actions. Additionally, the Concept-Anchored Mapping method is proposed to link suspicious actions to predefined intention concepts, offering insights into both the actions and their potential underlying intentions. Extensive experiments on the HAI dataset show that SPAN significantly outperforms existing methods, reducing MSE by 19.8% and improving average mAP by 1.78%. Notably, SPAN achieves a 2.74% mAP gain in low-frequency cases, demonstrating its superior ability to capture subtle behavioral changes. Compared to discrete classification systems, our continuous suspicion modeling approach enables earlier detection and proactive intervention, greatly enhancing system explainability and practical utility in security applications.


[15] RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling cs.CVPDF

Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan

TL;DR: RAPO++提出了一种跨阶段的提示优化框架,通过数据对齐和测试时尺度调整来提升文本到视频生成的质量,无需修改生成模型本身。该框架包括检索增强的提示优化、基于反馈的迭代优化和LLM微调三个阶段,实验表明其在多个基准上显著优于现有方法。

Details

Motivation: 用户提供的提示词通常是简短、非结构化且与训练数据不对齐的,这限制了扩散模型在文本到视频生成中的能力。RAPO++通过优化提示词来解决这一问题。

Result: 在五个先进的T2V模型和五个基准测试中,RAPO++在语义对齐、组合推理、时间稳定性和物理合理性方面显著优于现有方法。

Insight: RAPO++是一种模型无关、高效且可扩展的方案,通过优化提示词而非模型本身来提升生成质量,为T2V领域的提示优化设立了新标准。

Abstract: Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data–aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback – including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow – yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.


[16] Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection cs.CVPDF

Talha Ilyas, Duong Nhu, Allison Thomas, Arie Levin, Lim Wei Yap

TL;DR: 为了解决胎儿运动检测中的主观性和准确性问题,论文提出了一种名为CURL的自监督对比学习框架,通过双对比损失和任务特定采样策略,提升了超声视频中胎儿运动的检测性能。

Details

Motivation: 现有胎儿运动检测方法(如母体感知和胎心监护)存在主观性强和准确性低的问题,亟需一种客观可靠的自动化方法。

Result: 在92名受试者的30分钟超声视频数据集上,CURL达到78.01%的灵敏度和81.60%的AUROC。

Insight: 自监督对比学习能有效提取胎儿运动特征,为产前监测和临床决策提供客观依据。

Abstract: Accurate fetal movement (FM) detection is essential for assessing prenatal health, as abnormal movement patterns can indicate underlying complications such as placental dysfunction or fetal distress. Traditional methods, including maternal perception and cardiotocography (CTG), suffer from subjectivity and limited accuracy. To address these challenges, we propose Contrastive Ultrasound Video Representation Learning (CURL), a novel self-supervised learning framework for FM detection from extended fetal ultrasound video recordings. Our approach leverages a dual-contrastive loss, incorporating both spatial and temporal contrastive learning, to learn robust motion representations. Additionally, we introduce a task-specific sampling strategy, ensuring the effective separation of movement and non-movement segments during self-supervised training, while enabling flexible inference on arbitrarily long ultrasound recordings through a probabilistic fine-tuning approach. Evaluated on an in-house dataset of 92 subjects, each with 30-minute ultrasound sessions, CURL achieves a sensitivity of 78.01% and an AUROC of 81.60%, demonstrating its potential for reliable and objective FM analysis. These results highlight the potential of self-supervised contrastive learning for fetal movement analysis, paving the way for improved prenatal monitoring and clinical decision-making.


[17] EditInfinity: Image Editing with Binary-Quantized Generative Models cs.CVPDF

Jiahuan Wang, Yuxin Chen, Jun Yu, Guangming Lu, Wenjie Pei

TL;DR: EditInfinity 是一种基于二进制量化生成模型的图像编辑方法,通过精确的反转机制和全面的平滑策略,显著提升了文本驱动图像编辑的性能。

Details

Motivation: 基于扩散模型的图像编辑方法在反转过程中存在近似误差,限制了编辑性能。因此,作者探索了基于 VQ 的生成模型的参数高效适配方法,利用其可获取精确中间量化表示的特性,以实现更精确的图像反转。

Result: 在 PIE-Bench 基准测试中,EditInfinity 在“添加”、“修改”和“删除”操作上优于现有的基于扩散模型的基线方法。

Insight: 二进制量化生成模型的精确中间表示特性为图像编辑提供了新的监督来源,可以有效避免扩散模型反转中的近似误差问题。

Abstract: Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of VQ-based generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose \emph{EditInfinity}, which adapts \emph{Infinity}, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our \emph{EditInfinity} to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across “add”, “change”, and “delete” editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: https://github.com/yx-chen-ust/EditInfinity.


[18] Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context cs.CV | cs.AI | cs.CLPDF

Ge Zheng, Jiaye Qian, Jiajin Tang, Sibei Yang

TL;DR: 本文研究了大型视觉语言模型(LVLM)在长响应中更容易产生幻觉的原因,提出幻觉的增加并非单纯由长度引起,而是因为长响应更依赖上下文。作者提出了一个’诱导-检测-抑制’框架,显著改善了幻觉问题。

Details

Motivation: LVLM在长响应中表现出更多的幻觉,传统认为是由长度导致的累积不确定性引起,但作者质疑是否有更深层次的机制。

Result: 该方法在所有基准测试中均取得了显著的改进,验证了框架的有效性。

Insight: 幻觉的根源在于上下文依赖而非长度,这一发现为未来研究LVLM的幻觉问题提供了新的方向。

Abstract: Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel “induce-detect-suppress” framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential object-level hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs’ longer responses.


[19] Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding cs.CV | cs.LGPDF

Minseok Kang, Minhyeok Lee, Minjung Kim, Donghyeong Kim, Sangyoun Lee

TL;DR: DualGround是一种双分支架构,通过分离全局和局部语义来提升视频时序定位任务的表现。该方法通过结构化短语和句子级语义的分离建模,实现了更精细的时序对齐。

Details

Motivation: 现有VTG模型在处理文本标记时忽视了其语义角色的差异,过度依赖句子级全局语义,导致无法有效利用词级信号进行细粒度的时序对齐。

Result: 在QVHighlights和Charades-STA基准上,DualGround在Moment Retrieval和Highlight Detection任务中均取得了SOTA表现。

Insight: 通过分离建模不同语义角色的文本标记,模型能够同时捕获全局和局部语义,提升视频时序定位的精度和鲁棒性。

Abstract: Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local semantics by routing the [EOS] token through a sentence-level path and clustering word tokens into phrase-level units for localized grounding. Our method introduces (1) tokenrole- aware cross modal interaction strategies that align video features with sentence-level and phrase-level semantics in a structurally disentangled manner, and (2) a joint modeling framework that not only improves global sentence-level alignment but also enhances finegrained temporal grounding by leveraging structured phrase-aware context. This design allows the model to capture both coarse and localized semantics, enabling more expressive and context-aware video grounding. DualGround achieves state-of-the-art performance on both Moment Retrieval and Highlight Detection tasks across QVHighlights and Charades- STA benchmarks, demonstrating the effectiveness of disentangled semantic modeling in video-language alignment.


[20] Calibrating Multimodal Consensus for Emotion Recognition cs.CV | cs.CL | cs.LG | cs.MMPDF

Guowei Zhong, Junjie Li, Huaiyu Zhu, Ruohong Huan, Yun Pan

TL;DR: 该论文提出了一种称为校准多模态共识(CMC)的模型,用于解决多模态情感识别中语义不一致和文本模态主导的问题。通过伪标签生成模块和参数无关的融合模块,CMC在多模态微调中实现了更好的共识,并在多个数据集上取得了优异的性能。

Details

Motivation: 多模态情感识别(MER)中存在跨模态语义不一致和文本模态主导的问题,这会影响识别准确性。论文旨在解决这些问题。

Result: 在CH-SIMS、CH-SIMS v2、CMU-MOSI和CMU-MOSEI数据集上,CMC的性能与或优于现有最佳方法,尤其在语义不一致场景中表现突出。

Insight: 通过伪标签和多模态共识机制,可以有效减少模态间的不一致性,提升情感识别的鲁棒性和准确性。

Abstract: In recent years, Multimodal Emotion Recognition (MER) has made substantial progress. Nevertheless, most existing approaches neglect the semantic inconsistencies that may arise across modalities, such as conflicting emotional cues between text and visual inputs. Besides, current methods are often dominated by the text modality due to its strong representational capacity, which can compromise recognition accuracy. To address these challenges, we propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels, enabling unimodal pretraining in a self-supervised fashion. It then employs a Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for multimodal finetuning, thereby mitigating text dominance and guiding the fusion process toward a more reliable consensus. Experimental results demonstrate that CMC achieves performance on par with or superior to state-of-the-art methods across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and exhibits notable advantages in scenarios with semantic inconsistencies on CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CMC.


[21] Real-Time Currency Detection and Voice Feedback for Visually Impaired Individuals cs.CVPDF

Saraf Anzum Shreya, MD. Abu Ismail Siddique, Sharaf Tasnim

TL;DR: 本文提出一种实时货币检测系统,结合YOLOv8 nano模型和SE块,帮助视障人士通过语音反馈识别货币,准确率高达97.73%。

Details

Motivation: 视障人士在日常生活中处理货币依赖他人,智能手机和机器学习技术可以为他们提供独立操作的解决方案。

Result: 模型准确率达97.73%、召回率95.23%、F1分数95.85%,mAP50(B)为97.21%。

Insight: 结合轻量级模型和SE块可实现高效检测,语音反馈为视障人士提供了实用且易用的解决方案。

Abstract: Technologies like smartphones have become an essential in our daily lives. It has made accessible to everyone including visually impaired individuals. With the use of smartphone cameras, image capturing and processing have become more convenient. With the use of smartphones and machine learning, the life of visually impaired can be made a little easier. Daily tasks such as handling money without relying on someone can be troublesome for them. For that purpose this paper presents a real-time currency detection system designed to assist visually impaired individuals. The proposed model is trained on a dataset containing 30 classes of notes and coins, representing 3 types of currency: US dollar (USD), Euro (EUR), and Bangladeshi taka (BDT). Our approach uses a YOLOv8 nano model with a custom detection head featuring deep convolutional layers and Squeeze-and-Excitation blocks to enhance feature extraction and detection accuracy. Our model has achieved a higher accuracy of 97.73%, recall of 95.23%, f1-score of 95.85% and a mean Average Precision at IoU=0.5 (mAP50(B)) of 97.21%. Using the voice feedback after the detection would help the visually impaired to identify the currency. This paper aims to create a practical and efficient currency detection system to empower visually impaired individuals independent in handling money.


[22] GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection cs.CV | cs.MMPDF

Guangyu Dai, Dong Chen, Siliang Tang, Yueting Zhuang

TL;DR: 论文GMFVAD提出了一种粒度更细的多模态特征方法,用于改进视频异常检测(VAD),通过结合视频片段的主要内容和文本特征,减少冗余信息,提升检测性能。

Details

Motivation: 现有的视频异常检测方法多依赖于粗粒度的多模态信息融合,导致冗余信息未被有效处理,影响了检测性能。论文旨在通过更细粒度的特征提取和多模态信息融合,解决这一问题。

Result: 在四个主要数据集上实现了SOTA性能,并通过消融实验验证了冗余信息减少的有效性。

Insight: 细粒度的多模态特征融合能够有效减少冗余信息,从而提升视频异常检测的性能;文本特征的引入可以进一步增强视觉特征的表达能力。

Abstract: Video anomaly detection (VAD) is a challenging task that detects anomalous frames in continuous surveillance videos. Most previous work utilizes the spatio-temporal correlation of visual features to distinguish whether there are abnormalities in video snippets. Recently, some works attempt to introduce multi-modal information, like text feature, to enhance the results of video anomaly detection. However, these works merely incorporate text features into video snippets in a coarse manner, overlooking the significant amount of redundant information that may exist within the video snippets. Therefore, we propose to leverage the diversity among multi-modal information to further refine the extracted features, reducing the redundancy in visual features, and we propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD). Specifically, we generate more grained multi-modal feature based on the video snippet, which summarizes the main content, and text features based on the captions of original video will be introduced to further enhance the visual features of highlighted portions. Experiments show that the proposed GMFVAD achieves state-of-the-art performance on four mainly datasets. Ablation experiments also validate that the improvement of GMFVAD is due to the reduction of redundant information.


[23] Causal Debiasing for Visual Commonsense Reasoning cs.CV | cs.MMPDF

Jiayi Zou, Gengyun Jia, Bing-Kun Bao

TL;DR: 该论文提出了一种针对视觉常识推理(VCR)的去偏方法,通过分析数据集中的共现和统计偏差,并引入VCR-OOD数据集评估模型的泛化能力,同时采用后门调整方法消除预测捷径。

Details

Motivation: 现有VCR方法忽略了数据集中的偏差问题,缺乏去偏策略,可能导致模型依赖于数据中的统计相关性而非真正的因果关系。

Result: 实验证明了该去偏方法在不同数据集上的有效性。

Insight: 去偏方法可以提升模型的泛化能力,避免依赖数据中的统计相关性,从而更接近真实因果推理。

Abstract: Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.


[24] Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition cs.CVPDF

Haodong Yang, Zhongling Huang, Shaojie Guo, Zhe Zhang, Gong Cheng

TL;DR: 论文提出了一种知识引导的轻量级神经网络(KINN),用于解决复杂值SAR图像识别中数据稀缺和领域迁移下的泛化、可解释性和效率三者之间的矛盾。KINN通过物理压缩、特征聚合和语义压缩三阶段架构,结合电磁散射先验知识,显著提升了识别性能。

Details

Motivation: 传统数据驱动模型在CV-SAR图像识别中难以平衡泛化性、可解释性和效率,而CV-SAR数据中丰富的电磁散射特征未被充分利用。论文希望通过引入物理先验知识,解决这一三重困境。

Result: 在五个SAR基准测试中,KINN(0.7M/0.95M参数)在数据稀缺和领域迁移场景下表现优异,泛化能力强且具可解释性,成为参数高效识别的SOTA。

Insight: 结合物理先验的轻量级设计可有效解决多目标优化问题;KINN为SAR图像分析提供了一条可信AI的新路径。

Abstract: Deep learning models for complex-valued Synthetic Aperture Radar (CV-SAR) image recognition are fundamentally constrained by a representation trilemma under data-limited and domain-shift scenarios: the concurrent, yet conflicting, optimization of generalization, interpretability, and efficiency. Our work is motivated by the premise that the rich electromagnetic scattering features inherent in CV-SAR data hold the key to resolving this trilemma, yet they are insufficiently harnessed by conventional data-driven models. To this end, we introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel “compression-aggregation-compression” architecture. The first stage performs a physics-guided compression, wherein a novel dictionary processor adaptively embeds physical priors, enabling a compact unfolding network to efficiently extract sparse, physically-grounded signatures. A subsequent aggregation module enriches these representations, followed by a final semantic compression stage that utilizes a compact classification head with self-distillation to learn maximally task-relevant and discriminative embeddings. We instantiate KINN in both CNN (0.7M) and Vision Transformer (0.95M) variants. Extensive evaluations on five SAR benchmarks confirm that KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios and tangible interpretability, thereby providing an effective solution to the representation trilemma and offering a new path for trustworthy AI in SAR image analysis.


[25] DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering cs.CV | cs.MMPDF

Jiayi Zou, Chaofan Chen, Bing-Kun Bao, Changsheng Xu

TL;DR: DMC³提出了一个双模态反事实对比构建框架,解决了第一人称视角视频问答中的多事件理解和手-物交互识别挑战。

Details

Motivation: 现有方法在Egocentric VideoQA中忽视了第一人称视角的独特挑战,如多事件理解和手-物交互识别,DMC³旨在解决这些问题。

Result: 在EgoTaskQA和QAEGO4D上分别达到了52.51%、46.04%和13.2%的SOTA性能。

Insight: 反事实样本对比优化能有效提升模型对多事件和交互的理解能力。

Abstract: Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC$^3$) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51% and 46.04% on the \textit{normal} and \textit{indirect} splits of EgoTaskQA, and 13.2% on QAEGO4D, both reaching the state-of-the-art performance.


[26] UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning cs.CV | cs.AIPDF

Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong

TL;DR: 论文提出了一种名为Instruction-as-Reasoning的新范式,通过动态分析多样化指令,提升GUI grounding的性能。该方法结合了监督微调和强化学习,显著提高了多个基准测试的性能。

Details

Motivation: 现有的GUI grounding方法将指令视为静态代理,忽略了指令多样性和质量对性能的影响。研究发现现有数据集中指令存在23.3%的错误率,利用指令多样性可在推理时提升76%的性能。

Result: UI-Ins-32B在UI-I2E-Bench(87.3%)、ScreenSpot-Pro(57.0%)和MMBench-GUI L2(84.9%)上取得最佳性能。UI-Ins-7B在AndroidWorld任务中实现了74.1%的成功率。

Insight: 1. 多样化指令的动态利用能够显著提升grounding性能。2. SFT+RL框架能够避免策略崩溃问题。3. 该方法具备较强的智能代理潜力。

Abstract: GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.


[27] Breakdance Video classification in the age of Generative AI cs.CV | cs.AI | cs.LGPDF

Sauptik Dhar, Naveen Ramakrishnan, Michelle Munson

TL;DR: 论文探讨了现代视频基础模型(编码器和解码器)在断舞视频分类中的适用性,发现编码器模型仍然优于最先进的视频语言模型,并提供了选择编码器模型的见解和解码器模型的微调分析。

Details

Motivation: 现有的大型视觉语言模型主要应用于热门运动如足球、篮球等,而在小众但流行的断舞运动中研究较少。本文旨在填补这一空白。

Result: 编码器模型在断舞视频分类任务中表现优于视频语言模型,展示了其在特定领域中的优势。

Insight: 1. 编码器模型在特定领域(如断舞)的分类任务中仍具有优势;2. 解码器模型的微调需要更细致的分析;3. 小众运动领域的研究对推动基模型的应用具有重要意义。

Abstract: Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder models continue to outperform state-of-the-art Video Language Models for prediction tasks. We provide insights on how to choose the encoder model and provide a thorough analysis into the workings of a finetuned decoder model for breakdance video classification.


[28] A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization cs.CV | cs.AIPDF

LinFeng Li, Jian Zhao, Zepeng Yang, Yuhang Song, Bojun Lin

TL;DR: 论文提出了一个参数高效的混合专家(MoE)框架,用于解决跨模态地理定位中的异构视角问题,通过领域对齐预处理和两阶段训练策略取得了比赛第一名。

Details

Motivation: 跨模态地理定位任务中存在严重的平台间异构性和训练与测试领域的差距问题,需要一种高效的框架来解决这些挑战。

Result: 系统在官方排行榜上排名第一,展示了在异构视角下强大的跨模态地理定位能力。

Insight: 通过领域对齐和专家混合模型,可以有效缓解跨模态任务中的异构性问题,提升定位精度。

Abstract: We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation. The task retrieves the most relevant geo-referenced image from a large multi-platform corpus (satellite/drone/ground) given a natural-language query. Two obstacles are severe inter-platform heterogeneity and a domain gap between generic training descriptions and platform-specific test queries. We mitigate these with a domain-aligned preprocessing pipeline and a Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite augmentation, and removal of orientation words; (ii) an LLM-based caption refinement pipeline to align textual semantics with the distinct visual characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power, and fuse their scores at inference. The system tops the official leaderboard, demonstrating robust cross-modal geo-localization under heterogeneous viewpoints.


[29] HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models cs.CVPDF

Zelin Peng, Zhengqin Xu, Qingyang Liu, Xiaokang Yang, Wei Shen

TL;DR: HyperET提出了一种在双曲空间中高效训练多模态大语言模型(MLLMs)的方法,通过动态调整双曲半径实现视觉与文本模态的多粒度对齐,显著减少了计算资源需求。

Details

Motivation: 多模态大语言模型(MLLMs)训练需要大量计算资源,主要原因在于视觉编码器(如CLIP和SAM)缺乏与语言的多粒度对齐能力。HyperET旨在通过双曲空间解决这一问题。

Result: 实验表明,HyperET在多个MLLM基准测试中显著提升了现有预训练和微调模型的性能,仅增加不到1%的参数。

Insight: 双曲空间为多粒度跨模态对齐提供了自然框架,动态参数化策略兼顾灵活性与效率,为MLLMs的高效训练提供了新思路。

Abstract: Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with M"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1% additional parameters.


[30] Positional Encoding Field cs.CVPDF

Yunpeng Bai, Haoxiang Li, Qixing Huang

TL;DR: 该论文研究了扩散Transformer(DiTs)中patch token的独立性,并提出了一种新的位置编码场(PE-Field),将其从2D平面扩展到结构化3D场,以实现更好的几何建模和控制。

Details

Motivation: 研究发现,DiTs中的patch token表现出惊人的独立性,即使位置编码(PEs)被扰动,DiTs仍能生成全局一致的输出。这表明空间一致性主要由PEs控制,因此作者提出了PE-Field以更好地建模3D空间。

Result: 实验结果展示了PE-Field增强的DiT在单图像新视角合成任务上达到了最优性能,并能推广到可控的空间图像编辑任务。

Insight: 研究表明,DiTs的空间一致性主要依赖于位置编码,而非patch token之间的依赖性,这表明改进位置编码可能是提升视觉生成任务的关键。

Abstract: Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D space. Our PE-Field-augmented DiT achieves state-of-the-art performance on single-image novel view synthesis and generalizes to controllable spatial image editing.


[31] Dynamic Weight Adjustment for Knowledge Distillation: Leveraging Vision Transformer for High-Accuracy Lung Cancer Detection and Real-Time Deployment cs.CV | cs.AIPDF

Saif Ur Rehman Khan, Muhammad Nabeel Asim, Sebastian Vollmer, Andreas Dengel

TL;DR: 本文提出了一种名为FuzzyDistillViT-MobileNet的新方法,用于肺癌分类,通过动态模糊逻辑驱动的知识蒸馏(KD)处理诊断中的不确定性和复杂性。

Details

Motivation: 传统静态知识蒸馏方法的权重固定,无法灵活处理肺癌图像中不同区域的不确定性。为解决这一问题,研究者提出了一种动态调整蒸馏权重的方法。

Result: 模型在两个数据集上表现出色:LC25000组织病理图像(99.16%准确率)和IQOTH/NCCD CT扫描图像(99.54%准确率)。

Insight: 动态权重调整和模糊逻辑的结合能显著提升模型对不确定区域的适应能力,同时图像融合技术有助于提升特征提取的效率。

Abstract: This paper presents the FuzzyDistillViT-MobileNet model, a novel approach for lung cancer (LC) classification, leveraging dynamic fuzzy logic-driven knowledge distillation (KD) to address uncertainty and complexity in disease diagnosis. Unlike traditional models that rely on static KD with fixed weights, our method dynamically adjusts the distillation weight using fuzzy logic, enabling the student model to focus on high-confidence regions while reducing attention to ambiguous areas. This dynamic adjustment improves the model ability to handle varying uncertainty levels across different regions of LC images. We employ the Vision Transformer (ViT-B32) as the instructor model, which effectively transfers knowledge to the student model, MobileNet, enhancing the student generalization capabilities. The training process is further optimized using a dynamic wait adjustment mechanism that adapts the training procedure for improved convergence and performance. To enhance image quality, we introduce pixel-level image fusion improvement techniques such as Gamma correction and Histogram Equalization. The processed images (Pix1 and Pix2) are fused using a wavelet-based fusion method to improve image resolution and feature preservation. This fusion method uses the wavedec2 function to standardize images to a 224x224 resolution, decompose them into multi-scale frequency components, and recursively average coefficients at each level for better feature representation. To address computational efficiency, Genetic Algorithm (GA) is used to select the most suitable pre-trained student model from a pool of 12 candidates, balancing model performance with computational cost. The model is evaluated on two datasets, including LC25000 histopathological images (99.16% accuracy) and IQOTH/NCCD CT-scan images (99.54% accuracy), demonstrating robustness across different imaging domains.


[32] Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence cs.CVPDF

Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou

TL;DR: Conan是一个用于视频推理的框架,通过多阶段渐进式学习和强化学习,结合视觉证据生成多步推理。

Details

Motivation: 现有的视频推理方法要么依赖文本链导致虚假结论,要么基于帧检索但定位不准确,因此需要一种新的方法来结合视觉证据和多步推理。

Result: 在六个多步推理基准测试中,Conan平均准确率超过基线Qwen2.5-VL-7B-Instruct 10%以上,实现了最优性能。

Insight: Conan展示了在长视频理解任务中的强扩展性和鲁棒性,表明其方法可推广到更复杂的场景。

Abstract: Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.


[33] Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning cs.CV | cs.AIPDF

Xiaohan Lan, Fanfan Liu, Haibo Qiu, Siqi Yang, Delian Ruan

TL;DR: 该论文提出了Metis-HOME框架,通过混合优化的专家混合(MoE)架构解决多模态推理模型在复杂推理和泛化能力之间的权衡问题。

Details

Motivation: 当前的多模态大型推理模型在简单查询上也采用计算密集型推理,效率低下;同时,专注于专门推理削弱了其泛化能力。

Result: Metis-HOME显著提升了复杂推理能力,同时改善了模型的泛化性能,解决了现有推理模型的性能退化问题。

Insight: 通过动态路由和分支优化,可以实现高效的多模态推理和泛化能力的平衡,为构建强大且通用的MLLM提供了新范式。

Abstract: Inspired by recent advancements in LLM reasoning, the field of multimodal reasoning has seen remarkable progress, achieving significant performance gains on intricate tasks such as mathematical problem-solving. Despite this progress, current multimodal large reasoning models exhibit two key limitations. They tend to employ computationally expensive reasoning even for simple queries, leading to inefficiency. Furthermore, this focus on specialized reasoning often impairs their broader, more general understanding capabilities. In this paper, we propose Metis-HOME: a Hybrid Optimized Mixture-of-Experts framework designed to address this trade-off. Metis-HOME enables a ‘’Hybrid Thinking’’ paradigm by structuring the original dense model into two distinct expert branches: a thinking branch tailored for complex, multi-step reasoning, and a non-thinking branch optimized for rapid, direct inference on tasks like general VQA and OCR. A lightweight, trainable router dynamically allocates queries to the most suitable expert. We instantiate Metis-HOME by adapting the Qwen2.5-VL-7B into an MoE architecture. Comprehensive evaluations reveal that our approach not only substantially enhances complex reasoning abilities but also improves the model’s general capabilities, reversing the degradation trend observed in other reasoning-specialized models. Our work establishes a new paradigm for building powerful and versatile MLLMs, effectively resolving the prevalent reasoning-vs-generalization dilemma.


[34] Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis cs.CV | cs.AIPDF

Lixiong Qin, Yang Zhang, Mei Wang, Jiani Hu, Weihong Deng

TL;DR: 该论文提出了Fake-in-Facext(FiFa)框架,通过细粒度的面部区域划分和新型任务Artifact-Grounding Explanation(AGE),提升了DeepFake分析的可靠性和可解释性,并设计了统一的多任务学习架构FiFa-MLLM。

Details

Motivation: 当前的多模态大语言模型(MLLMs)在细粒度DeepFake分析中存在不足,例如数据标注不可靠、解释与视觉证据缺乏关联,以及不支持任意面部区域的查询。FiFa旨在解决这些问题。

Result: FiFa-MLLM在AGE任务上优于基线方法,并在现有XDFA数据集上实现了SOTA性能。

Insight: 通过细粒度面部区域划分和任务设计,FiFa展示了如何结合视觉和语言任务提升DeepFake分析的可靠性与可解释性。

Abstract: The advancement of Multimodal Large Language Models (MLLMs) has bridged the gap between vision and language tasks, enabling the implementation of Explainable DeepFake Analysis (XDFA). However, current methods suffer from a lack of fine-grained awareness: the description of artifacts in data annotation is unreliable and coarse-grained, and the models fail to support the output of connections between textual forgery explanations and the visual evidence of artifacts, as well as the input of queries for arbitrary facial regions. As a result, their responses are not sufficiently grounded in Face Visual Context (Facext). To address this limitation, we propose the Fake-in-Facext (FiFa) framework, with contributions focusing on data annotation and model construction. We first define a Facial Image Concept Tree (FICT) to divide facial images into fine-grained regional concepts, thereby obtaining a more reliable data annotation pipeline, FiFa-Annotator, for forgery explanation. Based on this dedicated data annotation, we introduce a novel Artifact-Grounding Explanation (AGE) task, which generates textual forgery explanations interleaved with segmentation masks of manipulated artifacts. We propose a unified multi-task learning architecture, FiFa-MLLM, to simultaneously support abundant multimodal inputs and outputs for fine-grained Explainable DeepFake Analysis. With multiple auxiliary supervision tasks, FiFa-MLLM can outperform strong baselines on the AGE task and achieve SOTA performance on existing XDFA datasets. The code and data will be made open-source at https://github.com/lxq1000/Fake-in-Facext.


[35] Blur2seq: Blind Deblurring and Camera Trajectory Estimation from a Single Camera Motion-blurred Image cs.CV | cs.LGPDF

Guillermo Carbajal, Andrés Almansa, Pablo Musé

TL;DR: 提出了Blur2Seq,一种从单张运动模糊图像中联合估计潜在清晰图像和相机运动轨迹的深度学习框架。通过可微模糊生成模块和端到端训练,实现了高性能的去模糊和轨迹估计。

Details

Motivation: 相机抖动导致的运动模糊(尤其是大幅或旋转运动)是图像恢复的挑战。现有端到端方法在严重或空间变化模糊下表现不佳,需要更有效且可解释的方法。

Result: 在合成和真实数据集上表现最优,尤其针对严重或空间变化模糊的情况。

Insight: 轨迹估计提供了模糊成因的可解释性,同时支持生成模糊图像对应的清晰图像序列。

Abstract: Motion blur caused by camera shake, particularly under large or rotational movements, remains a major challenge in image restoration. We propose a deep learning framework that jointly estimates the latent sharp image and the underlying camera motion trajectory from a single blurry image. Our method leverages the Projective Motion Blur Model (PMBM), implemented efficiently using a differentiable blur creation module compatible with modern networks. A neural network predicts a full 3D rotation trajectory, which guides a model-based restoration network trained end-to-end. This modular architecture provides interpretability by revealing the camera motion that produced the blur. Moreover, this trajectory enables the reconstruction of the sequence of sharp images that generated the observed blurry image. To further refine results, we optimize the trajectory post-inference via a reblur loss, improving consistency between the blurry input and the restored output. Extensive experiments show that our method achieves state-of-the-art performance on both synthetic and real datasets, particularly in cases with severe or spatially variant blur, where end-to-end deblurring networks struggle. Code and trained models are available at https://github.com/GuillermoCarbajal/Blur2Seq/


[36] Deep Learning-Powered Visual SLAM Aimed at Assisting Visually Impaired Navigation cs.CV | cs.ROPDF

Marziyeh Bamdad, Hans-Peter Hutter, Alireza Darvishy

TL;DR: 该论文提出了一种基于深度学习的视觉SLAM框架SELM-SLAM3,通过集成SuperPoint和LightGlue技术优化特征提取与匹配,在低纹理、运动模糊等挑战性条件下表现优于传统ORB-SLAM3和其他先进RGB-D SLAM系统。

Details

Motivation: 尽管SLAM技术有所进步,但在低纹理、运动模糊或复杂光照等挑战性条件下的鲁棒性仍是一个开放性问题。这些问题在视觉辅助导航中尤为突出,影响定位精度和跟踪稳定性,进而降低导航的可靠性和安全性。

Result: 实验结果表明,SELM-SLAM3在低纹理和快速运动等挑战性场景下表现优越,平均性能超过ORB-SLAM3 87.84%,并优于最先进的RGB-D SLAM系统36.77%。

Insight: 深度学习方法可以显著提升视觉SLAM在复杂环境中的鲁棒性,尤其是特征提取和匹配环节。这表明结合深度学习和传统SLAM技术是实现可靠导航系统的有效途径。

Abstract: Despite advancements in SLAM technologies, robust operation under challenging conditions such as low-texture, motion-blur, or challenging lighting remains an open challenge. Such conditions are common in applications such as assistive navigation for the visually impaired. These challenges undermine localization accuracy and tracking stability, reducing navigation reliability and safety. To overcome these limitations, we present SELM-SLAM3, a deep learning-enhanced visual SLAM framework that integrates SuperPoint and LightGlue for robust feature extraction and matching. We evaluated our framework using TUM RGB-D, ICL-NUIM, and TartanAir datasets, which feature diverse and challenging scenarios. SELM-SLAM3 outperforms conventional ORB-SLAM3 by an average of 87.84% and exceeds state-of-the-art RGB-D SLAM systems by 36.77%. Our framework demonstrates enhanced performance under challenging conditions, such as low-texture scenes and fast motion, providing a reliable platform for developing navigation aids for the visually impaired.


[37] From Cheap to Pro: A Learning-based Adaptive Camera Parameter Network for Professional-Style Imaging cs.CV | I.4.3; I.4.8; I.2.10PDF

Fuchen Li, Yansong Du, Wenbo Cheng, Xiaoxia Zhou, Sen Yin

TL;DR: 论文提出了一种轻量级、场景自适应的相机参数调整网络ACamera-Net,直接从RAW输入预测最佳曝光和白平衡,解决了消费级相机在复杂光照条件下图像质量不稳定的问题。

Details

Motivation: 消费级相机在复杂光照条件下(如低光、高动态范围、背光等)往往表现不佳,导致图像质量下降,影响下游视觉任务的性能。

Result: 实验表明,ACamera-Net显著提升了图像质量,稳定了感知输出,优于传统自动模式和轻量级基线方法。

Insight: 直接从RAW数据学习相机参数调整是一种高效且通用的方法,可显著改善图像质量和下游任务性能。

Abstract: Consumer-grade camera systems often struggle to maintain stable image quality under complex illumination conditions such as low light, high dynamic range, and backlighting, as well as spatial color temperature variation. These issues lead to underexposure, color casts, and tonal inconsistency, which degrade the performance of downstream vision tasks. To address this, we propose ACamera-Net, a lightweight and scene-adaptive camera parameter adjustment network that directly predicts optimal exposure and white balance from RAW inputs. The framework consists of two modules: ACamera-Exposure, which estimates ISO to alleviate underexposure and contrast loss, and ACamera-Color, which predicts correlated color temperature and gain factors for improved color consistency. Optimized for real-time inference on edge devices, ACamera-Net can be seamlessly integrated into imaging pipelines. Trained on diverse real-world data with annotated references, the model generalizes well across lighting conditions. Extensive experiments demonstrate that ACamera-Net consistently enhances image quality and stabilizes perception outputs, outperforming conventional auto modes and lightweight baselines without relying on additional image enhancement modules.


[38] EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence cs.CV | cs.ROPDF

Ding Zou, Feifan Wang, Mengyu Ge, Siyuan Fan, Zongbing Zhang

TL;DR: EmbodiedBrain是一个新型的视觉语言基础模型,旨在解决当前基于LLM和MLLM的具身智能体在任务规划中的局限性,通过创新的训练方法和评估系统,实现了卓越的性能。

Details

Motivation: 当前的大语言模型和多模态语言模型在具身任务中存在设计差距、实时性和性能的权衡问题,以及离线评估的局限性,这些问题阻碍了AGI的实现。

Result: 实验表明,EmbodiedBrain在所有指标上均表现出色,为具身基础模型设定了新的SOTA。

Insight: 通过创新的训练方法和评估系统,可以有效提升具身智能体的任务规划和执行能力;开源数据和模型为推动具身智能领域的发展提供了重要资源。

Abstract: The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts long-horizon task success by integrating preceding steps as Guided Precursors. Furthermore, we incorporate a comprehensive reward system, including a Generative Reward Model (GRM) accelerated at the infrastructure level, to improve training efficiency. For enable thorough validation, we establish a three-part evaluation system encompassing General, Planning, and End-to-End Simulation Benchmarks, highlighted by the proposal and open-sourcing of a novel, challenging simulation environment. Experimental results demonstrate that EmbodiedBrain achieves superior performance across all metrics, establishing a new state-of-the-art for embodied foundation models. Towards paving the way for the next generation of generalist embodied agents, we open-source all of our data, model weight, and evaluating methods, which are available at https://zterobot.github.io/EmbodiedBrain.github.io.


[39] Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence cs.CV | cs.AI | cs.MMPDF

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang

TL;DR: Open-o3 Video 是一个非代理框架,通过在视频推理中引入显式的时空证据,实现了对动态场景的时空联合跟踪和定位。通过精心构建的数据集和训练策略,该模型在多个视频理解基准测试中取得了最先进的性能。

Details

Motivation: 现有的视频推理模型通常只生成文本推理轨迹,而无法指出关键证据出现的时间和地点。OpenAI-o3等模型在图像推理中引入了证据中心化的能力,但在视频中扩展这一功能更具挑战性,需要同时处理动态场景的时空跟踪和定位问题。

Result: 在V-STAR基准测试中,mAM指标提升14.4%,mLGM提升24.2%;在VideoMME、WorldSense、VideoMMMU等基准测试中均表现出色。推理轨迹还为测试时提供了可扩展的信号。

Insight: 1. 统一的时空监督对视频推理至关重要;2. 多任务奖励的强化学习策略能有效平衡准确性、时间和空间目标;3. 显式推理轨迹不仅提升性能,还为模型的可信度提供了直接支持。

Abstract: Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.


[40] GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models cs.CVPDF

Muhammad Atif Butt, Alexandra Gomez-Villa, Tao Wu, Javier Vazquez-Corral, Joost Van De Weijer

TL;DR: GenColorBench是首个系统地评估文本到图像生成模型颜色控制能力的基准,包含44K个颜色相关提示,覆盖400多种颜色,揭示了主流模型在颜色生成上的能力和局限性。

Details

Motivation: 目前文本到图像生成模型的颜色控制能力较弱,缺乏系统性评估。颜色是人类视觉和沟通的核心,在许多应用中至关重要。

Result: 评估显示模型的颜色生成能力存在显著差异,揭示了模型对不同颜色约定的理解程度及其失败模式。

Insight: GenColorBench将为提升模型的颜色生成精确性提供指导,推动文本到图像生成技术的发展。

Abstract: Recent years have seen impressive advances in text-to-image generation, with image generative or unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. With 44K color-focused prompts covering 400+ colors, it reveals models’ true capabilities via perceptual and automated assessments. Evaluations of popular text-to-image models using GenColorBench show performance variations, highlighting which color conventions models understand best and identifying failure modes. Our GenColorBench assessments will guide improvements in precise color generation. The benchmark will be made public upon acceptance.


[41] Unsupervised Domain Adaptation via Similarity-based Prototypes for Cross-Modality Segmentation cs.CV | cs.AIPDF

Ziyu Ye, Chen Ju, Chaofan Ma, Xiaoyun Zhang

TL;DR: 该论文提出了一种基于相似性原型的无监督域适应框架,用于跨模态分割,通过学习类别原型并引入相似性约束,解决了域适应中的类缺失问题,并通过对比学习进一步提升性能。

Details

Motivation: 深度模型在训练数据上表现优异,但在未见数据上性能骤降。为解决域偏移问题并避免昂贵的未见域标注,论文提出了无监督域适应方法。

Result: 实验表明,该方法在跨模态分割任务上优于其他先进方法。

Insight: 通过类别原型和对比学习结合,可以有效缓解域适应中的类缺失问题,提升模型在未见域的泛化能力。

Abstract: Deep learning models have achieved great success on various vision challenges, but a well-trained model would face drastic performance degradation when applied to unseen data. Since the model is sensitive to domain shift, unsupervised domain adaptation attempts to reduce the domain gap and avoid costly annotation of unseen domains. This paper proposes a novel framework for cross-modality segmentation via similarity-based prototypes. In specific, we learn class-wise prototypes within an embedding space, then introduce a similarity constraint to make these prototypes representative for each semantic class while separable from different classes. Moreover, we use dictionaries to store prototypes extracted from different images, which prevents the class-missing problem and enables the contrastive learning of prototypes, and further improves performance. Extensive experiments show that our method achieves better results than other state-of-the-art methods.


[42] OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects cs.CV | cs.AI | I.4.5; I.2.6PDF

Mark He Huang, Lin Geng Foo, Christian Theobalt, Ying Sun, De Wen Soh

TL;DR: OnlineSplatter是一个新颖的在线前馈框架,直接从RGB帧生成高质量的3D高斯表示,无需相机位姿、深度先验或捆绑优化,适用于自由移动物体的在线3D重建。

Details

Motivation: 自由移动物体的单目视频3D重建面临挑战,特别是在缺乏可靠位姿或深度线索以及物体运动任意的情况下。现有方法需要额外的位姿或深度信息,限制了广泛应用。

Result: 在真实数据集上评估,OnlineSplatter显著优于现有无位姿重建基线,且随着观测增多持续改进,同时保持恒定的内存和运行时开销。

Insight: 通过空间引导的记忆设计和稀疏化机制,可以有效处理自由移动物体的复杂运动,同时保持重建质量和计算效率的平衡。

Abstract: Free-moving object reconstruction from monocular video remains challenging, particularly without reliable pose or depth cues and under arbitrary object motion. We introduce OnlineSplatter, a novel online feed-forward framework generating high-quality, object-centric 3D Gaussians directly from RGB frames without requiring camera pose, depth priors, or bundle optimization. Our approach anchors reconstruction using the first frame and progressively refines the object representation through a dense Gaussian primitive field, maintaining constant computational cost regardless of video sequence length. Our core contribution is a dual-key memory module combining latent appearance-geometry keys with explicit directional keys, robustly fusing current frame features with temporally aggregated object states. This design enables effective handling of free-moving objects via spatial-guided memory readout and an efficient sparsification mechanism, ensuring comprehensive yet compact object coverage. Evaluations on real-world datasets demonstrate that OnlineSplatter significantly outperforms state-of-the-art pose-free reconstruction baselines, consistently improving with more observations while maintaining constant memory and runtime.


[43] SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding cs.CVPDF

Yuan Sheng, Yanbin Hao, Chenxu Li, Shuo Wang, Xiangnan He

TL;DR: 论文提出了SeViCES框架,通过语义-视觉一致性证据选择提升长视频理解能力,其训练无关且模型无关,核心是两个模块:SVCFS(语义-视觉帧选择)和ACR(答案一致性优化)。

Details

Motivation: 长视频内容复杂且时间分散,现有方法难以高效处理。尽管视频大语言模型(Video-LLMs)能处理较长视频,但对真正长序列的计算负担大且推理不一致。需要一种方法选择最有信息量的帧并提供完整的查询相关上下文。

Result: 在长视频理解基准测试中,SeViCES在准确性和鲁棒性上均优于当前最优方法,证明了其有效性。

Insight: 语义和视觉证据的共识选择是提升视频大语言模型推理能力的关键,多模态融合和答案空间约束能显著提高长视频理解的可靠性。

Abstract: Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that leverages LLM reasoning over captions, and (2) a cluster-guided visual branch that aligns embeddings with semantic scores via mutual information. The Answer Consensus Refinement (ACR) module further resolves inconsistencies between semantic- and visual-based predictions by fusing evidence and constraining the answer space. Extensive experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness, demonstrating the importance of consensus-driven evidence selection for Video-LLMs.


[44] Deep Learning in Dental Image Analysis: A Systematic Review of Datasets, Methodologies, and Emerging Challenges cs.CV | cs.AIPDF

Zhenhuan Zhou, Jingbo Zhu, Yuchen Zhang, Xiaohang Guan, Peng Wang

TL;DR: 这篇系统综述总结了深度学习在牙科图像分析中的应用,涵盖260项研究,详细探讨了公开数据集和算法。文章介绍了牙科影像的基础概念、数据集特性、深度学习技术及其在各项任务中的应用,并分析了当前研究的挑战和未来方向。

Details

Motivation: 牙科影像分析面临低对比度、金属伪影和投影角度变化等挑战,人工解读耗时且不一致。基于AI的自动分析技术,尤其是深度学习,因其优异的特征提取能力成为解决方案。

Result: 研究展示了深度学习在牙科影像分析中的广泛应用,包括诊断和治疗规划任务,并指出了现有方法的局限性和改进空间。

Insight: 深度学习在牙科影像分析中具有巨大潜力,但仍需解决数据稀缺性、标注主观性和模型泛化性等挑战,未来可能引入更多多模态数据和自监督学习技术。

Abstract: Efficient analysis and processing of dental images are crucial for dentists to achieve accurate diagnosis and optimal treatment planning. However, dental imaging inherently poses several challenges, such as low contrast, metallic artifacts, and variations in projection angles. Combined with the subjectivity arising from differences in clinicians’ expertise, manual interpretation often proves time-consuming and prone to inconsistency. Artificial intelligence (AI)-based automated dental image analysis (DIA) offers a promising solution to these issues and has become an integral part of computer-aided dental diagnosis and treatment. Among various AI technologies, deep learning (DL) stands out as the most widely applied and influential approach due to its superior feature extraction and representation capabilities. To comprehensively summarize recent progress in this field, we focus on the two fundamental aspects of DL research-datasets and models. In this paper, we systematically review 260 studies on DL applications in DIA, including 49 papers on publicly available dental datasets and 211 papers on DL-based algorithms. We first introduce the basic concepts of dental imaging and summarize the characteristics and acquisition methods of existing datasets. Then, we present the foundational techniques of DL and categorize relevant models and algorithms according to different DIA tasks, analyzing their network architectures, optimization strategies, training methods, and performance. Furthermore, we summarize commonly used training and evaluation metrics in the DIA domain. Finally, we discuss the current challenges of existing research and outline potential future directions. We hope that this work provides a valuable and systematic reference for researchers in this field. All supplementary materials and detailed comparison tables will be made publicly available on GitHub.


[45] Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging cs.CVPDF

Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Dong Yang

TL;DR: BTB3D提出了一种新的因果卷积编码器-解码器架构,通过改进的三维令牌生成方法,显著提升了3D医学影像与语言模型的性能。

Details

Motivation: 当前3D医学影像的语言模型在处理高分辨率和长序列体积时表现不佳,主要问题包括对比预训练导致的视觉编码器与临床语言不匹配,以及切片级令牌化模糊了精细解剖结构。

Result: 在报告生成任务中,BTB3D提升了BLEU分数并将临床F1提高了40%;在文本到CT合成任务中,FID降低了75%,FVD减半。

Insight: 精确的三维令牌化(而非仅依赖更大的语言模型骨干)是实现可扩展3D医学影像-语言建模的关键。

Abstract: Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512512241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D


[46] UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset cs.CVPDF

Chen Zhao, En Ci, Yunzhe Xu, Tiehan Fan, Shanyan Guan

TL;DR: 该论文提出了大规模高质量UHR数据集UltraHR-100K和一种频率感知的后训练方法,显著提升了超高分辨率(UHR)文本到图像生成的细节质量和整体保真度。

Details

Motivation: 超高分辨率(UHR)文本到图像生成缺乏大规模高质量数据集,且现有方法在UHR场景下的细粒度细节合成方面表现不足。

Result: 在UltraHR-eval4K基准测试中,方法显著提升了UHR图像生成的细节质量和整体保真度。

Insight: 高质量数据集和针对高频细节的优化方法对提升UHR图像生成效果至关重要。

Abstract: Ultra-high-resolution (UHR) text-to-image (T2I) generation has seen notable progress. However, two key challenges remain : 1) the absence of a large-scale high-quality UHR T2I dataset, and (2) the neglect of tailored training strategies for fine-grained detail synthesis in UHR scenarios. To tackle the first challenge, we introduce \textbf{UltraHR-100K}, a high-quality dataset of 100K UHR images with rich captions, offering diverse content and strong visual fidelity. Each image exceeds 3K resolution and is rigorously curated based on detail richness, content complexity, and aesthetic quality. To tackle the second challenge, we propose a frequency-aware post-training method that enhances fine-detail generation in T2I diffusion models. Specifically, we design (i) \textit{Detail-Oriented Timestep Sampling (DOTS)} to focus learning on detail-critical denoising steps, and (ii) \textit{Soft-Weighting Frequency Regularization (SWFR)}, which leverages Discrete Fourier Transform (DFT) to softly constrain frequency components, encouraging high-frequency detail preservation. Extensive experiments on our proposed UltraHR-eval4K benchmarks demonstrate that our approach significantly improves the fine-grained detail quality and overall fidelity of UHR image generation. The code is available at \href{https://github.com/NJU-PCALab/UltraHR-100k}{here}.


[47] HybridSOMSpikeNet: A Deep Model with Differentiable Soft Self-Organizing Maps and Spiking Dynamics for Waste Classification cs.CVPDF

Debojyoti Ghosh, Adrijit Goswami

TL;DR: HybridSOMSpikeNet结合了卷积特征提取、可微分自组织和脉冲神经网络,用于高效准确的废物分类,测试准确率达97.39%,优于现有方法。

Details

Motivation: 废物分类错误导致环境污染和资源浪费,需要一种智能且高效的分类方法以支持可持续发展和循环经济。

Result: 在十类废物数据集上实现了97.39%的测试准确率,优于现有模型,且计算轻量适合实际部署。

Insight: 该框架不仅技术先进,还直接支持可持续发展目标(SDG 11和SDG 12),通过自动化废物分类提升回收效率和减少环境污染。

Abstract: Accurate waste classification is vital for achieving sustainable waste management and reducing the environmental footprint of urbanization. Misclassification of recyclable materials contributes to landfill accumulation, inefficient recycling, and increased greenhouse gas emissions. To address these issues, this study introduces HybridSOMSpikeNet, a hybrid deep learning framework that integrates convolutional feature extraction, differentiable self-organization, and spiking-inspired temporal processing to enable intelligent and energy-efficient waste classification. The proposed model employs a pre-trained ResNet-152 backbone to extract deep spatial representations, followed by a Differentiable Soft Self-Organizing Map (Soft-SOM) that enhances topological clustering and interpretability. A spiking neural head accumulates temporal activations over discrete time steps, improving robustness and generalization. Trained on a ten-class waste dataset, HybridSOMSpikeNet achieved a test accuracy of 97.39%, outperforming several state-of-the-art architectures while maintaining a lightweight computational profile suitable for real-world deployment. Beyond its technical innovations, the framework provides tangible environmental benefits. By enabling precise and automated waste segregation, it supports higher recycling efficiency, reduces contamination in recyclable streams, and minimizes the ecological and operational costs of waste processing. The approach aligns with global sustainability priorities, particularly the United Nations Sustainable Development Goals (SDG 11 and SDG 12), by contributing to cleaner cities, circular economy initiatives, and intelligent environmental management systems.


[48] Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward cs.CVPDF

Jing Bi, Guangyu Sun, Ali Vosoughi, Chen Chen, Chenliang Xu

TL;DR: 该论文系统地分析了多模态大语言模型(MLLMs)在视觉推理中的挑战,提出了一种基于代理的架构,结合轻量级视觉模块以改进推理链,并在多个任务上显著优于基线模型。

Details

Motivation: 尽管结合视觉和文本推理的多模态大语言模型(MLLMs)通过链式思维(CoT)提示解决复杂视觉任务,但仍然存在视觉幻觉和过度依赖文本先验的问题。作者希望通过系统诊断和改进模型架构来解决这些问题。

Result: 实验结果表明,该系统在MMMU和MathVista等任务上显著优于基线模型(7B参数),甚至匹配或超过更大规模的模型性能。

Insight: 论文的核心洞见是未来的视觉推理模型需要结合更多专门工具来分析视觉内容,而轻量级视觉模块与LLM的协同可以有效提升推理能力并减少幻觉问题。

Abstract: Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.


[49] Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models cs.CVPDF

Xuyang Liu, Xiyan Gui, Yuchao Zhang, Linfeng Zhang

TL;DR: MixKV提出了一种新颖的KV缓存压缩方法,结合重要性和多样性,优化大型视觉-语言模型中的内存瓶颈问题。

Details

Motivation: 大型视觉-语言模型在处理多模态序列时,KV缓存的扩展导致内存瓶颈,限制了部署扩展性。现有方法仅关注重要性,忽略了多模态KV缓存中的语义冗余模式。

Result: 实验表明,MixKV在极端压缩条件下(budget=64)平均提升了5.1%,在GUI任务中表现尤为突出(8.0%和9.0%增益),同时对LLM也有类似效果。

Insight: KV缓存压缩不仅需要关注重要性,还需考虑多样性以覆盖语义分布,MixKV为多模态模型的高效部署提供了新思路。

Abstract: Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0%} and \textbf{9.0%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.


[50] AutoScape: Geometry-Consistent Long-Horizon Scene Generation cs.CVPDF

Jiacheng Chen, Ziyu Jiang, Mingfu Liang, Bingbing Zhuang, Jong-Chyi Su

TL;DR: AutoScape是一个生成长时间驾驶场景的框架,通过RGB-D扩散模型迭代生成几何一致的关键帧,并通过视频扩散模型插值生成连贯视频帧。

Details

Motivation: 解决长时间驾驶场景生成中几何一致性和真实感的问题。

Result: 生成的20秒以上驾驶视频在FID和FVD指标上分别提升48.6%和43.0%。

Insight: 几何一致性是长时间场景生成的关键,联合RGB-D建模显著提升生成质量。

Abstract: This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene’s appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively.


[51] ACS-SegNet: An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology cs.CVPDF

Nima Torbati, Anastasia Meshcheryakova, Ramona Woitek, Diana Mechtcheriakova, Amirreza Mahbod

TL;DR: 本文提出了一种基于注意力机制的双编码器模型ACS-SegNet,结合CNN和ViT的优势,用于组织病理学图像的语义分割任务,并在公开数据集上表现优于现有方法。

Details

Motivation: 组织病理学图像的自动化分析在计算机辅助诊断中非常重要,现有深度学习方法虽表现优异,但CNN和ViT的结合尚未充分探索,需进一步提升分割性能。

Result: 在GCPS和PUMA数据集上分别取得76.79%/86.87%和64.93%/76.60%的μIoU/μDice分数,优于现有基准方法。

Insight: CNN和ViT的特征融合能够互补优势,注意力机制在多尺度特征整合中发挥关键作用,为病理图像分割提供新思路。

Abstract: Automated histopathological image analysis plays a vital role in computer-aided diagnosis of various diseases. Among developed algorithms, deep learning-based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention-driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual-encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved {\mu}IoU/{\mu}Dice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: https://github.com/NimaTorbati/ACS-SegNet


[52] DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion cs.CVPDF

Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski

TL;DR: DyPE是一种无需训练的动态位置外推方法,扩展预训练扩散Transformer模型的生成能力至超高分辨率,无需额外采样成本。

Details

Motivation: 当前扩散Transformer模型在超高分辨率图像生成中计算成本高,DyPE旨在解决这一问题,利用扩散过程的光谱特性动态调整位置编码。

Result: DyPE在多个基准测试中表现优异,最高支持1600万像素的图像生成,且在超高分辨率下的保真度显著提升。

Insight: 扩散过程中的低频和高频特性可以用于动态优化模型生成能力,无需额外训练即可扩展分辨率。

Abstract: Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism’s quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model’s positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.


[53] AlphaFlow: Understanding and Improving MeanFlow Models cs.CV | cs.LGPDF

Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov

TL;DR: AlphaFlow提出了一种统一框架,解决了MeanFlow模型中目标和优化的冲突问题,通过动态调整目标实现了更好的收敛性和性能,在ImageNet上取得了新的SOTA结果。

Details

Motivation: MeanFlow在少步生成建模中展现出强大能力,但其优化过程中目标和梯度存在冲突,导致收敛缓慢。作者希望通过分析MeanFlow的目标分解,改进优化过程。

Result: 在ImageNet-1K 256x256上,AlphaFlow-XL/2+模型取得了FID 2.58(1步)和2.15(2步)的SOTA结果。

Insight: 动态调整优化目标的权重(如课程学习)能有效解决多目标冲突问题,提升生成模型的效率和性能。

Abstract: MeanFlow has recently emerged as a powerful framework for few-step generative modeling trained from scratch, but its success is not yet fully understood. In this work, we show that the MeanFlow objective naturally decomposes into two parts: trajectory flow matching and trajectory consistency. Through gradient analysis, we find that these terms are strongly negatively correlated, causing optimization conflict and slow convergence. Motivated by these insights, we introduce $\alpha$-Flow, a broad family of objectives that unifies trajectory flow matching, Shortcut Model, and MeanFlow under one formulation. By adopting a curriculum strategy that smoothly anneals from trajectory flow matching to MeanFlow, $\alpha$-Flow disentangles the conflicting objectives, and achieves better convergence. When trained from scratch on class-conditional ImageNet-1K 256x256 with vanilla DiT backbones, $\alpha$-Flow consistently outperforms MeanFlow across scales and settings. Our largest $\alpha$-Flow-XL/2+ model achieves new state-of-the-art results using vanilla DiT backbones, with FID scores of 2.58 (1-NFE) and 2.15 (2-NFE).


[54] CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image cs.CVPDF

Binbin Huang, Haobin Duan, Yiqun Zhao, Zibo Zhao, Yi Ma

TL;DR: 这篇论文提出了名为Cupid的新方法,从单张2D图像中准确推断出物体的相机位姿、3D形状和纹理。Cupid将3D重建视为从学习到的3D物体分布中进行条件采样的过程,并联合生成体素和像素-体素对应关系,从而在统一的生成框架下实现鲁棒的位姿和形状估计。

Details

Motivation: 现有的单视图3D重建方法通常在位姿和形状估计的联合优化中表现不佳,且难以生成高保真度的3D结果。本文旨在通过统一的生成框架解决这些问题。

Result: Cupid在PSNR上超过现有方法3 dB以上,Chamfer Distance降低10%以上,在位姿准确性上与单目估计器相当,同时在视觉保真度上优于基线3D生成模型。

Insight: 通过联合建模位姿和形状的分布,并结合两阶段细化策略,能够显著提升单视图3D重建的准确性和质量。

Abstract: This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit cupid3d.github.io.


[55] Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature cs.CV | eess.SPPDF

Lei Cheng, Siyang Cao

TL;DR: 本文提出了一种融合雷达与摄像头数据的多目标跟踪(MOT)框架,通过在线校准和共同特征提升跟踪效率,减少人工干预。

Details

Motivation: 传统研究中雷达常被低估,仅作为辅助工具,但其能提供精确的目标深度信息。本文旨在充分发挥雷达的优势,并与摄像头数据融合,提升跟踪性能。

Result: 实验表明,该框架能简化传感器映射流程,提升跟踪精度,适用于受控环境和实际交通场景。

Insight: 本研究首次探索了雷达-摄像头共同特征及其在在线校准中的应用,为多传感器融合提供了新思路。

Abstract: This paper presents a Multi-Object Tracking (MOT) framework that fuses radar and camera data to enhance tracking efficiency while minimizing manual interventions. Contrary to many studies that underutilize radar and assign it a supplementary role–despite its capability to provide accurate range/depth information of targets in a world 3D coordinate system–our approach positions radar in a crucial role. Meanwhile, this paper utilizes common features to enable online calibration to autonomously associate detections from radar and camera. The main contributions of this work include: (1) the development of a radar-camera fusion MOT framework that exploits online radar-camera calibration to simplify the integration of detection results from these two sensors, (2) the utilization of common features between radar and camera data to accurately derive real-world positions of detected objects, and (3) the adoption of feature matching and category-consistency checking to surpass the limitations of mere position matching in enhancing sensor association accuracy. To the best of our knowledge, we are the first to investigate the integration of radar-camera common features and their use in online calibration for achieving MOT. The efficacy of our framework is demonstrated by its ability to streamline the radar-camera mapping process and improve tracking precision, as evidenced by real-world experiments conducted in both controlled environments and actual traffic scenarios. Code is available at https://github.com/radar-lab/Radar_Camera_MOT


[56] ARGenSeg: Image Segmentation with Autoregressive Image Generation Model cs.CVPDF

Xiaolong Wang, Lixiang Ru, Ziyuan Huang, Kaixiang Ji, Dandan Zheng

TL;DR: ARGenSeg提出了一种基于自回归图像生成模型的新型图像分割范式,利用多模态大语言模型(MLLM)生成视觉令牌,并通过VQ-VAE解码为密集掩码,显著提升了分割精度和推理速度。

Details

Motivation: 现有方法将图像分割集成到MLLM中时,通常依赖于离散表示或任务特定的解码器,导致难以捕获细粒度的视觉细节。ARGenSeg旨在通过图像生成的方式解决这一问题。

Result: 在多个分割数据集上超越了之前的SOTA方法,同时显著提升了推理速度。

Insight: 将图像分割任务转化为生成任务,可以更自然地捕获细粒度视觉信息,并利用MLLM的强大理解能力。

Abstract: We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.


[57] Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers cs.CV | cs.LGPDF

Dean L Slack, G Thomas Hudson, Thomas Winterbottom, Noura Al Moubayed

TL;DR: 这篇论文提出了一种基于纯Transformer的自回归视频预测模型,专注于动态物理模拟的因果建模,通过简单的端到端方法在像素空间中实现时空推理。

Details

Motivation: 受到自回归大型语言模型(LLMs)性能和可扩展性的启发,作者希望将Transformer架构扩展到视频预测领域,尤其是物理动态模拟中的因果建模。现有的视频生成方法在这方面存在不足,因此作者试图通过简单的设计来解决这一问题。

Result: 实验结果表明,该方法在物理模拟场景中显著提升了预测的时间跨度(最高50%),同时在视频质量指标上保持可比性。此外,可解释性实验证明模型能够泛化到分布外模拟参数的估计。

Insight: 研究揭示了纯Transformer架构在视频预测中的潜力,尤其是通过简单设计和像素空间表示,可以实现高效的时空建模和物理动态推理。这为未来基于注意力的时空视频建模提供了一个平台。

Abstract: Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.


[58] Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation cs.CV | cs.AI | cs.CLPDF

Yuhan Liu, Lianhui Qin, Shengjie Wang

TL;DR: 论文提出一种无需训练的框架Speculative Verdict (SV),通过结合多个轻量级专家和大模型实现高效的多模态推理,在信息密集型任务上表现优异。

Details

Motivation: 现有的大型视觉-语言模型在信息密集型图像(如密集文本和图形交织的场景)中表现不足,难以准确定位关键线索并进行多跳推理。

Result: SV在InfographicVQA等挑战性任务上表现优异,相比大型专有模型或训练流程,既纠正错误又节省成本。

Insight: 通过整合部分准确的推理路径,SV实现了错误纠正和高效推理,为密集多模态任务提供了新思路。

Abstract: Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict


[59] HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives cs.CVPDF

Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang

TL;DR: HoloCine是一款生成连贯多镜头长视频叙事的模型,解决了当前文本生成视频模型在全局一致性上的不足,通过创新的注意力机制实现了高效且一致的叙事生成。

Details

Motivation: 现有的文本生成视频模型虽然在生成独立片段上表现优异,但在生成连贯的多镜头叙事(即故事讲述的核心)上存在明显不足,HoloCine旨在填补这一‘叙事鸿沟’。

Result: HoloCine在叙事一致性上达到了新的SOTA,并展现了角色和场景的持久记忆以及对电影技术的直观理解能力。

Insight: HoloCine标志着从片段合成到自动化电影制作的重大转变,为端到端的电影创作提供了可行性路径。

Abstract: State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this “narrative gap” with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.


cs.CL [Back]

[60] An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics cs.CL | cs.AI | G.1.10; G.4; I.2.6; I.2.7PDF

Xincheng Liu

TL;DR: 本文评估了五种主流大语言模型(如GPT-5、Claude等)在生成高中物理教案时的教学合理性与可用性,测试了三种提示框架(TAG、RACE、COSTAR)。结果表明,模型选择主要影响语言可读性,而提示框架对内容准确性影响更大。最优配置是结合可读性优化模型与RACE框架。

Details

Motivation: 研究动机是探究不同AI模型和提示框架在生成教案时的表现,尤其是教学品质和实用性,为教育者提供可靠的AI辅助工具选择依据。

Result: 结果显示:1)DeepSeek模型可读性最佳;2)RACE框架内容准确性最高;3)教案目标多集中在低阶认知水平(记忆与理解)。

Insight: 核心洞见在于:模型设计决定可读性,而提示框架影响教学可靠性。最优方案需结合可读性强的模型与结构化提示框架(如RACE)。

Abstract: This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models: ChatGPT (GPT-5), Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2, and Grok 4. Beyond model choice, three structured prompt frameworks were tested: TAG (Task, Audience, Goal), RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective, Style, Tone, Audience, Response Format). Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum. The lesson plans were analyzed through four automated computational metrics: (1) readability and linguistic complexity, (2) factual accuracy and hallucination detection, (3) standards and curriculum alignment, and (4) cognitive demand of learning objectives. Results indicate that model selection exerted the strongest influence on linguistic accessibility, with DeepSeek producing the most readable teaching plan (FKGL = 8.64) and Claude generating the densest language (FKGL = 19.89). The prompt framework structure most strongly affected the factual accuracy and pedagogical completeness, with the RACE framework yielding the lowest hallucination index and the highest incidental alignment with NGSS curriculum standards. Across all models, the learning objectives in the fifteen lesson plans clustered at the Remember and Understand tiers of Bloom’s taxonomy. There were limited higher-order verbs in the learning objectives extracted. Overall, the findings suggest that readability is significantly governed by model design, while instructional reliability and curricular alignment depend more on the prompt framework. The most effective configuration for lesson plans identified in the results was to combine a readability-optimized model with the RACE framework and an explicit checklist of physics concepts, curriculum standards, and higher-order objectives.


[61] From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model cs.CLPDF

Yatai Ji, Teng Wang, Yuying Ge, Zhiheng Liu, Sidi Yang

TL;DR: ReDiff通过主动纠错框架解决视觉语言扩散模型中的错误级联问题,显著提升了生成内容的连贯性和事实准确性。

Details

Motivation: 离散扩散模型在视觉语言任务中虽有潜力,但训练与推理的不一致性导致错误级联,影响生成质量。

Result: ReDiff明显改善了生成内容的连贯性和事实准确性,并行生成效率优于传统去噪方法。

Insight: 主动修正机制能有效打破错误级联,提升模型生成的稳定性和质量。

Abstract: Discrete diffusion models have emerged as a promising direction for vision-language tasks, offering bidirectional context modeling and theoretical parallelization. However, their practical application is severely hindered by a train-inference discrepancy, which leads to catastrophic error cascades: initial token errors during parallel decoding pollute the generation context, triggering a chain reaction of compounding errors and leading to syntactic errors and semantic hallucinations. To address this fundamental challenge, we reframe the generation process from passive denoising to active refining. We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors. Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we implement a novel online self-correction loop where the model is explicitly trained to revise its own flawed drafts by learning from an expert’s corrections. This mistake-driven learning endows the model with the crucial ability to revisit and refine its already generated output, effectively breaking the error cascade. Extensive experiments demonstrate that ReDiff significantly improves the coherence and factual accuracy of generated content, enabling stable and efficient parallel generation far superior to traditional denoising methods. Our codes and models are available at https://rediff-hku.github.io/.


[62] Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention cs.CL | cs.AI | 68T40 | I.2.11PDF

J Rosser, José Luis Redondo García, Gustavo Penha, Konstantina Palla, Hugues Bouchard

TL;DR: Stream通过稀疏注意力技术实现了长上下文LLM的高效机制可解释性分析,解决了传统方法内存需求大的问题。

Details

Motivation: 随着LLM上下文长度增长,传统机制可解释性方法因内存需求呈二次方增长而难以扩展。

Result: 在RULER基准测试中,Stream保留了关键检索路径并剪枝90-96%的交互,显着降低内存需求。

Insight: Stream使得长上下文机制可解释性分析在消费级GPU上变得可行,推动了思维链监控的普及。

Abstract: As Large Language Models (LLMs) scale to million-token contexts, traditional Mechanistic Interpretability techniques for analyzing attention scale quadratically with context length, demanding terabytes of memory beyond 100,000 tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic sparse attention to efficiently analyze long context attention patterns. We present Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time $O(T \log T)$ and linear space $O(T)$, enabling one-pass interpretability at scale. Stream performs a binary-search-style refinement to retain only the top-$k$ key blocks per query while preserving the model’s next-token behavior. We apply Stream to long chain-of-thought reasoning traces and identify thought anchors while pruning 97-99% of token interactions. On the RULER benchmark, Stream preserves critical retrieval paths while discarding 90-96% of interactions and exposes layer-wise routes from the needle to output. Our method offers a practical drop-in tool for analyzing attention patterns and tracing information flow without terabytes of caches. By making long context interpretability feasible on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring. Code is available at https://anonymous.4open.science/r/stream-03B8/.


[63] Automated HIV Screening on Dutch EHR with Large Language Models cs.CLPDF

Lang Zhou, Amrish Jhingoer, Yinghao Luo, Klaske Vliegenthart–Jongbloed, Carlijn Jordans

TL;DR: 该论文提出了一种利用大型语言模型(LLM)分析电子健康记录(EHR)中非结构化文本的新方法,用于自动化HIV筛查,并展示了其在荷兰伊拉斯谟大学医学中心临床数据上的高准确性和低假阴性率。

Details

Motivation: HIV的高效筛查和早期诊断对减少传播至关重要。尽管大规模实验室检测不可行,但电子健康记录(EHR)的普及为利用非结构化文本数据(如临床笔记)提供了新机会。

Result: 实验结果表明,该方法在临床数据上表现出高准确性和低假阴性率。

Insight: 研究展示了LLM在医疗领域非结构化数据处理中的潜力,为HIV筛查提供了新的技术路径。

Abstract: Efficient screening and early diagnosis of HIV are critical for reducing onward transmission. Although large scale laboratory testing is not feasible, the widespread adoption of Electronic Health Records (EHRs) offers new opportunities to address this challenge. Existing research primarily focuses on applying machine learning methods to structured data, such as patient demographics, for improving HIV diagnosis. However, these approaches often overlook unstructured text data such as clinical notes, which potentially contain valuable information relevant to HIV risk. In this study, we propose a novel pipeline that leverages a Large Language Model (LLM) to analyze unstructured EHR text and determine a patient’s eligibility for further HIV testing. Experimental results on clinical data from Erasmus University Medical Center Rotterdam demonstrate that our pipeline achieved high accuracy while maintaining a low false negative rate.


[64] Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities cs.CL | cs.AIPDF

Nishant Balepur, Dang Nguyen, Dayeon Ki

TL;DR: 该论文提出了一种基于游戏的评估方法(Dixit游戏)来全面评估多模态大语言模型(MLM)的能力,克服了传统静态或主观评估方法的局限性。

Details

Motivation: 传统的多模态大语言模型评估方法通常是静态的或依赖主观比较,存在无法全面评估、主观性强、成本高以及模型可能利用表面捷径等问题。因此,作者希望通过游戏(Dixit)提供一个更全面、客观且有趣的评估框架。

Result: 实验表明,Dixit游戏的胜率排名与传统MLM基准测试的排名完全一致。同时,人类与MLM玩家的对局揭示了MLM推理能力的差异和改进空间。

Insight: 游戏可以作为评估多模态语言模型能力的有效工具,不仅提供客观的评估指标,还能揭示模型的策略和推理能力的局限性。

Abstract: Multi-modal large language models (MLMs) are often assessed on static, individual benchmarks – which cannot jointly assess MLM capabilities in a single task – or rely on human or model pairwise comparisons – which is highly subjective, expensive, and allows models to exploit superficial shortcuts (e.g., verbosity) to inflate their win-rates. To overcome these issues, we propose game-based evaluations to holistically assess MLM capabilities. Games require multiple abilities for players to win, are inherently competitive, and are governed by fix, objective rules, and makes evaluation more engaging, providing a robust framework to address the aforementioned challenges. We manifest this evaluation specifically through Dixit, a fantasy card game where players must generate captions for a card that trick some, but not all players, into selecting the played card. Our quantitative experiments with five MLMs show Dixit win-rate rankings are perfectly correlated with those on popular MLM benchmarks, while games between human and MLM players in Dixit reveal several differences between agent strategies and areas of improvement for MLM reasoning.


[65] Large Language Model enabled Mathematical Modeling cs.CL | cs.AIPDF

Guoyun Zhang

TL;DR: 这篇论文探讨了大型语言模型(LLMs)与优化建模的结合如何改进运筹学(OR)中的决策制定。研究表明,DeepSeek-R1模型能够通过自然语言理解和代码生成弥补传统建模中的专业性差距。

Details

Motivation: 传统优化方法(如线性规划和混合整数规划)高度依赖领域专业知识,限制了非专家用户的建模能力。论文提出通过LLMs降低建模门槛,提升运筹学问题的实际应用能力。

Result: DeepSeek-R1在LiveCodeBench和Math-500等基准测试中表现优异,但在实际OR场景中的应用仍有待验证,论文提供了系统性评估和改进方法。

Insight: 论文表明,尽管LLMs(如GPT-4)在自然语言处理和推理任务中表现出色,但其高昂的成本和幻觉问题限制了实际应用。DeepSeek-R1作为一种高效替代方案,有望通过针对性优化解决这些问题。

Abstract: The integration of Large Language Models (LLMs) with optimization modeling offers a promising avenue for advancing decision-making in operations research (OR). Traditional optimization methods,such as linear programming, mixed integer programming, and simulation depend heavily on domain expertise to translate real-world problems into solvable mathematical models. While solvers like Gurobi and COPT are powerful, expert input remains essential for defining objectives, constraints, and variables. This research investigates the potential of LLMs, specifically the DeepSeek-R1 model, to bridge this formulation gap using natural language understanding and code generation. Although prior models like GPT-4, Claude, and Bard have shown strong performance in NLP and reasoning tasks, their high token costs and tendency toward hallucinations limit real-world applicability in supply chain contexts. In contrast, DeepSeek-R1, a cost-efficient and high-performing model trained with reinforcement learning, presents a viable alternative. Despite its success in benchmarks such as LiveCodeBench and Math-500, its effectiveness in applied OR scenarios remains under explored. This study systematically evaluates DeepSeek-R1 across four key OR benchmarks: NL4OPT, IndustryOR, EasyLP, and ComplexOR. Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies like LLM-as-a-Judge, Few-shot Learning (FSL), Tool Calling, and a Multi-agent Framework. These techniques aim to reduce hallucinations, enhance formulation accuracy, and better align model outputs with user intent.


[66] Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation cs.CL | cs.AI | cs.LGPDF

Jackson Hassell, Dan Zhang, Hannah Kim, Tom Mitchell, Estevam Hruschka

TL;DR: 本文提出了一种基于记忆增强的框架,利用语义记忆和情景记忆,通过LLM生成的评论来优化代理的学习能力,无需参数更新,实现了显著性能提升。

Details

Motivation: 传统的微调方法成本高、灵活性差且不透明,而仅依赖标签的检索增强生成方法性能有限。本文旨在探索一种更灵活、高效的替代方案。

Result: 在多样化任务上,结合评论的方法比仅依赖标签的检索增强基线提升了24.8%的准确率。

Insight: 开源模型和OpenAI模型在处理事实性和偏好性数据时表现不同,记忆策略与模型特性共同影响学习动态。

Abstract: We investigate how agents built on pretrained large language models can learn target classification functions from labeled examples without parameter updates. While conventional approaches like fine-tuning are often costly, inflexible, and opaque, we propose a memory-augmented framework that leverages both labeled data and LLM-generated critiques. Our framework uses episodic memory to store instance-level critiques-capturing specific past experiences-and semantic memory to distill these into reusable, task-level guidance. Across a diverse set of tasks, incorporating critiques yields up to a 24.8 percent accuracy improvement over retrieval-based (RAG-style) baselines that rely only on labels. Through extensive empirical evaluation, we uncover distinct behavioral differences between OpenAI and opensource models, particularly in how they handle fact-oriented versus preference-based data. To interpret how models respond to different representations of supervision encoded in memory, we introduce a novel metric, suggestibility. This helps explain observed behaviors and illuminates how model characteristics and memory strategies jointly shape learning dynamics. Our findings highlight the promise of memory-driven, reflective learning for building more adaptive and interpretable LLM agents.


[67] LyriCAR: A Difficulty-Aware Curriculum Reinforcement Learning Framework For Controllable Lyric Translation cs.CL | cs.AI | cs.LGPDF

Le Ren, Xiangjian Zeng, Qingqiang Wu, Ruoxuan Liang

TL;DR: LyriCAR是一个基于课程强化学习的歌词翻译框架,通过难度感知设计和自适应策略提升翻译质量,减少训练步骤。

Details

Motivation: 现有歌词翻译方法依赖手工规则和句子级建模,难以处理段落级的音乐-语言模式,缺乏泛化能力。

Result: 在EN-ZH歌词翻译任务中达到SOTA,训练步骤减少40%,保持高性能。

Insight: 课程设计能有效分配训练资源,加速收敛,提升模型对复杂约束的处理能力。

Abstract: Lyric translation is a challenging task that requires balancing multiple musical constraints. Existing methods often rely on hand-crafted rules and sentence-level modeling, which restrict their ability to internalize musical-linguistic patterns and to generalize effectively at the paragraph level, where cross-line coherence and global rhyme are crucial. In this work, we propose LyriCAR, a novel framework for controllable lyric translation that operates in a fully unsupervised manner. LyriCAR introduces a difficulty-aware curriculum designer and an adaptive curriculum strategy, ensuring efficient allocation of training resources, accelerating convergence, and improving overall translation quality by guiding the model with increasingly complex challenges. Extensive experiments on the EN-ZH lyric translation task show that LyriCAR achieves state-of-the-art results across both standard translation metrics and multi-dimensional reward scores, surpassing strong baselines. Notably, the adaptive curriculum strategy reduces training steps by nearly 40% while maintaining superior performance. Code, data and model can be accessed at https://github.com/rle27/LyriCAR.


[68] LLM-Augmented Symbolic NLU System for More Reliable Continuous Causal Statement Interpretation cs.CL | cs.AIPDF

Xin Lian, Kenneth D. Forbus

TL;DR: 论文提出了一种结合大型语言模型(LLM)和符号化自然语言理解(NLU)系统的混合方法,以提高因果陈述的解释可靠性。

Details

Motivation: LLM依赖概率推理,容易产生事实幻觉和不一致的结构输出;符号化NLU系统虽可解释但覆盖范围有限且维护困难。

Result: 实验结果表明,混合方法在提取和解释常识科学文本中的数量和因果规律任务上显著优于纯符号化方法。

Insight: 结合LLM的广泛性和符号化NLU的可靠性,有望在自然语言理解任务中取得更好的效果。

Abstract: Despite the broad applicability of large language models (LLMs), their reliance on probabilistic inference makes them vulnerable to errors such as hallucination in generated facts and inconsistent output structure in natural language understanding (NLU) tasks. By contrast, symbolic NLU systems provide interpretable understanding grounded in curated lexicons, semantic resources, and syntactic & semantic interpretation rules. They produce relational representations that can be used for accurate reasoning and planning, as well as incremental debuggable learning. However, symbolic NLU systems tend to be more limited in coverage than LLMs and require scarce knowledge representation and linguistics skills to extend and maintain. This paper explores a hybrid approach that integrates the broad-coverage language processing of LLMs with the symbolic NLU capabilities of producing structured relational representations to hopefully get the best of both approaches. We use LLMs for rephrasing and text simplification, to provide broad coverage, and as a source of information to fill in knowledge gaps more automatically. We use symbolic NLU to produce representations that can be used for reasoning and for incremental learning. We evaluate this approach on the task of extracting and interpreting quantities and causal laws from commonsense science texts, along with symbolic- and LLM-only pipelines. Our results suggest that our hybrid method works significantly better than the symbolic-only pipeline.


[69] A Fundamental Algorithm for Dependency Parsing (With Corrections) cs.CLPDF

Michael A. Covington

TL;DR: 这篇论文提出了一种基础算法,用于将自然语言句子解析为依存树,其特点是逐词处理,并在可以附加时立即附加每个词。

Details

Motivation: 动机在于模拟人脑解析语言的特性,提出一种更自然且高效的依存解析方法。

Result: 结果表明,该方法在理论上具有较高的效率,且在自然语言中表现良好。

Insight: 文章的洞见在于通过模拟人脑的解析过程,证明了逐词处理的依存解析方法的可行性和高效性。

Abstract: This paper presents a fundamental algorithm for parsing natural language sentences into dependency trees. Unlike phrase-structure (constituency) parsers, this algorithm operates one word at a time, attaching each word as soon as it can be attached, corresponding to properties claimed for the parser in the human brain. Like phrase-structure parsing, its worst-case complexity is $O(n^3)$, but in human language, the worst case occurs only for small $n$.


[70] Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs cs.CL | cs.AIPDF

Yunpeng Xiao, Carl Yang, Mark Mai, Xiao Hu, Kai Shu

TL;DR: 这篇论文提出了一种超越MedQA的统一范式,用于评估和改进大型语言模型(LLMs)在真实临床决策中的应用,强调临床背景和问题的复杂性。

Details

Motivation: 现有的医疗问答数据集(如MedQA)过于简化,无法反映真实临床决策的复杂性,因此需要一种更全面的评估范式。

Result: 通过扩展评估指标(如效率、可解释性),论文揭示了LLMs在临床决策中的潜在优势和限制。

Insight: 真实临床环境的复杂性需要更全面的评估和改进方法,而不仅限于传统的问答准确性。

Abstract: Large language models (LLMs) show promise for clinical use. They are often evaluated using datasets such as MedQA. However, Many medical datasets, such as MedQA, rely on simplified Question-Answering (Q\A) that underrepresents real-world clinical decision-making. Based on this, we propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions. As the background and questions approach the real clinical environment, the difficulty increases. We summarize the settings of existing datasets and benchmarks along two dimensions. Then we review methods to address clinical decision-making, including training-time and test-time techniques, and summarize when they help. Next, we extend evaluation beyond accuracy to include efficiency, explainability. Finally, we highlight open challenges. Our paradigm clarifies assumptions, standardizes comparisons, and guides the development of clinically meaningful LLMs.


[71] Forging GEMs: Advancing Greek NLP through Quality-Based Corpus Curation and Specialized Pre-training cs.CL | cs.AI | 68T50, 68T07, 68U35PDF

Alexandra Apostolopoulou, Konstantinos Kanaris, Athanasios Koursaris, Dimitris Tsakalidis, George Domalis

TL;DR: 该论文针对希腊语自然语言处理的瓶颈问题,通过高质量语料库构建和多样化预训练模型提出了Greek Embedding Models (GEMs),显著提升了希腊语尤其是法律领域任务的表现。

Details

Motivation: 希腊语作为形态丰富但资源中等的语言,现有NLP研究分散且依赖早期Transformer架构,尤其在高价值法律领域缺乏长文本处理能力。

Result: GEM-RoBERTa和GEM-ConvBERT在下游任务中显著超越基线模型。

Insight: 高质量的语料库和多样化模型架构是提升中等资源语言NLP性能的关键。

Abstract: The advancement of natural language processing for morphologically rich, moderately-resourced languages like Modern Greek is often hindered by a fragmented research landscape, a lack of architectural diversity and reliance on limited context-length models. This is particularly true in specialized, high-value domains such as law, where existing models are frequently confined to early transformer architectures with a restrictive 512-token window, insufficient for analyzing long legal documents. To address these challenges, this paper presents Greek Embedding Models, a new family of transformer models for Greek language built upon a foundation of extensive, quality-driven data curation. We detail the construction of several large-scale Greek corpora, emphasizing a rigorous, quality-based filtering and preprocessing methodology to create high-value training datasets from both general-domain and specialized legal sources. On this carefully curated foundation, we pre-train and systematically evaluate a diverse suite of modern architectures, which has not previously applied to Greek language, such as ELECTRA, ConvBERT and ModernBERT. Furthermore, we propose the first bilingual Greek-English Embedding Models tailored for the legal domain. The extensive experiments on downstream tasks demonstrate that the new class of models establish the effectiveness of the proposed approach, highlighting that the GEM-RoBERTa and GEM-ConvBERT models significantly outperform existing baselines.


[72] Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training cs.CLPDF

Mehrdad Ghassabi, Sadra Hakim, Hamidreza Baradaran Kashani, Pedram Rostami

TL;DR: 利用RLAIF和DPO方法提升波斯语小语言模型的推理能力,在医学问答任务中表现优于大模型。

Details

Motivation: 在波斯语等资源稀缺语言中,小语言模型的推理能力对医学问答等专业应用至关重要。研究旨在探索高效训练方法,减少对大规模数据的依赖。

Result: 训练后的模型(仅用2M/2.5M token数据集)在波斯语医学推理任务中优于gaokerena-V(57M token训练)。

Insight: 推理能力训练(如CoT和DPO)对小语言模型至关重要,能在数据稀缺情况下实现高效性能提升。

Abstract: Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.


[73] CreativityPrism: A Holistic Benchmark for Large Language Model Creativity cs.CL | cs.AIPDF

Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei

TL;DR: 该论文提出了CreativityPrism,一个全面评估大型语言模型(LLM)创造力的框架,将创造力分解为质量、新颖性和多样性三个维度,并设计了多任务、多领域和多指标的评估方法。通过评估17个先进LLM,发现专有模型与开源模型之间存在显著差距,且模型在同一领域内的任务表现高度相关,但在不同领域间相关较弱。

Details

Motivation: 现有的创造力评估方法分散且不一致,缺乏统一的框架来衡量LLM的创造力表现。因此,作者提出CreativityPrism,旨在提供一个全面的评估工具,以更科学地分析LLM的创造力。

Result: 实验结果显示,专有模型与开源模型之间存在显著差距;模型在同一领域内的任务表现高度相关,而在跨领域任务中相关较弱。质量和多样性表现出强相关性,而新颖性与前两者相关性较弱。

Insight: 论文表明,LLM的创造力表现具有领域和维度特异性,单一任务或维度的评估无法全面反映模型的创造力水平。这强调了需要采用多维度、多领域的综合性评估方法。

Abstract: Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.


[74] Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning cs.CL | cs.AIPDF

Yajie Li, Albert Galimov, Mitra Datta Ganapaneni, Pujitha Thejaswi, De Meng

TL;DR: ARTER提出了一种结合自适应路由和选择性推理的结构化流水线,通过候选生成、上下文评分和高效推理,显著提升了实体链接的性能和效率。

Details

Motivation: 传统的实体链接方法依赖大量标注数据和精细调优,而现有的few-shot方法虽然减少了训练需求,但推理成本高且效率低下。

Result: 在标准基准测试中,ARTER比ReFinED性能提升最高4.47%,平均提升2.53%,且效率是纯LLM流水线的两倍。

Insight: 结合嵌入和LLM信号的自适应策略能显著提升任务性能,同时通过动态路由优化计算开销。

Abstract: Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.


[75] BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation cs.CLPDF

Haoyuan Li, Zhengyuan Shen, Sullam Jeoung, Yueyan Chen, Jiayu Li

TL;DR: BoundRL提出了一种高效的结构化文本分割方法,通过强化学习的边界生成实现语义对齐和文档重构,显著降低了推理成本并减少了幻觉。

Details

Motivation: 随着结构化文本(如技术报告和生成式AI提示)的复杂性增加,传统句子或段落级别的分割方法无法有效处理包含表格、代码片段和占位符等内容。这激发了对高效分割方法的需求。

Result: 实验表明,BoundRL让小模型(1.7B参数)优于大模型的少样本提示;RLVR和监督微调相比带来显著改进,中间候选集进一步提升性能和泛化能力。

Insight: 仅生成边界而非完整内容可以高效分割并减少幻觉;强化学习的奖励设计对性能至关重要;扰动生成序列能提高模型泛化能力。

Abstract: As structured texts become increasingly complex across diverse domains – from technical reports to generative AI prompts – the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL’s effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.


[76] DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking cs.CLPDF

Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li

TL;DR: DeepWideSearch 是一个新的基准测试,旨在评估信息搜索代理在同时实现深度推理和广域信息收集的能力。实验表明,当前最先进的代理在此任务上表现不佳,揭示了其在反思、内部知识依赖、检索不足和上下文溢出等方面的局限性。

Details

Motivation: 现有的信息搜索代理无法同时实现深度推理和广域信息收集,这在现实应用(如市场分析)中是一个关键缺陷。

Result: 实验结果显示,即使是当前最先进的代理平均成功率仅为 2.39%,验证了任务的高难度。

Insight: 1. 同时实现深度推理和广域信息收集是信息搜索代理的重要挑战。2. 当前的代理架构在多跳检索和大规模上下文处理方面存在显著缺陷。

Abstract: Current search agents fundamentally lack the ability to simultaneously perform \textit{deep} reasoning over multi-hop retrieval and \textit{wide}-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.


[77] Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding cs.CL | cs.AIPDF

Yuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu

TL;DR: 该论文提出了Mixture-of-Minds框架,通过多智能体强化学习(RL)分解表格理解任务为规划、编码和回答三个角色,结合代码执行和自改进训练框架,显著提升了表格理解的性能。

Details

Motivation: 当前用于表格理解的方法存在局限性:基于微调的LLMs易产生算术错误和幻觉,而基于工具的方法缺乏语义理解且依赖固定模式。需要一种结合鲁棒推理和可靠表格处理的方法。

Result: 在TableBench上达到62.13%的性能,超越OpenAI-o4-mini-high等现有方法。

Insight: 多智能体分工协作结合RL可以有效提升表格理解的鲁棒性和精确性,同时避免了单一方法的局限性。

Abstract: Understanding and reasoning over tables is a critical capability for many real-world applications. Large language models (LLMs) have shown promise on this task, but current approaches remain limited. Fine-tuning based methods strengthen language reasoning; yet they are prone to arithmetic errors and hallucination. In contrast, tool-based methods enable precise table manipulation but rely on rigid schemas and lack semantic understanding. These complementary drawbacks highlight the need for approaches that integrate robust reasoning with reliable table processing. In this work, we propose Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into three specialized roles: planning, coding, and answering. This design enables each agent to focus on a specific aspect of the task while leveraging code execution for precise table manipulation. Building on this workflow, we introduce a self-improvement training framework that employs Monte Carlo Tree Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents with reinforcement learning (RL). Extensive experiments show that Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and surpassing OpenAI-o4-mini-high. These results demonstrate the promise of combining structured multi-agent workflows with RL to advance table understanding.


[78] Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models cs.CL | cs.AIPDF

Maggie Bai, Ava Kim Cohen, Eleanor Koss, Charlie Lichtenbaum

TL;DR: 该论文通过五项任务探索了大型语言模型(LLMs)在文本输入上的空间推理能力,发现模型在小规模任务中表现尚可,但随着任务复杂性增加,性能急剧下降,揭示其缺乏稳健的空间表征能力。

Details

Motivation: 研究LLMs在空间推理任务中的表现,揭示其在语言理解和几何推理之间的能力差距。

Result: LLMs在小规模任务中表现尚可(50%以上准确率),但随着复杂性增加,性能平均下降42.7%,最高达84%。所有初始准确率超过50%的任务均下降至少48%。

Insight: LLMs的空间推理能力较弱,性能随任务复杂性显著下降,表明其架构中缺乏有效的空间表征机制。这为未来结合语言与几何的基准测试提供了方向。

Abstract: This paper explores the spatial reasoning capability of large language models (LLMs) over textual input through a suite of five tasks aimed at probing their spatial understanding and computational abilities. The models were tested on both fundamental spatial reasoning and multi-step problem-solving within structured grid-based environments using tasks such as quadrant identification, geometric transformations, distance evaluation, word searches, and tile sliding. Each task was scaled in complexity through increasing grid dimensions, requiring models to extend beyond simple pattern recognition into abstract spatial reasoning. Our results reveal that while LLMs demonstrate moderate success in all tasks with small complexity and size, performance drops off rapidly as scale increases, with an average loss in accuracy of 42.7%, and reaching as high as 84%. Every test that began with over 50% accuracy showed a loss of at least 48%, illustrating the consistent nature of the deterioration. Furthermore, their struggles with scaling complexity hint at a lack of robust spatial representations in their underlying architectures. This paper underscores the gap between linguistic and spatial reasoning in LLMs, offering insights into their current limitations, and laying the groundwork for future integrative benchmarks at the intersection of language and geometry.


[79] Context-level Language Modeling by Learning Predictive Context Embeddings cs.CL | cs.AIPDF

Beiya Dai, Yuliang Liu, Daozheng Xue, Qipeng Guo, Kai Chen

TL;DR: 论文提出了ContextLM框架,通过引入next-context prediction目标来增强传统的NTP方法,从而学习多token上下文的表示,提升了语言模型的语义捕捉和长距离上下文关系能力。

Details

Motivation: 传统NTP方法在捕捉高级语义结构和长距离上下文关系方面存在局限性,ContextLM的目标是通过预测多token上下文来解决这一问题。

Result: 在GPT2和Pythia模型上的实验表明,ContextLM在困惑度和下游任务性能上均有显著提升,同时计算开销极小。

Insight: next-context prediction为语言模型提供了一种可扩展且高效的增强路径,有助于提升长距离连贯性和注意力分配效果。

Abstract: Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model’s capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbf{ContextLM}, a framework that augments standard pretraining with an inherent \textbf{next-context prediction} objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.


[80] Citation Failure: Definition, Analysis and Efficient Mitigation cs.CLPDF

Jan Buchmann, Iryna Gurevych

TL;DR: 该论文提出了「引用失败」的新概念,并将其与「响应失败」区分开。通过CITECONTROL基准分析了引用失败的发生条件,并提出CITENTION框架来高效提升LLM的引用质量,实验证明了其有效性。

Details

Motivation: 现有的LLM基于RAG系统的引用功能旨在简化响应验证,但存在「引用失败」现象,即模型提供了有用响应却未能完整引用证据。论文旨在分析这一现象并找到高效解决方案,而不是将其与「响应失败」混淆。

Result: 实验表明,随着响应与证据关系的复杂度增加,引用失败率上升。CITENTION在CITECONTROL基准和迁移任务中均显著提升了引用表现。

Insight: 1. 引用失败与响应失败需分开处理;2. 响应与证据的关系复杂度是引用失败的关键因素;3. 多方法结合的框架能有效提升引用质量。

Abstract: Citations from LLM-based RAG systems are supposed to simplify response verification. However, this does not hold for citation failure, when a model generates a helpful response, but fails to cite complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to analyze failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To improve LLM citation efficiently, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings. We make our data and code publicly available.


[81] Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering cs.CLPDF

Lei Tang, Wei Zhou, Mohsen Mesgar

TL;DR: 本文首次系统地研究了过程奖励模型(PRMs)在表格问答(TQA)任务中的应用,发现结合文本和代码验证的PRMs有助于答案选择,但在跨域泛化上表现不佳,揭示了当前PRMs的局限性。

Details

Motivation: PRMs在复杂推理任务(如数学)中表现优异,但在半结构化数据(如表格问答)中的应用尚未探索。TQA的独特挑战(如冗余信息、步骤松散关联)需要深入研究。

Result: PRMs在TQA中能辅助答案选择,但跨域泛化能力弱;步骤验证性能与答案准确性相关性低,可能是由于步骤依赖性和因果联系较弱。

Insight: 当前PRMs在TQA中存在局限性,需设计更具鲁棒性、过程感知的验证模型以应对半结构化数据的挑战。

Abstract: Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA) remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.


[82] Teaching Language Models to Reason with Tools cs.CL | cs.AIPDF

Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao

TL;DR: 该论文提出了CoRT框架,通过Hint-Engineering和数据合成策略,教导大型推理模型有效利用代码解释器(CIs),优化其与外部工具的交互,显著提升了数学推理能力和效率。

Details

Motivation: 现有大型推理模型在复杂数学运算中表现不佳,而直接集成外部工具(如代码解释器)会导致模型内部推理与外部确定性知识之间的冲突。为解决这一问题,CoRT框架被设计用于优化模型与工具的交互。

Result: 在五个数学推理数据集上,CoRT对32B和1.5B模型的绝对性能分别提升4%和8%,并显著减少token使用(32B和1.5B分别降低30%和50%)。

Insight: 1. 外部工具与模型内部推理的协作是关键;2. Hint-Engineering是高效优化交互的有效策略;3. CoRT框架为融合概率与确定性推理提供了新思路。

Abstract: Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model’s internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose \emph{Hint-Engineering}, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1.5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT’s effectiveness, yielding absolute improvements of 4% and 8% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30% for the 32B model and 50% for the 1.5B model compared to pure natural language reasoning baselines. The models and code are available at: https://github.com/ChengpengLi1003/CoRT.


[83] Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models cs.CL | cs.AIPDF

Matteo Silvestri, Flavio Giorgi, Fabrizio Silvestri, Gabriele Tolomei

TL;DR: 该研究探讨了大型语言模型(LLM)是否对广泛使用的表格数据集(如Adult Income、Titanic等)存在先验知识污染。实验表明,当数据集包含强语义线索(如可理解的列名或值类别)时,LLMs表现出明显的污染效应;而在移除或随机化这些线索后,性能急剧下降至接近随机水平。

Details

Motivation: 当前的LLM评估常忽略数据集污染问题,尤其是在表格推理任务中。研究旨在揭示LLMs是否依赖对公开数据集的记忆而非真实泛化能力。

Result: LLMs在强语义线索数据集上表现优异,但在线索移除后性能骤降至随机水平。

Insight: LLMs的表格推理能力可能部分依赖数据集的语义记忆而非真正的泛化能力,需更严谨的评估设计。

Abstract: Large Language Models (LLMs) are increasingly evaluated on their ability to reason over structured data, yet such assessments often overlook a crucial confound: dataset contamination. In this work, we investigate whether LLMs exhibit prior knowledge of widely used tabular benchmarks such as Adult Income, Titanic, and others. Through a series of controlled probing experiments, we reveal that contamination effects emerge exclusively for datasets containing strong semantic cues-for instance, meaningful column names or interpretable value categories. In contrast, when such cues are removed or randomized, performance sharply declines to near-random levels. These findings suggest that LLMs’ apparent competence on tabular reasoning tasks may, in part, reflect memorization of publicly available datasets rather than genuine generalization. We discuss implications for evaluation protocols and propose strategies to disentangle semantic leakage from authentic reasoning ability in future LLM assessments.


[84] Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning) cs.CLPDF

Francesca Padovani, Bastian Bunzeck, Manar Ali, Omar Momen, Arianna Bisazza

TL;DR: 论文研究了仅使用对话数据预训练是否能生成功能合适的小语言模型。尽管在标准BabyLM基准测试中表现不佳,但模型在对话延续预测任务中表现优异,DPO微调进一步提升了性能。

Details

Motivation: 动机是探究仅依赖对话数据的预训练是否能生成功能完善的小型语言模型,并研究不同微调策略对生成结果的影响。

Result: 结果表明,预训练模型在标准BabyLM基准中表现欠佳,但在对话延续任务中表现突出;DPO微调进一步提升了模型在对话任务中的表现。

Insight: 研究发现,对话数据足以支持模型在特定任务中表现优异,但通用的微调方法(如PPO)可能不适合所有任务,而DPO微调更适用于提升对话能力。

Abstract: We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce “more communicative” text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.


[85] The Impact of Negated Text on Hallucination with Large Language Models cs.CL | cs.AIPDF

Jaehyung Seo, Hyeonseok Moon, Heuiseok Lim

TL;DR: 该论文探讨了大型语言模型(LLMs)在处理否定文本时的幻觉问题,发现LLMs难以有效检测否定文本中的幻觉,并设计了一个名为NegHalu的数据集来验证这一点。

Details

Motivation: 大型语言模型在自然语言处理中表现出色,但否定文本对其幻觉行为的影响尚不明确。研究旨在填补这一空白。

Result: 实验表明,LLMs在否定文本中检测幻觉的能力较差,容易产生逻辑不一致或不忠实的判断。

Insight: 否定文本会增加LLMs的幻觉风险,且这种现象可能与模型内部处理机制有关,未来需进一步优化模型设计。

Abstract: Recent studies on hallucination in large language models (LLMs) have been actively progressing in natural language processing. However, the impact of negated text on hallucination with LLMs remains largely unexplored. In this paper, we set three important yet unanswered research questions and aim to address them. To derive the answers, we investigate whether LLMs can recognize contextual shifts caused by negation and still reliably distinguish hallucinations comparable to affirmative cases. We also design the NegHalu dataset by reconstructing existing hallucination detection datasets with negated expressions. Our experiments demonstrate that LLMs struggle to detect hallucinations in negated text effectively, often producing logically inconsistent or unfaithful judgments. Moreover, we trace the internal state of LLMs as they process negated inputs at the token level and reveal the challenges of mitigating their unintended effects.


Son T. Luu, Trung Vo, Hiep Nguyen, Khanh Quoc Tran, Kiet Van Nguyen

TL;DR: 该论文介绍了VLSP 2025 MLQA-TSR任务,这是一个多模态法律问答任务,专注于越南交通标志法规,包含多模态法律检索和多模态问答两个子任务。

Details

Motivation: 推动越南多模态法律文本处理的研究,并为多模态法律领域提供基准数据集,特别针对交通标志法规。

Result: 最佳结果在多模态法律检索上达到F2分数64.55%,在多模态问答上达到准确率86.30%。

Insight: 该研究为多模态法律领域的智能系统开发和评估提供了重要基准,尤其适用于越南交通标志法规场景。

Abstract: This paper presents the VLSP 2025 MLQA-TSR - the multimodal legal question answering on traffic sign regulation shared task at VLSP 2025. VLSP 2025 MLQA-TSR comprises two subtasks: multimodal legal retrieval and multimodal question answering. The goal is to advance research on Vietnamese multimodal legal text processing and to provide a benchmark dataset for building and evaluating intelligent systems in multimodal legal domains, with a focus on traffic sign regulation in Vietnam. The best-reported results on VLSP 2025 MLQA-TSR are an F2 score of 64.55% for multimodal legal retrieval and an accuracy of 86.30% for multimodal question answering.


[87] NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew cs.CLPDF

Shaltiel Shmidman, Avi Shmidman, Moshe Koppel

TL;DR: 该论文介绍了NeoDictaBERT和NeoDictaBERT-bilingual,这是基于NeoBERT架构的BERT风格模型,专注于希伯来语文本,并在希伯来语基准测试中表现出色。

Details

Motivation: 为了解决BERT模型在希伯来语任务上的性能限制,并利用新型Transformer架构的优势,作者提出了专门针对希伯来语的改进型BERT模型。

Result: 模型在几乎所有希伯来语基准测试中表现优异,NeoDictaBERT-bilingual在检索任务上尤其突出。

Insight: 新型Transformer架构可以显著提升BERT模型在特定语言任务上的性能,尤其是在资源相对较少的语言(如希伯来语)中。

Abstract: Since their initial release, BERT models have demonstrated exceptional performance on a variety of tasks, despite their relatively small size (BERT-base has ~100M parameters). Nevertheless, the architectural choices used in these models are outdated compared to newer transformer-based models such as Llama3 and Qwen3. In recent months, several architectures have been proposed to close this gap. ModernBERT and NeoBERT both show strong improvements on English benchmarks and significantly extend the supported context window. Following their successes, we introduce NeoDictaBERT and NeoDictaBERT-bilingual: BERT-style models trained using the same architecture as NeoBERT, with a dedicated focus on Hebrew texts. These models outperform existing ones on almost all Hebrew benchmarks and provide a strong foundation for downstream tasks. Notably, the NeoDictaBERT-bilingual model shows strong results on retrieval tasks, outperforming other multilingual models of similar size. In this paper, we describe the training process and report results across various benchmarks. We release the models to the community as part of our goal to advance research and development in Hebrew NLP.


[88] LM-mixup: Text Data Augmentation via Language Model based Mixup cs.CLPDF

Zhijie Deng, Zhouan Shen, Ling Li, Yao Zhou, Zhaowei Zhu

TL;DR: LM-mixup是一种基于语言模型的数据增强方法,通过蒸馏低质量指令数据生成高质量数据,显著提升指令微调LLM的效率与性能。

Details

Motivation: 高质量指令数据稀缺,而低质量数据常被丢弃造成信息损失。现有数据增强方法对低质量数据效果不佳,且评估标准不明确。

Result: 仅使用3%蒸馏数据微调的LLM,性能超越全数据集训练,并媲美SOTA高质量数据选择方法。

Insight: 低质量数据经合理蒸馏和增强后是宝贵资源,可显著提升LLM的效率和性能。

Abstract: Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.


[89] ARC-Encoder: learning compressed text representations for large language models cs.CL | cs.AIPDF

Hippolyte Pilchen, Edouard Grave, Patrick Pérez

TL;DR: ARC-Encoder提出了一种新的文本压缩编码器,通过将上下文压缩为连续表征以替换解码LLM中的词嵌入,从而在不微调目标模型或修改其架构的情况下提高计算效率。

Details

Motivation: 当前检索增强生成或思维链推理等方法导致上下文更长、推理成本更高,而最有效的压缩方法需要微调或修改模型架构,可能损害模型的通用能力。本文探索了一种替代方案。

Result: 在多个基准测试中达到SOTA性能,同时提升推理的计算效率,并展示了其可适配多个解码LLM的能力。

Insight: ARC-Encoder是一种灵活高效的便携式编码器,无需修改目标模型即可实现上下文压缩,且能泛化到不同LLM。

Abstract: Recent techniques such as retrieval-augmented generation or chain-of-thought reasoning have led to longer contexts and increased inference costs. Context compression techniques can reduce these costs, but the most effective approaches require fine-tuning the target model or even modifying its architecture. This can degrade its general abilities when not used for this specific purpose. Here we explore an alternative approach: an encoder that compresses the context into continuous representations which replace token embeddings in decoder LLMs. First, we perform a systematic study of training strategies and architecture choices for the encoder. Our findings led to the design of an Adaptable text Representations Compressor, named ARC-Encoder, which outputs $x$-times fewer continuous representations (typically $x!\in!{4,8}$) than text tokens. We evaluate ARC-Encoder across a variety of LLM usage scenarios, ranging from in-context learning to context window extension, on both instruct and base decoders. Results show that ARC-Encoder achieves state-of-the-art performance on several benchmarks while improving computational efficiency at inference. Finally, we demonstrate that our models can be adapted to multiple decoders simultaneously, allowing a single encoder to generalize across different decoder LLMs. This makes ARC-Encoder a flexible and efficient solution for portable encoders that work seamlessly with multiple LLMs. We release a training code at https://github.com/kyutai-labs/ARC-Encoder , fine-tuning dataset and pretrained models are available at https://huggingface.co/collections/kyutai/arc-encoders-68ee18787301407d60a57047 .


[90] The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts cs.CL | cs.AIPDF

Sangmitra Madhusudan, Kaige Chen, Ali Emami

TL;DR: 论文提出了CenterBench数据集,用于测试语言模型在理解复杂嵌套句子时是否依赖语法结构还是语义模式匹配。研究发现语言模型在处理高复杂度句子时会放弃结构分析,转而依赖语义关联。

Details

Motivation: 现有方法无法区分语言模型是在进行语法分析还是简单的语义模式匹配。这限制了对其真实语言理解能力的评估。

Result: 模型在高复杂度句子中表现差异显著(中位差距达26.8%),表明其倾向于依赖语义关联而非结构分析。人类的表现则更具多样性。

Insight: 语义合理性并非总是有益,尤其在需要因果推理的任务中可能适得其反。模型的推理能力仍受限于语义捷径和过度思考等问题。

Abstract: When language models correctly parse “The cat that the dog chased meowed,” are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like “The cat [that the dog chased] meowed”) where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.


[91] GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning cs.CL | cs.AIPDF

Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia

TL;DR: GlobalRAG是一种基于强化学习的框架,旨在解决多跳问答中的全局规划和证据执行问题。通过分解问题和奖励机制,该方法显著提升了性能。

Details

Motivation: 多跳问答中存在全局规划不足和证据执行不一致的问题,限制了强化学习在检索增强生成(RAG)中的有效性。

Result: 在仅使用8k训练数据的情况下,GlobalRAG在多跳问答任务中实现了14.2%的平均性能提升(EM和F1)。

Insight: 强化学习在多跳问答中的应用潜力巨大,尤其是在优化全局规划和证据执行方面。

Abstract: Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.


[92] Beyond Retrieval-Ranking: A Multi-Agent Cognitive Decision Framework for E-Commerce Search cs.CLPDF

Zhouwei Zhai, Mengxiang Chen, Haoyun Xia, Jin Li, Renquan Zhou

TL;DR: 论文提出了一个多智能体认知决策框架(MACDF),用于改进电商搜索的局限性,将传统检索排序范式转变为主动决策支持,显著提升了复杂查询下的推荐准确性和用户满意度。

Details

Motivation: 传统的检索排序范式存在语义鸿沟、高决策成本和缺乏专业购物指导的问题,无法满足用户在电商平台上的多阶段认知决策需求。

Result: 离线和在线测试表明,MACDF显著提升了推荐准确性和用户满意度,尤其是在复杂查询(如否定、多约束或推理需求)下表现优异。

Insight: 多智能体认知系统具有重新定义电商搜索的潜力,未来可以进一步探索其在其他场景中的应用。

Abstract: The retrieval-ranking paradigm has long dominated e-commerce search, but its reliance on query-item matching fundamentally misaligns with multi-stage cognitive decision processes of platform users. This misalignment introduces critical limitations: semantic gaps in complex queries, high decision costs due to cross-platform information foraging, and the absence of professional shopping guidance. To address these issues, we propose a Multi-Agent Cognitive Decision Framework (MACDF), which shifts the paradigm from passive retrieval to proactive decision support. Extensive offline evaluations demonstrate MACDF’s significant improvements in recommendation accuracy and user satisfaction, particularly for complex queries involving negation, multi-constraint, or reasoning demands. Online A/B testing on JD search platform confirms its practical efficacy. This work highlights the transformative potential of multi-agent cognitive systems in redefining e-commerce search.


[93] Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks cs.CL | cs.AIPDF

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi

TL;DR: 论文研究了ChatGPT在协作任务中编码通信数据的公平性,发现其在性别和种族群体间无显著偏见。

Details

Motivation: 大规模评估协作和通信依赖于人工标注通信数据,ChatGPT虽能胜任但可能对某些人口统计学群体存在偏见,需验证其公平性。

Result: ChatGPT编码在不同人口统计学群体间无显著差异,验证了其公平性。

Insight: ChatGPT的公平性验证为其在教育和协作评估中的广泛使用提供了依据。

Abstract: Assessing communication and collaboration at scale depends on a labor intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology exhibits bias against different demographic groups, such as gender and race, remains unclear. To fill this gap, this paper investigates ChatGPT-based automated coding of communication data using a typical coding framework for collaborative problem solving, examining differences across gender and racial groups. The analysis draws on data from three types of collaborative tasks: negotiation, problem solving, and decision making. Our results show that ChatGPT-based coding exhibits no significant bias across gender and racial groups, paving the road for its adoption in large-scale assessment of collaboration and communication.


[94] Why Did Apple Fall To The Ground: Evaluating Curiosity In Large Language Model cs.CL | cs.AIPDF

Haoyu Wang, Sihang Jiang, Yuyan Chen, Yitong Wang, Yanghua Xiao

TL;DR: 该论文探讨了大语言模型(LLMs)是否具备类似人类的好奇心驱动学习能力,并通过基于人类好奇心评估问卷5DCR的框架评估了LLMs在不同维度(如信息寻求、刺激寻求和社会好奇心)上的表现。

Details

Motivation: 研究动机在于探讨LLMs是否具备好奇心驱动的学习能力,以及这种能力如何影响其推理和主动学习能力。

Result: 结果显示LLMs在知识渴求方面强于人类,但在不确定性环境中更保守;同时证实好奇心行为能增强模型的推理和主动学习能力。

Insight: 研究发现LLMs具备类似人类好奇心的潜力,为未来LLMs的学习能力和创新研究提供了实验支持。

Abstract: Curiosity serves as a pivotal conduit for human beings to discover and learn new knowledge. Recent advancements of large language models (LLMs) in natural language processing have sparked discussions regarding whether these models possess capability of curiosity-driven learning akin to humans. In this paper, starting from the human curiosity assessment questionnaire Five-Dimensional Curiosity scale Revised (5DCR), we design a comprehensive evaluation framework that covers dimensions such as Information Seeking, Thrill Seeking, and Social Curiosity to assess the extent of curiosity exhibited by LLMs. The results demonstrate that LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. We further investigated the relationship between curiosity and thinking of LLMs, confirming that curious behaviors can enhance the model’s reasoning and active learning abilities. These findings suggest that LLMs have the potential to exhibit curiosity similar to that of humans, providing experimental support for the future development of learning capabilities and innovative research in LLMs.


[95] The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI cs.CL | cs.AIPDF

Alan Saji, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully

TL;DR: 大型推理模型(LRMs)在多语言推理中以英语为通用语言,虽提高复杂任务的准确性,但可能因翻译错误导致失败。

Details

Motivation: 探索LRMs在多语言环境下的推理能力,关注其在非英语问题中以英语为默认语言的现象及其潜在的翻译错误问题。

Result: 英语推理痕迹展现更多认知行为,且准确性更高,但随着任务复杂度增加,翻译错误风险加剧。

Insight: LRMs在多语言推理中依赖英语可能导致翻译相关的失败模式,需平衡语言通用性与准确性。

Abstract: Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM’s reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting “Lost in Translation,” where translation steps lead to errors that would have been avoided by question’s language reasoning.


[96] \textsc{CantoNLU}: A benchmark for Cantonese natural language understanding cs.CLPDF

Junghyun Min, York Hay Ng, Sophia Chan, Helena Shunhua Zhao, En-Shiun Annie Lee

TL;DR: 该论文提出了一个名为CantoNLU的粤语自然语言理解基准,涵盖七项任务,并比较了不同模型的性能。

Details

Motivation: 粤语虽然使用者众多,但由于政策和双语现象,相关资源匮乏。缺少评估框架限制了粤语自然语言处理的发展。

Result: 调整的粤语模型整体表现最佳,单语模型在句法任务上更优,而普通话模型在某些场景下仍具竞争力。

Insight: 当粤语领域数据稀缺时,直接迁移可能足够;持续预训练对提升多任务性能有效。

Abstract: Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual models perform better on syntactic tasks. Mandarin models remain competitive in certain settings, indicating that direct transfer may be sufficient when Cantonese domain data is scarce. We release all datasets, code, and model weights to facilitate future research in Cantonese NLP.


[97] Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost cs.CL | cs.AIPDF

Runzhe Zhan, Zhihong Huang, Xinyi Yang, Lidia S. Chao, Min Yang

TL;DR: 该文首次系统分析了大型推理模型(LRM)作为机器翻译质量评估器的潜力,提出了校准LRM思考的方法,显著提升了评估性能。

Details

Motivation: 尽管大型推理模型在复杂任务中表现优异,但它们在机器翻译评估中的潜力尚未被充分探索。

Result: 在WMT24基准测试中,该方法将思考预算降低约35倍,并在不同规模的LRM(7B到32B)上提升了评估性能(如R1-Distill-Qwen-7B相关性提升了8.7个百分点)。

Insight: 校准LRM的思考过程可以显著提升其在机器翻译评估中的效率和性能,展示了其在细粒度自动评估中的潜力。

Abstract: Recent advancements in large reasoning models (LRMs) have introduced an intermediate “thinking” process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to “overthink” simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.


[98] A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text cs.CL | cs.AI | I.2.7PDF

Alicia Sagae, Chia-Jung Lee, Sandeep Avula, Brandon Dang, Vanessa Murdock

TL;DR: 本文构建了一个基于真实应用场景的数据集,用于评估大型语言模型(LLM)在负责任AI维度(如公平性、质量和安全性)上的表现,填补了现有评估方法的不足。

Details

Motivation: 现有评估方法通常关注高层次任务(如文本生成),而忽略了特定应用中负责任AI的维度(如公平性)。例如,某些受保护属性在某些应用中可能更重要。

Result: 数据集被证明可用于识别LLM在质量、真实性、安全性和公平性方面的差距,为研究社区提供了具体资源。

Insight: 评估LLM需要结合具体应用场景,关注负责任AI的多维度表现,而不仅仅是通用任务的表现。

Abstract: Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.


cs.IR [Back]

[99] Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures cs.IR | cs.CL | cs.CV | cs.LGPDF

Rahul Raja, Arpita Vats

TL;DR: 这篇论文综述了多媒体感知问答系统的最新进展,重点分析了检索方法与跨模态推理架构,包括视觉、语言和音频模态的对齐技术。

Details

Motivation: 随着多媒体内容的快速增长,传统的基于文本的问答系统已无法满足需求,需要整合多模态数据的检索和推理能力以提高问答系统的性能。

Result: 总结了当前多媒体感知问答系统的性能表现,指出了跨模态对齐、延迟与准确性的权衡等关键挑战。

Insight: 未来的研究方向包括提升跨模态语义对齐能力、优化延迟与准确性的平衡,以及构建更健壮、上下文感知的多媒体问答系统。

Abstract: Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.


[100] Automating Iconclass: LLMs and RAG for Large-Scale Classification of Religious Woodcuts cs.IR | cs.CVPDF

Drew B. Thomas

TL;DR: 该论文提出了一种结合LLMs和RAG的新方法,用于大规模分类早期现代宗教木刻图像,显著提高了分类精度。

Details

Motivation: 传统基于图像和关键词的分类方法在早期现代宗教图像分类中表现不佳,急需一种能结合视觉与文本信息的自动化方法。

Result: 在五个和四个分类级别上分别达到87%和92%的精确度,显著优于传统方法。

Insight: 展示了LLMs和RAG在艺术史和数字人文学科中的潜力,为大规模视觉档案分析提供了有力工具。

Abstract: This paper presents a novel methodology for classifying early modern religious images by using Large Language Models (LLMs) and vector databases in combination with Retrieval-Augmented Generation (RAG). The approach leverages the full-page context of book illustrations from the Holy Roman Empire, allowing the LLM to generate detailed descriptions that incorporate both visual and textual elements. These descriptions are then matched to relevant Iconclass codes through a hybrid vector search. This method achieves 87% and 92% precision at five and four levels of classification, significantly outperforming traditional image and keyword-based searches. By employing full-page descriptions and RAG, the system enhances classification accuracy, offering a powerful tool for large-scale analysis of early modern visual archives. This interdisciplinary approach demonstrates the growing potential of LLMs and RAG in advancing research within art history and digital humanities.


cs.PL [Back]

[101] Prompt Decorators: A Declarative and Composable Syntax for Reasoning, Formatting, and Control in LLMs cs.PL | cs.AI | cs.CL | cs.HCPDF

Mostapha Kalami Heris

TL;DR: 论文提出了Prompt Decorators,一种声明式、可组合的语法,通过紧凑的控制令牌(如+++Reasoning、+++Tone)来控制LLM的行为,提升透明性、可复用性和可解释性。

Details

Motivation: 现有的提示工程依赖冗长的自然语言指令,缺乏一致性、模块化和可解释性,限制了LLM在推理和输出表达上的可控性。

Result: 使用案例显示,该方法提升了推理透明度,降低了提示复杂度,并实现了跨领域的标准化模型行为。

Insight: Prompt Decorators为LLM提供了一个可复用、可解释的接口设计范式,对AI系统的可扩展性和行为一致性具有重要意义。

Abstract: Large Language Models (LLMs) are central to reasoning, writing, and decision-support workflows, yet users lack consistent control over how they reason and express outputs. Conventional prompt engineering relies on verbose natural-language instructions, limiting reproducibility, modularity, and interpretability. This paper introduces Prompt Decorators, a declarative, composable syntax that governs LLM behavior through compact control tokens such as +++Reasoning, +++Tone(style=formal), and +++Import(topic=”Systems Thinking”). Each decorator modifies a behavioral dimension, such as reasoning style, structure, or tone, without changing task content. The framework formalizes twenty core decorators organized into two functional families (Cognitive & Generative and Expressive & Systemic), each further decomposed into subcategories that govern reasoning, interaction, expression, and session-control. It defines a unified syntax, scoping model, and deterministic processing pipeline enabling predictable and auditable behavior composition. By decoupling task intent from execution behavior, Prompt Decorators create a reusable and interpretable interface for prompt design. Illustrative use cases demonstrate improved reasoning transparency, reduced prompt complexity, and standardized model behavior across domains. The paper concludes with implications for interoperability, behavioral consistency, and the development of declarative interfaces for scalable AI systems.


cs.HC [Back]

[102] Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations cs.HC | cs.AI | cs.CLPDF

Lorenzo Stacchio, Andrea Ubaldi, Alessandro Galdelli, Maurizio Mauri, Emanuele Frontoni

TL;DR: Empathic Prompting 是一个新颖的多模态人机交互框架,通过整合非语言上下文(如面部表情)增强LLM对话,提升流畅性和情感对齐。

Details

Motivation: 传统多模态交互需要显式用户控制,而情感信号在文本交流中往往难以捕捉,尤其是在医疗或教育等领域。

Result: 初步评估(N=5)表明系统能一致地将非语言输入整合为连贯的LLM输出,参与者强调了对话流畅性。

Insight: 该框架展示了在情感敏感的领域(如医疗、教育)中,通过非语言信号隐式增强LLM对话的潜力。

Abstract: We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users’ emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users’ emotional signals are critical yet often opaque in verbal exchanges.


cs.RO [Back]

[103] Kinaema: a recurrent sequence model for memory and pose in motion cs.RO | cs.CV | I.2.10PDF

Mert Bulent Sariyildiz, Philippe Weinzaepfel, Guillaume Bono, Gianluca Monaci, Christian Wolf

TL;DR: Kinaema是一个基于循环序列的模型,专注于机器人在连续任务中利用先验信息进行自我定位和高效导航。

Details

Motivation: 为了解决机器人在大场景中导航时的历史信息利用问题,特别是在连续任务中,如何高效地整合视觉观测流并准确定位。

Result: Kinaema在大场景导航中表现出色,能够高效定位并导航至目标位置,性能优于传统基于注意力机制的Transformer。

Insight: 循环Transformer框架在机器人导航任务中展现了强大的历史信息整合能力和计算效率。

Abstract: One key aspect of spatially aware robots is the ability to “find their bearings”, ie. to correctly situate themselves in previously seen spaces. In this work, we focus on this particular scenario of continuous robotics operations, where information observed before an actual episode start is exploited to optimize efficiency. We introduce a new model, Kinaema, and agent, capable of integrating a stream of visual observations while moving in a potentially large scene, and upon request, processing a query image and predicting the relative position of the shown space with respect to its current position. Our model does not explicitly store an observation history, therefore does not have hard constraints on context length. It maintains an implicit latent memory, which is updated by a transformer in a recurrent way, compressing the history of sensor readings into a compact representation. We evaluate the impact of this model in a new downstream task we call “Mem-Nav”. We show that our large-capacity recurrent model maintains a useful representation of the scene, navigates to goals observed before the actual episode start, and is computationally efficient, in particular compared to classical transformers with attention over an observation history.


[104] Dino-Diffusion Modular Designs Bridge the Cross-Domain Gap in Autonomous Parking cs.RO | cs.CVPDF

Zixuan Wu, Hengyuan Zhang, Ting-Hsuan Chen, Yuliang Guo, David Paz

TL;DR: 论文提出了一种名为Dino-Diffusion Parking (DDP)的领域无关自动驾驶停车系统,结合视觉基础模型和扩散规划,实现了在分布偏移下的泛化感知和鲁棒运动规划。实验证明其在多种对抗场景中停车成功率超过90%,并展示了良好的仿真到现实迁移能力。

Details

Motivation: 当前的端到端自动驾驶停车方法在领域内表现良好,但在领域偏移(如天气和光照变化)下的鲁棒性不足。作者旨在提出一种无需额外数据的领域无关解决方案。

Result: 在多种分布偏移场景中,停车成功率超过90%。3D高斯泼溅环境中的测试表明仿真到现实的迁移效果良好。

Insight: 1. 视觉基础模型和扩散规划的结合增强了系统的跨域鲁棒性;2. 零样本迁移证明了方法的泛化能力;3. 仿真到现实的迁移显示了实际应用的潜力。

Abstract: Parking is a critical pillar of driving safety. While recent end-to-end (E2E) approaches have achieved promising in-domain results, robustness under domain shifts (e.g., weather and lighting changes) remains a key challenge. Rather than relying on additional data, in this paper, we propose Dino-Diffusion Parking (DDP), a domain-agnostic autonomous parking pipeline that integrates visual foundation models with diffusion-based planning to enable generalized perception and robust motion planning under distribution shifts. We train our pipeline in CARLA at regular setting and transfer it to more adversarial settings in a zero-shot fashion. Our model consistently achieves a parking success rate above 90% across all tested out-of-distribution (OOD) scenarios, with ablation studies confirming that both the network architecture and algorithmic design significantly enhance cross-domain performance over existing baselines. Furthermore, testing in a 3D Gaussian splatting (3DGS) environment reconstructed from a real-world parking lot demonstrates promising sim-to-real transfer.


[105] GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation cs.RO | cs.AI | cs.CVPDF

Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji

TL;DR: GSWorld是一种结合3D高斯渲染与物理引擎的机器人操作模拟器,支持闭环式策略开发和sim2real训练,无需真实机器人。

Details

Motivation: 开发一种能够高效支持机器人操作策略开发与训练的仿真环境,以降低成本并提高性能。

Result: 构建了包含3种机器人和40多物体的数据库,并实现了多种sim2real应用,如零样本策略学习和基准测试。

Insight: 闭环仿真与高质量渲染的结合可以显著提升机器人操作策略的开发和训练效率。

Abstract: This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates “closing the loop” of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: https://3dgsworld.github.io/.


cs.AI [Back]

[106] Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory cs.AI | cs.CL | cs.LGPDF

Shiqi He, Yue Cui, Xinyu Ma, Yaliang Li, Bolin Ding

TL;DR: 本文提出了一种名为Branch-and-Browse的网页探索框架,通过树状结构推理和动作记忆实现了高效且可控的网页探索,显著提升了任务的完成率和执行效率。

Details

Motivation: 现有的自主网页代理方法在多步推理和回溯方面表现不佳,且计算成本较高。为了解决这些问题,作者提出了Branch-and-Browse框架。

Result: 在WebArena基准测试中,任务完成率达35.8%,并将执行时间减少了40.4%,优于现有方法。

Insight: 树状结构的推理和动作记忆是提升网页代理任务效率和可控性的有效手段。

Abstract: Autonomous web agents powered by large language models (LLMs) show strong potential for performing goal-oriented tasks such as information retrieval, report generation, and online transactions. These agents mark a key step toward practical embodied reasoning in open web environments. However, existing approaches remain limited in reasoning depth and efficiency: vanilla linear methods fail at multi-step reasoning and lack effective backtracking, while other search strategies are coarse-grained and computationally costly. We introduce Branch-and-Browse, a fine-grained web agent framework that unifies structured reasoning-acting, contextual memory, and efficient execution. It (i) employs explicit subtask management with tree-structured exploration for controllable multi-branch reasoning, (ii) bootstraps exploration through efficient web state replay with background reasoning, and (iii) leverages a page action memory to share explored actions within and across sessions. On the WebArena benchmark, Branch-and-Browse achieves a task success rate of 35.8% and reduces execution time by up to 40.4% relative to state-of-the-art methods. These results demonstrate that Branch-and-Browse is a reliable and efficient framework for LLM-based web agents.


[107] AI PB: A Grounded Generative Agent for Personalized Investment Insights cs.AI | cs.CE | cs.CLPDF

Daewoo Park, Suho Park, Inseok Hong, Hanwool Lee, Junkyu Park

TL;DR: AI PB是一个面向零售金融的生成式智能体,通过确定性路由、混合检索和多阶段推荐机制,主动生成用户特定的投资建议,并在高风险的金融环境中实现可信赖的AI输出。

Details

Motivation: 传统被动应答的聊天机器人无法满足高风险金融领域对合规性和用户个性化的需求,因此需要一种能够主动生成可信建议的智能体。

Result: 系统在韩国金融法规下运行,通过人类QA和系统指标验证了其在生成可信投资建议方面的有效性。

Insight: 在高风险领域,显式路由和多层次安全措施是实现可信赖生成式AI的关键。

Abstract: We present AI PB, a production-scale generative agent deployed in real retail finance. Unlike reactive chatbots that answer queries passively, AI PB proactively generates grounded, compliant, and user-specific investment insights. It integrates (i) a component-based orchestration layer that deterministically routes between internal and external LLMs based on data sensitivity, (ii) a hybrid retrieval pipeline using OpenSearch and the finance-domain embedding model, and (iii) a multi-stage recommendation mechanism combining rule heuristics, sequential behavioral modeling, and contextual bandits. Operating fully on-premises under Korean financial regulations, the system employs Docker Swarm and vLLM across 24 X NVIDIA H100 GPUs. Through human QA and system metrics, we demonstrate that grounded generation with explicit routing and layered safety can deliver trustworthy AI insights in high-stakes finance.


[108] What Defines Good Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation cs.AI | cs.CLPDF

Heejin Do, Jaehui Hwang, Dongyoon Han, Seong Joon Oh, Sangdoo Yun

TL;DR: 该论文提出了一种更细粒度的评估大语言模型(LLM)推理质量的方法,将推理分解为相关性和连贯性两个维度,并引入因果逐步评估(CaSE)方法以避免后见偏差。验证表明,该方法不仅能更准确地评估推理质量,还能通过优化训练数据提升任务表现。

Details

Motivation: 现有的大语言模型评估主要关注最终答案的正确性,但忽略了推理过程的质量。这种粗粒度的评估限制了模型的改进能力。

Result: CaSE方法在评估推理质量上与人类标注一致,且优化训练数据后显著提升了任务的最终表现。

Insight: 细粒度的推理评估不仅有助于分析和调试LLM,还能通过数据优化直接提升模型性能,展示了超越单纯正确性检查的实用价值。

Abstract: Evaluating large language models (LLMs) on final-answer correctness is the dominant paradigm. This approach, however, provides a coarse signal for model improvement and overlooks the quality of the underlying reasoning process. We argue that a more granular evaluation of reasoning offers a more effective path to building robust models. We decompose reasoning quality into two dimensions: relevance and coherence. Relevance measures if a step is grounded in the problem; coherence measures if it follows logically from prior steps. To measure these aspects reliably, we introduce causal stepwise evaluation (CaSE). This method assesses each reasoning step using only its preceding context, which avoids hindsight bias. We validate CaSE against human judgments on our new expert-annotated benchmarks, MRa-GSM8K and MRa-MATH. More importantly, we show that curating training data with CaSE-evaluated relevance and coherence directly improves final task performance. Our work provides a scalable framework for analyzing, debugging, and improving LLM reasoning, demonstrating the practical value of moving beyond validity checks.


[109] Real Deep Research for AI, Robotics and Beyond cs.AI | cs.CL | cs.CV | cs.LGPDF

Xueyan Zou, Jianglong Ye, Hao Zhang, Xiaoyu Xiang, Mingyu Ding

TL;DR: 为解决AI和机器人领域研究快速增长带来的信息过载问题,本文提出了Real Deep Research(RDR)框架,用于系统分析研究趋势、跨领域机会,并提供新研究的起点。

Details

Motivation: AI和机器人领域的研究每年超过1万篇,快速发展的趋势和跨学科需求使研究者难以跟上最新进展。

Result: 附录提供了广泛的分析结果,展示了RDR在不同主题中的应用效果。

Insight: 该框架可帮助研究者更高效地追踪新兴趋势和跨学科机会,促进AI和其他领域的创新。

Abstract: With the rapid growth of research in AI and robotics now producing over 10,000 papers annually it has become increasingly difficult for researchers to stay up to date. Fast evolving trends, the rise of interdisciplinary work, and the need to explore domains beyond one’s expertise all contribute to this challenge. To address these issues, we propose a generalizable pipeline capable of systematically analyzing any research area: identifying emerging trends, uncovering cross domain opportunities, and offering concrete starting points for new inquiry. In this work, we present Real Deep Research (RDR) a comprehensive framework applied to the domains of AI and robotics, with a particular focus on foundation models and robotics advancements. We also briefly extend our analysis to other areas of science. The main paper details the construction of the RDR pipeline, while the appendix provides extensive results across each analyzed topic. We hope this work sheds light for researchers working in the field of AI and beyond.


cs.LG [Back]

[110] Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values cs.LG | cs.CLPDF

Dian Yu, Yulai Zhao, Kishan Panaganti, Linfeng Song, Haitao Mi

TL;DR: 论文提出了RLEV方法,通过将人类定义的价值信号直接融入奖励函数,优化大语言模型的训练,使其更符合人类优先级。该方法在多种RL算法和模型规模下优于仅基于正确性的基线方法。

Details

Motivation: 现有的RLVR方法仅依赖二元正确性奖励训练模型,忽视了任务的重要性差异,导致模型对高价值和低价值任务的响应缺乏区分。RLEV旨在通过显式的人类价值信号解决这一问题。

Result: RLEV在多种RL算法和模型规模下优于基线方法。模型不仅能提高价值加权准确性,还能学习区分高低价值的任务终止策略。

Insight: 显式的人类价值信号可以有效指导模型优化,提升模型对任务优先级的敏感度。即使在噪声价值信号(如基于难度的标签)下,RLEV仍表现出稳健性,说明价值对齐是一种可行的实用路径。

Abstract: We propose Reinforcement Learning with Explicit Human Values (RLEV), a method that aligns Large Language Model (LLM) optimization directly with quantifiable human value signals. While Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains models in objective domains using binary correctness rewards, it overlooks that not all tasks are equally significant. RLEV extends this framework by incorporating human-defined value signals directly into the reward function. Using exam-style data with explicit ground-truth value labels, RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales. Crucially, RLEV policies not only improve value-weighted accuracy but also learn a value-sensitive termination policy: concise for low-value prompts, thorough for high-value ones. We demonstrate this behavior stems from value-weighted gradient amplification on end-of-sequence tokens. Ablation studies confirm the gain is causally linked to value alignment. RLEV remains robust under noisy value signals, such as difficulty-based labels, demonstrating that optimizing for an explicit utility function offers a practical path to aligning LLMs with human priorities.


[111] BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation cs.LG | cs.CL | q-bio.BMPDF

Liang Ye, Shengqin Chen, Jiazhu Dai

TL;DR: 这篇论文提出了BadGraph,一种针对文本引导图生成的潜在扩散模型的后门攻击方法,通过插入文本触发器毒化训练数据,实现对目标子图的隐蔽攻击。

Details

Motivation: 随着图生成技术的快速发展,其安全性问题日益突出,尤其是后门漏洞。目前的研究主要集中在图像扩散和无条件图生成的后门攻击,而对文本引导的图生成攻击研究较少。

Result: 在四个基准数据集上,低于10%的毒化率可实现50%的攻击成功率,24%的毒化率可实现80%以上的成功率,且对正常样本的性能影响可忽略。

Insight: 研究揭示了文本引导图生成的潜在扩散模型的安全风险,强调了在药物发现等应用中防御此类攻击的必要性。

Abstract: The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method targeting latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models’ applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.


[112] Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples cs.LG | cs.AI | cs.CL | cs.CVPDF

Shiva Sreeram, Alaa Maalouf, Pratyusha Sharma, Daniela Rus

TL;DR: 论文提出了一种高效的LLM适应方法,仅需对100个样本进行一次梯度步长计算,通过精心选择的矩阵子集和奇异值梯度分析,显著减少搜索时间,同时提高下游任务准确性。

Details

Motivation: 现有的LASER方法虽然能够在不进行梯度微调的情况下提升LLM的下游任务准确性,但其逐层搜索的方法效率低下,不适合快速部署。本文旨在消除这一开销,提出更高效的适应算法。

Result: 实验结果显示,该方法能将准确性提升高达24.6个百分点,同时显著减少搜索时间,仅需100个样本和一次梯度步长计算即可完成适应。

Insight: 核心发现是下游任务的适应主要由提示风格主导,而非数据集规模,因此少量样本足以支持高效的适应过程。

Abstract: Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM’s weight matrices can boost downstream accuracy – without any gradient-based fine-tuning. Yet LASER’s exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected – eliminating the layer-by-layer sweep, (ii) The gradient of each matrix’s singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data – both for computing the indicative gradients and for measuring the final accuracy – suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a result, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets – entirely without fine-tuning.


[113] Synthetic Data for Robust Runway Detection cs.LG | cs.CVPDF

Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Fabrice Jimenez, Thomas Oberlin

TL;DR: 论文探讨了在自动驾驶着陆系统中跑道检测的问题,提出了一种基于商用飞行模拟器的合成图像生成方法,以补充少量标注的真实图像。通过控制生成过程和合成与真实数据的结合,提升了标准目标检测模型的准确性和鲁棒性。

Details

Motivation: 在关键应用中,如自动驾驶着陆系统,训练数据需要覆盖所有可能场景(包括罕见情况),但真实数据收集和标注成本高昂。合成数据提供了一种低成本且可靠的替代方案,但其与真实数据之间的分布偏移问题需要解决。

Result: 实验表明,标准目标检测模型在结合合成和真实数据后,能够实现高精度预测,并在夜间图像等未见条件下表现出良好的鲁棒性。

Insight: 合成数据在关键应用中具有潜力,但需结合领域适应技术以解决分布偏移问题,从而提升模型在实际场景中的性能。

Abstract: Deep vision models are now mature enough to be integrated in industrial and possibly critical applications such as autonomous navigation. Yet, data collection and labeling to train such models requires too much efforts and costs for a single company or product. This drawback is more significant in critical applications, where training data must include all possible conditions including rare scenarios. In this perspective, generating synthetic images is an appealing solution, since it allows a cheap yet reliable covering of all the conditions and environments, if the impact of the synthetic-to-real distribution shift is mitigated. In this article, we consider the case of runway detection that is a critical part in autonomous landing systems developed by aircraft manufacturers. We propose an image generation approach based on a commercial flight simulator that complements a few annotated real images. By controlling the image generation and the integration of real and synthetic data, we show that standard object detection models can achieve accurate prediction. We also evaluate their robustness with respect to adverse conditions, in our case nighttime images, that were not represented in the real data, and show the interest of using a customized domain adaptation strategy.


[114] MEIcoder: Decoding Visual Stimuli from Neural Activity by Leveraging Most Exciting Inputs cs.LG | cs.CVPDF

Jan Sobotka, Luca Baroni, Ján Antolík

TL;DR: MEIcoder是一种基于神经元特异性最兴奋输入(MEIs)的解码方法,结合结构相似性指数损失和对抗训练,能够在小数据集上实现高性能的视觉刺激解码。

Details

Motivation: 由于在灵长类或人类中获取高通量神经活动数据的挑战性,深度学习解码技术面临数据稀缺的问题。MEIcoder旨在解决这一问题,通过利用MEIs提升解码性能。

Result: 实验表明,MEIcoder仅需1000-2500个神经元和少于1000个训练样本即可重建高质量自然图像,性能优于现有方法。

Insight: MEIs是解码性能提升的关键因素,该方法为神经科学和神经工程应用提供了实用的解决方案。

Abstract: Decoding visual stimuli from neural population activity is crucial for understanding the brain and for applications in brain-machine interfaces. However, such biological data is often scarce, particularly in primates or humans, where high-throughput recording techniques, such as two-photon imaging, remain challenging or impossible to apply. This, in turn, poses a challenge for deep learning decoding techniques. To overcome this, we introduce MEIcoder, a biologically informed decoding method that leverages neuron-specific most exciting inputs (MEIs), a structural similarity index measure loss, and adversarial training. MEIcoder achieves state-of-the-art performance in reconstructing visual stimuli from single-cell activity in primary visual cortex (V1), especially excelling on small datasets with fewer recorded neurons. Using ablation studies, we demonstrate that MEIs are the main drivers of the performance, and in scaling experiments, we show that MEIcoder can reconstruct high-fidelity natural-looking images from as few as 1,000-2,500 neurons and less than 1,000 training data points. We also propose a unified benchmark with over 160,000 samples to foster future research. Our results demonstrate the feasibility of reliable decoding in early visual system and provide practical insights for neuroscience and neuroengineering applications.


quant-ph [Back]

[115] Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems quant-ph | cs.AI | cs.CL | math-ph | math.MPPDF

Xi He, Sirui Lu, Bei Zeng

TL;DR: 该论文提出了一种多智能体、人在环的工作流,用于协同设计具有规定横向对角门的量子代码。该方法结合了系统枚举和精确分析重建,为量子代码的设计提供了新思路。

Details

Motivation: 量子代码的设计通常需要满足特定的横向对角门(transversal diagonal gates)条件,传统方法往往难以高效实现这一点。论文旨在通过多智能体系统协同工作,解决这一问题。

Result: 系统扫描生成了可实现循环逻辑组的代码表,例如在K=3时获得了6量子比特上的16阶逻辑组。此外,还展示了新代码(如((6,4,2)))的实现。

Insight: 1. 多智能体协同工作流显著提高了量子代码设计的效率和可扩展性。2. SSLP框架支持模块化分析和精确重建,为解决复杂量子问题提供了新思路。3. 结合人在环的设计,增强了方法的灵活性和可靠性。

Abstract: We present a multi-agent, human-in-the-loop workflow that co-designs quantum codes with prescribed transversal diagonal gates. It builds on the Subset-Sum Linear Programming (SSLP) framework (arXiv:2504.20847), which partitions basis strings by modular residues and enforces $Z$-marginal Knill-Laflamme (KL) equalities via small LPs. The workflow is powered by GPT-5 and implemented within TeXRA (https://texra.ai)-a multi-agent research assistant platform that supports an iterative tool-use loop agent and a derivation-then-edit workflow reasoning agent. We work in a LaTeX-Python environment where agents reason, edit documents, execute code, and synchronize their work to Git/Overleaf. Within this workspace, three roles collaborate: a Synthesis Agent formulates the problem; a Search Agent sweeps/screens candidates and exactifies numerics into rationals; and an Audit Agent independently checks all KL equalities and the induced logical action. As a first step we focus on distance $d=2$ with nondegenerate residues. For code dimension $K\in{2,3,4}$ and $n\le6$ qubits, systematic sweeps yield certificate-backed tables cataloging attainable cyclic logical groups-all realized by new codes-e.g., for $K=3$ we obtain order $16$ at $n=6$. From verified instances, Synthesis Agent abstracts recurring structures into closed-form families and proves they satisfy the KL equalities for all parameters. It further demonstrates that SSLP accommodates residue degeneracy by exhibiting a new $((6,4,2))$ code implementing the transversal controlled-phase $diag(1,1,1,i)$. Overall, the workflow recasts diagonal-transversal feasibility as an analytical pipeline executed at scale, combining systematic enumeration with exact analytical reconstruction. It yields reproducible code constructions, supports targeted extensions to larger $K$ and higher distances, and leads toward data-driven classification.


stat.AP [Back]

[116] AI Pose Analysis and Kinematic Profiling of Range-of-Motion Variations in Resistance Training stat.AP | cs.CVPDF

Adam Diamant

TL;DR: 该论文开发了一个基于AI的姿态估计流程,用于量化阻力训练中的运动学特征。通过分析部分范围运动(pROM)和全范围运动(fROM)训练的差异,研究发现pROM的ROM更小且执行时间更短,尤其是在离心阶段。此外,参与者个体差异是变异性主要来源。

Details

Motivation: 传统阻力训练研究依赖于主观评估或有限的技术手段,缺乏精确的运动学量化。本研究旨在利用AI姿态分析技术,提供更准确的数据支持,以比较pROM和fROM训练的差异。

Result: pROM的ROM更小,执行时间更短(尤其是离心阶段),参与者个体差异是变异性主要来源。%ROM指标在不同练习中表现出相对一致性。

Insight: AI姿态分析技术为阻力训练研究提供了新的量化工具,揭示了pROM和fROM的本质差异,有助于更个性化的训练方案设计。

Abstract: This study develops an AI-based pose estimation pipeline to enable precise quantification of movement kinematics in resistance training. Using video data from Wolf et al. (2025), which compared lengthened partial (pROM) and full range-of-motion (fROM) training across eight upper-body exercises in 26 participants, 280 recordings were processed to extract frame-level joint-angle trajectories. After filtering and smoothing, per-set metrics were derived, including range of motion (ROM), tempo, and concentric/eccentric phase durations. A random-effects meta-analytic model was applied to account for within-participant and between-exercise variability. Results show that pROM repetitions were performed with a smaller ROM and shorter overall durations, particularly during the eccentric phase of movement. Variance analyses revealed that participant-level differences, rather than exercise-specific factors, were the primary driver of variation, although there is substantial evidence of heterogeneous treatment effects. We then introduce a novel metric, %ROM, which is the proportion of full ROM achieved during pROM, and demonstrate that this definition of lengthened partials remains relatively consistent across exercises. Overall, these findings suggest that lengthened partials differ from full ROM training not only in ROM, but also in execution dynamics and consistency, highlighting the potential of AI-based methods for advancing research and improving resistance training prescription.