cs.CV [Total: 47]
cs.CL [Total: 53]
cs.LG [Total: 6]
cs.CR [Total: 1]
cs.RO [Total: 2]
cs.AI [Total: 4]
eess.AS [Total: 1]
eess.IV [Total: 1]
cs.MA [Total: 1]
cs.SE [Total: 1]

cs.CV [Back]

[1] Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts cs.CV | cs.AI | cs.ROPDF

Seungjun Yu, Junsung Park, Youngsun Lim, Hyunjung Shim

TL;DR: 本文提出了一种两阶段的视觉-语言问答系统，用于自动驾驶场景中的高级感知、预测和规划问题。通过结合多模态大模型、历史数据和元数据增强提示，显著提升了问答准确性。

Details

Motivation: 自动驾驶场景中的高级视觉-语言问答任务需要处理复杂的感知、预测和规划问题，现有方法在这些任务上的表现有待提升。本文旨在通过设计工程化的提示和上下文增强方法，提升问答系统的性能和鲁棒性。

Result: 在驾驶问答基准测试中，系统显著优于基线模型（65.1% vs. 62.61%）；自一致性进一步提高性能至66.85%；第二阶段达到67.37%整体准确率，且在视觉损坏下保持96%准确率。

Insight: 精心设计的提示和上下文增强可以显著提升预训练视觉-语言模型在自动驾驶任务中的表现。系统的鲁棒性表明其在复杂场景中的实用性。

Abstract: We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.

[2] $Δ$t-Mamba3D: A Time-Aware Spatio-Temporal State-Space Model for Breast Cancer Risk Prediction cs.CV | cs.AIPDF

Zhengbo Zhou, Dooman Arefan, Margarita Zuley, Shandong Wu

TL;DR: 该论文提出了一种名为$Δ$t-Mamba3D的新型状态空间模型，用于解决乳腺X光片序列中不规则时间间隔的时空建模问题，显著提升了乳腺癌风险预测的性能。

Details

Motivation: 现有的方法在处理高分辨率医学图像序列时，往往无法充分利用时空信息。要么将空间信息压缩为向量，要么使用计算效率低且不适配非均匀时间步长的时空模型，限制了预测性能。

Result: 在乳腺癌风险预测任务中，模型优于现有的循环、Transformer和状态空间模型变体，验证c-index提高了2-5个百分点，1-5年AUC评分更高。

Insight: 模型的成功表明，显式建模时间间隔和多尺度时空信息对纵向医学图像分析至关重要，同时证明高效的计算设计可以支持长序列数据的处理。

Abstract: Longitudinal analysis of sequential radiological images is hampered by a fundamental data challenge: how to effectively model a sequence of high-resolution images captured at irregular time intervals. This data structure contains indispensable spatial and temporal cues that current methods fail to fully exploit. Models often compromise by either collapsing spatial information into vectors or applying spatio-temporal models that are computationally inefficient and incompatible with non-uniform time steps. We address this challenge with Time-Aware $\Delta$t-Mamba3D, a novel state-space architecture adapted for longitudinal medical imaging. Our model simultaneously encodes irregular inter-visit intervals and rich spatio-temporal context while remaining computationally efficient. Its core innovation is a continuous-time selective scanning mechanism that explicitly integrates the true time difference between exams into its state transitions. This is complemented by a multi-scale 3D neighborhood fusion module that robustly captures spatio-temporal relationships. In a comprehensive breast cancer risk prediction benchmark using sequential screening mammogram exams, our model shows superior performance, improving the validation c-index by 2-5 percentage points and achieving higher 1-5 year AUC scores compared to established variants of recurrent, transformer, and state-space models. Thanks to its linear complexity, the model can efficiently process long and complex patient screening histories of mammograms, forming a new framework for longitudinal image analysis.

[3] MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models cs.CVPDF

Aritra Bhowmik, Denis Korzhenkov, Cees G. M. Snoek, Amirhossein Habibian, Mohsen Ghafoorian

TL;DR: 论文提出了一种专注于运动的表示对齐方法（MoAlign），通过解耦视频编码器的运动子空间并与扩散模型的特征对齐，提升了文本到视频生成模型的运动连贯性和物理合理性。

Details

Motivation: 现有文本到视频扩散模型在生成复杂运动时常缺乏连贯性和物理合理性，原因是模型对视频运动动态的理解不足。前人工作通过对齐视频编码器特征来解决，但这些特征混合了视频外观和运动动态，限制了改进效果。

Result: 在VideoPhy、VideoPhy2、VBench和VBench-2.0等数据集上的实验表明，方法显著提升了生成视频的物理常识性，同时保持了文本提示的贴合性。用户研究也验证了其优势。

Insight: 解耦运动动态是提升视频生成质量的关键；通过特征对齐可以高效地将预训练模型的运动知识迁移到生成模型中。

Abstract: Text-to-video diffusion models have enabled high-quality video synthesis, yet often fail to generate temporally coherent and physically plausible motion. A key reason is the models’ insufficient understanding of complex motions that natural videos often entail. Recent works tackle this problem by aligning diffusion model features with those from pretrained video encoders. However, these encoders mix video appearance and dynamics into entangled features, limiting the benefit of such alignment. In this paper, we propose a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder. This subspace is optimized to predict ground-truth optical flow, ensuring it captures true motion dynamics. We then align the latent features of a text-to-video diffusion model to this new subspace, enabling the generative model to internalize motion knowledge and generate more plausible videos. Our method improves the physical commonsense in a state-of-the-art video diffusion model, while preserving adherence to textual prompts, as evidenced by empirical evaluations on VideoPhy, VideoPhy2, VBench, and VBench-2.0, along with a user study.

[4] PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions cs.CV | cs.AI | cs.CLPDF

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis

TL;DR: PoSh是一种新的图像描述评估指标，利用场景图引导LLM作为评判工具，提供细粒度错误评分。DOCENT是一个新数据集，用于验证PoSh并成为详细图像描述的新基准。

Details

Motivation: 现有的图像描述评估指标（如CIDEr、SPICE）是为短文本设计的，难以评估长文本中的属性和关系错误。需要一种更敏感的评估方法。

Result: PoSh在DOCENT上比现有指标表现更好（Spearman ρ提升0.05），并可作为奖励函数提升模型性能。

Insight: 基础模型在处理复杂场景动态时仍存在不足，DOCENT为评估VLM提供了新挑战。

Abstract: While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $\rho$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.

[5] UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning cs.CVPDF

Zhongyu Jiang, Wenhao Chai, Lei Li, Zhuoran Zhou, Cheng-Yen Yang

TL;DR: UniHPR是一种统一的姿态表征学习方法，通过奇异值对比学习将图像、2D和3D人体姿态嵌入对齐，并在2D/3D姿态估计任务中表现优异。

Details

Motivation: 现有方法缺乏对不同模态（如图像、2D关键点、3D骨架等）之间相关性的系统性研究，UniHPR旨在填补这一空白，提升姿态表征的统一性和性能。

Result: 在Human3.6M（MPJPE 49.9mm）和3DPW（PA-MPJPE 51.6mm）数据集上达到先进性能，且姿态检索误差低至9.24mm。

Insight: 奇异值对比学习是跨模态对齐的有效工具，统一的姿态表征能显著提升下游任务的性能。

Abstract: In recent years, there has been a growing interest in developing effective alignment pipelines to generate unified representations from different modalities for multi-modal fusion and generation. As an important component of Human-Centric applications, Human Pose representations are critical in many downstream tasks, such as Human Pose Estimation, Action Recognition, Human-Computer Interaction, Object tracking, etc. Human Pose representations or embeddings can be extracted from images, 2D keypoints, 3D skeletons, mesh models, and lots of other modalities. Yet, there are limited instances where the correlation among all of those representations has been clearly researched using a contrastive paradigm. In this paper, we propose UniHPR, a unified Human Pose Representation learning pipeline, which aligns Human Pose embeddings from images, 2D and 3D human poses. To align more than two data representations at the same time, we propose a novel singular value-based contrastive learning loss, which better aligns different modalities and further boosts performance. To evaluate the effectiveness of the aligned representation, we choose 2D and 3D Human Pose Estimation (HPE) as our evaluation tasks. In our evaluation, with a simple 3D human pose decoder, UniHPR achieves remarkable performance metrics: MPJPE 49.9mm on the Human3.6M dataset and PA-MPJPE 51.6mm on the 3DPW dataset with cross-domain evaluation. Meanwhile, we are able to achieve 2D and 3D pose retrieval with our unified human pose representations in Human3.6M dataset, where the retrieval error is 9.24mm in MPJPE.

[6] Advancing Brain Tumor Segmentation via Attention-based 3D U-Net Architecture and Digital Image Processing cs.CV | 68U10, 68T07, 68T45 | I.4.6; I.2.10; I.5.4; J.3PDF

Eyad Gad, Seif Soliman, M. Saeed Darweesh

TL;DR: 本文提出了一种基于注意力机制的3D U-Net架构，结合数字图像处理技术，用于改进脑肿瘤分割任务。该方法在BraTS 2020数据集上表现优异，超越了现有研究。

Details

Motivation: 标准U-Net模型在脑肿瘤分割任务中面临不规则形状和模糊边界等问题，同时高分辨率MRI数据训练存在计算资源需求高和类别不平衡的挑战。

Result: 模型在BraTS 2020数据集上表现出色，Dice系数为0.975，特异性为0.988，敏感性为0.995。

Insight: 注意力机制和图像处理技术的结合可以显著提升脑肿瘤分割的精度和鲁棒性，为临床诊断提供了更可靠的解决方案。

Abstract: In the realm of medical diagnostics, rapid advancements in Artificial Intelligence (AI) have significantly yielded remarkable improvements in brain tumor segmentation. Encoder-Decoder architectures, such as U-Net, have played a transformative role by effectively extracting meaningful representations in 3D brain tumor segmentation from Magnetic resonance imaging (MRI) scans. However, standard U-Net models encounter challenges in accurately delineating tumor regions, especially when dealing with irregular shapes and ambiguous boundaries. Additionally, training robust segmentation models on high-resolution MRI data, such as the BraTS datasets, necessitates high computational resources and often faces challenges associated with class imbalance. This study proposes the integration of the attention mechanism into the 3D U-Net model, enabling the model to capture intricate details and prioritize informative regions during the segmentation process. Additionally, a tumor detection algorithm based on digital image processing techniques is utilized to address the issue of imbalanced training data and mitigate bias. This study aims to enhance the performance of brain tumor segmentation, ultimately improving the reliability of diagnosis. The proposed model is thoroughly evaluated and assessed on the BraTS 2020 dataset using various performance metrics to accomplish this goal. The obtained results indicate that the model outperformed related studies, exhibiting dice of 0.975, specificity of 0.988, and sensitivity of 0.995, indicating the efficacy of the proposed model in improving brain tumor segmentation, offering valuable insights for reliable diagnosis in clinical settings.

[7] A Novel Approach to Breast Cancer Segmentation using U-Net Model with Attention Mechanisms and FedProx cs.CV | cs.AI | 68U10, 68T07, 68T45, 92C55 | I.4.6; I.2.10; I.5.4; J.3PDF

Eyad Gad, Mustafa Abou Khatwa, Mustafa A. Elattar, Sahar Selim

TL;DR: 本文提出了一种结合注意力机制的改进U-Net模型和FedProx方法的新型乳腺癌分割方法，旨在解决非独立同分布（non-IID）医学数据训练中的准确性和隐私问题。

Details

Motivation: 乳腺癌是女性死亡的主要原因之一，早期检测和准确诊断至关重要。超声成像是可靠且经济的方法，但医疗数据的敏感性使得开发准确且隐私保护的人工智能模型具有挑战性。

Result: 全局模型达到了96%的准确率，证明了该方法在提高分割准确性和保护隐私方面的有效性。

Insight: FedProx是一种有潜力的方法，可用于在非IID本地医学数据集上训练精确的机器学习模型。

Abstract: Breast cancer is a leading cause of death among women worldwide, emphasizing the need for early detection and accurate diagnosis. As such Ultrasound Imaging, a reliable and cost-effective tool, is used for this purpose, however the sensitive nature of medical data makes it challenging to develop accurate and private artificial intelligence models. A solution is Federated Learning as it is a promising technique for distributed machine learning on sensitive medical data while preserving patient privacy. However, training on non-Independent and non-Identically Distributed (non-IID) local datasets can impact the accuracy and generalization of the trained model, which is crucial for accurate tumour boundary delineation in BC segmentation. This study aims to tackle this challenge by applying the Federated Proximal (FedProx) method to non-IID Ultrasonic Breast Cancer Imaging datasets. Moreover, we focus on enhancing tumour segmentation accuracy by incorporating a modified U-Net model with attention mechanisms. Our approach resulted in a global model with 96% accuracy, demonstrating the effectiveness of our method in enhancing tumour segmentation accuracy while preserving patient privacy. Our findings suggest that FedProx has the potential to be a promising approach for training precise machine learning models on non-IID local medical datasets.

[8] X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning cs.CV | cs.AI | cs.LGPDF

Yunzhe Wang, Soham Hans, Volkan Ustun

TL;DR: 论文提出了X-Ego-CS数据集和跨自我对比学习（CECL）方法，旨在通过同步的第一人称视角视频增强团队战术情境感知能力。

Details

Motivation: 现有团队交互建模多依赖第三方视角，忽略了同步的自我中心多智能体学习。

Result: CECL在队友和对手位置预测任务中表现优异，验证了其有效性。

Insight: 游戏理解是多智能体建模和战术学习的理想测试平台，对时空推理和人机协作有广泛意义。

Abstract: Human team tactics emerge from each player’s individual perspective and their ability to anticipate, interpret, and adapt to teammates’ intentions. While advances in video understanding have improved the modeling of team interactions in sports, most existing work relies on third-person broadcast views and overlooks the synchronous, egocentric nature of multi-agent learning. We introduce X-Ego-CS, a benchmark dataset consisting of 124 hours of gameplay footage from 45 professional-level matches of the popular e-sports game Counter-Strike 2, designed to facilitate research on multi-agent decision-making in complex 3D environments. X-Ego-CS provides cross-egocentric video streams that synchronously capture all players’ first-person perspectives along with state-action trajectories. Building on this resource, we propose Cross-Ego Contrastive Learning (CECL), which aligns teammates’ egocentric visual streams to foster team-level tactical situational awareness from an individual’s perspective. We evaluate CECL on a teammate-opponent location prediction task, demonstrating its effectiveness in enhancing an agent’s ability to infer both teammate and opponent positions from a single first-person view using state-of-the-art video encoders. Together, X-Ego-CS and CECL establish a foundation for cross-egocentric multi-agent benchmarking in esports. More broadly, our work positions gameplay understanding as a testbed for multi-agent modeling and tactical learning, with implications for spatiotemporal reasoning and human-AI teaming in both virtual and real-world domains. Code and dataset are available at https://github.com/HATS-ICT/x-ego.

[9] FootFormer: Estimating Stability from Visual Input cs.CVPDF

Keaton Kraiger, Jingjing Li, Skanda Bharadwaj, Jesse Scott, Robert T. Collins

TL;DR: FootFormer是一种跨模态方法，直接从视觉输入预测人体运动动力学，并在多个数据集上显著优于或等同于现有方法。

Details

Motivation: 现有方法通常只能生成一到两种运动动力学测量指标（如足压分布或重心），而FootFormer旨在通过视觉输入联合预测多种指标，填补这一空白。

Result: FootFormer在多个数据集上表现优异，显著优于或等同于现有方法，尤其在稳定性预测组件（CoP、CoM、BoS）上达到SOTA性能。

Insight: 跨模态学习和联合预测能够有效整合视觉输入与运动动力学指标，提升预测的准确性和全面性。

Abstract: We propose FootFormer, a cross-modality approach for jointly predicting human motion dynamics directly from visual input. On multiple datasets, FootFormer achieves statistically significantly better or equivalent estimates of foot pressure distributions, foot contact maps, and center of mass (CoM), as compared with existing methods that generate one or two of those measures. Furthermore, FootFormer achieves SOTA performance in estimating stability-predictive components (CoP, CoM, BoS) used in classic kinesiology metrics. Code and data are available at https://github.com/keatonkraiger/Vision-to-Stability.git.

Fengyuan Sun, Hui Chen, Xinhao Xu, Dandan Zheng, Jingdong Chen

TL;DR: PruneHal通过自适应KV缓存剪枝减少多模态大语言模型中的幻觉现象，无需额外训练且几乎不增加推理成本。

Details

Motivation: 现有方法通常通过额外训练或推理时引入外部/内部信息来缓解幻觉，但增加了计算成本。PruneHal观察到幻觉与模型对视觉token注意力不足相关，并提出了一种更高效的解决方案。

Result: 在多个主流MLLMs和基准测试中，PruneHal表现稳健且优异，验证了其高效性和优越性。

Insight: 幻觉的根本原因可能是注意力分散在多模态信息上，而PruneHal通过剪枝技术直接优化注意力分配，为解决这一问题提供了新思路。

Abstract: While multi-modal large language models (MLLMs) have made significant progress in recent years, the issue of hallucinations remains a major challenge. To mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during inference. However, these approaches inevitably introduce extra computational costs. In this paper, we observe that hallucinations in MLLMs are strongly associated with insufficient attention allocated to visual tokens. In particular, the presence of redundant visual tokens disperses the model’s attention, preventing it from focusing on the most informative ones. As a result, critical visual cues are often under-attended, which in turn exacerbates the occurrence of hallucinations. Building on this observation, we propose \textbf{PruneHal}, a training-free, simple yet effective method that leverages adaptive KV cache pruning to enhance the model’s focus on critical visual information, thereby mitigating hallucinations. To the best of our knowledge, we are the first to apply token pruning for hallucination mitigation in MLLMs. Notably, our method don’t require additional training and incurs nearly no extra inference cost. Moreover, PruneHal is model-agnostic and can be seamlessly integrated with different decoding strategies, including those specifically designed for hallucination mitigation. We evaluate PruneHal on several widely used hallucination evaluation benchmarks using four mainstream MLLMs, achieving robust and outstanding results that highlight the effectiveness and superiority of our method. Our code will be publicly available.

[11] Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning cs.CVPDF

Takehiro Aoshima, Yusuke Shinohara, Park Byeongseon

TL;DR: 本文提出了一种新的度量标准Video Consistency Distance (VCD)，通过奖励微调框架提升图像到视频生成任务的时序一致性，并在多数据集上验证了其有效性。

Details

Motivation: 传统奖励函数主要关注生成视频的整体质量和一致性，但在图像到视频生成任务中，时序一致性表现较差。为解决这一问题，作者提出了VCD度量标准。

Result: 实验结果表明，使用VCD微调的模型在多个数据集上显著提升了时序一致性，且不损害其他性能。

Insight: 频域分析可能是提升视频时序一致性的有效手段，奖励微调框架无需真实视频数据集即可优化模型。

Abstract: Reward-based fine-tuning of video diffusion models is an effective approach to improve the quality of generated videos, as it can fine-tune models without requiring real-world video datasets. However, it can sometimes be limited to specific performances because conventional reward functions are mainly aimed at enhancing the quality across the whole generated video sequence, such as aesthetic appeal and overall consistency. Notably, the temporal consistency of the generated video often suffers when applying previous approaches to image-to-video (I2V) generation tasks. To address this limitation, we propose Video Consistency Distance (VCD), a novel metric designed to enhance temporal consistency, and fine-tune a model with the reward-based fine-tuning framework. To achieve coherent temporal consistency relative to a conditioning image, VCD is defined in the frequency space of video frame features to capture frame information effectively through frequency-domain analysis. Experimental results across multiple I2V datasets demonstrate that fine-tuning a video generation model with VCD significantly enhances temporal consistency without degrading other performance compared to the previous method.

[12] Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks cs.CV | cs.AIPDF

Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo

TL;DR: 论文提出Dream4Drive框架，通过合成数据生成提升自动驾驶下游感知任务的性能，显著增强极端案例的感知能力。

Details

Motivation: 现有驾驶世界模型主要关注生成质量和可控性指标，忽略了下游感知任务的评估，而这对自动驾驶性能至关重要。

Result: Dream4Drive在多视角极端案例生成上具有高度灵活性，显著提升了感知模型在不同训练周期下的性能。

Insight: 合成数据生成应紧密结合下游任务需求，而非仅关注生成质量；3D资产和多视角编辑是实现高效数据增强的关键。

Abstract: Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Project: $\href{https://wm-research.github.io/Dream4Drive/}{this\ https\ URL}$

[13] MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting cs.CVPDF

In-Hwan Jin, Hyeongju Mun, Joonsoo Kim, Kugjin Yun, Kyeongbo Kong

TL;DR: MoE-GS 提出了一种动态高斯喷射的新框架，通过混合专家模型（MoE）提升动态场景重建的质量和一致性，并通过高效渲染和蒸馏策略解决计算开销问题。

Details

Motivation: 现有动态高斯喷射方法在不同场景中表现不一致，缺乏通用性。MoE-GS 旨在通过混合专家模型整合多种专家能力，解决动态场景的多样化挑战。

Result: 在 N3V 和 Technicolor 数据集上，MoE-GS 一致优于现有方法，并在高效性上表现突出。

Insight: 1. 混合专家模型可显著提升动态场景重建的质量
2. 高效渲染和蒸馏策略是平衡模型容量与实时性的关键

Abstract: Recent advances in dynamic scene reconstruction have significantly benefited from 3D Gaussian Splatting, yet existing methods show inconsistent performance across diverse scenes, indicating no single approach effectively handles all dynamic challenges. To overcome these limitations, we propose Mixture of Experts for Dynamic Gaussian Splatting (MoE-GS), a unified framework integrating multiple specialized experts via a novel Volume-aware Pixel Router. Our router adaptively blends expert outputs by projecting volumetric Gaussian-level weights into pixel space through differentiable weight splatting, ensuring spatially and temporally coherent results. Although MoE-GS improves rendering quality, the increased model capacity and reduced FPS are inherent to the MoE architecture. To mitigate this, we explore two complementary directions: (1) single-pass multi-expert rendering and gate-aware Gaussian pruning, which improve efficiency within the MoE framework, and (2) a distillation strategy that transfers MoE performance to individual experts, enabling lightweight deployment without architectural changes. To the best of our knowledge, MoE-GS is the first approach incorporating Mixture-of-Experts techniques into dynamic Gaussian splatting. Extensive experiments on the N3V and Technicolor datasets demonstrate that MoE-GS consistently outperforms state-of-the-art methods with improved efficiency. Video demonstrations are available at https://anonymous.4open.science/w/MoE-GS-68BA/.

[14] SFGFusion: Surface Fitting Guided 3D Object Detection with 4D Radar and Camera Fusion cs.CVPDF

Xiaozhi Li, Huijun Di, Jian Li, Feng Liu, Wei Liang

TL;DR: SFGFusion是一种新型相机与4D成像雷达融合的3D物体检测方法，通过曲面拟合增强空间表示和多模态交互，提升深度预测和点云密度。

Details

Motivation: 4D成像雷达虽具低成本、远距离探测和精确测速优势，但其稀疏点云和低分辨率限制了物体几何表示和多模态融合效果。

Result: 在TJ4DRadSet和VoD基准测试中表现出色，有效融合相机和4D雷达特征。

Insight: 曲面拟合为多模态融合提供了显式几何约束，显著提升了稀疏点云数据的利用率和检测精度。

Abstract: 3D object detection is essential for autonomous driving. As an emerging sensor, 4D imaging radar offers advantages as low cost, long-range detection, and accurate velocity measurement, making it highly suitable for object detection. However, its sparse point clouds and low resolution limit object geometric representation and hinder multi-modal fusion. In this study, we introduce SFGFusion, a novel camera-4D imaging radar detection network guided by surface fitting. By estimating quadratic surface parameters of objects from image and radar data, the explicit surface fitting model enhances spatial representation and cross-modal interaction, enabling more reliable prediction of fine-grained dense depth. The predicted depth serves two purposes: 1) in an image branch to guide the transformation of image features from perspective view (PV) to a unified bird’s-eye view (BEV) for multi-modal fusion, improving spatial mapping accuracy; and 2) in a surface pseudo-point branch to generate dense pseudo-point cloud, mitigating the radar point sparsity. The original radar point cloud is also encoded in a separate radar branch. These two point cloud branches adopt a pillar-based method and subsequently transform the features into the BEV space. Finally, a standard 2D backbone and detection head are used to predict object labels and bounding boxes from BEV features. Experimental results show that SFGFusion effectively fuses camera and 4D radar features, achieving superior performance on the TJ4DRadSet and view-of-delft (VoD) object detection benchmarks.

[15] Advances in 4D Representation: Geometry, Motion, and Interaction cs.CVPDF

Mingrui Zhao, Sauradip Nag, Kai Wang, Aditya Vora, Guangda Ji

TL;DR: 该综述论文聚焦4D表示在几何、运动和互动中的应用，强调如何选择和定制适合任务的4D表示方法，并讨论了当前数据集的不足与未来发展方向。

Details

Motivation: 4D生成与重建是计算机图形学的快速发展子领域，但现有研究多集中于技术枚举，缺乏从表示角度系统分析其特性与挑战的工作。

Result: 总结了当前主流与未充分探索的4D表示技术，指出数据集不足对领域发展的制约，并提出了改进方向。

Insight: 4D表示的选择需结合实际任务需求；LLM和VFM在4D应用中潜力大但需解决现有局限性；数据集的丰富性是推动领域进步的关键。

Abstract: We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics whose developments have been propelled by recent advances in neural fields, geometric and motion deep learning, as well 3D generative artificial intelligence (GenAI). While our survey is not the first of its kind, we build our coverage of the domain from a unique and distinctive perspective of 4D representations/}, to model 3D geometry evolving over time while exhibiting motion and interaction. Specifically, instead of offering an exhaustive enumeration of many works, we take a more selective approach by focusing on representative works to highlight both the desirable properties and ensuing challenges of each representation under different computation, application, and data scenarios. The main take-away message we aim to convey to the readers is on how to select and then customize the appropriate 4D representations for their tasks. Organizationally, we separate the 4D representations based on three key pillars: geometry, motion, and interaction. Our discourse will not only encompass the most popular representations of today, such as neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS), but also bring attention to relatively under-explored representations in the 4D context, such as structured models and long-range motions. Throughout our survey, we will reprise the role of large language models (LLMs) and video foundational models (VFMs) in a variety of 4D applications, while steering our discussion towards their current limitations and how they can be addressed. We also provide a dedicated coverage on what 4D datasets are currently available, as well as what is lacking, in driving the subfield forward. Project page:https://mingrui-zhao.github.io/4DRep-GMI/

[16] Vision-Based Mistake Analysis in Procedural Activities: A Review of Advances and Challenges cs.CVPDF

Konstantinos Bacharidis, Antonis A. Argyros

TL;DR: 该论文综述了基于视觉的程序活动中错误分析的进展与挑战，探讨了如何利用计算机视觉技术检测和预测结构化任务中的错误，并总结了现有数据集、评估方法和先进技术。

Details

Motivation: 程序活动中的错误分析在工业自动化、物理康复、教育和人机协作等领域具有重要应用价值。通过视觉方法检测和预测错误，可以提高任务执行的安全性和效率。

Result: 论文总结了现有方法的性能和局限性，并指出未来需要解决的问题（如区分允许的变异与真实错误）。

Insight: 通过结合神经符号推理和反事实状态建模等方向，未来可能进一步提升错误检测的精度和适用性。

Abstract: Mistake analysis in procedural activities is a critical area of research with applications spanning industrial automation, physical rehabilitation, education and human-robot collaboration. This paper reviews vision-based methods for detecting and predicting mistakes in structured tasks, focusing on procedural and executional errors. By leveraging advancements in computer vision, including action recognition, anticipation and activity understanding, vision-based systems can identify deviations in task execution, such as incorrect sequencing, use of improper techniques, or timing errors. We explore the challenges posed by intra-class variability, viewpoint differences and compositional activity structures, which complicate mistake detection. Additionally, we provide a comprehensive overview of existing datasets, evaluation metrics and state-of-the-art methods, categorizing approaches based on their use of procedural structure, supervision levels and learning strategies. Open challenges, such as distinguishing permissible variations from true mistakes and modeling error propagation are discussed alongside future directions, including neuro-symbolic reasoning and counterfactual state modeling. This work aims to establish a unified perspective on vision-based mistake analysis in procedural activities, highlighting its potential to enhance safety, efficiency and task performance across diverse domains.

[17] Unified Reinforcement and Imitation Learning for Vision-Language Models cs.CVPDF

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu

TL;DR: 本文提出了统一强化学习和模仿学习（RIL）算法，用于训练高效、轻量的视觉语言模型（VLM），结合两者优势，使小模型能模仿大模型并提升生成能力，性能接近或超越先进封闭源VLM。

Details

Motivation: 视觉语言模型（VLM）规模庞大，难以在资源受限环境中部署。因此，需要一种高效方法训练轻量但高性能的VLM。

Result: 实验表明，RIL缩小了与先进开源和封闭源VLM的性能差距，甚至在某些情况下超越它们。

Insight: 通过统一强化和模仿学习，轻量级模型不仅能模仿大模型，还能通过强化信号自主学习，提升适应性。

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.

[18] A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP cs.CVPDF

Ying Dai, Wei Yu Chen

TL;DR: 本文提出了一种无需训练的开集图像分割与识别框架，结合EfficientNetB0的无监督分割和CLIP的开集对象识别，通过两阶段流程实现了高效且灵活的跨模态对齐。

Details

Motivation: 开集视觉任务（如分割和识别）需要处理未见过的类别，传统方法依赖大量标注数据和训练。本文目标是开发一种无需训练的框架，利用预训练模型的优势减少对标注的依赖。

Result: 在COCO、ADE20K和PASCAL VOC等基准测试中达到了SOTA性能（匈牙利mIoU、精度、召回率和F1分数），证明了框架的有效性和泛化能力。

Insight: 1. 预训练模型的无缝结合可以显著减少对标注数据的依赖；2. SVD和层次聚类为无监督分割提供了灵活的解决方案；3. 跨模态对齐是开集识别的关键。

Abstract: This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP’s text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.

[19] DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents cs.CVPDF

Kai Shi, Jun Yang, Ni Yang, Binqiang Pan, Qingsong Xie

TL;DR: 论文提出了一种名为DaMo的数据混合优化器，用于在多模态大型语言模型（MLLMs）的微调中优化训练数据组合，提升移动电话代理（MPAs）的多任务处理能力。

Details

Motivation: 现有方法在确定多任务监督微调（SFT）的最优数据组合方面表现不佳，限制了MLLMs在多任务场景下的性能。

Result: 在PhoneAgentBench上性能提升3.38%，在其他基准（如BFCL-v3）上平均提升2.57%，在BFCL-v3单一任务上提升12.47%。

Insight: DaMo展示了强大的可扩展性，适用于不同模型架构，并能显著提升多任务学习效果。

Abstract: Mobile Phone Agents (MPAs) have emerged as a promising research direction due to their broad applicability across diverse scenarios. While Multimodal Large Language Models (MLLMs) serve as the foundation for MPAs, their effectiveness in handling multiple mobile phone tasks simultaneously remains limited. Although multitask supervised fine-tuning (SFT) is widely adopted for multitask learning, existing approaches struggle to determine optimal training data compositions for peak performance. To address this challenge, we propose DaMo (Data Mixture Optimizer) - a novel solution employing a trainable network that predicts optimal data mixtures by forecasting downstream task performance for any given dataset ratio. To support comprehensive evaluation, we introduce PhoneAgentBench, the first specialized benchmark to evaluate MLLMs on multimodal mobile phone tasks, comprising 1235 QA pairs spanning diverse real-world industrial mobile application scenarios. Demonstrating strong predictive capability (R^2=0.81) in small-scale pilot experiments, DaMo efficiently extrapolates optimal data mixing configurations. Our results show DaMo achieves a 3.38% performance improvement on PhoneAgentBench compared to alternative methods. Furthermore, extensive experiments across established benchmarks including BFCL-v3, MME-Reasoning, MME-Perception, and OCRBench reveal DaMo’s superior generalization, outperforming other approaches by 2.57% in terms of average score. When used solely for MLLM optimization on the BFCL-v3 task, DaMo improves the metrics by 12.47% than other methods. Notably, DaMo maintains robust scalability, preserving its effectiveness when applied to other model architectures. The code and dataset are available at https://github.com/OPPO-Mente-Lab/DaMo.git

[20] Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes cs.CVPDF

Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan

TL;DR: MV-RoboBench是一个新基准，用于评估视觉-语言模型（VLMs）在机器人场景中的多视角空间推理能力，结果显示当前模型在多视角任务中表现远低于人类水平。

Details

Motivation: 现有VLMs评估主要集中在单视角任务中，而机器人场景通常需要多视角信息以解决遮挡和深度模糊问题，因此需要评估VLMs的多视角推理能力。

Result: 当前VLMs在多视角任务中表现远不如人类，且空间智能与机器人任务执行在多视角场景中呈正相关。

Insight: 发现现有单视角基准的表现不能直接推广到机器人多视角任务中，突显了多视角推理能力的重要性。

Abstract: Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.

[21] Multi-Camera Worker Tracking in Logistics Warehouse Considering Wide-Angle Distortion cs.CVPDF

Yuki Mori, Kazuma Kano, Yusuke Asai, Shin Katayama, Kenta Urano

TL;DR: 论文提出了一种在物流仓库中使用19个广角摄像头跟踪工人的方法，通过基于脚部位置的坐标对齐减少图像失真，提高了20%以上的跟踪精度。

Details

Motivation: 随着电子商务的普及，物流仓库的效率提升需求增加。数字孪生技术需要准确跟踪工人位置，但单一摄像头视野有限，广角摄像头又会引入图像失真。

Result: 实验表明，该方法将工人跟踪的准确率提高了20%以上，并通过外观特征的比较验证了其有效性。

Insight: 广角摄像头的图像失真可以通过局部特征（如脚部位置）的对齐来缓解，这对于多摄像头系统的跟踪任务具有重要意义。

Abstract: With the spread of e-commerce, the logistics market is growing around the world. Therefore, improving the efficiency of warehouse operations is essential. To achieve this, various approaches have been explored, and among them, the use of digital twins is gaining attention. To make this approach possible, it is necessary to accurately collect the positions of workers in a warehouse and reflect them in a virtual space. However, a single camera has limitations in its field of view, therefore sensing with multiple cameras is necessary. In this study, we explored a method to track workers using 19 wide-angle cameras installed on the ceiling, looking down at the floor of the logistics warehouse. To understand the relationship between the camera coordinates and the actual positions in the warehouse, we performed alignment based on the floor surface. However, due to the characteristics of wide-angle cameras, significant distortion occurs at the edges of the image, particularly in the vertical direction. To address this, the detected worker positions from each camera were aligned based on foot positions, reducing the effects of image distortion, and enabling accurate position alignment across cameras. As a result, we confirmed an improvement of over 20% in tracking accuracy. Furthermore, we compared multiple methods for utilizing appearance features and validated the effectiveness of the proposed approach.

[22] Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis cs.CV | cs.MMPDF

Xueqi Ma, Yanbei Jiang, Sarah Erfani, James Bailey, Weifeng Liu

TL;DR: 提出了一个名为PICK的多步骤框架，利用多模态大语言模型（MLLMs）进行基于绘画的心理分析，特别是在HTP测试中的应用，通过分层分析和知识注入提升心理分析的准确性。

Details

Motivation: 当前MLLMs在多模态感知任务中表现优异，但在主观且情感丰富的心理分析领域应用较少。PICK旨在填补这一空白，通过结构化方法提升MLLMs在心理分析中的表现。

Result: 实验结果显示PICK显著提升了MLLMs在心理分析中的能力，并通过情感理解任务的扩展验证了其通用性。

Insight: PICK通过结构化方法和知识注入，成功将MLLMs应用于主观领域，展示了MLLMs在专业领域（如心理学）的潜力。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.

[23] PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation cs.CVPDF

Zhuoyang Xie, Yibo Zhao, Hui Huang, Riwei Wang, Zan Gao

TL;DR: PRGCN是一个新颖的图记忆网络，通过跨序列模式重用解决3D人体姿态估计中的深度模糊性问题，结合图记忆库和双流混合架构，实现了新的SOTA性能。

Details

Motivation: 现有的视频方法在处理序列时孤立操作，未能利用跨序列中普遍存在的结构规律和重复运动模式。

Result: 在Human3.6M和MPI-INF-3DHP基准测试中分别达到37.1mm和13.4mm的MPJPE，表现最优。

Insight: 跨序列模式重用是推动领域发展的关键机制，应从单序列优化转向累积知识学习。

Abstract: Monocular 3D human pose estimation remains a fundamentally ill-posed inverse problem due to the inherent depth ambiguity in 2D-to-3D lifting. While contemporary video-based methods leverage temporal context to enhance spatial reasoning, they operate under a critical paradigm limitation: processing each sequence in isolation, thereby failing to exploit the strong structural regularities and repetitive motion patterns that pervade human movement across sequences. This work introduces the Pattern Reuse Graph Convolutional Network (PRGCN), a novel framework that formalizes pose estimation as a problem of pattern retrieval and adaptation. At its core, PRGCN features a graph memory bank that learns and stores a compact set of pose prototypes, encoded as relational graphs, which are dynamically retrieved via an attention mechanism to provide structured priors. These priors are adaptively fused with hard-coded anatomical constraints through a memory-driven graph convolution, ensuring geometrical plausibility. To underpin this retrieval process with robust spatiotemporal features, we design a dual-stream hybrid architecture that synergistically combines the linear-complexity, local temporal modeling of Mamba-based state-space models with the global relational capacity of self-attention. Extensive evaluations on Human3.6M and MPI-INF-3DHP benchmarks demonstrate that PRGCN establishes a new state-of-the-art, achieving an MPJPE of 37.1mm and 13.4mm, respectively, while exhibiting enhanced cross-domain generalization capability. Our work posits that the long-overlooked mechanism of cross-sequence pattern reuse is pivotal to advancing the field, shifting the paradigm from per-sequence optimization towards cumulative knowledge learning.

[24] Towards Single-Source Domain Generalized Object Detection via Causal Visual Prompts cs.CVPDF

Chen Li, Huiying Xu, Changxin Gao, Zeyu Wang, Yun Liu

TL;DR: 本文提出了一种名为Cauvis的方法，通过因果视觉提示（Causal Visual Prompts）解决单源域泛化目标检测（SDGOD）中的虚假相关性问题，显著提升了模型在未见目标域中的泛化能力。

Details

Motivation: 当前单源域泛化目标检测方法因数据增强技术的局限性，容易陷入虚假相关性问题，导致模型过度依赖浅层特征（如颜色）而非本质特征（如物体轮廓）。

Result: 在SDGOD数据集上，Cauvis比现有域泛化方法性能提升15.9-31.4%，并在复杂干扰环境中表现出更强的鲁棒性。

Insight: 通过因果视觉提示和高频特征提取，可以有效解耦虚假特征与本质特征，从而显著提升模型的域泛化能力。

Abstract: Single-source Domain Generalized Object Detection (SDGOD), as a cutting-edge research topic in computer vision, aims to enhance model generalization capability in unseen target domains through single-source domain training. Current mainstream approaches attempt to mitigate domain discrepancies via data augmentation techniques. However, due to domain shift and limited domain-specific knowledge, models tend to fall into the pitfall of spurious correlations. This manifests as the model’s over-reliance on simplistic classification features (e.g., color) rather than essential domain-invariant representations like object contours. To address this critical challenge, we propose the Cauvis (Causal Visual Prompts) method. First, we introduce a Cross-Attention Prompts module that mitigates bias from spurious features by integrating visual prompts with cross-attention. To address the inadequate domain knowledge coverage and spurious feature entanglement in visual prompts for single-domain generalization, we propose a dual-branch adapter that disentangles causal-spurious features while achieving domain adaptation via high-frequency feature extraction. Cauvis achieves state-of-the-art performance with 15.9-31.4% gains over existing domain generalization methods on SDGOD datasets, while exhibiting significant robustness advantages in complex interference environments.

[25] CARES: Context-Aware Resolution Selector for VLMs cs.CV | cs.AI | cs.LGPDF

Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz

TL;DR: CARES提出了一种轻量级预处理模块，通过预测图像查询对的最小分辨率来减少大型视觉语言模型的计算开销，同时保持任务性能。

Details

Motivation: 现有大型视觉语言模型通常以原生或高分辨率处理图像，导致计算和延迟显著增加，即使低分辨率图像可能已足够。CARES旨在解决这一问题。

Result: 在五个多模态基准测试中，CARES在保持任务性能的同时，显著减少了计算开销。

Insight: 通过动态分辨率选择，可以在不影响模型性能的情况下大幅优化计算效率，尤其适用于资源受限的场景。

Abstract: Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM’s response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

[26] PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis cs.CVPDF

Qing Mao, Tianxin Huang, Yu Zhu, Jinqiu Sun, Yanning Zhang

TL;DR: PoseCrafter提出了一种混合视频生成（HVG）方法，结合视频插值和姿态条件新视角合成模型，生成清晰的中间帧，并通过特征匹配选择器（FMS）优化姿态估计性能。

Details

Motivation: 现有的稀疏重叠图像对姿态估计方法在小重叠或无重叠情况下效果不佳，生成的中间帧模糊且选择策略效率低下。

Result: 在多个数据集上，PoseCrafter显著提升了姿态估计性能，尤其是小重叠或无重叠情况。

Insight: 结合生成模型的优势并引入针对性选择策略，可以有效解决极端姿态估计的挑战。

Abstract: Pairwise camera pose estimation from sparsely overlapping image pairs remains a critical and unsolved challenge in 3D vision. Most existing methods struggle with image pairs that have small or no overlap. Recent approaches attempt to address this by synthesizing intermediate frames using video interpolation and selecting key frames via a self-consistency score. However, the generated frames are often blurry due to small overlap inputs, and the selection strategies are slow and not explicitly aligned with pose estimation. To solve these cases, we propose Hybrid Video Generation (HVG) to synthesize clearer intermediate frames by coupling a video interpolation model with a pose-conditioned novel view synthesis model, where we also propose a Feature Matching Selector (FMS) based on feature correspondence to select intermediate frames appropriate for pose estimation from the synthesized results. Extensive experiments on Cambridge Landmarks, ScanNet, DL3DV-10K, and NAVI demonstrate that, compared to existing SOTA methods, PoseCrafter can obviously enhance the pose estimation performances, especially on examples with small or no overlap.

[27] [De|Re]constructing VLMs’ Reasoning in Counting cs.CV | cs.CLPDF

Simone Alghisi, Gabriel Roccabruna, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi

TL;DR: 该论文研究了视觉语言模型（VLMs）在计数任务中的推理能力，发现其对物体数量、类型、空间排列及干扰物高度敏感，并指出错误主要源于最后一层表示到输出空间的映射问题。通过仅微调输出层，准确率提升了21%。

Details

Motivation: VLMs虽然在多任务中表现优异，但在视觉推理（如计数）中仍存在局限性。论文旨在深入分析VLMs失败的原因，并提出针对性改进方法。

Result: 实验表明，VLMs对物体特性及干扰高度敏感。通过微调输出层，计数准确率提升了21%，并在真实数据集上验证了改进效果。

Insight: VLMs的视觉推理能力可以通过针对底层表示的微调显著提升，无需大规模调整模型。这为改进其他视觉推理任务提供了潜在方向。

Abstract: Vision-Language Models (VLMs) have recently gained attention due to their competitive performance on multiple downstream tasks, achieved by following user-input instructions. However, VLMs still exhibit several limitations in visual reasoning, such as difficulties in identifying relations (e.g., spatial, temporal, and among objects), understanding temporal sequences (e.g., frames), and counting objects. In this work, we go beyond score-level benchmark evaluations of VLMs by investigating the underlying causes of their failures and proposing a targeted approach to improve their reasoning capabilities. We study the reasoning skills of seven state-of-the-art VLMs in the counting task under controlled experimental conditions. Our experiments show that VLMs are highly sensitive to the number and type of objects, their spatial arrangement, and the co-occurrence of distractors. A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space. Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%. We corroborate these findings by achieving consistent improvements on real-world datasets.

[28] The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models cs.CVPDF

Xiaofeng Zhang, Aaron Courville, Michal Drozdzal, Adriana Romero-Soriano

TL;DR: 这篇论文探讨了文本到图像（T2I）模型中提示复杂性对生成数据质量、多样性和一致性的影响。通过实验和理论分析，作者提出了一个评估框架，揭示了提示复杂性与生成数据效用之间的关系，并分析了不同推理时干预方法的效果。

Details

Motivation: T2I模型可以生成丰富的合成数据，但其效用受提示复杂性影响，而这一影响尚未被系统性研究。

Result: 实验表明，增加提示复杂性会降低条件多样性和提示一致性，但能减少合成数据与真实数据之间的分布偏移。此外，提示扩展方法在图像多样性和美学上表现最优。

Insight: 1）提示复杂性是T2I模型生成效用中的一个关键因素；2）推理时干预方法能在一定程度上提升生成数据的多样性，但可能偏离真实数据的支持范围；3）提示扩展方法因其使用了预训练语言模型作为似然估计器，表现最佳。

Abstract: Text-to-image (T2I) models offer great potential for creating virtually limitless synthetic data, a valuable resource compared to fixed and finite real datasets. Previous works evaluate the utility of synthetic data from T2I models on three key desiderata: quality, diversity, and consistency. While prompt engineering is the primary means of interacting with T2I models, the systematic impact of prompt complexity on these critical utility axes remains underexplored. In this paper, we first conduct synthetic experiments to motivate the difficulty of generalization w.r.t. prompt complexity and explain the observed difficulty with theoretical derivations. Then, we introduce a new evaluation framework that can compare the utility of real data and synthetic data, and present a comprehensive analysis of how prompt complexity influences the utility of synthetic data generated by commonly used T2I models. We conduct our study across diverse datasets, including CC12M, ImageNet-1k, and DCI, and evaluate different inference-time intervention methods. Our synthetic experiments show that generalizing to more general conditions is harder than the other way round, since the former needs an estimated likelihood that is not learned by diffusion models. Our large-scale empirical experiments reveal that increasing prompt complexity results in lower conditional diversity and prompt consistency, while reducing the synthetic-to-real distribution shift, which aligns with the synthetic experiments. Moreover, current inference-time interventions can augment the diversity of the generations at the expense of moving outside the support of real data. Among those interventions, prompt expansion, by deliberately using a pre-trained language model as a likelihood estimator, consistently achieves the highest performance in both image diversity and aesthetics, even higher than that of real data.

[29] A Matter of Time: Revealing the Structure of Time in Vision-Language Models cs.CV | cs.AI | cs.IR | cs.MMPDF

Nidham Tekaya, Manuela Waldner, Matthias Zeppelzauer

TL;DR: 这篇论文研究了视觉语言模型（VLM）的时间感知能力，提出了TIME10k基准数据集，并揭示了时间信息在VLM嵌入空间中的低维非线性结构。基于此，作者提出了一种显式的‘时间线’表示方法，用于时间推理任务。

Details

Motivation: 探讨VLM是否具备将视觉内容定位在时间线上的能力，以扩展其应用场景和功能性。

Result: 提出的时间线方法在时间推理任务中表现优于或接近基于提示的基线方法，且计算高效。

Insight: 时间信息在VLM嵌入空间中具有结构化特征，能够通过低维非线性映射表示时间进展。

Abstract: Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline’’ representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.

[30] HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking cs.CVPDF

Yao Deng, Xian Zhong, Wenxuan Liu, Zhaofei Yu, Jingling Yuan

TL;DR: 该论文提出了一种名为HAD（Hierarchical Asymmetric Distillation）的多模态知识蒸馏框架，旨在解决RGB相机和事件相机之间的时空不对称性问题，以提升目标跟踪性能。

Details

Motivation: RGB相机和事件相机各有优势（RGB相机空间分辨率高，事件相机时间分辨率高），但两者在成像机制上的时空不对称性阻碍了多模态信息的有效整合。

Result: 实验表明，HAD在多种场景下优于现有方法，消融实验验证了各模块的有效性和必要性。

Insight: 通过显式建模不对称性，可以有效整合多模态信息的互补优势，提升目标跟踪的鲁棒性。

Abstract: RGB cameras excel at capturing rich texture details with high spatial resolution, whereas event cameras offer exceptional temporal resolution and a high dynamic range (HDR). Leveraging their complementary strengths can substantially enhance object tracking under challenging conditions, such as high-speed motion, HDR environments, and dynamic background interference. However, a significant spatio-temporal asymmetry exists between these two modalities due to their fundamentally different imaging mechanisms, hindering effective multi-modal integration. To address this issue, we propose {Hierarchical Asymmetric Distillation} (HAD), a multi-modal knowledge distillation framework that explicitly models and mitigates spatio-temporal asymmetries. Specifically, HAD proposes a hierarchical alignment strategy that minimizes information loss while maintaining the student network’s computational efficiency and parameter compactness. Extensive experiments demonstrate that HAD consistently outperforms state-of-the-art methods, and comprehensive ablation studies further validate the effectiveness and necessity of each designed component. The code will be released soon.

[31] Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection cs.CV | cs.CRPDF

Ariana Yi, Ce Zhou, Liyang Xiao, Qiben Yan

TL;DR: 论文提出了{\alpha}-Cloak，一种针对视频目标检测的无盒对抗攻击方法，通过RGBA视频的alpha通道实现攻击，无需访问模型内部信息。

Details

Motivation: 随着目标检测模型在自动驾驶车辆（AVs）和监控平台等物理系统中的部署增加，确保其对抗攻击的安全性变得至关重要。现有的对抗攻击研究主要集中在图像领域，而视频领域的无盒攻击尚未充分探索。

Result: 在五种最先进的目标检测器、一个视觉语言模型和一个多模态大语言模型（Gemini-2.0-Flash）上，{\alpha}-Cloak实现了100%的攻击成功率。

Insight: 论文揭示了基于视频的感知系统中未被探索的alpha通道漏洞，强调了在对抗环境中考虑alpha通道防御的紧迫性。

Abstract: As object detection models are increasingly deployed in cyber-physical systems such as autonomous vehicles (AVs) and surveillance platforms, ensuring their security against adversarial threats is essential. While prior work has explored adversarial attacks in the image domain, those attacks in the video domain remain largely unexamined, especially in the no-box setting. In this paper, we present {\alpha}-Cloak, the first no-box adversarial attack on object detectors that operates entirely through the alpha channel of RGBA videos. {\alpha}-Cloak exploits the alpha channel to fuse a malicious target video with a benign video, resulting in a fused video that appears innocuous to human viewers but consistently fools object detectors. Our attack requires no access to model architecture, parameters, or outputs, and introduces no perceptible artifacts. We systematically study the support for alpha channels across common video formats and playback applications, and design a fusion algorithm that ensures visual stealth and compatibility. We evaluate {\alpha}-Cloak on five state-of-the-art object detectors, a vision-language model, and a multi-modal large language model (Gemini-2.0-Flash), demonstrating a 100% attack success rate across all scenarios. Our findings reveal a previously unexplored vulnerability in video-based perception systems, highlighting the urgent need for defenses that account for the alpha channel in adversarial settings.

[32] VGD: Visual Geometry Gaussian Splatting for Feed-Forward Surround-view Driving Reconstruction cs.CVPDF

Junhong Lin, Kangli Wang, Shunzhou Wang, Songlin Fan, Ge Li

TL;DR: VGD提出了一种新颖的前馈端到端学习框架，通过显式学习几何信息并结合高斯头分支提升新视角的语义质量，在nuScenes数据集上显著优于现有方法。

Details

Motivation: 环视自动驾驶场景重建的核心挑战是在保证泛化能力的同时提升新视角质量。由于多视角间重叠区域极少，现有方法难以保证几何一致性和重建质量。

Result: 在nuScenes数据集上，VGD在客观指标和主观质量上均显著优于现有方法。

Insight: 显式学习几何信息并结合高斯参数预测能有效提升多视角重建的质量和一致性。

Abstract: Feed-forward surround-view autonomous driving scene reconstruction offers fast, generalizable inference ability, which faces the core challenge of ensuring generalization while elevating novel view quality. Due to the surround-view with minimal overlap regions, existing methods typically fail to ensure geometric consistency and reconstruction quality for novel views. To tackle this tension, we claim that geometric information must be learned explicitly, and the resulting features should be leveraged to guide the elevating of semantic quality in novel views. In this paper, we introduce \textbf{Visual Gaussian Driving (VGD)}, a novel feed-forward end-to-end learning framework designed to address this challenge. To achieve generalizable geometric estimation, we design a lightweight variant of the VGGT architecture to efficiently distill its geometric priors from the pre-trained VGGT to the geometry branch. Furthermore, we design a Gaussian Head that fuses multi-scale geometry tokens to predict Gaussian parameters for novel view rendering, which shares the same patch backbone as the geometry branch. Finally, we integrate multi-scale features from both geometry and Gaussian head branches to jointly supervise a semantic refinement model, optimizing rendering quality through feature-consistent learning. Experiments on nuScenes demonstrate that our approach significantly outperforms state-of-the-art methods in both objective metrics and subjective quality under various settings, which validates VGD’s scalability and high-fidelity surround-view reconstruction.

Francisco Mena, Dino Ienco, Cassio F. Dantas, Roberto Interdonato, Andreas Dengel

TL;DR: 该论文提出了一种多模态协同学习的框架，用于提升地球观测任务中单模态模型的性能，尤其是在训练和推理阶段模态不一致的情况下。

Details

Motivation: 地球观测领域的多模数据量大且模态多样，但实际应用中训练和推理阶段可能无法获得相同的传感器模态。传统方法通常针对特定任务或模态设计解决方案，缺乏通用性。

Result: 在四个地球观测基准测试中，该方法在分类和回归任务上均优于当前最先进的机器学习、计算机视觉及地球观测专用方法。

Insight: 多模态协同学习可以有效利用训练阶段的多样模态数据，提升单模态推理性能，尤其在模态不一致的实际场景中。

Abstract: Multi-modal co-learning is emerging as an effective paradigm in machine learning, enabling models to collaboratively learn from different modalities to enhance single-modality predictions. Earth Observation (EO) represents a quintessential domain for multi-modal data analysis, wherein diverse remote sensors collect data to sense our planet. This unprecedented volume of data introduces novel challenges. Specifically, the access to the same sensor modalities at both training and inference stages becomes increasingly complex based on real-world constraints affecting remote sensing platforms. In this context, multi-modal co-learning presents a promising strategy to leverage the vast amount of sensor-derived data available at the training stage to improve single-modality models for inference-time deployment. Most current research efforts focus on designing customized solutions for either particular downstream tasks or specific modalities available at the inference stage. To address this, we propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference. Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information. We evaluate our framework on four EO benchmarks spanning classification and regression tasks across different sensor modalities, where only one of the modalities available during training is accessible at inference time. Our results demonstrate consistent predictive improvements over state-of-the-art approaches from the recent machine learning and computer vision literature, as well as EO-specific methods. The obtained findings validate our framework in the single-modality inference scenarios across a diverse range of EO applications.

[34] Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation cs.CVPDF

Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee

TL;DR: 本文提出了一种无需训练的跨模态视频推理分割方法DecAF，通过分解注意力融合机制优化原始注意力图，并结合SAM2提示生成精细分割掩码。

Details

Motivation: 现有MLLMs在视频理解中表现出色，但其原始注意力图噪声较多且与目标区域对齐性差，难以直接用于定位任务。

Result: 在指代和推理VOS基准测试中，DecAF表现优于无需训练方法，并与训练方法性能相当。

Insight: 无需训练即可将MLLMs的注意力机制直接应用于视频分割任务，突破了传统方法的限制。

Abstract: Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.

[35] CBDiff:Conditional Bernoulli Diffusion Models for Image Forgery Localization cs.CVPDF

Zhou Lei, Pan Gang, Wang Jiahao, Sun Di

TL;DR: 本文提出了一种名为CBDiff的条件伯努利扩散模型，用于图像伪造定位任务，通过生成多样化的伪造定位图，提升预测的可信度和可靠性。

Details

Motivation: 现有方法生成的单一确定性定位图在精确度和可靠性上不足，无法满足高风险应用需求。本文旨在解决这一问题。

Result: 在八个公开数据集上的实验表明，CBDiff显著优于现有最先进方法。

Insight: 通过生成多样化的预测结果，CBDiff能够更全面地捕捉伪造分布的不确定性，适合高风险场景的部署。

Abstract: Image Forgery Localization (IFL) is a crucial task in image forensics, aimed at accurately identifying manipulated or tampered regions within an image at the pixel level. Existing methods typically generate a single deterministic localization map, which often lacks the precision and reliability required for high-stakes applications such as forensic analysis and security surveillance. To enhance the credibility of predictions and mitigate the risk of errors, we introduce an advanced Conditional Bernoulli Diffusion Model (CBDiff). Given a forged image, CBDiff generates multiple diverse and plausible localization maps, thereby offering a richer and more comprehensive representation of the forgery distribution. This approach addresses the uncertainty and variability inherent in tampered regions. Furthermore, CBDiff innovatively incorporates Bernoulli noise into the diffusion process to more faithfully reflect the inherent binary and sparse properties of forgery masks. Additionally, CBDiff introduces a Time-Step Cross-Attention (TSCAttention), which is specifically designed to leverage semantic feature guidance with temporal steps to improve manipulation detection. Extensive experiments on eight publicly benchmark datasets demonstrate that CBDiff significantly outperforms existing state-of-the-art methods, highlighting its strong potential for real-world deployment.

[36] XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography cs.CV | cs.AIPDF

Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Sebastian Otalora, Mauricio Reyes

TL;DR: 该论文提出了XBench，首个系统性评估胸部X光视觉-语言模型跨模态可解释性的基准测试，揭示了当前模型在小病灶或弥散性病变上的局限性，并强调了临床可靠定位的重要性。

Details

Motivation: 视觉-语言模型在医学图像理解中表现优异，但其定位能力（文本概念与视觉证据的对齐程度）尚未充分研究。在医学领域，可靠的定位能力对可解释性和临床采纳至关重要。

Result: 1. 所有模型变体对大且明确的病变定位表现良好，但对小或弥散性病变的性能显著下降。
2. 在胸部X光数据集上预训练的模型表现出更好的定位能力。
3. 模型的识别能力与定位能力高度相关。

Insight: 尽管当前视觉-语言模型在识别能力上表现优异，但其临床可靠的定位能力仍不足，强调了医学实践中针对性可解释性基准测试的必要性。

Abstract: Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at https://github.com/Roypic/Benchmarkingattention

[37] MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom cs.CVPDF

Yifan Li, Fenghe Tang, Yingtai Li, Shaohua Kevin Zhou

TL;DR: MedReason-R1是一个专为CT诊断设计的医学视觉语言模型，通过结合强化学习和局部放大技术，实现了显式的诊断推理过程，显著提升了医学影像的诊断性能。

Details

Motivation: 通用的大型视觉语言模型（VLMs）在自然图像描述任务上表现优异，但在医学领域的表现不佳，主要因为缺乏高质量的大规模医学影像数据集和忽视从粗到细的诊断过程。

Result: MedReason-R1在CT疾病诊断任务上优于通用和医学VLMs，同时保留了泛化能力。

Insight: 1.医学诊断需要结合全局定位和疾病细节；2.强化学习可以在缺乏人工标注的情况下提升诊断推理能力；3.高质量的专有数据集是医学VLMs成功的关键。

Abstract: General-purpose large Vision-Language Models (VLMs) demonstrate strong capabilities in generating detailed descriptions for natural images. However, their performance in the medical domain remains suboptimal, even for relatively straightforward tasks, primarily due to the lack of large-scale, high-quality, specialized medical imaging datasets and the neglect of the diagnostic process that progresses from coarse to fine-grained. To address the first issue, we construct the CT-RATE-VQA dataset, which has 84K QA pairs. For the second issue, we propose MedReason-R1, a medical VLM with explicit reasoning process for disease diagnosis. MedReason-R1 incorporates a novel strategy that embeds zoom-in disease region-of-interest areas into the image, highlighting the crucial role of both global localization and disease-specific details in enhancing the model’s diagnostic performance. Furthermore, we introduce the GRPO reinforcement learning framework to MedReason-R1, which enables effective reasoning without relying on costly manual annotations. Compared to recent general-purpose and medical VLMs, MedReason-R1 achieves state-of-the-art performance in CT disease diagnosis while retaining generalization. The code, checkpoints, and dataset are available at: https://github.com/Leevan001/MedReason-R1

[38] From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction cs.CV | cs.AI | cs.CL | cs.ROPDF

Zhida Zhao, Talas Fu, Yifan Wang, Lijun Wang, Huchuan Lu

TL;DR: 本文提出了一种名为Policy World Model（PWM）的新驾驶范式，将世界建模与轨迹规划统一在一个架构中，并通过无动作的未来状态预测机制提升规划性能。

Details

Motivation: 现有的驾驶世界模型主要用于模拟世界，且与世界规划解耦。尽管最近的研究尝试统一世界建模和规划，但如何利用世界建模的知识协同提升规划仍需探索。

Result: 实验表明PWM仅使用前视摄像头输入即可匹配或超越依赖多视角多模态输入的先进方法。

Insight: 通过协同状态-动作预测和无动作的未来状态预测，可以更有效地将世界建模知识应用于规划任务，提升自动驾驶系统的可靠性和性能。

Abstract: Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy-World-Model.

[39] I Spy With My Model’s Eye: Visual Search as a Behavioural Test for MLLMs cs.CV | cs.AIPDF

John Burden, Jonathan Prunty, Ben Slater, Matthieu Tehenan, Greg Davis

TL;DR: 该论文提出了一种基于认知心理学中经典视觉搜索范式的方法，用于评估多模态大语言模型（MLLMs）的视觉处理能力，并发现其表现出类似人类的‘突显效果’和场景先验。

Details

Motivation: 尽管MLLMs在视觉语言任务上表现优异，但其视觉处理机制仍不透明。现有的黑盒评估方法仅关注任务准确性，而忽视了底层机制的研究。

Result: 实验表明，MLLMs在单一特征搜索中表现出类似人类的突显效果，但在多特征搜索中存在能力限制；同时，确认了模型在处理光照方向等场景先验上的能力。

Insight: 视觉搜索可作为MLLMs感知能力的诊断工具，其研究揭示了模型感知机制与人类认知的相似性，为理解和改进MLLMs提供了新视角。

Abstract: Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms – originally developed to study human perception – to test whether MLLMs exhibit the ``pop-out’’ effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.

[40] Curvilinear Structure-preserving Unpaired Cross-domain Medical Image Translation cs.CVPDF

Zihao Chen, Yi Zhou, Xudong Jiang, Li Chen, Leopold Schmetterer

TL;DR: 该论文提出了一种名为CST的框架，用于在无配对医疗图像翻译任务中保留精细的曲线结构（如微血管），通过引入结构一致性监督提升翻译准确性和诊断可靠性。

Details

Motivation: 现有方法在无配对图像翻译中常扭曲精细的曲线结构（如微血管），影响诊断和定量分析。这在眼科和血管成像中尤为重要，因为微小的形态变化具有重要临床意义。

Result: CST在翻译保真度上表现优异，取得了最先进的性能，同时显著提升了曲线结构的保留能力。

Insight: CST为医疗图像翻译中的几何完整性提供了新思路，适用于对结构敏感的应用场景。

Abstract: Unpaired image-to-image translation has emerged as a crucial technique in medical imaging, enabling cross-modality synthesis, domain adaptation, and data augmentation without costly paired datasets. Yet, existing approaches often distort fine curvilinear structures, such as microvasculature, undermining both diagnostic reliability and quantitative analysis. This limitation is consequential in ophthalmic and vascular imaging, where subtle morphological changes carry significant clinical meaning. We propose Curvilinear Structure-preserving Translation (CST), a general framework that explicitly preserves fine curvilinear structures during unpaired translation by integrating structure consistency into the training. Specifically, CST augments baseline models with a curvilinear extraction module for topological supervision. It can be seamlessly incorporated into existing methods. We integrate it into CycleGAN and UNSB as two representative backbones. Comprehensive evaluation across three imaging modalities: optical coherence tomography angiography, color fundus and X-ray coronary angiography demonstrates that CST improves translation fidelity and achieves state-of-the-art performance. By reinforcing geometric integrity in learned mappings, CST establishes a principled pathway toward curvilinear structure-aware cross-domain translation in medical imaging.

[41] Explainable Face Presentation Attack Detection via Ensemble-CAM cs.CVPDF

Rashik Shadman, M G Sarwar Murshed, Faraz Hussain

TL;DR: 该论文提出了Ensemble-CAM方法，为基于深度学习的面部呈现攻击检测（PAD）系统提供视觉解释，增强系统的透明性和可信度。

Details

Motivation: 现有的深度学习PAD系统虽有效，但决策过程不透明，缺乏解释性。为解决这一问题，需提供视觉解释以帮助理解系统决策的关键区域。

Result: Ensemble-CAM提升了面部PAD系统的透明度，使用户能够直观理解模型的决策依据，增强了系统的可信度。

Insight: 视觉解释技术不仅适用于PAD系统，还可以扩展到其他需要透明决策的深度学习应用领域。

Abstract: Presentation attacks represent a critical security threat where adversaries use fake biometric data, such as face, fingerprint, or iris images, to gain unauthorized access to protected systems. Various presentation attack detection (PAD) systems have been designed leveraging deep learning (DL) models to mitigate this type of threat. Despite their effectiveness, most of the DL models function as black boxes - their decisions are opaque to their users. The purpose of explainability techniques is to provide detailed information about the reason behind the behavior or decision of DL models. In particular, visual explanation is necessary to better understand the decisions or predictions of DL-based PAD systems and determine the key regions due to which a biometric image is considered real or fake by the system. In this work, a novel technique, Ensemble-CAM, is proposed for providing visual explanations for the decisions made by deep learning-based face PAD systems. Our goal is to improve DL-based face PAD systems by providing a better understanding of their behavior. Our provided visual explanations will enhance the transparency and trustworthiness of DL-based face PAD systems.

[42] LyTimeT: Towards Robust and Interpretable State-Variable Discovery cs.CVPDF

Kuai Yu, Crystal Su, Xiang Liu, Judah Goldfeder, Mingyuan Shao

TL;DR: LyTimeT是一个两阶段框架，用于从高维视频中提取动态系统的真实变量，通过时空注意力机制和稳定性约束学习鲁棒且可解释的潜在表示。

Details

Motivation: 从高维视频中提取动态系统的真实变量面临视觉干扰（如背景运动、遮挡和纹理变化）的挑战，需要一种鲁棒且可解释的方法。

Result: 在合成和真实动态系统测试中，LyTimeT在互信息和均方误差指标上优于基线方法，且对背景扰动具有不变性。

Insight: 时空注意力与稳定性约束的结合不仅提升了预测准确性，还增强了模型的物理可解释性。

Abstract: Extracting the true dynamical variables of a system from high-dimensional video is challenging due to distracting visual factors such as background motion, occlusions, and texture changes. We propose LyTimeT, a two-phase framework for interpretable variable extraction that learns robust and stable latent representations of dynamical systems. In Phase 1, LyTimeT employs a spatio-temporal TimeSformer-based autoencoder that uses global attention to focus on dynamically relevant regions while suppressing nuisance variation, enabling distraction-robust latent state learning and accurate long-horizon video prediction. In Phase 2, we probe the learned latent space, select the most physically meaningful dimensions using linear correlation analysis, and refine the transition dynamics with a Lyapunov-based stability regularizer to enforce contraction and reduce error accumulation during roll-outs. Experiments on five synthetic benchmarks and four real-world dynamical systems, including chaotic phenomena, show that LyTimeT achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers the lowest analytical mean squared error among CNN-based (TIDE) and transformer-only baselines. Our results demonstrate that combining spatio-temporal attention with stability constraints yields predictive models that are not only accurate but also physically interpretable.

[43] OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation cs.CVPDF

Guowei Xu, Yuxuan Bian, Ailing Zeng, Mingyi Shi, Shaoli Huang

TL;DR: OmniMotion-X是一个多功能的多模态全身人体运动生成框架，采用自回归扩散变换器，支持多种任务组合，如文本到运动、音乐到舞蹈等。通过引入参考运动和新颖的训练策略，提高了生成内容的一致性和质量。

Details

Motivation: 现有的运动生成方法往往局限于单一任务或模态，且在多模态任务中容易产生冲突。OmniMotion-X旨在提供一个统一的框架，解决多模态任务中的一致性和灵活性挑战。

Result: 实验表明，OmniMotion-X在多种任务中表现优于现有方法，生成长时间、一致且可控的运动。

Insight: 通过参考信号和渐进训练策略，可以有效解决多模态任务中的冲突，实现高质量的运动生成。

Abstract: This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.

[44] Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models cs.CVPDF

Xiaozhen Qiao, Jingkai Zhao, Yuqiu Jiang, Xianda Guo, Zhe Sun

TL;DR: CPL-NC是一个轻量级的测试时适应框架，针对视觉语言模型提出，通过动态调整类感知原型缓存和负对比学习机制，解决了原型降解和类混淆问题，显著提升了分布偏移下的泛化能力。

Details

Motivation: 现有的测试时适应方法在处理长尾分布和语义相似类混淆时表现不佳，CPL-NC通过动态调整原型和负对比学习，优化了这些问题。

Result: 在15个基准测试中，CPL-NC表现优于现有TTA方法，适用于ResNet-50和ViT-B/16骨干网络。

Insight: 动态调整原型和负对比学习可以有效解决长尾分布和类混淆问题，非对称优化方法提升了测试时适应的灵活性。

Abstract: Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.

[45] Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing cs.CV | cs.CL | cs.LGPDF

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang

TL;DR: Pico-Banana-400K是一个包含40万张图像的大规模数据集，用于文本引导的图像编辑研究。它通过系统的质量和多样性控制，提供了多样化的编辑对，并包含三个专门子集，支持复杂编辑场景的研究。

Details

Motivation: 当前多模态模型在文本引导图像编辑方面取得了显著进展，但缺乏大规模、高质量且开放的真实图像数据集限制了研究进展。Pico-Banana-400K旨在填补这一空白。

Result: Pico-Banana-400K为训练和评估新一代文本引导图像编辑模型提供了坚实的基础，并支持复杂编辑场景的研究。

Insight: 大规模、高质量的数据集是推动文本引导图像编辑研究的关键，而精细的质量控制和多样化编辑场景的设计可以进一步提升模型的适应能力。

Abstract: Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community’s progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.

[46] olmOCR 2: Unit Test Rewards for Document OCR cs.CV | cs.CLPDF

Jake Poznanski, Luca Soldaini, Kyle Lo

TL;DR: olmOCR 2是一款基于7B规模视觉语言模型（VLM）的OCR系统，通过强化学习和单元测试奖励机制实现高性能文档OCR，尤其在数学公式转换、表格解析和多栏布局方面表现优异。

Details

Motivation: 现有的OCR系统在处理复杂文档布局（如数学公式、表格和多栏文本）时存在性能瓶颈，需要更高效且可验证的训练方法。

Result: 在olmOCR-Bench基准测试中达到最先进性能，数学公式转换、表格解析和多栏布局改善尤为显著。

Insight: 通过合成数据和单元测试驱动的强化学习，可以有效提升复杂文档OCR任务的性能，同时开源模型和工具促进了技术共享。

Abstract: We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.

[47] Is This Tracker On? A Benchmark Protocol for Dynamic Tracking cs.CVPDF

Ilona Demler, Saumya Chauhan, Georgia Gkioxari

TL;DR: 该论文提出了一个新的基准测试套件ITTO，用于评估和诊断点跟踪方法的能力和局限性，通过真实世界场景的复杂运动和遮挡模式揭示了现有跟踪器的不足。

Details

Motivation: 当前的基准测试缺乏真实世界场景的运动复杂性和遮挡模式，限制了跟踪算法的实际应用能力。ITTO旨在填补这一空白，推动更鲁棒的跟踪算法发展。

Result: 实验表明，现有跟踪器在复杂运动和遮挡模式下表现不佳，尤其在遮挡后重新识别点的能力较弱。

Insight: 现有跟踪方法在真实世界动态场景中存在显著不足，亟需针对真实动态特性设计的新建模方法。

Abstract: We introduce ITTO, a challenging new benchmark suite for evaluating and diagnosing the capabilities and limitations of point tracking methods. Our videos are sourced from existing datasets and egocentric real-world recordings, with high-quality human annotations collected through a multi-stage pipeline. ITTO captures the motion complexity, occlusion patterns, and object diversity characteristic of real-world scenes – factors that are largely absent in current benchmarks. We conduct a rigorous analysis of state-of-the-art tracking methods on ITTO, breaking down performance along key axes of motion complexity. Our findings reveal that existing trackers struggle with these challenges, particularly in re-identifying points after occlusion, highlighting critical failure modes. These results point to the need for new modeling approaches tailored to real-world dynamics. We envision ITTO as a foundation testbed for advancing point tracking and guiding the development of more robust tracking algorithms.

cs.CL [Back]

[48] Small Language Models Offer Significant Potential for Science Community cs.CL | cs.AIPDF

Jian Zhang

TL;DR: 该论文探讨了小型语言模型（MiniLMs）在地球科学文献检索中的应用潜力，提出了一个高效、低成本且精确的信息检索框架，替代大型语言模型（LLMs）。

Details

Motivation: 尽管大型语言模型（LLMs）在科学研究中的应用日益广泛，但存在信息偏见和计算成本高的问题。作者希望通过小型语言模型提供一种更高效、低成本的替代方案，专注于地球科学领域的精确信息检索。

Result: MiniLMs能够高效地从文献中提取专家验证的信息，尤其是在定量研究方面表现优异。此外，情感分析和主题聚类方法揭示了地球科学领域的研究趋势和争议点。

Insight: 小型语言模型在资源受限的科学领域中具有显著优势，不仅能降低计算成本，还能提供更精确的领域特定信息检索能力。

Abstract: Recent advancements in natural language processing, particularly with large language models (LLMs), are transforming how scientists engage with the literature. While the adoption of LLMs is increasing, concerns remain regarding potential information biases and computational costs. Rather than LLMs, I developed a framework to evaluate the feasibility of precise, rapid, and cost-effective information retrieval from extensive geoscience literature using freely available small language models (MiniLMs). A curated corpus of approximately 77 million high-quality sentences, extracted from 95 leading peer-reviewed geoscience journals such as Geophysical Research Letters and Earth and Planetary Science Letters published during years 2000 to 2024, was constructed. MiniLMs enable a computationally efficient approach for extracting relevant domain-specific information from these corpora through semantic search techniques and sentence-level indexing. This approach, unlike LLMs such as ChatGPT-4 that often produces generalized responses, excels at identifying substantial amounts of expert-verified information with established, multi-disciplinary sources, especially for information with quantitative findings. Furthermore, by analyzing emotional tone via sentiment analysis and topical clusters through unsupervised clustering within sentences, MiniLM provides a powerful tool for tracking the evolution of conclusions, research priorities, advancements, and emerging questions within geoscience communities. Overall, MiniLM holds significant potential within the geoscience community for applications such as fact and image retrievals, trend analyses, contradiction analyses, and educational purposes.

[49] Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti cs.CL | cs.CYPDF

Mangsura Kabir Oni, Tabia Tanzin Prama

TL;DR: 论文探讨了基于Transformer的模型在低资源语言（标准孟加拉语到锡尔赫特语）翻译中的表现，发现微调模型优于零样本LLMs，强调了任务特定适应的重要性。

Details

Motivation: 机器翻译在高资源语言中取得显著进展，但低资源语言如锡尔赫特语的研究不足，需要探索有效方法。

Result: mBART-50在翻译流畅性上表现最优，MarianMT在字符级保真度上最强，微调模型显著优于LLMs。

Insight: 任务特定微调对低资源语言翻译至关重要，有助于推动包容性语言技术的发展。

Abstract: Machine Translation (MT) has advanced from rule-based and statistical methods to neural approaches based on the Transformer architecture. While these methods have achieved impressive results for high-resource languages, low-resource varieties such as Sylheti remain underexplored. In this work, we investigate Bengali-to-Sylheti translation by fine-tuning multilingual Transformer models and comparing them with zero-shot large language models (LLMs). Experimental results demonstrate that fine-tuned models significantly outperform LLMs, with mBART-50 achieving the highest translation adequacy and MarianMT showing the strongest character-level fidelity. These findings highlight the importance of task-specific adaptation for underrepresented languages and contribute to ongoing efforts toward inclusive language technologies.

[50] DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code cs.CL | cs.AI | cs.IR | cs.LGPDF

Shriyansh Agrawal, Aidan Lau, Sanyam Shah, Ahan M R, Kevin Zhu

TL;DR: DuoLens提出了一种用于检测多语言机器生成文本和源代码的框架，通过微调小型语言模型（SLM），在计算成本和准确性上显著优于现有方法。

Details

Motivation: 当前基于零样本方法的机器生成内容检测器（如Fast DetectGPT或GPTZero）存在计算成本高或准确性不足的问题，需要在两者之间权衡。

Result: AUROC达到0.97-0.99，macro-F1为0.89-0.94，延迟降低8-12倍，峰值VRAM减少3-5倍。在对抗性变换下，性能保持≥92%的干净AUROC。

Insight: 小型语言模型在特定任务中可以优于大型语言模型，同时大幅减少计算开销，为机器生成内容检测提供了高效解决方案。

Abstract: The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.

Wangjiaxuan Xin, Shuhua Yin, Shi Chen, Yaorong Ge

TL;DR: 该论文提出了一种名为TM-Rephrase的框架，通过使用大语言模型（LLMs）将社交媒体短文本重新表述为更标准化的语言，以提升主题建模的效果。实验表明，该方法显著改善了主题一致性、独特性和多样性。

Details

Motivation: 社交媒体短文本（如推文）的简洁性和噪声影响了主题建模的效果，导致生成的主题难以解释。论文旨在通过文本重述来解决这一问题。

Result: 实验结果表明，TM-Rephrase提高了主题一致性、独特性和多样性，减少了冗余，尤其是在LDA算法中效果最佳。

Insight: 标准化社交媒体短文本可以显著提升主题建模的实用性，对公共卫生危机等领域的社会媒体分析具有广泛意义。

Abstract: Social media platforms such as Twitter (now X) provide rich data for analyzing public discourse, especially during crises such as the COVID-19 pandemic. However, the brevity, informality, and noise of social media short texts often hinder the effectiveness of traditional topic modeling, producing incoherent or redundant topics that are often difficult to interpret. To address these challenges, we have developed \emph{TM-Rephrase}, a model-agnostic framework that leverages large language models (LLMs) to rephrase raw tweets into more standardized and formal language prior to topic modeling. Using a dataset of 25,027 COVID-19-related Twitter posts, we investigate the effects of two rephrasing strategies, general- and colloquial-to-formal-rephrasing, on multiple topic modeling methods. Results demonstrate that \emph{TM-Rephrase} improves three metrics measuring topic modeling performance (i.e., topic coherence, topic uniqueness, and topic diversity) while reducing topic redundancy of most topic modeling algorithms, with the colloquial-to-formal strategy yielding the greatest performance gains and especially for the Latent Dirichlet Allocation (LDA) algorithm. This study contributes to a model-agnostic approach to enhancing topic modeling in public health related social media analysis, with broad implications for improved understanding of public discourse in health crisis as well as other important domains.

Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu

TL;DR: MMAO-Bench是一个新颖的高质量多模态基准测试，旨在评估单模态与全模态能力之间的组合规律。

Details

Motivation: 当前多模态大模型正从单模态理解向全模态（视觉、音频、语言）统一演进，但单模态与全模态之间的关联尚不明确，需要全面评估以推动模型智能发展。

Result: 实验揭示了全模态能力对弱模型表现为瓶颈效应，而对强模型则表现出协同促进作用。

Insight: 全模态能力的提升依赖于单模态能力的增强，且在不同模型能力水平下表现出不同的影响模式。

Abstract: Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model’s intelligence evolution. In this work, we propose a novel, high quality and diversity omni model benchmark, MultiModal All in One Benchmark (MMAO-Bench), which effectively assesses both uni-modal and omni-modal understanding capabilities. The benchmark consists of 1880 human curated samples, across 44 task types, and a innovative multi-step open-ended question type that better assess complex reasoning tasks. Experimental result shows the compositional law between cross-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.

Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song

TL;DR: 论文研究了大型语言模型（LLMs）在英韩双语对话中的社会推理能力。通过SCRIPTS数据集发现，模型在英语对话中的表现优于韩语，且存在显著的社交偏见和推理失误。

Details

Motivation: 随着LLMs在人类与AI交互中的广泛应用，它们在社交推理方面的能力变得至关重要。论文旨在评估LLMs在推断人际关系时的表现，尤其是在不同语言和文化背景下。

Result: LLMs在英语数据集上的准确率为75-80%，在韩语中降至58-69%；10-25%的回答选择了不成立的关系；社交偏见在某些情况下被放大。

Insight: 当前LLMs的社会推理能力有限，尤其在跨语言和文化场景中表现不佳，凸显了开发更具社交意识的语言模型的必要性。

Abstract: As large language models (LLMs) are increasingly used in human-AI interactions, their social reasoning capabilities in interpersonal contexts are critical. We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean, sourced from movie scripts. The task involves evaluating models’ social reasoning capability to infer the interpersonal relationships (e.g., friends, sisters, lovers) between speakers in each dialogue. Each dialogue is annotated with probabilistic relational labels (Highly Likely, Less Likely, Unlikely) by native (or equivalent) Korean and English speakers from Korea and the U.S. Evaluating nine models on our task, current proprietary LLMs achieve around 75-80% on the English dataset, whereas their performance on Korean drops to 58-69%. More strikingly, models select Unlikely relationships in 10-25% of their responses. Furthermore, we find that thinking models and chain-of-thought prompting, effective for general reasoning, provide minimal benefits for social reasoning and occasionally amplify social biases. Our findings reveal significant limitations in current LLMs’ social reasoning capabilities, highlighting the need for efforts to develop socially-aware language models.

[54] Re:Member: Emotional Question Generation from Personal Memories cs.CL | cs.HCPDF

Zackary Rackauckas, Nobuaki Minematsu, Julia Hirschberg

TL;DR: Re:Member 是一个基于个人记忆的情感化问题生成系统，旨在通过结合用户个人视频和情感化语音问题，提升第二语言学习的互动性和情感参与度。

Details

Motivation: 传统第二语言学习工具缺乏情感互动和个人化内容，Re:Member 填补了这一空白，利用个人记忆和情感化设计提升学习体验。

Result: 系统能够生成情感丰富且与视觉上下文一致的问题，有效提升学习者的情感回忆和互动参与。

Insight: 情感化和个人化内容在教育技术中具有重要作用，能够显著提升学习者的参与度和学习效果。

Abstract: We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users’ personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.

[55] A Graph Signal Processing Framework for Hallucination Detection in Large Language Models cs.CL | cs.LG | eess.SP | stat.MLPDF

Valentin Noël

TL;DR: 论文提出了一种基于图信号处理的框架，用于检测大型语言模型中的幻觉问题。该方法将Transformer层的注意力机制建模为动态图，并通过谱分析定义诊断指标，实验表明这些指标能有效区分事实陈述和幻觉。

Details

Motivation: 大型语言模型在生成内容时容易产生幻觉（即非事实内容），目前缺乏有效的检测方法。论文旨在通过谱分析揭示幻觉的独特模式，从而提供一种检测框架。

Result: 实验显示，事实陈述表现为低频收敛的能量分布，而幻觉表现为特定的谱特性。基于谱特征的检测器准确率为88.75%，优于基于困惑度的基线方法（75%）。

Insight: 谱几何特性能够捕捉语言模型的推理模式和错误行为，有望成为幻觉检测的新框架。

Abstract: Large language models achieve impressive results but distinguishing factual reasoning from hallucinations remains challenging. We propose a spectral analysis framework that models transformer layers as dynamic graphs induced by attention, with token embeddings as signals on these graphs. Through graph signal processing, we define diagnostics including Dirichlet energy, spectral entropy, and high-frequency energy ratios, with theoretical connections to computational stability. Experiments across GPT architectures suggest universal spectral patterns: factual statements exhibit consistent “energy mountain” behavior with low-frequency convergence, while different hallucination types show distinct signatures. Logical contradictions destabilize spectra with large effect sizes ($g>1.0$), semantic errors remain stable but show connectivity drift, and substitution hallucinations display intermediate perturbations. A simple detector using spectral signatures achieves 88.75% accuracy versus 75% for perplexity-based baselines, demonstrating practical utility. These findings indicate that spectral geometry may capture reasoning patterns and error behaviors, potentially offering a framework for hallucination detection in large language models.

[56] Training-Free Spectral Fingerprints of Voice Processing in Transformers cs.CL | cs.LG | eess.SP | stat.MLPDF

Valentin Noël

TL;DR: 该论文通过谱分析识别不同Transformer架构在不同语言处理中的计算指纹，发现特定结构和训练重点会在注意力图中留下可检测的痕迹，且这些痕迹与行为差异强相关。

Details

Motivation: 研究动机是揭示不同Transformer架构在处理语言任务时，如何通过不同的连接模式实现相同的计算任务，并探索这些模式是否能通过谱分析方法检测。

Result: 结果显示Phi-3-Mini在英语中表现出显著的早期层扰动，而其他模型则在形态丰富的语言中表现不同，这些结果与模型的行为差异高度相关。

Insight: 研究发现训练重点和架构设计会在模型的注意力结构中留下可检测的痕迹，这些痕迹可以作为诊断工具揭示模型的偏向性。

Abstract: Different transformer architectures implement identical linguistic computations via distinct connectivity patterns, yielding model imprinted ``computational fingerprints’’ detectable through spectral analysis. Using graph signal processing on attention induced token graphs, we track changes in algebraic connectivity (Fiedler value, $\Delta\lambda_2$) under voice alternation across 20 languages and three model families, with a prespecified early window (layers 2–5). Our analysis uncovers clear architectural signatures: Phi-3-Mini shows a dramatic English specific early layer disruption ($\overline{\Delta\lambda_2}_{[2,5]}!\approx!-0.446$) while effects in 19 other languages are minimal, consistent with public documentation that positions the model primarily for English use. Qwen2.5-7B displays small, distributed shifts that are largest for morphologically rich languages, and LLaMA-3.2-1B exhibits systematic but muted responses. These spectral signatures correlate strongly with behavioral differences (Phi-3: $r=-0.976$) and are modulated by targeted attention head ablations, linking the effect to early attention structure and confirming functional relevance. Taken together, the findings are consistent with the view that training emphasis can leave detectable computational imprints: specialized processing strategies that manifest as measurable connectivity patterns during syntactic transformations. Beyond voice alternation, the framework differentiates reasoning modes, indicating utility as a simple, training free diagnostic for revealing architectural biases and supporting model reliability analysis.

[57] Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges cs.CLPDF

Cheng Huang, Nyima Tashi, Fan Gao, Yutong Liu, Jiahao Li

TL;DR: 这篇论文全面调查了藏语AI研究的现状，包括数据资源、NLP任务、机器翻译、语音识别和大语言模型的发展，同时指出了数据稀疏性、拼写变体和缺乏统一评估标准等挑战，并提出了跨语言迁移和多模态学习的潜力。

Details

Motivation: 藏语作为亚洲主要的低资源语言之一，具有独特的语言和社会文化特征，但其AI研究因缺乏可访问的数据资源、标准化基准和专用工具而受限。本文旨在填补这一空白，推动藏语AI研究的发展。

Result: 总结了藏语AI研究的现状，突出了数据稀疏性和标准化评估的不足。

Insight: 提出跨语言迁移和多模态学习是解决藏语AI研究中数据不足问题的有效途径，同时呼吁社区驱动的资源创建。

Abstract: Tibetan, one of the major low-resource languages in Asia, presents unique linguistic and sociocultural characteristics that pose both challenges and opportunities for AI research. Despite increasing interest in developing AI systems for underrepresented languages, Tibetan has received limited attention due to a lack of accessible data resources, standardized benchmarks, and dedicated tools. This paper provides a comprehensive survey of the current state of Tibetan AI in the AI domain, covering textual and speech data resources, NLP tasks, machine translation, speech recognition, and recent developments in LLMs. We systematically categorize existing datasets and tools, evaluate methods used across different tasks, and compare performance where possible. We also identify persistent bottlenecks such as data sparsity, orthographic variation, and the lack of unified evaluation metrics. Additionally, we discuss the potential of cross-lingual transfer, multi-modal learning, and community-driven resource creation. This survey aims to serve as a foundational reference for future work on Tibetan AI research and encourages collaborative efforts to build an inclusive and sustainable AI ecosystem for low-resource languages.

[58] “You Are Rejected!”: An Empirical Study of Large Language Models Taking Hiring Evaluations cs.CLPDF

Dingjie Fu, Dianxing Shi

TL;DR: 这篇论文研究了大型语言模型（LLMs）是否能通过技术公司招聘评估的问题，结果表明所有评估的LLMs均未通过测试。

Details

Motivation: 随着AI技术的快速发展，企业需要高效筛选大量工程师申请者。LLMs在编码和推理任务中表现出色，因此作者研究了LLMs是否能成功通过招聘评估。

Result: 所有评估的LLMs均未能通过招聘评估，其回答与公司要求的标准存在明显差异。

Insight: 论文揭示了LLMs在当前技术水平下尚无法完全替代人类工程师在招聘评估中的角色，强调了实际应用中的局限性。

Abstract: With the proliferation of the internet and the rapid advancement of Artificial Intelligence, leading technology companies face an urgent annual demand for a considerable number of software and algorithm engineers. To efficiently and effectively identify high-potential candidates from thousands of applicants, these firms have established a multi-stage selection process, which crucially includes a standardized hiring evaluation designed to assess job-specific competencies. Motivated by the demonstrated prowess of Large Language Models (LLMs) in coding and reasoning tasks, this paper investigates a critical question: Can LLMs successfully pass these hiring evaluations? To this end, we conduct a comprehensive examination of a widely used professional assessment questionnaire. We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance. Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions. Our empirical findings lead to a striking conclusion: All evaluated LLMs fails to pass the hiring evaluation.

[59] Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG cs.CLPDF

Jihwan Bang, Juntae Lee, Seunghan Yang, Sungha Choi

TL;DR: TSSS是一个高效的多跳RAG框架，通过模板化推理和检索器终止机制，减少冗余标记生成，提升推理效率。

Details

Motivation: 现有的多跳RAG方法效率低下，冗余生成标记且依赖随机终止，导致计算资源浪费和结果不稳定。

Result: 在HotpotQA等多个数据集上达到SOTA准确率，同时在效率上优于其他RAG-CoT方法。

Insight: 结构化推理和确定性终止机制可以显著提升多跳RAG的效率，适用于资源受限场景。

Abstract: Multi-hop retrieval-augmented generation (RAG) is a promising strategy for complex reasoning, yet existing iterative prompting approaches remain inefficient. They often regenerate predictable token sequences at every step and rely on stochastic stopping, leading to excessive token usage and unstable termination. We propose TSSS (Think Straight, Stop Smart), a structured multi-hop RAG framework designed for efficiency. TSSS introduces (i) a template-based reasoning that caches recurring prefixes and anchors sub-queries to the main question, reducing token generation cost while promoting stable reasoning, and (ii) a retriever-based terminator, which deterministically halts reasoning once additional sub-queries collapse into repetition. This separation of structured reasoning and termination control enables both faster inference and more reliable answers. On HotpotQA, 2WikiMultiHop, and MuSiQue, TSSS achieves state-of-the-art accuracy and competitive efficiency among RAG-CoT approaches, highlighting its effectiveness in efficiency-constrained scenarios such as on-device inference.

[60] When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA cs.CL | cs.AIPDF

Nishanth Sridhar Nakshatri, Shamik Roy, Manoj Ghuhan Arivazhagan, Hanhan Zhou, Vinayshekhar Bannihatti Kumar

TL;DR: 这篇论文介绍了evolveQA，一个专门用于评估大语言模型（LLMs）在处理时序知识演化问题上的能力的新基准测试。该基准基于三个真实世界的时标数据集构建，并揭示了LLMs在面对动态知识时的显著性能下降。

Details

Motivation: 现有的研究多基于结构化知识库（如Wikidata）评估LLMs的时序知识冲突处理能力，但这些研究局限于覆盖广泛的流行实体，缺乏对不同知识截止日期的公平评估。

Result: 评估12个开源和闭源LLMs显示，evolveQA上LLMs的性能较静态知识问题下降了高达31%。

Insight: LLMs在处理动态知识时表现不佳，提示未来研究需更关注时序知识的适应性和更新能力。

Abstract: LLMs often fail to handle temporal knowledge conflicts–contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.

[61] Interpretable Question Answering with Knowledge Graphs cs.CL | cs.AI | cs.LGPDF

Kartikeya Aneja, Manasvi Srivastava, Subhayan Das, Nagender Aneja

TL;DR: 本文提出了一种不依赖检索增强生成（RAG）和大语言模型（LLMs）的知识图谱问答系统，通过小型复述模型从知识图谱检索结果中生成答案。该系统在CRAG基准测试中表现良好。

Details

Motivation: 现有的问答系统通常依赖大语言模型和检索增强生成技术，但这些方法可能缺乏可解释性且计算成本高。本文旨在探索一种基于知识图谱的替代方案，提高问答系统的透明性和效率。

Result: 在CRAG基准测试中，使用LLAMA-3.2和GPT-3.5-Turbo的准确率分别为71.9%和54.4%。

Insight: 知识图谱可以有效替代大语言模型进行问答任务，小型复述模型的使用展示了轻量化解决方案的潜力。

Abstract: This paper presents a question answering system that operates exclusively on a knowledge graph retrieval without relying on retrieval augmented generation (RAG) with large language models (LLMs). Instead, a small paraphraser model is used to paraphrase the entity relationship edges retrieved from querying the knowledge graph. The proposed pipeline is divided into two main stages. The first stage involves pre-processing a document to generate sets of question-answer (QA) pairs. The second stage converts these QAs into a knowledge graph from which graph-based retrieval is performed using embeddings and fuzzy techniques. The graph is queried, re-ranked, and paraphrased to generate a final answer. This work includes an evaluation using LLM-as-a-judge on the CRAG benchmark, which resulted in accuracies of 71.9% and 54.4% using LLAMA-3.2 and GPT-3.5-Turbo, respectively.

[62] Multi-Faceted Evaluation of Tool-Augmented Dialogue Systems cs.CLPDF

Zhaoyi Joey Hou, Tanya Shourya, Yingfan Wang, Shamik Roy, Vinayshekhar Bannihatti Kumar

TL;DR: 该论文提出了TRACE基准和SCOPE框架，用于系统评估工具增强对话系统中的多样化错误模式，解决了现有评估方法无法捕捉多轮对话中关键错误的问题。

Details

Motivation: 现有的对话系统评估方法主要关注用户满意度或工具调用能力，但在多轮工具增强对话中，代理人可能误解工具结果但仍令用户满意，导致关键错误被忽略。

Result: 实验表明，SCOPE在用户满意度信号误导的复杂案例中显著优于基线方法。

Insight: 工具增强对话系统的评估需关注多轮交互中的潜在错误，而非仅依赖用户满意度。

Abstract: Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents’ tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues-such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases, and SCOPE, an evaluation framework that automatically discovers diverse error patterns and evaluation rubrics in tool-augmented dialogues. Experiments show SCOPE significantly outperforms the baseline, particularly on challenging cases where user satisfaction signals are misleading.

[63] DiSRouter: Distributed Self-Routing for LLM Selections cs.CLPDF

Hang Zheng, Hongshen Xu, Yongkai Lin, Shuai Fan, Lu Chen

TL;DR: 提出了DiSRouter，一种分布式自路由范式，解决LLM选择中的灵活性和性能问题。

Details

Motivation: 现有基于外部集中式路由器的LLM选择方法灵活性差且性能受限，无法充分理解不同LLM的知识边界。

Result: DiSRouter在多种场景下显著优于现有路由方法，能有效区分易难查询，并在域外任务中表现优越。

Insight: 利用LLM内在的自我意识比外部评估更有效，为模块化和高效的多智能体系统提供了新思路。

Abstract: The proliferation of Large Language Models (LLMs) has created a diverse ecosystem of models with highly varying performance and costs, necessitating effective query routing to balance performance and expense. Current routing systems often rely on a centralized external router trained on a fixed set of LLMs, making them inflexible and prone to poor performance since the small router can not fully understand the knowledge boundaries of different LLMs. We introduce DiSRouter (Distributed Self-Router), a novel paradigm that shifts from centralized control to distributed routing. In DiSRouter, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awareness, its ability to judge its competence. This distributed design offers superior flexibility, scalability, and generalizability. To enable this, we propose a two-stage Self-Awareness Training pipeline that enhances each LLM’s self-awareness. Extensive experiments demonstrate that DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out-of-domain tasks. Our work validates that leveraging an LLM’s intrinsic self-awareness is more effective than external assessment, paving the way for more modular and efficient multi-agent systems.

[64] SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets cs.CLPDF

Ziwei Wang, Jiayuan Su, Mengyu Zhou, Huaxing Zeng, Mengni Jia

TL;DR: SheetBrain是一个神经符号双工作流代理框架，专注于在电子表格上进行高准确度的推理，支持问答和操作任务。

Details

Motivation: 大型语言模型（LLMs）在处理复杂电子表格时难以准确捕获结构和确保推理正确性，因此需要更高效的工具。

Result: 在公共基准测试和新的SheetBench上，SheetBrain显著提高了准确性。

Insight: 神经符号结合的框架在处理复杂表格任务时具有优势，验证模块的设计是确保推理可靠性的关键。

Abstract: Understanding and reasoning over complex spreadsheets remain fundamental challenges for large language models (LLMs), which often struggle with accurately capturing the complex structure of tables and ensuring reasoning correctness. In this work, we propose SheetBrain, a neuro-symbolic dual workflow agent framework designed for accurate reasoning over tabular data, supporting both spreadsheet question answering and manipulation tasks. SheetBrain comprises three core modules: an understanding module, which produces a comprehensive overview of the spreadsheet - including sheet summary and query-based problem insight to guide reasoning; an execution module, which integrates a Python sandbox with preloaded table-processing libraries and an Excel helper toolkit for effective multi-turn reasoning; and a validation module, which verifies the correctness of reasoning and answers, triggering re-execution when necessary. We evaluate SheetBrain on multiple public tabular QA and manipulation benchmarks, and introduce SheetBench, a new benchmark targeting large, multi-table, and structurally complex spreadsheets. Experimental results show that SheetBrain significantly improves accuracy on both existing benchmarks and the more challenging scenarios presented in SheetBench. Our code is publicly available at https://github.com/microsoft/SheetBrain.

[65] Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization cs.CLPDF

Yuto Tomikawa, Masaki Uto

TL;DR: 本文提出了一种基于大语言模型和直接偏好优化技术的难度可控多选题生成方法，旨在解决传统方法无法直接生成多选题和难度控制精度不足的问题。

Details

Motivation: 在教育领域，难度可控的问题生成是自适应学习的关键工具，但现有方法无法直接生成多选题且难度控制精度有限。

Result: 生成的单选题在难度控制上表现更优，适用于教育场景。

Insight: 直接偏好优化技术能有效提升模型在难度控制任务中的表现，为大语言模型在教育领域的应用提供了新思路。

Abstract: Difficulty-controllable question generation for reading comprehension has gained significant attention in the field of education as a fundamental tool for adaptive learning support. Although several neural question generation methods have recently succeeded in controlling difficulty, conventional approaches still face two major limitations. First, they cannot directly generate multiple-choice questions, which are the most widely used question type in educational contexts. Second, they are not explicitly trained to optimize the accuracy of difficulty control, leaving room for further improvement in difficulty controllability. To address these limitations, this study proposes a novel difficulty-controllable multiple-choice question generation method for reading comprehension which leverages a large language model trained using a direct preference optimization technique to improve the accuracy of difficulty control.

[66] TheMCPCompany: Creating General-purpose Agents with Task-specific Tools cs.CLPDF

Reza Esfandiarpoor, Vishwas Suryanarayanan, Stephen H. Bach, Vishal Chowdhary, Anthony Aue

TL;DR: 该论文介绍了TheMCPCompany基准，用于评估基于工具调用的智能体在交互多种现实服务中的表现，展示了高级推理模型在工具发现中的潜力，但也揭示了复杂环境中工具的导航与组合仍是挑战。

Details

Motivation: 当前通用智能体主要依赖浏览器与环境交互，但任务专用工具集更易开发和维护，作者希望探索工具调用智能体在现实任务中的实用性。

Result: 高级模型（如GPT-5）在工具检索中表现接近真实工具，但小模型无法充分利用工具；复杂环境中的工具导航与组合仍是挑战。

Insight: 当前模型在复杂环境中仍需改进推理和检索能力；任务专用工具集为智能体性能提升提供了新方向。

Abstract: Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5’s performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.

[67] JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation cs.CLPDF

Fan Xu, Huixuan Zhang, Zhenliang Zhang, Jiahao Wang, Xiaojun Wan

TL;DR: JointCQ提出了一种联合生成声明和查询的框架，旨在解决大语言模型在幻觉检测中因上下文丢失和查询特异性不足而导致的问题，并通过实验验证了其在开放域QA任务上的优越性。

Details

Motivation: 现有的大语言模型在生成内容时容易出现幻觉问题（即生成看似真实但不可靠的内容），而目前的幻觉检测方法在声明提取和查询生成阶段表现不佳，影响了整体检测效果。

Result: 在多个开放域QA幻觉检测基准测试中，JointCQ优于现有方法，证明了其有效性。

Insight: 联合声明和查询生成能够更有效地解决幻觉问题，为大语言模型的透明性和可信性提供了新思路。

Abstract: Current large language models (LLMs) often suffer from hallucination issues, i,e, generating content that appears factual but is actually unreliable. A typical hallucination detection pipeline involves response decomposition (i.e., claim extraction), query generation, evidence collection (i.e., search or retrieval), and claim verification. However, existing methods exhibit limitations in the first two stages, such as context loss during claim extraction and low specificity in query generation, resulting in degraded performance across the hallucination detection pipeline. In this work, we introduce JointCQ https://github.com/pku0xff/JointCQ, a joint claim-and-query generation framework designed to construct an effective and efficient claim-query generator. Our framework leverages elaborately designed evaluation criteria to filter synthesized training data, and finetunes a language model for joint claim extraction and query generation, providing reliable and informative inputs for downstream search and verification. Experimental results demonstrate that our method outperforms previous methods on multiple open-domain QA hallucination detection benchmarks, advancing the goal of more trustworthy and transparent language model systems.

[68] KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints cs.CLPDF

Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi

TL;DR: KORE通过知识导向的数据增强和约束方法，提升大型多模态模型的知识注入能力，同时避免灾难性遗忘，实现了新知识的准确学习和旧知识的保留。

Details

Motivation: 大型多模态模型在预训练中编码了大量知识，但其知识是静态的且无法及时更新，导致难以持续学习新知识。现有的方法在新知识学习和避免灾难性遗忘方面存在困难。

Result: 在LLaVA和Qwen2.5-VL等多个模型上，KORE实现了优异的新知识注入性能并有效减轻了灾难性遗忘。

Insight: 知识注入需要同时关注新知识的准确学习和旧知识的保留，结构化知识转换和干扰最小化的方法为实现这一目标提供了有效途径。

Abstract: Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM’s linear layer activations and initializes the adapter by projecting the original weights into the matrix’s null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.

[69] Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization cs.CL | cs.AIPDF

Junjie Song, Yiwen Liu, Dapeng Li, Yin Sun, Shukun Fu

TL;DR: 该论文提出了一种基于超体积优化（HVO）的多目标强化学习框架，用于文本摘要任务，动态调整不同目标的权重，生成更平衡的摘要。

Details

Motivation: 文本摘要任务需要同时优化一致性、连贯性、相关性和流畅性等多个目标，挑战性较大。尽管基于LLM的强化学习取得了显著进展，但多目标优化问题研究较少。

Result: 在多个代表性数据集上，HVO优于GRPO方法，且在7B规模的基础模型上与GPT-4表现相当，同时生成更短的摘要。

Insight: 动态调整多目标权重是实现高质量文本摘要的有效方法，超体积优化为多目标强化学习提供了新思路。

Abstract: Text summarization is a crucial task that requires the simultaneous optimization of multiple objectives, including consistency, coherence, relevance, and fluency, which presents considerable challenges. Although large language models (LLMs) have demonstrated remarkable performance, enhanced by reinforcement learning (RL), few studies have focused on optimizing the multi-objective problem of summarization through RL based on LLMs. In this paper, we introduce hypervolume optimization (HVO), a novel optimization strategy that dynamically adjusts the scores between groups during the reward process in RL by using the hypervolume method. This method guides the model’s optimization to progressively approximate the pareto front, thereby generating balanced summaries across multiple objectives. Experimental results on several representative summarization datasets demonstrate that our method outperforms group relative policy optimization (GRPO) in overall scores and shows more balanced performance across different dimensions. Moreover, a 7B foundation model enhanced by HVO performs comparably to GPT-4 in the summarization task, while maintaining a shorter generation length. Our code is publicly available at https://github.com/ai4business-LiAuto/HVO.git

[70] Slot Filling as a Reasoning Task for SpeechLLMs cs.CLPDF

Kadri Hacioglu, Manjunath K E, Andreas Stolcke

TL;DR: 该论文提出将推理能力集成到语音大语言模型（SpeechLLM）中，以完成端到端的槽填充任务。通过链式思维框架分解任务，生成推理数据集，并采用监督微调策略。实验表明，引入推理步骤能提升性能，但适用于数学、逻辑和编码领域的文本LLM可能不适用于语音LLM。混合模式的SpeechLLM性能更优。

Details

Motivation: 受推理大语言模型（LLMs）发展的启发，作者希望通过引入推理能力提升语音LLM在槽填充任务中的表现，从而推动语音与语言模型的深度融合。

Result: 实验结果显示，引入推理步骤能提升性能，但某些领域的文本LLM可能不适合语音任务。混合模式的SpeechLLM比单一模式表现更好。

Insight: 研究发现，语音LLM的任务可能需要特定的推理能力，通用推理能力不一定适用。混合模式设计提供了更灵活的任务适应能力。

Abstract: We propose integration of reasoning into speech large language models (speechLLMs) for the end-to-end slot-filling task. Inspired by the recent development of reasoning LLMs, we use a chain-of-thought framework to decompose the slot-filling task into multiple reasoning steps, create a reasoning dataset and apply the supervised fine-tuning strategy to a speechLLM. We distinguish between regular and reasoning speechLLMs and experiment with different types and sizes of LLMs as their text foundation models. We demonstrate performance improvements by introducing reasoning (intermediate) steps. However, we show that a reasoning textual LLM developed mainly for math, logic and coding domains might be inferior as a foundation model for a reasoning speechLLM. We further show that hybrid speechLLMs, built on a hybrid text foundation LLM and fine-tuned to preserve both direct and reasoning modes of operation, have better performance than those fine-tuned employing only one mode of operation.

[71] Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection cs.CL | cs.CYPDF

Ewelina Gajewska, Arda Derbent, Jaroslaw A Chudziak, Katarzyna Budzynska

TL;DR: 该论文研究了通过为大型语言模型（LLMs）注入注释者身份特性（Persona-LLMs）如何影响其对仇恨言论的敏感性，尤其是关于注释者与目标之间身份共享或差异带来的偏见。实验使用了Google Gemini和OpenAI GPT-4.1-mini，并采用两种身份提示方法：浅层提示和基于检索增强生成（RAG）的深度上下文身份开发，以纳入更丰富的身份特征。分析了内群和外群注释者身份对模型检测性能和公平性的影响。

Details

Motivation: 现有的自动化仇恨言论检测系统在处理不同社会群体时可能存在偏见。论文希望通过结合心理学中群体身份的观点，以LLMs为基础，探索如何通过身份注入（Persona）来减少这种偏见，提升检测的公平性。

Result: 结果表明，身份注入能够在一定程度上减少模型的偏见，尤其是在内群注释者场景下表现更优。然而，身份注入也存在局限性，例如在外群注释者场景下可能未能完全消除偏见。

Insight: 结合社会心理学理论与NLP技术可以为自动化仇恨言论检测提供更公平的解决方案，但身份注入的效果受限于身份兼容性和数据多样性。

Abstract: In this paper, we investigate how personalising Large Language Models (Persona-LLMs) with annotator personas affects their sensitivity to hate speech, particularly regarding biases linked to shared or differing identities between annotators and targets. To this end, we employ Google’s Gemini and OpenAI’s GPT-4.1-mini models and two persona-prompting methods: shallow persona prompting and a deeply contextualised persona development based on Retrieval-Augmented Generation (RAG) to incorporate richer persona profiles. We analyse the impact of using in-group and out-group annotator personas on the models’ detection performance and fairness across diverse social groups. This work bridges psychological insights on group identity with advanced NLP techniques, demonstrating that incorporating socio-demographic attributes into LLMs can address bias in automated hate speech detection. Our results highlight both the potential and limitations of persona-based approaches in reducing bias, offering valuable insights for developing more equitable hate speech detection systems.

[72] Modeling Turn-Taking with Semantically Informed Gestures cs.CLPDF

Varsha Suresh, M. Hamza Mughal, Christian Theobalt, Vera Demberg

TL;DR: 论文提出了一种基于语义手势的对话轮转建模方法，通过扩展数据集并整合多模态信息，验证了手势在轮转预测中的补充作用。

Details

Motivation: 人类在对话中通过语音、手势和凝视等多模态线索管理轮转，现有研究多关注语言和声学特征，而忽视了手势的补充作用。本文旨在填补这一空白。

Result: 实验表明，加入语义手势后模型的性能优于基线方法，验证了手势在多模态轮转预测中的补充作用。

Insight: 语义手势在对话轮转中提供了独特的补充信息，多模态整合能显著提升轮转预测的准确性。

Abstract: In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.

[73] M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models cs.CL | cs.AIPDF

Yejin Kwon, Taewoo Kang, Hyunsoo Yoon, Changouk Kim

TL;DR: M3-SLU 是一个新的多模态大型语言模型（MLLM）基准，旨在评估多说话者、多轮对话的语音理解能力，尤其是说话者归属推理的挑战。

Details

Motivation: 当前的多模态大型语言模型在语音和文本理解方面表现优异，但在自然对话中识别“谁在什么时间说了什么”的能力仍存在不足。因此，M3-SLU 旨在填补这一空白。

Result: 实验表明，模型能捕捉说话内容，但在说话者识别上表现不佳，揭示了说话者感知对话理解的差距。

Insight: M3-SLU 为促进说话者感知的多模态理解研究提供了具有挑战性的基准。

Abstract: We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding. M3-SLU offers as a challenging benchmark to advance research in speaker-aware multimodal understanding.

[74] AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation cs.CL | cs.AIPDF

Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell

TL;DR: AgenticMath提出了一种基于多智能体的高质量数学问答对生成方法，通过筛选种子问题、多样化重述问题、增强答案逻辑性和最终评估，提升了LLM在数学推理任务中的表现。

Details

Motivation: 当前生成高质量数据集以提升LLM推理能力的方法存在答案质量低、信息丰富度有限的问题，因此需要一种更高效的方法来解决这一问题。

Result: 实验表明，使用仅30-60K样本微调的LLM在数学推理任务中表现优于基于更大规模低质量数据的基线模型。

Insight: 高质量、针对性强的数据生成对小规模模型性能的提升比大规模低质量数据更有效。

Abstract: The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic pipeline for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.

[75] LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts cs.CLPDF

Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang

TL;DR: LoongRL提出了一种基于强化学习的数据驱动方法，用于提升大模型在长上下文中的高级推理能力。其核心贡献是KeyChain，一种将短多跳QA任务转化为高难度长上下文任务的方法，并通过RL训练诱导出计划-检索-推理-复查的推理模式。

Details

Motivation: 长上下文推理对大语言模型至关重要，但目前强化学习主要用于短上下文推理，针对长上下文的高级推理模式和数据仍缺乏探索和研究。

Result: LoongRL训练后的模型在16K长度任务上表现出色，并可泛化到128K任务，性能大幅提升（Qwen2.5-7B和14B分别提升23.5%和21.1%）。

Insight: 强化学习可以有效诱导模型产生适用于长上下文的高级推理模式，且训练数据的高难度设计是关键。

Abstract: Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing “Aha” moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.

[76] The Massive Legal Embedding Benchmark (MLEB) cs.CL | cs.AI | cs.IRPDF

Umar Butler, Abdur-Rahman Butler, Adrian Lucas Malec

TL;DR: 该论文提出了Massive Legal Embedding Benchmark（MLEB），这是迄今为止最大、最多样化且最全面的开源法律信息检索基准。MLEB包含十个专业标注的数据集，涵盖多个司法管辖区、文档类型和任务类型。

Details

Motivation: 填补开源法律信息检索领域中关于跨司法管辖区和多任务类型的数据空白，促进法律信息检索技术的发展。

Result: 发布了MLEB基准及其相关资源，为法律信息检索研究提供了丰富的测试平台。

Insight: 强调了跨司法管辖区和多任务类型数据集的重要性，为法律领域的自然语言处理研究提供了新的方向。

Abstract: We present the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive open-source benchmark for legal information retrieval to date. MLEB consists of ten expert-annotated datasets spanning multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore), document types (cases, legislation, regulatory guidance, contracts, and literature), and task types (search, zero-shot classification, and question answering). Seven of the datasets in MLEB were newly constructed in order to fill domain and jurisdictional gaps in the open-source legal information retrieval landscape. We document our methodology in building MLEB and creating the new constituent datasets, and release our code, results, and data openly to assist with reproducible evaluations.

[77] MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs cs.CL | cs.LGPDF

Xinfeng Xia, Jiacheng Liu, Xiaofeng Hou, Peng Tang, Mingxuan Zhang

TL;DR: MoE-Prism通过模型-系统协同设计，将传统的Mixture-of-Experts模型转化为弹性服务，提供更多细粒度的操作点，显著提升性能和资源利用率。

Details

Motivation: 现有的Mixture-of-Experts模型由于依赖少数固定专家（monolithic experts）的路由机制，导致操作点过于粗粒度，难以适应多样化的服务级别目标（SLOs），造成资源浪费。

Result: 在三种MoE模型上验证，MoE-Prism提供超过4倍的稳定操作点，吞吐量提升19.9%，延迟降低10.36%。

Insight: 通过模型-系统协同设计，可以实现高质量的弹性服务，从而灵活适应不同的SLOs和资源约束。

Abstract: Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by sparsely activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a “quality cliff”, offering only a few coarse-grained operating points. This inflexibility forces a difficult trade-off between cost and quality, preventing adaptation to diverse Service Level Objectives (SLOs) and leading to significant resource over-provisioning. This paper introduces MoE-Prism, a model-system co-design that transforms rigid MoE models into elastic services. Our methodology is divided into two phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs monolithic experts into fine-grained “sub-experts.” This engine employs a partitioning optimization solver that uses a metaheuristic-based approach to group neurons, preserving functional locality without requiring retraining. Second, an \emph{Online Scheduling Engine} leverages this new elasticity through QoS-aware scheduling. It implements specialized policies to solve complex system problems, including maximizing throughput in cloud deployments and managing latency-optimized offloading for memory-constrained devices. Our evaluation across three different MoE models shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline. This allows an AI service to dynamically improve throughput by up to 19.9% under a strict latency budget or reduce latency by up to 10.36% under limited resources. MoE-Prism provides the critical “control knob” to bridge the model-system gap, enabling the next generation of adaptive, efficient, and QoS-aware AI services.

[78] Sign Language Translation with Sentence Embedding Supervision cs.CLPDF

Yasser Hamidullah, Josef van Genabith, Cristina España-Bonet

TL;DR: 该论文提出了一种新颖的手语翻译方法，通过利用目标句子的句子嵌入作为监督信号，无需依赖传统的手语注释数据，显著提升了无注释数据下的翻译性能。

Details

Motivation: 传统手语翻译系统依赖于手语注释数据（gloss annotations），但这些数据通常难以大规模获取且标注不一致。论文目标是开发一种无需手语注释的翻译方法，以解决数据稀缺和标注不一致的问题。

Result: 在PHOENIX-2014T（德语）和How2Sign（美式手语）数据集上，该方法显著优于其他无注释方法，缩小了与依赖注释的系统的性能差距。

Insight: 1. 句子嵌入可以作为手语翻译的有效监督信号；2. 多语言嵌入能进一步提升模型的翻译能力；3. 这种方法为手语翻译的数据获取开辟了新途径。

Abstract: State-of-the-art sign language translation (SLT) systems facilitate the learning process through gloss annotations, either in an end2end manner or by involving an intermediate step. Unfortunately, gloss labelled sign language data is usually not available at scale and, when available, gloss annotations widely differ from dataset to dataset. We present a novel approach using sentence embeddings of the target sentences at training time that take the role of glosses. The new kind of supervision does not need any manual annotation but it is learned on raw textual data. As our approach easily facilitates multilinguality, we evaluate it on datasets covering German (PHOENIX-2014T) and American (How2Sign) sign languages and experiment with mono- and multilingual sentence embeddings and translation systems. Our approach significantly outperforms other gloss-free approaches, setting the new state-of-the-art for data sets where glosses are not available and when no additional SLT datasets are used for pretraining, diminishing the gap between gloss-free and gloss-dependent systems.

[79] SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision cs.CLPDF

Yasser Hamidullah, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina España-Bonet

TL;DR: 这篇论文提出了SONAR-SLT方法，通过语言无关的多模态嵌入来监督手语翻译（SLT），支持直接多语言翻译。采用耦合增强方法解决数据稀缺问题，实验结果表明其优于传统的文本句子嵌入监督方法。

Details

Motivation: 传统的手语翻译（SLT）方法通常依赖于单一语言的文本监督，限制了其扩展性和跨语言泛化能力。为了解决这一问题，论文探索了语言无关的多模态嵌入监督方法。

Result: 实验结果显示，该方法在BLEURT指标上优于仅基于文本句子嵌入的监督方法，尤其在低资源场景下表现更优。

Insight: 语言无关的多模态监督和耦合增强方法是提高SLT可扩展性和鲁棒性的有效途径。

Abstract: Sign language translation (SLT) is typically trained with text in a single spoken language, which limits scalability and cross-language generalization. Earlier approaches have replaced gloss supervision with text-based sentence embeddings, but up to now, these remain tied to a specific language and modality. In contrast, here we employ language-agnostic, multimodal embeddings trained on text and speech from multiple languages to supervise SLT, enabling direct multilingual translation. To address data scarcity, we propose a coupled augmentation method that combines multilingual target augmentations (i.e. translations into many languages) with video-level perturbations, improving model robustness. Experiments show consistent BLEURT gains over text-only sentence embedding supervision, with larger improvements in low-resource settings. Our results demonstrate that language-agnostic embedding supervision, combined with coupled augmentation, provides a scalable and semantically robust alternative to traditional SLT training.

[80] Spatio-temporal Sign Language Representation and Translation cs.CL | cs.CVPDF

Yasser Hamidullah, Josef van Genabith, Cristina España-Bonet

TL;DR: 这是一篇关于手语翻译任务的论文，提出了一种时空特征表示与翻译的单模型方法，性能在开发集上表现尚可但测试集上较差。

Details

Motivation: 传统的手语翻译方法通常使用通用序列到序列架构，缺乏对时间特征的充分利用，本文旨在提出一种时空特征表示与翻译的单模型方法以改进性能。

Result: 最佳系统在开发集上达到5±1 BLEU分，但在测试集上性能大幅下降至0.11±0.06 BLEU分。

Insight: 时空特征表示方法在开发集上表现较好，但测试集上的性能下降可能表明模型存在泛化问题，或测试数据与训练数据差异较大。

Abstract: This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved $5\pm1$ BLEU points on the development set, but the performance on the test dropped to $0.11\pm0.06$ BLEU points.

[81] MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models cs.CLPDF

Kailin Jiang, Ning Jiang, Yuchen Ren, Yuchen Li, Yifan Gao

TL;DR: 论文提出了MINED基准，用于评估大型多模态模型（LMMs）对时间敏感知识的理解能力，并通过知识编辑方法探索了知识更新的可行性。

Details

Motivation: 现有的大型多模态模型（LMMs）在时间敏感知识的理解上表现不足，且缺乏动态评估基准来全面衡量其能力。

Result: Gemini-2.5-Pro在MINED上表现最佳（平均CEM得分63.07），开源LMMs表现较差；组织知识表现最好，体育知识最弱。知识编辑方法在单次编辑中有效。

Insight: 1. LMMs在时间敏感知识理解上仍需改进；2. 知识编辑方法为动态更新LMMs知识提供了可行路径。

Abstract: Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs’ ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.

[82] VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos cs.CL | cs.AI | cs.LGPDF

Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang

TL;DR: VideoAgentTrek提出了一种自动从公开视频中挖掘GUI交互数据的管道，无需手动标注；通过Video2Action模块提取精确的时间和内容信息，显著提升了计算机使用代理的性能。

Details

Motivation: 训练计算机使用代理需要大量标注数据，但手动标注成本高昂；现有公开视频（如YouTube教程）隐含了大量GUI交互信息，但缺乏显式标签。

Result: 在OSWorld-Verified上任务成功率从9.3%提升至15.8%（70%相对提升）；AgentNetBench上步骤准确率从64.1%提升至69.3%。

Insight: 互联网视频可作为高质量监督信号来源，为代理训练提供可扩展的数据解决方案；强调了无标注数据在计算机使用代理中的潜力。

Abstract: Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.

[83] What is the Best Sequence Length for BABYLM? cs.CLPDF

Suchir Salhan, Richard Diehl Martinez, Zébulon Goriely, Paula Buttery

TL;DR: 研究了在BabyLM Challenge中序列长度对预训练的影响，发现任务和架构决定了最优序列长度：短序列适合语法任务，长序列适合形态类比任务。

Details

Motivation: Transformer语言模型通常使用固定长度的上下文窗口，但在BabyLM Challenge中，许多提交使用更短的序列长度。研究旨在确定BabyLM预训练的最佳序列长度。

Result: 发现短序列（512 tokens）对语法任务足够，而长序列（2048 tokens）对形态类比任务更有利。最佳长度取决于任务和模型架构。

Insight: 序列长度的选择应根据具体任务和模型架构调整，单一固定长度可能不适用于所有场景。

Abstract: Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.

[84] Lookahead Routing for Large Language Models cs.CLPDF

Canbin Huang, Tianyuan Shi, Yuhua Zhu, Ruijun Chen, Xiaojun Quan

TL;DR: 论文提出了Lookahead框架，通过预测潜在输出来改进LLM路由决策，避免了传统分类方法的局限性，提升了7.7%的性能。

Details

Motivation: 现有LLM路由方法仅基于输入查询进行分类，忽略了输出信息的价值，导致复杂查询的路由决策不佳。

Result: 在七项公共基准测试中平均性能提升7.7%。

Insight: 动态预测潜在输出能显著提升路由决策质量，尤其是对复杂或模糊查询。

Abstract: Large language model (LLM) routers improve the efficiency of multi-model systems by directing each query to the most appropriate model while leveraging the diverse strengths of heterogeneous LLMs. Most existing approaches frame routing as a classification problem based solely on the input query. While this reduces overhead by avoiding inference across all models, it overlooks valuable information that could be gleaned from potential outputs and fails to capture implicit intent or contextual nuances that often emerge only during response generation. These limitations can result in suboptimal routing decisions, particularly for complex or ambiguous queries that require deeper semantic understanding. To address this challenge, we propose Lookahead, a routing framework that “foresees” potential model outputs by predicting their latent representations and uses these predictions to guide model selection, thus enabling more informed routing without full inference. Within this framework, we implement two approaches based on causal and masked language models. Empirical evaluations across seven public benchmarks - spanning instruction following, mathematical reasoning, and code generation - show that Lookahead consistently outperforms existing routing baselines, achieving an average performance gain of 7.7% over the state-of-the-art. Our code is available at https://github.com/huangcb01/lookahead-routing.

[85] Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment cs.CL | eess.ASPDF

Maureen de Seyssel, Eeshan Gunesh Dhekane

TL;DR: 该论文提出了一种统一的评估分类法，用于解决语音基础模型评估中的分散性问题，通过三个正交轴（评估方面、模型能力需求和任务要求）对现有评估方法进行分类，为选择合适的评估方法提供框架。

Details

Motivation: 语音基础模型的评估通常因任务和模型类型的差异而显得分散，缺乏统一标准。不同模型在不同语音处理方面表现优异，因此需要针对性的评估协议。

Result: 分类法成功对广泛的评估方法进行了系统性归类，并指出了评估中未被充分覆盖的领域（如韵律、交互和推理）。

Insight: 该分类法不仅为选择和设计评估方法提供了指导，还揭示了未来基准设计的优先级，有助于推动语音模型评估的统一性和全面性。

Abstract: Speech foundation models have recently achieved remarkable capabilities across a wide range of tasks. However, their evaluation remains disjointed across tasks and model types. Different models excel at distinct aspects of speech processing and thus require different evaluation protocols. This paper proposes a unified taxonomy that addresses the question: Which evaluation is appropriate for which model? The taxonomy defines three orthogonal axes: the \textbf{evaluation aspect} being measured, the model capabilities required to attempt the task, and the task or protocol requirements needed to perform it. We classify a broad set of existing evaluations and benchmarks along these axes, spanning areas such as representation learning, speech generation, and interactive dialogue. By mapping each evaluation to the capabilities a model exposes (e.g., speech generation, real-time processing) and to its methodological demands (e.g., fine-tuning data, human judgment), the taxonomy provides a principled framework for aligning models with suitable evaluation methods. It also reveals systematic gaps, such as limited coverage of prosody, interaction, or reasoning, that highlight priorities for future benchmark design. Overall, this work offers a conceptual foundation and practical guide for selecting, interpreting, and extending evaluations of speech models.

[86] Conditions for Catastrophic Forgetting in Multilingual Translation cs.CLPDF

Danni Liu, Jan Niehues

TL;DR: 论文探讨了多语言基础模型在微调时引发灾难性遗忘的条件，发现模型与数据规模的相对比例是主要因素，同时模型的指令跟随能力比架构更关键。

Details

Motivation: 多语言基础模型在微调特定语言时常常引发灾难性遗忘，但文献中对遗忘发生的条件缺乏系统性研究。

Result: 1. 模型与数据规模的相对比例是遗忘的主要因素；2. 指令跟随能力比架构更关键；3. 跨语言对齐可缓解遗忘。

Insight: 模型的指令跟随能力和跨语言对齐是多语言知识保留的关键，而参数高效微调未必优于全微调。

Abstract: Fine-tuning multilingual foundation models on specific languages often induces catastrophic forgetting, degrading performance on languages unseen in fine-tuning. While this phenomenon is widely-documented, the literature presents fragmented results about when forgetting occurs. To address this ambiguity, we conduct a systematic empirical study using machine translation as a testbed to identify the conditions that trigger catastrophic forgetting in multilingual fine-tuning. Through controlled experiments across different model architectures, data scales, and fine-tuning approaches, we reveal that the relative scale between model and data size is a primary determinant of forgetting. Moreover, we demonstrate that a model’s instruction-following ability is more critical for retaining multilingual knowledge than its architecture. Contrary to assumptions, parameter-efficient fine-tuning offers no clear advantage over full fine-tuning in mitigating forgetting. Lastly, we show that cross-lingual alignment can mitigate forgetting while also facilitating positive transfer to unseen target languages.

[87] Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark cs.CL | cs.AI | cs.CV | cs.DLPDF

Yu Wu, Ke Shu, Jonas Fischer, Lidia Pivovarova, David Rosson

TL;DR: 本文提出了一个新颖的任务：从混合语言的古籍中提取拉丁文片段，并通过多模态数据集评估大型基础模型的性能。结果表明，当代模型可以可靠地完成拉丁文检测任务。

Details

Motivation: 古籍中常包含多种语言混合的内容，尤其是拉丁文与其他语言的混杂，这对自动提取拉丁文提出了挑战。现有的模型在这些任务中的表现尚未得到系统评估。

Result: 实验结果表明，当前的模型可以在多语言古籍中可靠地检测拉丁文，为相关领域的研究提供了基准。

Insight: 大型基础模型在多模态、多语言任务中表现出色，但仍需进一步优化以应对古籍中的复杂布局和语言混杂问题。

Abstract: This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models’ capabilities and limits for this task.

[88] Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent cs.CL | cs.AIPDF

Yangshijie Zhang, Xinda Wang, Jialin Liu, Wenqiang Wang, Zhicong Ma

TL;DR: 论文提出了一种基于字体风格的对抗攻击方法（SAD），利用人类与NLP模型对风格化文本的感知差异，实现对模型的干扰。实验验证了其在情感分类和机器翻译等任务中的攻击效果，并展示了其在多模态任务中的潜在威胁。

Details

Motivation: 社交媒体的发展使得用户广泛使用风格化字体和类似字体的表情符号表达个性，但这些字体在NLP模型中可能被处理为无关的token，导致模型性能下降。研究旨在利用这种人类与模型的感知差异，设计对抗攻击。

Result: 实验表明，SAD在情感分类和机器翻译任务中成功干扰了传统模型、大语言模型（LLM）和商业服务。此外，SAD在多模态任务（如文本生成图像和语音）中也展示了潜在威胁。

Insight: 风格化文本在视觉上对人类友好，但对NLP模型可能是潜在的脆弱点。这种人类与模型的感知差异为对抗攻击提供了新的研究方向。

Abstract: With social media growth, users employ stylistic fonts and font-like emoji to express individuality, creating visually appealing text that remains human-readable. However, these fonts introduce hidden vulnerabilities in NLP models: while humans easily read stylistic text, models process these characters as distinct tokens, causing interference. We identify this human-model perception gap and propose a style-based attack, Style Attack Disguise (SAD). We design two sizes: light for query efficiency and strong for superior attack performance. Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services demonstrate SAD’s strong attack performance. We also show SAD’s potential threats to multimodal tasks including text-to-image and text-to-speech generation.

[89] LLavaCode: Compressed Code Representations for Retrieval-Augmented Code Generation cs.CLPDF

Daria Cherniuk, Nikita Sukhorukov, Nikita Sushko, Daniil Gusak, Danil Sivtsov

TL;DR: LlavaCode提出了一种通过压缩代码为紧凑表示的方法，显著减少了检索增强代码生成的上下文长度，提升了生成质量并降低了延迟。

Details

Motivation: 检索增强生成在代码补全中表现出色，但长上下文导致推理速度慢，影响交互式环境（如IDE）的体验。

Result: 实验显示压缩上下文显著提升EM和ES指标，同时TTFT减少了20-38%。

Insight: 紧凑的代码表示是解决检索增强生成延迟问题的有效途径。

Abstract: Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.

[90] Unraveling Emotions with Pre-Trained Models cs.CL | cs.AIPDF

Alejandro Pajón-Sanmartín, Francisco De Arriba-Pérez, Silvia García-Méndez, Fátima Leal, Benedita Malheiro

TL;DR: 论文比较了微调预训练模型和通用LLMs在情感检测中的效果，强调了结构化提示设计和情感分组的重要性，实验显示微调模型在情感识别中表现优于70%。

Details

Motivation: 情感识别在开放文本中存在上下文模糊性和语言多样性等挑战，通用模型直接应用效果有限，因此研究微调和提示工程的效果。

Result: 微调预训练模型情感识别指标超过70%，LLMs需结构化提示和情感分组以提升性能。

Insight: 结构化提示和情感分组是提升LLMs情感分析性能的关键，微调模型在开放文本情感识别中表现更优。

Abstract: Transformer models have significantly advanced the field of emotion recognition. However, there are still open challenges when exploring open-ended queries for Large Language Models (LLMs). Although current models offer good results, automatic emotion analysis in open texts presents significant challenges, such as contextual ambiguity, linguistic variability, and difficulty interpreting complex emotional expressions. These limitations make the direct application of generalist models difficult. Accordingly, this work compares the effectiveness of fine-tuning and prompt engineering in emotion detection in three distinct scenarios: (i) performance of fine-tuned pre-trained models and general-purpose LLMs using simple prompts; (ii) effectiveness of different emotion prompt designs with LLMs; and (iii) impact of emotion grouping techniques on these models. Experimental tests attain metrics above 70% with a fine-tuned pre-trained model for emotion recognition. Moreover, the findings highlight that LLMs require structured prompt engineering and emotion grouping to enhance their performance. These advancements improve sentiment analysis, human-computer interaction, and understanding of user behavior across various domains.

[91] DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference cs.CLPDF

Xiang Liu, Xuming Hu, Xiaowen Chu, Eunsol Choi

TL;DR: DiffAdapt是一个轻量级框架，通过根据问题难度和推理轨迹熵选择不同的推理策略，减少大型语言模型（LLM）在推理时的token使用量，同时保持或提高准确性。

Details

Motivation: 尽管当前大型语言模型具备强问题解决能力，但由于生成长的推理轨迹，效率较低。研究发现模型的token概率熵在简单问题上过高，表明存在‘过度思考’现象，需要一种高效的推理策略。

Result: 在5个模型和8个基准测试中，DiffAdapt在不降低或提升准确性的前提下，token使用量减少高达22.4%。

Insight: 简单问题存在过度思考现象，动态调整推理策略是提高LLM推理效率的有效途径。

Abstract: Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22–25% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM’s final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4%, establishing a practical path toward compute-efficient reasoning.

[92] CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation cs.CL | I.2.6; C.2.4; C.3PDF

Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe

TL;DR: CoSense-LLM是一个边缘优先框架，通过多模态传感器流生成语义令牌，并与大语言模型协作，满足延迟、能耗、带宽和隐私约束。

Details

Motivation: 在大模型部署中，语义理解、隐私保护和低延迟之间存在冲突。CoSense-LLM旨在将这些目标统一为一个边缘优先的设计，尤其适用于干扰环境。

Result: 在家庭、办公室和诊所场景中实现了亚秒级延迟，减少了带宽消耗，并通过本地检索提高了事实一致性。

Insight: 边缘优先设计能将语义、隐私和低延迟整合为统一的优化目标，适合资源受限和干扰多的环境。

Abstract: We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.

[93] Are Large Language Models Sensitive to the Motives Behind Communication? cs.CL | cs.AI | cs.LGPDF

Addison J. Wu, Ryan Liu, Kerem Oktar, Theodore R. Sumers, Thomas L. Griffiths

TL;DR: 该论文研究了大型语言模型（LLMs）是否能够理解人类交流背后的动机，发现LLMs在一定程度上能够理性评估偏置信息，但在真实场景中表现较弱，通过干预可提升其敏感度。

Details

Motivation: 人类交流具有动机性，而LLMs需要理解这些动机才能在现实世界中有效运作。目前尚不清楚LLMs是否具备这种能力。

Result: LLMs在控制实验中表现接近人类理性模型，但在真实广告场景中表现较差；通过干预可显著提升模型表现。

Insight: LLMs具备对动机的基本敏感性，但在复杂真实场景中需要进一步优化以提升其表现。

Abstract: Human communication is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans’ intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self-serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source – for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs’ behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents’ information ecosystems. In these settings, we find that LLMs’ inferences do not track the rational models’ predictions nearly as closely – partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.

[94] Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings cs.CL | cs.AIPDF

Cesar Gonzalez-Gutierrez, Dirk Hovy

TL;DR: 这篇论文通过实验研究了提示（prompting）对预训练语言模型内部表示质量的影响，发现提示的相关性与表示质量并不总是一致，挑战了传统假设。

Details

Motivation: 理解提示如何影响预训练语言模型的内部表示，尤其是在零样本（zero-shot）任务中，有助于揭示模型如何通过提示解决任务的内在机制。

Result: 研究发现提示会影响表示质量，但这种影响与提示和目标任务的相关性并不一致，提示的相关性并非总是带来更好的表示。

Insight: 提示的作用机制可能比简单的相关性假设更复杂，需要进一步研究其他潜在因素，如提示的多样性或模型内部的注意力机制。

Abstract: Prompting is a common approach for leveraging LMs in zero-shot settings. However, the underlying mechanisms that enable LMs to perform diverse tasks without task-specific supervision remain poorly understood. Studying the relationship between prompting and the quality of internal representations can shed light on how pre-trained embeddings may support in-context task solving. In this empirical study, we conduct a series of probing experiments on prompt embeddings, analyzing various combinations of prompt templates for zero-shot classification. Our findings show that while prompting affects the quality of representations, these changes do not consistently correlate with the relevance of the prompts to the target task. This result challenges the assumption that more relevant prompts necessarily lead to better representations. We further analyze potential factors that may contribute to this unexpected behavior.

[95] SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration cs.CL | cs.AI | cs.LGPDF

Xichen Zhang, Sitong Wu, Haoru Tan, Shaozuo Yu, Yinghao Zhu

TL;DR: 本文提出了SmartSwitch推理框架，旨在解决大型语言模型在复杂推理任务中因’浅层思维’（underthinking）导致的性能瓶颈问题。该框架通过监控推理过程、检测浅层思维并引导深度思考，显著提升了模型的性能。

Details

Motivation: 大型语言模型在复杂推理任务中表现出色，但其伴随的浅层思维问题（频繁切换思维而未深入探索）限制了性能和token效率。本文旨在解决这一问题。

Result: 在数学推理基准测试中，SmartSwitch显著提升了不同规模模型的性能。

Insight: 针对浅层思维的干预是提升大型语言模型推理能力的关键，SmartSwitch为这一问题提供了一种简单高效的解决方案。

Abstract: The long chain-of-thought (LongCoT) capability is central to the recent breakthroughs achieved by large language models in complex reasoning tasks. However, the accompanying issue of ‘’underthinking’’, where models exhibit shallow reasoning by frequently switching thoughts without sufficient exploration, limits both performance and token efficiency. To address this problem, we propose a simple yet effective reasoning strategy: the SmartSwitch inference framework. This framework can be easily integrated into any large language model as a plug-and-play solution, continuously monitoring the model’s reasoning process to detect underthinking and guide it toward deeper exploration of promising but overlooked thoughts. Specifically, the perception module identifies points where thoughts switch and evaluates the potential of the preceding thought using an off-the-shelf process reward model (PRM). If a high-potential thought is found to be prematurely abandoned, the intervention module interrupts the ongoing inference, backtracks to the point before the switch, and inserts a “deepening prompt” to encourage further exploration along that promising path. Extensive experiments on challenging mathematical reasoning benchmarks demonstrate that our method significantly enhances the performance of various large language models of different sizes.

[96] AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders cs.CL | cs.AI | cs.LGPDF

Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao

TL;DR: AdaSPEC通过选择性知识蒸馏改进推测解码器的效率，提出了一种过滤难拟合令牌的方法，从而提升令牌接受率。

Details

Motivation: 推测解码（SD）依赖小型草案模型与大型目标模型的对齐，传统知识蒸馏方法因对所有令牌最小化KL散度而与SD目标（最大化令牌接受率）不一致，导致性能不佳。

Result: 在算术推理、指令遵循、编码和摘要等多任务中，AdaSPEC在31M/1.4B和350M/2.7B参数配置下，令牌接受率最高提升15%，优于DistillSpec方法。

Insight: 选择性知识蒸馏更贴合SD的目标，避免了因模型容量限制导致的性能瓶颈，同时保持了生成质量。

Abstract: Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model’s knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15%). The code is publicly available at https://github.com/yuezhouhu/adaspec.

[97] Adapting Multilingual Models to Code-Mixed Tasks via Model Merging cs.CLPDF

Prashant Kodali, Vaishnavi Shivkumar, Swarang Joshi, Monojit Choudhary, Ponnurangam Kumaraguru

TL;DR: 该论文研究了模型合并作为适应代码混合NLP任务的替代方法，通过结合多语言基础模型和无标签代码混合文本的预训练，显著提升了分类任务的性能。

Details

Motivation: 代码混合NLP任务在处理多语言输入时面临资源分配不均和上下文理解的挑战，传统方法如完全微调或持续预训练难以高效利用无标签数据。

Result: 合并模型在英语-印地语和英语-西班牙语分类任务中表现优于传统方法，F1分数提升2-5分；在跨语言迁移任务中也表现更优。

Insight: 模型合并能更高效地利用无标签数据，适用于低资源场景；大语言模型的零/少样本学习在代码混合任务中表现不及微调方法。

Abstract: We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT->FT. We observe gains of 2–5 points in F1 over full fine-tuning and ~1-2 points over CPT->FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.

[98] ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers cs.CL | cs.IRPDF

Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang

TL;DR: 本文提出了ToolDreamer框架，通过利用LLM生成的假设工具描述（TD）优化工具检索，解决用户请求与TD语言不匹配的问题，提升检索性能。

Details

Motivation: 现有检索模型基于用户查询和工具描述（TD）的相似性排名工具，但由于用户请求与TD语言不匹配，导致检索效果不佳。

Result: 实验表明ToolDreamer提升了检索器的性能，支持训练和无训练场景，展示了其灵活性。

Insight: 将部分推理任务卸载到检索器，可以扩展LLM处理大规模工具集的能力，避免上下文窗口过载问题。

Abstract: Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens would exceed the LLM’s context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMs with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., description of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD’s. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.

[99] Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning cs.CL | cs.AI | cs.LGPDF

Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu

TL;DR: Scaf-GRPO提出了一个渐进式训练框架，通过在模型学习停滞时注入分层提示，帮助LLM解决超出其当前能力的问题，显著提升了数学推理任务的性能。

Details

Motivation: 现有的强化学习方法在LLM解决远超出其当前能力的问题时，会遇到’学习悬崖’现象，导致学习梯度消失，无法取得进展。

Result: 在AIME24数学基准测试中，Scaf-GRPO将Qwen2.5-Math-7B模型的pass@1分数相对提高了44.3%。

Insight: Scaf-GRPO通过分层引导的策略，为LLM提供了一种打破’学习悬崖’现象的有效方法，扩展了其自主推理能力的边界。

Abstract: Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ‘’learning cliff’’ phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model’s independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO’s effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model’s ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.

[100] Hubble: a Model Suite to Advance the Study of LLM Memorization cs.CL | cs.LGPDF

Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu

TL;DR: Hubble是一个开源的LLM套件，旨在研究LLM的记忆问题，通过标准模型和扰动模型的设计，探索敏感数据记忆的风险及其缓解策略。

Details

Motivation: 研究大语言模型（LLM）的记忆问题，尤其是在训练过程中敏感数据的记忆和遗忘机制，为缓解隐私风险提供实证支持。

Result: 发现敏感数据的记忆与其在训练语料中的频率和出现阶段密切相关，高频或早期的数据更容易被记忆，而低频或后期的数据可能被遗忘。

Insight: Hubble不仅为记忆研究提供了基准工具，还为隐私保护（如成员推断和机器遗忘）提供了新的研究平台。

Abstract: We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models – standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens – establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.

cs.LG [Back]

[101] BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping cs.LG | cs.AI | cs.CLPDF

Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen

TL;DR: BAPO提出了一种基于平衡策略优化和自适应剪裁的方法，解决了离线强化学习中策略熵急剧下降和优化不稳定的问题，显著提升了训练效率和模型性能。

Details

Motivation: 离线强化学习（RL）在大型语言模型（LLMs）的训练中虽然提升了样本效率，但存在策略熵下降快、优化不稳定甚至崩溃的问题。BAPO旨在解决这些问题。

Result: 在AIME 2024和2025基准测试中，BAPO的7B和32B模型超越了开源和商业模型，表现出高效、稳定的训练性能。

Insight: 研究揭示了PPO类目标中固定剪裁机制会系统地抑制熵增更新，导致策略过度利用。BAPO通过自适应剪裁解决了这一问题。

Abstract: Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings–where stale data from past policies are used for training–improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios–including sample replay and partial rollout–BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.

[102] NeuroAda: Activating Each Neuron’s Potential for Parameter-Efficient Fine-Tuning cs.LG | cs.AI | cs.CLPDF

Zhi Zhang, Yixian Shen, Congfeng Cao, Ekaterina Shutova

TL;DR: NeuroAda是一种新颖的参数高效微调方法，通过选择重要参数并引入旁路连接，既实现了精细微调又保持了高内存效率，在23+任务中表现优异。

Details

Motivation: 现有参数高效微调方法存在表现力与内存效率之间的权衡问题，NeuroAda旨在解决这一矛盾。

Result: 在23+任务中表现最佳，仅需≤0.02%可训练参数，CUDA内存使用减少高达60%。

Insight: 选择性参数适应与旁路连接的结合能有效平衡性能与资源消耗，为参数高效微调提供了新思路。

Abstract: Existing parameter-efficient fine-tuning (PEFT) methods primarily fall into two categories: addition-based and selective in-situ adaptation. The former, such as LoRA, introduce additional modules to adapt the model to downstream tasks, offering strong memory efficiency. However, their representational capacity is often limited, making them less suitable for fine-grained adaptation. In contrast, the latter directly fine-tunes a carefully chosen subset of the original model parameters, allowing for more precise and effective adaptation, but at the cost of significantly increased memory consumption. To reconcile this trade-off, we propose NeuroAda, a novel PEFT method that enables fine-grained model finetuning while maintaining high memory efficiency. Our approach first identifies important parameters (i.e., connections within the network) as in selective adaptation, and then introduces bypass connections for these selected parameters. During finetuning, only the bypass connections are updated, leaving the original model parameters frozen. Empirical results on 23+ tasks spanning both natural language generation and understanding demonstrate that NeuroAda achieves state-of-the-art performance with as little as $\leq \textbf{0.02}%$ trainable parameters, while reducing CUDA memory usage by up to 60%. We release our code here: https://github.com/FightingFighting/NeuroAda.git.

[103] FrogDeepSDM: Improving Frog Counting and Occurrence Prediction Using Multimodal Data and Pseudo-Absence Imputation cs.LG | cs.CVPDF

Chirag Padubidri, Pranesh Velmurugan, Andreas Lanitis, Andreas Kamilaris

TL;DR: 论文通过深度学习与数据填补技术提升蛙类物种分布模型精度，数据平衡显著减少计数误差，多模态集成模型优于单模态，图像与表格数据融合提升分类准确率至84.9%。

Details

Motivation: 传统物种分布监测方法覆盖不全，数据稀疏或缺失限制模型表现。通过深度学习和数据预处理技术弥补这些不足，提升生态模型预测精度。

Result: MAE从189降至29，多模态模型分类准确率达84.9%，AUC为0.90，泛化能力强。

Insight: 多模态学习与数据预处理对稀疏或不完整数据的生态建模至关重要，为生物多样性监测提供更精确、可扩展的方法。

Abstract: Monitoring species distribution is vital for conservation efforts, enabling the assessment of environmental impacts and the development of effective preservation strategies. Traditional data collection methods, including citizen science, offer valuable insights but remain limited in coverage and completeness. Species Distribution Modelling (SDM) helps address these gaps by using occurrence data and environmental variables to predict species presence across large regions. In this study, we enhance SDM accuracy for frogs (Anura) by applying deep learning and data imputation techniques using data from the “EY - 2022 Biodiversity Challenge.” Our experiments show that data balancing significantly improved model performance, reducing the Mean Absolute Error (MAE) from 189 to 29 in frog counting tasks. Feature selection identified key environmental factors influencing occurrence, optimizing inputs while maintaining predictive accuracy. The multimodal ensemble model, integrating land cover, NDVI, and other environmental inputs, outperformed individual models and showed robust generalization across unseen regions. The fusion of image and tabular data improved both frog counting and habitat classification, achieving 84.9% accuracy with an AUC of 0.90. This study highlights the potential of multimodal learning and data preprocessing techniques such as balancing and imputation to improve predictive ecological modeling when data are sparse or incomplete, contributing to more precise and scalable biodiversity monitoring.

[104] Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning cs.LG | cs.AI | cs.CLPDF

Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang

TL;DR: 该论文提出了Ring-linear系列模型，采用线性注意力与softmax注意力的混合架构，显著降低了长上下文推理的计算成本，并通过优化比例实现了高效训练与推理。

Details

Motivation: 解决长上下文推理中的高计算和I/O开销问题，提出一种高效的混合注意力架构。

Result: 推理成本降至密集模型的1/10，训练效率提升50%，在多任务复杂推理基准中保持SOTA。

Insight: 混合注意力架构在长上下文任务中高效且成本低，算子库的优化对训练和推理性能至关重要。

Abstract: In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.

Jiacheng Liu, Xinyu Wang, Yuqi Lin, Zhikai Wang, Peiru Wang

TL;DR: 该论文系统综述了扩散模型中的缓存方法，提出了一种无需训练、架构无关的高效推理范式——Diffusion Caching，通过重用扩散过程中的计算冗余来减少计算开销。

Details

Motivation: 扩散模型因其高质量生成和控制能力成为生成式AI的核心，但其多步迭代和复杂网络导致高昂计算开销和延迟，限制了实时应用。现有加速技术存在适用性有限、训练成本高或质量下降等问题。

Result: Diffusion Caching显著减少了计算开销，适用于多样任务，为多模态和交互式应用提供了高效推理框架。

Insight: 缓存方法从静态重用发展到动态预测，增强了灵活性和通用性，预示着实时高效生成式AI的未来方向。

Abstract: Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.

[106] Blackbox Model Provenance via Palimpsestic Membership Inference cs.LG | cs.CLPDF

Rohith Kuditipudi, Jing Huang, Sally Zhu, Diyi Yang, Christopher Potts

TL;DR: 该论文研究了如何通过查询或观察文本来证明黑箱模型是否源自某个特定训练过的模型，提出了基于训练数据顺序的统计方法，并在不同规模的语言模型上验证了其有效性。

Details

Motivation: 研究动机是解决黑箱模型溯源问题，即如何证明某个黑箱模型是基于特定训练模型生成的。这对于模型版权保护和责任追溯具有重要意义。

Result: 结果表明，查询方法在大多数情况下能达到极低的p值（1e-8），观察方法中第二种方法仅需几百个token即可区分模型来源。

Insight: 研究揭示了语言模型训练数据的顺序信息可以作为模型溯源的有效依据，且在小规模数据下也能实现高准确率。

Abstract: Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice’s model to produce text. Can Alice prove that Bob is using her model, either by querying Bob’s derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem–in which the null hypothesis is that Bob’s model or text is independent of Alice’s randomized training run–and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice’s model using test statistics that capture correlation between Bob’s model or text and the ordering of training examples in Alice’s training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice’s training data. In the query setting, we directly estimate (via prompting) the likelihood Bob’s model gives to Alice’s training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model’s training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob’s text overlapping with spans of Alice’s training examples and 2) the likelihood of Bob’s text with respect to different versions of Alice’s model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob’s text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.

cs.CR [Back]

[107] From See to Shield: ML-Assisted Fine-Grained Access Control for Visual Data cs.CR | cs.CV | cs.LGPDF

Mete Harun Akcay, Buse Gul Atli, Siddharth Prakash Rao, Alexandros Bakas

TL;DR: 该论文提出了一种基于机器学习的细粒度访问控制系统，用于视觉数据的敏感信息保护，结合自动检测、加密和策略管理模块，展示了高效性和可扩展性。

Details

Motivation: 随着存储数据量的增长，如何在大型数据仓库中识别和保护敏感信息（尤其是与多角色用户共享时）成为挑战。需要一种能选择性保护敏感区域的解决方案。

Result: 实验结果显示，系统在隐私敏感对象检测上表现优异（F1提升5%，平均精度提升10%），策略解密的平均时间为每图像1秒内。

Insight: 混合加密方案和模块化设计能有效平衡效率与安全性，适用于大规模视觉数据的细粒度访问控制。

Abstract: As the volume of stored data continues to grow, identifying and protecting sensitive information within large repositories becomes increasingly challenging, especially when shared with multiple users with different roles and permissions. This work presents a system architecture for trusted data sharing with policy-driven access control, enabling selective protection of sensitive regions while maintaining scalability. The proposed architecture integrates four core modules that combine automated detection of sensitive regions, post-correction, key management, and access control. Sensitive regions are secured using a hybrid scheme that employs symmetric encryption for efficiency and Attribute-Based Encryption for policy enforcement. The system supports efficient key distribution and isolates key storage to strengthen overall security. To demonstrate its applicability, we evaluate the system on visual datasets, where Privacy-Sensitive Objects in images are automatically detected, reassessed, and selectively encrypted prior to sharing in a data repository. Experimental results show that our system provides effective PSO detection, increases macro-averaged F1 score (5%) and mean Average Precision (10%), and maintains an average policy-enforced decryption time of less than 1 second per image. These results demonstrate the effectiveness, efficiency and scalability of our proposed solution for fine-grained access control.

cs.RO [Back]

[108] $\nabla$-SDF: Learning Euclidean Signed Distance Functions Online with Gradient-Augmented Octree Interpolation and Neural Residual cs.RO | cs.AI | cs.CVPDF

Zhirui Dai, Qihao Qian, Tianxing Fan, Nikolay Atanasov

TL;DR: 这篇论文提出了$
abla$-SDF方法，结合了梯度增强的八叉树插值和神经残差的混合方法，用于在线学习欧几里得符号距离函数（SDF）。该方法在计算效率、内存占用和准确性方面均优于现有技术。

Details

Motivation: 现有的SDF估计方法在在线和大规模重建中存在局限性：基于离散体素结构的方法影响SDF的连续性和可微性，而基于神经网络的方法效率低且易受灾难性遗忘和内存限制的影响。因此，需要一种兼具效率和准确性的解决方案。

Result: 实验表明，$
abla$-SDF在准确性和效率上超过了现有技术，为机器人学和计算机视觉的下游任务提供了可扩展的解决方案。

Insight: 通过结合显式和隐式方法的优势，$
abla$-SDF展示了在处理大规模和非截断SDF重建任务中的潜力。

Abstract: Estimation of signed distance functions (SDFs) from point cloud data has been shown to benefit many robot autonomy capabilities, including localization, mapping, motion planning, and control. Methods that support online and large-scale SDF reconstruction tend to rely on discrete volumetric data structures, which affect the continuity and differentiability of the SDF estimates. Recently, using implicit features, neural network methods have demonstrated high-fidelity and differentiable SDF reconstruction but they tend to be less efficient, can experience catastrophic forgetting and memory limitations in large environments, and are often restricted to truncated SDFs. This work proposes $\nabla$-SDF, a hybrid method that combines an explicit prior obtained from gradient-augmented octree interpolation with an implicit neural residual. Our method achieves non-truncated (Euclidean) SDF reconstruction with computational and memory efficiency comparable to volumetric methods and differentiability and accuracy comparable to neural network methods. Extensive experiments demonstrate that \methodname{} outperforms the state of the art in terms of accuracy and efficiency, providing a scalable solution for downstream tasks in robotics and computer vision.

[109] GigaBrain-0: A World Model-Powered Vision-Language-Action Model cs.RO | cs.CVPDF

GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang

TL;DR: GigaBrain-0 是一种基于世界模型的视觉-语言-动作（VLA）模型，通过生成多样化数据减少对真实机器人数据的依赖，并通过 RGBD 输入建模和 CoT 监督提升策略鲁棒性，显著提高了跨任务泛化能力。

Details

Motivation: 训练通用机器人的 VLA 模型通常需要大规模的真实机器人数据，但数据采集成本高且耗时，限制了模型的扩展性和泛化能力。

Result: 在多种任务中表现出优异的泛化能力，尤其在纹理、颜色、物体位置和视角变化的情况下性能显著提升。

Insight: 生成数据可以有效弥补真实数据的不足，同时多模态输入和推理机制的结合是提升 VLA 模型性能的关键。

Abstract: Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

cs.AI [Back]

[110] A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist cs.AI | cs.CLPDF

Sohyeon Jeon, Hyung-Chul Lee

TL;DR: 这篇论文通过行为和元认知分析方法，研究了两种大型语言模型（LLMs）在三种提示条件下评估临床试验报告（基于CONSORT标准）的能力，揭示了模型在推理风格和不确定性表达上的差异，并强调了其在医疗AI开发中的局限性。

Details

Motivation: 尽管LLMs在医疗领域快速扩展，但其在基于CONSORT标准评估临床试验报告时的认知和推理能力尚不明确。研究旨在填补这一空白，帮助开发更可靠和可解释的医疗AI。

Result: 结果显示，模型在不同CONSORT项目上的表现差异显著，提示类型显著影响了推理风格和不确定性表达。

Insight: 研究表明，当前LLMs在临床合规自动化中存在局限性，开发更可靠的医疗AI需要深入理解模型的认知适应和策略行为。

Abstract: Despite the rapid expansion of Large Language Models (LLMs) in healthcare, the ability of these systems to assess clinical trial reporting according to CONSORT standards remains unclear, particularly with respect to their cognitive and reasoning strategies. This study applies a behavioral and metacognitive analytic approach with expert-validated data, systematically comparing two representative LLMs under three prompt conditions. Clear differences emerged in how the models approached various CONSORT items, and prompt types, including shifts in reasoning style, explicit uncertainty, and alternative interpretations shaped response patterns. Our results highlight the current limitations of these systems in clinical compliance automation and underscore the importance of understanding their cognitive adaptations and strategic behavior in developing more explainable and reliable medical AI.

[111] The Zero-Step Thinking: An Empirical Study of Mode Selection as Harder Early Exit in Reasoning Models cs.AI | cs.CLPDF

Yuqiao Tan, Shizhu He, Kang Liu, Jun Zhao

TL;DR: 该论文研究推理模型中模式选择（Mode Selection）作为一种更难的早期退出（Early Exit）问题，通过零步思考（zero-step thinking）减少计算开销，发现现有方法在信息有限时效果不佳。

Details

Motivation: 推理模型在数学和逻辑推理任务中表现出色，但其逐步思考可能导致计算开销过大。模式选择和早期退出旨在减少这种开销，但模式选择因需在推理前做决策而更具挑战性。

Result: 发现提示方法因分类能力有限效果差，而利用内部信息的方法表现较好但稳定性不足。

Insight: 模式选择在信息有限时仍具挑战性，现有方法难以有效解决，需进一步优化。

Abstract: Reasoning models have demonstrated exceptional performance in tasks such as mathematics and logical reasoning, primarily due to their ability to engage in step-by-step thinking during the reasoning process. However, this often leads to overthinking, resulting in unnecessary computational overhead. To address this issue, Mode Selection aims to automatically decide between Long-CoT (Chain-of-Thought) or Short-CoT by utilizing either a Thinking or NoThinking mode. Simultaneously, Early Exit determines the optimal stopping point during the iterative reasoning process. Both methods seek to reduce the computational burden. In this paper, we first identify Mode Selection as a more challenging variant of the Early Exit problem, as they share similar objectives but differ in decision timing. While Early Exit focuses on determining the best stopping point for concise reasoning at inference time, Mode Selection must make this decision at the beginning of the reasoning process, relying on pre-defined fake thoughts without engaging in an explicit reasoning process, referred to as zero-step thinking. Through empirical studies on nine baselines, we observe that prompt-based approaches often fail due to their limited classification capabilities when provided with minimal hand-crafted information. In contrast, approaches that leverage internal information generally perform better across most scenarios but still exhibit issues with stability. Our findings indicate that existing methods relying solely on the information provided by models are insufficient for effectively addressing Mode Selection in scenarios with limited information, highlighting the ongoing challenges of this task. Our code is available at https://github.com/Trae1ounG/Zero_Step_Thinking.

[112] Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning cs.AI | cs.CV | cs.ROPDF

Gunshi Gupta, Karmesh Yadav, Zsolt Kira, Yarin Gal, Rahaf Aljundi

TL;DR: 论文提出了Memo，一种基于Transformer的架构和训练方法，用于强化学习中的长时记忆任务。通过引入周期性总结标记，Memo在训练过程中创建和检索记忆，提升了计算和存储效率，并在长上下文推理中表现更优。

Details

Motivation: 现有的Transformer模型在长时记忆任务中面临上下文限制和计算效率问题，而人类却能高效压缩和利用记忆。论文旨在解决这些问题，提出了一种更高效的记忆管理方法。

Result: 在网格世界元RL基准和真实室内导航任务中，Memo优于基线模型，且在长上下文推理和流式设置中表现稳健。

Insight: 动态记忆管理可以有效解决Transformer的长时记忆问题，同时保持计算和存储效率。

Abstract: To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo’s effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints.

[113] HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application cs.AI | cs.CL | cs.MAPDF

Yiqian Yang, Tian Lan, Qianghuai Jia, Li Zhu, Hui Jiang

TL;DR: HSCodeComp 是一个面向深度搜索代理的基准测试，专注于评估代理在层次化规则应用中的能力，特别是在模糊和隐式逻辑关系的情境下。实验表明，现有代理的表现远低于人类专家水平。

Details

Motivation: 当前代理基准测试忽视了代理在处理复杂规则（如关税规则）时的能力，而这些规则在现实应用中至关重要。

Result: 实验显示，最佳代理仅达到 46.8% 的准确率，远低于人类专家的 95.0%。

Insight: 层次化规则应用对代理具有显著挑战性，现有的测试时扩展方法未能进一步提升性能。

Abstract: Effective deep search agents must not only access open-domain and domain-specific knowledge but also apply complex rules-such as legal clauses, medical manuals and tariff rules. These rules often feature vague boundaries and implicit logic relationships, making precise application challenging for agents. However, this critical capability is largely overlooked by current agent benchmarks. To fill this gap, we introduce HSCodeComp, the first realistic, expert-level e-commerce benchmark designed to evaluate deep search agents in hierarchical rule application. In this task, the deep reasoning process of agents is guided by these rules to predict 10-digit Harmonized System Code (HSCode) of products with noisy but realistic descriptions. These codes, established by the World Customs Organization, are vital for global supply chain efficiency. Built from real-world data collected from large-scale e-commerce platforms, our proposed HSCodeComp comprises 632 product entries spanning diverse product categories, with these HSCodes annotated by several human experts. Extensive experimental results on several state-of-the-art LLMs, open-source, and closed-source agents reveal a huge performance gap: best agent achieves only 46.8% 10-digit accuracy, far below human experts at 95.0%. Besides, detailed analysis demonstrates the challenges of hierarchical rule application, and test-time scaling fails to improve performance further.

eess.AS [Back]

[114] StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction eess.AS | cs.AI | cs.CLPDF

Qianheng Xu

TL;DR: 该论文提出了StutterZero和StutterFormer两种端到端的语音转换模型，可直接将口吃语音转化为流畅语音并联合预测转录文本，显著提升了转录准确率和语义相似度。

Details

Motivation: 全球有超过7000万人存在口吃问题，但现有的自动语音系统常因分阶段处理或多模块分离，导致转录不准确或失真。

Result: StutterZero将词错误率（WER）降低24%，BERTScore提升31%；StutterFormer进一步将WER降低28%，BERTScore提升34%。

Insight: 端到端模型能够直接联合处理语音转换与转录任务，避免了多阶段处理的失真问题，为包容性人机交互和语音治疗提供了新方向。

Abstract: Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, whereas StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both architectures are trained on paired stuttered-fluent data synthesized from the SEP-28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a 31% improvement in semantic similarity (BERTScore) compared to the leading Whisper-Medium model. StutterFormer achieved better results, with a 28% decrease in WER and a 34% improvement in BERTScore. The results validate the feasibility of direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.

eess.IV [Back]

[115] Automated Morphological Analysis of Neurons in Fluorescence Microscopy Using YOLOv8 eess.IV | cs.CV | q-bio.QMPDF

Banan Alnemri, Arwa Basbrain

TL;DR: 论文提出了一种基于YOLOv8的自动化流程，用于荧光显微镜图像中神经元的分割与形态分析，准确率超过97%，显著减少了人工标注需求。

Details

Motivation: 神经元形态分析的准确分割和测量是神经科学和生物医学成像的关键，但传统方法依赖人工，耗时且主观性强，亟需自动化解决方案。

Result: 模型分割准确率超过97%，形态测量整体精度为75.32%，证明方法对神经元形态分析的可靠性和有效性。

Insight: YOLOv8在生物医学图像分割中表现优异，自动化流程可显著提升研究效率，为神经科学和细胞成像提供了一种可扩展的工具。

Abstract: Accurate segmentation and precise morphological analysis of neuronal cells in fluorescence microscopy images are crucial steps in neuroscience and biomedical imaging applications. However, this process is labor-intensive and time-consuming, requiring significant manual effort and expertise to ensure reliable outcomes. This work presents a pipeline for neuron instance segmentation and measurement based on a high-resolution dataset of stem-cell-derived neurons. The proposed method uses YOLOv8, trained on manually annotated microscopy images. The model achieved high segmentation accuracy, exceeding 97%. In addition, the pipeline utilized both ground truth and predicted masks to extract biologically significant features, including cell length, width, area, and grayscale intensity values. The overall accuracy of the extracted morphological measurements reached 75.32%, further supporting the effectiveness of the proposed approach. This integrated framework offers a valuable tool for automated analysis in cell imaging and neuroscience research, reducing the need for manual annotation and enabling scalable, precise quantification of neuron morphology.

cs.MA [Back]

[116] ColorAgent: Building A Robust, Personalized, and Interactive OS Agent cs.MA | cs.AI | cs.CLPDF

Ning Li, Qiqiang Lin, Zheng Wu, Xiaoyun Mo, Weiming Zhang

TL;DR: ColorAgent是一个个性化的操作系统代理，通过强化学习和多智能体框架实现长期稳健的环境交互，同时在用户意图识别和主动交互方面表现出色。

Details

Motivation: 随着硬件、软件和大语言模型的进步，人机交互正从命令行转向AI代理交互。构建一个能执行用户指令并忠实遵循用户需求的操作系统代理成为可能。

Result: 在AndroidWorld和AndroidLab基准测试中，分别取得了77.2%和50.7%的成功率，达到新的SOTA。

Insight: 当前基准测试不足以全面评估操作系统代理，未来需要在评估范式、智能体协作和安全性等方面进一步探索。

Abstract: With the advancements in hardware, software, and large language model technologies, the interaction between humans and operating systems has evolved from the command-line interface to the rapidly emerging AI agent interactions. Building an operating system (OS) agent capable of executing user instructions and faithfully following user desires is becoming a reality. In this technical report, we present ColorAgent, an OS agent designed to engage in long-horizon, robust interactions with the environment while also enabling personalized and proactive user interaction. To enable long-horizon interactions with the environment, we enhance the model’s capabilities through step-wise reinforcement learning and self-evolving training, while also developing a tailored multi-agent framework that ensures generality, consistency, and robustness. In terms of user interaction, we explore personalized user intent recognition and proactive engagement, positioning the OS agent not merely as an automation tool but as a warm, collaborative partner. We evaluate ColorAgent on the AndroidWorld and AndroidLab benchmarks, achieving success rates of 77.2% and 50.7%, respectively, establishing a new state of the art. Nonetheless, we note that current benchmarks are insufficient for a comprehensive evaluation of OS agents and propose further exploring directions in future work, particularly in the areas of evaluation paradigms, agent collaboration, and security. Our code is available at https://github.com/MadeAgents/mobile-use.

cs.SE [Back]

[117] Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1 cs.SE | cs.AI | cs.CLPDF

Qianli Ma, Siyu Wang, Yilin Chen, Yinhao Tang, Yixiang Yang

TL;DR: 论文提出AutoPage，一个多层次的多智能体系统，用于高效、低成本地将学术论文转化为动态网页。通过分层协作流程和验证机制，解决了自动化网页生成的挑战。

Details

Motivation: 研究人员在创建动态网页以展示研究成果时，面临手动、重复的工作负担。现有自动化工具无法处理动态交互式网页的需求。

Result: AutoPage在15分钟内以低于0.1美元的成本生成高质量、视觉效果佳的网页。

Insight: 网页生成的挑战可通过分层协作和验证机制解决，将系统设计为人类与AI的协作助手而非单一工具。

Abstract: In the quest for scientific progress, communicating research is as vital as the discovery itself. Yet, researchers are often sidetracked by the manual, repetitive chore of building project webpages to make their dense papers accessible. While automation has tackled static slides and posters, the dynamic, interactive nature of webpages has remained an unaddressed challenge. To bridge this gap, we reframe the problem, arguing that the solution lies not in a single command, but in a collaborative, hierarchical process. We introduce $\textbf{AutoPage}$, a novel multi-agent system that embodies this philosophy. AutoPage deconstructs paper-to-page creation into a coarse-to-fine pipeline from narrative planning to multimodal content generation and interactive rendering. To combat AI hallucination, dedicated “Checker” agents verify each step against the source paper, while optional human checkpoints ensure the final product aligns perfectly with the author’s vision, transforming the system from a mere tool into a powerful collaborative assistant. To rigorously validate our approach, we also construct $\textbf{PageBench}$, the first benchmark for this new task. Experiments show AutoPage not only generates high-quality, visually appealing pages but does so with remarkable efficiency in under 15 minutes for less than $0.1. Code and dataset will be released at $\href{https://mqleet.github.io/AutoPage_ProjectPage/}{Webpage}$.

Table of Contents

cs.CV [Back]

[1] Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts cs.CV | cs.AI | cs.ROPDF

[2] $Δ$t-Mamba3D: A Time-Aware Spatio-Temporal State-Space Model for Breast Cancer Risk Prediction cs.CV | cs.AIPDF

[3] MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models cs.CVPDF

[4] PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions cs.CV | cs.AI | cs.CLPDF

[5] UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning cs.CVPDF

[6] Advancing Brain Tumor Segmentation via Attention-based 3D U-Net Architecture and Digital Image Processing cs.CV | 68U10, 68T07, 68T45 | I.4.6; I.2.10; I.5.4; J.3PDF

[7] A Novel Approach to Breast Cancer Segmentation using U-Net Model with Attention Mechanisms and FedProx cs.CV | cs.AI | 68U10, 68T07, 68T45, 92C55 | I.4.6; I.2.10; I.5.4; J.3PDF

[8] X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning cs.CV | cs.AI | cs.LGPDF

[9] FootFormer: Estimating Stability from Visual Input cs.CVPDF

[10] PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning cs.CV | cs.AIPDF

[11] Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning cs.CVPDF

[12] Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks cs.CV | cs.AIPDF

[13] MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting cs.CVPDF

[14] SFGFusion: Surface Fitting Guided 3D Object Detection with 4D Radar and Camera Fusion cs.CVPDF

[15] Advances in 4D Representation: Geometry, Motion, and Interaction cs.CVPDF

[16] Vision-Based Mistake Analysis in Procedural Activities: A Review of Advances and Challenges cs.CVPDF

[17] Unified Reinforcement and Imitation Learning for Vision-Language Models cs.CVPDF

[18] A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP cs.CVPDF

[19] DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents cs.CVPDF

[20] Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes cs.CVPDF

[21] Multi-Camera Worker Tracking in Logistics Warehouse Considering Wide-Angle Distortion cs.CVPDF

[22] Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis cs.CV | cs.MMPDF

[23] PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation cs.CVPDF

[24] Towards Single-Source Domain Generalized Object Detection via Causal Visual Prompts cs.CVPDF

[25] CARES: Context-Aware Resolution Selector for VLMs cs.CV | cs.AI | cs.LGPDF

[26] PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis cs.CVPDF

[27] [De|Re]constructing VLMs’ Reasoning in Counting cs.CV | cs.CLPDF

[28] The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models cs.CVPDF

[29] A Matter of Time: Revealing the Structure of Time in Vision-Language Models cs.CV | cs.AI | cs.IR | cs.MMPDF

[30] HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking cs.CVPDF

[31] Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection cs.CV | cs.CRPDF

[32] VGD: Visual Geometry Gaussian Splatting for Feed-Forward Surround-view Driving Reconstruction cs.CVPDF

[33] Multi-modal Co-learning for Earth Observation: Enhancing single-modality models via modality collaboration cs.CV | cs.AI | cs.LGPDF

[34] Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation cs.CVPDF

[35] CBDiff:Conditional Bernoulli Diffusion Models for Image Forgery Localization cs.CVPDF

[36] XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography cs.CV | cs.AIPDF

[37] MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom cs.CVPDF

[38] From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction cs.CV | cs.AI | cs.CL | cs.ROPDF

[39] I Spy With My Model’s Eye: Visual Search as a Behavioural Test for MLLMs cs.CV | cs.AIPDF

[40] Curvilinear Structure-preserving Unpaired Cross-domain Medical Image Translation cs.CVPDF

[41] Explainable Face Presentation Attack Detection via Ensemble-CAM cs.CVPDF

[42] LyTimeT: Towards Robust and Interpretable State-Variable Discovery cs.CVPDF

[43] OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation cs.CVPDF

[44] Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models cs.CVPDF

[45] Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing cs.CV | cs.CL | cs.LGPDF

[46] olmOCR 2: Unit Test Rewards for Document OCR cs.CV | cs.CLPDF

[47] Is This Tracker On? A Benchmark Protocol for Dynamic Tracking cs.CVPDF

cs.CL [Back]

[48] Small Language Models Offer Significant Potential for Science Community cs.CL | cs.AIPDF

[49] Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti cs.CL | cs.CYPDF

[50] DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code cs.CL | cs.AI | cs.IR | cs.LGPDF

[51] Improving Topic Modeling of Social Media Short Texts with Rephrasing: A Case Study of COVID-19 Related Tweets cs.CL | cs.AIPDF

[52] MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Uni-modal and Omni-modal in OmniModels cs.CL | cs.AI | I.2.7PDF

[53] Are they lovers or friends? Evaluating LLMs’ Social Reasoning in English and Korean Dialogues cs.CLPDF

[54] Re:Member: Emotional Question Generation from Personal Memories cs.CL | cs.HCPDF

[55] A Graph Signal Processing Framework for Hallucination Detection in Large Language Models cs.CL | cs.LG | eess.SP | stat.MLPDF

[56] Training-Free Spectral Fingerprints of Voice Processing in Transformers cs.CL | cs.LG | eess.SP | stat.MLPDF

[57] Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges cs.CLPDF

[58] “You Are Rejected!”: An Empirical Study of Large Language Models Taking Hiring Evaluations cs.CLPDF

[59] Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG cs.CLPDF

[60] When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA cs.CL | cs.AIPDF

[61] Interpretable Question Answering with Knowledge Graphs cs.CL | cs.AI | cs.LGPDF

[62] Multi-Faceted Evaluation of Tool-Augmented Dialogue Systems cs.CLPDF

[63] DiSRouter: Distributed Self-Routing for LLM Selections cs.CLPDF

[64] SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets cs.CLPDF

[65] Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization cs.CLPDF

[66] TheMCPCompany: Creating General-purpose Agents with Task-specific Tools cs.CLPDF

[67] JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation cs.CLPDF

[68] KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints cs.CLPDF

[69] Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization cs.CL | cs.AIPDF

[70] Slot Filling as a Reasoning Task for SpeechLLMs cs.CLPDF

[71] Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection cs.CL | cs.CYPDF

[72] Modeling Turn-Taking with Semantically Informed Gestures cs.CLPDF

[73] M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models cs.CL | cs.AIPDF

[74] AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation cs.CL | cs.AIPDF

[75] LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts cs.CLPDF

[76] The Massive Legal Embedding Benchmark (MLEB) cs.CL | cs.AI | cs.IRPDF

[77] MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs cs.CL | cs.LGPDF