cs.CV [Total: 67]
cs.CL [Total: 19]
cs.CY [Total: 1]
cs.LG [Total: 4]
cs.AR [Total: 1]
cs.RO [Total: 5]
cs.IR [Total: 1]
eess.SY [Total: 1]
cs.GR [Total: 1]
eess.IV [Total: 8]
cs.AI [Total: 4]

cs.CV [Back]

[1] YOLO11-CR: a Lightweight Convolution-and-Attention Framework for Accurate Fatigue Driving Detection cs.CV | eess.IVPDF

Zhebin Jin, Ligang Dong

TL;DR: YOLO11-CR 是一个轻量级的卷积-注意力框架，用于实时疲劳驾驶检测，结合了局部 CNN 特征和全局 Transformer 上下文，提升了检测精度和定位能力。

Details

Motivation: 传统疲劳驾驶检测方法（如生理信号或车辆动态）具有侵入性和硬件依赖性，而基于视觉的方法对小目标和遮挡物体检测效果不佳，因此需要一种高效且鲁棒的解决方案。

Result: 在 DSM 数据集上表现优异：精确度 87.17%，召回率 83.86%，mAP@50 88.09%，mAP@50-95 55.93%，超越基线模型。

Insight: YOLO11-CR 为疲劳驾驶检测提供了高效、非侵入性的解决方案，未来可结合时序建模和多模态数据进一步提升性能，适合实际部署。

Abstract: Driver fatigue detection is of paramount importance for intelligent transportation systems due to its critical role in mitigating road traffic accidents. While physiological and vehicle dynamics-based methods offer accuracy, they are often intrusive, hardware-dependent, and lack robustness in real-world environments. Vision-based techniques provide a non-intrusive and scalable alternative, but still face challenges such as poor detection of small or occluded objects and limited multi-scale feature modeling. To address these issues, this paper proposes YOLO11-CR, a lightweight and efficient object detection model tailored for real-time fatigue detection. YOLO11-CR introduces two key modules: the Convolution-and-Attention Fusion Module (CAFM), which integrates local CNN features with global Transformer-based context to enhance feature expressiveness; and the Rectangular Calibration Module (RCM), which captures horizontal and vertical contextual information to improve spatial localization, particularly for profile faces and small objects like mobile phones. Experiments on the DSM dataset demonstrated that YOLO11-CR achieves a precision of 87.17%, recall of 83.86%, mAP@50 of 88.09%, and mAP@50-95 of 55.93%, outperforming baseline models significantly. Ablation studies further validate the effectiveness of the CAFM and RCM modules in improving both sensitivity and localization accuracy. These results demonstrate that YOLO11-CR offers a practical and high-performing solution for in-vehicle fatigue monitoring, with strong potential for real-world deployment and future enhancements involving temporal modeling, multi-modal data integration, and embedded optimization.

[2] MIRAGE: Towards AI-Generated Image Detection in the Wild cs.CV | cs.AIPDF

Cheng Xia, Manxi Lin, Jiexiang Tan, Xiaoxiong Du, Yang Qiu

TL;DR: 该论文提出了MIRAGE，一个用于检测真实场景中AI生成图像（AIGI）的挑战性基准和模型Mirage-R1。通过结合启发式与分析推理机制，Mirage-R1在性能上显著优于现有方法。

Details

Motivation: 随着生成式AI的发展，AI生成图像的泛滥对信息安全和公众信任构成威胁。现有的检测方法在实验室环境下有效，但难以泛化到复杂多变（如经过编辑或来自不同模型）的真实场景。

Result: 实验显示，Mirage-R1在Mirage和公开基准上分别领先现有最佳方法5%和10%。

Insight: 真实场景中的AIGI检测需要模型具备推理能力和适应性，结合多源数据训练和动态推理策略是关键。

Abstract: The spreading of AI-generated images (AIGI), driven by advances in generative AI, poses a significant threat to information security and public trust. Existing AIGI detectors, while effective against images in clean laboratory settings, fail to generalize to in-the-wild scenarios. These real-world images are noisy, varying from ``obviously fake” images to realistic ones derived from multiple generative models and further edited for quality control. We address in-the-wild AIGI detection in this paper. We introduce Mirage, a challenging benchmark designed to emulate the complexity of in-the-wild AIGI. Mirage is constructed from two sources: (1) a large corpus of Internet-sourced AIGI verified by human experts, and (2) a synthesized dataset created through the collaboration between multiple expert generators, closely simulating the realistic AIGI in the wild. Building on this benchmark, we propose Mirage-R1, a vision-language model with heuristic-to-analytic reasoning, a reflective reasoning mechanism for AIGI detection. Mirage-R1 is trained in two stages: a supervised-fine-tuning cold start, followed by a reinforcement learning stage. By further adopting an inference-time adaptive thinking strategy, Mirage-R1 is able to provide either a quick judgment or a more robust and accurate conclusion, effectively balancing inference speed and performance. Extensive experiments show that our model leads state-of-the-art detectors by 5% and 10% on Mirage and the public benchmark, respectively. The benchmark and code will be made publicly available.

[3] DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model cs.CVPDF

Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang

TL;DR: 该论文提出DianJin-OCR-R1框架，通过结合推理与工具的大视觉语言模型（VLM），提升OCR任务性能并减少幻觉问题。模型通过自识别、调用专家工具、重新推理的流程完成任务，实验证明其效果优于非推理模型和专家OCR模型。

Details

Motivation: 当前的大视觉语言模型（LVLM）在OCR任务中存在幻觉问题，且性能不如针对特定任务训练的专家模型。为了解决这些问题，论文提出了一种结合推理和工具调用的新框架。

Result: 在ReST和OmniDocBench数据集上，DianJin-OCR-R1模型的性能优于非推理模型和专家OCR模型。

Insight: 结合推理和工具调用的VLM可以显著提升OCR任务的表现，同时通过外部专家模型的引入减少幻觉问题，为多模态任务提供了新思路。

Abstract: Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations–generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally looks again the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. Additionally, expert models are typically smaller in scale and easy to iterate, enabling performance improvements for VLMs at a lower cost. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method.

[4] Exploration of Deep Learning Based Recognition for Urdu Text cs.CVPDF

Sumaiya Fazal, Sheeraz Ahmed

TL;DR: 本文提出了一种基于卷积神经网络（CNN）的乌尔都语光学字符识别系统，通过自动特征学习和层级神经网络结构，解决了乌尔都语因复杂几何和形态结构导致的识别难题，并在组件分类任务中达到了99%的准确率。

Details

Motivation: 乌尔都语是一种草书体语言，具有复杂的几何和形态结构，且对上下文敏感，传统的基于分割的识别方法错误率较高。因此，需要一个更高效的识别方法。

Result: 在乌尔都语文本数据集上的实验表明，模型在组件分类任务中的准确率达到99%。

Insight: 乌尔都语识别的关键在于高效的特征学习和层级处理，CNN和层级神经网络的结合为此提供了一种可行的解决方案。

Abstract: Urdu is a cursive script language and has similarities with Arabic and many other South Asian languages. Urdu is difficult to classify due to its complex geometrical and morphological structure. Character classification can be processed further if segmentation technique is efficient, but due to context sensitivity in Urdu, segmentation-based recognition often results with high error rate. Our proposed approach for Urdu optical character recognition system is a component-based classification relying on automatic feature learning technique called convolutional neural network. CNN is trained and tested on Urdu text dataset, which is generated through permutation process of three characters and further proceeds to discarding unnecessary images by applying connected component technique in order to obtain ligature only. Hierarchical neural network is implemented with two levels to deal with three degrees of character permutations and component classification Our model successfully achieved 0.99% for component classification.

[5] GaitCrafter: Diffusion Model for Biometric Preserving Gait Synthesis cs.CV | cs.AIPDF

Sirshapan Mitra, Yogesh S. Rawat

TL;DR: 提出了GaitCrafter，一种基于扩散模型的框架，用于在剪影域中合成真实步态序列。通过扩散模型生成高质量、可控且保护隐私的步态数据，提升了步态识别的性能。

Details

Motivation: 步态识别受限于缺乏大规模标注数据集和多样化步态样本收集的困难，同时需保护隐私。GaitCrafter旨在通过合成数据解决这些问题。

Result: 合成数据提升了步态识别性能，尤其在挑战性条件下；生成的新身份具有独特且一致的步态模式。

Insight: 扩散模型适用于高质量、隐私保护的步态数据生成，通过插值身份嵌入可以生成新身份，扩展数据集多样性。

Abstract: Gait recognition is a valuable biometric task that enables the identification of individuals from a distance based on their walking patterns. However, it remains limited by the lack of large-scale labeled datasets and the difficulty of collecting diverse gait samples for each individual while preserving privacy. To address these challenges, we propose GaitCrafter, a diffusion-based framework for synthesizing realistic gait sequences in the silhouette domain. Unlike prior works that rely on simulated environments or alternative generative models, GaitCrafter trains a video diffusion model from scratch, exclusively on gait silhouette data. Our approach enables the generation of temporally consistent and identity-preserving gait sequences. Moreover, the generation process is controllable-allowing conditioning on various covariates such as clothing, carried objects, and view angle. We show that incorporating synthetic samples generated by GaitCrafter into the gait recognition pipeline leads to improved performance, especially under challenging conditions. Additionally, we introduce a mechanism to generate novel identities-synthetic individuals not present in the original dataset-by interpolating identity embeddings. These novel identities exhibit unique, consistent gait patterns and are useful for training models while maintaining privacy of real subjects. Overall, our work takes an important step toward leveraging diffusion models for high-quality, controllable, and privacy-aware gait data generation.

[6] Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving cs.CVPDF

Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang

TL;DR: Prune2Drive 是一个即插即用的视觉标记剪枝框架，专注于加速自动驾驶中的视觉语言模型（VLMs），通过多样性感知的标记选择和视图自适应剪枝控制器，显著提高了计算效率。

Details

Motivation: 自动驾驶中的视觉语言模型因处理高分辨率多视图图像而面临巨大的计算负担，导致延迟和内存消耗问题，影响了实际部署。

Result: 在保留 10% 视觉标记时，预填充阶段加速 6.4 倍，FLOPs 降至 13.4%，任务性能仅下降 3%。

Insight: 通过剪枝视觉标记而非依赖注意力分数，可以有效平衡计算效率和模型性能，适用于多视图场景。

Abstract: Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images, a standard setup in AD systems with six or more synchronized cameras. This overhead stems from the large number of visual tokens generated during encoding, increasing inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in autonomous driving. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism inspired by farthest point sampling, which prioritizes semantic and spatial coverage across views rather than relying solely on attention scores, and (ii) a view-adaptive pruning controller that learns optimal pruning ratios for each camera view based on their importance to downstream driving tasks. Unlike prior methods, Prune2Drive does not require model retraining or access to attention maps, making it compatible with modern efficient attention implementations. Extensive experiments on two large-scale multi-view driving benchmarks, DriveLM and DriveLMM-o1, show that Prune2Drive achieves significant speedups and memory savings while maintaining or improving task performance. When retaining only 10% of the visual tokens, our method achieves a 6.40$\times$ speedup in the prefilling phase and consumes 13.4% of the original FLOPs, with only a 3% performance drop on the DriveLM benchmark.

[7] DAASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples cs.CV | cs.LGPDF

Abdullah Al Nomaan Nafi, Habibur Rahaman, Zafaryab Haider, Tanzim Mahfuz, Fnu Suya

TL;DR: DAASH是一个可微分的元攻击框架，通过组合多种Lp范数约束的攻击方法，生成有效且视觉隐蔽的对抗样本。它在多个攻击阶段中动态调整各基础攻击的权重，联合优化分类损失和视觉失真，显著优于现有方法。

Details

Motivation: 现有的对抗样本生成方法通常基于Lp范数约束，但这类样本在视觉上与自然图像不一致。尽管近期有一些研究开始关注视觉对齐的对抗样本，但仍不清楚如何利用Lp约束攻击的见解提升其效果。

Result: 在CIFAR-10、CIFAR-100和ImageNet上，DAASH在攻击成功率和视觉质量指标（SSIM、LPIPS、FID）上均显著优于当前最佳方法（如AdvAD），并展现了良好的泛化性。

Insight: 通过动态组合Lp约束攻击，可以生成更接近自然图像的对抗样本，同时保持高攻击性，为评估模型鲁棒性提供了实用基线。

Abstract: Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only recently have a few methods begun specifically exploring perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DAASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DAASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DAASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DAASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD – achieving higher attack success rates (e.g., 20.63% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements $\approx$ of 11, 0.015, and 5.7, respectively). Furthermore, DAASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.

[8] Automated Assessment of Aesthetic Outcomes in Facial Plastic Surgery cs.CVPDF

Pegah Varghaei, Kiran Abraham-Aggarwal, Manoj T. Abraham, Arun Ross

TL;DR: 本文提出了一种可扩展、可解释的计算机视觉框架，用于通过正面照片量化面部整形手术的美学效果，并通过大规模数据集验证了其方法的有效性。

Details

Motivation: 面部整形手术的美学效果评估通常依赖于主观判断，缺乏客观标准。本文旨在通过计算机视觉技术，提供一种可量化、可复现的评估方法，辅助手术规划和术后效果评价。

Result: 在鼻整形手术中，96.2%的患者至少在一个鼻部测量指标上有显著改善；在更广泛的正面图像群体中，71.3%的患者在面部对称性或感知年龄上有显著提升。身份匹配率达到99.5%以上。

Insight: 计算机视觉技术可以标准化美学效果的评估，减少主观性差异，同时为临床决策提供数据支持。

Abstract: We introduce a scalable, interpretable computer-vision framework for quantifying aesthetic outcomes of facial plastic surgery using frontal photographs. Our pipeline leverages automated landmark detection, geometric facial symmetry computation, deep-learning-based age estimation, and nasal morphology analysis. To perform this study, we first assemble the largest curated dataset of paired pre- and post-operative facial images to date, encompassing 7,160 photographs from 1,259 patients. This dataset includes a dedicated rhinoplasty-only subset consisting of 732 images from 366 patients, 96.2% of whom showed improvement in at least one of the three nasal measurements with statistically significant group-level change. Among these patients, the greatest statistically significant improvements (p < 0.001) occurred in the alar width to face width ratio (77.0%), nose length to face height ratio (41.5%), and alar width to intercanthal ratio (39.3%). Among the broader frontal-view cohort, comprising 989 rigorously filtered subjects, 71.3% exhibited significant enhancements in global facial symmetry or perceived age (p < 0.01). Importantly, our analysis shows that patient identity remains consistent post-operatively, with True Match Rates of 99.5% and 99.6% at a False Match Rate of 0.01% for the rhinoplasty-specific and general patient cohorts, respectively. Additionally, we analyze inter-practitioner variability in improvement rates. By providing reproducible, quantitative benchmarks and a novel dataset, our pipeline facilitates data-driven surgical planning, patient counseling, and objective outcome evaluation across practices.

[9] Applications of Small Language Models in Medical Imaging Classification with a Focus on Prompt Strategies cs.CVPDF

Yiting Wang, Ziwei Wang, Jiachen Zhong, Di Zhu, Weiyi Li

TL;DR: 论文研究了小型语言模型（SLM）在医学影像分类中的表现，重点比较了不同提示策略对模型性能的影响，发现精心设计的提示可以显著提升SLM的准确性。

Details

Motivation: 大型语言模型（LLM）在医疗环境中面临计算成本高、访问受限和数据隐私等问题，因此研究SLM在资源受限环境中的表现具有重要意义。

Result: 实验结果表明，特定SLM搭配精心设计的提示可以达到与LLM竞争的准确性。

Insight: 提示工程可以显著提升SLM在医疗领域的性能，且无需终端用户具备深入的AI知识。

Abstract: Large language models (LLMs) have shown remarkable capabilities in natural language processing and multi-modal understanding. However, their high computational cost, limited accessibility, and data privacy concerns hinder their adoption in resource-constrained healthcare environments. This study investigates the performance of small language models (SLMs) in a medical imaging classification task, comparing different models and prompt designs to identify the optimal combination for accuracy and usability. Using the NIH Chest X-ray dataset, we evaluate multiple SLMs on the task of classifying chest X-ray positions (anteroposterior [AP] vs. posteroanterior [PA]) under three prompt strategies: baseline instruction, incremental summary prompts, and correction-based reflective prompts. Our results show that certain SLMs achieve competitive accuracy with well-crafted prompts, suggesting that prompt engineering can substantially enhance SLM performance in healthcare applications without requiring deep AI expertise from end users.

[10] Mitigating Easy Option Bias in Multiple-Choice Question Answering cs.CV | cs.AI | cs.MMPDF

Hao Zhang, Chen Li, Basura Fernando

TL;DR: 该论文指出了多选视觉问答（VQA）基准中存在的‘易选项偏差’问题，并提出了一种名为GroundAttack的工具包，通过生成视觉上合理的困难负选项来解决这一问题，从而更真实地评估视觉语言模型的问答能力。

Details

Motivation: 研究者发现一些多选VQA基准存在易选项偏差（EOB），使得视觉语言模型（VLMs）仅通过视觉和选项就能推断出正确答案，而无需依赖问题本身，导致评估失真。

Result: 在新的EOB-free标注数据集上，现有VLMs在仅使用视觉和选项（V+O）时的准确率接近于随机，而在使用视觉、问题和选项（V+Q+O）时的准确率也显著下降，表明问题更具挑战性。

Insight: 研究揭示了现有多选VQA基准的潜在偏差问题，强调了在设计评估基准时需注意选项和问题的平衡性，以避免模型通过捷径学习取得虚假的高性能。

Abstract: In this early study, we observe an Easy-Options Bias (EOB) issue in some multiple-choice Visual Question Answering (VQA) benchmarks such as MMStar, RealWorldQA, SEED-Bench, Next-QA, STAR benchmark and Video-MME. This bias allows vision-language models (VLMs) to select the correct answer using only the vision (V) and options (O) as inputs, without the need for the question (Q). Through grounding experiments, we attribute the bias to an imbalance in visual relevance: the correct answer typically aligns more closely with the visual contents than the negative options in feature space, creating a shortcut for VLMs to infer the answer via simply vision-option similarity matching. To fix this, we introduce GroundAttack, a toolkit that automatically generates hard negative options as visually plausible as the correct answer. We apply it to the NExT-QA and MMStar datasets, creating new EOB-free annotations. On these EOB-free annotations, current VLMs approach to random accuracies under (V+O) settings, and drop to non-saturated accuracies under (V+Q+O) settings, providing a more realistic evaluation of VLMs’ QA ability. Codes and new annotations will be released soon.

[11] Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference cs.CV | cs.AI | cs.CL | eess.IVPDF

Yunxiang Yang, Ningning Xu, Jidong J. Yang

TL;DR: 该论文提出了一种结构化提示和多智能体知识蒸馏框架，用于交通视频理解和风险推断。通过结合GPT-4o和o3-mini两个视觉-语言模型（VLM），生成高质量伪标注，并训练了一个紧凑的3B参数模型VISTA。VISTA在多项指标上表现优异，适合边缘设备部署。

Details

Motivation: 传统方法在复杂动态的真实交通场景中难以实现可扩展性和泛化性，需提升智能交通系统和自动驾驶的全面理解与风险推断能力。

Result: VISTA在BLEU-4、METEOR等指标上表现优异，接近大模型，且适合边缘设备实时部署。

Insight: 通过知识蒸馏和结构化多智能体监督，轻量化VLM可具备复杂推理能力，解决了传统方法的部署瓶颈。

Abstract: Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we introduce a novel structured prompting and knowledge distillation framework that enables automatic generation of high-quality traffic scene annotations and contextual risk assessments. Our framework orchestrates two large Vision-Language Models (VLMs): GPT-4o and o3-mini, using a structured Chain-of-Thought (CoT) strategy to produce rich, multi-perspective outputs. These outputs serve as knowledge-enriched pseudo-annotations for supervised fine-tuning of a much smaller student VLM. The resulting compact 3B-scale model, named VISTA (Vision for Intelligent Scene and Traffic Analysis), is capable of understanding low-resolution traffic videos and generating semantically faithful, risk-aware captions. Despite its significantly reduced parameter count, VISTA achieves strong performance across established captioning metrics (BLEU-4, METEOR, ROUGE-L, and CIDEr) when benchmarked against its teacher models. This demonstrates that effective knowledge distillation and structured multi-agent supervision can empower lightweight VLMs to capture complex reasoning capabilities. The compact architecture of VISTA facilitates efficient deployment on edge devices, enabling real-time risk monitoring without requiring extensive infrastructure upgrades.

[12] EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis cs.CVPDF

Shuai Tan, Bin Ji

TL;DR: EDTalk++是一个新型的全解耦框架，用于可控的说话头部生成，能够独立操控嘴型、头部姿态、眼动和表情，支持视频或音频输入。

Details

Motivation: 现有方法在多面部运动的解耦和多样化输入模态的支持上存在不足，需要深入探索面部特征的解耦空间以实现独立控制和共享视觉先验。

Result: 实验证明了EDTalk++在解耦控制和多模态输入支持上的有效性。

Insight: 通过正交基设计和高效率训练策略，可以有效地实现面部特征的解耦控制，并支持多模态输入的共享视觉先验。

Abstract: Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal inputs, both aspects often neglected in existing methods. To address this gap, this paper proposes EDTalk++, a novel full disentanglement framework for controllable talking head generation. Our framework enables individual manipulation of mouth shape, head pose, eye movement, and emotional expression, conditioned on video or audio inputs. Specifically, we employ four lightweight modules to decompose the facial dynamics into four distinct latent spaces representing mouth, pose, eye, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk++.

[13] Revisiting MLLM Token Technology through the Lens of Classical Visual Coding cs.CVPDF

Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao

TL;DR: 该论文通过经典视觉编码的视角重新审视了多模态大语言模型（MLLM）的令牌技术，提出了一种统一的框架，实现了对两者的系统比较，并探讨了双向启发和未来研究方向。

Details

Motivation: MLLM令牌技术与经典视觉编码具有相似的核心目标——在最小化计算成本的同时最大化信息保真度。通过视觉编码的成熟原则重新审视MLLM令牌技术，可以提升其效率和鲁棒性。

Result: 论文首次全面、系统地比较了MLLM令牌技术与视觉编码，为更高效的多模态模型和更强大的视觉编码器铺平了道路。

Insight: 经典视觉编码的原则可以显著提升MLLM令牌技术的效率，而MLLM的技术范式也能启发下一代语义视觉编码器的设计。

Abstract: Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning, through the established principles of long-developed visual coding area. From this perspective, we (1) establish a unified formulation bridging token technology and visual coding, enabling a systematic, module-by-module comparative analysis; (2) synthesize bidirectional insights, exploring how visual coding principles can enhance MLLM token techniques’ efficiency and robustness, and conversely, how token technology paradigms can inform the design of next-generation semantic visual codecs; (3) prospect for promising future research directions and critical unsolved challenges. In summary, this study presents the first comprehensive and structured technology comparison of MLLM token and visual coding, paving the way for more efficient multimodal models and more powerful visual codecs simultaneously.

[14] Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs cs.CV | cs.LGPDF

Ivan Reyes-Amezcua, Francisco Lopez-Tiro, Clement Larose, Andres Mendez-Vazquez, Gilberto Ochoa-Ruiz

TL;DR: 该论文比较了Vision Transformers (ViTs)与CNN在肾结石图像分类任务中的表现，发现ViT在复杂和大范围的图像特征捕捉上优于传统的ResNet50，尤其在多变的成像条件下表现更好。

Details

Motivation: 肾结石图像分类对个性化治疗和预防复发至关重要。传统的CNN在捕捉长距离依赖关系上表现有限，因此需要探索更有效的模型。

Result: ViT模型在复杂子集（内窥镜图像片段）中达到了95.2%的准确率和95.1%的F1-score，显著高于ResNet50（64.5%和59.3%）。在CCD相机图像的混合视点子集中，ViT也以87.1%的准确率优于ResNet50的78.4%。

Insight: ViT在捕捉图像的长距离依赖关系和复杂特征方面具有优势，适合医学图像分析任务，尤其是在成像条件多变的情况下。

Abstract: Kidney stone classification from endoscopic images is critical for personalized treatment and recurrence prevention. While convolutional neural networks (CNNs) have shown promise in this task, their limited ability to capture long-range dependencies can hinder performance under variable imaging conditions. This study presents a comparative analysis between Vision Transformers (ViTs) and CNN-based models, evaluating their performance on two ex vivo datasets comprising CCD camera and flexible ureteroscope images. The ViT-base model pretrained on ImageNet-21k consistently outperformed a ResNet50 baseline across multiple imaging conditions. For instance, in the most visually complex subset (Section patches from endoscopic images), the ViT model achieved 95.2% accuracy and 95.1% F1-score, compared to 64.5% and 59.3% with ResNet50. In the mixed-view subset from CCD-camera images, ViT reached 87.1% accuracy versus 78.4% with CNN. These improvements extend across precision and recall as well. The results demonstrate that ViT-based architectures provide superior classification performance and offer a scalable alternative to conventional CNNs for kidney stone image analysis.

[15] STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models cs.CV | cs.AIPDF

Tinh-Anh Nguyen-Nhu, Triet Dao Hoang Minh, Dat To-Thanh, Phuc Le-Gia, Tuan Vo-Lan

TL;DR: STER-VLM: 高效视觉-语言模型框架，通过分解空间与时间信息、优化帧选择及引用驱动理解，提升交通场景分析的语义丰富性和准确性。

Details

Motivation: 当前视觉-语言模型在交通分析中计算资源需求高，且难以处理细粒度时空信息。

Result: 在WTS和BDD数据集上表现优异，AI City Challenge 2025 Track 2测试得分为55.655，验证了框架的有效性。

Insight: 通过分而治之的时空信息处理策略，STER-VLM在资源效率和准确性之间取得了平衡，适用于实际交通分析应用。

Abstract: Vision-language models (VLMs) have emerged as powerful tools for enabling automated traffic analysis; however, current approaches often demand substantial computational resources and struggle with fine-grained spatio-temporal understanding. This paper introduces STER-VLM, a computationally efficient framework that enhances VLM performance through (1) caption decomposition to tackle spatial and temporal information separately, (2) temporal frame selection with best-view filtering for sufficient temporal information, and (3) reference-driven understanding for capturing fine-grained motion and dynamic context and (4) curated visual/textual prompt techniques. Experimental results on the WTS \cite{kong2024wts} and BDD \cite{BDD} datasets demonstrate substantial gains in semantic richness and traffic scene interpretation. Our framework is validated through a decent test score of 55.655 in the AI City Challenge 2025 Track 2, showing its effectiveness in advancing resource-efficient and accurate traffic analysis for real-world applications.

[16] MINR: Efficient Implicit Neural Representations for Multi-Image Encoding cs.CVPDF

Wenyong Zhou, Taiqiang Wu, Zhengwu Liu, Yuxin Cheng, Chen Zhang

TL;DR: 论文MINR通过共享中间层和引入投影层，高效编码多图像，节省60%参数同时保持性能，适用于100张图像处理。

Details

Motivation: 传统的隐式神经表示（INRs）为每张图像单独训练MLP导致计算和存储效率低下。

Result: 在图像重建和超分辨率任务中，MINR节省60%参数，处理100张图像时PSNR保持34dB。

Insight: INRs的中间层权重分布高度相似，支持共享机制，投影层有效保留图像独特性。

Abstract: Implicit Neural Representations (INRs) aim to parameterize discrete signals through implicit continuous functions. However, formulating each image with a separate neural network~(typically, a Multi-Layer Perceptron (MLP)) leads to computational and storage inefficiencies when encoding multi-images. To address this issue, we propose MINR, sharing specific layers to encode multi-image efficiently. We first compare the layer-wise weight distributions for several trained INRs and find that corresponding intermediate layers follow highly similar distribution patterns. Motivated by this, we share these intermediate layers across multiple images while preserving the input and output layers as input-specific. In addition, we design an extra novel projection layer for each image to capture its unique features. Experimental results on image reconstruction and super-resolution tasks demonstrate that MINR can save up to 60% parameters while maintaining comparable performance. Particularly, MINR scales effectively to handle 100 images, maintaining an average peak signal-to-noise ratio (PSNR) of 34 dB. Further analysis of various backbones proves the robustness of the proposed MINR.

Fuyang Liu, Jilin Mei, Fangyuan Mao, Chen Min, Yan Xing

TL;DR: CORENet提出了一种基于LiDAR监督的跨模态4D雷达去噪网络，旨在提升自动驾驶中对4D雷达点云的感知能力，仅需在训练时使用LiDAR数据，推理时完全依赖雷达数据。

Details

Motivation: 4D雷达点云因其稀疏性和噪声问题，在实际感知任务中面临挑战。尽管LiDAR在精准感知方面表现优异，但其在恶劣天气下的性能受限。因此，亟需一种能够结合LiDAR优势但仅依赖雷达数据的去噪方法。

Result: 在噪声严重的Dual-Radar数据集上，CORENet显著提升了检测鲁棒性，并超越了现有主流方法。

Insight: 跨模态监督是解决雷达点云去噪问题的有效途径，LiDAR的高质量数据可以作为训练阶段的强监督信号，而无需在推理时依赖多模态输入。

Abstract: 4D radar-based object detection has garnered great attention for its robustness in adverse weather conditions and capacity to deliver rich spatial information across diverse driving scenarios. Nevertheless, the sparse and noisy nature of 4D radar point clouds poses substantial challenges for effective perception. To address the limitation, we present CORENet, a novel cross-modal denoising framework that leverages LiDAR supervision to identify noise patterns and extract discriminative features from raw 4D radar data. Designed as a plug-and-play architecture, our solution enables seamless integration into voxel-based detection frameworks without modifying existing pipelines. Notably, the proposed method only utilizes LiDAR data for cross-modal supervision during training while maintaining full radar-only operation during inference. Extensive evaluation on the challenging Dual-Radar dataset, which is characterized by elevated noise level, demonstrates the effectiveness of our framework in enhancing detection robustness. Comprehensive experiments validate that CORENet achieves superior performance compared to existing mainstream approaches.

[18] Multi-view Clustering via Bi-level Decoupling and Consistency Learning cs.CV | cs.LGPDF

Shihao Dong, Yuhui Zheng, Huiying Xu, Xinzhong Zhu

TL;DR: 该论文提出了一种新的多视图聚类框架BDCL，通过双层解耦和一致性学习来提升特征表示的效果，增强簇间区分性和簇内紧凑性。

Details

Motivation: 多视图聚类通过利用视图之间的信息可以提高聚类性能，但现有的方法通常忽略了面向簇的特征表示学习。

Result: 在五个基准数据集上的实验表明，BDCL优于现有的先进方法。

Insight: 通过明确解耦特征和簇空间，并利用一致性学习优化簇内结构，可以有效提升多视图聚类的性能。

Abstract: Multi-view clustering has shown to be an effective method for analyzing underlying patterns in multi-view data. The performance of clustering can be improved by learning the consistency and complementarity between multi-view features, however, cluster-oriented representation learning is often overlooked. In this paper, we propose a novel Bi-level Decoupling and Consistency Learning framework (BDCL) to further explore the effective representation for multi-view data to enhance inter-cluster discriminability and intra-cluster compactness of features in multi-view clustering. Our framework comprises three modules: 1) The multi-view instance learning module aligns the consistent information while preserving the private features between views through reconstruction autoencoder and contrastive learning. 2) The bi-level decoupling of features and clusters enhances the discriminability of feature space and cluster space. 3) The consistency learning module treats the different views of the sample and their neighbors as positive pairs, learns the consistency of their clustering assignments, and further compresses the intra-cluster space. Experimental results on five benchmark datasets demonstrate the superiority of the proposed method compared with the SOTA methods. Our code is published on https://github.com/LouisDong95/BDCL.

[19] AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes cs.CV | eess.IVPDF

Tianyi Xu, Fan Zhang, Boxin Shi, Tianfan Xue, Yujin Wang

TL;DR: 这篇论文提出了一种基于强化学习的方法AdaptiveAE，用于在动态场景中优化快门速度和ISO组合的选择，以最大化HDR重建质量。

Details

Motivation: 传统HDR成像技术忽视了快门速度和ISO之间的复杂交互作用，以及在动态场景中运动模糊的影响，导致HDR质量受限。

Result: 在多数据集上的实验表明，AdaptiveAE在HDR重建质量上达到了最先进的性能。

Insight: 论文的创新点在于将强化学习与传统HDR技术结合，通过模拟动态场景的真实条件，实现了更优的曝光策略。

Abstract: Mainstream high dynamic range imaging techniques typically rely on fusing multiple images captured with different exposure setups (shutter speed and ISO). A good balance between shutter speed and ISO is crucial for achieving high-quality HDR, as high ISO values introduce significant noise, while long shutter speeds can lead to noticeable motion blur. However, existing methods often overlook the complex interaction between shutter speed and ISO and fail to account for motion blur effects in dynamic scenes. In this work, we propose AdaptiveAE, a reinforcement learning-based method that optimizes the selection of shutter speed and ISO combinations to maximize HDR reconstruction quality in dynamic environments. AdaptiveAE integrates an image synthesis pipeline that incorporates motion blur and noise simulation into our training procedure, leveraging semantic information and exposure histograms. It can adaptively select optimal ISO and shutter speed sequences based on a user-defined exposure time budget, and find a better exposure schedule than traditional solutions. Experimental results across multiple datasets demonstrate that it achieves the state-of-the-art performance.

[20] Bridging the Gap: Doubles Badminton Analysis with Singles-Trained Models cs.CVPDF

Seungheon Baek, Jinhyuk Yun

TL;DR: 该论文提出了一种方法，将单打训练的模型迁移到双打羽毛球分析中，解决了双打研究的数据和多人追踪挑战。

Details

Motivation: 双打羽毛球在国际比赛中更常见，但由于数据和多人追踪的困难，以往研究主要集中在单打上。本文旨在填补这一空白。

Result: 研究表明，基于姿态的击球识别可以扩展到双打羽毛球，为双打数据分析奠定了基础。

Insight: 本工作展示了模型迁移的潜力，并为双打羽毛球的数据集和分析框架提供了方向。

Abstract: Badminton is known as one of the fastest racket sports in the world. Despite doubles matches being more prevalent in international tournaments than singles, previous research has mainly focused on singles due to the challenges in data availability and multi-person tracking. To address this gap, we designed an approach that transfers singles-trained models to doubles analysis. We extracted keypoints from the ShuttleSet single matches dataset using ViT-Pose and embedded them through a contrastive learning framework based on ST-GCN. To improve tracking stability, we incorporated a custom multi-object tracking algorithm that resolves ID switching issues from fast and overlapping player movements. A Transformer-based classifier then determines shot occurrences based on the learned embeddings. Our findings demonstrate the feasibility of extending pose-based shot recognition to doubles badminton, broadening analytics capabilities. This work establishes a foundation for doubles-specific datasets to enhance understanding of this predominant yet understudied format of the fast racket sport.

[21] Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency cs.CV | cs.AIPDF

Yanbiao Ma, Wei Dai, Bowei Liu, Jiayi Chen, Wenke Huang

TL;DR: 本文提出了一种通过跨域几何一致性校准VFM（视觉基础模型）导出潜空间中偏差分布的方法，解决了训练样本与真实分布之间差距的问题，并在联邦学习和长尾识别中验证了其有效性。

Details

Motivation: 深度学习中训练样本与真实分布之间存在差距（如采样偏差和噪声），导致模型性能受限。尤其是在利用视觉基础模型（如CLIP、DINOv2）提取特征时，这种差距更为明显。本文旨在通过跨域几何一致性解决此问题。

Result: 实验表明，提出的几何知识指导的分布校准方法有效克服了数据异质性和样本不平衡引起的信息缺失，显著提升了性能。

Insight: 通过跨域几何一致性校准特征分布，可以显著提升模型在联邦学习和长尾识别等复杂场景中的表现，为利用基础模型解决分布偏差问题提供了新思路。

Abstract: Despite the fast progress of deep learning, one standing challenge is the gap of the observed training samples and the underlying true distribution. There are multiple reasons for the causing of this gap e.g. sampling bias, noise etc. In the era of foundation models, we show that when leveraging the off-the-shelf (vision) foundation models (e.g., CLIP, DINOv2) for feature extraction, the geometric shapes of the resulting feature distributions exhibit remarkable transferability across domains and datasets. To verify its practical usefulness, we embody our geometric knowledge-guided distribution calibration framework in two popular and challenging settings: federated learning and long-tailed recognition. In the federated setting, we devise a technique of acquiring the global geometric shape under privacy constraints, then leverage this knowledge to generate new samples for clients, in the aim of bridging the gap between local and global observations. In long-tailed learning, it utilizes the geometric knowledge transferred from sample-rich categories to recover the true distribution for sample-scarce tail classes. Comprehensive experiments show that our proposed geometric knowledge-guided distribution calibration effectively overcomes information deficits caused by data heterogeneity and sample imbalance, with boosted performance across benchmarks.

[22] Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models cs.CV | cs.AIPDF

Vamsi Krishna Mulukutla, Sai Supriya Pavarala, Srinivasa Raju Rudraraju, Sridevi Bonthu

TL;DR: 本文首次对开源视觉语言模型（VLM）和传统深度学习模型在面部情绪识别（FER）任务上的表现进行了实证比较，结果表明传统模型（如EfficientNet-B0）显著优于VLM（如CLIP），并提出了一个结合图像恢复的新流程以解决数据不匹配问题。

Details

Motivation: 面部情绪识别（FER）在人类-计算机交互和心理健康诊断等领域具有重要意义。本研究旨在评估开源视觉语言模型（如Phi-3.5 Vision和CLIP）在FER任务中的表现，与传统深度学习模型（如VGG19、ResNet-50和EfficientNet-B0）进行比较，以填补现有研究的空白。

Result: 传统模型表现显著优于VLM，其中EfficientNet-B0（86.44%）和ResNet-50（85.72%）表现最佳，而CLIP（64.07%）和Phi-3.5 Vision（51.66%）较差。图像恢复部分改善了VLM的表现，但仍未超越传统模型。

Insight: 1. VLM在低质量或噪声数据上的表现较差，需要进一步适应或改进。
2. 传统深度学习模型在特定任务（如FER）上仍具有优势。
3. 图像恢复技术可能是提升VLM性能的有效途径之一，但仍需更多研究。

Abstract: Facial Emotion Recognition (FER) is crucial for applications such as human-computer interaction and mental health diagnostics. This study presents the first empirical comparison of open-source Vision-Language Models (VLMs), including Phi-3.5 Vision and CLIP, against traditional deep learning models VGG19, ResNet-50, and EfficientNet-B0 on the challenging FER-2013 dataset, which contains 35,887 low-resolution grayscale images across seven emotion classes. To address the mismatch between VLM training assumptions and the noisy nature of FER data, we introduce a novel pipeline that integrates GFPGAN-based image restoration with FER evaluation. Results show that traditional models, particularly EfficientNet-B0 (86.44%) and ResNet-50 (85.72%), significantly outperform VLMs like CLIP (64.07%) and Phi-3.5 Vision (51.66%), highlighting the limitations of VLMs in low-quality visual tasks. In addition to performance evaluation using precision, recall, F1-score, and accuracy, we provide a detailed computational cost analysis covering preprocessing, training, inference, and evaluation phases, offering practical insights for deployment. This work underscores the need for adapting VLMs to noisy environments and provides a reproducible benchmark for future research in emotion recognition.

[23] EAvatar: Expression-Aware Head Avatar Reconstruction with Generative Geometry Priors cs.CV | cs.AIPDF

Shikun Zhang, Cunjian Chen, Yiqun Wang, Qiuhong Ke, Yong Li

TL;DR: EAvatar是一个基于3D高斯溅射的头部重建框架，通过稀疏表情控制机制和生成几何先验，解决了现有方法在细微表情捕捉和局部纹理连续性上的挑战。

Details

Motivation: 现有3D高斯溅射方法在细微表情捕捉和局部纹理连续性上面临挑战，特别是在高度可变形的区域。

Result: 实验结果表明，EAvatar能生成更准确、视觉连贯的头部重建，并在表情可控性和细节保真度上有所提升。

Insight: 结合生成几何先验和稀疏控制机制，可以显著提升复杂几何建模的表现力，同时保持实时渲染的优势。

Abstract: High-fidelity head avatar reconstruction plays a crucial role in AR/VR, gaming, and multimedia content creation. Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated effectiveness in modeling complex geometry with real-time rendering capability and are now widely used in high-fidelity head avatar reconstruction tasks. However, existing 3DGS-based methods still face significant challenges in capturing fine-grained facial expressions and preserving local texture continuity, especially in highly deformable regions. To mitigate these limitations, we propose a novel 3DGS-based framework termed EAvatar for head reconstruction that is both expression-aware and deformation-aware. Our method introduces a sparse expression control mechanism, where a small number of key Gaussians are used to influence the deformation of their neighboring Gaussians, enabling accurate modeling of local deformations and fine-scale texture transitions. Furthermore, we leverage high-quality 3D priors from pretrained generative models to provide a more reliable facial geometry, offering structural guidance that improves convergence stability and shape accuracy during training. Experimental results demonstrate that our method produces more accurate and visually coherent head reconstructions with improved expression controllability and detail fidelity.

[24] FLAIR: Frequency- and Locality-Aware Implicit Neural Representations cs.CV | cs.AIPDF

Sukhun Ko, Dahyeon Kye, Kyle Min, Chanho Eom, Jihyong Oh

TL;DR: 论文FLAIR提出了一种新的隐式神经表示(INR)方法，通过引入RC-GAUSS激活函数和WEGE编码机制，解决了现有INR在频率选择性和空间定位上的不足。

Details

Motivation: 现有的INR方法缺乏频率选择性和空间局部性，导致对冗余信号的依赖，并表现出频谱偏差，难以捕捉高频细节。

Result: 在2D图像表示与恢复以及3D重建任务中，FLAIR均优于现有INR方法。

Insight: 频率选择和空间定位是提高INR性能的关键，显式引导网络学习频率信息可以有效缓解频谱偏差。

Abstract: Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity, spatial localization, and sparse representations, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is RC-GAUSS, a novel activation designed for explicit frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform (DWT) to compute energy scores and explicitly guide frequency information to the network. Our method consistently outperforms existing INRs in 2D image representation and restoration, as well as 3D reconstruction.

[25] GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering cs.CVPDF

Farhaan Ebadulla, Chiraag Mudlapur, Gaurav BV

TL;DR: GazeProphet提出了一种纯软件的注视预测方法，用于VR的注视点渲染，无需专用硬件，结合了Spherical Vision Transformer和LSTM，性能优于传统方法。

Details

Motivation: 当前基于硬件的眼动追踪系统成本高、校准复杂且兼容性受限，限制了注视点渲染的普及。因此，需要一种纯软件的解决方案。

Result: 在VR数据集上，GazeProphet的中值角度误差为3.83度，性能优于传统方法24%，且置信度校准可靠。

Insight: 纯软件的注视预测方法可实现接近硬件的性能，为VR注视点渲染的普及提供了可行方案。

Abstract: Foveated rendering significantly reduces computational demands in virtual reality applications by concentrating rendering quality where users focus their gaze. Current approaches require expensive hardware-based eye tracking systems, limiting widespread adoption due to cost, calibration complexity, and hardware compatibility constraints. This paper presents GazeProphet, a software-only approach for predicting gaze locations in VR environments without requiring dedicated eye tracking hardware. The approach combines a Spherical Vision Transformer for processing 360-degree VR scenes with an LSTM-based temporal encoder that captures gaze sequence patterns. A multi-modal fusion network integrates spatial scene features with temporal gaze dynamics to predict future gaze locations with associated confidence estimates. Experimental evaluation on a comprehensive VR dataset demonstrates that GazeProphet achieves a median angular error of 3.83 degrees, outperforming traditional saliency-based baselines by 24% while providing reliable confidence calibration. The approach maintains consistent performance across different spatial regions and scene types, enabling practical deployment in VR systems without additional hardware requirements. Statistical analysis confirms the significance of improvements across all evaluation metrics. These results show that software-only gaze prediction can work for VR foveated rendering, making this performance boost more accessible to different VR platforms and apps.

[26] A Lightweight Dual-Mode Optimization for Generative Face Video Coding cs.CV | eess.IVPDF

Zihan Zhang, Shanzhi Yin, Bolin Chen, Ru-Ling Liao, Shiqi Wang

TL;DR: 该论文提出了一种轻量级的双模式优化方法，用于生成式面部视频编码（GFVC），通过架构重设计和操作优化，显著降低了模型参数量和计算成本，同时保持了重建质量。

Details

Motivation: 现有的GFVC方法虽然在率失真性能上表现出色，但其大模型参数和高计算成本阻碍了实际部署，尤其是在资源受限的环境中。因此，需要一种轻量化的解决方案。

Result: 实验结果显示，该方法相比基线模型减少了90.4%的参数量和88.9%的计算成本，同时在感知质量上优于最新的视频编码标准VVC。

Insight: 该研究的核心在于通过架构和操作的协同优化实现轻量化，证明了在不牺牲性能的情况下显著降低计算资源需求的可行性，为GFVC在移动边缘设备上的部署铺平了道路。

Abstract: Generative Face Video Coding (GFVC) achieves superior rate-distortion performance by leveraging the strong inference capabilities of deep generative models. However, its practical deployment is hindered by large model parameters and high computational costs. To address this, we propose a lightweight GFVC framework that introduces dual-mode optimization – combining architectural redesign and operational refinement – to reduce complexity whilst preserving reconstruction quality. Architecturally, we replace traditional 3 x 3 convolutions with slimmer and more efficient layers, reducing complexity without compromising feature expressiveness. Operationally, we develop a two-stage adaptive channel pruning strategy: (1) soft pruning during training identifies redundant channels via learnable thresholds, and (2) hard pruning permanently eliminates these channels post-training using a derived mask. This dual-phase approach ensures both training stability and inference efficiency. Experimental results demonstrate that the proposed lightweight dual-mode optimization for GFVC can achieve 90.4% parameter reduction and 88.9% computation saving compared to the baseline, whilst achieving superior performance compared to state-of-the-art video coding standard Versatile Video Coding (VVC) in terms of perceptual-level quality metrics. As such, the proposed method is expected to enable efficient GFVC deployment in resource-constrained environments such as mobile edge devices.

[27] Color Spike Data Generation via Bio-inspired Neuron-like Encoding with an Artificial Photoreceptor Layer cs.CVPDF

Hsieh Ching-Teng, Wang Yuan-Kai

TL;DR: 该论文提出了一种受生物神经元启发的Neuron-like Encoding方法，结合人工光感受器层生成包含颜色和亮度信息的脉冲数据，以增强脉冲神经网络的性能并遵循神经形态计算原则。

Details

Motivation: 传统脉冲神经网络（SNN）性能落后于卷积神经网络（CNN），部分原因是脉冲数据的信息容量有限。现有方法通过非脉冲输入训练SNN，偏离了神经形态计算的初衷。

Result: 实验表明，该方法有效增加了脉冲信号的信息量，并提升了SNN的性能。

Insight: 生物启发的编码方式可以增强脉冲数据的表达能力，同时保持神经形态计算的特性，未来有望推动SNN的更广泛应用。

Abstract: In recent years, neuromorphic computing and spiking neural networks (SNNs) have ad-vanced rapidly through integration with deep learning. However, the performance of SNNs still lags behind that of convolutional neural networks (CNNs), primarily due to the limited information capacity of spike-based data. Although some studies have attempted to improve SNN performance by training them with non-spiking inputs such as static images, this approach deviates from the original intent of neuromorphic computing, which emphasizes spike-based information processing. To address this issue, we propose a Neuron-like Encoding method that generates spike data based on the intrinsic operational principles and functions of biological neurons. This method is further enhanced by the incorporation of an artificial pho-toreceptor layer, enabling spike data to carry both color and luminance information, thereby forming a complete visual spike signal. Experimental results using the Integrate-and-Fire neuron model demonstrate that this biologically inspired approach effectively increases the information content of spike signals and improves SNN performance, all while adhering to neuromorphic principles. We believe this concept holds strong potential for future development and may contribute to overcoming current limitations in neuro-morphic computing, facilitating broader applications of SNNs.

[28] DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup cs.CVPDF

Zhen Qu, Xian Tao, Xinyi Gong, ShiChen Qu, Xiaopei Zhang

TL;DR: DictAS是一个通过字典查找实现类别通用的少样本异常分割框架，无需对目标数据进行重新训练，仅使用少量正常参考图像即可检测未见类别的异常。

Details

Motivation: 现有基于视觉-语言模型的少样本异常分割方法依赖于已知类别的异常样本先验知识，限制了跨类别的泛化能力。DictAS通过自监督学习将字典查找能力应用于FSAS任务，避免了仅记忆训练集中的模式。

Result: 在七个工业和医学数据集上的实验表明，DictAS始终优于现有FSAS方法。

Insight: 通过自监督学习和字典查找策略，DictAS实现了未见类别的异常检测，无需依赖已知异常样本，具有较强的通用性和泛化能力。

Abstract: Recent vision-language models (e.g., CLIP) have demonstrated remarkable class-generalizable ability to unseen classes in few-shot anomaly segmentation (FSAS), leveraging supervised prompt learning or fine-tuning on seen classes. However, their cross-category generalization largely depends on prior knowledge of real seen anomaly samples. In this paper, we propose a novel framework, namely DictAS, which enables a unified model to detect visual anomalies in unseen object categories without any retraining on the target data, only employing a few normal reference images as visual prompts. The insight behind DictAS is to transfer dictionary lookup capabilities to the FSAS task for unseen classes via self-supervised learning, instead of merely memorizing the normal and abnormal feature patterns from the training set. Specifically, DictAS mainly consists of three components: (1) Dictionary Construction - to simulate the index and content of a real dictionary using features from normal reference images. (2) Dictionary Lookup - to retrieve queried region features from the dictionary via a sparse lookup strategy. When a query feature cannot be retrieved, it is classified as an anomaly. (3) Query Discrimination Regularization- to enhance anomaly discrimination by making abnormal features harder to retrieve from the dictionary. To achieve this, Contrastive Query Constraint and Text Alignment Constraint are further proposed. Extensive experiments on seven public industrial and medical datasets demonstrate that DictAS consistently outperforms state-of-the-art FSAS methods.

[29] Learnable SMPLify: A Neural Solution for Optimization-Free Human Pose Inverse Kinematics cs.CVPDF

Yuchen Yang, Linfeng Dong, Wei Wang, Zhihang Zhong, Xiao Sun

TL;DR: Learnable SMPLify 提出了一种基于神经网络的优化免费方法，用于解决3D人体姿态和形状估计中的逆向运动学问题，显著提升了计算效率。

Details

Motivation: 传统SMPLify方法通过迭代优化解决逆向运动学问题，计算成本高，限制了实际应用。受数据驱动神经网络在效率提升方面的启发，作者提出了基于回归的神经框架。

Result: 相比SMPLify，速度提升近200倍，泛化性能优异（如3DPW和RICH数据集），并能作为插件提升LucidAction等模型的效果。

Insight: 数据驱动的方法可以显著提升效率，但需解决数据构造和泛化问题；此工作为神经逆向运动学提供了一个简单实用的基准。

Abstract: In 3D human pose and shape estimation, SMPLify remains a robust baseline that solves inverse kinematics (IK) through iterative optimization. However, its high computational cost limits its practicality. Recent advances across domains have shown that replacing iterative optimization with data-driven neural networks can achieve significant runtime improvements without sacrificing accuracy. Motivated by this trend, we propose Learnable SMPLify, a neural framework that replaces the iterative fitting process in SMPLify with a single-pass regression model. The design of our framework targets two core challenges in neural IK: data construction and generalization. To enable effective training, we propose a temporal sampling strategy that constructs initialization-target pairs from sequential frames. To improve generalization across diverse motions and unseen poses, we propose a human-centric normalization scheme and residual learning to narrow the solution space. Learnable SMPLify supports both sequential inference and plug-in post-processing to refine existing image-based estimators. Extensive experiments demonstrate that our method establishes itself as a practical and simple baseline: it achieves nearly 200x faster runtime compared to SMPLify, generalizes well to unseen 3DPW and RICH, and operates in a model-agnostic manner when used as a plug-in tool on LucidAction. The code is available at https://github.com/Charrrrrlie/Learnable-SMPLify.

[30] The 9th AI City Challenge cs.CV | cs.AI | cs.LG | cs.ROPDF

Zheng Tang, Shuo Wang, David C. Anastasiu, Ming-Ching Chang, Anuj Sharma

TL;DR: 第九届AI City挑战赛聚焦交通、工业自动化和公共安全领域的计算机视觉与AI应用，设有四个赛道，吸引了245支团队参与。比赛强调多类3D追踪、视频问答、仓库环境空间推理和高效道路目标检测，数据集基于NVIDIA Omniverse生成，并推动了新基准的建立。

Details

Motivation: 推动计算机视觉和AI在交通、工业自动化及公共安全中的实际应用，通过多任务挑战促进技术进步和算法优化。

Result: 多个团队在任务中取得顶尖成绩，设立了新基准，比赛数据集下载量超过30,000次。

Insight: 1. 多模态数据（如3D标注、RGB-D）能显著提升AI任务性能；2. 边缘设备部署需求推动了轻量化算法发展；3. 公开数据集和公平评估对社区技术进步至关重要。

Abstract: The ninth AI City Challenge continues to advance real-world applications of computer vision and AI in transportation, industrial automation, and public safety. The 2025 edition featured four tracks and saw a 17% increase in participation, with 245 teams from 15 countries registered on the evaluation server. Public release of challenge datasets led to over 30,000 downloads to date. Track 1 focused on multi-class 3D multi-camera tracking, involving people, humanoids, autonomous mobile robots, and forklifts, using detailed calibration and 3D bounding box annotations. Track 2 tackled video question answering in traffic safety, with multi-camera incident understanding enriched by 3D gaze labels. Track 3 addressed fine-grained spatial reasoning in dynamic warehouse environments, requiring AI systems to interpret RGB-D inputs and answer spatial questions that combine perception, geometry, and language. Both Track 1 and Track 3 datasets were generated in NVIDIA Omniverse. Track 4 emphasized efficient road object detection from fisheye cameras, supporting lightweight, real-time deployment on edge devices. The evaluation framework enforced submission limits and used a partially held-out test set to ensure fair benchmarking. Final rankings were revealed after the competition concluded, fostering reproducibility and mitigating overfitting. Several teams achieved top-tier results, setting new benchmarks in multiple tasks.

[31] Generative Model-Based Feature Attention Module for Video Action Analysis cs.CVPDF

Guiqin Wang, Peng Zhao, Cong Zhao, Jing Huang, Siyan Guo

TL;DR: 该论文提出了一种基于生成模型的注意力模块，用于视频动作分析，通过学习前景与背景的差异来捕获特征语义的时序依赖性，显著提升了动作识别和检测的精度。

Details

Motivation: 现有视频动作分析方法在特征提取中忽略了特征语义，导致在高性能物联网应用中精度不足，亟需一种更鲁棒且可扩展的解决方案。

Result: 在主流数据集上的实验表明，该方法在动作识别和检测任务中表现优异。

Insight: 特征语义的时序依赖性对于视频动作分析至关重要，尤其是在物联网等高性能应用中。

Abstract: Video action analysis is a foundational technology within the realm of intelligent video comprehension, particularly concerning its application in Internet of Things(IoT). However, existing methodologies overlook feature semantics in feature extraction and focus on optimizing action proposals, thus these solutions are unsuitable for widespread adoption in high-performance IoT applications due to the limitations in precision, such as autonomous driving, which necessitate robust and scalable intelligent video analytics analysis. To address this issue, we propose a novel generative attention-based model to learn the relation of feature semantics. Specifically, by leveraging the differences of actions’ foreground and background, our model simultaneously learns the frame- and segment-dependencies of temporal action feature semantics, which takes advantage of feature semantics in the feature extraction effectively. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark video task, action recognition and action detection. In the context of action detection tasks, we substantiate the superiority of our approach through comprehensive validation on widely recognized datasets. Moreover, we extend the validation of the effectiveness of our proposed method to a broader task, video action recognition. Our code is available at https://github.com/Generative-Feature-Model/GAF.

[32] Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model cs.CVPDF

Ruixin Zhang, Jiaqing Fan, Yifan Liao, Qian Qiao, Fanzhang Li

TL;DR: 本文提出了一种创新的临时条件参考视频对象分割模型，通过改进分割头设计和简化文本到视频扩散模型，显著提升了性能。

Details

Motivation: 现有RVOS方法过于关注特征提取和时序建模，忽略了分割头的设计优化，同时传统噪声预测模块会引入随机性影响分割精度。

Result: 在四个公开RVOS基准测试中均达到最先进性能。

Insight: 分割头设计和噪声控制的简化是提升RVOS性能的关键方向。

Abstract: Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal-Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text-to-video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to overcome the limited feature extraction capability of the VAE, we design a Temporal Context Mask Refinement (TCMR) module, which significantly improves segmentation quality without introducing complex designs. We evaluate our method on four public RVOS benchmarks, where it consistently achieves state-of-the-art performance.

[33] Bridging Clear and Adverse Driving Conditions cs.CVPDF

Yoel Shapiro, Yahia Showgan, Koustav Mullick

TL;DR: 该论文提出了一种新的域适应（DA）流水线，通过将晴朗天气下的图像转换为雾、雨、雪和夜间图像，提升了自动驾驶系统在恶劣环境条件下的性能。

Details

Motivation: 自动驾驶系统在恶劣环境（如低光照和降水）下表现显著下降，而现有数据集对这类条件的覆盖不足。为了避免高昂的真实数据采集和标注成本，论文提出了一种合成数据生成方法。

Result: 在ACDC数据集上，语义分割性能提升了1.85%（整体）和4.62%（夜间场景），验证了合成数据对提升自动驾驶感知鲁棒性的有效性。

Insight: 混合合成与真实数据的方法能够有效弥补仿真与现实的差距，为自动驾驶系统在恶劣条件下的性能提升提供了新思路。

Abstract: Autonomous Driving (AD) systems exhibit markedly degraded performance under adverse environmental conditions, such as low illumination and precipitation. The underrepresentation of adverse conditions in AD datasets makes it challenging to address this deficiency. To circumvent the prohibitive cost of acquiring and annotating adverse weather data, we propose a novel Domain Adaptation (DA) pipeline that transforms clear-weather images into fog, rain, snow, and nighttime images. Here, we systematically develop and evaluate several novel data-generation pipelines, including simulation-only, GAN-based, and hybrid diffusion-GAN approaches, to synthesize photorealistic adverse images from labelled clear images. We leverage an existing DA GAN, extend it to support auxiliary inputs, and develop a novel training recipe that leverages both simulated and real images. The simulated images facilitate exact supervision by providing perfectly matched image pairs, while the real images help bridge the simulation-to-real (sim2real) gap. We further introduce a method to mitigate hallucinations and artifacts in Stable-Diffusion Image-to-Image (img2img) outputs by blending them adaptively with their progenitor images. We finetune downstream models on our synthetic data and evaluate them on the Adverse Conditions Dataset with Correspondences (ACDC). We achieve 1.85 percent overall improvement in semantic segmentation, and 4.62 percent on nighttime, demonstrating the efficacy of our hybrid method for robust AD perception under challenging conditions.

[34] Towards Efficient Vision State Space Models via Token Merging cs.CVPDF

Jinyoung Park, Minseok Son, Changick Kim

TL;DR: 论文提出MaMe，一种专门为基于SSM的视觉模型设计的token合并策略，旨在提升模型的计算效率，同时保持性能。

Details

Motivation: State Space Models (SSMs)在计算机视觉中表现强大，但其计算效率仍需提升以支持实际和大规模部署。token减少是提升效率的有效方法，但如何应用于SSMs需要特别考虑其独特的序列建模能力。

Result: 实验表明，MaMe在保持性能的同时实现了高效的token减少，并在图像分类、视频和音频领域表现出强泛化能力。

Insight: MaMe的提出不仅提升了SSMs的效率，还展示了其在多领域中的适用性，为SSM的广泛应用提供了新思路。

Abstract: State Space Models (SSMs) have emerged as powerful architectures in computer vision, yet improving their computational efficiency remains crucial for practical and scalable deployment.While token reduction serves as an effective approach for model efficiency, applying it to SSMs requires careful consideration of their unique sequential modeling capabilities.In this work, we propose MaMe, a token-merging strategy tailored for SSM-based vision models.MaMe addresses two key challenges: quantifying token importance and preserving sequential properties. Our approach leverages the state transition parameter $\mathbf{\Delta}$ as an informativeness measure and introduces strategic token arrangements to preserve sequential information flow.Extensive experiments demonstrate that MaMe achieves superior efficiency-performance trade-offs for both fine-tuned and off-the-shelf models. Particularly, our approach maintains robustness even under aggressive token reduction where existing methods undergo significant performance degradation.Beyond image classification, MaMe shows strong generalization capabilities across video and audio domains, establishing an effective approach for enhancing efficiency in diverse SSM applications.

[35] Unleashing Semantic and Geometric Priors for 3D Scene Completion cs.CVPDF

Shiyuan Chen, Wei Sui, Bohao Zhang, Zeyd Boukhers, John See

TL;DR: 该论文提出了一种名为FoundationSSC的新框架，通过双重解耦设计在3D语义场景补全任务中同时优化几何和语义信息，并在多个数据集上实现性能提升。

Details

Motivation: 现有的3D语义场景补全方法通常使用耦合的编码器处理几何和语义信息，导致性能受限。论文旨在通过解耦设计解决这一问题。

Result: 在SemanticKITTI上相比之前最佳结果提升0.23 mIoU和2.03 IoU，在SSCBench-KITTI-360上实现21.78 mIoU和48.61 IoU的SOTA性能。

Insight: 解耦设计可以显著提升3D语义场景补全任务中语义和几何信息的联合优化效果，同时AAF模块为特征融合提供了新的思路。

Abstract: Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving and robotic navigation. However, existing methods rely on a coupled encoder to deliver both semantic and geometric priors, which forces the model to make a trade-off between conflicting demands and limits its overall performance. To tackle these challenges, we propose FoundationSSC, a novel framework that performs dual decoupling at both the source and pathway levels. At the source level, we introduce a foundation encoder that provides rich semantic feature priors for the semantic branch and high-fidelity stereo cost volumes for the geometric branch. At the pathway level, these priors are refined through specialised, decoupled pathways, yielding superior semantic context and depth distributions. Our dual-decoupling design produces disentangled and refined inputs, which are then utilised by a hybrid view transformation to generate complementary 3D features. Additionally, we introduce a novel Axis-Aware Fusion (AAF) module that addresses the often-overlooked challenge of fusing these features by anisotropically merging them into a unified representation. Extensive experiments demonstrate the advantages of FoundationSSC, achieving simultaneous improvements in both semantic and geometric metrics, surpassing prior bests by +0.23 mIoU and +2.03 IoU on SemanticKITTI. Additionally, we achieve state-of-the-art performance on SSCBench-KITTI-360, with 21.78 mIoU and 48.61 IoU. The code will be released upon acceptance.

[36] PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction cs.CVPDF

Xiaolu Hou, Bing Ma, Jiaxiang Cheng, Xuhua Ren, Kai Yu

TL;DR: 该论文提出了一种自动化生成个性化多模态Vlog的框架PersonaVlog，通过多智能体协作和迭代自校正机制，结合多模态大语言模型，实现了基于主题和参考图像的高质量Vlog生成。

Details

Motivation: 现有Vlog生成方法依赖预定义脚本，缺乏动态性和个性化表达。随着短视频和个性化内容需求的增长，急需一种能够实现有效多模态协作和高个性化的自动化Vlog生成方法。

Result: 实验表明，PersonaVlog框架在生成个性化多模态Vlog方面显著优于基线方法，展现了高效性和创造力。

Insight: 结合多模态大语言模型和智能体协作能够有效提升自动化内容生成的个性化和动态性，迭代自校正机制为多模态生成提供了新的优化思路。

Abstract: With the growing demand for short videos and personalized content, automated Video Log (Vlog) generation has become a key direction in multimodal content creation. Existing methods mostly rely on predefined scripts, lacking dynamism and personal expression. Therefore, there is an urgent need for an automated Vlog generation approach that enables effective multimodal collaboration and high personalization. To this end, we propose PersonaVlog, an automated multimodal stylized Vlog generation framework that can produce personalized Vlogs featuring videos, background music, and inner monologue speech based on a given theme and reference image. Specifically, we propose a multi-agent collaboration framework based on Multimodal Large Language Models (MLLMs). This framework efficiently generates high-quality prompts for multimodal content creation based on user input, thereby improving the efficiency and creativity of the process. In addition, we incorporate a feedback and rollback mechanism that leverages MLLMs to evaluate and provide feedback on generated results, thereby enabling iterative self-correction of multimodal content. We also propose ThemeVlogEval, a theme-based automated benchmarking framework that provides standardized metrics and datasets for fair evaluation. Comprehensive experiments demonstrate the significant advantages and potential of our framework over several baselines, highlighting its effectiveness and great potential for generating automated Vlogs.

[37] Two-Factor Authentication Smart Entryway Using Modified LBPH Algorithm cs.CVPDF

Zakiah Ayop, Wan Mohamad Hariz Bin Wan Mohamad Rosdi, Looi Wei Hua, Syarulnaziah Anawar, Nur Fadzilah Othman

TL;DR: 该论文提出了一种基于改进的LBPH算法的双因素认证智能门禁系统，结合人脸识别和密码验证，并通过自动化流程在检测到陌生人时通知主人并激活监控系统。

Details

Motivation: 在COVID-19疫情期间，人脸口罩检测的需求增加，但现有技术在智能门禁系统中缺乏相关应用。本文旨在填补这一空白。

Result: 系统平均准确率约70%，精确率80%，召回率83.26%，表明其在人脸识别和口罩检测方面有效。

Insight: 改进的LBPH算法在面部遮挡情况下表现良好，展示了双因素认证在智能门禁系统中的潜力。

Abstract: Face mask detection has become increasingly important recently, particularly during the COVID-19 pandemic. Many face detection models have been developed in smart entryways using IoT. However, there is a lack of IoT development on face mask detection. This paper proposes a two-factor authentication system for smart entryway access control using facial recognition and passcode verification and an automation process to alert the owner and activate the surveillance system when a stranger is detected and controls the system remotely via Telegram on a Raspberry Pi platform. The system employs the Local Binary Patterns Histograms for the full face recognition algorithm and modified LBPH algorithm for occluded face detection. On average, the system achieved an Accuracy of approximately 70%, a Precision of approximately 80%, and a Recall of approximately 83.26% across all tested users. The results indicate that the system is capable of conducting face recognition and mask detection, automating the operation of the remote control to register users, locking or unlocking the door, and notifying the owner. The sample participants highly accept it for future use in the user acceptance test.

[38] TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis cs.CVPDF

Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen

TL;DR: TalkVid是一个大规模、高质量、多样化的音频驱动说话头部合成数据集，包含1244小时的视频和7729名独特演讲者，显著提升了模型的泛化能力。

Details

Motivation: 现有音频驱动说话头部合成模型在泛化到不同种族、语言和年龄组时表现不佳，原因是训练数据在规模、质量和多样性上的不足。

Result: 在TalkVid上训练的模型表现优于基于其他数据集的模型，且TalkVid-Bench揭示了传统聚合指标未发现的性能差异。

Insight: 数据多样性对模型泛化至关重要，未来研究需要更均衡的评估指标来全面衡量模型表现。

Abstract: Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid

[39] RCGNet: RGB-based Category-Level 6D Object Pose Estimation with Geometric Guidance cs.CVPDF

Sheng Yu, Di-Hua Zhai, Yuanqing Xia

TL;DR: 本文提出了一种仅基于RGB图像的类别级物体6D姿态估计方法RCGNet，通过几何指导提升姿态估计的准确性。

Details

Motivation: 当前大多数基于RGB-D的方法在缺乏深度信息的场景中表现不佳，因此本文探索仅使用RGB图像的解决方案，以应对真实场景中深度数据不可用的问题。

Result: 在基准数据集上的实验表明，该方法不仅高效，而且比以往基于RGB的方法更准确。

Insight: 仅依赖RGB图像的姿态估计方法在真实场景中具有实用潜力，几何指导是提升性能的关键。

Abstract: While most current RGB-D-based category-level object pose estimation methods achieve strong performance, they face significant challenges in scenes lacking depth information. In this paper, we propose a novel category-level object pose estimation approach that relies solely on RGB images. This method enables accurate pose estimation in real-world scenarios without the need for depth data. Specifically, we design a transformer-based neural network for category-level object pose estimation, where the transformer is employed to predict and fuse the geometric features of the target object. To ensure that these predicted geometric features faithfully capture the object’s geometry, we introduce a geometric feature-guided algorithm, which enhances the network’s ability to effectively represent the object’s geometric information. Finally, we utilize the RANSAC-PnP algorithm to compute the object’s pose, addressing the challenges associated with variable object scales in pose estimation. Experimental results on benchmark datasets demonstrate that our approach is not only highly efficient but also achieves superior accuracy compared to previous RGB-based methods. These promising results offer a new perspective for advancing category-level object pose estimation using RGB images.

[40] RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation cs.CV | cs.AI | cs.CLPDF

Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

TL;DR: 论文研究了多模态大语言模型（MLLMs）在识别图像旋转方向上的能力，发现现有模型对人类简单的旋转任务表现不佳，尤其是区分90°和270°旋转。

Details

Motivation: 研究MLLMs在空间推理任务（如图像旋转识别）上的表现，以评估其在视觉感知方面的能力与人类感知的差距。

Result: 大多数模型能识别0°和180°旋转，但无法区分90°和270°旋转；辅助信息和微调收效甚微，显示了MLLMs在空间推理上的不足。

Insight: MLLMs在图像旋转识别任务上的表现显著落后于人类，表明其在空间关系和上下文推理方面仍需改进。

Abstract: We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench – a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information – including captions, depth maps, and more – or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models’ ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs’ spatial reasoning capabilities and human perception in identifying rotation.

[41] OmniTry: Virtual Try-On Anything without Masks cs.CVPDF

Yutong Feng, Linlin Zhang, Hengyuan Cao, Yiming Chen, Xiaoduan Feng

TL;DR: OmniTry提出了一种统一的虚拟试穿框架，支持对多种可穿戴物品（如珠宝、配饰）进行无掩码试穿，解决了数据配对的挑战。

Details

Motivation: 现有虚拟试穿方法多专注于衣物，且需要掩码输入，限制了实际应用。OmniTry旨在扩展试穿范围，并消除对掩码的依赖。

Result: OmniTry在多类物品的试穿任务中，定位和外观保持性能优于现有方法。

Insight: 利用大规模无监督数据预训练可以显著降低对有监督数据的需求，提升模型泛化能力。

Abstract: Virtual Try-ON (VTON) is a practical and widely-applied task, for which most of existing works focus on clothes. This paper presents OmniTry, a unified framework that extends VTON beyond garment to encompass any wearable objects, e.g., jewelries and accessories, with mask-free setting for more practical application. When extending to various types of objects, data curation is challenging for obtaining paired images, i.e., the object image and the corresponding try-on result. To tackle this problem, we propose a two-staged pipeline: For the first stage, we leverage large-scale unpaired images, i.e., portraits with any wearable items, to train the model for mask-free localization. Specifically, we repurpose the inpainting model to automatically draw objects in suitable positions given an empty mask. For the second stage, the model is further fine-tuned with paired images to transfer the consistency of object appearance. We observed that the model after the first stage shows quick convergence even with few paired samples. OmniTry is evaluated on a comprehensive benchmark consisting of 12 common classes of wearable objects, with both in-shop and in-the-wild images. Experimental results suggest that OmniTry shows better performance on both object localization and ID-preservation compared with existing methods. The code, model weights, and evaluation benchmark of OmniTry will be made publicly available at https://omnitry.github.io/.

[42] DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction cs.CVPDF

Dengxian Gong, Shunping Ji

TL;DR: DeH4R是一种结合了图生成效率与图增长动态性的新型混合模型，通过任务解耦实现快速推理和拓扑保真度提升，在CityScale和SpaceNet基准测试中表现优于现有方法。

Details

Motivation: 现有方法（如基于分割、图增长或图生成）在道路网络图提取中存在拓扑保真度不足或计算效率低的问题，DeH4R旨在结合两者的优势。

Result: 在CityScale数据集上，APLS和IoU分别比RNGDet++提升4.62和10.18，且速度加快约10倍。

Insight: 解耦设计能够平衡效率与动态性，为道路网络图提取提供了新的研究方向。

Abstract: The automated extraction of complete and precise road network graphs from remote sensing imagery remains a critical challenge in geospatial computer vision. Segmentation-based approaches, while effective in pixel-level recognition, struggle to maintain topology fidelity after vectorization postprocessing. Graph-growing methods build more topologically faithful graphs but suffer from computationally prohibitive iterative ROI cropping. Graph-generating methods first predict global static candidate road network vertices, and then infer possible edges between vertices. They achieve fast topology-aware inference, but limits the dynamic insertion of vertices. To address these challenges, we propose DeH4R, a novel hybrid model that combines graph-generating efficiency and graph-growing dynamics. This is achieved by decoupling the task into candidate vertex detection, adjacent vertex prediction, initial graph contruction, and graph expansion. This architectural innovation enables dynamic vertex (edge) insertions while retaining fast inference speed and enhancing both topology fidelity and spatial consistency. Comprehensive evaluations on CityScale and SpaceNet benchmarks demonstrate state-of-the-art (SOTA) performance. DeH4R outperforms the prior SOTA graph-growing method RNGDet++ by 4.62 APLS and 10.18 IoU on CityScale, while being approximately 10 $\times$ faster. The code will be made publicly available at https://github.com/7777777FAN/DeH4R.

[43] HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes cs.CVPDF

Keliang Li, Hongze Shen, Hao Shi, Ruibing Hou, Hong Chang

TL;DR: HumanPCR是一个评估多模态大模型（MLLM）在人类相关视觉场景中能力的评测套件，包含感知（Human-P）、理解（Human-C）和推理（Human-R）三个层次，覆盖9个维度任务，并揭示现有模型在细节感知、时间理解和心智建模等方面的不足。

Details

Motivation: 为评估多模态模型在复杂人类场景中的表现，填补现有评测的空白，特别是在感知、理解和推理能力的全面评估上。

Result: 评估30多个先进模型，发现在细节感知、时间理解和心智建模等任务中存在明显不足，且模型在主动提取视觉证据上表现不佳。

Insight: 模型在人类中心任务中依赖查询导向的检索而非主动推理，即使采用扩展视觉上下文和测试时思考等技术改进有限。

Abstract: The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs’ capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models.

[44] Hierarchical Vision-Language Retrieval of Educational Metaverse Content in Agriculture cs.CVPDF

Ali Abdari, Alex Falcon, Giuseppe Serra

TL;DR: 论文提出了一种层次化的视觉语言模型，用于检索农业主题的虚拟博物馆内容，并引入了一个新数据集AgriMuseums。实验表明，该方法在检索任务上表现优异。

Details

Motivation: 在线教育内容快速增长，但缺乏有效的组织和检索方法。元宇宙为教育内容提供了沉浸式体验，但现有数据集较小，且缺乏高级模型的训练支持。

Result: 在实验中，模型达到了62%的R@1和78%的MRR，并在现有基准上提升了6%的R@1和11%的MRR。

Insight: 层次化模型能更好地结合视觉和语言信息，提升检索效果；新数据集的引入为未来研究提供了支持。

Abstract: Every day, a large amount of educational content is uploaded online across different areas, including agriculture and gardening. When these videos or materials are grouped meaningfully, they can make learning easier and more effective. One promising way to organize and enrich such content is through the Metaverse, which allows users to explore educational experiences in an interactive and immersive environment. However, searching for relevant Metaverse scenarios and finding those matching users’ interests remains a challenging task. A first step in this direction has been done recently, but existing datasets are small and not sufficient for training advanced models. In this work, we make two main contributions: first, we introduce a new dataset containing 457 agricultural-themed virtual museums (AgriMuseums), each enriched with textual descriptions; and second, we propose a hierarchical vision-language model to represent and retrieve relevant AgriMuseums using natural language queries. In our experimental setting, the proposed method achieves up to about 62% R@1 and 78% MRR, confirming its effectiveness, and it also leads to improvements on existing benchmarks by up to 6% R@1 and 11% MRR. Moreover, an extensive evaluation validates our design choices. Code and dataset are available at https://github.com/aliabdari/Agricultural_Metaverse_Retrieval .

[45] Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance cs.CVPDF

Yiming Cao, Yanjie Li, Kaisheng Liang, Yuni Lai, Bin Xiao

TL;DR: 该论文提出了一种新的目标对抗攻击方法IPGA（Intermediate Projector Guided Attack），通过攻击视觉-语言模型中的投影模块（如Q-Former），实现了更精确的扰动控制和更有效的攻击效果。

Details

Motivation: 当前的目标对抗攻击方法通常通过扰乱图像编码器的全局相似性来实现攻击，但这种方式忽视了投影模块在视觉-语言模型中的关键作用，导致攻击粒度和效果受限。

Result: 实验表明，IPGA在全局图像描述任务和细粒度视觉问答任务中均优于现有方法，且能够迁移到Google Gemini和OpenAI GPT等商业模型。

Insight: 投影模块是视觉-语言模型中的关键语义桥梁，攻击其中间阶段可以有效提升对抗攻击的精细度和通用性。

Abstract: Targeted adversarial attacks are essential for proactively identifying security flaws in Vision-Language Models before real-world deployment. However, current methods perturb images to maximize global similarity with the target text or reference image at the encoder level, collapsing rich visual semantics into a single global vector. This limits attack granularity, hindering fine-grained manipulations such as modifying a car while preserving its background. Furthermore, these methods largely overlook the projector module, a critical semantic bridge between the visual encoder and the language model in VLMs, thereby failing to disrupt the full vision-language alignment pipeline within VLMs and limiting attack effectiveness. To address these issues, we propose the Intermediate Projector Guided Attack (IPGA), the first method to attack using the intermediate stage of the projector module, specifically the widely adopted Q-Former, which transforms global image embeddings into fine-grained visual features. This enables more precise control over adversarial perturbations by operating on semantically meaningful visual tokens rather than a single global representation. Specifically, IPGA leverages the Q-Former pretrained solely on the first vision-language alignment stage, without LLM fine-tuning, which improves both attack effectiveness and transferability across diverse VLMs. Furthermore, we propose Residual Query Alignment (RQA) to preserve unrelated visual content, thereby yielding more controlled and precise adversarial manipulations. Extensive experiments show that our attack method consistently outperforms existing methods in both standard global image captioning tasks and fine-grained visual question-answering tasks in black-box environment. Additionally, IPGA successfully transfers to multiple commercial VLMs, including Google Gemini and OpenAI GPT.

[46] Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks cs.CV | cs.AIPDF

Yeji Park, Minyoung Lee, Sanghyuk Chun, Junsuk Choe

TL;DR: 该论文提出了一种名为FOCUS的解码策略，用于减轻大型视觉语言模型（LVLM）在处理多图像任务时的跨图像信息泄漏问题，通过随机噪声掩蔽和对比精炼提升性能。

Details

Motivation: 现有LVLM在单图像任务中表现良好，但在多图像输入时性能显著下降，原因是不同图像的视觉线索在模型输出中混淆，称为跨图像信息泄漏。

Result: 在四个多图像基准测试和多种LVLM中，FOCUS均显著提升了模型性能，展示了其通用性和实用性。

Insight: FOCUS为多图像推理提供了一种通用解决方案，无需额外训练或架构修改，展示了通过解码策略优化模型性能的潜力。

Abstract: Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model’s output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.

[47] Shape-from-Template with Generalised Camera cs.CVPDF

Agniva Sengupta, Stefan Zachow

TL;DR: 这篇论文提出了一种基于广义相机模型的新方法，用于将3D形状非刚性注册到多相机系统观测的2D关键点，扩展了Shape-from-Template（SfT）的应用范围。

Details

Motivation: 传统SfT方法主要基于单相机，而利用多相机系统的约束可以提升非刚性注册的精度和适用性，例如在医学影像和手持相机应用中。

Result: 在合成和真实数据上验证了方法的准确性，展示了多相机系统在非刚性注册中的优势。

Insight: 广义相机模型为多相机系统提供了统一的框架，结合多视图约束显著提升了非刚性3D重建的性能。

Abstract: This article presents a new method for non-rigidly registering a 3D shape to 2D keypoints observed by a constellation of multiple cameras. Non-rigid registration of a 3D shape to observed 2D keypoints, i.e., Shape-from-Template (SfT), has been widely studied using single images, but SfT with information from multiple-cameras jointly opens new directions for extending the scope of known use-cases such as 3D shape registration in medical imaging and registration from hand-held cameras, to name a few. We represent such multi-camera setup with the generalised camera model; therefore any collection of perspective or orthographic cameras observing any deforming object can be registered. We propose multiple approaches for such SfT: the first approach where the corresponded keypoints lie on a direction vector from a known 3D point in space, the second approach where the corresponded keypoints lie on a direction vector from an unknown 3D point in space but with known orientation w.r.t some local reference frame, and a third approach where, apart from correspondences, the silhouette of the imaged object is also known. Together, these form the first set of solutions to the SfT problem with generalised cameras. The key idea behind SfT with generalised camera is the improved reconstruction accuracy from estimating deformed shape while utilising the additional information from the mutual constraints between multiple views of a deformed object. The correspondence-based approaches are solved with convex programming while the silhouette-based approach is an iterative refinement of the results from the convex solutions. We demonstrate the accuracy of our proposed methods on many synthetic and real data

[48] VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization cs.CVPDF

Jiajing Lin, Shu Jiang, Qingyuan Zeng, Zhenzhong Wang, Min Jiang

TL;DR: VisionLaw提出了一种双层次优化框架，从视觉观测中推断可解释的固有动力学，结合LLM驱动的解耦本构演化策略和视觉引导的评估机制，显著优于现有方法。

Details

Motivation: 现有方法要么依赖手动定义的本构先验，难以泛化；要么使用神经网络建模，可解释性和泛化性差。VisionLaw旨在解决这些问题，通过可解释的表达式推断物体的固有动力学。

Result: VisionLaw在合成和真实数据集上显著优于现有方法，并能泛化到新场景的交互模拟任务。

Insight: 结合符号化建模（LLM）与神经网络的视觉引导，可以在可解释性和泛化性之间取得平衡，为物理仿真提供新思路。

Abstract: The intrinsic dynamics of an object governs its physical behavior in the real world, playing a critical role in enabling physically plausible interactive simulation with 3D assets. Existing methods have attempted to infer the intrinsic dynamics of objects from visual observations, but generally face two major challenges: one line of work relies on manually defined constitutive priors, making it difficult to generalize to complex scenarios; the other models intrinsic dynamics using neural networks, resulting in limited interpretability and poor generalization. To address these challenges, we propose VisionLaw, a bilevel optimization framework that infers interpretable expressions of intrinsic dynamics from visual observations. At the upper level, we introduce an LLMs-driven decoupled constitutive evolution strategy, where LLMs are prompted as a knowledgeable physics expert to generate and revise constitutive laws, with a built-in decoupling mechanism that substantially reduces the search complexity of LLMs. At the lower level, we introduce a vision-guided constitutive evaluation mechanism, which utilizes visual simulation to evaluate the consistency between the generated constitutive law and the underlying intrinsic dynamics, thereby guiding the upper-level evolution. Experiments on both synthetic and real-world datasets demonstrate that VisionLaw can effectively infer interpretable intrinsic dynamics from visual observations. It significantly outperforms existing state-of-the-art methods and exhibits strong generalization for interactive simulation in novel scenarios.

[49] A Fully Transformer Based Multimodal Framework for Explainable Cancer Image Segmentation Using Radiology Reports cs.CV | cs.AIPDF

Enobong Adahada, Isabel Sassoon, Kate Hone, Yongmin Li

TL;DR: Med-CTX 是一种基于 Transformer 的多模态框架，通过结合医学影像和临床报告实现乳腺癌超声分割的可解释性，性能显著优于现有基线模型。

Details

Motivation: 医学影像分割任务需要更高的准确性和可解释性，而现有方法难以同时满足这两点。通过结合临床报告的多模态信息，可以提升模型性能和透明性。

Result: 在 BUS-BRA 数据集上达到 Dice 99% 和 IoU 95% 的卓越性能，超越 U-Net 等基线模型。文本信息对分割精度和解释质量至关重要。

Insight: 1. 临床文本的引入显著提升了分割性能和模型的可解释性。2. 跨模态注意力是实现高性能和透明性的关键机制。3. 模型的不确定性校准性能（ECE 3.2%）和多模态对齐（CLIP 85%）为医学应用提供了更高的可信度。

Abstract: We introduce Med-CTX, a fully transformer based multimodal framework for explainable breast cancer ultrasound segmentation. We integrate clinical radiology reports to boost both performance and interpretability. Med-CTX achieves exact lesion delineation by using a dual-branch visual encoder that combines ViT and Swin transformers, as well as uncertainty aware fusion. Clinical language structured with BI-RADS semantics is encoded by BioClinicalBERT and combined with visual features utilising cross-modal attention, allowing the model to provide clinically grounded, model generated explanations. Our methodology generates segmentation masks, uncertainty maps, and diagnostic rationales all at once, increasing confidence and transparency in computer assisted diagnosis. On the BUS-BRA dataset, Med-CTX achieves a Dice score of 99% and an IoU of 95%, beating existing baselines U-Net, ViT, and Swin. Clinical text plays a key role in segmentation accuracy and explanation quality, as evidenced by ablation studies that show a -5.4% decline in Dice score and -31% in CIDEr. Med-CTX achieves good multimodal alignment (CLIP score: 85%) and increased confi dence calibration (ECE: 3.2%), setting a new bar for trustworthy, multimodal medical architecture.

[50] Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering cs.CV | cs.LGPDF

Diaa Addeen Abuhani, Marco Seccaroni, Martina Mazzarello, Imran Zualkernan, Fabio Duarte

TL;DR: 该论文提出一种无监督聚类框架，结合街景图像和空间种植模式，无需标注数据即可估算城市树木生物多样性，适用于缺乏详细清单的城市。

Details

Motivation: 城市树木生物多样性对城市生态系统至关重要，但传统实地调查成本高，监督AI方法需要标注数据且泛化性差。本文旨在开发一种低成本、可扩展的无监督方法。

Result: 在北美八座城市应用中，该方法能高精度恢复树种多样性，与实地数据相比Wasserstein距离较低，并保持了空间自相关性。

Insight: 该方法为缺乏详细树木清单的城市提供了一种低成本、可扩展的生物多样性监测方案，支持城市生态系统的适应性管理。

Abstract: Urban tree biodiversity is critical for climate resilience, ecological stability, and livability in cities, yet most municipalities lack detailed knowledge of their canopies. Field-based inventories provide reliable estimates of Shannon and Simpson diversity but are costly and time-consuming, while supervised AI methods require labeled data that often fail to generalize across regions. We introduce an unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without labels. Applied to eight North American cities, the method recovers genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices and preserving spatial autocorrelation. This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and offers a pathway for continuous, low-cost monitoring to support equitable access to greenery and adaptive management of urban ecosystems.

[51] SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation cs.CVPDF

Paul Grimal, Michaël Soumm, Hervé Le Borgne, Olivier Ferret, Akihiro Sugimoto

TL;DR: 该论文提出了一种训练无关的方法SAGA，通过明确建模去噪过程中的信号成分，提升文本到图像生成中与提示的精确对齐，并支持额外的条件模态（如边界框）以增强空间对齐。

Details

Motivation: 当前的文本到图像生成模型虽然在视觉上效果显著，但往往难以精确对齐文本提示，导致关键元素缺失或概念混淆。

Result: 实验表明，该方法在文本到图像生成任务中优于当前最先进的方法。

Insight: 信号对齐的分布学习可以有效解决文本到图像生成中的对齐问题，且无需额外训练即可灵活应用于不同架构。

Abstract: State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities – such as bounding boxes – for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.

[52] RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection cs.CVPDF

Matthias Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool

TL;DR: 论文提出了两个新的增量学习（IL）评测基准RICO（D-RICO和EC-RICO），分别针对域偏移和类扩展场景，基于14个多样数据集构建，揭示了当前IL方法在适应性和知识保留上的不足。

Details

Motivation: 现有的增量学习评测多依赖简化或合成的数据集，无法真实反映IL在实际场景中的挑战，因此需要更贴近现实的评测基准。

Result: 实验表明，现有IL方法在适应性和知识保留上表现不佳，仅回放少量旧数据即优于所有IL方法，但单独训练仍是最优解。

Insight: IL方法的性能瓶颈可能源于蒸馏中的弱教师、单模型难以处理多样任务，以及模型可塑性不足。

Abstract: Incremental Learning (IL) trains models sequentially on new data without full retraining, offering privacy, efficiency, and scalability. IL must balance adaptability to new data with retention of old knowledge. However, evaluations often rely on synthetic, simplified benchmarks, obscuring real-world IL performance. To address this, we introduce two Realistic Incremental Object Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a fixed class set, and Expanding-Classes RICO (EC-RICO) integrates new domains and classes per IL step. Built from 14 diverse datasets covering real and synthetic domains, varying conditions (e.g., weather, time of day), camera sensors, perspectives, and labeling policies, both benchmarks capture challenges absent in existing evaluations. Our experiments show that all IL methods underperform in adaptability and retention, while replaying a small amount of previous data already outperforms all methods. However, individual training on the data remains superior. We heuristically attribute this gap to weak teachers in distillation, single models’ inability to manage diverse tasks, and insufficient plasticity. Our code will be made publicly available.

[53] In-hoc Concept Representations to Regularise Deep Learning in Medical Imaging cs.CVPDF

Valentina Corbetta, Floris Six Dijkstra, Regina Beets-Tan, Hoel Kervadec, Kristoffer Wickstrøm

TL;DR: 该论文提出了LCRReg方法，通过潜在概念表示（LCRs）在无概念标注的主训练集上，利用小规模辅助数据集引导模型学习更具临床意义的特征，从而提升医学影像深度学习模型的鲁棒性。

Details

Motivation: 医学影像中的深度学习模型常依赖虚假相关性而非临床相关特征，导致在分布偏移时泛化能力差。

Result: 在合成和真实医学任务中验证，LCRReg显著提升对虚假相关性的鲁棒性及OOD泛化能力。

Insight: LCRReg为提升模型鲁棒性提供了一种不依赖密集概念监督的轻量级解决方案。

Abstract: Deep learning models in medical imaging often achieve strong in-distribution performance but struggle to generalise under distribution shifts, frequently relying on spurious correlations instead of clinically meaningful features. We introduce LCRReg, a novel regularisation approach that leverages Latent Concept Representations (LCRs) (e.g., Concept Activation Vectors (CAVs)) to guide models toward semantically grounded representations. LCRReg requires no concept labels in the main training set and instead uses a small auxiliary dataset to synthesise high-quality, disentangled concept examples. We extract LCRs for predefined relevant features, and incorporate a regularisation term that guides a Convolutional Neural Network (CNN) to activate within latent subspaces associated with those concepts. We evaluate LCRReg across synthetic and real-world medical tasks. On a controlled toy dataset, it significantly improves robustness to injected spurious correlations and remains effective even in multi-concept and multiclass settings. On the diabetic retinopathy binary classification task, LCRReg enhances performance under both synthetic spurious perturbations and out-of-distribution (OOD) generalisation. Compared to baselines, including multitask learning, linear probing, and post-hoc concept-based models, LCRReg offers a lightweight, architecture-agnostic strategy for improving model robustness without requiring dense concept supervision. Code is available at the following link: https://github.com/Trustworthy-AI-UU-NKI/lcr\_regularization

[54] SCRNet: Spatial-Channel Regulation Network for Medical Ultrasound Image Segmentation cs.CVPDF

Weixin Xu, Ziliang Wang

TL;DR: SCRNet是一个新颖的网络架构，通过空间-通道调节模块（SCRM）和特征聚合模块（FAM），结合卷积和交叉注意力机制，优化医学超声图像分割任务，兼顾长距离依赖和局部上下文信息。

Details

Motivation: 传统CNN方法忽视长距离依赖，而Transformer方法忽略了局部上下文信息，医学超声图像分割领域需要一种能同时处理两者的方法。

Result: 实验表明，SCRNet在医学超声图像分割任务中性能优越，达到了SOTA水平。

Insight: 卷积与注意力机制的并行结合可以有效互补CNN和Transformer的不足，适用于医学图像的复杂特征提取。

Abstract: Medical ultrasound image segmentation presents a formidable challenge in the realm of computer vision. Traditional approaches rely on Convolutional Neural Networks (CNNs) and Transformer-based methods to address the intricacies of medical image segmentation. Nevertheless, inherent limitations persist, as CNN-based methods tend to disregard long-range dependencies, while Transformer-based methods may overlook local contextual information. To address these deficiencies, we propose a novel Feature Aggregation Module (FAM) designed to process two input features from the preceding layer. These features are seamlessly directed into two branches of the Convolution and Cross-Attention Parallel Module (CCAPM) to endow them with different roles in each of the two branches to help establish a strong connection between the two input features. This strategy enables our module to focus concurrently on both long-range dependencies and local contextual information by judiciously merging convolution operations with cross-attention mechanisms. Moreover, by integrating FAM within our proposed Spatial-Channel Regulation Module (SCRM), the ability to discern salient regions and informative features warranting increased attention is enhanced. Furthermore, by incorporating the SCRM into the encoder block of the UNet architecture, we introduce a novel framework dubbed Spatial-Channel Regulation Network (SCRNet). The results of our extensive experiments demonstrate the superiority of SCRNet, which consistently achieves state-of-the-art (SOTA) performance compared to existing methods.

[55] PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis cs.CVPDF

Chunji Lv, Zequn Chen, Donglin Di, Weinan Zhang, Hao Li

TL;DR: PhysGM是一种前馈框架，能够从单张图像预测3D高斯表示及其物理属性，实现即时的物理模拟和高保真4D渲染，速度显著优于现有方法。

Details

Motivation: 现有基于物理的3D运动合成方法依赖预重建的3D高斯分布表示，物理集成常因不灵活的手动定义属性或不稳定的优化密集型视频模型而受限。

Result: PhysGM能在1分钟内从单张图像生成高保真4D模拟，速度显著提升且渲染结果逼真。

Insight: 通过联合建模3D表示与物理属性，PhysGM实现了高效、高质量的4D合成，为物理模拟提供了新的端到端解决方案。

Abstract: While physics-grounded 3D motion synthesis has seen significant progress, current methods face critical limitations. They typically rely on pre-reconstructed 3D Gaussian Splatting (3DGS) representations, while physics integration depends on either inflexible, manually defined physical attributes or unstable, optimization-heavy guidance from video models. To overcome these challenges, we introduce PhysGM, a feed-forward framework that jointly predicts a 3D Gaussian representation and its physical properties from a single image, enabling immediate, physical simulation and high-fidelity 4D rendering. We first establish a base model by jointly optimizing for Gaussian reconstruction and probabilistic physics prediction. The model is then refined with physically plausible reference videos to enhance both rendering fidelity and physics prediction accuracy. We adopt the Direct Preference Optimization (DPO) to align its simulations with reference videos, circumventing Score Distillation Sampling (SDS) optimization which needs back-propagating gradients through the complex differentiable simulation and rasterization. To facilitate the training, we introduce a new dataset PhysAssets of over 24,000 3D assets, annotated with physical properties and corresponding guiding videos. Experimental results demonstrate that our method effectively generates high-fidelity 4D simulations from a single image in one minute. This represents a significant speedup over prior works while delivering realistic rendering results. Our project page is at:https://hihixiaolv.github.io/PhysGM.github.io/

[56] DIME-Net: A Dual-Illumination Adaptive Enhancement Network Based on Retinex and Mixture-of-Experts cs.CVPDF

Ziang Wang, Xiaoqin Wang, Dingyi Wang, Qiang Li, Shushan Qiao

TL;DR: 该文提出了DIME-Net，一种基于Retinex理论和混合专家（Mixture-of-Experts）的双光照自适应增强网络，能够统一处理低光和背光图像的退化问题，并通过稀疏门控机制选择专家网络，结合修复模块提升图像质量。

Details

Motivation: 现实环境中复杂光照条件（如低光和背光）导致图像质量下降，影响后续视觉任务。现有方法多针对单一光照问题，缺乏统一处理能力。

Result: DIME-Net在合成和真实低光/背光数据集上表现优异，无需重新训练，展示了泛化能力和实用潜力。

Insight: 结合Retinex理论与混合专家网络可以有效统一处理多种光照问题，而光照感知注意力机制能显著提升修复效果。

Abstract: Image degradation caused by complex lighting conditions such as low-light and backlit scenarios is commonly encountered in real-world environments, significantly affecting image quality and downstream vision tasks. Most existing methods focus on a single type of illumination degradation and lack the ability to handle diverse lighting conditions in a unified manner. To address this issue, we propose a dual-illumination enhancement framework called DIME-Net. The core of our method is a Mixture-of-Experts illumination estimator module, where a sparse gating mechanism adaptively selects suitable S-curve expert networks based on the illumination characteristics of the input image. By integrating Retinex theory, this module effectively performs enhancement tailored to both low-light and backlit images. To further correct illumination-induced artifacts and color distortions, we design a damage restoration module equipped with Illumination-Aware Cross Attention and Sequential-State Global Attention mechanisms. In addition, we construct a hybrid illumination dataset, MixBL, by integrating existing datasets, allowing our model to achieve robust illumination adaptability through a single training process. Experimental results show that DIME-Net achieves competitive performance on both synthetic and real-world low-light and backlit datasets without any retraining. These results demonstrate its generalization ability and potential for practical multimedia applications under diverse and complex illumination conditions.

[57] ViT-FIQA: Assessing Face Image Quality using Vision Transformers cs.CVPDF

Andrea Atzori, Fadi Boutros, Naser Damer

TL;DR: 本文提出了一种基于Vision Transformer（ViT）的人脸图像质量评估（FIQA）方法（ViT-FIQA），通过可学习的质量标记和一个回归头来预测人脸图像的效用分数。

Details

Motivation: 现有FIQA方法主要依赖卷积神经网络（CNN），而ViT架构在FIQA中潜力尚未充分挖掘。

Result: 在多个基准测试和不同FR模型上，ViT-FIQA表现优异，验证了ViT在FIQA中的有效性。

Insight: ViT架构为FIQA提供了可扩展的基础，展现出优于CNN的潜力。

Abstract: Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample’s utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research https://cutt.ly/irHlzXUC.

[58] ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving cs.CVPDF

Xianda Guo, Ruijun Zhang, Yiqun Duan, Ruilin Wang, Keyuan Zhou

TL;DR: ROVR-Open-Dataset是一个用于自动驾驶的大规模深度数据集，旨在解决现有数据集多样性和可扩展性不足的问题，为深度估计研究提供新平台。

Details

Motivation: 现有深度数据集（如KITTI、nuScenes和DDAD）在多样性和可扩展性上存在局限，且基准性能趋于饱和，需要新的大规模、多样化数据集支持基础模型和多模态学习时代。

Result: 标准单目深度估计模型的基准实验验证了数据集的实用性，并揭示了在挑战性条件下的性能差距。

Insight: 新数据集为深度估计研究提供了更复杂的场景和新的挑战，推动了领域的发展。

Abstract: Depth estimation is a fundamental task for 3D scene understanding in autonomous driving, robotics, and augmented reality. Existing depth datasets, such as KITTI, nuScenes, and DDAD, have advanced the field but suffer from limitations in diversity and scalability. As benchmark performance on these datasets approaches saturation, there is an increasing need for a new generation of large-scale, diverse, and cost-efficient datasets to support the era of foundation models and multi-modal learning. To address these challenges, we introduce a large-scale, diverse, frame-wise continuous dataset for depth estimation in dynamic outdoor driving environments, comprising 20K video frames to evaluate existing methods. Our lightweight acquisition pipeline ensures broad scene coverage at low cost, while sparse yet statistically sufficient ground truth enables robust training. Compared to existing datasets, ours presents greater diversity in driving scenarios and lower depth density, creating new challenges for generalization. Benchmark experiments with standard monocular depth estimation models validate the dataset’s utility and highlight substantial performance gaps in challenging conditions, establishing a new platform for advancing depth estimation research.

[59] OmViD: Omni-supervised active learning for video action detection cs.CVPDF

Aayush Rana, Akash Kumar, Vibhav Vineet, Yogesh S Rawat

TL;DR: OmViD是一种基于主动学习的视频动作检测方法，通过动态选择不同粒度的标注类型（如视频级标签、点、涂鸦等）来降低标注成本，并结合3D超像素生成伪标签，实现高效训练。

Details

Motivation: 视频动作检测通常需要密集的时空标注，成本高昂且耗时。不同视频的难度和标注需求各异，因此研究如何动态选择合适的标注类型并高效利用这些标注至关重要。

Result: 在UCF101-24和JHMDB-21数据集上验证，显著降低标注成本的同时性能损失极小。

Insight: 动态选择标注类型并结合伪标签生成是一种降低视频动作检测标注成本的有效方法。

Abstract: Video action detection requires dense spatio-temporal annotations, which are both challenging and expensive to obtain. However, real-world videos often vary in difficulty and may not require the same level of annotation. This paper analyzes the appropriate annotation types for each sample and their impact on spatio-temporal video action detection. It focuses on two key aspects: 1) how to obtain varying levels of annotation for videos, and 2) how to learn action detection from different annotation types. The study explores video-level tags, points, scribbles, bounding boxes, and pixel-level masks. First, a simple active learning strategy is proposed to estimate the necessary annotation type for each video. Then, a novel spatio-temporal 3D-superpixel approach is introduced to generate pseudo-labels from these annotations, enabling effective training. The approach is validated on UCF101-24 and JHMDB-21 datasets, significantly cutting annotation costs with minimal performance loss.

[60] Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment cs.CVPDF

Samuel Seligardi, Pietro Musoni, Eleonora Iotti, Gianluca Contesso, Alessandro Dal Palù

TL;DR: 本文提出了一种基于物理的3D模拟系统，用于生成合成数据和分析包装稳定性评估中的故障，减少物理测试需求并提高测量精度。

Details

Motivation: 物流行业对自动化系统的需求增加，以及对环保包装材料的探索，促使研究团队开发一种可控且准确的物理模拟系统。

Result: 该系统能够减少物理测试的成本和环境负担，同时提高对托盘动态分析的测量准确性。

Insight: 物理模拟与深度学习的结合为物流包装安全分析提供了一种高效且环保的解决方案。

Abstract: The design and analysis of pallet setups are essential for ensuring safety of packages transportation. With rising demands in the logistics sector, the development of automated systems utilizing advanced technologies has become increasingly crucial. Moreover, the widespread use of plastic wrapping has motivated researchers to investigate eco-friendly alternatives that still adhere to safety standards. We present a fully controllable and accurate physical simulation system capable of replicating the behavior of moving pallets. It features a 3D graphics-based virtual environment that supports a wide range of configurations, including variable package layouts, different wrapping materials, and diverse dynamic conditions. This innovative approach reduces the need for physical testing, cutting costs and environmental impact while improving measurement accuracy for analyzing pallet dynamics. Additionally, we train a deep neural network to evaluate the rendered videos generated by our simulator, as a crash-test predictor for pallet configurations, further enhancing the system’s utility in safety analysis.

[61] Self-Supervised Sparse Sensor Fusion for Long Range Perception cs.CVPDF

Edoardo Palladin, Samuel Brucker, Filippo Ghilotti, Praveen Narayanan, Mario Bijelic

TL;DR: 论文提出了一种自监督稀疏传感器融合方法，用于提升自动驾驶车辆在长距离（250米）感知中的性能，通过高效3D编码和多模态时态特征，以及自监督预训练方案，显著改善了目标检测和LiDAR预测。

Details

Motivation: 自动驾驶在城际高速公路上需要长距离感知（250米），而现有方法多针对短距离（50-100米），且BEV表示在高距离下计算和内存成本成二次方增加。

Result: 将感知距离扩展到250米，目标检测mAP提升26.6%，LiDAR预测的Chamfer Distance降低30.5%。

Insight: 稀疏表示和自监督学习是解决长距离感知中计算和标注成本问题的有效途径，尤其适用于城际高速公路和大规模车辆。

Abstract: Outside of urban hubs, autonomous cars and trucks have to master driving on intercity highways. Safe, long-distance highway travel at speeds exceeding 100 km/h demands perception distances of at least 250 m, which is about five times the 50-100m typically addressed in city driving, to allow sufficient planning and braking margins. Increasing the perception ranges also allows to extend autonomy from light two-ton passenger vehicles to large-scale forty-ton trucks, which need a longer planning horizon due to their high inertia. However, most existing perception approaches focus on shorter ranges and rely on Bird’s Eye View (BEV) representations, which incur quadratic increases in memory and compute costs as distance grows. To overcome this limitation, we built on top of a sparse representation and introduced an efficient 3D encoding of multi-modal and temporal features, along with a novel self-supervised pre-training scheme that enables large-scale learning from unlabeled camera-LiDAR data. Our approach extends perception distances to 250 meters and achieves an 26.6% improvement in mAP in object detection and a decrease of 30.5% in Chamfer Distance in LiDAR forecasting compared to existing methods, reaching distances up to 250 meters. Project Page: https://light.princeton.edu/lrs4fusion/

[62] ResPlan: A Large-Scale Vector-Graph Dataset of 17,000 Residential Floor Plans cs.CV | cs.RO | 68T45PDF

Mohamed Abouagour, Eleftherios Garyfallidis

TL;DR: ResPlan是一个大规模住宅平面图数据集，包含17,000个详细且结构丰富的住宅布局，旨在推动空间AI研究。

Details

Motivation: 现有数据集（如RPLAN和MSD）在视觉保真度和结构多样性上存在局限性，ResPlan通过提供更真实且非理想化的住宅布局填补了这一空白。

Result: ResPlan在规模和实用性上显著优于现有基准，支持多种应用场景如机器人、生成AI和游戏开发。

Insight: ResPlan不仅为空间智能系统的开发提供了强大基础，还通过公开基准任务推动了研究社区的发展。

Abstract: We introduce ResPlan, a large-scale dataset of 17,000 detailed, structurally rich, and realistic residential floor plans, created to advance spatial AI research. Each plan includes precise annotations of architectural elements (walls, doors, windows, balconies) and functional spaces (such as kitchens, bedrooms, and bathrooms). ResPlan addresses key limitations of existing datasets such as RPLAN (Wu et al., 2019) and MSD (van Engelenburg et al., 2024) by offering enhanced visual fidelity and greater structural diversity, reflecting realistic and non-idealized residential layouts. Designed as a versatile, general-purpose resource, ResPlan supports a wide range of applications including robotics, reinforcement learning, generative AI, virtual and augmented reality, simulations, and game development. Plans are provided in both geometric and graph-based formats, enabling direct integration into simulation engines and fast 3D conversion. A key contribution is an open-source pipeline for geometry cleaning, alignment, and annotation refinement. Additionally, ResPlan includes structured representations of room connectivity, supporting graph-based spatial reasoning tasks. Finally, we present comparative analyses with existing benchmarks and outline several open benchmark tasks enabled by ResPlan. Ultimately, ResPlan offers a significant advance in scale, realism, and usability, providing a robust foundation for developing and benchmarking next-generation spatial intelligence systems.

[63] Backdooring Self-Supervised Contrastive Learning by Noisy Alignment cs.CVPDF

Tuo Chen, Jie Gui, Minjing Dong, Ju Jia, Lanting Fang

TL;DR: 这篇论文提出了一种名为Noisy Alignment（NA）的后门攻击方法，针对自监督对比学习（CL）的数据污染攻击（DPCLs）。通过显式抑制污染图像中的噪声成分，并结合对比学习的随机裁剪机制优化布局，NA在保持干净数据准确性的同时，显著提升了攻击效果。

Details

Motivation: 现有DPCLs依赖于脆弱的隐式共现关系，且对污染图像中的判别性特征抑制不足，导致攻击效果有限。论文旨在解决这些问题，提出更高效的后门攻击方法。

Result: 实验表明，NA在攻击效果上优于现有DPCLs，同时保持干净数据准确性和对常见防御的鲁棒性。代码已开源。

Insight: 论文揭示了对比学习的随机裁剪机制在数据污染攻击中的关键作用，为后门攻击和防御提供了新的研究方向。

Abstract: Self-supervised contrastive learning (CL) effectively learns transferable representations from unlabeled data containing images or image-text pairs but suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary can inject poisoned images into pretraining datasets, causing compromised CL encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs, however, achieve limited efficacy due to their dependence on fragile implicit co-occurrence between backdoor and target object and inadequate suppression of discriminative features in backdoored images. We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. Inspired by powerful training-controllable CL attacks, we identify and extract the critical objective of noisy alignment, adapting it effectively into data-poisoning scenarios. Our method implements noisy alignment by strategically manipulating contrastive learning’s random cropping mechanism, formulating this process as an image layout optimization problem with theoretically derived optimal parameters. The resulting method is simple yet effective, achieving state-of-the-art performance compared to existing DPCLs, while maintaining clean-data accuracy. Furthermore, Noisy Alignment demonstrates robustness against common backdoor defenses. Codes can be found at https://github.com/jsrdcht/Noisy-Alignment.

[64] InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing cs.CVPDF

Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu

TL;DR: 论文提出InfiniteTalk，一种用于稀疏帧视频配音的音频驱动视频生成方法，解决了传统方法只能编辑嘴部区域的问题，实现了全身运动的音频同步编辑。

Details

Motivation: 传统视频配音技术仅能编辑嘴部区域，导致面部表情和身体动作不协调。论文提出稀疏帧视频配音这一新范式，旨在通过保留参考关键帧实现身份、标志性手势和摄像机轨迹的一致性，同时支持全身运动的音频同步编辑。

Result: 在HDTF、CelebV-HQ和EMTD数据集上的评估表明，该方法在视觉真实性、情感一致性和全身运动同步性方面达到SOTA性能。

Insight: 稀疏帧视频配音通过保留关键帧解决了身份一致性问题，同时实现了全身运动的同步编辑，为视频配音技术提供了新的研究方向。

Abstract: Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.

[65] Distilled-3DGS:Distilled 3D Gaussian Splatting cs.CVPDF

Lintao Xiang, Xinkai Chen, Jianhuang Lai, Guangcong Wang

TL;DR: 论文提出了一种基于知识蒸馏的3D高斯泼溅（3DGS）框架Distilled-3DGS，通过聚合多个教师模型的输出优化轻量级学生模型，显著降低了存储需求并保持了高保真渲染质量。

Details

Motivation: 3DGS在新型视图合成中表现出色，但高保真渲染需要大量3D高斯模型，导致存储和内存消耗过大。论文旨在通过知识蒸馏解决这一问题。

Result: Distilled-3DGS在多数据集上定量和定性评估中，以更少的存储需求实现了与现有最佳方法相当的渲染质量。

Insight: 知识蒸馏可有效压缩3DGS模型，同时保持其渲染能力；结构相似性损失函数是关键提升点。

Abstract: 3D Gaussian Splatting (3DGS) has exhibited remarkable efficacy in novel view synthesis (NVS). However, it suffers from a significant drawback: achieving high-fidelity rendering typically necessitates a large number of 3D Gaussians, resulting in substantial memory consumption and storage requirements. To address this challenge, we propose the first knowledge distillation framework for 3DGS, featuring various teacher models, including vanilla 3DGS, noise-augmented variants, and dropout-regularized versions. The outputs of these teachers are aggregated to guide the optimization of a lightweight student model. To distill the hidden geometric structure, we propose a structural similarity loss to boost the consistency of spatial geometric distributions between the student and teacher model. Through comprehensive quantitative and qualitative evaluations across diverse datasets, the proposed Distilled-3DGS, a simple yet effective framework without bells and whistles, achieves promising rendering results in both rendering quality and storage efficiency compared to state-of-the-art methods. Project page: https://distilled3dgs.github.io . Code: https://github.com/lt-xiang/Distilled-3DGS .

[66] Beyond Simple Edits: Composed Video Retrieval with Dense Modifications cs.CVPDF

Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah

TL;DR: 这篇论文提出了一个新的数据集Dense-WebVid-CoVR和一个跨注意力融合模型，用于处理组合视频检索任务中细粒度的修改，并在所有指标上实现了state-of-the-art结果。

Details

Motivation: 现有的视频检索框架在处理细粒度的组合查询和时态理解变化方面表现不足，限制了其在复杂任务中的检索能力。

Result: 模型在视觉+文本设置下取得了71.3%的Recall@1，比现有最佳方法提高了3.4%。

Insight: 细粒度的组合视频检索需要更丰富的数据集和更精确的跨模态对齐方法。

Abstract: Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model are available at :https://github.com/OmkarThawakar/BSE-CoVR

[67] LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos cs.CVPDF

Chin-Yang Lin, Cheng Sun, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin

TL;DR: LongSplat提出了一种新颖的3D高斯泼溅框架，专注于解决长视频中相机姿态未知、运动不规则和大场景问题，提升了渲染质量、姿态准确性和计算效率。

Details

Motivation: 当前方法在处理长视频时，常因相机姿态漂移、初始几何不准确和内存限制而失效。LongSplat旨在解决这些问题，特别是针对非结构化长视频的场景重建和合成。

Result: 在多个挑战性基准测试中，LongSplat在渲染质量、姿态准确性和计算效率上优于现有方法，达到最先进水平。

Insight: 利用增量优化和空间密度信息可显著提升长视频3D重建的鲁棒性和效率，联合优化相机姿态与3D结构是关键。

Abstract: LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a robust Pose Estimation Module leveraging learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches. Project page: https://linjohnss.github.io/longsplat/

cs.CL [Back]

[68] MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents cs.CL | cs.AI | cs.CVPDF

Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong

TL;DR: MM-BrowseComp是一个新基准，用于评估多模态浏览代理的能力，填补了现有基准忽略多模态内容的不足。

Details

Motivation: 现有基准（如BrowseComp）主要关注文本信息，忽视了网页中广泛存在的多模态内容，因此需要一个新的基准来评估代理的多模态检索和推理能力。

Result: 即使是OpenAI o3等先进模型，在MM-BrowseComp上的准确率仅为29.02%，显示当前模型的多模态能力不足。

Insight: 当前AI代理在多模态推理方面表现不佳，凸显了未来研究中提升多模态能力的必要性。

Abstract: AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents’ multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.

[69] Stands to Reason: Investigating the Effect of Reasoning on Idiomaticity Detection cs.CLPDF

Dylan Phelps, Rodrigo Wilkens, Edward Gow-Smith, Thomas Pickard, Maggie Mi

TL;DR: 论文研究了推理能力对大型语言模型在习语性检测任务中表现的影响，发现推理对模型性能的提升效果有限且因模型大小而异。

Details

Motivation: 习语性检测任务需要理解表达式的潜在含义，因此推理能力的引入可能提升任务表现。然而，这种提升是否显著以及如何受模型大小影响尚不明确。

Result: 推理能力对小模型有提升但有限，而大模型（14B以上）表现较好且能准确理解习语含义。为小模型提供定义的策略在某些情况下可改善性能。

Insight: 模型大小对习语性检测任务至关重要，大模型更擅长语义理解，而小模型可通过外部信息（如定义）弥补能力不足。

Abstract: The recent trend towards utilisation of reasoning models has improved the performance of Large Language Models (LLMs) across many tasks which involve logical steps. One linguistic task that could benefit from this framing is idiomaticity detection, as a potentially idiomatic expression must first be understood before it can be disambiguated and serves as a basis for reasoning. In this paper, we explore how reasoning capabilities in LLMs affect idiomaticity detection performance and examine the effect of model size. We evaluate, as open source representative models, the suite of DeepSeek-R1 distillation models ranging from 1.5B to 70B parameters across four idiomaticity detection datasets. We find the effect of reasoning to be smaller and more varied than expected. For smaller models, producing chain-of-thought (CoT) reasoning increases performance from Math-tuned intermediate models, but not to the levels of the base models, whereas larger models (14B, 32B, and 70B) show modest improvements. Our in-depth analyses reveal that larger models demonstrate good understanding of idiomaticity, successfully producing accurate definitions of expressions, while smaller models often fail to output the actual meaning. For this reason, we also experiment with providing definitions in the prompts of smaller models, which we show can improve performance in some cases.

[70] Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis cs.CL | cs.AIPDF

Ayoub Ben Chaliah, Hela Dellagi

TL;DR: Datarus-R1-14B是一个14B参数的开源语言模型，专为数据分析和问题解决设计，采用多步推理和代码执行的训练方法，表现优于同类模型。

Details

Motivation: 传统的语言模型在复杂数据分析和问题解决中常表现出推理步骤冗余或逻辑循环的问题。Datarus-R1通过模拟完整的分析轨迹（包括推理、代码执行和自我修正）来解决这一问题。

Result: Datarus在标准化测试中表现优于同类模型，甚至接近更大规模的模型，如在AIME 2024/2025和LiveCodeBench上准确率提升30%，同时生成的令牌数减少18-49%。

Insight: 通过双推理模式（代理模式和反思模式）和多步推理训练，模型能够高效地完成复杂任务，避免了冗余推理步骤，展现了更接近人类专家的分析能力。

Abstract: We present Datarus-R1-14B, a 14 B-parameter open-weights language model fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and graduate-level problem solver. Datarus is trained not on isolated question-answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self-corrections, and final conclusions, all captured in a ReAct-style notebook format spanning finance, medicine, numerical analysis, and other quantitative domains. Our training pipeline combines (i) a trajectory-centric synthetic data generator that yielded 144 000 tagged notebook episodes, (ii) a dual-reward framework blending a lightweight tag-based structural signal with a Hierarchical Reward Model (HRM) that scores both single-step soundness and end-to-end coherence, and (iii) a memory-optimized implementation of Group Relative Policy Optimization (GRPO) featuring KV-cache reuse, sequential generation, and reference-model sharding. A cosine curriculum smoothly shifts emphasis from structural fidelity to semantic depth, reducing the format collapse and verbosity that often plague RL-aligned LLMs. A central design choice in Datarus is it dual reasoning interface. In agentic mode the model produces ReAct-tagged steps that invoke Python tools to execute real code; in reflection mode it outputs compact Chain-of-Thought (CoT) traces delimited by and tags. On demanding postgraduate-level problems, Datarus exhibits an “AHA-moment” pattern: it sketches hypotheses, revises them once or twice, and converges avoiding the circular, token-inflating loops common to contemporary systems. Across standard public benchmarks Datarus surpasses similar size models and even reaches the level of larger reasoning models such as QwQ-32B achieving up to 30% higher accuracy on AIME 2024/2025 and LiveCodeBench while emitting 18-49% fewer tokens per solution.

[71] ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs cs.CL | cs.AIPDF

Hongxin Ding, Baixiang Huang, Yue Fang, Weibin Liao, Xinke Jiang

TL;DR: ProMed提出了一种基于Shapley信息增益的强化学习框架，将医疗大语言模型从被动应答转变为主动提问，显著提升了在交互式医疗问答中的表现。

Details

Motivation: 当前医疗大语言模型主要在静态问答中表现优异，但在交互式临床问诊中缺乏主动提问能力，可能导致误诊。ProMed旨在解决这一问题。

Result: 在部分信息医疗基准上，ProMed平均优于现有方法6.29%，比被动范式提升54.45%，且在跨域任务中表现出强鲁棒性。

Insight: 将Shapley值引入交互式医疗问答，量化问题价值，为类似任务提供了新思路；主动提问范式可显著提升模型临床决策能力。

Abstract: Interactive medical questioning is essential in real-world clinical consultations, where physicians must actively gather information from patients. While medical Large Language Models (LLMs) have shown impressive capabilities in static medical question answering, they predominantly operate under a reactive paradigm: generating answers directly without seeking additional information, which risks incorrect diagnoses in such interactive settings. To address this limitation, we propose ProMed, a reinforcement learning (RL) framework that transitions medical LLMs toward a proactive paradigm, equipping them with the ability to ask clinically valuable questions before decision-making. At the core of ProMed is the Shapley Information Gain (SIG) reward, which quantifies the clinical utility of each question by combining the amount of newly acquired information with its contextual importance, estimated via Shapley values. We integrate SIG into a two-stage training pipeline: (1) SIG-Guided Model Initialization uses Monte Carlo Tree Search (MCTS) to construct high-reward interaction trajectories to supervise the model, and (2) SIG-Augmented Policy Optimization, which integrates SIG and enhances RL with a novel SIG-guided Reward Distribution Mechanism that assigns higher rewards to informative questions for targeted optimization. Extensive experiments on two newly curated partial-information medical benchmarks demonstrate that ProMed significantly outperforms state-of-the-art methods by an average of 6.29% and delivers a 54.45% gain over the reactive paradigm, while also generalizing robustly to out-of-domain cases.

[72] Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation cs.CL | cs.LGPDF

Hassan Barmandah

TL;DR: 论文通过LoRA微调ALLaM-7B-Instruct-preview模型，解决了沙特阿拉伯方言（Najdi和Hijazi）在现代标准阿拉伯语（MSA）主导的大语言模型中的不足，提升了方言生成能力。

Details

Motivation: 尽管阿拉伯语的大语言模型（LLMs）发展迅速，但沙特阿拉伯方言（如Najdi和Hijazi）的支持仍然有限，导致模型难以捕捉真实的方言变化。此研究旨在填补这一空白。

Result: Dialect-Token模型表现最佳，将沙特方言生成率从47.97%提升至84.21%，并减少了MSA泄漏（从32.63%降至6.21%）。同时，文本保真度指标显著提升（chrF++增加3.53，BERTScore增加0.059）。

Insight: 显式方言标签（Dialect-Token）能够显著提升方言生成的控制力和质量，同时避免通用指令模型常见的元数据重复问题。此方法为方言处理提供了可扩展的解决方案。

Abstract: Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%; fidelity also improves (chrF++ +3.53, BERTScore +0.059). Both LoRA variants outperform strong generic instruction models (Falcon-7B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, AceGPT-v2-8B-Chat, JAIS-13B-Chat) in dialect control and fidelity, while avoiding metadata-tag echoing that these baselines frequently exhibit. We do not release the dataset or any model weights/adapters; instead, we release training/evaluation/inference code and a detailed datasheet (schema and aggregate statistics) to support independent verification.

[73] AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings cs.CLPDF

Haoxuan Li, Wei Song, Aofan Liu, Peiwu Qin

TL;DR: AdaDocVQA是一种自适应框架，用于解决低资源环境下长文档视觉问答（Document VQA）的挑战，通过混合文本检索、智能数据增强和自适应集成推理显著提升了性能。

Details

Motivation: 低资源环境下长文档的视觉问答中存在上下文限制和训练数据不足的问题，需要一种高效的解决方案。

Result: 在日文文档VQA基准测试中取得了显著提升，例如JDocQA中的83.04%准确率（Yes/No问题）和59%准确率（LAVA数据集）。

Insight: 框架为其他低资源语言和专门领域提供了可扩展的基础，并通过消融验证了各组件的重要性。

Abstract: Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answer pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration generation and early stopping mechanisms. Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04% accuracy on Yes/No questions, 52.66% on factual questions, and 44.12% on numerical questions in JDocQA, and 59% accuracy on LAVA dataset. Ablation studies confirm meaningful contributions from each component, and our framework establishes new state-of-the-art results for Japanese document VQA while providing a scalable foundation for other low-resource languages and specialized domains. Our code available at: https://github.com/Haoxuanli-Thu/AdaDocVQA.

[74] CRISP: Persistent Concept Unlearning via Sparse Autoencoders cs.CL | I.2.7PDF

Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov

TL;DR: CRISP提出了一种基于稀疏自编码器（SAE）的参数高效方法，用于实现持久的概念遗忘，优于现有方法，同时保留模型的通用能力。

Details

Motivation: 随着大语言模型（LLM）的广泛应用，如何在保留模型实用性的同时选择性移除不需要的知识变得至关重要。现有SAE方法主要在推理时干预，无法持久修改模型参数，容易被绕过。

Result: 实验表明，CRISP在WMDP基准测试中优于现有方法，成功移除有害知识且保留模型能力。特征分析显示，CRISP实现了目标与无害概念的语义分离。

Insight: CRISP通过特征层级的干预，实现了对模型知识的精确控制，为安全部署LLM提供了新思路。

Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model’s parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.

[75] ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions? cs.CL | cs.LGPDF

Vy Tuong Dang, An Vo, Quang Tau, Duc Dm, Daeyoung Kim

TL;DR: ViExam 提出了一个包含2548个多模态问题的越南教育评估基准，测试了视觉语言模型（VLM）在低资源语言（越南语）上的表现，发现其效果普遍低于人类平均水平。

Details

Motivation: 探索视觉语言模型（VLM）在低资源语言（越南语）的真实多模态教育任务中的表现，填补现有研究空白。

Result: SOTA VLM仅达到57.74%准确率，开源模型为27.70%，均低于人类平均水平（66.54%）。人机协作仅提升5个百分点。

Insight: 1) VLM在低资源语言任务中表现较差；2) 跨语言提示未带来改进；3) 人机协作可能是潜在改进方向。

Abstract: Vision language models (VLMs) demonstrate remarkable capabilities on English multimodal tasks, but their performance on low-resource languages with genuinely multimodal educational content remains largely unexplored. In this work, we test how VLMs perform on Vietnamese educational assessments, investigating whether VLMs trained predominantly on English data can handle real-world cross-lingual multimodal reasoning. Our work presents the first comprehensive evaluation of VLM capabilities on multimodal Vietnamese exams through proposing ViExam, a benchmark containing 2,548 multimodal questions. We find that state-of-the-art VLMs achieve only 57.74% while open-source models achieve 27.70% mean accuracy across 7 academic domains, including Mathematics, Physics, Chemistry, Biology, Geography, Driving Test, and IQ Test. Most VLMs underperform average human test-takers (66.54%), with only the thinking VLM o3 (74.07%) exceeding human average performance, yet still falling substantially short of human best performance (99.60%). Cross-lingual prompting with English instructions while maintaining Vietnamese content fails to improve performance, decreasing accuracy by 1 percentage point for SOTA VLMs. Human-in-the-loop collaboration can partially improve VLM performance by 5 percentage points. Code and data are available at: https://vi-exam.github.io.

[76] Generics and Default Reasoning in Large Language Models cs.CL | cs.AI | cs.LOPDF

James Ravi Kirkpatrick, Rachel Katharine Sterken

TL;DR: 本文评估了28个大语言模型（LLMs）在20种涉及泛化推理模式的失效推理能力，发现前沿模型在多数情况下表现良好，但性能因模型和提示方式差异显著。思维链（CoT）提示会导致性能显著下降，多数模型难以区分失效推理与演绎推理。

Details

Motivation: 泛化推理（如“鸟类会飞”）是语言学、哲学和认知科学中的核心问题，具有复杂的例外行为和默认推理重要性。研究旨在评估LLMs在此类任务中的表现。

Result: 前沿模型在零样本条件下表现良好（部分准确率>75%），但CoT提示导致平均准确率下降11.14%。模型普遍混淆泛化与全称陈述。

Insight: 当前LLMs在默认推理中展现了潜力，但仍有明显限制，尤其是在复杂提示方式下的性能退化和对泛化语义的理解不足。

Abstract: This paper evaluates the capabilities of 28 large language models (LLMs) to reason with 20 defeasible reasoning patterns involving generic generalizations (e.g., ‘Birds fly’, ‘Ravens are black’) central to non-monotonic logic. Generics are of special interest to linguists, philosophers, logicians, and cognitive scientists because of their complex exception-permitting behaviour and their centrality to default reasoning, cognition, and concept acquisition. We find that while several frontier models handle many default reasoning problems well, performance varies widely across models and prompting styles. Few-shot prompting modestly improves performance for some models, but chain-of-thought (CoT) prompting often leads to serious performance degradation (mean accuracy drop -11.14%, SD 15.74% in models performing above 75% accuracy in zero-shot condition, temperature 0). Most models either struggle to distinguish between defeasible and deductive inference or misinterpret generics as universal statements. These findings underscore both the promise and limits of current LLMs for default reasoning.

[77] Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA cs.CLPDF

Kaiwei Zhang, Qi Jia, Zijian Chen, Wei Sun, Xiangyang Zhu

TL;DR: 该论文研究了大型语言模型（LLMs）在科学问答（QA）中的‘讨好’行为（即倾向于迎合用户观点而忽视事实准确性），提出了一种评估框架和轻量级后训练方法（Pressure-Tune）来缓解此问题。

Details

Motivation: 大型语言模型在需要事实严谨性的领域（如科学QA）中，表现出迎合用户观点的‘讨好’行为，这可能导致高风险的错误决策。传统基于用户偏好的对齐技术可能加剧这一问题，但相关研究较少。因此，作者希望填补这一空白。

Result: 实验表明，Pressure-Tune显著提升了模型对‘讨好’行为的抵抗力，同时不牺牲准确性或对有效反馈的响应能力。

Insight: 模型的对齐策略是‘讨好’行为的主要驱动力，未来的研究应关注如何平衡用户偏好和事实准确性。

Abstract: Large language models (LLMs), while increasingly used in domains requiring factual rigor, often display a troubling behavior: sycophancy, the tendency to align with user beliefs regardless of correctness. This tendency is reinforced by preference-based alignment techniques that optimize for user satisfaction but can undermine truthfulness. While relatively benign in casual dialogue, sycophancy poses serious risks in high-stakes settings such as scientific question answering (QA), where model outputs may shape collaborative reasoning, decision-making, and knowledge formation. Despite its importance, this phenomenon remains underexamined in factual QA contexts. We address this gap by introducing a unified evaluation framework to quantify the impact of sycophantic context on model behavior in scientific QA, measuring how much user-imposed social pressure distorts model outputs. The framework incorporates adversarial prompting setups and targeted metrics, such as misleading resistance and sycophancy resistance, that capture a model’s ability to maintain factual consistency under misleading cues. Systematic evaluations across open-source and proprietary models reveal pervasive sycophantic tendencies, driven more by alignment strategy than by model size. To mitigate this issue, we propose Pressure-Tune, a lightweight post-training method that fine-tunes models on synthetic adversarial dialogues paired with chain-of-thought rationales. These rationales reject user misinformation while reinforcing factual commitments. Experiments on challenging scientific QA benchmarks show that Pressure-Tune significantly enhances sycophancy resistance without compromising accuracy or responsiveness to valid feedback, offering a practical pathway toward more truthful and principled model behavior.

[78] Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study cs.CLPDF

Hanna Woloszyn, Benjamin Gagl

TL;DR: 论文比较了大语言模型（LLMs）与儿童在描述图片时的语言差异，发现LLM生成文本更长但词汇丰富度较低，且与儿童语言在语义上差异显著。少量示例提示略微提高了相似性，但仍无法完全复现儿童语言模式。

Details

Motivation: 随着LLMs在教育中的应用增加，研究其是否能生成类似儿童的语言对于教育工具开发及心理语言学研究具有重要意义。

Result: LLM生成的文本比儿童语言更长，词汇丰富度更低，依赖高频词，且名词使用不足；语义分析显示两者差异显著，少量示例提示仅略微提升相似性。

Insight: LLMs当前难以完全复现儿童语言模式，提示设计对其表现影响有限，需进一步研究其在儿童教育工具中的适用性。

Abstract: The role of large language models (LLMs) in education is increasing, yet little attention has been paid to whether LLM-generated text resembles child language. This study evaluates how LLMs replicate child-like language by comparing LLM-generated texts to a collection of German children’s descriptions of picture stories. We generated two LLM-based corpora using the same picture stories and two prompt types: zero-shot and few-shot prompts specifying a general age from the children corpus. We conducted a comparative analysis across psycholinguistic text properties, including word frequency, lexical richness, sentence and word length, part-of-speech tags, and semantic similarity with word embeddings. The results show that LLM-generated texts are longer but less lexically rich, rely more on high-frequency words, and under-represent nouns. Semantic vector space analysis revealed low similarity, highlighting differences between the two corpora on the level of corpus semantics. Few-shot prompt increased similarities between children and LLM text to a minor extent, but still failed to replicate lexical and semantic patterns. The findings contribute to our understanding of how LLMs approximate child language through multimodal prompting (text + image) and give insights into their use in psycholinguistic research and education while raising important questions about the appropriateness of LLM-generated language in child-directed educational tools.

[79] TracSum: A New Benchmark for Aspect-Based Summarization with Sentence-Level Traceability in Medical Domain cs.CLPDF

Bohao Chu, Meijie Li, Sameh Frihat, Chengyu Gu, Georg Lodde

TL;DR: TracSum是一个新的可追踪、基于方面的医学领域摘要生成基准，旨在通过句子级引证提高摘要的事实准确性，并提供了一种评估框架和基线方法。

Details

Motivation: 现有大型语言模型生成的文档摘要在医学领域可能存在事实准确性不足的问题，因此需要引入可追踪的摘要生成方法以提高用户对摘要的信任度。

Result: 实验表明TracSum可作为有效的基准，Track-Then-Sum方法提升了生成准确性，完整上下文进一步提高了摘要完整性。

Insight: 在摘要生成前显式进行句子级追踪能提升准确性，而上下文的完整引入则有助于摘要的完整性。

Abstract: While document summarization with LLMs has enhanced access to textual information, concerns about the factual accuracy of these summaries persist, especially in the medical domain. Tracing evidence from which summaries are derived enables users to assess their accuracy, thereby alleviating this concern. In this paper, we introduce TracSum, a novel benchmark for traceable, aspect-based summarization, in which generated summaries are paired with sentence-level citations, enabling users to trace back to the original context. First, we annotate 500 medical abstracts for seven key medical aspects, yielding 3.5K summary-citation pairs. We then propose a fine-grained evaluation framework for this new task, designed to assess the completeness and consistency of generated content using four metrics. Finally, we introduce a summarization pipeline, Track-Then-Sum, which serves as a baseline method for comparison. In experiments, we evaluate both this baseline and a set of LLMs on TracSum, and conduct a human evaluation to assess the evaluation results. The findings demonstrate that TracSum can serve as an effective benchmark for traceable, aspect-based summarization tasks. We also observe that explicitly performing sentence-level tracking prior to summarization enhances generation accuracy, while incorporating the full context further improves completeness.

[80] Beyond Human Judgment: A Bayesian Evaluation of LLMs’ Moral Values Understanding cs.CL | cs.HC | 68T50, 62F15, 62P25 | I.2.7; K.4.1; J.4PDF

Maciej Skorski, Alina Landowska

TL;DR: 本文通过贝叶斯方法评估了大型语言模型对道德价值观的理解能力，发现AI模型的表现优于人类标注者的平均水平。

Details

Motivation: 探讨大型语言模型如何在道德维度上与人类表现相比，并揭示其在道德检测中的敏感性和准确性。

Result: AI模型的表现通常排在前25%的人类标注者中，平衡准确性优于平均水平，且在道德检测中假阴性更少。

Insight: AI在道德检测方面比人类更敏感，尤其是在避免假阴性方面表现突出，展示了其在道德理解上的潜力。

Abstract: How do large language models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluate top language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from ~700 annotators on 100K+ texts spanning social media, news, and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25% of human annotators, achieving much better-than-average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.

[81] Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs cs.CL | cs.AIPDF

Juncheng Xie, Hung-yi Lee

TL;DR: 本文提出了一种基于提示的单次精确长度控制生成方法，通过添加倒计时标记和计数规则，使得现成的LLM能够无需微调或迭代采样即可生成指定长度的文本（如单词或字符）。

Details

Motivation: 现有的大型语言模型（LLM）在生成文本时难以精确控制长度，通常会超出或未达到指定的长度要求。本文旨在通过提示工程实现精确的长度控制，避免模型在内部无法可靠计数的问题。

Result: 在多种任务（开放式生成、XSUM摘要、MT-Bench-LI指令遵循和LIFEBENCH等长跟踪）中，模型的长度遵从率显著提升（如MT-Bench-LI中从30%升至95%），且回答质量不受影响。

Insight: 提示工程可以显著提升LLM在任务中的表现，尤其是在需要精确控制的场景中，提供了一种无需额外训练的高效解决方案。

Abstract: Controlling the length of text produced by large language models (LLMs) remains challenging: models frequently overshoot or undershoot explicit length instructions because they cannot reliably keep an internal token count. We present a prompt-based, one-shot strategy that compels an off-the-shelf LLM to generate exactly a desired number of tokens - words (English) or characters (Chinese) - without any fine-tuning or iterative sampling. The prompt appends countdown markers and explicit counting rules so that the model “writes while counting.” We evaluate on four settings: open-ended generation (1-1000 tokens), XSUM summarization, MT-Bench-LI instruction following, and the LIFEBENCH equal-length track. On MT-Bench-LI, strict length compliance with GPT-4.1 leaps from below 30% under naive prompts to above 95% with our countdown prompt, surpassing the popular draft-then-revise baseline, while judged answer quality is preserved. These results show that precise length control can be achieved through prompt engineering alone, offering a lightweight alternative to training- or decoding-based methods.

[82] The illusion of a perfect metric: Why evaluating AI’s words is harder than it looks cs.CL | cs.AIPDF

Maria Paz Oliva, Adriana Correia, Ivan Vankov, Viktor Botev

TL;DR: 论文指出，现有自动评估指标（AEM）在自然语言生成（NLG）任务中无法成为完美解决方案，挑战了追求“完美指标”的观念。

Details

Motivation: 尽管自动评估指标（AEM）在NLG任务中被广泛应用，但现有指标往往仅捕捉文本质量的特定方面，且有效性因任务和数据集而异，缺乏标准化验证方法，与人类判断的相关性不一致。

Result: 研究发现，现有AEM无法成为通用解决方案，且LLM-as-a-Judge和RAG评估中同样存在局限性。

Insight: 论文强调应根据任务需求选择互补性评估指标，并建议未来研究聚焦于改进验证方法，而非单纯追求“完美指标”。

Abstract: Evaluating Natural Language Generation (NLG) is crucial for the practical adoption of AI, but has been a longstanding research challenge. While human evaluation is considered the de-facto standard, it is expensive and lacks scalability. Practical applications have driven the development of various automatic evaluation metrics (AEM), designed to compare the model output with human-written references, generating a score which approximates human judgment. Over time, AEMs have evolved from simple lexical comparisons, to semantic similarity models and, more recently, to LLM-based evaluators. However, it seems that no single metric has emerged as a definitive solution, resulting in studies using different ones without fully considering the implications. This paper aims to show this by conducting a thorough examination of the methodologies of existing metrics, their documented strengths and limitations, validation methods, and correlations with human judgment. We identify several key challenges: metrics often capture only specific aspects of text quality, their effectiveness varies by task and dataset, validation practices remain unstructured, and correlations with human judgment are inconsistent. Importantly, we find that these challenges persist in the most recent type of metric, LLM-as-a-Judge, as well as in the evaluation of Retrieval Augmented Generation (RAG), an increasingly relevant task in academia and industry. Our findings challenge the quest for the ‘perfect metric’. We propose selecting metrics based on task-specific needs and leveraging complementary evaluations and advocate that new metrics should focus on enhanced validation methodologies.

[83] MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models cs.CL | cs.CVPDF

Jiacheng Ruan, Dan Jiang, Xian Gao, Ting Liu, Yuzhuo Fu

TL;DR: MME-SCI是一个全面的、挑战性的科学领域多模态大语言模型（MLLM）评测基准，覆盖多语言、多模态和多学科，实验证明其难度显著高于现有基准。

Details

Motivation: 当前科学领域的评测基准在多语言推理能力、多模态覆盖和科学知识点细粒度标注方面存在不足，亟需一个更全面的评测工具。

Result: 实验显示MME-SCI难度显著更高（例如，o4-mini在图像模式下准确率仅为24.73%-52.11%），并揭示了模型在特定领域的弱点。

Insight: MME-SCI突出了多语言和细粒度知识标注的重要性，为未来MLLM在科学领域的改进提供了方向。

Abstract: Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models’ reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs’ comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI’s multilingual and fine-grained knowledge attributes, we analyzed existing models’ performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at https://github.com/JCruan519/MME-SCI.

[84] Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization cs.CL | cs.AIPDF

Shaohua Duan, Xinze Li, Zhenghao Liu, Xiaoyuan Yi, Yukun Yan

TL;DR: 论文提出了一种名为LongMab-PO的新框架，通过多臂老虎机（MAB）策略选择长上下文中最信息量的块，生成高质量和多样化的响应，并通过直接偏好优化（DPO）训练LLM，显著提升了偏好数据对的多样性和质量。

Details

Motivation: 现有方法在通过合成数据微调LLMs时，生成的数据通常缺乏多样性和存在事实不一致问题，这限制了模型的长期上下文能力提升。

Result: 实验结果表明，LongMab-PO显著改善了偏好数据对的多样性和质量，在长期上下文推理任务中表现优异。

Insight: 通过动态选择上下文块并结合奖励反馈机制，可以有效解决合成数据多样性不足的问题，提升LLM在长期上下文任务中的表现。

Abstract: Long-context modeling is critical for a wide range of real-world tasks, including long-context question answering, summarization, and complex reasoning tasks. Recent studies have explored fine-tuning Large Language Models (LLMs) with synthetic data to enhance their long-context capabilities. However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab-PO, a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. Specifically, we treat context chunks as arms of MAB, select chunks based on their expected reward scores to input into LLMs to generate responses, and iteratively update these scores based on reward feedback. This exploration and exploitation process enables the model to focus on the most relevant context segments, thereby generating and collecting high-quality and diverse responses. Finally, we collect these generated responses from the rollout process and apply the DPO method to further optimize the LLM. Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs, achieving state-of-the-art performance on long-context reasoning benchmarks. All code and data will be released on https://github.com/NEUIR/LongMab-PO.

[85] Ask Good Questions for Large Language Models cs.CL | cs.AIPDF

Qi Wu, Zhongqi Lu

TL;DR: 论文提出了Ask-Good-Question (AGQ)框架，通过改进的Concept-Enhanced Item Response Theory (CEIRT)模型识别用户知识水平，并利用大语言模型生成引导性问题，显著提升了信息检索效率。

Details

Motivation: 当前大型语言模型(LLMs)在对话系统中表现优秀，但未能准确引导话题，尤其是在识别用户对相关概念的困惑时存在不足。

Result: 实验表明，AGQ优于其他基线方法，显著提升了信息检索效率。

Insight: 通过结合心理学测量理论(CEIRT)与大语言模型，能够更精准地识别用户需求并生成有效引导，为对话系统设计提供了新思路。

Abstract: Recent advances in large language models (LLMs) have significantly improved the performance of dialog systems, yet current approaches often fail to provide accurate guidance of topic due to their inability to discern user confusion in related concepts. To address this, we introduce the Ask-Good-Question (AGQ) framework, which features an improved Concept-Enhanced Item Response Theory (CEIRT) model to better identify users’ knowledge levels. Our contributions include applying the CEIRT model along with LLMs to directly generate guiding questions based on the inspiring text, greatly improving information retrieval efficiency during the question & answer process. Through comparisons with other baseline methods, our approach outperforms by significantly enhencing the users’ information retrieval experiences.

[86] Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR cs.CLPDF

Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu

TL;DR: 论文提出了一种名为SvS（自我对抗与变分问题合成）的策略，用于强化学习与可验证奖励（RLVR），通过合成变分问题来维持策略熵，显著提升了Pass@k性能。

Details

Motivation: 传统的RLVR训练在提升Pass@1性能时会降低策略熵，从而减少生成多样性并限制Pass@k性能。论文旨在解决这一问题。

Result: 在AIME24和AIME25基准测试中，Pass@32性能分别提升了18.3%和22.8%。跨12个推理基准测试和不同模型规模的实验验证了其泛化性和鲁棒性。

Insight: 通过动态更新训练问题，可以有效避免策略熵的崩溃，从而提升模型的多样性和推理能力。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy’s generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy’s correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

cs.CY [Back]

Jonathan A. Karr Jr., Benjamin F. Herbst, Ting Hua, Matthew Hauenstein, Georgina Curto

TL;DR: 这篇论文提出了一个新的多模态数据集，用于检测和测量针对无家可归者的社会偏见，并通过LLMs（大语言模型）实现自动化识别。研究还比较了本地和闭源LLMs的分类性能。

Details

Motivation: 无家可归问题是全球性社会挑战，而社会偏见是解决问题的障碍。研究希望通过分析在线和公共讨论中的偏见，提升公众意识并影响政策制定。

Result: 1. 本地LLMs在零样本分类中存在不一致性，但通过上下文学习后性能接近闭源LLMs。2. LLMs整体表现优于BERT。

Insight: LLMs在检测社会偏见方面具有潜力，但零样本学习仍需改进。多模态数据集为研究无家可归者的偏见提供了新工具。

Abstract: Homelessness is a persistent social challenge, impacting millions worldwide. Over 770,000 people experienced homelessness in the U.S. in 2024. Social stigmatization is a significant barrier to alleviation, shifting public perception, and influencing policymaking. Given that online and city council discourse reflect and influence part of public opinion, it provides valuable insights to identify and track social biases. This research contributes to alleviating homelessness by acting on public opinion. It introduces novel methods, building on natural language processing (NLP) and large language models (LLMs), to identify and measure PEH social bias expressed in digital spaces. We present a new, manually-annotated multi-modal dataset compiled from Reddit, X (formerly Twitter), news articles, and city council meeting minutes across 10 U.S. cities. This unique dataset provides evidence of the typologies of homelessness bias described in the literature. In order to scale up and automate the detection of homelessness bias online, we evaluate LLMs as classifiers. We applied both zero-shot and few-shot classification techniques to this data. We utilized local LLMs (Llama 3.2 3B Instruct, Qwen 2.5 7B Instruct, and Phi4 Instruct Mini) as well as closed-source API models (GPT-4.1, Gemini 2.5 Pro, and Grok-4). Our findings reveal that although there are significant inconsistencies in local LLM zero-shot classification, the in-context learning classification scores of local LLMs approach the classification scores of closed-source LLMs. Furthermore, LLMs outperform BERT when averaging across all categories. This work aims to raise awareness about the pervasive bias against PEH, develop new indicators to inform policy, and ultimately enhance the fairness and ethical application of Generative AI technologies.

cs.LG [Back]

[88] Input Time Scaling cs.LG | cs.AI | cs.CLPDF

Rapheal Huang, Weilong Guo

TL;DR: 该论文提出了一种新的扩展范式“输入时间扩展”，通过在训练和测试时结合LLM的元知识优化输入策略，发现训练-测试协同设计的重要性，并证明低质量数据和高性能可以共存。

Details

Motivation: 当前大型语言模型（LLM）通常在训练和推理时分别进行数据和计算资源的扩展，但输入质量的优化潜力未被充分探索。

Result: 在Qwen2.5-32B-Instruct上实现SOTA性能（AIME24:76.7%，AIME25:76.7%），通过多数投票进一步提升（AIME25:80%）。

Insight: 1. 训练和测试策略需协同设计；2. 数据集质量并非越高越好；3. 少量数据可能足以激发模型推理能力。

Abstract: Current Large Language Models (LLMs) are usually post-trained on large-scale carefully curated datasets (data & training scaling) and doing reasoning in test time (inference time scaling). In this work, we present a new scaling paradigm, Input Time Scaling, to complement previous scaling methods by putting resources on queries (input time). During training and testing, we combine meta-knowledge from LLMs to refine inputs with different strategies. We also find a new phenomenon, training-testing co-design there. We need to apply query strategies during both training and testing. Only applying strategies on training or testing would seriously degrade the performance. We are also surprised to find that seemingly low data quality datasets can gain high performance. Adding irrelevant information to the queries, randomly selecting examples from a minimally filtered dataset, can even perform the best. These findings contradict the widely held inductive bias, “garbage in, garbage out”. Curating datasets with seemingly high-quality data can even potentially limit the performance ceiling. In addition, models trained on more data with similar quality (15k VS 1k) perform worse, simple dataset size scaling should also be carefully inspected. The good news is that our findings are compatible with the Less is More phenomenon. A small set of examples is enough to evoke high-level reasoning ability. With experiments on models trained on Qwen2.5-32B-Instruct, we are able to reach SOTA performance among 32B models on AIME24(76.7%) and AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B, the best result would be 86.7% on AIME24 and 76.7% on AIME25. To facilitate reproducibility and further research, we are working on open-source our datasets, data pipelines, evaluation results, and checkpoints.

[89] BERT-VQA: Visual Question Answering on Plots cs.LG | cs.CVPDF

Tai Vu, Robert Yang

TL;DR: 论文研究了基于图表的视觉问答任务，提出了BERT-VQA模型，采用VisualBERT架构和ResNet 101图像编码器，并验证了跨模态模块的有效性。实验结果否定了核心假设，但对图表问答任务的挑战性和模型选择提供了重要见解。

Details

Motivation: 视觉问答（VQA）是自然语言理解和视觉领域的结合，但针对图表的VQA任务尚未充分探索。作者希望通过BERT-VQA模型解决这一子任务，并验证跨模态模块的重要性。

Result: 实验结果否定了核心假设，即跨模态模块在图表问答任务中并非必需。BERT-VQA表现未能超越基线，表明图表问答的复杂性和模型设计的挑战。

Insight: 图表问答任务需要更细致的跨模态对齐机制，现有的通用模型如VisualBERT可能无法直接适配。未来研究需探索更针对性的架构和特征提取方法。

Abstract: Visual question answering has been an exciting challenge in the field of natural language understanding, as it requires deep learning models to exchange information from both vision and language domains. In this project, we aim to tackle a subtask of this problem, namely visual question answering on plots. To achieve this, we developed BERT-VQA, a VisualBERT-based model architecture with a pretrained ResNet 101 image encoder, along with a potential addition of joint fusion. We trained and evaluated this model against a baseline that consisted of a LSTM, a CNN, and a shallow classifier. The final outcome disproved our core hypothesis that the cross-modality module in VisualBERT is essential in aligning plot components with question phrases. Therefore, our work provided valuable insights into the difficulty of the plot question answering challenge as well as the appropriateness of different model architectures in solving this problem.

[90] RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning cs.LG | cs.CVPDF

Suhang Hu, Wei Hu, Yuhang Su, Fan Zhang

TL;DR: RISE 是一个两阶段框架，通过自监督的推理增强视觉语言模型（VLM）的图像标注能力，显著提升了复杂任务的性能和可解释性。

Details

Motivation: VLM 在复杂图像标注任务（如情感分类和上下文驱动的目标检测）中表现不佳，原因是标准监督微调（SFT）和视觉强化微调（Visual-RFT）无法生成高质量、逻辑一致的推理链（CoTs）。

Result: 在复杂和简单的图像标注任务中，RISE 训练的 Qwen2-VL-2B 模型优于 SFT 和 Visual-RFT，表现出更强的性能和可解释性。

Insight: RISE 提供了一种无需人工标注 CoTs 的自监督解决方案，为提升 VLM 的推理能力提供了新思路。

Abstract: Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven “annotation-reasoning-annotation” closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset, filtered by RISE-CoT rewards, for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations, achieving Expertise in complex visual tasks. Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.

[91] Hierarchy-Consistent Learning and Adaptive Loss Balancing for Hierarchical Multi-Label Classification cs.LG | cs.CVPDF

Ruobing Jiang, Mengzhe Liu, Haobing Liu, Yanwei Yu

TL;DR: 本文提出了HCAL模型，通过原型对比学习和自适应任务加权机制解决HMC中的结构一致性和任务平衡问题，实现了更高的分类准确性和层次一致性。

Details

Motivation: HMC在保持结构一致性和多任务学习中的损失平衡方面存在挑战，传统方法容易产生优化偏差。

Result: 在三个数据集上的实验表明，HCAL在分类准确性和层次一致性上优于基线模型。

Insight: 自适应任务加权和原型扰动机制是提升HMC性能的关键。原型对比学习能有效保持语义一致性。

Abstract: Hierarchical Multi-Label Classification (HMC) faces critical challenges in maintaining structural consistency and balancing loss weighting in Multi-Task Learning (MTL). In order to address these issues, we propose a classifier called HCAL based on MTL integrated with prototype contrastive learning and adaptive task-weighting mechanisms. The most significant advantage of our classifier is semantic consistency including both prototype with explicitly modeling label and feature aggregation from child classes to parent classes. The other important advantage is an adaptive loss-weighting mechanism that dynamically allocates optimization resources by monitoring task-specific convergence rates. It effectively resolves the “one-strong-many-weak” optimization bias inherent in traditional MTL approaches. To further enhance robustness, a prototype perturbation mechanism is formulated by injecting controlled noise into prototype to expand decision boundaries. Additionally, we formalize a quantitative metric called Hierarchical Violation Rate (HVR) as to evaluate hierarchical consistency and generalization. Extensive experiments across three datasets demonstrate both the higher classification accuracy and reduced hierarchical violation rate of the proposed classifier over baseline models.

cs.AR [Back]

[92] White-Box Reasoning: Synergizing LLM Strategy and gm/Id Data for Automated Analog Circuit Design cs.AR | cs.AI | cs.CLPDF

Jianqiu Chen, Siqi Li, Xu He

TL;DR: 这篇论文提出了一种结合大型语言模型（LLM）策略推理和gm/Id方法的协同推理框架，用于自动化模拟电路设计，显著提高了效率和设计质量。

Details

Motivation: 模拟集成电路设计依赖于经验和低效的仿真，传统公式在先进节点中失效，直接应用LLM可能导致缺乏工程原则的“猜测”。

Result: 验证表明，该框架使Gemini模型在5次迭代内满足所有TT corner规格，并扩展到所有PVT corners，效率提升了一个数量级。

Insight: 结合LLM的科学推理能力与电路设计方法学，为真正的模拟设计自动化开辟了新路径。

Abstract: Analog IC design is a bottleneck due to its reliance on experience and inefficient simulations, as traditional formulas fail in advanced nodes. Applying Large Language Models (LLMs) directly to this problem risks mere “guessing” without engineering principles. We present a “synergistic reasoning” framework that integrates an LLM’s strategic reasoning with the physical precision of the gm/Id methodology. By empowering the LLM with gm/Id lookup tables, it becomes a quantitative, data-driven design partner. We validated this on a two-stage op-amp, where our framework enabled the Gemini model to meet all TT corner specs in 5 iterations and extended optimization to all PVT corners. A crucial ablation study proved gm/Id data is key for this efficiency and precision; without it, the LLM is slower and deviates. Compared to a senior engineer’s design, our framework achieves quasi-expert quality with an order-of-magnitude improvement in efficiency. This work validates a path for true analog design automation by combining LLM reasoning with scientific circuit design methodologies.

cs.RO [Back]

[93] A Surveillance Based Interactive Robot cs.RO | cs.AI | cs.CV | I.2.9; I.2.10; I.2.7PDF

Kshitij Kavimandan, Pooja Mangal, Devanshi Mehta

TL;DR: 该论文设计了一个基于移动监控的交互式机器人，能够实时传输视频并响应语音命令，用户可通过手机或浏览器监控和控制。系统采用树莓派4和现有开源技术实现。

Details

Motivation: 为了解决远程监控和交互的需求，作者提出了一种低成本、易复现的移动监控机器人方案，结合了实时视频流、语音交互和物体检测功能。

Result: 在室内测试中，机器人能够以交互式帧率检测常见物体，可靠识别语音命令并翻译为动作，无需手动控制。

Insight: 该设计展示了现有技术和低成本硬件的潜力，同时提出了传感器融合和GPU加速等未来改进方向。

Abstract: We build a mobile surveillance robot that streams video in real time and responds to speech so a user can monitor and steer it from a phone or browser. The system uses two Raspberry Pi 4 units: a front unit on a differential drive base with camera, mic, and speaker, and a central unit that serves the live feed and runs perception. Video is sent with FFmpeg. Objects in the scene are detected using YOLOv3 to support navigation and event awareness. For voice interaction, we use Python libraries for speech recognition, multilingual translation, and text-to-speech, so the robot can take spoken commands and read back responses in the requested language. A Kinect RGB-D sensor provides visual input and obstacle cues. In indoor tests the robot detects common objects at interactive frame rates on CPU, recognises commands reliably, and translates them to actions without manual control. The design relies on off-the-shelf hardware and open software, making it easy to reproduce. We discuss limits and practical extensions, including sensor fusion with ultrasonic range data, GPU acceleration, and adding face and text recognition.

[94] ROVER: Robust Loop Closure Verification with Trajectory Prior in Repetitive Environments cs.RO | cs.CVPDF

Jingwen Yu, Jiayi Yang, Anjun Hu, Jiankun Wang, Ping Tan

TL;DR: ROVER提出了一种在重复环境中利用轨迹先验进行环路闭合验证的方法，通过结合机器人的时空运动线索，显著提高了环路检测的鲁棒性和准确性。

Details

Motivation: 在重复环境中，基于外观特征的环路闭合检测容易失效，导致误检测。现有方法忽略了机器人的轨迹先验知识，因此需要一种更鲁棒的验证方法。

Result: 实验表明，ROVER在重复环境中显著减少了误检，并成功集成到先进的SLAM系统中。

Insight: 结合时空运动线索（如轨迹）可以显著提升环路闭合验证的鲁棒性，尤其在相似性高的环境中。

Abstract: Loop closure detection is important for simultaneous localization and mapping (SLAM), which associates current observations with historical keyframes, achieving drift correction and global relocalization. However, a falsely detected loop can be fatal, and this is especially difficult in repetitive environments where appearance-based features fail due to the high similarity. Therefore, verification of a loop closure is a critical step in avoiding false positive detections. Existing works in loop closure verification predominantly focus on learning invariant appearance features, neglecting the prior knowledge of the robot’s spatial-temporal motion cue, i.e., trajectory. In this letter, we propose ROVER, a loop closure verification method that leverages the historical trajectory as a prior constraint to reject false loops in challenging repetitive environments. For each loop candidate, it is first used to estimate the robot trajectory with pose-graph optimization. This trajectory is then submitted to a scoring scheme that assesses its compliance with the trajectory without the loop, which we refer to as the trajectory prior, to determine if the loop candidate should be accepted. Benchmark comparisons and real-world experiments demonstrate the effectiveness of the proposed method. Furthermore, we integrate ROVER into state-of-the-art SLAM systems to verify its robustness and efficiency. Our source code and self-collected dataset are available at https://github.com/jarvisyjw/ROVER.

[95] MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence cs.RO | cs.AI | cs.CVPDF

Chao Tang, Anxing Xiao, Yuhong Deng, Tianrun Hu, Wenlong Dong

TL;DR: MimicFunc是一个从单个人类视频中模仿工具操作技能的框架，通过功能帧（function frame）实现功能级对应，解决了工具几何差异导致的泛化难题。

Details

Motivation: 人类可以轻松通过观察一次任务模仿工具操作并泛化到其他工具，而现有机器人难以实现类似能力。此研究旨在解决功能相似工具间几何差异导致的泛化问题。

Result: 实验表明MimicFunc能有效从单个人类RGB-D视频泛化到新工具操作任务，并生成用于策略训练的滚动数据。

Insight: 功能级对应是实现工具操作泛化的关键，关键点抽象提供了有效的几何无关表示。

Abstract: Imitating tool manipulation from human videos offers an intuitive approach to teaching robots, while also providing a promising and scalable alternative to labor-intensive teleoperation data collection for visuomotor policy learning. While humans can mimic tool manipulation behavior by observing others perform a task just once and effortlessly transfer the skill to diverse tools for functionally equivalent tasks, current robots struggle to achieve this level of generalization. A key challenge lies in establishing function-level correspondences, considering the significant geometric variations among functionally similar tools, referred to as intra-function variations. To address this challenge, we propose MimicFunc, a framework that establishes functional correspondences with function frame, a function-centric local coordinate frame constructed with keypoint-based abstraction, for imitating tool manipulation skills. Experiments demonstrate that MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools for functionally equivalent tasks. Furthermore, leveraging MimicFunc’s one-shot generalization capability, the generated rollouts can be used to train visuomotor policies without requiring labor-intensive teleoperation data collection for novel objects. Our code and video are available at https://sites.google.com/view/mimicfunc.

[96] Multimodal Data Storage and Retrieval for Embodied AI: A Survey cs.RO | cs.CVPDF

Yihao Lu, Hao Tang

TL;DR: 本文综述了用于具身AI(EAI)的多模态数据存储与检索方法，系统评估了五种存储架构和五种检索范式，揭示了语义一致性与实时性之间的冲突，并提出了未来研究方向。

Details

Motivation: EAI代理持续与物理世界交互，产生大量异构多模态数据流，传统管理系统难以有效处理。因此，亟需研究适用于EAI的数据存储与检索方法。

Result: 指出从物理接地差距到跨模态集成等系统性挑战，并提出了标准化基准测试和自适应优化等解决方案。

Insight: EAI的数据管理需兼顾长期语义一致性与实时响应能力，未来研究应聚焦于物理感知模型和动态适应性优化。

Abstract: Embodied AI (EAI) agents continuously interact with the physical world, generating vast, heterogeneous multimodal data streams that traditional management systems are ill-equipped to handle. In this survey, we first systematically evaluate five storage architectures (Graph Databases, Multi-Model Databases, Data Lakes, Vector Databases, and Time-Series Databases), focusing on their suitability for addressing EAI’s core requirements, including physical grounding, low-latency access, and dynamic scalability. We then analyze five retrieval paradigms (Fusion Strategy-Based Retrieval, Representation Alignment-Based Retrieval, Graph-Structure-Based Retrieval, Generation Model-Based Retrieval, and Efficient Retrieval-Based Optimization), revealing a fundamental tension between achieving long-term semantic coherence and maintaining real-time responsiveness. Based on this comprehensive analysis, we identify key bottlenecks, spanning from the foundational Physical Grounding Gap to systemic challenges in cross-modal integration, dynamic adaptation, and open-world generalization. Finally, we outline a forward-looking research agenda encompassing physics-aware data models, adaptive storage-retrieval co-optimization, and standardized benchmarking, to guide future research toward principled data management solutions for EAI. Our survey is based on a comprehensive review of more than 180 related studies, providing a rigorous roadmap for designing the robust, high-performance data management frameworks essential for the next generation of autonomous embodied systems.

[97] Augmenting cobots for sheet-metal SMEs with 3D object recognition and localisation cs.RO | cs.CVPDF

Martijn Cramer, Yanming Wu, David De Schepper, Eric Demeester

TL;DR: 该论文探讨了在中小型钣金企业中通过3D物体识别与定位技术增强协作机器人（cobots）的潜力，解决高混合低产量生产中的自动化问题。

Details

Motivation: 由于高混合低产量的生产需求，传统自动化解决方案难以满足中小企业需求，导致手工劳动重复且技术人才未充分利用。

Result: 论文展示了在工业场景中增强协作机器人系统的机会与挑战，并通过实际案例验证了技术的可行性。

Insight: 3D物体识别与定位技术可以显著提升协作机器人在灵活生产任务中的适用性，为中小企业自动化提供了新方向。

Abstract: Due to high-mix-low-volume production, sheet-metal workshops today are challenged by small series and varying orders. As standard automation solutions tend to fall short, SMEs resort to repetitive manual labour impacting production costs and leading to tech-skilled workforces not being used to their full potential. The COOCK+ ROBUST project aims to transform cobots into mobile and reconfigurable production assistants by integrating existing technologies, including 3D object recognition and localisation. This article explores both the opportunities and challenges of enhancing cobotic systems with these technologies in an industrial setting, outlining the key steps involved in the process. Additionally, insights from a past project, carried out by the ACRO research unit in collaboration with an industrial partner, serves as a concrete implementation example throughout.

cs.IR [Back]

[98] TaoSR1: The Thinking Model for E-commerce Relevance Search cs.IR | cs.AI | cs.CLPDF

Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin

TL;DR: 论文提出了一种名为TaoSR1的框架，旨在直接部署大型语言模型（LLM）于电商搜索中的查询-产品相关性预测任务，通过分阶段的方法解决复杂推理问题、生成质量优化以及部署可行性。

Details

Motivation: 现有BERT模型擅长语义匹配但缺乏复杂推理能力，而目前对LLM的研究大多仍采用判别式微调或蒸馏，未能充分发挥LLM的直接部署潜力。论文旨在解决CoT错误累积、判别式幻觉和部署可行性等关键挑战。

Result: 离线数据上显著超越基线模型，在线人类评估中取得显著增益，验证了CoT推理在相关性分类中的新范式。

Insight: 通过分阶段优化和动态采样策略，TaoSR1展示了LLM在电商搜索中的潜力，尤其是在复杂推理任务中的直接部署可行性。后CoT处理与分区方法为高效在线部署提供了新思路。

Abstract: Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.

eess.SY [Back]

[99] Model-based Multi-object Visual Tracking: Identification and Standard Model Limitations eess.SY | cs.CV | cs.SYPDF

Jan Krejčí, Oliver Kost, Yuxuan Xia, Lennart Svensson, Ondřej Straka

TL;DR: 本文探讨了基于模型的多目标视觉跟踪方法，采用标准点目标（SPO）模型，并通过泊松多伯努利混合（PMBM）滤波器计算后验密度。研究发现SPO模型与数据不匹配，并提出了未来改进的方向。

Details

Motivation: 研究动机在于将雷达跟踪领域的多目标跟踪方法应用于行人跟踪，解决2D边界框检测中的跟踪问题，并评估标准模型的实际表现。

Result: 实验结果显示SPO模型与数据存在不匹配，尽管PMBM算法表现尚可，但模型改进空间较大。

Insight: 研究启示在于指出了基于模型的方法需要进一步调整以解决模型与数据不匹配的问题，为未来算法改进提供了方向。

Abstract: This paper uses multi-object tracking methods known from the radar tracking community to address the problem of pedestrian tracking using 2D bounding box detections. The standard point-object (SPO) model is adopted, and the posterior density is computed using the Poisson multi-Bernoulli mixture (PMBM) filter. The selection of the model parameters rooted in continuous time is discussed, including the birth and survival probabilities. Some parameters are selected from the first principles, while others are identified from the data, which is, in this case, the publicly available MOT-17 dataset. Although the resulting PMBM algorithm yields promising results, a mismatch between the SPO model and the data is revealed. The model-based approach assumes that modifying the problematic components causing the SPO model-data mismatch will lead to better model-based algorithms in future developments.

cs.GR [Back]

[100] Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing cs.GR | cs.CVPDF

Feng-Lin Liu, Shi-Yang Li, Yan-Pei Cao, Hongbo Fu, Lin Gao

TL;DR: 该论文提出了一种基于草图的3D感知视频编辑方法Sketch3DVE，解决了在视频中编辑3D场景结构的挑战，特别是大幅视角变化时的内容一致性和几何控制问题。

Details

Motivation: 现有视频编辑方法在风格迁移或外观修改上效果显著，但在编辑3D场景结构时面临视角变化大、内容一致性和2D输入到3D输出的转换等挑战。

Result: 实验表明Sketch3DVE在编辑具有显著视角变化的视频时优于现有方法，能够生成高质量且一致的编辑结果。

Insight: 草图作为交互工具能有效解决几何控制问题，3D信息的利用显著提升了编辑内容在视角变化时的一致性。

Abstract: Recent video editing methods achieve attractive results in style transfer or appearance modification. However, editing the structural content of 3D scenes in videos remains challenging, particularly when dealing with significant viewpoint changes, such as large camera rotations or zooms. Key challenges include generating novel view content that remains consistent with the original video, preserving unedited regions, and translating sparse 2D inputs into realistic 3D video outputs. To address these issues, we propose Sketch3DVE, a sketch-based 3D-aware video editing method to enable detailed local manipulation of videos with significant viewpoint changes. To solve the challenge posed by sparse inputs, we employ image editing methods to generate edited results for the first frame, which are then propagated to the remaining frames of the video. We utilize sketching as an interaction tool for precise geometry control, while other mask-based image editing methods are also supported. To handle viewpoint changes, we perform a detailed analysis and manipulation of the 3D information in the video. Specifically, we utilize a dense stereo method to estimate a point cloud and the camera parameters of the input video. We then propose a point cloud editing approach that uses depth maps to represent the 3D geometry of newly edited components, aligning them effectively with the original 3D scene. To seamlessly merge the newly edited content with the original video while preserving the features of unedited regions, we introduce a 3D-aware mask propagation strategy and employ a video diffusion model to produce realistic edited videos. Extensive experiments demonstrate the superiority of Sketch3DVE in video editing. Homepage and code: http://http://geometrylearning.com/Sketch3DVE/

eess.IV [Back]

[101] Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology eess.IV | cs.CVPDF

Mingzhe Hu, Zach Eidex, Shansong Wang, Mojtaba Safari, Qiang Li

TL;DR: 论文评估了GPT-5及其变体在零样本多模态医学推理任务中的表现，结果显示GPT-5在放射学和放射肿瘤学任务中显著优于GPT-4o，展示了其在医疗领域的潜力。

Details

Motivation: 评估GPT-5在多模态医学推理（结合图像、文本和定量数据）中的性能，以确保其在高风险医疗领域的应用可靠性。

Result: GPT-5在所有任务中表现最优，在解剖区域问题和物理考试问题中的准确率显著高于GPT-4o（如+20.00%和90.7%）。

Insight: GPT-5在多模态医学推理中的优越性能表明其可能在医疗专家工作流中发挥重要作用，尤其是在高风险决策领域。

Abstract: Radiology, radiation oncology, and medical physics require decision-making that integrates medical images, textual reports, and quantitative data under high-stakes conditions. With the introduction of GPT-5, it is critical to assess whether recent advances in large multimodal models translate into measurable gains in these safety-critical domains. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks: (1) VQA-RAD, a benchmark for visual question answering in radiology; (2) SLAKE, a semantically annotated, multilingual VQA dataset testing cross-modal grounding; and (3) a curated Medical Physics Board Examination-style dataset of 150 multiple-choice questions spanning treatment planning, dosimetry, imaging, and quality assurance. Across all datasets, GPT-5 achieved the highest accuracy, with substantial gains over GPT-4o up to +20.00% in challenging anatomical regions such as the chest-mediastinal, +13.60% in lung-focused questions, and +11.44% in brain-tissue interpretation. On the board-style physics questions, GPT-5 attained 90.7% accuracy (136/150), exceeding the estimated human passing threshold, while GPT-4o trailed at 78.0%. These results demonstrate that GPT-5 delivers consistent and often pronounced performance improvements over GPT-4o in both image-grounded reasoning and domain-specific numerical problem-solving, highlighting its potential to augment expert workflows in medical imaging and therapeutic physics.

[102] PediDemi – A Pediatric Demyelinating Lesion Segmentation Dataset eess.IV | cs.CVPDF

Maria Popa, Gabriela Adriana Visa

TL;DR: 论文介绍了首个公开的小儿脱髓鞘病变分割数据集PediDemi，填补了儿科病例数据的空白，并评估了现有MS数据集训练的模型在该数据集上的表现。

Details

Motivation: 当前公开的数据集主要针对成人多发性硬化症（MS），缺乏儿科脱髓鞘病变的数据，限制了相关研究和模型开发。

Result: 结果表明，现有模型在儿科数据集上的表现有限，凸显了多样化数据集的重要性。

Insight: 儿科脱髓鞘病变数据与成人MS数据存在差异，需要针对性的数据集和模型改进以提高分割精度。

Abstract: Demyelinating disorders of the central nervous system may have multiple causes, the most common are infections, autoimmune responses, genetic or vascular etiology. Demyelination lesions are characterized by areas were the myelin sheath of the nerve fibers are broken or destroyed. Among autoimmune disorders, Multiple Sclerosis (MS) is the most well-known Among these disorders, Multiple Sclerosis (MS) is the most well-known and aggressive form. Acute Disseminated Encephalomyelitis (ADEM) is another type of demyelinating disease, typically with a better prognosis. Magnetic Resonance Imaging (MRI) is widely used for diagnosing and monitoring disease progression by detecting lesions. While both adults and children can be affected, there is a significant lack of publicly available datasets for pediatric cases and demyelinating disorders beyond MS. This study introduces, for the first time, a publicly available pediatric dataset for demyelinating lesion segmentation. The dataset comprises MRI scans from 13 pediatric patients diagnosed with demyelinating disorders, including 3 with ADEM. In addition to lesion segmentation masks, the dataset includes extensive patient metadata, such as diagnosis, treatment, personal medical background, and laboratory results. To assess the quality of the dataset and demonstrate its relevance, we evaluate a state-of-the-art lesion segmentation model trained on an existing MS dataset. The results underscore the importance of diverse datasets

[103] InnerGS: Internal Scenes Rendering via Factorized 3D Gaussian Splatting eess.IV | cs.CVPDF

Shuxin Liang, Yihan Xiao, Wenlu Tang

TL;DR: 该论文提出了一种名为InnerGS的方法，通过分解的3D高斯分布直接建模内部场景的体积密度，无需相机姿态即可从稀疏切片数据中重建复杂内部结构。

Details

Motivation: 现有的3D高斯抛雪球（3DGS）方法主要关注外部表面建模，但在需要深入理解物体内部结构的应用中（如医学成像或工业检测），内部场景的重建至关重要。

Result: 实验表明，InnerGS能够高效地从稀疏数据中重建复杂的内部场景，且效果优于现有方法。

Insight: 该研究揭示了通过体积密度建模内部场景的潜力，为医学成像和工业检测等应用提供了新的技术方向。

Abstract: 3D Gaussian Splatting (3DGS) has recently gained popularity for efficient scene rendering by representing scenes as explicit sets of anisotropic 3D Gaussians. However, most existing work focuses primarily on modeling external surfaces. In this work, we target the reconstruction of internal scenes, which is crucial for applications that require a deep understanding of an object’s interior. By directly modeling a continuous volumetric density through the inner 3D Gaussian distribution, our model effectively reconstructs smooth and detailed internal structures from sparse sliced data. Our approach eliminates the need for camera poses, is plug-and-play, and is inherently compatible with any data modalities. We provide cuda implementation at: https://github.com/Shuxin-Liang/InnerGS.

[104] Deep Biomechanically-Guided Interpolation for Keypoint-Based Brain Shift Registration eess.IV | cs.CVPDF

Tiago Assis, Ines P. Machado, Benjamin Zwick, Nuno C. Garcia, Reuben Dorent

TL;DR: 该论文提出了一种基于生物力学引导的深度学习方法，通过稀疏匹配的关键点生成密集且物理合理的大脑变形场，显著优于传统几何插值方法。

Details

Motivation: 在神经外科手术中，准确的脑位移补偿对神经导航的可靠性至关重要。现有的基于关键点的配准方法虽然对大的变形和拓扑变化具有鲁棒性，但通常依赖于忽略组织生物力学的简单几何插值器。

Result: 实验表明，与传统插值方法相比，该方法的均方误差降低了50%，且推理时计算开销几乎可以忽略。

Insight: 论文表明，结合生物力学先验的深度学习方法可以显著提升配准的准确性，为术中脑变形补偿提供了新的解决方案。

Abstract: Accurate compensation of brain shift is critical for maintaining the reliability of neuronavigation during neurosurgery. While keypoint-based registration methods offer robustness to large deformations and topological changes, they typically rely on simple geometric interpolators that ignore tissue biomechanics to create dense displacement fields. In this work, we propose a novel deep learning framework that estimates dense, physically plausible brain deformations from sparse matched keypoints. We first generate a large dataset of synthetic brain deformations using biomechanical simulations. Then, a residual 3D U-Net is trained to refine standard interpolation estimates into biomechanically guided deformations. Experiments on a large set of simulated displacement fields demonstrate that our method significantly outperforms classical interpolators, reducing by half the mean square error while introducing negligible computational overhead at inference time. Code available at: \href{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}.

[105] Latent Interpolation Learning Using Diffusion Models for Cardiac Volume Reconstruction eess.IV | cs.CVPDF

Niklas Bubeck, Suprosanna Shit, Chen Chen, Can Zhao, Pengfei Guo

TL;DR: 论文提出了一种基于扩散模型的心脏体积重建方法（CaLID），通过数据驱动的插值方案、高效的潜在空间计算和仅需稀疏2D CMR图像输入，显著提升了重建精度和效率。

Details

Motivation: 由于传统2D短轴切片稀疏采样的局限性，心脏磁共振成像（CMR）常缺乏完整的体积信息，现有方法依赖预定义插值方案或额外语义输入，难以实现高效且准确的3D重建。

Result: 方法在体积评估和下游分割任务中表现优异，重建质量和效率均优于基线方法，实现了时空一致的心脏重建。

Insight: 扩散模型在医学图像重建中的潜力被进一步验证，潜在空间操作显著提升了计算效率，为临床提供了更实用的解决方案。

Abstract: Cardiac Magnetic Resonance (CMR) imaging is a critical tool for diagnosing and managing cardiovascular disease, yet its utility is often limited by the sparse acquisition of 2D short-axis slices, resulting in incomplete volumetric information. Accurate 3D reconstruction from these sparse slices is essential for comprehensive cardiac assessment, but existing methods face challenges, including reliance on predefined interpolation schemes (e.g., linear or spherical), computational inefficiency, and dependence on additional semantic inputs such as segmentation labels or motion data. To address these limitations, we propose a novel \textbf{Ca}rdiac \textbf{L}atent \textbf{I}nterpolation \textbf{D}iffusion (CaLID) framework that introduces three key innovations. First, we present a data-driven interpolation scheme based on diffusion models, which can capture complex, non-linear relationships between sparse slices and improves reconstruction accuracy. Second, we design a computationally efficient method that operates in the latent space and speeds up 3D whole-heart upsampling time by a factor of 24, reducing computational overhead compared to previous methods. Third, with only sparse 2D CMR images as input, our method achieves SOTA performance against baseline methods, eliminating the need for auxiliary input such as morphological guidance, thus simplifying workflows. We further extend our method to 2D+T data, enabling the effective modeling of spatiotemporal dynamics and ensuring temporal coherence. Extensive volumetric evaluations and downstream segmentation tasks demonstrate that CaLID achieves superior reconstruction quality and efficiency. By addressing the fundamental limitations of existing approaches, our framework advances the state of the art for spatio and spatiotemporal whole-heart reconstruction, offering a robust and clinically practical solution for cardiovascular imaging.

[106] A Novel Attention-Augmented Wavelet YOLO System for Real-time Brain Vessel Segmentation on Transcranial Color-coded Doppler eess.IV | cs.AI | cs.CVPDF

Wenxuan Zhang, Shuai Li, Xinyi Wang, Yu Sun, Hongyu Kang

TL;DR: 提出了基于注意力增强小波YOLO（AAW-YOLO）的实时脑血管分割系统，用于TCCD数据，减少对操作员经验的依赖，效果显著。

Details

Motivation: TCCD成像在脑血管评估中因操作员依赖性强而受限，需要AI辅助实现自动化分割以提高效率和可及性。

Result: Dice分数0.901，IoU 0.823，推理速度14.199ms/帧，性能优于传统方法。

Insight: AI辅助TCCD可显著降低操作门槛，适合资源有限场景；未来可扩展双边建模和大规模验证。

Abstract: The Circle of Willis (CoW), vital for ensuring consistent blood flow to the brain, is closely linked to ischemic stroke. Accurate assessment of the CoW is important for identifying individuals at risk and guiding appropriate clinical management. Among existing imaging methods, Transcranial Color-coded Doppler (TCCD) offers unique advantages due to its radiation-free nature, affordability, and accessibility. However, reliable TCCD assessments depend heavily on operator expertise for identifying anatomical landmarks and performing accurate angle correction, which limits its widespread adoption. To address this challenge, we propose an AI-powered, real-time CoW auto-segmentation system capable of efficiently capturing cerebral arteries. No prior studies have explored AI-driven cerebrovascular segmentation using TCCD. In this work, we introduce a novel Attention-Augmented Wavelet YOLO (AAW-YOLO) network tailored for TCCD data, designed to provide real-time guidance for brain vessel segmentation in the CoW. We prospectively collected TCCD data comprising 738 annotated frames and 3,419 labeled artery instances to establish a high-quality dataset for model training and evaluation. The proposed AAW-YOLO demonstrated strong performance in segmenting both ipsilateral and contralateral CoW vessels, achieving an average Dice score of 0.901, IoU of 0.823, precision of 0.882, recall of 0.926, and mAP of 0.953, with a per-frame inference speed of 14.199 ms. This system offers a practical solution to reduce reliance on operator experience in TCCD-based cerebrovascular screening, with potential applications in routine clinical workflows and resource-constrained settings. Future research will explore bilateral modeling and larger-scale validation.

[107] Learning to See Through Flare eess.IV | cs.CVPDF

Xiaopeng Peng, Heath Gemar, Erin Fleet, Kyle Novak, Abbie Watnik

TL;DR: NeuSee是一种新型计算成像框架，通过联合学习衍射光学元件（DOE）和频率空间的Mamba-GAN网络，实现高保真传感器保护和图像恢复，能够抑制高达10^6倍的激光辐照峰值。

Details

Motivation: 机器视觉系统容易受到激光耀斑的影响，导致传感器过饱和或永久损坏。现有方法难以在全可见光谱范围内实现高质量的传感器保护和图像恢复。

Result: NeuSee在100K图像上训练，能够抑制高达10^6倍传感器饱和阈值的激光辐照峰值，恢复图像质量提升10.1%，首次实现全光谱成像和激光抑制。

Insight: NeuSee通过联合学习DOE和图像恢复网络，有效解决了激光耀斑问题，为机器视觉系统在复杂光照环境中的稳定运行提供了新思路。

Abstract: Machine vision systems are susceptible to laser flare, where unwanted intense laser illumination blinds and distorts its perception of the environment through oversaturation or permanent damage to sensor pixels. We introduce NeuSee, the first computational imaging framework for high-fidelity sensor protection across the full visible spectrum. It jointly learns a neural representation of a diffractive optical element (DOE) and a frequency-space Mamba-GAN network for image restoration. NeuSee system is adversarially trained end-to-end on 100K unique images to suppress the peak laser irradiance as high as $10^6$ times the sensor saturation threshold $I_{\textrm{sat}}$, the point at which camera sensors may experience damage without the DOE. Our system leverages heterogeneous data and model parallelism for distributed computing, integrating hyperspectral information and multiple neural networks for realistic simulation and image restoration. NeuSee takes into account open-world scenes with dynamically varying laser wavelengths, intensities, and positions, as well as lens flare effects, unknown ambient lighting conditions, and sensor noises. It outperforms other learned DOEs, achieving full-spectrum imaging and laser suppression for the first time, with a 10.1% improvement in restored image quality.

[108] MMIS-Net for Retinal Fluid Segmentation and Detection eess.IV | cs.CVPDF

Nchongmaje Ndipenocha, Alina Mirona, Kezhi Wanga, Yongmin Li

TL;DR: MMIS-Net通过融合多模态医疗图像数据，采用相似性融合块和一热标签空间优化性能，在视网膜液体分割与检测任务中表现优异。

Details

Motivation: 现有深度学习模型多基于单一数据源训练，忽略了多模态、多器官、多疾病标注数据的协同潜力。

Result: 在RETOUCH测试集上，分割任务Dice得分0.83，流体检测任务AUC为1，超越现有方法。

Insight: 多模态数据融合和标签一致性处理能显著提升模型性能。

Abstract: Purpose: Deep learning methods have shown promising results in the segmentation, and detection of diseases in medical images. However, most methods are trained and tested on data from a single source, modality, organ, or disease type, overlooking the combined potential of other available annotated data. Numerous small annotated medical image datasets from various modalities, organs, and diseases are publicly available. In this work, we aim to leverage the synergistic potential of these datasets to improve performance on unseen data. Approach: To this end, we propose a novel algorithm called MMIS-Net (MultiModal Medical Image Segmentation Network), which features Similarity Fusion blocks that utilize supervision and pixel-wise similarity knowledge selection for feature map fusion. Additionally, to address inconsistent class definitions and label contradictions, we created a one-hot label space to handle classes absent in one dataset but annotated in another. MMIS-Net was trained on 10 datasets encompassing 19 organs across 2 modalities to build a single model. Results: The algorithm was evaluated on the RETOUCH grand challenge hidden test set, outperforming large foundation models for medical image segmentation and other state-of-the-art algorithms. We achieved the best mean Dice score of 0.83 and an absolute volume difference of 0.035 for the fluids segmentation task, as well as a perfect Area Under the Curve of 1 for the fluid detection task. Conclusion: The quantitative results highlight the effectiveness of our proposed model due to the incorporation of Similarity Fusion blocks into the network’s backbone for supervision and similarity knowledge selection, and the use of a one-hot label space to address label class inconsistencies and contradictions.

cs.AI [Back]

[109] Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL cs.AI | cs.CLPDF

Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu

TL;DR: 该论文提出了一种名为Chain-of-Agents（CoA）的新型LLM推理范式，通过多智能体蒸馏和强化学习，实现了端到端的复杂问题解决，显著提升了智能体的性能。

Details

Motivation: 现有基于提示工程的智能体系统计算效率低且无法从数据驱动的学习中受益，需要开发一种更高效的端到端智能体框架。

Result: AFM在多个基准测试中取得了最先进的性能表现，研究全开源。

Insight: 智能体蒸馏和强化学习的结合为未来智能体模型的发展提供了重要方向。

Abstract: Recent advances in large language models (LLMs) and multi-agent systems have demonstrated remarkable capabilities in complex problem-solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi-agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and can not benefit from data-centric learning. In this work, we introduce Chain-of-Agents (CoA), a novel paradigm of LLM reasoning that enables native end-to-end complex problem-solving in the same way as a multi-agent system (i.e., multi-turn problem solving with multiple tools and multiple agents) within one model. In chain-of-agents problem-solving, the model dynamically activates different tool agents and role-playing agents to simulate multi-agent collaboration in an end-to-end fashion. To elicit end-to-end chain-of-agents problem-solving abilities in LLMs, we introduce a multi-agent distillation framework to distill state-of-the-art multi-agent systems into chain-of-agents trajectories for agentic supervised fine-tuning. We then use agentic reinforcement learning on verifiable agentic tasks to further improve the models’ capabilities on chain-of-agents problem solving. We call the resulting models Agent Foundation Models (AFMs). Our empirical studies demonstrate that AFM establishes new state-of-the-art performance across diverse benchmarks in both web agent and code agent settings. We make the entire research, including the model weights, code for training and evaluation, and the training data, fully open-sourced, which offers a solid starting point for future research on agent models and agentic RL.

[110] Explicit v.s. Implicit Memory: Exploring Multi-hop Complex Reasoning Over Personalized Information cs.AI | cs.CL | cs.IRPDF

Zeyu Zhang, Yang Zhang, Haoran Tan, Rui Li, Xu Chen

TL;DR: 本文探讨了语言模型代理中显式与隐式记忆机制在个性化信息多跳推理任务中的表现差异，并提出了一种混合记忆方法HybridMem以提高性能。

Details

Motivation: 现实世界中的复杂任务通常需要基于大量用户信息进行多跳推理，而现有记忆机制难以应对这一挑战。

Result: 实验结果表明HybridMem在多跳推理任务中表现更优。

Insight: 显式记忆更适合结构化信息检索，而隐式记忆在处理模糊信息时更具优势，混合方法可互补不足。

Abstract: In large language model-based agents, memory serves as a critical capability for achieving personalization by storing and utilizing users’ information. Although some previous studies have adopted memory to implement user personalization, they typically focus on preference alignment and simple question-answering. However, in the real world, complex tasks often require multi-hop reasoning on a large amount of user information, which poses significant challenges for current memory approaches. To address this limitation, we propose the multi-hop personalized reasoning task to explore how different memory mechanisms perform in multi-hop reasoning over personalized information. We explicitly define this task and construct a dataset along with a unified evaluation framework. Then, we implement various explicit and implicit memory methods and conduct comprehensive experiments. We evaluate their performance on this task from multiple perspectives and analyze their strengths and weaknesses. Besides, we explore hybrid approaches that combine both paradigms and propose the HybridMem method to address their limitations. We demonstrate the effectiveness of our proposed model through extensive experiments. To benefit the research community, we release this project at https://github.com/nuster1128/MPR.

[111] TASER: Table Agents for Schema-guided Extraction and Recommendation cs.AI | cs.CL | cs.IR | cs.LGPDF

Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso

TL;DR: TASER是一个持续学习的智能表格提取系统，旨在从混乱、多页、异构的金融表格中提取结构化数据，并通过推荐代理优化输出，性能优于现有方法。

Details

Motivation: 金融文档中的表格通常混乱且分散，现有表格检测模型难以处理此类复杂场景，亟需一种能够持续学习并优化提取结果的系统。

Result: TASER在表格检测上优于Table Transformer模型10.1%，批处理提升104.3%的模式推荐利用率，提取结果提升9.8%。

Insight: 持续学习和批处理对提升表格提取性能至关重要，模式引导的代理系统能有效处理真实世界的复杂表格。

Abstract: Real-world financial documents report essential information about an entity’s financial holdings that can span millions of different financial instrument types. Yet, these details are often buried in messy, multi-page, fragmented tables - for example, 99.4% of the tables in our dataset have no bounding boxes with the maximum number of rows amounting to 426 per table across 44 pages. To tackle these unique challenges from real-world tables, we present a continuously learning, agentic table extraction system, TASER (Table Agents for Schema-guided Extraction and Recommendation) that extracts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Our table agents execute on table detection, classification, extraction, and recommendations by leveraging an initial schema. Then, our Recommender Agent reviews the outputs, recommends schema revisions, and decides on the final recommendations, enabling TASER to outperform existing table detection models such as Table Transformer by 10.1%. Within this continuous learning process, we highlight that larger batch sizes result in a 104.3% increase in schema recommendations that are actionable and utilized, resulting in a 9.8% increase in extracted holdings - highlighting the importance of a continuous learning process. To train TASER, we have manually labeled 22,584 pages (28,150,449 tokens), 3,213 tables for $731,685,511,687 of holdings culminating in one of the first real financial table datasets. We release our dataset TASERTab to enable the research community to access real-world financial tables and outputs. Our results highlight the promise of agentic, schema-guided extraction systems for robust understanding of real-world financial tables.

[112] Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation cs.AI | cs.CVPDF

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng

TL;DR: 论文提出了一种多模态结构化强化学习方法（MSRL），用于突破图表到代码生成任务中监督微调（SFT）的性能瓶颈，通过结合文本和视觉反馈的多粒度奖励系统显著提升了生成效果。

Details

Motivation: 当前任务中，监督微调在深入理解信息丰富的图像和生成结构化输出方面存在性能瓶颈，需要更有效的强化学习策略。

Result: 在ChartMimic和ReachQA基准上分别提升了6.2%和9.9%，达到了与先进闭源模型竞争的性能。

Insight: 多模态反馈（尤其是视觉反馈）对于提升结构化输出的质量至关重要，而两阶段课程学习能有效稳定训练过程。

Abstract: While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that appropriately reward structured outputs. We systematically investigate the performance plateau in SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation, which substantially breaks through this plateau. We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables to mitigate simplistic patterns of prior synthetic data. Despite reaching state-of-the-art performance, our experiments show that scaling SFT data eventually hits a plateau where further increases yield negligible improvements. Our MSRL method leverages a multi-granularity structured reward system using multimodal textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details. At the visual level, model-based rewards assess structural similarity by rendering generated code into images and employing an evaluator model. We implement this within a two-stage curriculum for training stability. Results demonstrate that MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively, achieving competitive performance with advanced closed-source models.

Table of Contents

cs.CV [Back]

[1] YOLO11-CR: a Lightweight Convolution-and-Attention Framework for Accurate Fatigue Driving Detection cs.CV | eess.IVPDF

[2] MIRAGE: Towards AI-Generated Image Detection in the Wild cs.CV | cs.AIPDF

[3] DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model cs.CVPDF

[4] Exploration of Deep Learning Based Recognition for Urdu Text cs.CVPDF

[5] GaitCrafter: Diffusion Model for Biometric Preserving Gait Synthesis cs.CV | cs.AIPDF

[6] Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving cs.CVPDF

[7] DAASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples cs.CV | cs.LGPDF

[8] Automated Assessment of Aesthetic Outcomes in Facial Plastic Surgery cs.CVPDF

[9] Applications of Small Language Models in Medical Imaging Classification with a Focus on Prompt Strategies cs.CVPDF

[10] Mitigating Easy Option Bias in Multiple-Choice Question Answering cs.CV | cs.AI | cs.MMPDF

[11] Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference cs.CV | cs.AI | cs.CL | eess.IVPDF

[12] EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis cs.CVPDF

[13] Revisiting MLLM Token Technology through the Lens of Classical Visual Coding cs.CVPDF

[14] Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs cs.CV | cs.LGPDF

[15] STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models cs.CV | cs.AIPDF

[16] MINR: Efficient Implicit Neural Representations for Multi-Image Encoding cs.CVPDF

[17] CORENet: Cross-Modal 4D Radar Denoising Network with LiDAR Supervision for Autonomous Driving cs.CV | cs.AIPDF

[18] Multi-view Clustering via Bi-level Decoupling and Consistency Learning cs.CV | cs.LGPDF

[19] AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes cs.CV | eess.IVPDF

[20] Bridging the Gap: Doubles Badminton Analysis with Singles-Trained Models cs.CVPDF

[21] Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency cs.CV | cs.AIPDF

[22] Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models cs.CV | cs.AIPDF

[23] EAvatar: Expression-Aware Head Avatar Reconstruction with Generative Geometry Priors cs.CV | cs.AIPDF

[24] FLAIR: Frequency- and Locality-Aware Implicit Neural Representations cs.CV | cs.AIPDF

[25] GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering cs.CVPDF

[26] A Lightweight Dual-Mode Optimization for Generative Face Video Coding cs.CV | eess.IVPDF

[27] Color Spike Data Generation via Bio-inspired Neuron-like Encoding with an Artificial Photoreceptor Layer cs.CVPDF

[28] DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup cs.CVPDF

[29] Learnable SMPLify: A Neural Solution for Optimization-Free Human Pose Inverse Kinematics cs.CVPDF

[30] The 9th AI City Challenge cs.CV | cs.AI | cs.LG | cs.ROPDF

[31] Generative Model-Based Feature Attention Module for Video Action Analysis cs.CVPDF

[32] Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model cs.CVPDF

[33] Bridging Clear and Adverse Driving Conditions cs.CVPDF

[34] Towards Efficient Vision State Space Models via Token Merging cs.CVPDF

[35] Unleashing Semantic and Geometric Priors for 3D Scene Completion cs.CVPDF

[36] PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction cs.CVPDF

[37] Two-Factor Authentication Smart Entryway Using Modified LBPH Algorithm cs.CVPDF

[38] TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis cs.CVPDF

[39] RCGNet: RGB-based Category-Level 6D Object Pose Estimation with Geometric Guidance cs.CVPDF

[40] RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation cs.CV | cs.AI | cs.CLPDF

[41] OmniTry: Virtual Try-On Anything without Masks cs.CVPDF

[42] DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction cs.CVPDF

[43] HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes cs.CVPDF

[44] Hierarchical Vision-Language Retrieval of Educational Metaverse Content in Agriculture cs.CVPDF

[45] Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance cs.CVPDF

[46] Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks cs.CV | cs.AIPDF

[47] Shape-from-Template with Generalised Camera cs.CVPDF

[48] VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization cs.CVPDF

[49] A Fully Transformer Based Multimodal Framework for Explainable Cancer Image Segmentation Using Radiology Reports cs.CV | cs.AIPDF

[50] Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering cs.CV | cs.LGPDF

[51] SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation cs.CVPDF

[52] RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection cs.CVPDF

[53] In-hoc Concept Representations to Regularise Deep Learning in Medical Imaging cs.CVPDF

[54] SCRNet: Spatial-Channel Regulation Network for Medical Ultrasound Image Segmentation cs.CVPDF

[55] PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis cs.CVPDF

[56] DIME-Net: A Dual-Illumination Adaptive Enhancement Network Based on Retinex and Mixture-of-Experts cs.CVPDF

[57] ViT-FIQA: Assessing Face Image Quality using Vision Transformers cs.CVPDF

[58] ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving cs.CVPDF

[59] OmViD: Omni-supervised active learning for video action detection cs.CVPDF

[60] Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment cs.CVPDF

[61] Self-Supervised Sparse Sensor Fusion for Long Range Perception cs.CVPDF

[62] ResPlan: A Large-Scale Vector-Graph Dataset of 17,000 Residential Floor Plans cs.CV | cs.RO | 68T45PDF

[63] Backdooring Self-Supervised Contrastive Learning by Noisy Alignment cs.CVPDF

[64] InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing cs.CVPDF

[65] Distilled-3DGS:Distilled 3D Gaussian Splatting cs.CVPDF

[66] Beyond Simple Edits: Composed Video Retrieval with Dense Modifications cs.CVPDF

[67] LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos cs.CVPDF

cs.CL [Back]

[68] MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents cs.CL | cs.AI | cs.CVPDF

[69] Stands to Reason: Investigating the Effect of Reasoning on Idiomaticity Detection cs.CLPDF

[70] Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis cs.CL | cs.AIPDF

[71] ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs cs.CL | cs.AIPDF

[72] Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation cs.CL | cs.LGPDF

[73] AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings cs.CLPDF

[74] CRISP: Persistent Concept Unlearning via Sparse Autoencoders cs.CL | I.2.7PDF

[75] ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions? cs.CL | cs.LGPDF

[76] Generics and Default Reasoning in Large Language Models cs.CL | cs.AI | cs.LOPDF

[77] Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA cs.CLPDF