Table of Contents
- cs.CL [Total: 27]
- cs.CV [Total: 66]
- cs.IR [Total: 1]
- q-bio.NC [Total: 1]
- cs.GR [Total: 3]
- eess.IV [Total: 7]
- cs.RO [Total: 3]
- cs.CR [Total: 1]
- cs.LG [Total: 4]
- cs.AI [Total: 2]
- cs.SD [Total: 1]
cs.CL [Back]
[1] Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
Yimeng Zhang,Tian Wang,Jiri Gesi,Ziyi Wang,Yuxuan Lu,Jiacheng Lin,Sinong Zhan,Vianne Gao,Ruochen Jiao,Junze Liu,Kun Qian,Yuxin Tang,Ran Xue,Houyu Zhang,Qingjun Cui,Yufan Guo,Dakuo Wang
Main category: cs.CL
TL;DR: Shop-R1提出了一种新颖的强化学习框架,通过分阶段奖励机制(理由生成和动作预测)提升LLMs在模拟在线购物行为中的推理能力,实验显示性能提升显著。
Details
Motivation: 当前基于LLM生成可信人类行为的方法受限于模型的推理能力,Shop-R1旨在通过RL框架突破这一限制,提升模拟真实人类行为的准确性。Contribution: 提出了Shop-R1框架,通过分阶段的奖励信号(理由生成和动作预测)和难度感知的层次化奖励结构,显著提升了LLM在模拟在线购物行为中的推理能力。
Method: 将任务分解为理由生成和动作预测两阶段,分别设计自监督信号和层次化奖励结构。理由生成利用内部模型信号,动作预测通过难度感知的奖励分配防止奖励滥用。
Result: 实验结果表明,该方法相对基线提升了65%以上的性能。
Insight: 分阶段奖励机制和层次化奖励结构可以有效提升LLM的推理能力,尤其适用于复杂的人类行为模拟任务。
Abstract: Large Language Models (LLMs) have recently demonstrated strong potential in generating ‘believable human-like’ behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline.
[2] Dynamic and Generalizable Process Reward Modeling
Zhangyue Yin,Qiushi Sun,Zhiyuan Zeng,Qinyuan Cheng,Xipeng Qiu,Xuanjing Huang
Main category: cs.CL
TL;DR: 这篇论文提出了动态且可泛化的过程奖励建模方法(DG-PRM),通过奖励树和多维奖励标准,动态选择奖励信号,显著提高了模型的跨域泛化能力。
Details
Motivation: 现有过程奖励模型(PRMs)依赖启发式方法,泛化能力差;而LLM-as-judge方法忽视了文本中的指导信息。静态和粗粒度的评估标准难以适应复杂的过程监督。Contribution: 提出DG-PRM方法,包含奖励树和动态奖励信号选择,首创使用帕累托支配估计识别奖励信号,显著提升任务性能。
Method: 采用奖励树存储细粒度奖励标准,动态选择奖励信号;提出帕累托支配估计处理多维度奖励信号。
Result: 实验表明,DG-PRM在主流基准上表现优异,显著提升任务性能,并在分布外场景表现良好。
Insight: DG-PRM的动态性和多维度奖励设计是其泛化能力强的关键,为复杂任务的过程监督提供了新思路。
Abstract: Process Reward Models (PRMs) are crucial for guiding Large Language Models (LLMs) in complex scenarios by providing dense reward signals. However, existing PRMs primarily rely on heuristic approaches, which struggle with cross-domain generalization. While LLM-as-judge has been proposed to provide generalized rewards, current research has focused mainly on feedback results, overlooking the meaningful guidance embedded within the text. Additionally, static and coarse-grained evaluation criteria struggle to adapt to complex process supervision. To tackle these challenges, we propose Dynamic and Generalizable Process Reward Modeling (DG-PRM), which features a reward tree to capture and store fine-grained, multi-dimensional reward criteria. DG-PRM dynamically selects reward signals for step-wise reward scoring. To handle multifaceted reward signals, we pioneeringly adopt Pareto dominance estimation to identify discriminative positive and negative pairs. Experimental results show that DG-PRM achieves stunning performance on prevailing benchmarks, significantly boosting model performance across tasks with dense rewards. Further analysis reveals that DG-PRM adapts well to out-of-distribution scenarios, demonstrating exceptional generalizability.
[3] One Whisper to Grade Them All
Nhan Phan,Anusha Porwal,Yaroslav Getman,Ekaterina Voskoboinik,Tamás Grósz,Mikko Kurimo
Main category: cs.CL
TL;DR: 论文提出了一种高效端到端的自动口语评分方法,通过单一Whisper-small编码器和轻量级聚合器处理多部分口语测试,无需转录和每部分独立模型,降低了推断时间,并提高了评分效率。
Details
Motivation: 为提高自动口语评分(ASA)的效率,减少计算资源需求,使其更适合大规模计算机辅助语言学习系统。Contribution: 1. 提出单一Whisper-small编码器处理多部分口语测试,结合轻量级聚合器预测最终分数,减少了模型复杂性和推断时间;2. 提出一种数据采样策略,仅需44.8%的说话者数据即可达到高性能。
Method: 使用Whisper-small编码器处理所有口语响应,通过轻量级聚合器整合信息,预测分数;采用数据采样策略优化训练。
Result: 系统RMSE为0.384,优于文本基线(0.44),参数仅为168M(Whisper-small的70%);数据采样策略下RMSE为0.383。
Insight: 单一编码器和轻量级聚合器架构显著提高了效率;数据采样策略可有效处理类别不平衡并提升数据利用率。
Abstract: We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak & Improve Challenge. Our system’s main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency.
[4] Evaluating the Performance of AI Text Detectors, Few-Shot and Chain-of-Thought Prompting Using DeepSeek Generated Text
Hulayyil Alshammari,Praveen Rao
Main category: cs.CL
TL;DR: 论文评估了六种AI文本检测工具对DeepSeek生成文本的检测效果,发现对抗攻击(如改写和人工修饰)显着降低检测准确率,同时验证了few-shot和链式思考提示的高准确性。
Details
Motivation: 随着大语言模型(LLMs)的普及,AI生成文本的检测技术面临挑战,但缺乏对新兴模型DeepSeek的研究。论文旨在填补这一空白,评估现有检测工具在面对DeepSeek生成文本时的表现。Contribution: 1) 首次系统评估六种AI检测工具对DeepSeek生成文本的检测能力;2) 研究了对抗攻击(如改写和人工修饰)对检测性能的影响;3) 验证了few-shot和链式思考(CoT)提示在文本分类中的高准确率。
Method: 1) 收集人类写作样本,并用DeepSeek生成对应AI文本;2) 通过改写和人工修饰生成对抗样本;3) 测试六种检测工具在原始和对抗样本上的性能;4) 使用few-shot和CoT提示评估DeepSeek作为检测器的能力。
Result: 1) QuillBot和Copyleaks在原始和改写文本上表现优异,其他工具(如AI Text Classifier和GPT-2)结果不稳定;2) 人工修饰攻击效果最强,显著降低检测准确率;3) few-shot和CoT提示表现最佳,AI召回率达96%,人类召回率达100%。
Insight: 1) 对抗攻击(尤其是人工修饰)对现有检测工具构成严峻挑战;2) few-shot和CoT方法为AI文本检测提供了新思路;3) 未来需开发更鲁棒的检测技术以应对新兴LLM。
Abstract: Large language models (LLMs) have rapidly transformed the creation of written materials. LLMs have led to questions about writing integrity, thereby driving the creation of artificial intelligence (AI) detection technologies. Adversarial attacks, such as standard and humanized paraphrasing, inhibit detectors’ ability to detect machine-generated text. Previous studies have mainly focused on ChatGPT and other well-known LLMs and have shown varying accuracy across detectors. However, there is a clear gap in the literature about DeepSeek, a recently published LLM. Therefore, in this work, we investigate whether six generally accessible AI detection tools – AI Text Classifier, Content Detector AI, Copyleaks, QuillBot, GPT-2, and GPTZero – can consistently recognize text generated by DeepSeek. The detectors were exposed to the aforementioned adversarial attacks. We also considered DeepSeek as a detector by performing few-shot prompting and chain-of-thought reasoning (CoT) for classifying AI and human-written text. We collected 49 human-authored question-answer pairs from before the LLM era and generated matching responses using DeepSeek-v3, producing 49 AI-generated samples. Then, we applied adversarial techniques such as paraphrasing and humanizing to add 196 more samples. These were used to challenge detector robustness and assess accuracy impact. While QuillBot and Copyleaks showed near-perfect performance on original and paraphrased DeepSeek text, others – particularly AI Text Classifier and GPT-2 – showed inconsistent results. The most effective attack was humanization, reducing accuracy to 71% for Copyleaks, 58% for QuillBot, and 52% for GPTZero. Few-shot and CoT prompting showed high accuracy, with the best five-shot result misclassifying only one of 49 samples (AI recall 96%, human recall 100%).
[5] Technical Report of TeleChat2, TeleChat2.5 and T1
Zihan Wang,Xinzhang Liu,Yitong Yao,Chao Wang,Yu Zhao,Zhihao Yang,Wenmin Deng,Kaipeng Jia,Jiaxin Peng,Yuyao Huang,Sishi Xiong,Zhuo Jiang,Kaidong Yu,Xiaohui Hu,Fubei Yao,Ruiyu Fang,Zhuoru Jiang,Ruiting Song,Qiyi Xie,Rui Xue,Xuewei He,Yanlei Xue,Zhu Yuan,Zhaoxi Zhang,Zilu Huang,Shiquan Wang,Xin Wang,Hanming Wu,Mingyuan Wang,Xufeng Zhan,Yuhan Sun,Zhaohu Xing,Yuhao Jiang,Bingkai Yang,Shuangyong Song,Yongxiang Li,Zhongjiang He,Xuelong Li
Main category: cs.CL
TL;DR: TeleChat2、TeleChat2.5和T1是基于Transformer架构的升级版语言模型系列,通过改进的训练策略(如预训练、SFT、DPO和强化学习)显著提升性能,T1专注于复杂推理任务,而TeleChat2.5强调快速推理。115B参数版本在数学和代码任务上表现优于GPT-4o等专有模型。
Details
Motivation: 为开发者提供更强大的开源语言模型,通过优化训练策略提升模型在推理和通用任务上的性能。Contribution: 1. 推出TeleChat2、TeleChat2.5和T1系列模型;2. 采用增强的训练策略(预训练+SFT+DPO+强化学习);3. T1支持长链推理,TeleChat2.5强调推理速度。
Method: 1. 预训练(10万亿tokens)+SFT+DPO;2. TeleChat2.5和T1增加领域特定数据微调和强化学习;3. 115B参数Transformer架构。
Result: T1-115B在数学和代码任务上优于GPT-4o等专有模型;TeleChat2.5提供高效推理速度。
Insight: 强化学习和领域微调显著提升模型性能,开源大模型在特定任务上可超越专有模型。
Abstract: We introduce the latest series of TeleChat models: \textbf{TeleChat2}, \textbf{TeleChat2.5}, and \textbf{T1}, offering a significant upgrade over their predecessor, TeleChat. Despite minimal changes to the model architecture, the new series achieves substantial performance gains through enhanced training strategies in both pre-training and post-training stages. The series begins with \textbf{TeleChat2}, which undergoes pretraining on 10 trillion high-quality and diverse tokens. This is followed by Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to further enhance its capabilities. \textbf{TeleChat2.5} and \textbf{T1} expand the pipeline by incorporating a continual pretraining phase with domain-specific datasets, combined with reinforcement learning (RL) to improve performance in code generation and mathematical reasoning tasks. The \textbf{T1} variant is designed for complex reasoning, supporting long Chain-of-Thought (CoT) reasoning and demonstrating substantial improvements in mathematics and coding. In contrast, \textbf{TeleChat2.5} prioritizes speed, delivering rapid inference. Both flagship models of \textbf{T1} and \textbf{TeleChat2.5} are dense Transformer-based architectures with 115B parameters, showcasing significant advancements in reasoning and general task performance compared to the original TeleChat. Notably, \textbf{T1-115B} outperform proprietary models such as OpenAI’s o1-mini and GPT-4o. We publicly release \textbf{TeleChat2}, \textbf{TeleChat2.5} and \textbf{T1}, including post-trained versions with 35B and 115B parameters, to empower developers and researchers with state-of-the-art language models tailored for diverse applications.
[6] NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database
Weizhi Fei,Hao Shi,Jing Xu,Jingchen Peng,Jiazheng Li,Jingzhao Zhang,Bo Bai,Wei Han,Zhenyuan Chen,Xueyan Niu
Main category: cs.CL
TL;DR: NeuralDB 是一种用于高效编辑大规模语言模型(LLMs)知识的框架,通过神经键值(KV)数据库和非线性门控检索模块,支持大规模编辑(如100,000条事实),同时保持模型的通用能力。
Details
Motivation: 传统的 Locate-and-Edit(L&E)方法在编辑大量事实时可能导致模型通用能力下降或忘记已编辑的内容。NeuralDB 旨在解决这一问题。Contribution: 提出了 NeuralDB 框架,将编辑事实明确表示为神经 KV 数据库,并通过门控检索模块实现了高效编辑和通用能力的平衡。
Method: 将线性 L&E 方法建模为 KV 数据库查询,并设计了一个非线性门控检索模块,仅在推理涉及编辑事实时激活。
Result: 在 ZsRE 和 CounterFacts 数据集上的实验表明,NeuralDB 在编辑效果、泛化性和任务性能上优于基线,且可扩展到 100,000 条事实。
Insight: 通过显式表示编辑事实并限制门控模块的影响范围,NeuralDB 实现了高效编辑与模型性能的兼容性。
Abstract: Efficiently editing knowledge stored in large language models (LLMs) enables model updates without large-scale training. One possible solution is Locate-and-Edit (L&E), allowing simultaneous modifications of a massive number of facts. However, such editing may compromise the general abilities of LLMs and even result in forgetting edited facts when scaling up to thousands of edits. In this paper, we model existing linear L&E methods as querying a Key-Value (KV) database. From this perspective, we then propose NeuralDB, an editing framework that explicitly represents the edited facts as a neural KV database equipped with a non-linear gated retrieval module, % In particular, our gated module only operates when inference involves the edited facts, effectively preserving the general abilities of LLMs. Comprehensive experiments involving the editing of 10,000 facts were conducted on the ZsRE and CounterFacts datasets, using GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results demonstrate that NeuralDB not only excels in editing efficacy, generalization, specificity, fluency, and consistency, but also preserves overall performance across six representative text understanding and generation tasks. Further experiments indicate that NeuralDB maintains its effectiveness even when scaled to 100,000 facts (\textbf{50x} more than in prior work).
[7] GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs
Duy Nguyen,Archiki Prasad,Elias Stengel-Eskin,Mohit Bansal
Main category: cs.CL
TL;DR: GrAInS通过梯度归因方法在推理时调整LLMs和VLMs的内部激活,无需微调或更新权重,显著提升模型性能和对齐能力。
Details
Motivation: 现有推理时干预方法依赖固定全局向量,忽视输入token的因果影响,且未充分利用梯度信息,尤其在多模态场景下表现不佳。Contribution: 提出了GrAInS方法,利用梯度归因和对比分析识别关键token,构建语义转向向量,实现细粒度、可解释的模型行为控制。
Method: 采用Integrated Gradients计算token级梯度归因,生成转向向量,并在推理时结合归一化调整隐藏激活。
Result: 在TruthfulQA上提升13.22%准确率,MMHal-Bench幻觉率降低至0.514,SPA-VL对齐胜率提高8.11%。
Insight: 梯度归因在多模态任务中能更公平地分配视觉和文本token的影响,显著提升模型的可控性和性能。
Abstract: Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model’s logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model’s fluency and general capabilities.
[8] Synthetic Data Generation for Phrase Break Prediction with Large Language Model
Hoyeon Lee,Sejung Son,Ye-Eun Kang,Jong-Hwan Kim
Main category: cs.CL
TL;DR: 论文探讨了利用大型语言模型(LLM)生成短语断句预测的合成数据,以解决传统依赖人工标注的高成本问题,并通过多语言实验验证其有效性。
Details
Motivation: 现有的短语断句预测方法高度依赖人工标注的音频或文本数据,导致高昂的成本和精力投入,且语音领域的固有变异性使得数据获取复杂。大型语言模型(LLM)在NLP中解决数据问题的成功经验启发了本文的研究方向。Contribution: 提出利用LLM生成合成短语断句标注数据,减少对人工标注的依赖,并通过对比传统标注和多语言实验验证其有效性。
Method: 利用LLM生成合成短语断句标注数据,与传统人工标注数据进行比较,并评估其在多种语言中的效果。
Result: 实验证明,LLM生成的合成数据能有效缓解短语断句预测中的数据挑战,展示了LLM在语音领域的潜力。
Insight: LLM不仅可以用于NLP任务,还能扩展到语音领域,解决数据稀缺和标注成本高的问题,为跨领域数据生成提供了新思路。
Abstract: Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.
[9] MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning
Xiaoyuan Li,Moxin Li,Wenjie Wang,Rui Men,Yichang Zhang,Fuli Feng,Dayiheng Liu,Junyang Lin
Main category: cs.CL
TL;DR: 该论文提出了MathOPEval评估基准,专注于评测多模态大语言模型(MLLMs)通过代码执行视觉操作的能力,填补了现有评测集中于文本输出的空白。
Details
Motivation: 现有评估主要关注MLLMs的文本推理输出,而忽略了其通过代码执行视觉操作的能力,这在多模态数学推理中至关重要。Contribution: 提出了首个专注于评测MLLMs代码生成(MCG)和代码编辑(MCE)能力的框架,覆盖五种数学图形。
Method: 设计了MCG(多模态代码生成)和MCE(多模态代码编辑)任务,评估模型从零生成及细粒度编辑(删除、修改、标注)的能力。
Result: 实验显示,当前主流MLLMs在细粒度视觉操作任务上的表现显著落后于人类。
Insight: 现有MLLMs在视觉操作的精确性和细粒度能力上仍有明显不足,未来研究需进一步优化代码表示与视觉理解。
Abstract: Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM’s ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM’s code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model’s ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model’s capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.
[10] HIVMedQA: Benchmarking large language models for HIV medical decision support
Gonzalo Cardenal Antolin,Jacques Fellay,Bashkim Jaha,Roger Kouyos,Niko Beerenwinkel,Diane Duroux
Main category: cs.CL
TL;DR: 这篇论文提出了HIVMedQA基准,用于评估大语言模型(LLMs)在HIV医疗决策支持中的表现,分析了多个模型的性能表现和局限性。
Details
Motivation: HIV管理是一个复杂的医疗领域,但目前缺乏对大语言模型在临床决策支持中的系统评估,尤其是在HIV治疗中的表现。Contribution: 论文的主要贡献包括:1)引入了HIVMedQA基准数据集;2)评估了多种通用和专业医疗LLMs的性能;3)揭示了模型在理解、推理等方面的局限性。
Method: 方法包括:1)开发临床相关的开放式问题数据集;2)使用提示工程优化模型性能;3)通过词汇相似度和LLM-as-a-judge方法评估模型。
Result: 结果显示:1)Gemini 2.5 Pro表现最佳;2)专有模型表现优于开源模型;3)复杂问题中性能下降;4)医疗微调模型未必优于通用模型。
Insight: 论文提出了对LLMs在医疗领域应用的改进需求,尤其是在理解和推理能力、消除偏见以及确保安全性方面的挑战。
Abstract: Large language models (LLMs) are emerging as valuable tools to support clinicians in routine decision-making. HIV management is a compelling use case due to its complexity, including diverse treatment options, comorbidities, and adherence challenges. However, integrating LLMs into clinical practice raises concerns about accuracy, potential harm, and clinician acceptance. Despite their promise, AI applications in HIV care remain underexplored, and LLM benchmarking studies are scarce. This study evaluates the current capabilities of LLMs in HIV management, highlighting their strengths and limitations. We introduce HIVMedQA, a benchmark designed to assess open-ended medical question answering in HIV care. The dataset consists of curated, clinically relevant questions developed with input from an infectious disease physician. We evaluated seven general-purpose and three medically specialized LLMs, applying prompt engineering to enhance performance. Our evaluation framework incorporates both lexical similarity and an LLM-as-a-judge approach, extended to better reflect clinical relevance. We assessed performance across key dimensions: question comprehension, reasoning, knowledge recall, bias, potential harm, and factual accuracy. Results show that Gemini 2.5 Pro consistently outperformed other models across most dimensions. Notably, two of the top three models were proprietary. Performance declined as question complexity increased. Medically fine-tuned models did not always outperform general-purpose ones, and larger model size was not a reliable predictor of performance. Reasoning and comprehension were more challenging than factual recall, and cognitive biases such as recency and status quo were observed. These findings underscore the need for targeted development and evaluation to ensure safe, effective LLM integration in clinical care.
[11] Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models
Kexin Chen,Dongxia Wang,Yi Liu,Haonan Zhang,Wenhai Wang
Main category: cs.CL
TL;DR: 本文研究Transformer文本嵌入模型中的’粘性令牌’(sticky tokens),这些令牌重复插入句子时会干扰嵌入距离的正常分布,降低下游任务性能。作者提出了一种高效检测方法STD,并在多个模型中发现了868个粘性令牌,分析了其来源和对下游任务的影响。
Details
Motivation: Transformer文本嵌入模型在NLP任务中广泛使用,但某些令牌(粘性令牌)会破坏嵌入距离的分布,影响模型性能。目前缺乏对这种令牌的系统研究,因此作者旨在揭示其特性和影响。Contribution: 1. 首次对粘性令牌进行系统研究,并给出正式定义;2. 提出STD方法高效检测粘性令牌;3. 在14个模型家族的40个检查点中发现了868个粘性令牌,并分析了其来源;4. 验证了其对下游任务的显著负面影响。
Method: 提出Sticky Token Detector (STD)方法,通过句子和令牌过滤技术检测粘性令牌。应用STD分析了多个模型的检查点,并结合注意力层分析探究其内部机制。
Result: 在14个模型家族的40个检查点中发现了868个粘性令牌,这些令牌主要来源于特殊词汇或未使用的条目以及多语言语料库中的子词片段。下游任务如聚类和检索性能下降达50%。
Insight: 粘性令牌的现象揭示了当前模型在令牌化和表示学习上的不足,需改进令牌化策略和模型设计以避免其对性能的负面影响。
Abstract: Despite the widespread use of Transformer-based text embedding models in NLP tasks, surprising ‘sticky tokens’ can undermine the reliability of embeddings. These tokens, when repeatedly inserted into sentences, pull sentence similarity toward a certain value, disrupting the normal distribution of embedding distances and degrading downstream performance. In this paper, we systematically investigate such anomalous tokens, formally defining them and introducing an efficient detection method, Sticky Token Detector (STD), based on sentence and token filtering. Applying STD to 40 checkpoints across 14 model families, we discover a total of 868 sticky tokens. Our analysis reveals that these tokens often originate from special or unused entries in the vocabulary, as well as fragmented subwords from multilingual corpora. Notably, their presence does not strictly correlate with model size or vocabulary size. We further evaluate how sticky tokens affect downstream tasks like clustering and retrieval, observing significant performance drops of up to 50%. Through attention-layer analysis, we show that sticky tokens disproportionately dominate the model’s internal representations, raising concerns about tokenization robustness. Our findings show the need for better tokenization strategies and model design to mitigate the impact of sticky tokens in future text embedding applications.
[12] SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models
Wonjun Jeong,Dongseok Kim,Taegkeun Whangbo
Main category: cs.CL
TL;DR: 论文提出了一种名为SCOPE的评估框架,旨在通过随机化和反偏见选项布局来减少大型语言模型在选择任务中的偏见干扰,从而更公平地评估模型的实际能力。
Details
Motivation: 大型语言模型在多项选择任务中可能通过利用选项位置或标签的固有偏见来夸大得分,而非真正理解问题。这导致评估结果不可靠。Contribution: 提出了SCOPE框架,通过估计模型的位置偏见分布并重新分配答案位置,均衡了随机选择概率,同时避免语义相似的干扰项相邻放置,从而提升了评估的公平性和可靠性。
Method: 使用无语义内容的空提示迭代调用模型,估计其位置偏见分布,并基于反偏见分布重新分配答案位置,同时避免语义相似的干扰项相邻。
Result: SCOPE在多个基准实验中稳定优于现有去偏见方法,并展现出对正确选项更清晰的置信度分布。
Insight: 通过消除模型在选项布局上的偏见依赖,能够更准确地评估其真实能力,为未来LLM评估提供了一种更公平的标准化方法。
Abstract: Large Language Models (LLMs) can achieve inflated scores on multiple-choice tasks by exploiting inherent biases in option positions or labels, rather than demonstrating genuine understanding. This study introduces SCOPE, an evaluation framework designed to measure and mitigate such selection bias in a dataset-independent manner. By repeatedly invoking a null prompt that lacks semantic content, SCOPE estimates each model’s unique position-bias distribution. It then redistributes the answer slot according to the inverse-bias distribution, thereby equalizing the lucky-rate, the probability of selecting the correct answer by chance. Furthermore, it prevents semantically similar distractors from being placed adjacent to the answer, thereby blocking near-miss guesses based on superficial proximity cues. Across multiple benchmark experiments, SCOPE consistently outperformed existing debiasing methods in terms of stable performance improvements and showed clearer confidence distributions over correct options. This framework thus offers a new standard for enhancing the fairness and reliability of LLM evaluations.
[13] TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks
Keyu Wu,Qianjin Yu,Manlin Mei,Ruiting Liu,Jun Wang,Kailai Zhang,Yelun Bao
Main category: cs.CL
TL;DR: 这篇论文提出了TN-AutoRCA,一个用于电信网络中基于警报的根因分析(RCA)的自动化和自我改进的代理框架,并构建了一个现实的基准测试。
Details
Motivation: 电信网络中的根因分析(RCA)是一个复杂且关键的任务,但由于其图结构的推理需求和缺乏现实的基准测试,对人工智能提出了巨大挑战。Contribution: 论文的主要贡献包括:1) 构建了一个现实的电信网络RCA基准测试;2) 提出了TN-AutoRCA,一个自我改进的代理框架。
Method: 采用基于图结构的推理方法,并结合自动化代理框架,实现RCA过程的自我改进。
Result: TN-AutoRCA框架在现实的电信网络数据中表现优异,能够有效地进行根因分析。
Insight: 研究强调了构建现实的基准测试的重要性,以及通过自动化代理框架实现持续改进的潜力。
Abstract: Root Cause Analysis (RCA) in telecommunication networks is a critical task, yet it presents a formidable challenge for Artificial Intelligence (AI) due to its complex, graph-based reasoning requirements and the scarcity of realistic benchmarks.
[14] Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Nevidu Jayatilleke,Nisansa de Silva
Main category: cs.CL
TL;DR: 该论文对六种OCR引擎在低资源语言(Sinhala和Tamil)上的零样本性能进行了比较分析,发现不同引擎在两种语言上的表现差异明显,并提出了一种新的合成Tamil OCR基准数据集。
Details
Motivation: 尽管拉丁语系等高资源语言的OCR问题已基本解决,但低资源语言因其独特字符和稀缺数据仍存在挑战。论文旨在评估不同OCR引擎在这些语言上的表现。Contribution: 1. 对六种OCR引擎在Sinhala和Tamil上的零样本性能进行全面比较。2. 引入新的合成Tamil OCR基准数据集。
Method: 使用五种测量技术(如字符错误率CER和单词错误率WER)严格评估六种OCR引擎(包括商业和开源系统)的性能。
Result: Surya在Sinhala上表现最佳(WER 2.61%),而Document AI在Tamil上表现最优(CER 0.78%)。
Insight: 开源与商业OCR引擎在低资源语言上各有优劣,且合成数据集可能有助于填补数据稀缺的空白。
Abstract: Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.
[15] BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit
Biao Yi,Zekun Fei,Jianing Geng,Tong Li,Lihai Nie,Zheli Liu,Yiming Li
Main category: cs.CL
TL;DR: 论文提出了一种针对大型推理模型(LRMs)的新型攻击方法——‘过度思考后门’,通过数据污染引入可控的冗余推理步骤,在不影响输出正确性的情况下显著增加推理过程的长度。
Details
Motivation: 大型推理模型(LRMs)在复杂推理任务中表现出色,但其广泛使用的链式推理(CoT)能力可能成为攻击的新目标。作者旨在探索一种可控的、不影响正确性的攻击方式,以揭示模型的安全性漏洞。Contribution: 提出了’过度思考后门’概念,并设计了一种可调谐触发器,允许攻击者精确控制推理冗余度;通过数据污染和程序化生成的冗余推理步骤实现了攻击的隐蔽性和资源消耗效果。
Method: 采用数据污染方法,将可调谐触发器(重复次数决定攻击强度)与程序化生成的冗余推理步骤配对;通过教师LLM生成正确但冗余的推理过程,确保最终答案正确。
Result: 实验证明,该方法能在不影响输出正确性的情况下,显著增加推理过程的长度,展示了攻击的可靠性和可控性。
Insight: 该研究揭示了LRMs安全性中的新威胁,强调了在模型设计和部署中需要关注推理过程的潜在漏洞。
Abstract: Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term “overthinking backdoors”. We advance this concept by proposing a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model’s reasoning verbosity. Our attack is implemented through a novel data poisoning methodology. It pairs a tunable trigger-where the number of repetitions signals the desired intensity-with a correspondingly verbose CoT response. These responses are programmatically generated by instructing a teacher LLM to inject a controlled number of redundant refinement steps into a correct reasoning process. The approach preserves output correctness, which ensures stealth and establishes the attack as a pure resource-consumption vector. Extensive empirical results on various LRMs demonstrate that our method can reliably trigger a controllable, multi-fold increase in the length of the reasoning process, without degrading the final answer’s correctness. Our source code is available at https://github.com/FZaKK/BadReasoner.
[16] AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data
Rana Alshaikh,Israa Alghanmi,Shelan Jeawak
Main category: cs.CL
TL;DR: 论文提出了AraTable,一个针对阿拉伯语表格数据的评测基准,用于评估大语言模型(LLMs)的推理与理解能力。结果显示LLMs在简单任务上表现尚可,但在复杂推理任务上仍有挑战。
Details
Motivation: 现有评测基准主要针对英语表格数据,阿拉伯语由于资源匮乏和语言特性独特而缺乏相关评测资源。Contribution: 1. 提出AraTable,首个针对阿拉伯语表格数据的综合评测基准;2. 设计了混合生成与人工验证的高质量数据集构建方法;3. 提出全自动化评测框架,性能接近人工评判。
Method: 采用混合流水线:LLMs生成初始内容,随后由人工专家过滤验证;同时提出基于自我审议机制的自动化评测框架。
Result: 初步分析显示LLMs在简单任务(如直接问答)上表现良好,但在复杂推理和事实验证任务上仍有显著认知挑战。
Insight: 阿拉伯语表格数据任务需要进一步改进复杂推理能力;提出的自动化评测框架为高效评估提供了可行方案。
Abstract: The cognitive and reasoning abilities of large language models (LLMs) have enabled remarkable progress in natural language processing. However, their performance in interpreting structured data, especially in tabular formats, remains limited. Although benchmarks for English tabular data are widely available, Arabic is still underrepresented because of the limited availability of public resources and its unique language features. To address this gap, we present AraTable, a novel and comprehensive benchmark designed to evaluate the reasoning and understanding capabilities of LLMs when applied to Arabic tabular data. AraTable consists of various evaluation tasks, such as direct question answering, fact verification, and complex reasoning, involving a wide range of Arabic tabular sources. Our methodology follows a hybrid pipeline, where initial content is generated by LLMs and subsequently filtered and verified by human experts to ensure high dataset quality. Initial analyses using AraTable show that, while LLMs perform adequately on simpler tabular tasks such as direct question answering, they continue to face significant cognitive challenges when tasks require deeper reasoning and fact verification. This indicates that there are substantial opportunities for future work to improve performance on complex tabular reasoning tasks. We also propose a fully automated evaluation framework that uses a self-deliberation mechanism and achieves performance nearly identical to that of human judges. This research provides a valuable, publicly available resource and evaluation framework that can help accelerate the development of foundational models for processing and analysing Arabic structured data.
[17] Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language
Md Obyedullahil Mamun,Md Adyelullahil Mamun,Arif Ahmad,Md. Imran Hossain Emu
Main category: cs.CL
TL;DR: 论文研究了针对低资源语言孟加拉语的标点恢复问题,使用XLM-RoBERTa-large模型,在多样化的文本领域预测四种标点符号,并通过数据增强构建大规模训练语料,取得了高准确率。
Details
Motivation: 孟加拉语作为低资源语言,其标点恢复任务缺乏足够的数据支持,而标点恢复对提升文本可读性和自动语音识别后处理至关重要。Contribution: 1) 应用XLM-RoBERTa-large模型优化标点恢复任务;2) 构建公开可用的数据集和代码;3) 通过数据增强解决了数据稀缺问题。
Method: 采用XLM-RoBERTa-large模型,在多样化的文本领域预测四种标点符号,并通过数据增强(如调整增强因子alpha=0.20%)提升模型性能。
Result: 模型在新闻测试集上的准确率为97.1%,在参考集和ASR集上的准确率分别为91.2%和90.2%,展现了良好的泛化能力。
Insight: 1) 数据增强对低资源语言的NLP任务至关重要;2) Transformer模型在标点恢复任务中表现优越;3) 公开数据集促进了未来研究。
Abstract: Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the model’s effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP.
[18] Generation of Synthetic Clinical Text: A Systematic Review
Basel Alshaikhdeeb,Ahmed Abdelmonem Hemedan,Soumyabrata Ghosh,Irina Balaur,Venkata Satagopam
Main category: cs.CL
TL;DR: 本文系统综述了临床合成文本生成的研究,重点分析了生成目的、技术和评估方法,指出Transformer架构(如GPT)是主流技术,实用性是主要评估方式。合成文本能有效缓解数据稀疏和隐私问题,但隐私保护仍需更多人工评估。
Details
Motivation: 临床NLP任务常面临数据稀疏和隐私问题,生成合成临床文本是解决这些问题的有效途径。本文旨在系统综述相关研究,为未来方向提供指导。Contribution: 1)系统分类合成文本生成的目的(如数据增强、隐私保护);2)总结主流生成技术(如Transformer);3)梳理评估方法(如实用性、隐私性)。
Method: 通过系统性文献检索(共1,398篇,筛选94篇),定量分析了三个核心问题:生成目的、技术和评估方法。
Result: 合成文本在数据增强和下游任务中表现良好,但隐私问题仍需解决。Transformer(尤其是GPT)是主流技术,实用性是最常用评估指标。
Insight: 合成文本虽不能完全替代真实数据,但在缓解数据稀疏性和辅助NLP任务中潜力巨大。未来需更多人工评估以确保隐私安全。
Abstract: Generating clinical synthetic text represents an effective solution for common clinical NLP issues like sparsity and privacy. This paper aims to conduct a systematic review on generating synthetic medical free-text by formulating quantitative analysis to three research questions concerning (i) the purpose of generation, (ii) the techniques, and (iii) the evaluation methods. We searched PubMed, ScienceDirect, Web of Science, Scopus, IEEE, Google Scholar, and arXiv databases for publications associated with generating synthetic medical unstructured free-text. We have identified 94 relevant articles out of 1,398 collected ones. A great deal of attention has been given to the generation of synthetic medical text from 2018 onwards, where the main purpose of such a generation is towards text augmentation, assistive writing, corpus building, privacy-preserving, annotation, and usefulness. Transformer architectures were the main predominant technique used to generate the text, especially the GPTs. On the other hand, there were four main aspects of evaluation, including similarity, privacy, structure, and utility, where utility was the most frequent method used to assess the generated synthetic medical text. Although the generated synthetic medical text demonstrated a moderate possibility to act as real medical documents in different downstream NLP tasks, it has proven to be a great asset as augmented, complementary to the real documents, towards improving the accuracy and overcoming sparsity/undersampling issues. Yet, privacy is still a major issue behind generating synthetic medical text, where more human assessments are needed to check for the existence of any sensitive information. Despite that, advances in generating synthetic medical text will considerably accelerate the adoption of workflows and pipeline development, discarding the time-consuming legalities of data transfer.
[19] The Moral Gap of Large Language Models
Maciej Skorski,Alina Landowska
Main category: cs.CL
TL;DR: 这篇论文首次全面比较了大型语言模型(LLMs)和微调Transformer模型在道德内容检测任务上的表现,发现LLMs在道德推理任务中存在显著的性能差距,提示微调模型仍然更优。
Details
Motivation: 研究动机是为了评估大型语言模型在专门的道德推理任务中的表现,特别是在检测社交媒体上的道德内容时是否有效。Contribution: 论文的主要贡献是首次提供了LLMs和微调模型在道德基础检测任务上的综合性能比较,揭示了LLMs的局限性。
Method: 研究方法是通过在Twitter和Reddit数据集上使用ROC、PR和DET曲线分析,比较LLMs和微调Transformer模型的表现。
Result: 结果表明,LLMs在道德内容检测任务中存在高假阴性率和系统性低估问题,即使经过提示工程优化,微调模型仍表现更优。
Insight: 研究启示是,针对特定的道德推理任务,任务专用的微调模型仍然是更有效的解决方案,而通用LLMs可能无法完全满足需求。
Abstract: Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.
[20] Effective Multi-Task Learning for Biomedical Named Entity Recognition
João Ruano,Gonçalo M. Correia,Leonor Barreiros,Afonso Mendes
Main category: cs.CL
TL;DR: 该论文提出了一种名为SRU-NER的新型方法,通过多任务学习策略有效整合多个生物医学数据集,解决嵌套命名实体识别问题。
Details
Motivation: 生物医学领域中的命名实体识别(NER)面临术语复杂性和数据集标注不一致的挑战,该研究旨在解决这些问题。Contribution: 提出了SRU-NER模型,通过动态调整损失计算来减少标注差异,同时支持嵌套实体识别和多任务学习。
Method: 采用基于槽的循环单元(SRU)和多任务学习策略,动态调整损失以避免惩罚未标注实体类型。
Result: 在生物医学和通用领域的NER任务中,SRU-NER表现出色,并提高了跨领域泛化能力。
Insight: 动态损失调整和多任务学习是解决标注不一致和提升跨领域泛化的有效手段。
Abstract: Biomedical Named Entity Recognition presents significant challenges due to the complexity of biomedical terminology and inconsistencies in annotation across datasets. This paper introduces SRU-NER (Slot-based Recurrent Unit NER), a novel approach designed to handle nested named entities while integrating multiple datasets through an effective multi-task learning strategy. SRU-NER mitigates annotation gaps by dynamically adjusting loss computation to avoid penalizing predictions of entity types absent in a given dataset. Through extensive experiments, including a cross-corpus evaluation and human assessment of the model’s predictions, SRU-NER achieves competitive performance in biomedical and general-domain NER tasks, while improving cross-domain generalization.
[21] GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface
Urchade Zaratiana,Gil Pasternak,Oliver Boyd,George Hurn-Maloney,Ash Lewis
Main category: cs.CL
TL;DR: GLiNER2是一个高效的多任务信息提取系统,通过统一的架构支持命名实体识别、文本分类和层次结构化数据提取,具有基于模式的界面和CPU高效性。
Details
Motivation: 现有信息提取解决方案通常需要针对不同任务使用专用模型,或依赖于计算密集型的大型语言模型,GLiNER2旨在提供一个统一且高效的替代方案。Contribution: 提出了GLiNER2,一个支持多任务信息提取的统一框架,具有基于模式的界面和CPU高效性,显著提升了部署便捷性。
Method: 基于预训练Transformer编码器架构,引入多任务组合和直观的模式界面,实现高效多任务信息提取。
Result: 实验表明GLiNER2在提取和分类任务上具有竞争力,且部署便捷性优于基于大型语言模型的方案。
Insight: 统一的模式和高效设计可在减少计算资源消耗的同时,实现多任务信息提取,为实际应用提供了更轻量级的选择。
Abstract: Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built pretrained transformer encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across extraction and classification tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source pip-installable library with pre-trained models and documentation at https://github.com/fastino-ai/GLiNER2.
[22] GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation
Jiafeng Xiong,Yuting Zhao
Main category: cs.CL
TL;DR: GIIFT是一个基于图引导的归纳式无图像多模态机器翻译框架,通过跨模态图注意力网络适配器统一多模态知识,并在无图像的翻译任务中实现归纳泛化。
Details
Motivation: 现有的MMT方法在模态间隙利用和多模态领域泛化方面存在局限,GIIFT旨在通过场景图和跨模态学习提升无图像翻译能力。Contribution: 1. 构建多模态场景图以保留和整合模态特定信息;2. 提出GIIFT框架,通过跨模态图注意力网络适配器实现归纳泛化。
Method: 1. 构造多模态场景图;2. 采用两阶段训练,结合图注意力网络学习多模态知识。
Result: 在Multi30K和WMT基准测试中,GIIFT在无图像条件下超越现有方法,达到最优性能。
Insight: 场景图和跨模态适配器能够有效捕捉模态间的关系,提升模型在无图像任务中的表现。
Abstract: Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.
[23] Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods
Ganesh Sapkota,Md Hasibur Rahman
Main category: cs.CL
TL;DR: 该论文提出了一种结合6-mer和BPE-600的混合分词策略,显著提升了DNA语言模型的性能,能够同时捕捉短长序列模式,在多个任务中优于现有模型。
Details
Motivation: 传统k-mer分词方法在捕获DNA序列局部结构时有效,但存在词汇分布不均和对全局上下文理解不足的问题,需要改进。Contribution: 提出了一种结合6-mer和BPE-600的混合分词策略,通过平衡词汇表提升了DNA语言模型对局部序列结构和全局上下文的理解能力。
Method: 使用6-mer分词与600次BPE生成的优化词汇表结合,训练基础DNA语言模型,并在next-k-mer预测任务上微调。
Result: 模型在3-mer、4-mer和5-mer预测任务中分别达到10.78%、10.1%和4.12%的准确率,优于NT、DNABERT2和GROVER等现有模型。
Insight: 混合分词策略在基因组语言建模中具有重要作用,为未来DNA序列分析和生物研究提供了强大基础。
Abstract: This paper presents a novel hybrid tokenization strategy that enhances the performance of DNA Language Models (DLMs) by combining 6-mer tokenization with Byte Pair Encoding (BPE-600). Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges, including uneven token distribution and a limited understanding of global sequence context. To address these limitations, we propose merging unique 6mer tokens with optimally selected BPE tokens generated through 600 BPE cycles. This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns within DNA sequences simultaneously. A foundational DLM trained on this hybrid vocabulary was evaluated using next-k-mer prediction as a fine-tuning task, demonstrating significantly improved performance. The model achieved prediction accuracies of 10.78% for 3-mers, 10.1% for 4-mers, and 4.12% for 5-mers, outperforming state-of-the-art models such as NT, DNABERT2, and GROVER. These results highlight the ability of the hybrid tokenization strategy to preserve both the local sequence structure and global contextual information in DNA modeling. This work underscores the importance of advanced tokenization methods in genomic language modeling and lays a robust foundation for future applications in downstream DNA sequence analysis and biological research.
[24] Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs
Feng Hong,Geng Yu,Yushi Ye,Haicheng Huang,Huangjie Zheng,Ya Zhang,Yanfeng Wang,Jiangchao Yao
Main category: cs.CL
TL;DR: 论文提出了一种名为WINO的解码算法,旨在解决Diffusion大型语言模型(DLLM)中速度与质量的权衡问题,通过可撤销的并行解码机制显著提升了性能。
Details
Motivation: 现有的DLLM在快速并行解码时会因不可逆的早期错误累积导致性能显著下降,因此需要一种新的解码方法来解决这一问题。Contribution: 提出了WINO算法,首次在DLLM中实现了可撤销的并行解码机制,显著改善了速度与质量的权衡。
Method: WINO采用并行草稿与验证机制,通过双向上下文验证并重新屏蔽可疑标记以实现优化解码。
Result: 实验表明,WINO在数学和图像描述任务中分别实现了6倍和10倍的加速,同时性能提升显著(如GSM8K准确率提升2.58%)。
Insight: 可撤销解码机制可以有效缓解DLLM中的早期错误累积,为高效且高质量的并行生成提供了新的思路。
Abstract: Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model’s bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.
[25] AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs
Xiaopeng Ke,Hexuan Deng,Xuebo Liu,Jun Rao,Zhenxi Song,Jun Yu,Min Zhang
Main category: cs.CL
TL;DR: AQuilt是一个为专业领域LLMs设计的低成本、高相关性数据合成框架,通过结合逻辑和自检机制提升模型性能。
Details
Motivation: 通用领域的大型语言模型在专业领域中表现不佳,现有数据合成方法计算成本高或性能有限,AQuilt旨在解决这些问题。Contribution: 提出AQuilt框架,包含Answer、Question、Unlabeled data等模块,通过逻辑和自检提升数据质量,支持多任务定制。
Method: 整合逻辑和自检机制,利用未标注数据生成高质量指令调整数据,数据合成成本仅为DeepSeek-V3的17%。
Result: 实验表明AQuilt性能与DeepSeek-V3相当,且生成的数据与下游任务相关性更高。
Insight: 逻辑和自检机制是提升专业领域LLMs性能的关键,低成本数据合成方法具有广泛应用潜力。
Abstract: Despite the impressive performance of large language models (LLMs) in general domains, they often underperform in specialized domains. Existing approaches typically rely on data synthesis methods and yield promising results by using unlabeled data to capture domain-specific features. However, these methods either incur high computational costs or suffer from performance limitations, while also demonstrating insufficient generalization across different tasks. To address these challenges, we propose AQuilt, a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data, including Answer, Question, Unlabeled data, Inspection, Logic, and Task type. By incorporating logic and inspection, we encourage reasoning processes and self-inspection to enhance model performance. Moreover, customizable task instructions enable high-quality data generation for any task. As a result, we construct a dataset of 703k examples to train a powerful data synthesis model. Experiments show that AQuilt is comparable to DeepSeek-V3 while utilizing just 17% of the production cost. Further analysis demonstrates that our generated data exhibits higher relevance to downstream tasks. Source code, models, and scripts are available at https://github.com/Krueske/AQuilt.
[26] TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards
Andreea Nica,Ivan Zakazov,Nicolas Mario Baldwin,Saibo Geng,Robert West
Main category: cs.CL
TL;DR: TRPrompt结合文本反馈与提示模型训练,提升LLM推理能力,无需目标模型参数更新,并在数学数据集上实现先进性能。
Details
Motivation: 当前提示优化方法分为基于文本反馈的无训练方法和基于数值奖励的训练方法,TRPrompt旨在统一这两种方法,直接利用文本反馈优化提示模型。Contribution: 提出TRPrompt框架,通过文本反馈迭代优化提示模型,无需预训练数据集,并在数学数据集GSMHard和MATH上实现最优性能。
Method: TRPrompt将文本反馈直接用于提示模型训练,利用LLM对“好”提示的理解,结合高分辨率的文本信号,迭代改进提示生成。
Result: 在GSMHard和MATH数据集上,TRPrompt生成的查询特定提示实现了当前最优性能。
Insight: 文本反馈可以直接用于训练提示模型,统一无训练和训练方法,显著提升LLM推理能力。
Abstract: Prompt optimization improves the reasoning abilities of large language models (LLMs) without requiring parameter updates to the target model. Following heuristic-based “Think step by step” approaches, the field has evolved in two main directions: while one group of methods uses textual feedback to elicit improved prompts from general-purpose LLMs in a training-free way, a concurrent line of research relies on numerical rewards to train a special prompt model, tailored for providing optimal prompts to the target model. In this paper, we introduce the Textual Reward Prompt framework (TRPrompt), which unifies these approaches by directly incorporating textual feedback into training of the prompt model. Our framework does not require prior dataset collection and is being iteratively improved with the feedback on the generated prompts. When coupled with the capacity of an LLM to internalize the notion of what a “good” prompt is, the high-resolution signal provided by the textual rewards allows us to train a prompt model yielding state-of-the-art query-specific prompts for the problems from the challenging math datasets GSMHard and MATH.
[27] Checklists Are Better Than Reward Models For Aligning Language Models
Vijay Viswanathan,Yanchao Sun,Shuang Ma,Xiang Kong,Meng Cao,Graham Neubig,Tongshuang Wu
Main category: cs.CL
TL;DR: 本论文提出了一种新颖的方法“从清单反馈的强化学习”(RLCF),通过提取指令中的清单并使用AI评判或验证程序评估响应,显著提升了语言模型对多样化需求的适应性。
Details
Motivation: 传统强化学习使用固定标准(如‘有帮助性’和‘无危害性’)来调整语言模型,限制了其适应多样化指令的能力。本文旨在通过灵活的、指令特定的标准,拓宽强化学习在指令跟随中的应用范围。Contribution: 提出RLCF方法,利用清单反馈作为强化学习的奖励信号。这是首个在所有测试基准中均能提升性能的方法。
Method: 从指令中提取清单,使用AI评判和专门的验证程序评估响应,通过结合得分计算奖励信号,用于强化学习训练。
Result: 在五个基准测试中,RLCF均表现最佳,其中FollowBench的硬满意度提升4点,InFoBench提升6点,Arena-Hard的胜率提升3点。
Insight: 清单反馈是提升语言模型适应多样化需求的关键工具,其灵活性优于传统固定标准的奖励模型。
Abstract: Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this – typically using fixed criteria such as “helpfulness” and “harmfulness”. In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose “Reinforcement Learning from Checklist Feedback” (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item - using both AI judges and specialized verifier programs - then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks – RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. These results establish checklist feedback as a key tool for improving language models’ support of queries that express a multitude of needs.
cs.CV [Back]
[28] Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling
Yi Xin,Juncheng Yan,Qi Qin,Zhen Li,Dongyang Liu,Shicheng Li,Victor Shea-Jay Huang,Yupeng Zhou,Renrui Zhang,Le Zhuo,Tiancheng Han,Xiaoqing Sun,Siqi Luo,Mengmeng Wang,Bin Fu,Yuewen Cao,Hongsheng Li,Guangtao Zhai,Xiaohong Liu,Yu Qiao,Peng Gao
Main category: cs.CV
TL;DR: Lumina-mGPT 2.0是一种独立的解码器自回归模型,用于高质量图像生成和多任务处理,其性能媲美扩散模型,同时具备更大的灵活性和架构设计自由。
Details
Motivation: 现有方法依赖预训练组件或混合架构,限制了设计自由;Lumina-mGPT 2.0完全从零开始训练,旨在提供更高灵活性和统一生成框架。Contribution: 1. 提出独立的纯自回归模型,无需依赖预训练组件;2. 统一的多任务生成框架;3. 高效的解码策略提升生成质量和速度。
Method: 1. 采用解码器自回归架构;2. 统一的分词方案支持多任务;3. 引入推理时缩放和推测性Jacobi采样优化解码。
Result: 在文本到图像基准测试(GenEval、DPG)上表现优异,部分超越扩散模型;在多任务基准Graph200K上表现突出。
Insight: 自回归模型仍具备竞争力,结合高效解码策略可显著提升性能,为多模态统一生成提供了新方向。
Abstract: We present Lumina-mGPT 2.0, a stand-alone, decoder-only autoregressive model that revisits and revitalizes the autoregressive paradigm for high-quality image generation and beyond. Unlike existing approaches that rely on pretrained components or hybrid architectures, Lumina-mGPT 2.0 is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom. It achieves generation quality on par with state-of-the-art diffusion models such as DALL-E 3 and SANA, while preserving the inherent flexibility and compositionality of autoregressive modeling. Our unified tokenization scheme allows the model to seamlessly handle a wide spectrum of tasks-including subject-driven generation, image editing, controllable synthesis, and dense prediction-within a single generative framework. To further boost usability, we incorporate efficient decoding strategies like inference-time scaling and speculative Jacobi sampling to improve quality and speed, respectively. Extensive evaluations on standard text-to-image benchmarks (e.g., GenEval, DPG) demonstrate that Lumina-mGPT 2.0 not only matches but in some cases surpasses diffusion-based models. Moreover, we confirm its multi-task capabilities on the Graph200K benchmark, with the native Lumina-mGPT 2.0 performing exceptionally well. These results position Lumina-mGPT 2.0 as a strong, flexible foundation model for unified multimodal generation. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-mGPT-2.0.
[29] SV3.3B: A Sports Video Understanding Model for Action Recognition
Sai Varun Kodathala,Yashwanth Reddy Vutukoori,Rakesh Vunnam
Main category: cs.CV
TL;DR: 论文提出了SV3.3B,一个轻量级的3.3B参数视频理解模型,通过新型时间运动差异采样和自监督学习实现高效设备端部署,显著提升了运动视频分析的细粒度理解能力。
Details
Motivation: 传统运动视频分析方法计算密集且缺乏对动作细节的理解,现有的模型难以捕捉关键的运动生物力学过渡阶段(如准备、执行和收尾)。SV3.3B旨在解决这些问题。Contribution: 1.提出了SV3.3B模型,结合时间运动差异采样和自监督学习;2.设计了一种基于DWT-VGG16-LDA的关键帧提取机制;3.在NSVA篮球数据集上表现优于GPT-4o等闭源模型。
Method: 模型采用DWT-VGG16-LDA提取关键帧,结合V-DWT-JEPA2编码器和LLM解码器,通过掩蔽去噪目标预训练,并微调以生成运动动作描述。
Result: 在NSVA篮球数据集上,SV3.3B在文本生成和运动专用指标上优于GPT-4o,信息密度、动作复杂度和测量精度提升显著,且计算需求更低。
Insight: 轻量化模型结合高效的帧采样和自监督学习,可以显著提升运动视频分析的细粒度理解能力,同时适应设备端部署需求。
Abstract: This paper addresses the challenge of automated sports video analysis, which has traditionally been limited by computationally intensive models requiring server-side processing and lacking fine-grained understanding of athletic movements. Current approaches struggle to capture the nuanced biomechanical transitions essential for meaningful sports analysis, often missing critical phases like preparation, execution, and follow-through that occur within seconds. To address these limitations, we introduce SV3.3B, a lightweight 3.3B parameter video understanding model that combines novel temporal motion difference sampling with self-supervised learning for efficient on-device deployment. Our approach employs a DWT-VGG16-LDA based keyframe extraction mechanism that intelligently identifies the 16 most representative frames from sports sequences, followed by a V-DWT-JEPA2 encoder pretrained through mask-denoising objectives and an LLM decoder fine-tuned for sports action description generation. Evaluated on a subset of the NSVA basketball dataset, SV3.3B achieves superior performance across both traditional text generation metrics and sports-specific evaluation criteria, outperforming larger closed-source models including GPT-4o variants while maintaining significantly lower computational requirements. Our model demonstrates exceptional capability in generating technically detailed and analytically rich sports descriptions, achieving 29.2% improvement over GPT-4o in ground truth validation metrics, with substantial improvements in information density, action complexity, and measurement precision metrics essential for comprehensive athletic analysis. Model Available at https://huggingface.co/sportsvision/SV3.3B.
[30] Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models
Lifeng Chen,Jiner Wang,Zihao Pan,Beier Zhu,Xiaofeng Yang,Chi Zhang
Main category: cs.CV
TL;DR: Detail++是一种无需训练的细节增强框架,通过渐进式细节注入策略解决文本到图像生成中复杂提示和多主体属性绑定的问题。
Details
Motivation: 现有的文本到图像生成模型在处理复杂提示和多主体属性绑定时表现不佳,受人类绘画过程的启发,作者提出了分阶段生成的方法。Contribution: 1. 提出了渐进式细节注入策略(PDI);2. 利用自注意力机制和交叉注意力机制实现属性与主体的精确绑定;3. 引入了质心对齐损失以减少测试时的绑定噪声。
Method: 1. 将复杂提示分解为多个简化子提示;2. 分阶段生成,先确保全局布局,再细化细节;3. 通过交叉注意力和质心对齐损失优化绑定。
Result: 在T2I-CompBench和新构建的风格组合基准测试中,Detail++显著优于现有方法,尤其在多对象和复杂风格条件下表现突出。
Insight: 分阶段生成和渐进式细节注入可以显著提升复杂场景的生成质量,测试时的质心对齐损失是一种有效的后处理优化手段。
Abstract: Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompt, particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.
[31] FishDet-M: A Unified Large-Scale Benchmark for Robust Fish Detection and CLIP-Guided Model Selection in Diverse Aquatic Visual Domains
Muayad Abujabal,Lyes Saad Saoud,Irfan Hussain
Main category: cs.CV
TL;DR: 该论文提出了FishDet-M,一个统一的大规模鱼类检测基准,涵盖13个数据集,并引入CLIP引导的模型选择框架,以提升鱼类检测的跨域性能和适应性。
Details
Motivation: 当前鱼类检测存在数据集分散、成像条件异质性高和评估协议不一致的问题,限制了实际部署。FishDet-M旨在解决这些挑战,提供一个标准化的评估平台。Contribution: 1) 提出FishDet-M,最大的统一鱼类检测基准;2) 系统评估28种现代检测模型;3) 引入CLIP引导的动态模型选择框架。
Method: 1) 统一并标注13个鱼类数据集;2) 使用标准指标评估多种检测模型;3) 利用CLIP的视觉-语言对齐特性实现零样本模型选择。
Result: FishDet-M展示了不同模型在性能与效率间的权衡,CLIP选择框架在无需集成计算下实现了高效检测。
Insight: 通过CLIP的语义对齐特性,可以动态选择最适合的检测器,为实时水下视觉应用提供了可扩展解决方案。
Abstract: Accurate fish detection in underwater imagery is essential for ecological monitoring, aquaculture automation, and robotic perception. However, practical deployment remains limited by fragmented datasets, heterogeneous imaging conditions, and inconsistent evaluation protocols. To address these gaps, we present \textit{FishDet-M}, the largest unified benchmark for fish detection, comprising 13 publicly available datasets spanning diverse aquatic environments including marine, brackish, occluded, and aquarium scenes. All data are harmonized using COCO-style annotations with both bounding boxes and segmentation masks, enabling consistent and scalable cross-domain evaluation. We systematically benchmark 28 contemporary object detection models, covering the YOLOv8 to YOLOv12 series, R-CNN based detectors, and DETR based models. Evaluations are conducted using standard metrics including mAP, mAP@50, and mAP@75, along with scale-specific analyses (AP$_S$, AP$_M$, AP$_L$) and inference profiling in terms of latency and parameter count. The results highlight the varying detection performance across models trained on FishDet-M, as well as the trade-off between accuracy and efficiency across models of different architectures. To support adaptive deployment, we introduce a CLIP-based model selection framework that leverages vision-language alignment to dynamically identify the most semantically appropriate detector for each input image. This zero-shot selection strategy achieves high performance without requiring ensemble computation, offering a scalable solution for real-time applications. FishDet-M establishes a standardized and reproducible platform for evaluating object detection in complex aquatic scenes. All datasets, pretrained models, and evaluation tools are publicly available to facilitate future research in underwater computer vision and intelligent marine systems.
[32] DiNAT-IR: Exploring Dilated Neighborhood Attention for High-Quality Image Restoration
Hanzhou Liu,Binghan Li,Chengkai Liu,Mi Lu
Main category: cs.CV
TL;DR: 该论文提出了一种基于Transformer的架构DiNAT-IR,通过探索扩张邻域注意力(DiNA)和通道感知模块,解决了图像修复任务中全局与局部信息平衡的问题,实现了高效高质量的图像恢复。
Details
Motivation: 现有的Transformer方法在图像修复任务中虽然能建模长距离依赖关系,但计算成本高且可能忽略局部细节。Restormer通过通道自注意力提高了效率,但对局部伪影的处理不足。本文旨在通过DiNA和通道感知模块弥补这一缺陷。Contribution: 1. 引入扩张邻域注意力(DiNA),结合滑动窗口注意力和混合扩张因子,平衡全局上下文与局部精度。2. 提出通道感知模块,增强全局上下文理解,弥补局部注意力的局限性。3. 设计了DiNAT-IR架构,在多个图像修复任务上取得了竞争性结果。
Method: 1. 使用DiNA扩展注意力感受野,通过混合扩张因子捕捉多尺度信息。2. 引入通道感知模块,在局部注意力基础上补充全局上下文。3. 构建DiNAT-IR架构,结合上述模块优化图像修复性能。
Result: DiNAT-IR在多个图像修复基准测试中表现优异,证明了其在高分辨率图像上的高效性和高质量恢复能力。
Insight: 1. 全局与局部信息的结合是提升图像修复质量的关键。2. 通道感知模块能有效弥补局部注意力的不足,增强模型对复杂场景的适应性。
Abstract: Transformers, with their self-attention mechanisms for modeling long-range dependencies, have become a dominant paradigm in image restoration tasks. However, the high computational cost of self-attention limits scalability to high-resolution images, making efficiency-quality trade-offs a key research focus. To address this, Restormer employs channel-wise self-attention, which computes attention across channels instead of spatial dimensions. While effective, this approach may overlook localized artifacts that are crucial for high-quality image restoration. To bridge this gap, we explore Dilated Neighborhood Attention (DiNA) as a promising alternative, inspired by its success in high-level vision tasks. DiNA balances global context and local precision by integrating sliding-window attention with mixed dilation factors, effectively expanding the receptive field without excessive overhead. However, our preliminary experiments indicate that directly applying this global-local design to the classic deblurring task hinders accurate visual restoration, primarily due to the constrained global context understanding within local attention. To address this, we introduce a channel-aware module that complements local attention, effectively integrating global context without sacrificing pixel-level precision. The proposed DiNAT-IR, a Transformer-based architecture specifically designed for image restoration, achieves competitive results across multiple benchmarks, offering a high-quality solution for diverse low-level computer vision problems.
[33] OPEN: A Benchmark Dataset and Baseline for Older Adult Patient Engagement Recognition in Virtual Rehabilitation Learning Environments
Ali Abedi,Sadaf Safa,Tracey J. F. Colella,Shehroz S. Khan
Main category: cs.CV
TL;DR: 论文提出了OPEN数据集,用于识别老年患者在虚拟康复学习环境中的参与度,填补了针对老年人参与度研究的空白,并提供了多模态数据和基线模型。
Details
Motivation: 虚拟学习和康复中的参与度对效果至关重要,但目前针对老年人参与度的研究和数据集有限,且现有方法忽略了上下文和长期会话的纵向特征。Contribution: 1) 提出首个专注于老年患者参与度的数据集OPEN;2) 提供了多模态数据(面部、手部、身体关节特征等)和上下文标注;3) 展示了基线模型结果。
Method: 通过虚拟康复学习环境收集数据,提取多模态特征(如关节和情感特征),并标注参与度状态。使用机器学习和深度学习方法训练模型。
Result: 基线模型在参与度识别任务中达到了81%的准确率,验证了数据集的实用性和模型的可扩展性。
Insight: OPEN不仅为老年人参与度研究提供了数据基础,还展示了结合上下文和多模态数据的重要性,为个性化康复建模开辟了新方向。
Abstract: Engagement in virtual learning is essential for participant satisfaction, performance, and adherence, particularly in online education and virtual rehabilitation, where interactive communication plays a key role. Yet, accurately measuring engagement in virtual group settings remains a challenge. There is increasing interest in using artificial intelligence (AI) for large-scale, real-world, automated engagement recognition. While engagement has been widely studied in younger academic populations, research and datasets focused on older adults in virtual and telehealth learning settings remain limited. Existing methods often neglect contextual relevance and the longitudinal nature of engagement across sessions. This paper introduces OPEN (Older adult Patient ENgagement), a novel dataset supporting AI-driven engagement recognition. It was collected from eleven older adults participating in weekly virtual group learning sessions over six weeks as part of cardiac rehabilitation, producing over 35 hours of data, making it the largest dataset of its kind. To protect privacy, raw video is withheld; instead, the released data include facial, hand, and body joint landmarks, along with affective and behavioral features extracted from video. Annotations include binary engagement states, affective and behavioral labels, and context-type indicators, such as whether the instructor addressed the group or an individual. The dataset offers versions with 5-, 10-, 30-second, and variable-length samples. To demonstrate utility, multiple machine learning and deep learning models were trained, achieving engagement recognition accuracy of up to 81 percent. OPEN provides a scalable foundation for personalized engagement modeling in aging populations and contributes to broader engagement recognition research.
[34] Bearded Dragon Activity Recognition Pipeline: An AI-Based Approach to Behavioural Monitoring
Arsen Yermukan,Pedro Machado,Feliciano Domingos,Isibor Kennedy Ihianle,Jordan J. Bird,Stefano S. K. Kaburu,Samantha J. Ward
Main category: cs.CV
TL;DR: 论文提出了一种基于YOLO目标检测模型的自动化系统,用于实时监控鬃狮蜥的关键行为(如晒太阳和捕猎)。通过训练多个YOLO变体,最终选择YOLOv8s作为最优模型,系统在行为分类中表现出高准确性和速度,但捕猎检测的准确性较低。
Details
Motivation: 传统的鬃狮蜥行为监测方法耗时且容易出错,需要一种高效、自动化的替代方案。Contribution: 开发了一个基于YOLO模型的实时视频分析系统,用于自动识别鬃狮蜥的两种关键行为,并通过公开数据集展示了其可行性。
Method: 采用多种YOLO变体(v5-v12)训练自定义数据集,提取帧级坐标并通过时间插值和规则逻辑分类行为。YOLOv8s因其速度和精度被选为最佳模型。
Result: 系统对晒太阳行为检测可靠(mAP@0.5:0.95=0.855),但捕猎检测因蟋蟀识别不足(mAP@0.5=0.392)表现较差。
Insight: 小型目标(如蟋蟀)的检测仍具挑战性,未来可通过扩展数据集或专用小目标检测器改进。
Abstract: Traditional monitoring of bearded dragon (Pogona Viticeps) behaviour is time-consuming and prone to errors. This project introduces an automated system for real-time video analysis, using You Only Look Once (YOLO) object detection models to identify two key behaviours: basking and hunting. We trained five YOLO variants (v5, v7, v8, v11, v12) on a custom, publicly available dataset of 1200 images, encompassing bearded dragons (600), heating lamps (500), and crickets (100). YOLOv8s was selected as the optimal model due to its superior balance of accuracy (mAP@0.5:0.95 = 0.855) and speed. The system processes video footage by extracting per-frame object coordinates, applying temporal interpolation for continuity, and using rule-based logic to classify specific behaviours. Basking detection proved reliable. However, hunting detection was less accurate, primarily due to weak cricket detection (mAP@0.5 = 0.392). Future improvements will focus on enhancing cricket detection through expanded datasets or specialised small-object detectors. This automated system offers a scalable solution for monitoring reptile behaviour in controlled environments, significantly improving research efficiency and data quality.
[35] AG-VPReID.VIR: Bridging Aerial and Ground Platforms for Video-based Visible-Infrared Person Re-ID
Huy Nguyen,Kien Nguyen,Akila Pemasiri,Akmal Jahan,Clinton Fookes,Sridha Sridharan
Main category: cs.CV
TL;DR: 论文发布了首个空中-地面跨模态视频行人重识别数据集AG-VPReID.VIR,并提出TCC-VPReID三流架构,通过风格鲁棒特征学习、记忆跨视角适应和中间引导时序建模,显著提升跨平台和跨模态行人重识别性能。
Details
Motivation: 现有行人重识别数据集主要关注地面视角,忽略了空中视角在遮挡、覆盖范围和抗干扰方面的优势。论文旨在填补这一空白,构建首个覆盖空中与地面、可见光与红外模态的行人重识别数据集。Contribution: 1. 发布首个空中-地面跨模态视频行人重识别数据集AG-VPReID.VIR;2. 提出TCC-VPReID三流架构,通过风格鲁棒特征学习、跨视角适应和时序建模解决交叉挑战。
Method: TCC-VPReID采用三流架构整合风格鲁棒特征学习、记忆跨视角适应和中间引导时序建模,以弥合地面-空中视角和RGB-IR模态的领域差距。
Result: 实验表明,TCC-VPReID在AG-VPReID.VIR数据集上显著优于现有方法,验证了其对跨平台和跨模态挑战的有效性。
Insight: 空中视角为行人重识别提供了独特的抗遮挡和广覆盖优势,结合多模态数据可进一步提升全天候监控系统的性能。
Abstract: Person re-identification (Re-ID) across visible and infrared modalities is crucial for 24-hour surveillance systems, but existing datasets primarily focus on ground-level perspectives. While ground-based IR systems offer nighttime capabilities, they suffer from occlusions, limited coverage, and vulnerability to obstructions–problems that aerial perspectives uniquely solve. To address these limitations, we introduce AG-VPReID.VIR, the first aerial-ground cross-modality video-based person Re-ID dataset. This dataset captures 1,837 identities across 4,861 tracklets (124,855 frames) using both UAV-mounted and fixed CCTV cameras in RGB and infrared modalities. AG-VPReID.VIR presents unique challenges including cross-viewpoint variations, modality discrepancies, and temporal dynamics. Additionally, we propose TCC-VPReID, a novel three-stream architecture designed to address the joint challenges of cross-platform and cross-modality person Re-ID. Our approach bridges the domain gaps between aerial-ground perspectives and RGB-IR modalities, through style-robust feature learning, memory-based cross-view adaptation, and intermediary-guided temporal modeling. Experiments show that AG-VPReID.VIR presents distinctive challenges compared to existing datasets, with our TCC-VPReID framework achieving significant performance gains across multiple evaluation protocols. Dataset and code are available at https://github.com/agvpreid25/AG-VPReID.VIR.
[36] Exploring the interplay of label bias with subgroup size and separability: A case study in mammographic density classification
Emma A. M. Stanley,Raghav Mehta,Mélanie Roschewitz,Nils D. Forkert,Ben Glocker
Main category: cs.CV
TL;DR: 研究了医学影像数据中标签偏差对深度学习模型性能的影响,发现子群的相对大小和可分性对特征表示和性能有显著影响。
Details
Motivation: 医学影像数据中标签偏差对特定子群的系统性影响是一个未充分研究的问题,可能影响医学AI系统的公平性。Contribution: 揭示了标签偏差如何通过子群大小和可分性影响模型的特征表达和性能,提出了对医学AI公平性的重要见解。
Method: 使用EMBED数据集训练深度学习模型,模拟标签偏差对可分和非可分子群的影响,分析特征表示和性能变化。
Result: 标签偏差导致特征空间的显著偏移,性能差异取决于验证集的标签是否干净,多数子群偏差时真阳性率下降明显。
Insight: 标签偏差对模型公平性的影响复杂,需考虑子群特性和验证集标签质量,强调干净标签验证集的重要性。
Abstract: Systematic mislabelling affecting specific subgroups (i.e., label bias) in medical imaging datasets represents an understudied issue concerning the fairness of medical AI systems. In this work, we investigated how size and separability of subgroups affected by label bias influence the learned features and performance of a deep learning model. Therefore, we trained deep learning models for binary tissue density classification using the EMory BrEast imaging Dataset (EMBED), where label bias affected separable subgroups (based on imaging manufacturer) or non-separable “pseudo-subgroups”. We found that simulated subgroup label bias led to prominent shifts in the learned feature representations of the models. Importantly, these shifts within the feature space were dependent on both the relative size and the separability of the subgroup affected by label bias. We also observed notable differences in subgroup performance depending on whether a validation set with clean labels was used to define the classification threshold for the model. For instance, with label bias affecting the majority separable subgroup, the true positive rate for that subgroup fell from 0.898, when the validation set had clean labels, to 0.518, when the validation set had biased labels. Our work represents a key contribution toward understanding the consequences of label bias on subgroup fairness in medical imaging AI.
[37] Registration beyond Points: General Affine Subspace Alignment via Geodesic Distance on Grassmann Manifold
Jaeho Shin,Hyeonjae Gil,Junwoo Jang,Maani Ghaffari,Ayoung Kim
Main category: cs.CV
TL;DR: 该论文提出了一个基于Grassmann流形上的可优化成本函数,用于刚性变换下的仿射子空间对齐,解决了现有方法在距离测量上的局限性。
Details
Motivation: 现有的Grassmann流形方法虽能测量仿射子空间之间的接近度,但无法将距离表示为刚性变换的显式函数,限制了其在配准问题中的应用。Contribution: 首次显式推导了关于刚性变换(旋转$ℓ和位移$ℓ)的可优化成本函数,并通过数学证明验证了高维线性子空间基作为显式表示的可行性。
Method: 提出了基于Grassmann流形上测地距离的可优化成本函数,并扩展到了最大化内点集的BnB求解器。
Result: 该方法在多种计算机视觉任务中改善了现有解法的收敛性或表现更优,代码已开源。
Insight: 直接最小化测地距离能够找到全局最优解,且不受表示模糊性的影响。
Abstract: Affine Grassmannian has been favored for expressing proximity between lines and planes due to its theoretical exactness in measuring distances among features. Despite this advantage, the existing method can only measure the proximity without yielding the distance as an explicit function of rigid body transformation. Thus, an optimizable distance function on the manifold has remained underdeveloped, stifling its application in registration problems. This paper is the first to explicitly derive an optimizable cost function between two Grassmannian features with respect to rigid body transformation ($\mathbf{R}$ and $\mathbf{t}$). Specifically, we present a rigorous mathematical proof demonstrating that the bases of high-dimensional linear subspaces can serve as an explicit representation of the cost. Finally, we propose an optimizable cost function based on the transformed bases that can be applied to the registration problem of any affine subspace. Compared to vector parameter-based approaches, our method is able to find a globally optimal solution by directly minimizing the geodesic distance which is agnostic to representation ambiguity. The resulting cost function and its extension to the inlier-set maximizing \ac{BnB} solver have been demonstrated to improve the convergence of existing solutions or outperform them in various computer vision tasks. The code is available on https://github.com/joomeok/GrassmannRegistration.
[38] GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures
Jake R. Patock,Nicole Catherine Lewis,Kevin McCoy,Christina Gomez,Canling Chen,Lorenzo Luzi
Main category: cs.CV
TL;DR: GRR-CoCa通过将LLM中的高斯误差门控线性单元、均方根归一化和旋转位置嵌入引入多模态模型CoCa,显著提升了其性能。
Details
Motivation: 当前领先的多模态模型架构在技术成熟度上落后于LLMs,作者希望通过引入LLM中的先进技术来改进CoCa模型。Contribution: 提出了GRR-CoCa模型,将LLM中的三种技术引入多模态模型,显著提升了CoCa在对比性和生成性任务上的性能。
Method: 在CoCa的文本解码器和ViT编码器中引入了高斯误差门控线性单元、均方根归一化和旋转位置嵌入。
Result: 相较于基线CoCa,GRR-CoCa在预训练和微调任务上显著提升了性能,对比性损失、困惑度和CoCa损失均有明显改进。
Insight: LLM中的高级架构技术可以有效地迁移到多模态模型中,提升其在视觉语言任务中的表现和泛化能力。
Abstract: State-of-the-art (SOTA) image and text generation models are multimodal models that have many similarities to large language models (LLMs). Despite achieving strong performances, leading foundational multimodal model architectures frequently lag behind the architectural sophistication of contemporary LLMs. We propose GRR-CoCa, an improved SOTA Contrastive Captioner (CoCa) model that incorporates Gaussian error gated linear units, root mean squared normalization, and rotary positional embedding into the textual decoders and the vision transformer (ViT) encoder. Each architectural modification has been shown to improve model performance in LLMs, but has yet to be adopted in CoCa. We benchmarked GRR-CoCa against Baseline CoCa, a model with the same modified textual decoders but with CoCa’s original ViT encoder. We used standard pretraining and fine-tuning workflows to benchmark the models on contrastive and generative tasks. Our GRR-CoCa significantly outperformed Baseline CoCa on the pretraining dataset and three diverse fine-tuning datasets. Pretraining improvements were 27.25% in contrastive loss, 3.71% in perplexity, and 7.15% in CoCa loss. The average fine-tuning improvements were 13.66% in contrastive loss, 5.18% in perplexity, and 5.55% in CoCa loss. We show that GRR-CoCa’s modified architecture improves performance and generalization across vision-language domains.
[39] Celeb-DF++: A Large-scale Challenging Video DeepFake Benchmark for Generalizable Forensics
Yuezun Li,Delong Zhu,Xinjie Cui,Siwei Lyu
Main category: cs.CV
TL;DR: 本文提出Celeb-DF++,一个大规模且多样化的DeepFake视频基准数据集,旨在解决通用取证(generalizable forensics)的挑战。该数据集包含三种常见伪造场景,并采用多种最新DeepFake方法生成高质量视频,用于评估检测方法的泛化能力。
Details
Motivation: 随着AI技术的发展,DeepFake视频的多样性增加,现有数据集因伪造类型有限,难以支持通用检测方法的开发。因此,需要一个更大规模的多样化数据集来推动通用取证的研究。Contribution: 1. 提出Celeb-DF++数据集,覆盖三种常见伪造场景,使用22种不同的DeepFake方法生成视频。2. 设计评估协议,用于测试24种现有检测方法的泛化能力。
Method: 1. 扩展了之前的Celeb-DF数据集,加入更多伪造类型和高质量视频。2. 通过多种DeepFake方法生成数据,涵盖不同架构和生成流程。
Result: Celeb-DF++展示了现有检测方法在泛化能力上的局限性,并验证了新数据集的挑战性。
Insight: 多样化的伪造类型是推动通用取证技术发展的关键,现有检测方法在应对新型DeepFake时仍需改进。
Abstract: The rapid advancement of AI technologies has significantly increased the diversity of DeepFake videos circulating online, posing a pressing challenge for \textit{generalizable forensics}, \ie, detecting a wide range of unseen DeepFake types using a single model. Addressing this challenge requires datasets that are not only large-scale but also rich in forgery diversity. However, most existing datasets, despite their scale, include only a limited variety of forgery types, making them insufficient for developing generalizable detection methods. Therefore, we build upon our earlier Celeb-DF dataset and introduce {Celeb-DF++}, a new large-scale and challenging video DeepFake benchmark dedicated to the generalizable forensics challenge. Celeb-DF++ covers three commonly encountered forgery scenarios: Face-swap (FS), Face-reenactment (FR), and Talking-face (TF). Each scenario contains a substantial number of high-quality forged videos, generated using a total of 22 various recent DeepFake methods. These methods differ in terms of architectures, generation pipelines, and targeted facial regions, covering the most prevalent DeepFake cases witnessed in the wild. We also introduce evaluation protocols for measuring the generalizability of 24 recent detection methods, highlighting the limitations of existing detection methods and the difficulty of our new dataset.
[40] ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks
Ahmad ALBarqawi,Mahmoud Nazzal,Issa Khalil,Abdallah Khreishah,NhatHai Phan
Main category: cs.CV
TL;DR: ViGText提出了一种结合视觉-语言模型(VLLM)文本解释和图神经网络(GNN)的新方法,以提升深度学习伪造(deepfake)图像的检测能力,显著提升了泛化性和鲁棒性。
Details
Motivation: 深度伪造技术日益复杂,传统检测方法在泛化和对抗攻击方面表现不佳,ViGText通过结合视觉和文本信息提供更全面的分析。Contribution: 1. 整合VLLM文本解释和图像数据,提供上下文感知分析;2. 使用图神经网络(GNN)结合图像和文本图结构;3. 显著提升对定制化深度伪造的检测能力。
Method: 1. 将图像分块处理,构建图像图和文本图;2. 利用GNN整合多级特征(空间域和频域);3. 通过视觉-语言模型生成详细文本解释辅助分析。
Result: 1. 泛化性评估中F1分数从72.45%提升至98.32%;2. 召回率比其他方法高11.1%;3. 面对攻击时分类性能下降小于4%。
Insight: 视觉与文本信息的结合能有效捕捉深度伪造中的细微不一致,GNN的图结构整合进一步提升了模型的鲁棒性和泛化能力。
Abstract: The rapid rise of deepfake technology, which produces realistic but fraudulent digital content, threatens the authenticity of media. Traditional deepfake detection approaches often struggle with sophisticated, customized deepfakes, especially in terms of generalization and robustness against malicious attacks. This paper introduces ViGText, a novel approach that integrates images with Vision Large Language Model (VLLM) Text explanations within a Graph-based framework to improve deepfake detection. The novelty of ViGText lies in its integration of detailed explanations with visual data, as it provides a more context-aware analysis than captions, which often lack specificity and fail to reveal subtle inconsistencies. ViGText systematically divides images into patches, constructs image and text graphs, and integrates them for analysis using Graph Neural Networks (GNNs) to identify deepfakes. Through the use of multi-level feature extraction across spatial and frequency domains, ViGText captures details that enhance its robustness and accuracy to detect sophisticated deepfakes. Extensive experiments demonstrate that ViGText significantly enhances generalization and achieves a notable performance boost when it detects user-customized deepfakes. Specifically, average F1 scores rise from 72.45% to 98.32% under generalization evaluation, and reflects the model’s superior ability to generalize to unseen, fine-tuned variations of stable diffusion models. As for robustness, ViGText achieves an increase of 11.1% in recall compared to other deepfake detection approaches. When facing targeted attacks that exploit its graph-based architecture, ViGText limits classification performance degradation to less than 4%. ViGText uses detailed visual and textual analysis to set a new standard for detecting deepfakes, helping ensure media authenticity and information integrity.
[41] Enhancing Scene Transition Awareness in Video Generation via Post-Training
Hanwen Shen,Jiajie Lu,Yupeng Cao,Xiaonan Yang
Main category: cs.CV
TL;DR: 该论文针对现有AI视频生成模型在生成多场景视频时缺乏场景转换意识的问题,提出了一个包含多场景转换的视频数据集TAV,并通过后训练方法提升了模型的场景转换生成能力。
Details
Motivation: 现有的文本到视频生成模型在生成单场景短视频时表现良好,但在生成需要多场景转换的长视频时表现不佳,主要原因在于缺乏对场景转换的学习和理解。Contribution: 论文的主要贡献是提出了一个包含多场景转换的视频数据集TAV,并通过后训练方法增强了模型的场景转换意识,从而提升了多场景视频的生成能力。
Method: 论文通过构建TAV数据集(包含多场景转换的视频片段),并对现有模型在该数据集上进行后训练,以提高模型对场景转换的理解和生成能力。
Result: 实验结果表明,后训练TAV数据集能够显著提升模型在多场景视频生成中的表现,缩小生成视频与提示需求之间的差距,同时保持图像质量。
Insight: 论文揭示了对多场景转换的学习是提升长视频生成质量的关键,而专门设计的数据集(如TAV)可以有效弥补当前模型的不足。
Abstract: Recent advances in AI-generated video have shown strong performance on \emph{text-to-video} tasks, particularly for short clips depicting a single scene. However, current models struggle to generate longer videos with coherent scene transitions, primarily because they cannot infer when a transition is needed from the prompt. Most open-source models are trained on datasets consisting of single-scene video clips, which limits their capacity to learn and respond to prompts requiring multiple scenes. Developing scene transition awareness is essential for multi-scene generation, as it allows models to identify and segment videos into distinct clips by accurately detecting transitions. To address this, we propose the \textbf{Transition-Aware Video} (TAV) dataset, which consists of preprocessed video clips with multiple scene transitions. Our experiment shows that post-training on the \textbf{TAV} dataset improves prompt-based scene transition understanding, narrows the gap between required and generated scenes, and maintains image quality.
[42] BokehDiff: Neural Lens Blur with One-Step Diffusion
Chengxuan Zhu,Qingnan Fan,Qi Zhang,Jinwei Chen,Huaqi Zhang,Chao Xu,Boxin Shi
Main category: cs.CV
TL;DR: BokehDiff提出了一种基于扩散先验的镜头模糊渲染方法,利用物理启发的自注意力模块和深度依赖约束,实现了高质量的单步推理结果。
Details
Motivation: 现有方法受限于深度估计的准确性,容易在深度不连续区域产生伪影。作者希望通过生成扩散先验改进镜头模糊渲染的效果。Contribution: 1. 引入物理启发的自注意力模块,结合图像形成过程;2. 提出单步推理的扩散模型,无需额外噪声;3. 合成具有透明度的逼真前景数据以解决数据稀缺问题。
Method: 使用深度依赖的弥散圆约束和自遮挡效应的自注意力模块,将扩散模型适配为单步推理框架。
Result: 方法能够生成物理准确且视觉吸引人的镜头模糊效果,解决了深度不连续区域的伪影问题。
Insight: 扩散模型可通过单步推理应用于图像处理任务,且合成数据的多样性对提升模型性能至关重要。
Abstract: We introduce BokehDiff, a novel lens blur rendering method that achieves physically accurate and visually appealing outcomes, with the help of generative diffusion prior. Previous methods are bounded by the accuracy of depth estimation, generating artifacts in depth discontinuities. Our method employs a physics-inspired self-attention module that aligns with the image formation process, incorporating depth-dependent circle of confusion constraint and self-occlusion effects. We adapt the diffusion model to the one-step inference scheme without introducing additional noise, and achieve results of high quality and fidelity. To address the lack of scalable paired data, we propose to synthesize photorealistic foregrounds with transparency with diffusion models, balancing authenticity and scene diversity.
[43] Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement
Xiaoran Sun,Liyan Wang,Cong Wang,Yeying Jin,Kin-man Lam,Zhixun Su,Yang Yang,Jinshan Pan
Main category: cs.CV
TL;DR: 论文提出了一种结合大视觉语言模型(VLM)和迭代手动指令(IMI)的低光照图像增强方法VLM-IMI,通过语义引导生成细节丰富且语义对齐的增强结果。
Details
Motivation: 现有低光照图像增强方法依赖预训练模型或低光照输入,缺乏对正常光照图像的语义信息利用,导致复杂光照条件下效果不佳。Contribution: 1. 提出VLM-IMI框架,结合VLM和IMI实现语义引导的低光照增强;2. 引入指令先验融合模块,动态对齐和融合图像与文本特征;3. 采用迭代手动指令优化策略,逐步提升视觉质量。
Method: 1. 利用视觉语言模型生成文本描述作为增强线索;2. 设计指令先验融合模块对齐图像和文本特征;3. 通过迭代手动指令优化文本提示,逐步提升增强效果。
Result: 在多样场景的中低光照条件下,VLM-IMI在定量指标和感知质量上均优于现有方法。
Insight: 语义信息(如文本描述)与视觉特征的动态融合能显著提升低光照增强的细节恢复和语义一致性。
Abstract: Most existing low-light image enhancement (LLIE) methods rely on pre-trained model priors, low-light inputs, or both, while neglecting the semantic guidance available from normal-light images. This limitation hinders their effectiveness in complex lighting conditions. In this paper, we propose VLM-IMI, a novel framework that leverages large vision-language models (VLMs) with iterative and manual instructions (IMIs) for LLIE. VLM-IMI incorporates textual descriptions of the desired normal-light content as enhancement cues, enabling semantically informed restoration. To effectively integrate cross-modal priors, we introduce an instruction prior fusion module, which dynamically aligns and fuses image and text features, promoting the generation of detailed and semantically coherent outputs. During inference, we adopt an iterative and manual instruction strategy to refine textual instructions, progressively improving visual quality. This refinement enhances structural fidelity, semantic alignment, and the recovery of fine details under extremely low-light conditions. Extensive experiments across diverse scenarios demonstrate that VLM-IMI outperforms state-of-the-art methods in both quantitative metrics and perceptual quality. The source code is available at https://github.com/sunxiaoran01/VLM-IMI.
[44] TextSAM-EUS: Text Prompt Learning for SAM to Accurately Segment Pancreatic Tumor in Endoscopic Ultrasound
Pascal Spiegler,Taha Koleilat,Arash Harirpoush,Corey S. Miller,Hassan Rivaz,Marta Kersten-Oertel,Yiming Xiao
Main category: cs.CV
TL;DR: TextSAM-EUS 是一个轻量级的文本驱动适配方法,通过将 SAM 模型与 BiomedCLIP 文本编码器结合,实现了无需手动几何提示的胰腺肿瘤自动分割。该方法在 EUS 数据集上表现优于现有监督深度学习和基础模型。
Details
Motivation: 胰腺癌预后差,目前依赖 EUS 进行靶向活检和放疗,但 EUS 图像存在斑点噪声、低对比度等问题,传统的全监督深度学习方法需要大量专家标注数据且效果不佳,亟需更高效的自动分割方法。Contribution: 1. 提出了首个将提示学习引入 SAM 医学图像分割的方法 TextSAM-EUS;2. 通过轻量级适配(仅调整 0.86% 参数)实现高效分割;3. 在公开数据集上表现优于现有 SOTA 方法。
Method: 结合 BiomedCLIP 文本编码器进行文本提示学习(上下文优化),并通过 LoRA 适配 SAM 架构,实现无需手动几何提示的分割。
Result: 在 EUS 数据集上,TextSAM-EUS 的 Dice 达到 82.69%(自动提示)和 83.10%(手动提示),NSD 分别达到 85.28% 和 85.70%,超越现有方法。
Insight: 文本提示学习可有效用于医学图像分割任务,轻量级适配 SAM 能在少参数调整下实现高性能。
Abstract: Pancreatic cancer carries a poor prognosis and relies on endoscopic ultrasound (EUS) for targeted biopsy and radiotherapy. However, the speckle noise, low contrast, and unintuitive appearance of EUS make segmentation of pancreatic tumors with fully supervised deep learning (DL) models both error-prone and dependent on large, expert-curated annotation datasets. To address these challenges, we present TextSAM-EUS, a novel, lightweight, text-driven adaptation of the Segment Anything Model (SAM) that requires no manual geometric prompts at inference. Our approach leverages text prompt learning (context optimization) through the BiomedCLIP text encoder in conjunction with a LoRA-based adaptation of SAM’s architecture to enable automatic pancreatic tumor segmentation in EUS, tuning only 0.86% of the total parameters. On the public Endoscopic Ultrasound Database of the Pancreas, TextSAM-EUS with automatic prompts attains 82.69% Dice and 85.28% normalized surface distance (NSD), and with manual geometric prompts reaches 83.10% Dice and 85.70% NSD, outperforming both existing state-of-the-art (SOTA) supervised DL models and foundation models (e.g., SAM and its variants). As the first attempt to incorporate prompt learning in SAM-based medical image segmentation, TextSAM-EUS offers a practical option for efficient and robust automatic EUS segmentation. Our code will be publicly available upon acceptance.
[45] Comparison of Segmentation Methods in Remote Sensing for Land Use Land Cover
Naman Srivastava,Joel D Joy,Yash Dixit,Swarup E,Rakshit Ramesh
Main category: cs.CV
TL;DR: 该论文评估了先进的地类覆盖(LULC)测绘技术,重点研究了基于查找表(LUT)的大气校正方法,并结合监督和半监督学习模型(如DeeplabV3+和跨伪监督CPS)进行LULC预测。CPS模型通过动态加权改进伪标签可靠性。以印度海得拉巴为例,展示了城市化对土地利用的影响。
Details
Motivation: 城市化快速发展导致土地利用和覆盖变化显著,这对城市规划和可持续发展至关重要。研究旨在评估和改进LULC测绘技术,以支持更精确的城市规划和资源管理。Contribution: 论文的主要贡献包括:1)结合LUT大气校正和高分辨率Cartosat MX图像;2)评估DeeplabV3+和动态加权的CPS模型;3)通过案例研究展示了技术在城市规划中的实际应用。
Method: 研究方法包括:1)使用LUT进行大气校正;2)应用DeeplabV3+和CPS模型进行LULC分类;3)通过动态加权优化CPS的伪标签可靠性。
Result: 研究表明,动态加权的CPS模型在LULC分类中表现优异,能够准确捕捉土地利用变化,如城市扩张和绿地减少。案例研究验证了技术在实际城市规划中的实用性。
Insight: 通过动态加权改进的CPS模型能够有效提升半监督学习的性能,为高分辨率遥感图像的LULC分类提供了新思路。研究还强调了遥感技术在监测城市化影响中的重要性。
Abstract: Land Use Land Cover (LULC) mapping is essential for urban and resource planning, and is one of the key elements in developing smart and sustainable cities.This study evaluates advanced LULC mapping techniques, focusing on Look-Up Table (LUT)-based Atmospheric Correction applied to Cartosat Multispectral (MX) sensor images, followed by supervised and semi-supervised learning models for LULC prediction. We explore DeeplabV3+ and Cross-Pseudo Supervision (CPS). The CPS model is further refined with dynamic weighting, enhancing pseudo-label reliability during training. This comprehensive approach analyses the accuracy and utility of LULC mapping techniques for various urban planning applications. A case study of Hyderabad, India, illustrates significant land use changes due to rapid urbanization. By analyzing Cartosat MX images over time, we highlight shifts such as urban sprawl, shrinking green spaces, and expanding industrial areas. This demonstrates the practical utility of these techniques for urban planners and policymakers.
[46] Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning
Ruizhe Chen,Zhiting Fan,Tianze Luo,Heqing Zou,Zhaopeng Feng,Guiyang Xie,Hansheng Zhang,Zhuochen Wang,Zuozhu Liu,Huaijian Zhang
Main category: cs.CV
TL;DR: 论文提出了一种结合监督微调和强化学习的两阶段训练框架,以提升视频时间定位任务的准确性和鲁棒性。通过高质量冷启动数据和多难度控制的强化学习,方法在多个基准测试中表现优异。
Details
Motivation: 现有的视频时间定位方法在大规模视觉语言模型和指令微调方面取得进展,但仍存在时间感知能力有限和泛化能力不足的问题。论文旨在通过强化学习提升模型的定位和推理能力。Contribution: 1. 提出了两阶段的训练框架,结合监督微调和强化学习。2. 使用高质量冷启动数据初始化模型。3. 提出了难度控制的强化学习策略。4. 发布了数据集、模型和代码。
Method: 1. 第一阶段:利用高质量冷启动数据进行监督微调。2. 第二阶段:通过难度控制的强化学习进一步优化模型。
Result: 在多个视频时间定位基准测试中表现优于现有模型,尤其在复杂和开放域场景中效果显著。
Insight: 1. 高质量冷启动数据对模型初始化至关重要。2. 难度控制的强化学习能有效提升模型的鲁棒性和泛化能力。
Abstract: Video Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introduce a two-stage training framework that integrates supervised fine-tuning with reinforcement learning (RL) to improve both the accuracy and robustness of VTG models. Our approach first leverages high-quality curated cold start data for SFT initialization, followed by difficulty-controlled RL to further enhance temporal localization and reasoning abilities. Comprehensive experiments on multiple VTG benchmarks demonstrate that our method consistently outperforms existing models, particularly in challenging and open-domain scenarios. We conduct an in-depth analysis of training strategies and dataset curation, highlighting the importance of both high-quality cold start data and difficulty-controlled RL. To facilitate further research and industrial adoption, we release all intermediate datasets, models, and code to the community.
[47] A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli
Qianyi He,Yuan Chang Leong
Main category: cs.CV
TL;DR: 本文提出了一种多模态序列到序列的Transformer模型,用于预测自然多模态电影刺激下的全脑fMRI响应,结合视觉、听觉和语言输入,通过预训练模型提取特征,并利用双交叉注意力机制整合信息。
Details
Motivation: Algonauts 2025挑战赛呼吁开发编码模型,以预测自然多模态电影刺激下的全脑fMRI响应。本文旨在通过多模态序列建模和时空信息整合,提升脑活动预测的准确性。Contribution: 1. 提出了一种序列到序列Transformer,利用多模态输入预测fMRI活动;2. 引入双交叉注意力机制,整合当前刺激和高层叙事信息;3. 结合共享编码器和部分特定解码器,平衡群体共性和个体差异。
Method: 使用预训练模型提取视觉、听觉和语言特征,通过序列Transformer建模输入与fMRI响应的时序关系,并利用双交叉注意力机制整合当前刺激和长期叙事信息。解码器部分为共享和特定解码器的混合设计。
Result: 模型在分布内和分布外数据上均表现优异,验证了多模态时序建模在脑活动预测中的有效性。
Insight: 时序感知的多模态建模能够显著提升脑活动预测的准确性,同时混合解码器设计有助于平衡群体共性和个体差异。
Abstract: The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies. In this submission, we propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs. Stimulus features were extracted using pretrained models including VideoMAE, HuBERT, Qwen, and BridgeTower. The decoder integrates information from prior brain states, current stimuli, and episode-level summaries via dual cross-attention mechanisms that attend to both perceptual information extracted from the stimulus as well as narrative information provided by high-level summaries of narrative content. One core innovation of our approach is the use of sequences of multimodal context to predict sequences of brain activity, enabling the model to capture long-range temporal structure in both stimuli and neural responses. Another is the combination of a shared encoder with partial subject-specific decoder, which leverages common structure across subjects while accounting for individual variability. Our model achieves strong performance on both in-distribution and out-of-distribution data, demonstrating the effectiveness of temporally-aware, multimodal sequence modeling for brain activity prediction. The code is available at https://github.com/Angelneer926/Algonauts_challenge.
[48] Distributional Uncertainty for Out-of-Distribution Detection
JinYoung Kim,DaeUng Jo,Kimin Yun,Jeonghyo Song,Youngjoon Yoo
Main category: cs.CV
TL;DR: 该论文提出了一种名为Free-Energy Posterior Network的新框架,用于联合建模分布不确定性和识别分布外(OoD)样本。通过两个关键创新(基于自由能的密度估计器和Beta分布参数化的损失函数),该方法在语义分割任务中实现了更细粒度的不确定性估计。
Details
Motivation: 传统方法如MC Dropout通常仅关注模型或数据不确定性,未能完全符合OoD检测的语义目标,因此需要一种更有效的方法来联合建模分布不确定性。Contribution: 1. 提出基于自由能的密度估计器,参数化为Beta分布,用于精细不确定性估计;2. 在后验网络中集成损失函数,直接从学习参数中估计不确定性,无需随机采样。
Method: 使用Free-Energy Posterior Network框架,结合Beta分布的自由能密度估计器和损失函数,与RPL框架集成,提升OoD区域的检测能力。
Result: 在Fishyscapes、RoadAnomaly和Segment-Me-If-You-Can等真实世界基准测试中验证了方法的有效性。
Insight: 通过Beta分布的方差学习OoD区域,提供了一种语义有意义且计算高效的不确定性感知分割解决方案。
Abstract: Estimating uncertainty from deep neural networks is a widely used approach for detecting out-of-distribution (OoD) samples, which typically exhibit high predictive uncertainty. However, conventional methods such as Monte Carlo (MC) Dropout often focus solely on either model or data uncertainty, failing to align with the semantic objective of OoD detection. To address this, we propose the Free-Energy Posterior Network, a novel framework that jointly models distributional uncertainty and identifying OoD and misclassified regions using free energy. Our method introduces two key contributions: (1) a free-energy-based density estimator parameterized by a Beta distribution, which enables fine-grained uncertainty estimation near ambiguous or unseen regions; and (2) a loss integrated within a posterior network, allowing direct uncertainty estimation from learned parameters without requiring stochastic sampling. By integrating our approach with the residual prediction branch (RPL) framework, the proposed method goes beyond post-hoc energy thresholding and enables the network to learn OoD regions by leveraging the variance of the Beta distribution, resulting in a semantically meaningful and computationally efficient solution for uncertainty-aware segmentation. We validate the effectiveness of our method on challenging real-world benchmarks, including Fishyscapes, RoadAnomaly, and Segment-Me-If-You-Can.
[49] T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation
Yubin Chen,Xuyang Guo,Zhenmei Shi,Zhao Song,Jiahao Zhang
Main category: cs.CV
TL;DR: 论文提出了第一个系统性评估文本到视频(T2V)模型世界知识生成能力的基准T2VWorldBench,涵盖6大类60子类1200个提示,并通过人工和自动评估揭示了当前模型的不足。
Details
Motivation: 当前T2V模型在视觉合理性上表现出色,但其世界知识的利用能力未被充分研究。论文旨在填补这一空白,提供系统评估框架。Contribution: 提出首个评估T2V模型世界知识能力的基准T2VWorldBench,涵盖多领域,结合人工与自动评估方法,并分析了10种先进模型的性能。
Method: 通过设计涵盖多领域的6大类60子类1200个提示,结合人工评估和基于视觉语言模型(VLMs)的自动评估方法。
Result: 评估显示大多数T2V模型无法正确理解世界知识,生成内容与事实不符,揭示了当前模型的重要局限性。
Insight: 研究表明T2V模型在常识推理和事实性生成方面仍有显著不足,为未来构建更鲁棒的模型提供了研究方向。
Abstract: Text-to-video (T2V) models have shown remarkable performance in generating visually reasonable scenes, while their capability to leverage world knowledge for ensuring semantic consistency and factual accuracy remains largely understudied. In response to this challenge, we propose T2VWorldBench, the first systematic evaluation framework for evaluating the world knowledge generation abilities of text-to-video models, covering 6 major categories, 60 subcategories, and 1,200 prompts across a wide range of domains, including physics, nature, activity, culture, causality, and object. To address both human preference and scalable evaluation, our benchmark incorporates both human evaluation and automated evaluation using vision-language models (VLMs). We evaluated the 10 most advanced text-to-video models currently available, ranging from open source to commercial models, and found that most models are unable to understand world knowledge and generate truly correct videos. These findings point out a critical gap in the capability of current text-to-video models to leverage world knowledge, providing valuable research opportunities and entry points for constructing models with robust capabilities for commonsense reasoning and factual generation.
[50] Information Entropy-Based Framework for Quantifying Tortuosity in Meibomian Gland Uneven Atrophy
Kesheng Wang,Xiaoyu Chen,Chunlei He,Fenfen Li,Xinxin Yu,Dexing Kong,Shoujun Huang,Qi Dai
Main category: cs.CV
TL;DR: 提出了一种基于信息熵的框架,用于量化曲线曲折度,并在睑板腺萎缩评估中验证其有效性,展示了临床应用的潜力。
Details
Motivation: 医学图像分析中,曲线曲折度的精确量化对辅助诊断和病理评估至关重要,但传统方法依赖于理想化的直线比较,缺乏对生物合理参考曲线的利用。Contribution: 1. 提出了一种基于信息熵的曲折度量化框架;2. 结合概率建模和域变换,优于传统方法;3. 在睑板腺萎缩评估中验证了方法的有效性。
Method: 通过比较目标曲线与生物合理的参考曲线,利用信息熵和概率建模量化曲折度。进行了数值模拟和临床数据实验验证。
Result: 在Demodex阴性组和阳性组间发现显著的曲折度差异(AUC=0.8768,敏感度0.75,特异度0.93)。
Insight: 基于信息熵的框架提供了一种更鲁棒且客观的曲折度评估方法,尤其适用于医学数据中具有生物合理参考曲线的场景。
Abstract: In the medical image analysis field, precise quantification of curve tortuosity plays a critical role in the auxiliary diagnosis and pathological assessment of various diseases. In this study, we propose a novel framework for tortuosity quantification and demonstrate its effectiveness through the evaluation of meibomian gland atrophy uniformity,serving as a representative application scenario. We introduce an information entropy-based tortuosity quantification framework that integrates probability modeling with entropy theory and incorporates domain transformation of curve data. Unlike traditional methods such as curvature or arc-chord ratio, this approach evaluates the tortuosity of a target curve by comparing it to a designated reference curve. Consequently, it is more suitable for tortuosity assessment tasks in medical data where biologically plausible reference curves are available, providing a more robust and objective evaluation metric without relying on idealized straight-line comparisons. First, we conducted numerical simulation experiments to preliminarily assess the stability and validity of the method. Subsequently, the framework was applied to quantify the spatial uniformity of meibomian gland atrophy and to analyze the difference in this uniformity between \textit{Demodex}-negative and \textit{Demodex}-positive patient groups. The results demonstrated a significant difference in tortuosity-based uniformity between the two groups, with an area under the curve of 0.8768, sensitivity of 0.75, and specificity of 0.93. These findings highlight the clinical utility of the proposed framework in curve tortuosity analysis and its potential as a generalizable tool for quantitative morphological evaluation in medical diagnostics.
[51] Degradation-Consistent Learning via Bidirectional Diffusion for Low-Light Image Enhancement
Jinhong He,Minglong Xue,Zhipu Liu,Mingliang Zhou,Aoxiang Ning,Palaiahnakote Shivakumara
Main category: cs.CV
TL;DR: 该论文提出了一种双向扩散优化机制,用于低光照图像增强,通过联合建模低光照和正常光照图像的退化过程,提升生成质量,并结合自适应特征交互块和反射感知校正模块,显著优于现有方法。
Details
Motivation: 现有的基于扩散模型的方法在低光照图像增强中表现不佳,主要因为其对退化过程的单向建模难以捕捉真实世界的复杂退化模式,导致结构不一致和像素错位。Contribution: 1. 提出双向扩散优化机制,联合建模低光照和正常光照图像的退化过程;2. 引入自适应特征交互块(AFI)和反射感知校正模块(RACM),提升细节恢复和颜色校正能力;3. 在多个基准数据集上验证了方法的优越性和泛化性。
Method: 1. 在训练阶段进行双向扩散(低光照到正常光照和正常光照到低光照);2. 使用自适应特征交互块(AFI)优化特征表示;3. 设计反射感知校正模块(RACM)进行去噪后颜色恢复和过曝光抑制。
Result: 实验证明,该方法在定量和定性评估中均优于现有方法,并在多样化退化场景中表现出良好的泛化能力。
Insight: 双向扩散机制能够更精确地匹配退化参数,而特征交互和反射感知模块的结合有效提升了图像的一致性和视觉效果。
Abstract: Low-light image enhancement aims to improve the visibility of degraded images to better align with human visual perception. While diffusion-based methods have shown promising performance due to their strong generative capabilities. However, their unidirectional modelling of degradation often struggles to capture the complexity of real-world degradation patterns, leading to structural inconsistencies and pixel misalignments. To address these challenges, we propose a bidirectional diffusion optimization mechanism that jointly models the degradation processes of both low-light and normal-light images, enabling more precise degradation parameter matching and enhancing generation quality. Specifically, we perform bidirectional diffusion-from low-to-normal light and from normal-to-low light during training and introduce an adaptive feature interaction block (AFI) to refine feature representation. By leveraging the complementarity between these two paths, our approach imposes an implicit symmetry constraint on illumination attenuation and noise distribution, facilitating consistent degradation learning and improving the models ability to perceive illumination and detail degradation. Additionally, we design a reflection-aware correction module (RACM) to guide color restoration post-denoising and suppress overexposed regions, ensuring content consistency and generating high-quality images that align with human visual perception. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art methods in both quantitative and qualitative evaluations while generalizing effectively to diverse degradation scenarios. Code at https://github.com/hejh8/BidDiff
[52] WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection
Haodong Zhu,Wenhao Dong,Linlin Yang,Hong Li,Yuguang Yang,Yangyang Ren,Qingcheng Zhu,Zichao Feng,Changbai Li,Shaohui Lin,Runqi Wang,Xiaoyan Luo,Baochang Zhang
Main category: cs.CV
TL;DR: WaveMamba提出了一种基于小波变换和Mamba框架的RGB-红外目标检测融合方法,通过低频和高频子带的全面融合,显著提升了检测性能。
Details
Motivation: RGB和红外图像在目标检测中具有互补特性,但现有方法未能充分利用其频率特征。Contribution: 1. 提出了WaveMamba Fusion Block (WMFB),实现低频和高频特征的深度融合;2. 改进了检测头以减少信息损失;3. 在四个基准测试上平均mAP提升了4.5%。
Method: 1. 使用离散小波变换(DWT)分解RGB和红外图像的频率特征;2. 通过WMFB(包括LMFB和高频增强策略)实现特征融合;3. 采用改进的检测头结合IDWT生成最终检测结果。
Result: 在四个基准测试上,WaveMamba的平均mAP超越了现有最优方法4.5%。
Insight: 通过小波变换分解和Mamba框架的结合,可以更高效地利用RGB和红外图像的多模态特性,显著提升目标检测性能。
Abstract: Leveraging the complementary characteristics of visible (RGB) and infrared (IR) imagery offers significant potential for improving object detection. In this paper, we propose WaveMamba, a cross-modality fusion method that efficiently integrates the unique and complementary frequency features of RGB and IR decomposed by Discrete Wavelet Transform (DWT). An improved detection head incorporating the Inverse Discrete Wavelet Transform (IDWT) is also proposed to reduce information loss and produce the final detection results. The core of our approach is the introduction of WaveMamba Fusion Block (WMFB), which facilitates comprehensive fusion across low-/high-frequency sub-bands. Within WMFB, the Low-frequency Mamba Fusion Block (LMFB), built upon the Mamba framework, first performs initial low-frequency feature fusion with channel swapping, followed by deep fusion with an advanced gated attention mechanism for enhanced integration. High-frequency features are enhanced using a strategy that applies an ``absolute maximum” fusion approach. These advancements lead to significant performance gains, with our method surpassing state-of-the-art approaches and achieving average mAP improvements of 4.5% on four benchmarks.
[53] Real-Time Object Detection and Classification using YOLO for Edge FPGAs
Rashed Al Amin,Roman Obermaisser
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于YOLOv5的资源高效实时目标检测与分类系统,针对边缘FPGA平台进行了优化。实验结果表明,该系统在Xilinx Kria KV260 FPGA上实现了99%的分类准确率,功耗仅为3.5W,处理速度为每秒9帧。
Details
Motivation: 尽管现有深度学习方法(如CNN、SSD和YOLO)在FPGA上表现出高性能,但YOLO-based系统在边缘FPGA平台的资源效率方面仍面临挑战。本文旨在解决这一问题。Contribution: 提出了一种针对FPGA优化的YOLOv5系统,实现了资源高效的实时目标检测与分类,适用于边缘计算场景。
Method: 基于YOLOv5架构进行优化,并在COCO和GTSRD数据集上训练,部署于Xilinx Kria KV260 FPGA板。
Result: 系统在COCO和GTSRD数据集上达到99%分类准确率,功耗3.5W,处理速度为9 FPS。
Insight: 优化后的YOLOv5在边缘FPGA平台上表现出色,平衡了高精度与低资源消耗,为实时目标检测与分类提供了可行方案。
Abstract: Object detection and classification are crucial tasks across various application domains, particularly in the development of safe and reliable Advanced Driver Assistance Systems (ADAS). Existing deep learning-based methods such as Convolutional Neural Networks (CNNs), Single Shot Detectors (SSDs), and You Only Look Once (YOLO) have demonstrated high performance in terms of accuracy and computational speed when deployed on Field-Programmable Gate Arrays (FPGAs). However, despite these advances, state-of-the-art YOLO-based object detection and classification systems continue to face challenges in achieving resource efficiency suitable for edge FPGA platforms. To address this limitation, this paper presents a resource-efficient real-time object detection and classification system based on YOLOv5 optimized for FPGA deployment. The proposed system is trained on the COCO and GTSRD datasets and implemented on the Xilinx Kria KV260 FPGA board. Experimental results demonstrate a classification accuracy of 99%, with a power consumption of 3.5W and a processing speed of 9 frames per second (FPS). These findings highlight the effectiveness of the proposed approach in enabling real-time, resource-efficient object detection and classification for edge computing applications.
[54] Unsupervised Domain Adaptation for 3D LiDAR Semantic Segmentation Using Contrastive Learning and Multi-Model Pseudo Labeling
Abhishek Kaushik,Norbert Haala,Uwe Soergel
Main category: cs.CV
TL;DR: 论文提出了一种结合对比学习和多模型伪标记的两阶段无监督域自适应方法,用于3D LiDAR语义分割,显著提升了跨域的语义分割精度。
Details
Motivation: 解决3D LiDAR语义分割在跨域(如传感器类型、地理位置)时的性能下降问题,同时避免目标域数据的高成本人工标注。Contribution: 提出了一个两阶段的无监督域自适应框架,结合对比学习和多模型伪标记策略,有效缓解了域偏移问题。
Method: 1. 使用无监督对比学习预训练主干网络以学习域不变特征;2. 通过多模型(投影、体素、混合、圆柱等)的伪标记生成高质量标签,用于微调。
Result: 实验表明,从SemanticKITTI到SemanticPOSS和SemanticSlamantic的域自适应任务中,该方法显著优于直接迁移和单模型方法。
Insight: 对比学习和多模型伪标记的组合可以有效提升跨域语义分割的性能,无需目标域标注。
Abstract: Addressing performance degradation in 3D LiDAR semantic segmentation due to domain shifts (e.g., sensor type, geographical location) is crucial for autonomous systems, yet manual annotation of target data is prohibitive. This study addresses the challenge using Unsupervised Domain Adaptation (UDA) and introduces a novel two-stage framework to tackle it. Initially, unsupervised contrastive learning at the segment level is used to pre-train a backbone network, enabling it to learn robust, domain-invariant features without labels. Subsequently, a multi-model pseudo-labeling strategy is introduced, utilizing an ensemble of diverse state-of-the-art architectures (including projection, voxel, hybrid, and cylinder-based methods). Predictions from these models are aggregated via hard voting to generate high-quality, refined pseudo-labels for the unlabeled target domain, mitigating single-model biases. The contrastively pre-trained network is then fine-tuned using these robust pseudo-labels. Experiments adapting from SemanticKITTI to unlabeled target datasets (SemanticPOSS, SemanticSlamantic) demonstrate significant improvements in segmentation accuracy compared to direct transfer and single-model UDA approaches. These results highlight the effectiveness of combining contrastive pre-training with refined ensemble pseudo-labeling for bridging complex domain gaps without requiring target domain annotations.
[55] Differential-UMamba: Rethinking Tumor Segmentation Under Limited Data Scenarios
Dhruv Jain,Romain Modzelewski,Romain Hérault,Clement Chatelain,Eva Torfeh,Sebastien Thureau
Main category: cs.CV
TL;DR: 论文提出了一种名为Diff-UMamba的新型架构,结合UNet和mamba机制,用于在数据稀缺的医学图像分割任务中抑制噪声并增强任务相关的特征表示,从而提高模型的泛化能力和分割精度。
Details
Motivation: 在医学图像分割中,数据稀缺场景下,深度学习模型容易过拟合噪声和无关模式,导致性能下降。本文旨在通过改进模型架构来解决这一问题。Contribution: 引入了Diff-UMamba架构,结合UNet和mamba机制,并设计了噪声抑制模块(NRM),通过信号差分策略减少噪声激活,从而提升模型在有限数据下的分割性能。
Method: 方法包括:1)将UNet与mamba机制结合,建模长程依赖;2)使用NRM模块通过信号差分抑制无关激活;3)在多个公开数据集和内部数据集上验证性能。
Result: 在MSD、AIIB23和BraTS-21等数据集上,Diff-UMamba比基线方法提升了1-3%的精度;在内部NSCLC数据集上,GTV分割精度提升了4-5%。
Insight: 在数据稀缺场景下,通过抑制噪声和无关特征,模型可以更专注于临床相关区域,从而显著提升分割性能和鲁棒性。
Abstract: In data-scarce scenarios, deep learning models often overfit to noise and irrelevant patterns, which limits their ability to generalize to unseen samples. To address these challenges in medical image segmentation, we introduce Diff-UMamba, a novel architecture that combines the UNet framework with the mamba mechanism for modeling long-range dependencies. At the heart of Diff-UMamba is a Noise Reduction Module (NRM), which employs a signal differencing strategy to suppress noisy or irrelevant activations within the encoder. This encourages the model to filter out spurious features and enhance task-relevant representations, thereby improving its focus on clinically meaningful regions. As a result, the architecture achieves improved segmentation accuracy and robustness, particularly in low-data settings. Diff-UMamba is evaluated on multiple public datasets, including MSD (lung and pancreas) and AIIB23, demonstrating consistent performance gains of 1-3% over baseline methods across diverse segmentation tasks. To further assess performance under limited-data conditions, additional experiments are conducted on the BraTS-21 dataset by varying the proportion of available training samples. The approach is also validated on a small internal non-small cell lung cancer (NSCLC) dataset for gross tumor volume (GTV) segmentation in cone beam CT (CBCT), where it achieves a 4-5% improvement over the baseline.
[56] LEAF: Latent Diffusion with Efficient Encoder Distillation for Aligned Features in Medical Image Segmentation
Qilin Huang,Tianyu Lin,Zhiguang Chen,Fudan Zheng
Main category: cs.CV
TL;DR: LEAF是一种基于潜在扩散模型的医学图像分割方法,通过直接预测分割图替代噪声预测,并采用特征蒸馏对齐卷积层与基于Transformer的视觉编码器特征,提升了分割性能且不影响推理效率。
Details
Motivation: 现有扩散模型在医学图像分割中直接沿用原始训练过程,未针对性调整,且预训练扩散模型的特征提取能力不足,因此需要改进。Contribution: 提出LEAF模型,通过直接预测分割图减少结果方差,并结合特征蒸馏对齐卷积层与Transformer编码器特征,提升分割性能的同时保持高效推理。
Method: 在微调过程中用分割图直接预测替代噪声预测;采用特征蒸馏方法对齐卷积层与Transformer视觉编码器的隐藏状态。
Result: 实验证明该方法提升了原始扩散模型在多种疾病类型分割数据集上的性能,且未增加推理时的参数量或计算成本。
Insight: 直接预测分割图和特征对齐策略能有效提升医学图像分割任务的性能,同时保持模型效率。
Abstract: Leveraging the powerful capabilities of diffusion models has yielded quite effective results in medical image segmentation tasks. However, existing methods typically transfer the original training process directly without specific adjustments for segmentation tasks. Furthermore, the commonly used pre-trained diffusion models still have deficiencies in feature extraction. Based on these considerations, we propose LEAF, a medical image segmentation model grounded in latent diffusion models. During the fine-tuning process, we replace the original noise prediction pattern with a direct prediction of the segmentation map, thereby reducing the variance of segmentation results. We also employ a feature distillation method to align the hidden states of the convolutional layers with the features from a transformer-based vision encoder. Experimental results demonstrate that our method enhances the performance of the original diffusion model across multiple segmentation datasets for different disease types. Notably, our approach does not alter the model architecture, nor does it increase the number of parameters or computation during the inference phase, making it highly efficient.
[57] 3D Test-time Adaptation via Graph Spectral Driven Point Shift
Xin Wei,Qin Yang,Yijie Fang,Mingrui Zhu,Nannan Wang
Main category: cs.CV
TL;DR: 论文提出了一种基于图谱域的三维点云测试时自适应方法(GSDTTA),通过将优化问题转移到图谱域,实现了更高效的自适应,同时减少了计算开销。
Details
Motivation: 现有3D测试时自适应(TTA)方法通常依赖于昂贵的空间域优化,且可能需额外训练数据,这限制了其在3D点云上的应用。Contribution: 提出GSDTTA方法,通过图傅里叶变换(GFT)将点云表示为图谱域信号,仅优化低频分量实现高效自适应。
Method: 利用GFT将点云变换到图谱域,优化其中10%的低频分量,并通过IGFT重构适应后的点云;结合特征图引导的自训练策略迭代优化。
Result: 在基准数据集上,GSDTTA的性能优于现有3D TTA方法。
Insight: 图谱域中的低频分量能有效捕捉3D点云的全局结构特性,使得优化更高效且性能提升显著。
Abstract: While test-time adaptation (TTA) methods effectively address domain shifts by dynamically adapting pre-trained models to target domain data during online inference, their application to 3D point clouds is hindered by their irregular and unordered structure. Current 3D TTA methods often rely on computationally expensive spatial-domain optimizations and may require additional training data. In contrast, we propose Graph Spectral Domain Test-Time Adaptation (GSDTTA), a novel approach for 3D point cloud classification that shifts adaptation to the graph spectral domain, enabling more efficient adaptation by capturing global structural properties with fewer parameters. Point clouds in target domain are represented as outlier-aware graphs and transformed into graph spectral domain by Graph Fourier Transform (GFT). For efficiency, adaptation is performed by optimizing only the lowest 10% of frequency components, which capture the majority of the point cloud’s energy. An inverse GFT (IGFT) is then applied to reconstruct the adapted point cloud with the graph spectral-driven point shift. This process is enhanced by an eigenmap-guided self-training strategy that iteratively refines both the spectral adjustments and the model parameters. Experimental results and ablation studies on benchmark datasets demonstrate the effectiveness of GSDTTA, outperforming existing TTA methods for 3D point cloud classification.
[58] LMM-Det: Make Large Multimodal Models Excel in Object Detection
Jincheng Li,Chunyu Xie,Ji Ao,Dawei Leng,Yuhui Yin
Main category: cs.CV
TL;DR: LMM-Det提出了一种简单有效的方法,利用大型多模态模型(LMM)进行目标检测,无需依赖专门检测模块,通过数据分布调整和推理优化提升召回率。
Details
Motivation: 尽管大型多模态模型在多模态任务中表现优异,但其目标检测能力与专业检测器存在显著差距。为解决这一问题,LMM-Det探索了如何在不添加额外检测模块的情况下,利用LMM实现目标检测。Contribution: 1) 揭示了LMM在目标检测中召回率显著下降的问题;2) 提出通过数据分布调整和推理优化提升召回率;3) 展示了LMM无需额外模块即可具备目标检测能力。
Method: 1) 对LMM与目标检测的结合进行探索性分析;2) 通过数据分布调整和推理优化提升召回率;3) 重新组织指令对话以增强检测能力。
Result: 实验证明LMM-Det在目标检测中表现出色,验证了LMM无需额外模块的检测能力。
Insight: LMM可以通过适当的调整和优化,在目标检测任务中发挥潜力,无需依赖专业检测模块。
Abstract: Large multimodal models (LMMs) have garnered wide-spread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others. While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors. To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose LMM-Det, a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules. Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models. To mitigate this, we propose to increase the recall rate by introducing data distribution adjustment and inference optimization tailored for object detection. We re-organize the instruction conversations to enhance the object detection capabilities of large multimodal models. We claim that a large multimodal model possesses detection capability without any extra detection modules. Extensive experiments support our claim and show the effectiveness of the versatile LMM-Det. The datasets, models, and codes are available at https://github.com/360CVGroup/LMM-Det.
[59] Improving Large Vision-Language Models’ Understanding for Field Data
Xiaomei Zhang,Hanyu Zheng,Xiangyu Zhu,Jinghuan Wei,Junhong Zou,Zhen Lei,Zhaoxiang Zhang
Main category: cs.CV
TL;DR: FieldLVLM框架通过领域感知语言生成和数据压缩多模态模型调优,显著提升大型视觉语言模型对科学领域复杂数据的理解能力。
Details
Motivation: 大型视觉语言模型在通用任务中表现优异,但在科学领域(如自然科学的复杂现场数据理解)中仍显不足,需要领域专用优化。Contribution: 提出了FieldLVLM框架,包含领域感知语言生成和数据压缩多模态模型调优,显著提升了模型对科学领域数据的理解能力。
Method: 1. 领域感知语言生成:通过机器学习提取现场数据关键特征并转化为结构化文本;2. 数据压缩多模态调优:压缩数据复杂度以适配模型语言解码器。
Result: 实验结果在新基准数据集上显示,FieldLVLM显著优于现有方法,为科学领域应用提供了新可能性。
Insight: 这种领域专用优化方法为大型模型与科学发现之间的鸿沟架设了桥梁。
Abstract: Large Vision-Language Models (LVLMs) have shown impressive capabilities across a range of tasks that integrate visual and textual understanding, such as image captioning and visual question answering. These models are trained on large-scale image and video datasets paired with text, enabling them to bridge visual perception and natural language processing. However, their application to scientific domains, especially in interpreting complex field data commonly used in the natural sciences, remains underexplored. In this work, we introduce FieldLVLM, a novel framework designed to improve large vision-language models’ understanding of field data. FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning. The field-aware language generation strategy leverages a special-purpose machine learning pipeline to extract key physical features from field data, such as flow classification, Reynolds number, and vortex patterns. This information is then converted into structured textual descriptions that serve as a dataset. The data-compressed multimodal model tuning focuses on LVLMs with these generated datasets, using a data compression strategy to reduce the complexity of field inputs and retain only the most informative values. This ensures compatibility with the models language decoder and guides its learning more effectively. Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data. Our findings suggest that this approach opens up new possibilities for applying large vision-language models to scientific research, helping bridge the gap between large models and domain-specific discovery.
[60] A Multi-Dataset Benchmark for Semi-Supervised Semantic Segmentation in ECG Delineation
Minje Park,Jeonghwa Lim,Taehyung Yu,Sunghoon Joo
Main category: cs.CV
TL;DR: 该论文提出了首个针对心电图(ECG)分割的半监督语义分割(SemiSeg)系统基准测试,整合了多个公共数据集并采用了计算机视觉中的代表性算法,结果表明变压器架构表现优于卷积网络。
Details
Motivation: 由于公开标注的心电图数据稀缺,限制了深度学习在心电图分割领域的进展,而半监督学习可以通过利用大量未标注数据来解决这一问题。Contribution: 1. 提出了首个半监督语义分割在心电图分割中的系统基准测试;2. 整合并标准化了多个公共数据集;3. 采用了两种架构(卷积网络和变压器)评估代表性算法;4. 提出了心电图特定的训练配置和增强策略。
Method: 1. 使用五种代表性半监督语义分割算法;2. 在卷积网络和变压器两种架构上实现;3. 设计了两种评估设置(域内和跨域);4. 提出了心电图特定的训练配置和数据增强方法。
Result: 变压器架构在半监督心电图分割任务中表现优于卷积网络。
Insight: 半监督学习可以有效利用未标注数据提升心电图分割性能,变压器架构在此任务中具有潜力。
Abstract: Electrocardiogram (ECG) delineation, the segmentation of meaningful waveform features, is critical for clinical diagnosis. Despite recent advances using deep learning, progress has been limited by the scarcity of publicly available annotated datasets. Semi-supervised learning presents a promising solution by leveraging abundant unlabeled ECG data. In this study, we present the first systematic benchmark for semi-supervised semantic segmentation (SemiSeg) in ECG delineation. We curated and unified multiple public datasets, including previously underused sources, to support robust and diverse evaluation. We adopted five representative SemiSeg algorithms from computer vision, implemented them on two different architectures: the convolutional network and the transformer, and evaluated them in two different settings: in-domain and cross-domain. Additionally, we propose ECG-specific training configurations and augmentation strategies and introduce a standardized evaluation framework. Our results show that the transformer outperforms the convolutional network in semi-supervised ECG delineation. We anticipate that our benchmark will serve as a foundation for advancing semi-supervised ECG delineation methods and will facilitate further research in this domain.
[61] GVCCS: A Dataset for Contrail Identification and Tracking on Visible Whole Sky Camera Sequences
Gabriel Jarry,Ramon Dalmau,Philippe Very,Franck Ballerini,Stephania-Denisa Bocu
Main category: cs.CV
TL;DR: 该论文提出了一个新的公开数据集GVCCS,用于从地面全天空摄像头拍摄的可见光序列中识别和追踪尾迹云,并通过深度学习框架进行语义和实例分割及时空追踪。
Details
Motivation: 航空业的气候影响不仅来自CO2排放,还包括非CO2效应,尤其是尾迹云。现有数据集缺乏对尾迹云动态和来源的详细追踪,影响了物理模型的验证和校准。Contribution: 提出了GVCCS数据集,包含122个视频序列和24,228帧图像,每帧中的尾迹云被单独标注并追踪。同时提出了一个统一的深度学习框架,用于尾迹云的语义和实例分割及时空追踪。
Method: 使用全景分割模型(panoptic segmentation)在一个架构中同时完成尾迹云的语义分割、实例分割和时空追踪。
Result: GVCCS数据集和提出的深度学习框架为尾迹云的高质量监测和物理模型的校准提供了支持。
Insight: 高质量且时间分辨率高的尾迹云数据集可以改进气候影响的评估和物理模型的准确性,最终帮助更全面地理解航空业的气候影响。
Abstract: Aviation’s climate impact includes not only CO2 emissions but also significant non-CO2 effects, especially from contrails. These ice clouds can alter Earth’s radiative balance, potentially rivaling the warming effect of aviation CO2. Physics-based models provide useful estimates of contrail formation and climate impact, but their accuracy depends heavily on the quality of atmospheric input data and on assumptions used to represent complex processes like ice particle formation and humidity-driven persistence. Observational data from remote sensors, such as satellites and ground cameras, could be used to validate and calibrate these models. However, existing datasets don’t explore all aspect of contrail dynamics and formation: they typically lack temporal tracking, and do not attribute contrails to their source flights. To address these limitations, we present the Ground Visible Camera Contrail Sequences (GVCCS), a new open data set of contrails recorded with a ground-based all-sky camera in the visible range. Each contrail is individually labeled and tracked over time, allowing a detailed analysis of its lifecycle. The dataset contains 122 video sequences (24,228 frames) and includes flight identifiers for contrails that form above the camera. As reference, we also propose a unified deep learning framework for contrail analysis using a panoptic segmentation model that performs semantic segmentation (contrail pixel identification), instance segmentation (individual contrail separation), and temporal tracking in a single architecture. By providing high-quality, temporally resolved annotations and a benchmark for model evaluation, our work supports improved contrail monitoring and will facilitate better calibration of physical models. This sets the groundwork for more accurate climate impact understanding and assessments.
[62] Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction
Runmin Zhang,Zhu Yu,Si-Yuan Cao,Lingyu Zhu,Guangyi Zhang,Xiaokai Bai,Hui-Liang Shen
Main category: cs.CV
TL;DR: SGCDet是一种基于自适应3D体积构建的多视角室内3D目标检测框架,通过几何和上下文感知模块动态集成多视角信息,并采用稀疏体积构建策略减少冗余计算。
Details
Motivation: 传统方法中体素的感受野固定在图像的固定位置,限制了特征的表达能力,且存在冗余计算问题。SGCDet旨在通过自适应体积构建和动态视角贡献调整来提升检测效率和精度。Contribution: 提出几何和上下文感知模块,动态调整多视角信息的贡献;设计稀疏体积构建策略,优化特征提取;仅需3D边界框监督,无需场景几何真值。
Method: 基于多视角图像,通过几何和上下文感知模块自适应整合信息;稀疏体积构建策略筛选高概率体素减少计算;仅使用3D边界框进行监督训练。
Result: 在ScanNet、ScanNet200和ARKitScenes数据集上取得SOTA性能。
Insight: 自适应体积构建和多视角动态融合能显著提升3D目标检测性能,同时稀疏策略有效减少计算负担。
Abstract: This work presents SGCDet, a novel multi-view indoor 3D object detection framework based on adaptive 3D volume construction. Unlike previous approaches that restrict the receptive field of voxels to fixed locations on images, we introduce a geometry and context aware aggregation module to integrate geometric and contextual information within adaptive regions in each image and dynamically adjust the contributions from different views, enhancing the representation capability of voxel features. Furthermore, we propose a sparse volume construction strategy that adaptively identifies and selects voxels with high occupancy probabilities for feature refinement, minimizing redundant computation in free space. Benefiting from the above designs, our framework achieves effective and efficient volume construction in an adaptive way. Better still, our network can be supervised using only 3D bounding boxes, eliminating the dependence on ground-truth scene geometry. Experimental results demonstrate that SGCDet achieves state-of-the-art performance on the ScanNet, ScanNet200 and ARKitScenes datasets. The source code is available at https://github.com/RM-Zhang/SGCDet.
[63] EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs
Yuping He,Yifei Huang,Guo Chen,Baoqi Pei,Jilan Xu,Tong Lu,Jiangmiao Pang
Main category: cs.CV
TL;DR: 本文介绍了EgoExoBench,第一个用于评估多模态大语言模型(MLLMs)在第一人称和第三人称视频理解与跨视角推理能力的基准数据集。研究发现现有模型在跨视角任务中表现不佳。
Details
Motivation: 人类具备在第一人称(自我中心)和第三人称(外部中心)视角间转换和整合知识的能力,但当前多模态大语言模型在这一领域的表现尚未被探索。Contribution: 1. 提出EgoExoBench基准数据集,包含7,300多个问答对,覆盖11个子任务;2. 首次系统性评估13种先进MLLMs在跨视角任务中的表现。
Method: 通过公开数据集构建EgoExoBench,包含三类核心任务:语义对齐、视角关联和时序推理,并对13种MLLMs进行评估。
Result: 现有MLLMs在单视角任务中表现优异,但在跨视角语义对齐、视角关联和时序推理任务中表现较差。
Insight: 跨视角推理是MLLMs未来的重要研究方向,EgoExoBench为开发具备人类跨视角智能的代理和助手提供了基准支持。
Abstract: Transferring and integrating knowledge across first-person (egocentric) and third-person (exocentric) viewpoints is intrinsic to human intelligence, enabling humans to learn from others and convey insights from their own experiences. Despite rapid progress in multimodal large language models (MLLMs), their ability to perform such cross-view reasoning remains unexplored. To address this, we introduce EgoExoBench, the first benchmark for egocentric-exocentric video understanding and reasoning. Built from publicly available datasets, EgoExoBench comprises over 7,300 question-answer pairs spanning eleven sub-tasks organized into three core challenges: semantic alignment, viewpoint association, and temporal reasoning. We evaluate 13 state-of-the-art MLLMs and find that while these models excel on single-view tasks, they struggle to align semantics across perspectives, accurately associate views, and infer temporal dynamics in the ego-exo context. We hope EgoExoBench can serve as a valuable resource for research on embodied agents and intelligent assistants seeking human-like cross-view intelligence.
[64] VB-Mitigator: An Open-source Framework for Evaluating and Advancing Visual Bias Mitigation
Ioannis Sarridis,Christos Koutlis,Symeon Papadopoulos,Christos Diou
Main category: cs.CV
TL;DR: VB-Mitigator是一个开源框架,旨在简化和统一视觉偏差缓解技术的评估与比较,促进公平性研究的发展。
Details
Motivation: 计算机视觉模型中的偏差问题导致AI系统不公、不可靠且难以泛化,而现有研究的分散实现和评估不一致性阻碍了进展。Contribution: 提出了VB-Mitigator框架,整合了12种缓解方法和7个基准数据集,支持扩展性和统一评估。
Method: 提供开源框架,集成多种方法和数据集,支持新方法和指标的快速集成与比较。
Result: 框架为研究社区提供了统一的评估环境,并提供了现有方法的性能对比。
Insight: 统一的研究环境有助于加速公平性研究,标准化评估实践对领域发展至关重要。
Abstract: Bias in computer vision models remains a significant challenge, often resulting in unfair, unreliable, and non-generalizable AI systems. Although research into bias mitigation has intensified, progress continues to be hindered by fragmented implementations and inconsistent evaluation practices. Disparate datasets and metrics used across studies complicate reproducibility, making it difficult to fairly assess and compare the effectiveness of various approaches. To overcome these limitations, we introduce the Visual Bias Mitigator (VB-Mitigator), an open-source framework designed to streamline the development, evaluation, and comparative analysis of visual bias mitigation techniques. VB-Mitigator offers a unified research environment encompassing 12 established mitigation methods, 7 diverse benchmark datasets. A key strength of VB-Mitigator is its extensibility, allowing for seamless integration of additional methods, datasets, metrics, and models. VB-Mitigator aims to accelerate research toward fairness-aware computer vision models by serving as a foundational codebase for the research community to develop and assess their approaches. To this end, we also recommend best evaluation practices and provide a comprehensive performance comparison among state-of-the-art methodologies.
[65] Deformable Convolution Module with Globally Learned Relative Offsets for Fundus Vessel Segmentation
Lexuan Zhu,Yuxuan Li,Yuning Ren
Main category: cs.CV
TL;DR: 论文提出了一种新颖的可变形卷积模块,通过学习全局相对偏移来增强模型对复杂形状特征的表达能力,并将其应用于眼底血管分割任务,取得了最先进的性能。
Details
Motivation: 现有的可变形卷积直接对卷积核进行变形,难以捕捉长距离的全局特征。眼底血管具有复杂的自相似边缘,需要一种能够全局学习偏移的方法来增强特征表示能力。Contribution: 论文的主要贡献包括:1)提出了一种可插拔的可变形卷积模块,利用注意力机制和全连接网络学习全局偏移;2)通过子像素位移场自适应变形特征图,实现卷积核采样网格的相对变形;3)构建了基于该模块的GDCUnet模型,在眼底血管分割任务中表现优异。
Method: 提出的模块通过学习子像素位移场来变形特征图,而非直接变形卷积核。模块结合了注意力机制和全连接网络,能够捕捉长距离全局特征,并实现了卷积核大小与学习网络的解耦。
Result: 在公开数据集上的实验表明,GDCUnet达到了最先进的性能。消融实验进一步验证了所提模块能显著提升模型对复杂血管特征的表示和泛化能力。
Insight: 所提模块类似于传统卷积的接口,适用于其他具有复杂全局自相似特征的机器视觉任务。
Abstract: Deformable convolution can adaptively change the shape of convolution kernel by learning offsets to deal with complex shape features. We propose a novel plug and play deformable convolutional module that uses attention and feedforward networks to learn offsets, so that the deformable patterns can capture long-distance global features. Compared with previously existing deformable convolutions, the proposed module learns the sub pixel displacement field and adaptively warps the feature maps across all channels rather than directly deforms the convolution kernel , which is equivalent to a relative deformation of the kernel sampling grids, achieving global feature deformation and the decoupling of kernel size and learning network. Considering that the fundus blood vessels have globally self similar complex edges, we design a deep learning model for fundus blood vessel segmentation, GDCUnet, based on the proposed convolutional module. Empirical evaluations under the same configuration and unified framework show that GDCUnet has achieved state of the art performance on public datasets. Further ablation experiments demonstrated that the proposed deformable convolutional module could more significantly learn the complex features of fundus blood vessels, enhancing the model representation and generalization capabilities.The proposed module is similar to the interface of conventional convolution, we suggest applying it to more machine vision tasks with complex global self similar features.
[66] Towards Effective Human-in-the-Loop Assistive AI Agents
Filippos Bellos,Yayuan Li,Cary Shu,Ruey Day,Jeffrey M. Siskind,Jason J. Corso
Main category: cs.CV
TL;DR: 该论文提出了一个评估框架和多模态数据集,用于研究AI辅助对人类任务完成的影响,并开发了一个基于增强现实(AR)的AI代理,通过实验验证了AI辅助对人类性能的积极效果。
Details
Motivation: 研究人类与AI在物理任务中的协作对日常生活和专业领域的潜力,提出了评估这种协作的挑战。Contribution: 1. 提出了一个评估人类-AI协作的框架和多模态数据集;2. 开发了一个基于AR的交互式AI代理;3. 通过实验验证了AI辅助对人类任务完成的提升效果。
Method: 1. 设计了一个评估框架和多模态数据集;2. 开发了基于AR的AI代理;3. 通过人类实验验证AI辅助效果。
Result: 实验表明,AI辅助显著提高了任务完成效果,减少了错误并提升了学习成果。
Insight: AI代理的交互式指导在复杂的人类协作任务中具有实际应用潜力,尤其是在需要实时反馈的场景中(如烹饪和战场医疗)。
Abstract: Effective human-AI collaboration for physical task completion has significant potential in both everyday activities and professional domains. AI agents equipped with informative guidance can enhance human performance, but evaluating such collaboration remains challenging due to the complexity of human-in-the-loop interactions. In this work, we introduce an evaluation framework and a multimodal dataset of human-AI interactions designed to assess how AI guidance affects procedural task performance, error reduction and learning outcomes. Besides, we develop an augmented reality (AR)-equipped AI agent that provides interactive guidance in real-world tasks, from cooking to battlefield medicine. Through human studies, we share empirical insights into AI-assisted human performance and demonstrate that AI-assisted collaboration improves task completion.
[67] Towards Consistent Long-Term Pose Generation
Yayuan Li,Filippos Bellos,Jason Corso
Main category: cs.CV
TL;DR: 论文提出了一种新的单阶段架构,直接从RGB图像和文本描述生成连续坐标空间的姿态,避免了中间表示或基于标记的生成,显著提升了长时姿态生成的一致性。
Details
Motivation: 当前姿态生成方法依赖中间表示或自回归模型,导致长时生成中性能下降和时间一致性难以保持。本文旨在解决这一问题。Contribution: 主要贡献是提出了一种直接生成连续坐标空间中姿态的单阶段架构,通过相对运动预测机制和统一的占位符标记,避免中间表示和训练-推理不一致。
Method: 方法包括相对运动预测机制(保持空间关系)和统一占位符标记(实现单次前向生成,训练与推理行为一致)。
Result: 在Penn Action和F-PHAB数据集上的实验表明,该方法在长时生成场景中显著优于现有的基于量化和自回归的方法。
Insight: 直接操作姿态坐标并避免中间表示可显著提升长时姿态生成的一致性和性能。
Abstract: Current approaches to pose generation rely heavily on intermediate representations, either through two-stage pipelines with quantization or autoregressive models that accumulate errors during inference. This fundamental limitation leads to degraded performance, particularly in long-term pose generation where maintaining temporal coherence is crucial. We propose a novel one-stage architecture that directly generates poses in continuous coordinate space from minimal context - a single RGB image and text description - while maintaining consistent distributions between training and inference. Our key innovation is eliminating the need for intermediate representations or token-based generation by operating directly on pose coordinates through a relative movement prediction mechanism that preserves spatial relationships, and a unified placeholder token approach that enables single-forward generation with identical behavior during training and inference. Through extensive experiments on Penn Action and First-Person Hand Action Benchmark (F-PHAB) datasets, we demonstrate that our approach significantly outperforms existing quantization-based and autoregressive methods, especially in long-term generation scenarios.
[68] HumanMaterial: Human Material Estimation from a Single Image via Progressive Training
Yu Jiang,Jiahao Xia,Jiongming Qin,Yusen Wang,Tuo Cao,Chunxia Xiao
Main category: cs.CV
TL;DR: 这篇论文提出了一种渐进式训练方法(HumanMaterial),用于从单张图像中估计高质量的人体材质,并构建了一个高质量的数据集(OpenHumanBRDF)以提升渲染结果的真实性,尤其在皮肤渲染方面。
Details
Motivation: 传统的基于物理渲染的人体逆向渲染任务因缺乏材质图的约束而成为一个不适定问题,且现有数据集的简化材质数据和渲染方法导致渲染结果(尤其是皮肤)的真实性有限。Contribution: 1. 构建了高质量数据集OpenHumanBRDF,支持更复杂的材质(如位移和次表面散射)。2. 设计了渐进式训练模型HumanMaterial和基于控制的PBR渲染损失(CPR loss),以优化材质估计性能。
Method: HumanMaterial通过三个先验模型初步估计材质图,再通过微调模型优化结果。CPR损失用于增强先验模型训练中对重要材质的优化。
Result: 在OpenHumanBRDF数据集和真实数据上的实验表明,该方法在材质估计任务中达到了最先进的性能。
Insight: 渐进式训练策略和CPR损失设计能够有效平衡多材质图的重要性,并提升渲染结果的真实感。
Abstract: Full-body Human inverse rendering based on physically-based rendering aims to acquire high-quality materials, which helps achieve photo-realistic rendering under arbitrary illuminations. This task requires estimating multiple material maps and usually relies on the constraint of rendering result. The absence of constraints on the material maps makes inverse rendering an ill-posed task. Previous works alleviated this problem by building material dataset for training, but their simplified material data and rendering equation lead to rendering results with limited realism, especially that of skin. To further alleviate this problem, we construct a higher-quality dataset (OpenHumanBRDF) based on scanned real data and statistical material data. In addition to the normal, diffuse albedo, roughness, specular albedo, we produce displacement and subsurface scattering to enhance the realism of rendering results, especially for the skin. With the increase in prediction tasks for more materials, using an end-to-end model as in the previous work struggles to balance the importance among various material maps, and leads to model underfitting. Therefore, we design a model (HumanMaterial) with progressive training strategy to make full use of the supervision information of the material maps and improve the performance of material estimation. HumanMaterial first obtain the initial material results via three prior models, and then refine the results by a finetuning model. Prior models estimate different material maps, and each map has different significance for rendering results. Thus, we design a Controlled PBR Rendering (CPR) loss, which enhances the importance of the materials to be optimized during the training of prior models. Extensive experiments on OpenHumanBRDF dataset and real data demonstrate that our method achieves state-of-the-art performance.
[69] Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows
Simin Huo,Ning Li
Main category: cs.CV
TL;DR: Iwin Transformer通过在交错窗口注意力与深度可分离卷积的协同作用下,提出了一种无需位置嵌入的分层视觉Transformer,可直接从低分辨率到高分辨率微调,优于Swin Transformer。
Details
Motivation: 解决Swin Transformer需要连续两个块才能近似全局注意力的局限性,提出一种更高效的方法实现全局信息交换。Contribution: 提出Iwin Transformer,结合交错窗口注意力和深度可分离卷积,实现高效全局信息交换,并在多个视觉任务上表现优异。
Method: 采用交错窗口注意力连接远距离token,深度可分离卷积连接邻近token,单模块内完成全局信息交换。
Result: 在ImageNet-1K上达到87.4% top-1准确率,并在语义分割和视频动作识别任务中表现强劲。
Insight: Iwin Transformer的核心组件可作为独立模块替换自注意力,在图像生成中有效,有望启发未来研究如3D注意力。
Abstract: We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer’s limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin-Transformer.
[70] DCFFSNet: Deep Connectivity Feature Fusion Separation Network for Medical Image Segmentation
Xun Ye,Ruixiang Tang,Mingda Zhang,Jianglong Qin
Main category: cs.CV
TL;DR: DCFFSNet提出了一种深度连通性特征融合分离网络,通过量化连通性特征与其他特征的相对强度,动态平衡多尺度特征表达,显著提升了医学图像分割的精度和边缘一致性。
Details
Motivation: 现有方法在医学图像分割中通常强制注入连通性特征,导致特征空间耦合且缺乏标准化机制来量化不同特征的强度,限制了分割效果。Contribution: 1. 提出了特征空间解耦策略,量化连通性特征与其他特征的相对强度;2. 构建了深度连通性特征融合分离架构,动态平衡多尺度特征表达。
Method: 通过深度连通性特征融合分离网络(DCFFSNet)实现特征空间动态解耦,量化特征强度并融合多尺度信息。
Result: 在ISIC2018、DSB2018和MoNuSeg数据集上,DCFFSNet在Dice和IoU指标上均优于现有主流方法,边缘过渡更平滑。
Insight: 动态量化特征强度和解耦特征空间能显著提升医学图像分割的性能,尤其适用于解决分割碎片化问题。
Abstract: Medical image segmentation leverages topological connectivity theory to enhance edge precision and regional consistency. However, existing deep networks integrating connectivity often forcibly inject it as an additional feature module, resulting in coupled feature spaces with no standardized mechanism to quantify different feature strengths. To address these issues, we propose DCFFSNet (Dual-Connectivity Feature Fusion-Separation Network). It introduces an innovative feature space decoupling strategy. This strategy quantifies the relative strength between connectivity features and other features. It then builds a deep connectivity feature fusion-separation architecture. This architecture dynamically balances multi-scale feature expression. Experiments were conducted on the ISIC2018, DSB2018, and MoNuSeg datasets. On ISIC2018, DCFFSNet outperformed the next best model (CMUNet) by 1.3% (Dice) and 1.2% (IoU). On DSB2018, it surpassed TransUNet by 0.7% (Dice) and 0.9% (IoU). On MoNuSeg, it exceeded CSCAUNet by 0.8% (Dice) and 0.9% (IoU). The results demonstrate that DCFFSNet exceeds existing mainstream methods across all metrics. It effectively resolves segmentation fragmentation and achieves smooth edge transitions. This significantly enhances clinical usability.
[71] Self-Supervised Ultrasound-Video Segmentation with Feature Prediction and 3D Localised Loss
Edward Ellis,Robert Mendel,Andrew Bulpitt,Nasim Parsa,Michael F Byrne,Sharib Ali
Main category: cs.CV
TL;DR: 该论文提出了一种基于V-JEPA的自监督学习方法,用于超声视频分割,通过特征预测和3D局部化损失提升ViT模型的局部性理解。
Details
Motivation: 超声影像数据获取和标注困难,需要大量时间和专业知识。自监督学习可以利用未标注数据学习有用表示,从而在标注数据有限时提升分割性能。Contribution: 1. 首次将V-JEPA框架应用于超声视频数据;2. 提出一种新的3D局部化辅助任务,增强ViT的局部性表示。
Method: 1. 采用V-JEPA框架进行特征预测,避免像素级重建;2. 设计3D局部化辅助任务,提升ViT的局部性理解。
Result: 在多种冻结编码器配置下,提出的方法显著提升了分割性能,使用100%和10%训练数据时分别提升3.4%和8.35%。
Insight: V-JEPA框架适用于超声影像,能够有效利用时序信息且对噪声不敏感;辅助任务的设计弥补了ViT在局部性上的不足。
Abstract: Acquiring and annotating large datasets in ultrasound imaging is challenging due to low contrast, high noise, and susceptibility to artefacts. This process requires significant time and clinical expertise. Self-supervised learning (SSL) offers a promising solution by leveraging unlabelled data to learn useful representations, enabling improved segmentation performance when annotated data is limited. Recent state-of-the-art developments in SSL for video data include V-JEPA, a framework solely based on feature prediction, avoiding pixel level reconstruction or negative samples. We hypothesise that V-JEPA is well-suited to ultrasound imaging, as it is less sensitive to noisy pixel-level detail while effectively leveraging temporal information. To the best of our knowledge, this is the first study to adopt V-JEPA for ultrasound video data. Similar to other patch-based masking SSL techniques such as VideoMAE, V-JEPA is well-suited to ViT-based models. However, ViTs can underperform on small medical datasets due to lack of inductive biases, limited spatial locality and absence of hierarchical feature learning. To improve locality understanding, we propose a novel 3D localisation auxiliary task to improve locality in ViT representations during V-JEPA pre-training. Our results show V-JEPA with our auxiliary task improves segmentation performance significantly across various frozen encoder configurations, with gains up to 3.4% using 100% and up to 8.35% using only 10% of the training data.
[72] NLML-HPE: Head Pose Estimation with Limited Data via Manifold Learning
Mahdi Ghafourian,Federico M. Sukno
Main category: cs.CV
TL;DR: 本文提出了一种名为NLML-HPE的新型深度学习方法,通过非线性流形学习解决有限数据下的头部姿态估计问题,结合张量分解和前馈神经网络,将头部姿态估计建模为回归任务。
Details
Motivation: 传统的头部姿态估计方法通常需要大量标注数据,而真实数据常存在标注不准确的问题。本文旨在通过流形学习和精确生成的2D姿态数据,在有限数据下实现高效准确的头部姿态估计。Contribution: 1. 提出结合张量分解和神经网络的回归方法,将头部姿态估计问题建模为连续角度预测。2. 通过旋转3D头部模型生成精确的2D姿态数据集。3. 实现实时性能并在有限数据下表现优异。
Method: 使用Tucker分解将欧拉角(偏航、俯仰、翻滚)分离到子空间,并分别建模为余弦曲线;采用前馈神经网络学习旋转流形。
Result: 实验表明,模型在有限数据下能快速预测未见数据,且表现优于传统分类方法。
Insight: 将头部姿态估计视为回归任务并结合流形学习,能够更高效地利用有限数据,并提升模型泛化能力。
Abstract: Head pose estimation (HPE) plays a critical role in various computer vision applications such as human-computer interaction and facial recognition. In this paper, we propose a novel deep learning approach for head pose estimation with limited training data via non-linear manifold learning called NLML-HPE. This method is based on the combination of tensor decomposition (i.e., Tucker decomposition) and feed forward neural networks. Unlike traditional classification-based approaches, our method formulates head pose estimation as a regression problem, mapping input landmarks into a continuous representation of pose angles. To this end, our method uses tensor decomposition to split each Euler angle (yaw, pitch, roll) to separate subspaces and models each dimension of the underlying manifold as a cosine curve. We address two key challenges: 1. Almost all HPE datasets suffer from incorrect and inaccurate pose annotations. Hence, we generated a precise and consistent 2D head pose dataset for our training set by rotating 3D head models for a fixed set of poses and rendering the corresponding 2D images. 2. We achieved real-time performance with limited training data as our method accurately captures the nature of rotation of an object from facial landmarks. Once the underlying manifold for rotation around each axis is learned, the model is very fast in predicting unseen data. Our training and testing code is available online along with our trained models: https: //github.com/MahdiGhafoorian/NLML_HPE.
[73] DSFormer: A Dual-Scale Cross-Learning Transformer for Visual Place Recognition
Haiyang Jiang,Songhao Piao,Chao Gao,Lei Yu,Liguo Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为DSFormer的双尺度交叉学习Transformer框架,用于视觉地点识别(VPR),通过结合双尺度特征交互和创新的块聚类策略,显著提升了性能并减少了训练数据需求。
Details
Motivation: 视觉地点识别在动态环境和多视角变化下的鲁棒性是一个关键挑战。传统方法难以兼顾语义丰富性和空间细节,需要大量训练数据。Contribution: 1. 提出DSFormer框架,通过双尺度Transformer实现跨尺度特征学习;2. 引入块聚类策略优化数据集划分,减少30%训练数据需求;3. 在多个基准数据集上达到SOTA性能。
Method: 1. DSFormer模块结合自注意力(单尺度长程依赖)和交叉注意力(跨尺度学习);2. 块聚类策略从多角度重新划分SF-XL数据集。
Result: 在多个VPR基准数据集上优于DELG、Patch-NetVLAD等现有方法,并显著提升计算效率。
Insight: 双尺度特征交互和动态数据组织是提升VPR鲁棒性的有效途径。
Abstract: Visual Place Recognition (VPR) is crucial for robust mobile robot localization, yet it faces significant challenges in maintaining reliable performance under varying environmental conditions and viewpoints. To address this, we propose a novel framework that integrates Dual-Scale-Former (DSFormer), a Transformer-based cross-learning module, with an innovative block clustering strategy. DSFormer enhances feature representation by enabling bidirectional information transfer between dual-scale features extracted from the final two CNN layers, capturing both semantic richness and spatial details through self-attention for long-range dependencies within each scale and shared cross-attention for cross-scale learning. Complementing this, our block clustering strategy repartitions the widely used San Francisco eXtra Large (SF-XL) training dataset from multiple distinct perspectives, optimizing data organization to further bolster robustness against viewpoint variations. Together, these innovations not only yield a robust global embedding adaptable to environmental changes but also reduce the required training data volume by approximately 30% compared to previous partitioning methods. Comprehensive experiments demonstrate that our approach achieves state-of-the-art performance across most benchmark datasets, surpassing advanced reranking methods like DELG, Patch-NetVLAD, TransVPR, and R2Former as a global retrieval solution using 512-dim global descriptors, while significantly improving computational efficiency.
[74] PDB-Eval: An Evaluation of Large Multimodal Models for Description and Explanation of Personalized Driving Behavior
Junda Wu,Jessica Echterhoff,Kyungtae Han,Amr Abdelraouf,Rohit Gupta,Julian McAuley
Main category: cs.CV
TL;DR: 论文提出了PDB-Eval基准,用于评估大型多模态模型(MLLMs)在描述和解释个性化驾驶行为方面的能力,并通过PDB-QA任务提升其与驾驶领域的对齐能力,显著提高了零样本任务的性能。
Details
Motivation: 现有数据集在基于外部视觉证据描述和解释驾驶行为方面存在局限性,因此需要一种新的方法来提升大型多模态模型在驾驶理解和推理方面的能力,以实现个性化的驾驶行为分析和风险评估。Contribution: 1. 提出PDB-Eval基准,包括PDB-X和PDB-QA两个组成部分;2. 通过PDB-QA任务实现MLLMs与驾驶任务的对齐;3. 在零样本任务和其他驾驶相关任务上取得显著性能提升。
Method: 1. PDB-X用于评估MLLMs对时序驾驶场景的理解能力;2. PDB-QA作为一种视觉解释问答任务,用于MLLMs的指令微调;3. 通过微调提升模型在驾驶领域的表现。
Result: 在零样本问答任务上性能提升达73.2%,在Brain4Cars的转向意图预测任务上提升12.5%,在AIDE的所有任务上提升达11.0%。
Insight: 通过细粒度的描述和解释微调,可以有效弥合MLLMs与驾驶领域的差距,同时保持其泛化能力。
Abstract: Understanding a driver’s behavior and intentions is important for potential risk assessment and early accident prevention. Safety and driver assistance systems can be tailored to individual drivers’ behavior, significantly enhancing their effectiveness. However, existing datasets are limited in describing and explaining general vehicle movements based on external visual evidence. This paper introduces a benchmark, PDB-Eval, for a detailed understanding of Personalized Driver Behavior, and aligning Large Multimodal Models (MLLMs) with driving comprehension and reasoning. Our benchmark consists of two main components, PDB-X and PDB-QA. PDB-X can evaluate MLLMs’ understanding of temporal driving scenes. Our dataset is designed to find valid visual evidence from the external view to explain the driver’s behavior from the internal view. To align MLLMs’ reasoning abilities with driving tasks, we propose PDB-QA as a visual explanation question-answering task for MLLM instruction fine-tuning. As a generic learning task for generative models like MLLMs, PDB-QA can bridge the domain gap without harming MLLMs’ generalizability. Our evaluation indicates that fine-tuning MLLMs on fine-grained descriptions and explanations can effectively bridge the gap between MLLMs and the driving domain, which improves zero-shot performance on question-answering tasks by up to 73.2%. We further evaluate the MLLMs fine-tuned on PDB-X in Brain4Cars’ intention prediction and AIDE’s recognition tasks. We observe up to 12.5% performance improvements on the turn intention prediction task in Brain4Cars, and consistent performance improvements up to 11.0% on all tasks in AIDE.
[75] CRUISE: Cooperative Reconstruction and Editing in V2X Scenarios using Gaussian Splatting
Haoran Xu,Saining Zhang,Peishuo Li,Baijun Ye,Xiaoxue Chen,Huan-ang Gao,Jv Zheng,Xiaowei Song,Ziqiao Peng,Run Miao,Jinrang Jia,Yifeng Shi,Guangqi Yi,Hang Zhao,Hao Tang,Hongyang Li,Kaicheng Yu,Hao Zhao
Main category: cs.CV
TL;DR: CRUISE利用高斯泼溅技术实现车联网(V2X)场景的高保真重建与灵活编辑,支持数据增强和挑战性案例生成,显著提升3D检测与跟踪性能。
Details
Motivation: 车联网在自动驾驶中的协作潜力尚未充分挖掘,CRUISE旨在填补仿真生成和增强V2X数据的空白。Contribution: 1) 提出基于分解高斯泼溅的V2X场景重建与编辑框架;2) 支持多视角渲染和数据增强;3) 在V2X-Seq基准上验证了3D检测与跟踪性能提升。
Method: 采用分解高斯泼溅技术,将动态交通参与者表示为可编辑的高斯模型,实现场景的灵活修改和多视角渲染。
Result: 1) 高保真重建V2X场景;2) 显著提升3D检测和协作跟踪性能;3) 有效生成挑战性案例。
Insight: 高斯泼溅技术不仅适用于静态场景,还能扩展至动态交通参与者的编辑与增强,为V2X数据生成提供了实用工具。
Abstract: Vehicle-to-everything (V2X) communication plays a crucial role in autonomous driving, enabling cooperation between vehicles and infrastructure. While simulation has significantly contributed to various autonomous driving tasks, its potential for data generation and augmentation in V2X scenarios remains underexplored. In this paper, we introduce CRUISE, a comprehensive reconstruction-and-synthesis framework designed for V2X driving environments. CRUISE employs decomposed Gaussian Splatting to accurately reconstruct real-world scenes while supporting flexible editing. By decomposing dynamic traffic participants into editable Gaussian representations, CRUISE allows for seamless modification and augmentation of driving scenes. Furthermore, the framework renders images from both ego-vehicle and infrastructure views, enabling large-scale V2X dataset augmentation for training and evaluation. Our experimental results demonstrate that: 1) CRUISE reconstructs real-world V2X driving scenes with high fidelity; 2) using CRUISE improves 3D detection across ego-vehicle, infrastructure, and cooperative views, as well as cooperative 3D tracking on the V2X-Seq benchmark; and 3) CRUISE effectively generates challenging corner cases.
[76] Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection
Francesco Dalmonte,Emirhan Bayar,Emre Akbas,Mariana-Iuliana Georgescu
Main category: cs.CV
TL;DR: 该论文提出了一种基于Q-Former的自编码器框架(Q-Former Autoencoder),用于医学图像异常检测,利用预训练的视觉基础模型(如DINO、DINOv2和Masked Autoencoder)提取特征,无需领域微调即可实现高性能。
Details
Motivation: 医学图像异常检测面临多样性和标注数据稀缺的挑战。传统方法需要从头训练编码器,而预训练视觉基础模型在自然图像上的表现启发作者探索其在医学任务中的泛化能力。Contribution: 1. 提出Q-Former Autoencoder框架,直接利用冻结的预训练视觉基础模型提取特征;2. 引入Q-Former架构控制重建序列长度并聚合多尺度特征;3. 结合感知损失提升重建的语义相关性。
Method: 1. 冻结预训练视觉基础模型作为特征提取器;2. 使用Q-Former作为瓶颈层;3. 结合Masked Autoencoder的特征计算感知损失。
Result: 在BraTS2021、RESC和RSNA等四个医学异常检测基准上取得了最先进的性能,验证了自然图像预训练模型在医学任务中的泛化能力。
Insight: 无需领域微调的自然图像预训练模型能为医学任务提供高层次的语义表示,Q-Former架构在多尺度特征聚合中表现优异。
Abstract: Anomaly detection in medical images is an important yet challenging task due to the diversity of possible anomalies and the practical impossibility of collecting comprehensively annotated data sets. In this work, we tackle unsupervised medical anomaly detection proposing a modernized autoencoder-based framework, the Q-Former Autoencoder, that leverages state-of-the-art pretrained vision foundation models, such as DINO, DINOv2 and Masked Autoencoder. Instead of training encoders from scratch, we directly utilize frozen vision foundation models as feature extractors, enabling rich, multi-stage, high-level representations without domain-specific fine-tuning. We propose the usage of the Q-Former architecture as the bottleneck, which enables the control of the length of the reconstruction sequence, while efficiently aggregating multiscale features. Additionally, we incorporate a perceptual loss computed using features from a pretrained Masked Autoencoder, guiding the reconstruction towards semantically meaningful structures. Our framework is evaluated on four diverse medical anomaly detection benchmarks, achieving state-of-the-art results on BraTS2021, RESC, and RSNA. Our results highlight the potential of vision foundation model encoders, pretrained on natural images, to generalize effectively to medical image analysis tasks without further fine-tuning. We release the code and models at https://github.com/emirhanbayar/QFAE.
[77] A COCO-Formatted Instance-Level Dataset for Plasmodium Falciparum Detection in Giemsa-Stained Blood Smears
Frauke Wilm,Luis Carlos Rivera Monroy,Mathias Öttl,Lukas Mürdter,Leonid Mill,Andreas Maier
Main category: cs.CV
TL;DR: 该论文提出了一种改进的疟疾数据集,提供详细的COCO格式实例级标注,并通过Faster R-CNN验证了其有效性,展示了高质量标注数据对自动检测疟疾的重要性。
Details
Motivation: 在发展中国家,疟疾诊断高度依赖吉姆萨染色血涂片的准确检测,但现有数据集缺乏详细的实例级标注,限制了基于深度学习的自动检测方法的推广。Contribution: 主要贡献是提供了一个增强版的疟疾数据集,包含详细的COCO格式标注,并通过实验验证了其训练数据的质量。
Method: 采用Faster R-CNN模型训练感染和非感染的红细胞及白细胞检测任务,并通过交叉验证评估性能。
Result: 在原数据集上的交叉验证结果显示,感染细胞检测的F1分数可达0.88,证明了数据的有效性。
Insight: 高质量标注数据的生成需要自动标注修正与针对性人工校正的结合,这对提升检测模型的鲁棒性至关重要。
Abstract: Accurate detection of Plasmodium falciparum in Giemsa-stained blood smears is an essential component of reliable malaria diagnosis, especially in developing countries. Deep learning-based object detection methods have demonstrated strong potential for automated Malaria diagnosis, but their adoption is limited by the scarcity of datasets with detailed instance-level annotations. In this work, we present an enhanced version of the publicly available NIH malaria dataset, with detailed bounding box annotations in COCO format to support object detection training. We validated the revised annotations by training a Faster R-CNN model to detect infected and non-infected red blood cells, as well as white blood cells. Cross-validation on the original dataset yielded F1 scores of up to 0.88 for infected cell detection. These results underscore the importance of annotation volume and consistency, and demonstrate that automated annotation refinement combined with targeted manual correction can produce training data of sufficient quality for robust detection performance. The updated annotations set is publicly available via GitHub: https://github.com/MIRA-Vision-Microscopy/malaria-thin-smear-coco.
[78] Reinforced Embodied Active Defense: Exploiting Adaptive Interaction for Robust Visual Perception in Adversarial 3D Environments
Xiao Yang,Lingxuan Wu,Lizhong Wang,Chengyang Ying,Hang Su,Jun Zhu
Main category: cs.CV
TL;DR: 这篇论文提出了Reinforced Embodied Active Defense (Rein-EAD),一种主动防御框架,通过自适应探索和环境交互提升3D对抗环境中的视觉感知鲁棒性。该方法在多步目标优化和不确定性奖励机制的支持下,显著降低了攻击成功率,同时保持标准任务准确性。
Details
Motivation: 3D环境中的对抗攻击对视觉感知系统的可靠性构成严重威胁,尤其是在身份验证和自动驾驶等安全敏感应用中。现有的被动防御方法依赖于预定义的对抗策略假设,难以适应动态环境。因此,亟需一种主动防御框架来应对这一挑战。Contribution: 提出了Rein-EAD框架,首次将强化学习与自适应环境交互结合用于主动防御。该方法通过多步目标优化和不确定性奖励机制,显著提升了对抗环境中的感知鲁棒性,并展示了良好的泛化能力。
Method: Rein-EAD采用了多步目标优化策略,平衡即时预测准确性和预测熵最小化,并通过不确定性导向的奖励机制优化策略更新,避免了高计算开销。
Result: 实验表明,Rein-EAD显著降低了攻击成功率,同时在多种任务中保持了标准准确性,并能泛化至未见过的自适应攻击。
Insight: 论文揭示了主动交互和自适应策略在防御对抗攻击中的重要性,尤其是在复杂动态3D环境中。这一框架为未来鲁棒视觉感知系统的设计提供了新思路。
Abstract: Adversarial attacks in 3D environments have emerged as a critical threat to the reliability of visual perception systems, particularly in safety-sensitive applications such as identity verification and autonomous driving. These attacks employ adversarial patches and 3D objects to manipulate deep neural network (DNN) predictions by exploiting vulnerabilities within complex scenes. Existing defense mechanisms, such as adversarial training and purification, primarily employ passive strategies to enhance robustness. However, these approaches often rely on pre-defined assumptions about adversarial tactics, limiting their adaptability in dynamic 3D settings. To address these challenges, we introduce Reinforced Embodied Active Defense (Rein-EAD), a proactive defense framework that leverages adaptive exploration and interaction with the environment to improve perception robustness in 3D adversarial contexts. By implementing a multi-step objective that balances immediate prediction accuracy with predictive entropy minimization, Rein-EAD optimizes defense strategies over a multi-step horizon. Additionally, Rein-EAD involves an uncertainty-oriented reward-shaping mechanism that facilitates efficient policy updates, thereby reducing computational overhead and supporting real-world applicability without the need for differentiable environments. Comprehensive experiments validate the effectiveness of Rein-EAD, demonstrating a substantial reduction in attack success rates while preserving standard accuracy across diverse tasks. Notably, Rein-EAD exhibits robust generalization to unseen and adaptive attacks, making it suitable for real-world complex tasks, including 3D object classification, face recognition and autonomous driving.
[79] Human Scanpath Prediction in Target-Present Visual Search with Semantic-Foveal Bayesian Attention
João Luzio,Alexandre Bernardino,Plinio Moreno
Main category: cs.CV
TL;DR: 该论文提出了SemBA-FAST框架,结合深度目标检测和概率语义融合机制,用于目标存在的视觉搜索任务中预测人类注意力扫描路径。该方法在COCO-Search18数据集上表现优异,超越了基线和其他自上而下模型。
Details
Motivation: 人类在目标导向的视觉任务中,注意力受自上而下和自下而上的双重线索引导,而中央凹视觉对高效注意力的定向至关重要。现有研究通过扫描路径数据取得了进展,但如何结合语义和中央凹动态更新来提升注意力预测仍需探索。Contribution: 提出SemBA-FAST框架,整合深度目标检测和概率语义融合机制,动态生成注意力图,通过预训练检测器和人工中央凹化顺序更新自上而下知识,提升扫描路径预测性能。
Method: 结合预训练目标检测器和概率语义融合机制,动态生成注意力图,并利用人工中央凹化技术逐步更新注意力预测。在COCO-Search18数据集上验证了其有效性。
Result: SemBA-FAST生成了与人类真实扫描路径高度一致的注意力序列,超越了基线和其他自上而下模型,部分情况下甚至能与扫描路径信息模型媲美。
Insight: 研究表明,语义与中央凹概率结合的框架能有效模拟人类注意力行为,对实时认知计算和机器人技术具有重要启示。
Abstract: In goal-directed visual tasks, human perception is guided by both top-down and bottom-up cues. At the same time, foveal vision plays a crucial role in directing attention efficiently. Modern research on bio-inspired computational attention models has taken advantage of advancements in deep learning by utilizing human scanpath data to achieve new state-of-the-art performance. In this work, we assess the performance of SemBA-FAST, i.e. Semantic-based Bayesian Attention for Foveal Active visual Search Tasks, a top-down framework designed for predicting human visual attention in target-present visual search. SemBA-FAST integrates deep object detection with a probabilistic semantic fusion mechanism to generate attention maps dynamically, leveraging pre-trained detectors and artificial foveation to update top-down knowledge and improve fixation prediction sequentially. We evaluate SemBA-FAST on the COCO-Search18 benchmark dataset, comparing its performance against other scanpath prediction models. Our methodology achieves fixation sequences that closely match human ground-truth scanpaths. Notably, it surpasses baseline and other top-down approaches and competes, in some cases, with scanpath-informed models. These findings provide valuable insights into the capabilities of semantic-foveal probabilistic frameworks for human-like attention modelling, with implications for real-time cognitive computing and robotics.
[80] Explaining How Visual, Textual and Multimodal Encoders Share Concepts
Clément Cornet,Romaric Besançon,Hervé Le Borgne
Main category: cs.CV
TL;DR: 该论文提出了一种新方法,通过稀疏自编码器(SAE)提取的特征,实现跨模态模型(视觉、文本和多模态)的定量比较,并提出了量化特征共享程度的指标。
Details
Motivation: 现有的研究多限于同一模态内的模型比较,而多模态模型的兴起使得跨模态特征的比较成为重要研究方向。Contribution: 1. 提出了跨模态模型定量比较的新指标;2. 设计了量化特征共享程度的工具;3. 对21种不同规模、类型的编码器进行了综合分析。
Method: 使用稀疏自编码器提取特征,开发定量比较指标和特征共享度量工具,分析了视觉、文本和多模态编码器。
Result: 发现视觉-语言模型(VLM)中特定的视觉特征与文本编码器共享,表明文本预训练对视觉特征的影响。
Insight: 多模态上下文训练增强了跨模态特征的共享,文本预训练对视觉特征的学习有显著贡献。
Abstract: Sparse autoencoders (SAEs) have emerged as a powerful technique for extracting human-interpretable features from neural networks activations. Previous works compared different models based on SAE-derived features but those comparisons have been restricted to models within the same modality. We propose a novel indicator allowing quantitative comparison of models across SAE features, and use it to conduct a comparative study of visual, textual and multimodal encoders. We also propose to quantify the Comparative Sharedness of individual features between different classes of models. With these two new tools, we conduct several studies on 21 encoders of the three types, with two significantly different sizes, and considering generalist and domain specific datasets. The results allow to revisit previous studies at the light of encoders trained in a multimodal context and to quantify to which extent all these models share some representations or features. They also suggest that visual features that are specific to VLMs among vision encoders are shared with text encoders, highlighting the impact of text pretraining. The code is available at https://github.com/CEA-LIST/SAEshareConcepts
[81] Towards Large Scale Geostatistical Methane Monitoring with Part-based Object Detection
Adhemar de Senneville,Xavier Bou,Thibaud Ehret,Rafael Grompone,Jean Louis Bonne,Nicolas Dumelie,Thomas Lauvaux,Gabriele Facciolo
Main category: cs.CV
TL;DR: 论文提出了一种基于部分的目标检测方法,用于大范围地理区域中稀有目标(如生物消化器)的检测,并利用地理统计方法估算甲烷排放量。
Details
Motivation: 遥感数据量大且目标稀有,现有方法难以高效检测稀有目标,这对大规模环境监测等应用至关重要。Contribution: 1. 提出了一个包含生物消化器的新数据集;2. 开发了一种基于部分的目标检测方法,通过关注子结构提升检测效果;3. 实现了地理统计方法估算甲烷排放量。
Method: 基于部分的目标检测方法,通过识别生物消化器的子结构提升初始检测效果,随后将其应用于新区域进行库存统计和甲烷排放估算。
Result: 方法在稀有目标检测上表现良好,并成功应用于法国生物消化器的甲烷排放估算。
Insight: 通过关注目标的子结构,可以在数据稀有的情况下提升检测性能,为大范围环境监测提供了新思路。
Abstract: Object detection is one of the main applications of computer vision in remote sensing imagery. Despite its increasing availability, the sheer volume of remote sensing data poses a challenge when detecting rare objects across large geographic areas. Paradoxically, this common challenge is crucial to many applications, such as estimating environmental impact of certain human activities at scale. In this paper, we propose to address the problem by investigating the methane production and emissions of bio-digesters in France. We first introduce a novel dataset containing bio-digesters, with small training and validation sets, and a large test set with a high imbalance towards observations without objects since such sites are rare. We develop a part-based method that considers essential bio-digester sub-elements to boost initial detections. To this end, we apply our method to new, unseen regions to build an inventory of bio-digesters. We then compute geostatistical estimates of the quantity of methane produced that can be attributed to these infrastructures in a given area at a given time.
[82] Object segmentation in the wild with foundation models: application to vision assisted neuro-prostheses for upper limbs
Bolutife Atoki,Jenny Benois-Pineau,Renaud Péteri,Fabien Baldacci,Aymar de Rugy
Main category: cs.CV
TL;DR: 该论文提出了一种基于基础模型(如SAM)的对象分割方法,通过注视点生成提示引导分割,无需在特定对象上进行微调,应用于上肢神经假肢的视觉辅助中。
Details
Motivation: 研究动机是探索基础模型在复杂现实场景中的对象分割能力,特别是在神经假肢的视觉辅助应用中,解决分割任务的高难度问题。Contribution: 主要贡献包括:1. 使用注视点引导SAM生成分割提示;2. 在自我中心视觉数据上微调SAM;3. 在真实世界数据集上验证方法效果。
Method: 方法包括:1. 利用注视点生成提示引导SAM分割;2. 在Grasping-in-the-Wild数据集上微调模型;3. 评估分割质量提升效果。
Result: 在Grasping-in-the-Wild数据集上,分割质量IoU指标提升了0.51分。
Insight: 研究发现基础模型在复杂现实场景中具备较强的泛化能力,通过简单提示生成和微调即可显著提升分割效果。
Abstract: In this work, we address the problem of semantic object segmentation using foundation models. We investigate whether foundation models, trained on a large number and variety of objects, can perform object segmentation without fine-tuning on specific images containing everyday objects, but in highly cluttered visual scenes. The ‘’in the wild’’ context is driven by the target application of vision guided upper limb neuroprostheses. We propose a method for generating prompts based on gaze fixations to guide the Segment Anything Model (SAM) in our segmentation scenario, and fine-tune it on egocentric visual data. Evaluation results of our approach show an improvement of the IoU segmentation quality metric by up to 0.51 points on real-world challenging data of Grasping-in-the-Wild corpus which is made available on the RoboFlow Platform (https://universe.roboflow.com/iwrist/grasping-in-the-wild)
[83] GaussianFusionOcc: A Seamless Sensor Fusion Approach for 3D Occupancy Prediction Using 3D Gaussians
Tomislav Pavković,Mohammad-Ali Nikouei Mahani,Johannes Niedermayer,Johannes Betz
Main category: cs.CV
TL;DR: GaussianFusionOcc提出了一种基于3D高斯表示的多模态传感器融合方法,用于3D语义占据预测,通过创新的注意力机制和高效表示提升了预测精度和速度。
Details
Motivation: 3D语义占据预测对自动驾驶至关重要,但现有方法依赖密集网格表示,内存和计算效率低。多模态传感器融合可提升预测可靠性,但传统方法难以实现高效融合。Contribution: 1. 使用3D高斯表示提升内存效率和推理速度;2. 提出模态无关的可变形注意力机制,实现多传感器数据的无缝融合;3. 在多种传感器组合下验证方法的通用性。
Method: 结合3D高斯表示与可变形注意力机制,通过多模态传感器(相机、LiDAR、雷达)融合优化高斯属性,实现精确的环境建模。
Result: GaussianFusionOcc在精度和效率上超越现有方法,展示了多传感器融合和高斯表示的优越性。
Insight: 3D高斯表示是高效3D占据预测的潜在方向,可变形注意力机制为多模态融合提供了灵活且统一的解决方案。
Abstract: 3D semantic occupancy prediction is one of the crucial tasks of autonomous driving. It enables precise and safe interpretation and navigation in complex environments. Reliable predictions rely on effective sensor fusion, as different modalities can contain complementary information. Unlike conventional methods that depend on dense grid representations, our approach, GaussianFusionOcc, uses semantic 3D Gaussians alongside an innovative sensor fusion mechanism. Seamless integration of data from camera, LiDAR, and radar sensors enables more precise and scalable occupancy prediction, while 3D Gaussian representation significantly improves memory efficiency and inference speed. GaussianFusionOcc employs modality-agnostic deformable attention to extract essential features from each sensor type, which are then used to refine Gaussian properties, resulting in a more accurate representation of the environment. Extensive testing with various sensor combinations demonstrates the versatility of our approach. By leveraging the robustness of multi-modal fusion and the efficiency of Gaussian representation, GaussianFusionOcc outperforms current state-of-the-art models.
[84] IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning
Tianheng Qiu,Jingchun Gao,Jingyu Li,Huiyi Leong,Xuan Huang,Xi Wang,Xiaocheng Zhang,Kele Xu,Lan Zhang
Main category: cs.CV
TL;DR: IntentVCNet提出了一种新颖的方法,通过结合时间与空间理解知识,填补大型视觉语言模型在细粒度意图控制视频描述中的时空鸿沟。
Details
Motivation: 现有大型视觉语言模型(LVLMs)虽具备时空理解能力,但无法在时间序列中直接响应指令实现细粒度空间控制,导致难以完成意图导向的视频描述。Contribution: 1. 提出提示组合策略,建模用户意图与视频序列的隐含关系;2. 设计参数高效的框适配器,增强全局视觉上下文中的对象语义信息。
Method: 1. 使用提示组合策略连接用户意图与视频;2. 通过框适配器为视觉标记提供意图先验信息。
Result: 实验证明,该方法在开源LVLMs中取得最优效果,并在IntentVC挑战中排名第二。
Insight: 结合提示和适配器的策略能显著提升LVLMs对视频序列中空间细节的建模能力,实现精准的意图导向描述。
Abstract: Intent-oriented controlled video captioning aims to generate targeted descriptions for specific targets in a video based on customized user intent. Current Large Visual Language Models (LVLMs) have gained strong instruction following and visual comprehension capabilities. Although the LVLMs demonstrated proficiency in spatial and temporal understanding respectively, it was not able to perform fine-grained spatial control in time sequences in direct response to instructions. This substantial spatio-temporal gap complicates efforts to achieve fine-grained intention-oriented control in video. Towards this end, we propose a novel IntentVCNet that unifies the temporal and spatial understanding knowledge inherent in LVLMs to bridge the spatio-temporal gap from both prompting and model perspectives. Specifically, we first propose a prompt combination strategy designed to enable LLM to model the implicit relationship between prompts that characterize user intent and video sequences. We then propose a parameter efficient box adapter that augments the object semantic information in the global visual context so that the visual token has a priori information about the user intent. The final experiment proves that the combination of the two strategies can further enhance the LVLM’s ability to model spatial details in video sequences, and facilitate the LVLMs to accurately generate controlled intent-oriented captions. Our proposed method achieved state-of-the-art results in several open source LVLMs and was the runner-up in the IntentVC challenge. Our code is available on https://github.com/thqiu0419/IntentVCNet.
[85] COT-AD: Cotton Analysis Dataset
Akbar Ali,Mahek Vyas,Soumyaratna Debnath,Chanda Grover Kamra,Jaidev Sanjay Khalane,Reuben Shibu Devanesan,Indra Deep Mastan,Subramanian Sankaranarayanan,Pankaj Khanna,Shanmuganathan Raman
Main category: cs.CV
TL;DR: 这篇论文介绍了COT-AD数据集,一个用于通过计算机视觉增强棉花作物分析的综合性数据集,包含25,000多张图像和5,000张标注图像。
Details
Motivation: 现有的棉花作物相关数据集较少,无法满足数据驱动的农业管理需求,COT-AD填补了这一空白。Contribution: 提供了包含多种任务支持的棉花作物数据集CO T-AD,涵盖害虫与疾病识别、植被与杂草分析等。
Method: 通过空中和高分辨率DSLR图像采集棉花生长周期的图像,并进行标注,支持分类、分割、图像恢复等任务。
Result: 构建了包含25,000多张图像的数据集,其中5,000张标注图像,为棉花研究提供了丰富资源。
Insight: COT-AD可推动数据驱动的棉花作物管理,尤其在早期疾病管理和生成模型应用中具有潜力。
Abstract: This paper presents COT-AD, a comprehensive Dataset designed to enhance cotton crop analysis through computer vision. Comprising over 25,000 images captured throughout the cotton growth cycle, with 5,000 annotated images, COT-AD includes aerial imagery for field-scale detection and segmentation and high-resolution DSLR images documenting key diseases. The annotations cover pest and disease recognition, vegetation, and weed analysis, addressing a critical gap in cotton-specific agricultural datasets. COT-AD supports tasks such as classification, segmentation, image restoration, enhancement, deep generative model-based cotton crop synthesis, and early disease management, advancing data-driven crop management
[86] TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation
Zhekai Chen,Ruihang Chu,Yukang Chen,Shiwei Zhang,Yujie Wei,Yingya Zhang,Xihui Liu
Main category: cs.CV
TL;DR: TTS-VAR是一种测试时尺度调整框架,用于视觉自回归生成模型,通过动态调整批量大小和结合多尺度生成策略,提高了生成质量和效率。
Details
Motivation: 视觉生成模型的扩展通常需要大量的训练和计算资源。测试时尺度调整因其资源效率和性能优势受到关注,但尚未有针对视觉自回归模型的通用框架。Contribution: 提出了首个针对视觉自回归(VAR)模型的通用测试时尺度调整框架TTS-VAR,通过自适应批量大小调度和多尺度生成策略显著提升了生成质量。
Method: 框架将生成过程建模为路径搜索问题,引入自适应批量大小调度,并在粗尺度上使用基于聚类的多样性搜索(保留结构多样性),在细尺度上采用基于重采样的潜力选择(利用多尺度历史数据评估潜力)。
Result: 在Infinity VAR模型上的实验表明,GenEval分数显著提升8.7%(从0.69提高到0.75)。
Insight: 早期结构特征对最终生成质量有重要影响,且重采样效果在不同生成尺度上表现各异。
Abstract: Scaling visual generation models is essential for real-world content creation, yet requires substantial training and computational expenses. Alternatively, test-time scaling has garnered growing attention due to resource efficiency and promising performance. In this work, we present TTS-VAR, the first general test-time scaling framework for visual auto-regressive (VAR) models, modeling the generation process as a path searching problem. To dynamically balance computational efficiency with exploration capacity, we first introduce an adaptive descending batch size schedule throughout the causal generation process. Besides, inspired by VAR’s hierarchical coarse-to-fine multi-scale generation, our framework integrates two key components: (i) At coarse scales, we observe that generated tokens are hard for evaluation, possibly leading to erroneous acceptance of inferior samples or rejection of superior samples. Noticing that the coarse scales contain sufficient structural information, we propose clustering-based diversity search. It preserves structural variety through semantic feature clustering, enabling later selection on samples with higher potential. (ii) In fine scales, resampling-based potential selection prioritizes promising candidates using potential scores, which are defined as reward functions incorporating multi-scale generation history. Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement (from 0.69 to 0.75). Key insights reveal that early-stage structural features effectively influence final quality, and resampling efficacy varies across generation scales. Code is available at https://github.com/ali-vilab/TTS-VAR.
[87] A 3D Cross-modal Keypoint Descriptor for MR-US Matching and Registration
Daniil Morozov,Reuben Dorent,Nazim Haouchine
Main category: cs.CV
TL;DR: 该论文提出了一种新型3D跨模态关键点描述符,用于MRI和iUS图像的匹配与配准,采用患者特定的合成匹配方法和对比学习训练,显著提高了配准精度和鲁棒性。
Details
Motivation: MRI与iUS图像的配准在手术中至关重要,但两者在模态上的显著差异(如分辨率、视野)导致现有方法效果不佳。论文旨在解决这一问题。Contribution: 1. 提出了一种新的3D跨模态关键点描述符;2. 采用了基于合成数据的对比学习训练方法;3. 结合概率关键点检测和动态难样本挖掘策略,提升了配准的鲁棒性和精度。
Method: 1. 使用患者特定的合成iUS数据训练跨模态描述符;2. 采用对比学习和动态难样本挖掘的课程式三元组损失;3. 通过关键点匹配实现刚性配准。
Result: 在ReMIND数据集上,关键点匹配的平均精度达69.8%,配准平均误差为2.39 mm,优于现有方法。
Insight: 通过合成数据训练跨模态描述符能够显著提升配准性能,且方法具有良好的可解释性和鲁棒性。
Abstract: Intraoperative registration of real-time ultrasound (iUS) to preoperative Magnetic Resonance Imaging (MRI) remains an unsolved problem due to severe modality-specific differences in appearance, resolution, and field-of-view. To address this, we propose a novel 3D cross-modal keypoint descriptor for MRI-iUS matching and registration. Our approach employs a patient-specific matching-by-synthesis approach, generating synthetic iUS volumes from preoperative MRI. This enables supervised contrastive training to learn a shared descriptor space. A probabilistic keypoint detection strategy is then employed to identify anatomically salient and modality-consistent locations. During training, a curriculum-based triplet loss with dynamic hard negative mining is used to learn descriptors that are i) robust to iUS artifacts such as speckle noise and limited coverage, and ii) rotation-invariant . At inference, the method detects keypoints in MR and real iUS images and identifies sparse matches, which are then used to perform rigid registration. Our approach is evaluated using 3D MRI-iUS pairs from the ReMIND dataset. Experiments show that our approach outperforms state-of-the-art keypoint matching methods across 11 patients, with an average precision of $69.8%$. For image registration, our method achieves a competitive mean Target Registration Error of 2.39 mm on the ReMIND2Reg benchmark. Compared to existing iUS-MR registration approach, our framework is interpretable, requires no manual initialization, and shows robustness to iUS field-of-view variation. Code is available at https://github.com/morozovdd/CrossKEY.
[88] VideoMind: An Omni-Modal Video Dataset with Intent Grounding for Deep-Cognitive Video Understanding
Baoyao Yang,Wanyun Li,Dixin Chen,Junxiang Chen,Wenbin Yao,Haifeng Lin
Main category: cs.CV
TL;DR: 论文介绍了VideoMind,一个专注于视频内容深度认知的多模态数据集,通过分层文本描述(事实、抽象、意图)和Chain-of-Thought方法生成意图表达,支持下游任务评估。
Details
Motivation: 现有视频数据集缺乏对视频内容的深度认知能力,尤其是意图层面的表达,阻碍了多模态视频理解的进展。Contribution: 1)提出首个包含层次化意图表达的视频数据集;2)采用Chain-of-Thought方法生成深度认知描述;3)提供黄金标准基准测试。
Method: 通过分层文本描述(事实、抽象、意图)标注视频,并用Chain-of-Thought方法生成意图表达,设计混合认知检索实验评估模型性能。
Result: 发布了多模型(如InternVideo、VAST)的评估结果,验证了数据集的实用性。
Insight: 意图表达需要上下文集成,而不仅是表面观察,这对深度视频理解任务(如情感识别)至关重要。
Abstract: This paper introduces VideoMind, a video-centric omni-modal dataset designed for deep video content cognition and enhanced multi-modal feature representation. The dataset comprises 103K video samples (3K reserved for testing), each paired with audio and systematically detailed textual descriptions. Specifically, every video and its audio is described across three hierarchical layers (factual, abstract, and intent), progressing from surface to depth. It contains over 22 million words, averaging ~225 words per sample. VideoMind’s key distinction from existing datasets is its provision of intent expressions, which require contextual integration across the entire video and are not directly observable. These deep-cognitive expressions are generated using a Chain-of-Thought (COT) approach, prompting the mLLM through step-by-step reasoning. Each description includes annotations for subject, place, time, event, action, and intent, supporting downstream recognition tasks. Crucially, we establish a gold-standard benchmark with 3,000 manually validated samples for evaluating deep-cognitive video understanding. We design hybrid-cognitive retrieval experiments, scored by multi-level retrieval metrics, to appropriately assess deep video comprehension. Evaluation results for models (e.g., InternVideo, VAST, UMT-L) are released. VideoMind serves as a powerful benchmark for fine-grained cross-modal alignment and advances fields requiring in-depth video understanding, such as emotion and intent recognition. The data is publicly available on GitHub, HuggingFace, and OpenDataLab, https://github.com/cdx-cindy/VideoMind.
[89] Deep Learning-Based Age Estimation and Gender Deep Learning-Based Age Estimation and Gender Classification for Targeted Advertisement
Muhammad Imran Zaman,Nisar Ahmed
Main category: cs.CV
TL;DR: 本文提出了一种新颖的深度学习方法,用于从面部图像中同时进行年龄和性别分类,以提升定向广告的效果。通过优化设计的自定义CNN架构,模型学习共享表示,显著提高了性能。
Details
Motivation: 现有的年龄和性别分类方法通常独立处理这两个任务,忽视了它们之间的相关性。本文旨在通过联合学习共享表示,提升分类和估计的准确性。Contribution: 1. 提出了一种优化的CNN架构,联合学习年龄和性别共享表示;2. 在大规模多样化数据集上验证了性能;3. 对不同年龄段的年龄估计进行了深入分析。
Method: 采用了自定义的CNN架构,联合训练年龄和性别分类任务,并通过数据预处理和共享表示学习提升模型鲁棒性。
Result: 性别分类准确率高达95%,年龄估计的平均绝对误差为5.77年。对年轻个体的年龄估计存在偏差,需针对性优化。
Insight: 联合学习比独立任务处理更有效;年轻个体的年龄估计仍需改进;CNN架构和超参数选择对性能有关键影响。
Abstract: This paper presents a novel deep learning-based approach for simultaneous age and gender classification from facial images, designed to enhance the effectiveness of targeted advertising campaigns. We propose a custom Convolutional Neural Network (CNN) architecture, optimized for both tasks, which leverages the inherent correlation between age and gender information present in facial features. Unlike existing methods that often treat these tasks independently, our model learns shared representations, leading to improved performance. The network is trained on a large, diverse dataset of facial images, carefully pre-processed to ensure robustness against variations in lighting, pose, and image quality. Our experimental results demonstrate a significant improvement in gender classification accuracy, achieving 95%, and a competitive mean absolute error of 5.77 years for age estimation. Critically, we analyze the performance across different age groups, identifying specific challenges in accurately estimating the age of younger individuals. This analysis reveals the need for targeted data augmentation and model refinement to address these biases. Furthermore, we explore the impact of different CNN architectures and hyperparameter settings on the overall performance, providing valuable insights for future research.
[90] Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis
Yanzuo Lu,Yuxi Ren,Xin Xia,Shanchuan Lin,Xing Wang,Xuefeng Xiao,Andy J. Ma,Xiaohua Xie,Jian-Huang Lai
Main category: cs.CV
TL;DR: 该论文提出了一种名为Adversarial Distribution Matching (ADM)的新方法,用于改进扩散模型的蒸馏效率以生成高效的图像和视频。通过结合对抗性训练和扩散判别器,ADM避免了传统KL散度匹配导致的模式崩溃问题,并在单步和多步蒸馏任务中取得了优异性能。
Details
Motivation: 传统的Distribution Matching Distillation (DMD)依赖反向KL散度最小化,可能导致模式崩溃。为了克服这一缺陷,作者提出了ADM,利用对抗性训练更好地匹配真实和生成数据的潜在分布。Contribution: 1. 提出ADM框架,通过对抗性训练改进扩散模型的蒸馏效率;2. 结合潜在空间和像素空间的判别器,提升单步蒸馏性能;3. 提出DMDX统一流程,显著优于DMD2。
Method: 1. 使用扩散判别器对齐真实和假分数估计器的潜在预测;2. 在预训练阶段引入分布损失,改进初始化;3. 结合单步和多步蒸馏,形成DMDX流程。
Result: 在SDXL、SD3-Medium、SD3.5-Large和CogVideoX上的实验表明,ADM在单步和多步蒸馏任务中均优于DMD2,同时降低了GPU时间消耗。
Insight: 对抗性训练在扩散模型蒸馏中能有效避免模式崩溃问题,同时潜空间和像素空间的联合优化可以显著提升生成质量和效率。
Abstract: Distribution Matching Distillation (DMD) is a promising score distillation technique that compresses pre-trained teacher diffusion models into efficient one-step or multi-step student generators. Nevertheless, its reliance on the reverse Kullback-Leibler (KL) divergence minimization potentially induces mode collapse (or mode-seeking) in certain applications. To circumvent this inherent drawback, we propose Adversarial Distribution Matching (ADM), a novel framework that leverages diffusion-based discriminators to align the latent predictions between real and fake score estimators for score distillation in an adversarial manner. In the context of extremely challenging one-step distillation, we further improve the pre-trained generator by adversarial distillation with hybrid discriminators in both latent and pixel spaces. Different from the mean squared error used in DMD2 pre-training, our method incorporates the distributional loss on ODE pairs collected from the teacher model, and thus providing a better initialization for score distillation fine-tuning in the next stage. By combining the adversarial distillation pre-training with ADM fine-tuning into a unified pipeline termed DMDX, our proposed method achieves superior one-step performance on SDXL compared to DMD2 while consuming less GPU time. Additional experiments that apply multi-step ADM distillation on SD3-Medium, SD3.5-Large, and CogVideoX set a new benchmark towards efficient image and video synthesis.
[91] DRWKV: Focusing on Object Edges for Low-Light Image Enhancement
Xuecheng Bai,Yuxiang Wang,Boyu Hu,Qinyuan Jie,Chuanzhi Xu,Hongru Xiao,Kechen Li,Vera Chung
Main category: cs.CV
TL;DR: DRWKV提出了一种新的低光照图像增强方法,通过结合全局边缘Retinex理论、螺旋扫描注意力机制和双谱对齐器,显著提升了边缘保留和图像质量。
Details
Motivation: 低光照条件下,图像增强任务面临边缘和结构细节保留的挑战,现有方法难以在极端光照退化情况下保持连续性和细节。Contribution: 1. 提出GER理论分离光照和边缘;2. 设计螺旋扫描注意力机制Evolving WKV Attention;3. 引入Bi-SAB和MS2-Loss联合对齐亮度和色度特征。
Method: 结合GER理论、Evolving WKV Attention和Bi-SAB,并采用MS2-Loss优化模型性能。
Result: 在五个低光照图像增强基准测试中,DRWKV在PSNR、SSIM和NIQE上表现领先,计算复杂度低,且能提升多目标跟踪任务性能。
Insight: 通过分离光照和边缘结构,结合空间连续性建模,可以有效提升低光照图像的质量和实用性。
Abstract: Low-light image enhancement remains a challenging task, particularly in preserving object edge continuity and fine structural details under extreme illumination degradation. In this paper, we propose a novel model, DRWKV (Detailed Receptance Weighted Key Value), which integrates our proposed Global Edge Retinex (GER) theory, enabling effective decoupling of illumination and edge structures for enhanced edge fidelity. Secondly, we introduce Evolving WKV Attention, a spiral-scanning mechanism that captures spatial edge continuity and models irregular structures more effectively. Thirdly, we design the Bilateral Spectrum Aligner (Bi-SAB) and a tailored MS2-Loss to jointly align luminance and chrominance features, improving visual naturalness and mitigating artifacts. Extensive experiments on five LLIE benchmarks demonstrate that DRWKV achieves leading performance in PSNR, SSIM, and NIQE while maintaining low computational complexity. Furthermore, DRWKV enhances downstream performance in low-light multi-object tracking tasks, validating its generalization capabilities.
[92] SIDA: Synthetic Image Driven Zero-shot Domain Adaptation
Ye-Chan Kim,SeungJu Cha,Si-Woo Kim,Taewhan Kim,Dong-Jin Kim
Main category: cs.CV
TL;DR: SIDA提出了一种基于合成图像的零样本域适应方法,通过生成多样化的合成图像来捕捉目标域的精细风格特征,显著提高了适应效率并减少了时间开销。
Details
Motivation: 现有的零样本域适应方法依赖文本描述,难以捕捉复杂现实世界的多样性且耗时较长,因此探索利用合成图像提供更细粒度的风格线索。Contribution: 1. 提出SIDA方法,利用合成图像驱动零样本域适应;2. 引入Domain Mix和Patch Style Transfer模块,增强域内表示多样性;3. 在挑战性域中实现最优性能并显著提高效率。
Method: 1. 生成源域细节丰富的图像并通过翻译反映目标域风格;2. 使用Domain Mix混合多样风格;3. 利用Patch Style Transfer为图像块分配不同风格。
Result: SIDA在多样零样本适应场景中表现最佳,尤其在挑战性域中,同时显著减少适应时间。
Insight: 合成图像比文本更能捕捉现实世界多样性,模块化设计(如混合风格和分块风格迁移)是提升域适应性能的关键。
Abstract: Zero-shot domain adaptation is a method for adapting a model to a target domain without utilizing target domain image data. To enable adaptation without target images, existing studies utilize CLIP’s embedding space and text description to simulate target-like style features. Despite the previous achievements in zero-shot domain adaptation, we observe that these text-driven methods struggle to capture complex real-world variations and significantly increase adaptation time due to their alignment process. Instead of relying on text descriptions, we explore solutions leveraging image data, which provides diverse and more fine-grained style cues. In this work, we propose SIDA, a novel and efficient zero-shot domain adaptation method leveraging synthetic images. To generate synthetic images, we first create detailed, source-like images and apply image translation to reflect the style of the target domain. We then utilize the style features of these synthetic images as a proxy for the target domain. Based on these features, we introduce Domain Mix and Patch Style Transfer modules, which enable effective modeling of real-world variations. In particular, Domain Mix blends multiple styles to expand the intra-domain representations, and Patch Style Transfer assigns different styles to individual patches. We demonstrate the effectiveness of our method by showing state-of-the-art performance in diverse zero-shot adaptation scenarios, particularly in challenging domains. Moreover, our approach achieves high efficiency by significantly reducing the overall adaptation time.
[93] Captain Cinema: Towards Short Movie Generation
Junfei Xiao,Ceyuan Yang,Lvmin Zhang,Shengqu Cai,Yang Zhao,Yuwei Guo,Gordon Wetzstein,Maneesh Agrawala,Alan Yuille,Lu Jiang
Main category: cs.CV
TL;DR: Captain Cinema 是一个短电影生成框架,通过生成关键帧和视频合成模型实现高质量且连贯的短电影生成。
Details
Motivation: 现有视频生成模型在长叙事和多场景条件下难以保持故事和视觉一致性,需要一种高效且稳定的方法解决这一问题。Contribution: 提出了 Captain Cinema 框架,结合自上而下的关键帧规划和自下而上的视频合成方法,并设计了 MM-DiT 模型的交错训练策略,支持长上下文视频生成。
Method: 1. 关键帧规划(top-down)确保叙事和视觉连贯性;2. 视频合成(bottom-up)填补关键帧间的动态;3. 为长上下文视频数据优化的 MM-DiT 交错训练策略。
Result: 实验表明,Captain Cinema 能高效生成视觉和叙事连贯的高质量短电影。
Insight: 通过分阶段生成和长上下文优化,可以显著提升复杂叙事视频生成的稳定性和质量。
Abstract: We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a specially curated cinematic dataset consisting of interleaved data pairs. Our experiments demonstrate that Captain Cinema performs favorably in the automated creation of visually coherent and narrative consistent short movies in high quality and efficiency. Project page: https://thecinema.ai
cs.IR [Back]
[94] LLM-based Embedders for Prior Case Retrieval
Damith Premasiri,Tharindu Ranasinghe,Ruslan Mitkov
Main category: cs.IR
TL;DR: 该论文提出了一种基于LLM(大语言模型)的文本嵌入方法,用于解决法律领域中先例检索(PCR)的两个主要挑战:长文本输入限制和缺乏训练数据。研究表明,这种无监督的LLM嵌入方法优于传统的BM25和有监督的Transformer模型。
Details
Motivation: 在普通法体系中,律师和法官依赖先例来构建论点,但由于案例数量庞大,高效的检索方法变得至关重要。传统的IR方法如BM25在处理法律长文本时存在限制,且深度学习方法因缺乏训练数据难以有效应用。Contribution: 论文的主要贡献是提出了一种基于LLM的无监督文本嵌入方法,解决了法律长文本输入限制和训练数据不足的问题,并在实验中展示了其优越性。
Method: 采用LLM(如BERT的变体)作为文本嵌入器,支持更长的输入文本,并通过无监督学习避免了训练数据的依赖。在四个PCR基准数据集上进行了评估。
Result: 实验结果表明,LLM嵌入方法在PCR任务中超越了传统的BM25和有监督的Transformer模型。
Insight: 无监督的LLM嵌入方法为法律信息检索提供了一种新思路,尤其是在数据稀缺和长文本处理方面具有潜力。
Abstract: In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. ii. Lack of legal training data; due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively. In this research, we address these challenges by leveraging LLM-based text embedders in PCR. LLM-based embedders support longer input lengths, and since we use them in an unsupervised manner, they do not require training data, addressing both challenges simultaneously. In this paper, we evaluate state-of-the-art LLM-based text embedders in four PCR benchmark datasets and show that they outperform BM25 and supervised transformer-based models.
q-bio.NC [Back]
[95] Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)
Semih Eren,Deniz Kucukahmetler,Nico Scherf
Main category: q-bio.NC
TL;DR: 这篇论文提出了一种层次化的多模态循环集成模型,通过结合视觉、听觉和语义信息来预测大脑对自然电影的反应,在Algonauts 2025挑战赛中表现优异。
Details
Motivation: 准确预测大脑对自然刺激的响应需要整合多模态(视觉、听觉、语义)和时间动态信息。Algonauts 2025挑战赛提供了一个基于电影观看的fMRI数据集,适合验证这种整合模型的有效性。Contribution: 论文的主要贡献包括:提出了一种层次化的多模态循环集成模型;设计了融合多模态信息的双向RNN架构;通过课程学习和损失函数优化提升了模型性能;在挑战赛中取得了第三名的成绩。
Method: 方法包括:(1)使用预训练的视频、音频和语言嵌入;(2)通过双向RNN编码时间动态;(3)融合多模态隐藏状态并传递到第二层循环网络;(4)采用轻量化受试者特定输出头预测fMRI响应;(5)结合MSE和相关性的复合损失函数及课程学习策略。
Result: 模型在Algonauts 2025挑战赛中排名第三,整体Pearson相关系数r=0.2094,单区域最高平均r=0.63,对最具挑战性的受试者(Subject 5)表现尤为突出。
Insight: 多模态信息融合和时序动态建模对大脑响应预测至关重要;课程学习和模型集成能显著提升鲁棒性和性能;轻量化受试者特定输出头有助于个性化建模。
Abstract: Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
cs.GR [Back]
[96] Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA
Rameen Abdal,Or Patashnik,Ekaterina Deyneka,Hao Chen,Aliaksandr Siarohin,Sergey Tulyakov,Daniel Cohen-Or,Kfir Aberman
Main category: cs.GR
TL;DR: 该论文提出了一个零样本动态概念个性化框架,利用网格化的LoRA适配器实现高效训练和推理。
Details
Motivation: 现有的动态概念个性化方法需要针对每个实例进行微调,限制了可扩展性。因此,作者提出了一个零样本框架,无需测试时优化即可泛化到未见概念。Contribution: 主要贡献是提出了一个完全零样本的动态概念个性化框架,通过Grid-LoRA适配器和Grid Fill模块实现高效训练和推理。
Method: 方法基于2x2视频网格,训练轻量级Grid-LoRA适配器进行编辑和组合,并通过Grid Fill模块完成部分观察的布局,生成时间一致且身份保持的输出。
Result: 实验表明,该方法在未训练的概念和广泛编辑场景中均能生成高质量且一致的结果。
Insight: 结构化网格设计和LoRA适配器是实现高效零样本动态概念个性化的关键。
Abstract: Recent advances in text-to-video generation have enabled high-quality synthesis from text and image prompts. While the personalization of dynamic concepts, which capture subject-specific appearance and motion from a single video, is now feasible, most existing methods require per-instance fine-tuning, limiting scalability. We introduce a fully zero-shot framework for dynamic concept personalization in text-to-video models. Our method leverages structured 2x2 video grids that spatially organize input and output pairs, enabling the training of lightweight Grid-LoRA adapters for editing and composition within these grids. At inference, a dedicated Grid Fill module completes partially observed layouts, producing temporally coherent and identity preserving outputs. Once trained, the entire system operates in a single forward pass, generalizing to previously unseen dynamic concepts without any test-time optimization. Extensive experiments demonstrate high-quality and consistent results across a wide range of subjects beyond trained concepts and editing scenarios.
[97] GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar
SeungJun Moon,Hah Min Lew,Seungeun Lee,Ji-Su Kang,Gyeong-Moon Park
Main category: cs.GR
TL;DR: 本文提出了一种自适应几何高斯泼溅框架GeoAvatar,用于3D头部化身生成,通过自适应预分配和嘴部结构优化,显著提升了重建和动画的质量。
Details
Motivation: 现有方法在平衡重建和动画时难以适应面部区域的不同几何偏差,导致质量不理想。本文旨在解决这一问题。Contribution: 1. 提出自适应几何高斯泼溅框架GeoAvatar。2. 引入自适应预分配阶段(APS)和无监督分割方法。3. 设计嘴部结构和部件变形策略以提升动画保真度。4. 提出正则化损失以优化Gaussians与3DMM面部的绑定。5. 发布DynamicFace数据集。
Method: 1. 使用APS分割Gaussians为刚性和柔性集合。2. 基于嘴部解剖学设计部件变形策略。3. 提出正则化损失优化绑定。
Result: 实验表明,GeoAvatar在重建和动画生成方面优于现有方法。
Insight: 通过分区处理Gaussians和嘴部动态建模,可以显著提升3D头部化身的动画质量和几何适应性。
Abstract: Despite recent progress in 3D head avatar generation, balancing identity preservation, i.e., reconstruction, with novel poses and expressions, i.e., animation, remains a challenge. Existing methods struggle to adapt Gaussians to varying geometrical deviations across facial regions, resulting in suboptimal quality. To address this, we propose GeoAvatar, a framework for adaptive geometrical Gaussian Splatting. GeoAvatar leverages Adaptive Pre-allocation Stage (APS), an unsupervised method that segments Gaussians into rigid and flexible sets for adaptive offset regularization. Then, based on mouth anatomy and dynamics, we introduce a novel mouth structure and the part-wise deformation strategy to enhance the animation fidelity of the mouth. Finally, we propose a regularization loss for precise rigging between Gaussians and 3DMM faces. Moreover, we release DynamicFace, a video dataset with highly expressive facial motions. Extensive experiments show the superiority of GeoAvatar compared to state-of-the-art methods in reconstruction and novel animation scenarios.
[98] PS-GS: Gaussian Splatting for Multi-View Photometric Stereo
Yixiao Chen,Bin Liang,Hanzhi Guo,Yongqing Cheng,Jiayi Zhao,Dongdong Weng
Main category: cs.GR
TL;DR: PS-GS是一种结合多视角光度立体(MVPS)和高斯泼溅的方法,用于高效地联合估计物体的几何、材质和光照。通过正则化和多视角多光源图像,解决了逆渲染的病态问题,并在重建精度和计算效率上优于现有方法。
Details
Motivation: 当前基于固定环境光照的逆渲染方法在多视角光度立体(MVPS)中的重建精度不足。为了更高效地进行逆渲染,需要一种能够联合估计几何、材质和光照的方法。Contribution: 提出了PS-GS,首次将高斯泼溅用于MVPS,实现了高效的几何、材质和光照联合估计。通过正则化和改进的光照计算方法,提升了重建精度和效率。
Method: 1)初始化基于高斯泼溅的几何模型;2)通过包含光照计算MLP的完整渲染方程进行逆渲染;3)利用未标定光度立体法的法线图正则化渲染法线;4)提出2D高斯光线追踪用于光源优化。
Result: 在合成和真实数据集上,PS-GS在重建精度和计算效率上均优于现有方法,并支持新视角合成、重新光照以及材质形状编辑。
Insight: 结合高斯泼溅和MVPS能够有效解决逆渲染的病态问题,同时多光源和多视角数据的联合优化是提升重建精度的关键。
Abstract: Integrating inverse rendering with multi-view photometric stereo (MVPS) yields more accurate 3D reconstructions than the inverse rendering approaches that rely on fixed environment illumination. However, efficient inverse rendering with MVPS remains challenging. To fill this gap, we introduce the Gaussian Splatting for Multi-view Photometric Stereo (PS-GS), which efficiently and jointly estimates the geometry, materials, and lighting of the object that is illuminated by diverse directional lights (multi-light). Our method first reconstructs a standard 2D Gaussian splatting model as the initial geometry. Based on the initialization model, it then proceeds with the deferred inverse rendering by the full rendering equation containing a lighting-computing multi-layer perceptron. During the whole optimization, we regularize the rendered normal maps by the uncalibrated photometric stereo estimated normals. We also propose the 2D Gaussian ray-tracing for single directional light to refine the incident lighting. The regularizations and the use of multi-view and multi-light images mitigate the ill-posed problem of inverse rendering. After optimization, the reconstructed object can be used for novel-view synthesis, relighting, and material and shape editing. Experiments on both synthetic and real datasets demonstrate that our method outperforms prior works in terms of reconstruction accuracy and computational efficiency.
eess.IV [Back]
[99] Improving Multislice Electron Ptychography with a Generative Prior
Christian K. Belardi,Chia-Hao Lee,Yingheng Wang,Justin Lovelace,Kilian Q. Weinberger,David A. Muller,Carla P. Gomes
Main category: eess.IV
TL;DR: 该论文提出了一种结合扩散模型与迭代求解器的混合方法(MEP-Diffusion),显著提升了多片电子相衍射成像的质量。
Details
Motivation: 多片电子相衍射成像(MEP)当前迭代求解方法耗时且效果不佳,需改进重建质量。Contribution: 提出MEP-Diffusion,一种专为MEP设计的扩散模型,通过生成先验显著提升重建效果。
Method: 使用扩散模型作为生成先验,通过Diffusion Posterior Sampling(DPS)与现有迭代方法结合。
Result: 在SSIM指标上比现有方法提升了90.50%,重建3D体积质量显著提高。
Insight: 生成式先验能有效解决逆问题中的病态性,为计算成像领域提供新思路。
Abstract: Multislice electron ptychography (MEP) is an inverse imaging technique that computationally reconstructs the highest-resolution images of atomic crystal structures from diffraction patterns. Available algorithms often solve this inverse problem iteratively but are both time consuming and produce suboptimal solutions due to their ill-posed nature. We develop MEP-Diffusion, a diffusion model trained on a large database of crystal structures specifically for MEP to augment existing iterative solvers. MEP-Diffusion is easily integrated as a generative prior into existing reconstruction methods via Diffusion Posterior Sampling (DPS). We find that this hybrid approach greatly enhances the quality of the reconstructed 3D volumes, achieving a 90.50% improvement in SSIM over existing methods.
[100] Integrating Feature Selection and Machine Learning for Nitrogen Assessment in Grapevine Leaves using In-Field Hyperspectral Imaging
Atif Bilal Asad,Achyut Paudel,Safal Kshetri,Chenchen Kang,Salik Ram Khanal,Nataliya Shcherbatyuk,Pierre Davadant,R. Paul Schreiner,Santosh Kalauni,Manoj Karkee,Markus Keller
Main category: eess.IV
TL;DR: 本文研究了如何利用田间高光谱成像技术和机器学习方法,结合特征选择,评估葡萄叶中的氮含量。通过分析不同品种和生长阶段的葡萄叶,确定了关键光谱区域,并验证了模型的有效性。
Details
Motivation: 氮是葡萄生长和果实品质的关键因素,但其在土壤中的时空变异性大。因此,需要精确评估葡萄叶中氮含量,以实现精准施肥管理。Contribution: 本文的主要贡献包括:1) 使用田间高光谱成像技术获取葡萄叶的氮含量数据;2) 结合特征选择和机器学习方法,确定了与氮含量相关的关键光谱区域;3) 分别在叶片和冠层级别验证了模型的预测能力。
Method: 方法分为三步:1) 采集四种葡萄品种在两个生长季节的高光谱图像(400-1000nm);2) 采用两种特征选择方法筛选出与氮含量相关的光谱波段;3) 使用梯度提升和XGBoost两种机器学习模型进行氮含量预测。
Result: 实验结果显示,模型在叶片和冠层级别的预测R平方值分别为0.57和0.49。关键光谱区域(500-525nm、650-690nm、750-800nm、900-950nm)的确定表明其稳健性。
Insight: 本文表明,结合高光谱成像和机器学习技术,可以在田间条件下高效监测葡萄园的氮状态,为精准农业提供了实用工具。
Abstract: Nitrogen (N) is one of the most crucial nutrients in vineyards, affecting plant growth and subsequent products such as wine and juice. Because soil N has high spatial and temporal variability, it is desirable to accurately estimate the N concentration of grapevine leaves and manage fertilization at the individual plant level to optimally meet plant needs. In this study, we used in-field hyperspectral images with wavelengths ranging from $400 to 1000nm of four different grapevine cultivars collected from distinct vineyards and over two growth stages during two growing seasons to develop models for predicting N concentration at the leaf-level and canopy-level. After image processing, two feature selection methods were employed to identify the optimal set of spectral bands that were responsive to leaf N concentrations. The selected spectral bands were used to train and test two different Machine Learning (ML) models, Gradient Boosting and XGBoost, for predicting nitrogen concentrations. The comparison of selected bands for both leaf-level and canopy-level datasets showed that most of the spectral regions identified by the feature selection methods were across both methods and the dataset types (leaf- and canopy-level datasets), particularly in the key regions, 500-525nm, 650-690nm, 750-800nm, and 900-950nm. These findings indicated the robustness of these spectral regions for predicting nitrogen content. The results for N prediction demonstrated that the ML model achieved an R square of 0.49 for canopy-level data and an R square of 0.57 for leaf-level data, despite using different sets of selected spectral bands for each analysis level. The study demonstrated the potential of using in-field hyperspectral imaging and the use of spectral data in integrated feature selection and ML techniques to monitor N status in vineyards.
[101] Direct Dual-Energy CT Material Decomposition using Model-based Denoising Diffusion Model
Hang Xu,Alexandre Bousse,Alessandro Perelli
Main category: eess.IV
TL;DR: 本文提出了一种名为DEcomp-MoD的深度学习方法,直接利用双能CT投影数据进行材料分解,结合模型驱动的扩散模型,避免了后处理的局限性,提高了分解精度。
Details
Motivation: 传统双能CT材料分解方法通常在图像域进行后处理,无法有效解决射束硬化效应,导致结果不理想。本文旨在直接在投影域进行分解,通过结合模型和深度学习提升精度。Contribution: 提出了DEcomp-MoD方法,直接利用双能CT投影数据生成材料图像,结合了模型驱动和基于评分的去噪扩散先验,提升了材料分解的准确性。
Method: 通过将双能CT的光谱模型知识融入深度学习训练损失函数,并结合材料图像域的评分去噪扩散先验,设计了一种基于条件的扩散模型,确保结果与输入数据一致。
Result: 在合成低剂量AAPM数据集上的实验表明,DEcomp-MoD优于现有无监督评分模型和监督深度学习网络,具备临床诊断潜力。
Insight: 直接在投影域进行材料分解可以有效避免后处理引入的误差,结合模型和扩散模型的先验知识能够显著提升性能。
Abstract: Dual-energy X-ray Computed Tomography (DECT) constitutes an advanced technology which enables automatic decomposition of materials in clinical images without manual segmentation using the dependency of the X-ray linear attenuation with energy. However, most methods perform material decomposition in the image domain as a post-processing step after reconstruction but this procedure does not account for the beam-hardening effect and it results in sub-optimal results. In this work, we propose a deep learning procedure called Dual-Energy Decomposition Model-based Diffusion (DEcomp-MoD) for quantitative material decomposition which directly converts the DECT projection data into material images. The algorithm is based on incorporating the knowledge of the spectral DECT model into the deep learning training loss and combining a score-based denoising diffusion learned prior in the material image domain. Importantly the inference optimization loss takes as inputs directly the sinogram and converts to material images through a model-based conditional diffusion model which guarantees consistency of the results. We evaluate the performance with both quantitative and qualitative estimation of the proposed DEcomp-MoD method on synthetic DECT sinograms from the low-dose AAPM dataset. Finally, we show that DEcomp-MoD outperform state-of-the-art unsupervised score-based model and supervised deep learning networks, with the potential to be deployed for clinical diagnosis.
[102] Parameter-Efficient Fine-Tuning of 3D DDPM for MRI Image Generation Using Tensor Networks
Binghua Li,Ziqing Chang,Tong Liang,Chao Li,Toshihisa Tanaka,Shigeki Aoki,Qibin Zhao,Zhe Sun
Main category: eess.IV
TL;DR: 本文提出了一种新的参数高效微调方法TenVOO,用于三维U-Net扩散模型在MRI图像生成中的优化。通过张量网络建模,TenVOO能以极少的参数捕捉复杂空间依赖性,实验表现优于现有方法。
Details
Motivation: 当前三维卷积操作的高效参数化研究不足,而MRI图像生成中又需要捕捉复杂的空间依赖性。这促使研究者开发一种参数高效的微调方法。Contribution: 提出了TenVOO,一种基于张量网络的参数高效微调方法,特别针对3D卷积核的低维表示,显著减少了参数数量。
Method: 使用张量网络建模表示3D卷积核,将其分解为低维张量,从而在微调时高效捕捉空间依赖性。
Result: 在三个MRI数据集(ADNI、PPMI、BraTS2021)上,TenVOO在MS-SSIM指标上表现最优,仅需0.3%的可训练参数。
Insight: 张量网络可以有效压缩3D卷积核的参数,同时保持模型性能,为高维数据的参数高效方法提供了新思路。
Abstract: We address the challenge of parameter-efficient fine-tuning (PEFT) for three-dimensional (3D) U-Net-based denoising diffusion probabilistic models (DDPMs) in magnetic resonance imaging (MRI) image generation. Despite its practical significance, research on parameter-efficient representations of 3D convolution operations remains limited. To bridge this gap, we propose Tensor Volumetric Operator (TenVOO), a novel PEFT method specifically designed for fine-tuning DDPMs with 3D convolutional backbones. Leveraging tensor network modeling, TenVOO represents 3D convolution kernels with lower-dimensional tensors, effectively capturing complex spatial dependencies during fine-tuning with few parameters. We evaluate TenVOO on three downstream brain MRI datasets-ADNI, PPMI, and BraTS2021-by fine-tuning a DDPM pretrained on 59,830 T1-weighted brain MRI scans from the UK Biobank. Our results demonstrate that TenVOO achieves state-of-the-art performance in multi-scale structural similarity index measure (MS-SSIM), outperforming existing approaches in capturing spatial dependencies while requiring only 0.3% of the trainable parameters of the original model. Our code is available at: https://github.com/xiaovhua/tenvoo
[103] Deep Learning for Glioblastoma Morpho-pathological Features Identification: A BraTS-Pathology Challenge Solution
Juexin Zhang,Ying Weng,Ke Chen
Main category: eess.IV
TL;DR: 该论文提出了一种利用预训练深度学习模型并微调的方法,用于识别胶质母细胞瘤的形态病理学特征,尽管在BraTS-Path验证集上表现不佳,但在测试阶段获得了第二名。
Details
Motivation: 胶质母细胞瘤的异质性对诊断和治疗提出了挑战,传统的病理学方法效率有限,因此探索深度学习方法来提高诊断准确性具有重要意义。Contribution: 提出了一个基于预训练模型的解决方案,并通过微调在BraTS-Path挑战赛中展现出一定的诊断能力,尤其在特异性方面表现优异。
Method: 使用预训练模型并在BraTS-Path训练数据集上进行微调,测试了其在胶质母细胞瘤形态病理学特征识别中的表现。
Result: 模型在验证集上的准确率、召回率和F1分数均为0.392229,特异性为0.898704,马修斯相关系数为0.255267,测试阶段排名第二。
Insight: 深度学习在胶质母细胞瘤诊断中有潜力,但模型需要进一步优化以提高其在复杂病理条件下的表现。
Abstract: Glioblastoma, a highly aggressive brain tumor with diverse molecular and pathological features, poses a diagnostic challenge due to its heterogeneity. Accurate diagnosis and assessment of this heterogeneity are essential for choosing the right treatment and improving patient outcomes. Traditional methods rely on identifying specific features in tissue samples, but deep learning offers a promising approach for improved glioblastoma diagnosis. In this paper, we present our approach to the BraTS-Path Challenge 2024. We leverage a pre-trained model and fine-tune it on the BraTS-Path training dataset. Our model demonstrates poor performance on the challenging BraTS-Path validation set, as rigorously assessed by the Synapse online platform. The model achieves an accuracy of 0.392229, a recall of 0.392229, and a F1-score of 0.392229, indicating a consistent ability to correctly identify instances under the target condition. Notably, our model exhibits perfect specificity of 0.898704, showing an exceptional capacity to correctly classify negative cases. Moreover, a Matthews Correlation Coefficient (MCC) of 0.255267 is calculated, to signify a limited positive correlation between predicted and actual values and highlight our model’s overall predictive power. Our solution also achieves the second place during the testing phase.
[104] UniSegDiff: Boosting Unified Lesion Segmentation via a Staged Diffusion Model
Yilong Hu,Shijie Chang,Lihe Zhang,Feng Tian,Weibing Sun,Huchuan Lu
Main category: eess.IV
TL;DR: UniSegDiff是一种基于扩散概率模型(DPM)的分阶段训练和推理框架,旨在统一多种模态和器官的病灶分割任务。通过动态调整不同阶段的预测目标,该方法优化了模型的注意力分布,显著提升了分割性能。
Details
Motivation: 当前扩散模型在病灶分割中存在注意力分布不均的问题,导致训练时间长且效果不理想。为了解决这一问题,作者提出了UniSegDiff框架,旨在统一多种模态和器官的分割任务。Contribution: 1. 提出了UniSegDiff,一种分阶段的扩散模型框架;2. 动态调整预测目标以优化注意力分布;3. 通过预训练特征提取网络实现统一的病灶分割。
Method: 采用分阶段训练和推理,动态调整预测目标,并通过预训练特征提取网络来统一多种模态和器官的分割任务。
Result: 在多种模态和六个不同器官的数据集上,UniSegDiff显著超越了当前的最先进方法(SOTA)。
Insight: 扩散模型通过动态调整预测目标可以实现更均匀的注意力分布,从而提高医学图像分割的性能。
Abstract: The Diffusion Probabilistic Model (DPM) has demonstrated remarkable performance across a variety of generative tasks. The inherent randomness in diffusion models helps address issues such as blurring at the edges of medical images and labels, positioning Diffusion Probabilistic Models (DPMs) as a promising approach for lesion segmentation. However, we find that the current training and inference strategies of diffusion models result in an uneven distribution of attention across different timesteps, leading to longer training times and suboptimal solutions. To this end, we propose UniSegDiff, a novel diffusion model framework designed to address lesion segmentation in a unified manner across multiple modalities and organs. This framework introduces a staged training and inference approach, dynamically adjusting the prediction targets at different stages, forcing the model to maintain high attention across all timesteps, and achieves unified lesion segmentation through pre-training the feature extraction network for segmentation. We evaluate performance on six different organs across various imaging modalities. Comprehensive experimental results demonstrate that UniSegDiff significantly outperforms previous state-of-the-art (SOTA) approaches. The code is available at https://github.com/HUYILONG-Z/UniSegDiff.
[105] DiagR1: A Vision-Language Model Trained via Reinforcement Learning for Digestive Pathology Diagnosis
Minxi Ouyang,Lianghui Zhu,Yaqing Bao,Qiang Huang,Jingli Ouyang,Tian Guan,Xitong Ling,Jiawen Li,Song Duan,Wenbin Dai,Li Zheng,Xuemei Zhang,Yonghong He
Main category: eess.IV
TL;DR: 论文提出DiagR1模型,通过强化学习训练的多模态模型,专注于消化病理诊断。该模型解决了现有方法中数据质量和推理透明度的限制,通过构建高质量数据集、引入提示增强策略和GRPO优化,显著提升了诊断报告的生成质量。
Details
Motivation: 当前胃肠病理多模态模型受限于数据质量差和推理不透明,导致诊断文本生成时出现事实性错误,且缺乏明确的中间推理链,临床可信度不足。Contribution: 1. 构建了大规模的胃肠病理数据集;2. 提出了结合病变分类和解剖位置信息的提示增强策略;3. 采用GRPO优化推理质量和输出结构。
Method: 1. 构建包含微观描述和诊断结论的数据集;2. 设计提示增强策略;3. 结合监督微调和GRPO进行后训练。
Result: 在真实病理报告生成任务中,生成质量、结构完整性和临床相关性显著优于现有方法,临床相关性提升18.7%,结构完整性提升32.4%,诊断错误减少41.2%。
Insight: 通过强化学习优化多模态模型的推理能力,结合高质量数据和提示策略,可显著提升医疗诊断任务的生成质量和临床实用性。
Abstract: Multimodal large models have shown great potential in automating pathology image analysis. However, current multimodal models for gastrointestinal pathology are constrained by both data quality and reasoning transparency: pervasive noise and incomplete annotations in public datasets predispose vision language models to factual hallucinations when generating diagnostic text, while the absence of explicit intermediate reasoning chains renders the outputs difficult to audit and thus less trustworthy in clinical practice. To address these issues, we construct a large scale gastrointestinal pathology dataset containing both microscopic descriptions and diagnostic conclusions, and propose a prompt argumentation strategy that incorporates lesion classification and anatomical site information. This design guides the model to better capture image specific features and maintain semantic consistency in generation. Furthermore, we employ a post training pipeline that combines supervised fine tuning with Group Relative Policy Optimization (GRPO) to improve reasoning quality and output structure. Experimental results on real world pathology report generation tasks demonstrate that our approach significantly outperforms state of the art open source and proprietary baselines in terms of generation quality, structural completeness, and clinical relevance. Our solution outperforms state of the art models with 18.7% higher clinical relevance, 32.4% improved structural completeness, and 41.2% fewer diagnostic errors, demonstrating superior accuracy and clinical utility compared to existing solutions.
cs.RO [Back]
[106] Evaluation of facial landmark localization performance in a surgical setting
Ines Frajtag,Marko Švaco,Filip Šuligoj
Main category: cs.RO
TL;DR: 这篇论文通过实验测试了MediaPipe算法在手术环境中的人脸关键点定位性能,结果显示在手术灯光下,算法的大角度偏转检测精度有所提升,但存在关键点检测不精确的问题。
Details
Motivation: 随着机器人技术和计算机视觉在医疗领域的广泛应用,如何在复杂光照和位置变化下精确检测人脸关键点成为挑战。论文旨在评估MediaPipe算法在手术环境中的表现。Contribution: 论文主要贡献是通过实验验证了MediaPipe算法在手术灯光下对大角度偏转的人脸关键点检测性能,并指出其潜在的医疗应用价值。
Method: 研究采用机器人手臂自动调整位置,固定手术灯和人体模型,测试MediaPipe算法在不同角度下的检测精度。
Result: 实验结果表明,手术灯光下算法在大偏转角(yaw和pitch)下的检测性能提升,但关键点检测精度不足导致标准偏差增加。
Insight: 论文揭示了MediaPipe算法在医疗场景中的潜力,同时也强调了需要改进关键点检测精度以满足手术需求。
Abstract: The use of robotics, computer vision, and their applications is becoming increasingly widespread in various fields, including medicine. Many face detection algorithms have found applications in neurosurgery, ophthalmology, and plastic surgery. A common challenge in using these algorithms is variable lighting conditions and the flexibility of detection positions to identify and precisely localize patients. The proposed experiment tests the MediaPipe algorithm for detecting facial landmarks in a controlled setting, using a robotic arm that automatically adjusts positions while the surgical light and the phantom remain in a fixed position. The results of this study demonstrate that the improved accuracy of facial landmark detection under surgical lighting significantly enhances the detection performance at larger yaw and pitch angles. The increase in standard deviation/dispersion occurs due to imprecise detection of selected facial landmarks. This analysis allows for a discussion on the potential integration of the MediaPipe algorithm into medical procedures.
[107] ReSem3D: Refinable 3D Spatial Constraints via Fine-Grained Semantic Grounding for Generalizable Robotic Manipulation
Chenyu Su,Weiwei Shang,Chen Qian,Fei Zhang,Shuang Cong
Main category: cs.RO
TL;DR: ReSem3D是一个通过细粒度语义对3D空间约束进行动态优化的机器人操作框架,结合多模态大语言模型和视觉基础模型,解决了现有方法在语义粒度、实时闭环规划和鲁棒性上的不足。
Details
Motivation: 现有方法在机器人操作中存在语义粒度粗、缺乏实时闭环规划和鲁棒性不足的问题,ReSem3D旨在通过结合VFM和MLLM,实现细粒度语义对3D空间约束的动态优化。Contribution: 1. 提出了ReSem3D框架,实现细粒度语义对3D空间约束的动态优化;2. 结合MLLM和VFM,实现从自然语言指令到实时操作的无缝衔接;3. 在多样语义环境中展示了零样本泛化能力。
Method: 采用分层递归推理的MLLM,与VFM交互,分两阶段(部件级提取和区域级优化)自动构建3D空间约束,并将其编码为实时优化目标。
Result: 在丰富家庭环境和稀疏实验室环境中,ReSem3D在零样本条件下完成多样操作任务,表现出强适应性和泛化能力。
Insight: 1. 细粒度语义对3D空间约束的动态优化是机器人操作的关键;2. MLLM与VFM的协同作用可实现跨模态语义对空间约束的实时构建;3. 零样本泛化能力验证了框架的鲁棒性。
Abstract: Semantics-driven 3D spatial constraints align highlevel semantic representations with low-level action spaces, facilitating the unification of task understanding and execution in robotic manipulation. The synergistic reasoning of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) enables cross-modal 3D spatial constraint construction. Nevertheless, existing methods have three key limitations: (1) coarse semantic granularity in constraint modeling, (2) lack of real-time closed-loop planning, (3) compromised robustness in semantically diverse environments. To address these challenges, we propose ReSem3D, a unified manipulation framework for semantically diverse environments, leveraging the synergy between VFMs and MLLMs to achieve fine-grained visual grounding and dynamically constructs hierarchical 3D spatial constraints for real-time manipulation. Specifically, the framework is driven by hierarchical recursive reasoning in MLLMs, which interact with VFMs to automatically construct 3D spatial constraints from natural language instructions and RGB-D observations in two stages: part-level extraction and region-level refinement. Subsequently, these constraints are encoded as real-time optimization objectives in joint space, enabling reactive behavior to dynamic disturbances. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization. Code and videos at https://resem3d.github.io.
[108] Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding
Xiaojie Zhang,Yuanfei Wang,Ruihai Wu,Kunqi Xu,Yu Li,Liuyu Xiang,Hao Dong,Zhaofeng He
Main category: cs.RO
TL;DR: 这篇论文提出了AdaRPG框架,利用基础模型提取物体部件以增强视觉功能技能的泛化能力,并通过推理复杂机制生成高级控制代码,解决了铰接物体操纵中的多样性和功能差异问题。
Details
Motivation: 铰接物体的内部结构不可直接观察,且几何多样性和功能机制差异使其操纵具有挑战性。现有方法难以实现跨类别的自适应操纵。Contribution: 1. 提出AdaRPG框架,利用基础模型提取部件级几何信息,增强视觉功能技能的泛化能力;2. 构建部件级功能标注数据集;3. 利用基础模型推理复杂机制,生成高级控制代码。
Method: 1. 通过基础模型提取物体部件;2. 训练部件级功能模型;3. 利用基础模型的常识推理复杂机制,生成控制代码。
Result: 仿真和真实实验表明,AdaRPG在新型铰接物体类别上具有强泛化能力。
Insight: 部件级几何信息比整体对象更具局部相似性,能够更好地支持功能技能的泛化;基础模型的常识知识能有效推理复杂机制。
Abstract: Articulated objects pose diverse manipulation challenges for robots. Since their internal structures are not directly observable, robots must adaptively explore and refine actions to generate successful manipulation trajectories. While existing works have attempted cross-category generalization in adaptive articulated object manipulation, two major challenges persist: (1) the geometric diversity of real-world articulated objects complicates visual perception and understanding, and (2) variations in object functions and mechanisms hinder the development of a unified adaptive manipulation strategy. To address these challenges, we propose AdaRPG, a novel framework that leverages foundation models to extract object parts, which exhibit greater local geometric similarity than entire objects, thereby enhancing visual affordance generalization for functional primitive skills. To support this, we construct a part-level affordance annotation dataset to train the affordance model. Additionally, AdaRPG utilizes the common knowledge embedded in foundation models to reason about complex mechanisms and generate high-level control codes that invoke primitive skill functions based on part affordance inference. Simulation and real-world experiments demonstrate AdaRPG’s strong generalization ability across novel articulated object categories.
cs.CR [Back]
[109] RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models
Haoran Gao,Yuanhe Zhang,Zhenhong Zhou,Lei Jiang,Fanyu Meng,Yujia Xiao,Kun Wang,Yang Liu,Junlan Feng
Main category: cs.CR
TL;DR: RECALLED提出了针对大型视觉语言模型(LVLMs)的首个基于视觉输入的无限资源消耗攻击方法,通过像素级优化和并行目标损失实现高效的攻击效果。
Details
Motivation: 现有针对大型语言模型(LLMs)的资源消耗攻击(RCAs)研究忽视了视觉输入的攻击潜力,而视觉模态的引入为攻击提供了新途径。Contribution: 1. 提出了首个针对LVLMs的视觉模态资源消耗攻击方法RECALLED;2. 设计了‘视觉引导优化’和‘多目标并行损失’,用于生成高效且通用的攻击模板;3. 揭示了LVLMs的安全漏洞。
Method: 1. 使用‘视觉引导优化’(精细像素级优化)生成‘输出召回’对抗扰动;2. 通过注入扰动触发无限输出;3. 引入‘多目标并行损失’解决优化冲突。
Result: 实验结果表明,RECALLED可将服务响应延迟提高26倍以上,GPU利用率和内存消耗增加20%。
Insight: 该研究表明LVLMs在视觉输入下存在严重的安全漏洞,同时提出的红队框架为未来防御研究提供了方向。
Abstract: Resource Consumption Attacks (RCAs) have emerged as a significant threat to the deployment of Large Language Models (LLMs). With the integration of vision modalities, additional attack vectors exacerbate the risk of RCAs in large vision-language models (LVLMs). However, existing red-teaming studies have largely overlooked visual inputs as a potential attack surface, resulting in insufficient mitigation strategies against RCAs in LVLMs. To address this gap, we propose RECALLED (\textbf{RE}source \textbf{C}onsumption \textbf{A}ttack on \textbf{L}arge Vision-\textbf{L}anguag\textbf{E} Mo\textbf{D}els), the first approach for exploiting visual modalities to trigger unbounded RCAs red-teaming. First, we present \textit{Vision Guided Optimization}, a fine-grained pixel-level optimization, to obtain \textit{Output Recall} adversarial perturbations, which can induce repeating output. Then, we inject the perturbations into visual inputs, triggering unbounded generations to achieve the goal of RCAs. Additionally, we introduce \textit{Multi-Objective Parallel Losses} to generate universal attack templates and resolve optimization conflicts when intending to implement parallel attacks. Empirical results demonstrate that RECALLED increases service response latency by over 26 $\uparrow$, resulting in an additional 20% increase in GPU utilization and memory consumption. Our study exposes security vulnerabilities in LVLMs and establishes a red-teaming framework that can facilitate future defense development against RCAs.
cs.LG [Back]
[110] GenSelect: A Generative Approach to Best-of-N
Shubham Toshniwal,Ivan Sorokin,Aleksander Ficek,Ivan Moshkov,Igor Gitman
Main category: cs.LG
TL;DR: GenSelect 提出了一种生成式方法,通过长推理从 N 个候选解中选择最优,利用了 LLMs 的比较能力,同时实现了高效扩展。
Details
Motivation: 当前方法对解决方案的点式评分或成对比较未能充分利用 LLMs 的比较能力,且成对方法在大规模采样预算下效率低下。Contribution: 提出了 GenSelect 方法,通过长推理选择最佳候选解,结合了 LLMs 的比较优势并高效扩展采样预算。
Method: GenSelect 利用 LLMs 的长推理能力,从 N 个候选解中选出最优解,避免了点式评分的局限性,同时提升了扩展性。
Result: 在数学推理任务中,GenSelect 的表现优于现有的评分方法,展示了其简单提示下的高效性。
Insight: GenSelect 证明了 LLMs 的长推理能力在候选解选择中的有效性,为生成式奖励模型的测试时扩展提供了新思路。
Abstract: Generative reward models with parallel sampling have enabled effective test-time scaling for reasoning tasks. Current approaches employ pointwise scoring of individual solutions or pairwise comparisons. However, pointwise methods underutilize LLMs’ comparative abilities, while pairwise methods scale inefficiently with larger sampling budgets. We introduce GenSelect, where the LLM uses long reasoning to select the best solution among N candidates. This leverages LLMs’ comparative strengths while scaling efficiently across parallel sampling budgets. For math reasoning, we demonstrate that reasoning models, such as QwQ and DeepSeek-R1-0528, excel at GenSelect, outperforming existing scoring approaches with simple prompting.
[111] Group Sequence Policy Optimization
Chujie Zheng,Shixuan Liu,Mingze Li,Xiong-Hui Chen,Bowen Yu,Chang Gao,Kai Dang,Yuqiong Liu,Rui Men,An Yang,Jingren Zhou,Junyang Lin
Main category: cs.LG
TL;DR: 本文提出了一种名为GSPO的强化学习算法,通过序列级的重要性比例定义与优化,显著提升了训练效率和性能。
Details
Motivation: 现有的强化学习算法(如GRPO)在训练大型语言模型时存在效率低和不稳定的问题,尤其是在混合专家(MoE)模型中。GSPO旨在解决这些问题。Contribution: 提出了GSPO算法,通过序列级优化(而非令牌级)提升训练效率和稳定性,特别是在MoE模型中,并简化了RL基础设施设计。
Method: GSPO基于序列似然定义重要性比例,采用序列级的裁剪、奖励和优化。相比GRPO,其表现更优且训练更稳定。
Result: GSPO在训练效率和性能上优于GRPO,显著稳定了MoE模型的训练,并在最新的Qwen3模型中取得了显著改进。
Insight: 序列级优化比令牌级更能有效提升强化学习效率,尤其是在复杂模型(如MoE)中,且简化了RL基础设施设计。
Abstract: This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
[112] Enhancing Quantization-Aware Training on Edge Devices via Relative Entropy Coreset Selection and Cascaded Layer Correction
Yujia Tong,Jingling Yuan,Chuang Hu
Main category: cs.LG
TL;DR: 该论文提出了QuaRC框架,通过相对熵核心集选择和级联层校正技术,在边缘设备上高效进行量化感知训练,显著提升了低比特量化模型的性能。
Details
Motivation: 由于隐私问题,边缘设备上的数据无法集中处理,而传统的量化感知训练依赖完整数据集,计算成本高。核心集选择可以缓解这一问题,但现有方法在小规模数据集上无法有效消除量化误差。Contribution: 1)提出基于相对熵评分的核心集选择方法;2)设计级联层校正策略以减少中间层的量化误差。
Method: 1)利用相对熵评分选择最具代表性的数据子集;2)通过级联层校正对齐量化模型和全精度模型的中间层输出。
Result: 在ImageNet-1K数据集上,2比特量化的ResNet-18使用1%数据子集时,Top-1准确率比现有技术提升5.72%。
Insight: 通过核心集选择和分层校正策略,可以在小规模数据集上有效减少量化误差,为边缘设备上的量化感知训练提供新思路。
Abstract: With the development of mobile and edge computing, the demand for low-bit quantized models on edge devices is increasing to achieve efficient deployment. To enhance the performance, it is often necessary to retrain the quantized models using edge data. However, due to privacy concerns, certain sensitive data can only be processed on edge devices. Therefore, employing Quantization-Aware Training (QAT) on edge devices has become an effective solution. Nevertheless, traditional QAT relies on the complete dataset for training, which incurs a huge computational cost. Coreset selection techniques can mitigate this issue by training on the most representative subsets. However, existing methods struggle to eliminate quantization errors in the model when using small-scale datasets (e.g., only 10% of the data), leading to significant performance degradation. To address these issues, we propose QuaRC, a QAT framework with coresets on edge devices, which consists of two main phases: In the coreset selection phase, QuaRC introduces the ``Relative Entropy Score” to identify the subsets that most effectively capture the model’s quantization errors. During the training phase, QuaRC employs the Cascaded Layer Correction strategy to align the intermediate layer outputs of the quantized model with those of the full-precision model, thereby effectively reducing the quantization errors in the intermediate layers. Experimental results demonstrate the effectiveness of our approach. For instance, when quantizing ResNet-18 to 2-bit using a 1% data subset, QuaRC achieves a 5.72% improvement in Top-1 accuracy on the ImageNet-1K dataset compared to state-of-the-art techniques.
[113] VIBE: Video-Input Brain Encoder for fMRI Response Modeling
Daniel Carlstrom Schad,Shrey Dixit,Janis Keck,Viktor Studenyak,Aleksandr Shpilevoi,Andrej Bicanski
Main category: cs.LG
TL;DR: VIBE是一种两阶段Transformer模型,融合多模态视频、音频和文本特征来预测fMRI活动,在Algonauts 2025挑战赛中表现出色。
Details
Motivation: 研究旨在通过多模态融合模型(视频、音频、文本)更准确地预测fMRI响应,提升对大脑活动的建模能力。Contribution: 提出VIBE模型,结合多模态特征和两阶段Transformer架构,显著提升了fMRI响应预测的准确率。
Method: 使用多模态融合Transformer合并特征,再通过预测Transformer进行时间解码,采用旋转嵌入技术。训练数据来自CNeuroMod数据集的65小时电影内容。
Result: 在Friends S07(分布内)和六部分布外电影上的平均Parcel-wise Pearson相关系数分别为32.25和21.25,Algonauts 2025挑战赛中表现优异。
Insight: 多模态特征的融合和时间解码是提升fMRI预测性能的关键,旋转嵌入技术也发挥了重要作用。
Abstract: We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 32.25 on in-distribution Friends S07 and 21.25 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.
cs.AI [Back]
[114] Agentic AI framework for End-to-End Medical Data Inference
Soorya Ram Shimgekar,Shayan Vassef,Abhay Goyal,Navin Kumar,Koustuv Saha
Main category: cs.AI
TL;DR: 论文提出了一种基于Agentic AI的端到端医疗数据推理框架,通过模块化和任务专用的智能体,实现了从数据接收到推理的自动化流程,解决了医疗领域中的数据预处理碎片化、模型兼容性和隐私问题。
Details
Motivation: 医疗领域的机器学习解决方案部署成本高且劳动密集,主要由于数据预处理流程碎片化、模型兼容性问题和严格的数据隐私要求。为了解决这些挑战,研究提出了一种自动化框架。Contribution: 主要贡献是一个端到端的Agentic AI框架,能够自动处理结构化和非结构化医疗数据,实现特征选择、模型选择和预处理推荐的自动化,减少了专家干预的需求。
Method: 框架包含多个专用智能体,分别负责数据接收(Ingestion Identifier Agent)、隐私匿名化(Data Anonymizer Agent)、特征提取(Feature Extraction Agent)、模型匹配(Model-Data Feature Matcher Agent)和预处理建议与实施(Preprocessing Recommender Agent和Implementor Agent)。最终,模型推理智能体(Model Inference Agent)生成可解释的输出。
Result: 通过在老年医学、姑息治疗和结肠镜影像等公开数据集上的评估,验证了框架的有效性。例如,在结构化数据(焦虑数据)和非结构化数据(结肠息肉图像)上实现了从数据接收到推理的全流程自动化。
Insight: 框架通过模块化设计和任务专用智能体,显著降低了医疗领域机器学习部署的成本和复杂性,为临床环境中AI的可扩展和高效应用提供了新途径。
Abstract: Building and deploying machine learning solutions in healthcare remains expensive and labor-intensive due to fragmented preprocessing workflows, model compatibility issues, and stringent data privacy constraints. In this work, we introduce an Agentic AI framework that automates the entire clinical data pipeline, from ingestion to inference, through a system of modular, task-specific agents. These agents handle both structured and unstructured data, enabling automatic feature selection, model selection, and preprocessing recommendation without manual intervention. We evaluate the system on publicly available datasets from geriatrics, palliative care, and colonoscopy imaging. For example, in the case of structured data (anxiety data) and unstructured data (colonoscopy polyps data), the pipeline begins with file-type detection by the Ingestion Identifier Agent, followed by the Data Anonymizer Agent ensuring privacy compliance, where we first identify the data type and then anonymize it. The Feature Extraction Agent identifies features using an embedding-based approach for tabular data, extracting all column names, and a multi-stage MedGemma-based approach for image data, which infers modality and disease name. These features guide the Model-Data Feature Matcher Agent in selecting the best-fit model from a curated repository. The Preprocessing Recommender Agent and Preprocessing Implementor Agent then apply tailored preprocessing based on data type and model requirements. Finally, the ``Model Inference Agent” runs the selected model on the uploaded data and generates interpretable outputs using tools like SHAP, LIME, and DETR attention maps. By automating these high-friction stages of the ML lifecycle, the proposed framework reduces the need for repeated expert intervention, offering a scalable, cost-efficient pathway for operationalizing AI in clinical environments.
[115] SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law
Shanghai AI Lab,:,Yicheng Bao,Guanxu Chen,Mingkang Chen,Yunhao Chen,Chiyu Chen,Lingjie Chen,Sirui Chen,Xinquan Chen,Jie Cheng,Yu Cheng,Dengke Deng,Yizhuo Ding,Dan Ding,Xiaoshan Ding,Yi Ding,Zhichen Dong,Lingxiao Du,Yuyu Fan,Xinshun Feng,Yanwei Fu,Yuxuan Gao,Ruijun Ge,Tianle Gu,Lujun Gui,Jiaxuan Guo,Qianxi He,Yuenan Hou,Xuhao Hu,Hong Huang,Kaichen Huang,Shiyang Huang,Yuxian Jiang,Shanzhe Lei,Jie Li,Lijun Li,Hao Li,Juncheng Li,Xiangtian Li,Yafu Li,Lingyu Li,Xueyan Li,Haotian Liang,Dongrui Liu,Qihua Liu,Zhixuan Liu,Bangwei Liu,Huacan Liu,Yuexiao Liu,Zongkai Liu,Chaochao Lu,Yudong Lu,Xiaoya Lu,Zhenghao Lu,Qitan Lv,Caoyuan Ma,Jiachen Ma,Xiaoya Ma,Zhongtian Ma,Lingyu Meng,Ziqi Miao,Yazhe Niu,Yuezhang Peng,Yuan Pu,Han Qi,Chen Qian,Xingge Qiao,Jingjing Qu,Jiashu Qu,Wanying Qu,Wenwen Qu,Xiaoye Qu,Qihan Ren,Qingnan Ren,Qingyu Ren,Jing Shao,Wenqi Shao,Shuai Shao,Dongxing Shi,Xin Song,Xinhao Song,Yan Teng,Xuan Tong,Yingchun Wang,Xuhong Wang,Shujie Wang,Xin Wang,Yige Wang,Yixu Wang,Yuanfu Wang,Futing Wang,Ruofan Wang,Wenjie Wang,Yajie Wang,Muhao Wei,Xiaoyu Wen,Fenghua Weng,Yuqi Wu,Yingtong Xiong,Xingcheng Xu,Chao Yang,Yue Yang,Yang Yao,Yulei Ye,Zhenyun Yin,Yi Yu,Bo Zhang,Qiaosheng Zhang,Jinxuan Zhang,Yexin Zhang,Yinqiang Zheng,Hefeng Zhou,Zhanhui Zhou,Pengyu Zhu,Qingzi Zhu,Yubo Zhu,Bowen Zhou
Main category: cs.AI
TL;DR: SafeWork-R1 是一款多模态推理模型,通过 SafeLadder 框架实现了能力与安全性的协同进化,在安全相关基准测试中表现优越且不影响通用能力。
Details
Motivation: 现有的对齐方法(如 RLHF)仅学习人类偏好,未能实现内在安全推理和自我反思能力。SafeWork-R1 旨在填补这一空白,提升 AI 的安全性和可靠性。Contribution: 1. 提出了 SafeLadder 框架,支持大规模渐进式的安全强化学习;2. 实现了安全推理能力的“顿悟”时刻;3. 开发了多模型变体,验证了框架的通用性。
Method: 1. 使用 SafeLadder 框架进行安全强化学习训练;2. 结合多原则验证器;3. 采用推理时干预和审议搜索机制。
Result: 在安全基准测试中,SafeWork-R1 比基模型 Qwen2.5-VL-72B 提升 46.54%,且安全性能优于 GPT-4.1 和 Claude Opus 4。
Insight: 安全性与能力可以协同进化,SafeLadder 框架为构建可靠、可信的通用 AI 提供了可扩展方案。
Abstract: We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha’ moments. Notably, SafeWork-R1 achieves an average improvement of $46.54%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.
cs.SD [Back]
[116] Bob’s Confetti: Phonetic Memorization Attacks in Music and Video Generation
Jaechul Roh,Zachary Novack,Yuefeng Peng,Niloofar Mireshghallah,Taylor Berg-Kirkpatrick,Amir Houmansadr
Main category: cs.SD
TL;DR: 论文研究了歌词到歌曲(LS2)生成模型和文本到视频模型在训练数据记忆方面的漏洞,提出了一种基于同音替换的攻击方法(APT),揭示了模型在音频和视觉领域的记忆问题。
Details
Motivation: 现有研究表明生成模型存在训练数据记忆问题,但其在跨模态(如音频和视觉)中的表现尚未被充分探索。论文旨在揭示这种漏洞的严重性及其潜在风险。Contribution: 1. 提出了Adversarial PhoneTic Prompting(APT)攻击方法;2. 发现模型在音频领域对同音替换歌词仍能生成相似内容;3. 首次揭示了文本到视频模型的语音到视觉记忆现象。
Method: 通过同音替换(如’mom’s spaghetti’→’Bob’s confetti’)修改歌词,生成对抗性提示,测试模型对原训练数据的记忆能力,并使用音频(CLAP、AudioJudge)和视频(CoverID)指标评估相似性。
Result: 实验表明,SUNO、YuE等音频生成模型和Veo 3视频模型会生成与原训练数据高度相似的输出,验证了跨模态记忆问题的存在。
Insight: 语音提示可以触发模型对训练数据的记忆,甚至跨模态(从音频到视觉),这对生成系统的版权、安全和内容溯源提出了新的挑战。
Abstract: Lyrics-to-Song (LS2) generation models promise end-to-end music synthesis from text, yet their vulnerability to training data memorization remains underexplored. We introduce Adversarial PhoneTic Prompting (APT), a novel attack where lyrics are semantically altered while preserving their acoustic structure through homophonic substitutions (e.g., Eminem’s famous “mom’s spaghetti” $\rightarrow$ “Bob’s confetti”). Despite these distortions, we uncover a powerful form of sub-lexical memorization: models like SUNO and YuE regenerate outputs strikingly similar to known training content, achieving high similarity across audio-domain metrics, including CLAP, AudioJudge, and CoverID. This vulnerability persists across multiple languages and genres. More surprisingly, we discover that phoneme-altered lyrics alone can trigger visual memorization in text-to-video models. When prompted with phonetically modified lyrics from Lose Yourself, Veo 3 reconstructs visual elements from the original music video – including character appearance and scene composition – despite no visual cues in the prompt. We term this phenomenon phonetic-to-visual regurgitation. Together, these findings expose a critical vulnerability in transcript-conditioned multimodal generation: phonetic prompting alone can unlock memorized audiovisual content, raising urgent questions about copyright, safety, and content provenance in modern generative systems. Example generations are available on our demo page (jrohsc.github.io/music_attack/).