Table of Contents
- cs.CL [Total: 21]
- cs.CV [Total: 60]
- cs.RO [Total: 2]
- cs.AI [Total: 3]
- cs.IR [Total: 2]
- cs.CR [Total: 2]
- q-bio.NC [Total: 1]
- eess.IV [Total: 6]
- cs.LG [Total: 3]
cs.CL [Back]
[1] Adaptive Linguistic Prompting (ALP) Enhances Phishing Webpage Detection in Multimodal Large Language Models
Atharva Bhargude,Ishan Gonehal,Chandler Haney,Dave Yoon,Kevin Zhu,Aaron Sandoval,Sean O’Brien,Kaustubh Vinnakota
Main category: cs.CL
TL;DR: 该论文提出了一种名为ALP的自适应语言提示方法,通过多模态大语言模型(如GPT-4o和Gemini 1.5 Pro)检测钓鱼网页。ALP通过结构化语义推理分析文本欺骗行为,整合文本、视觉和URL信息,显著提高了钓鱼检测的准确性。
Details
Motivation: 钓鱼攻击是严重的网络安全威胁,需要自适应检测技术。传统方法难以应对复杂的钓鱼内容,因此研究利用多模态大语言模型的潜力来改进检测。Contribution: 1. 提出了ALP方法,通过结构化语义推理分析钓鱼内容。2. 整合多模态信息(文本、视觉、URL)实现统一检测模型。3. 实验表明ALP显著提高了检测准确性(F1-score达0.93)。
Method: ALP通过引导LLMs分析语言模式、紧急提示词和操纵性措辞,结合多模态数据(文本、图像、URL),实现结构化推理和上下文分析。
Result: ALP在多模态LLMs中实现了0.93的F1-score,优于传统方法,展示了其在高精度和可解释性方面的优势。
Insight: ALP展示了大语言模型在网络安全领域的潜力,通过结构化语言分析和多模态整合,可以显著提升钓鱼检测的鲁棒性和适应性。
Abstract: Phishing attacks represent a significant cybersecurity threat, necessitating adaptive detection techniques. This study explores few-shot Adaptive Linguistic Prompting (ALP) in detecting phishing webpages through the multimodal capabilities of state-of-the-art large language models (LLMs) such as GPT-4o and Gemini 1.5 Pro. ALP is a structured semantic reasoning method that guides LLMs to analyze textual deception by breaking down linguistic patterns, detecting urgency cues, and identifying manipulative diction commonly found in phishing content. By integrating textual, visual, and URL-based analysis, we propose a unified model capable of identifying sophisticated phishing attempts. Our experiments demonstrate that ALP significantly enhances phishing detection accuracy by guiding LLMs through structured reasoning and contextual analysis. The findings highlight the potential of ALP-integrated multimodal LLMs to advance phishing detection frameworks, achieving an F1-score of 0.93, surpassing traditional approaches. These results establish a foundation for more robust, interpretable, and adaptive linguistic-based phishing detection systems using LLMs.
[2] SAFT: Structure-Aware Fine-Tuning of LLMs for AMR-to-Text Generation
Rafiq Kamel,Filippo Guerranti,Simon Geisler,Stephan Günnemann
Main category: cs.CL
TL;DR: SAFT 是一种结构感知的微调方法,通过将图拓扑信息注入预训练的大语言模型(LLM)中,显著提升了AMR到文本生成的性能,无需修改模型架构。
Details
Motivation: 当前方法在处理图结构输入的AMR时,常常忽略结构性信息或依赖不兼容的架构。SAFT旨在利用图结构信息增强LLM的性能。Contribution: 提出了一种新的结构感知微调方法SAFT,利用方向敏感的磁拉普拉斯位置编码将图结构信息融入LLM的嵌入空间。
Method: 通过计算AMR的磁拉普拉斯矩阵生成方向敏感的位置编码,并将其投影到LLM的嵌入空间,从而在微调中保留图结构信息。
Result: 在AMR 3.0上取得了3.5 BLEU的提升,尤其在复杂图结构上表现更优。
Insight: 结构性信息对LLM处理图结构数据至关重要,SAFT提供了一种通用方法,可推广到其他图结构输入任务。
Abstract: Large Language Models (LLMs) are increasingly applied to tasks involving structured inputs such as graphs. Abstract Meaning Representations (AMRs), which encode rich semantics as directed graphs, offer a rigorous testbed for evaluating LLMs on text generation from such structures. Yet, current methods often arbitrarily linearize AMRs, discarding key structural cues, or rely on architectures incompatible with standard LLMs. We introduce SAFT, a structure-aware fine-tuning approach that injects graph topology into pretrained LLMs without architectural changes. We compute direction-sensitive positional encodings from the magnetic Laplacian of transformed AMRs and project them into the embedding space of the LLM. While possibly applicable to any graph-structured inputs, we focus on AMR-to-text generation as a representative and challenging benchmark. SAFT sets a new state-of-the-art on AMR 3.0 with a 3.5 BLEU improvement over baselines. Gains scale with graph complexity, highlighting the value of structure-aware representations in enhancing LLM performance. SAFT offers a general and effective pathway for bridging structured data and language models.
[3] PARAM-1 BharatGen 2.9B Model
Kundeshwar Pundalik,Piyush Sawarkar,Nihar Sahoo,Abhishek Shinde,Prateek Chanda,Vedant Goswami,Ajay Nagpal,Atul Singh,Viraj Thakur,Vijay Dewane,Aamod Thakur,Bhargav Patel,Smita Gautam,Bhagwan Panditi,Shyam Pawar,Madhav Kotcha,Suraj Racha,Saral Sureka,Pankaj Singh,Rishi Bal,Rohit Saluja,Ganesh Ramakrishnan
Main category: cs.CL
TL;DR: PARAM-1 BharatGen 2.9B 是针对印度多语言环境设计的语言模型,专注于印地语和英语,通过公平的语料分配和适应印度语言形态的分词器,提升了多语言任务的性能。
Details
Motivation: 现有大语言模型主要以英语为中心,导致印度等多语言地区代表性不足。PARAM-1旨在通过架构和训练策略的多样性设计,填补这一技术鸿沟。Contribution: 1. 提出了一个专注于印度多样性的大语言模型PARAM-1;2. 提出了公平的语料分配和分词器设计;3. 提供了文化对齐的评估基准。
Method: 1. 使用仅含印地语和英语的双语数据集训练;2. 采用SentencePiece分词器适应印度语言形态;3. 在预训练阶段嵌入多样性设计。
Result: PARAM-1展示了既可作为通用模型,又适用于印度中心任务的能力。
Insight: 预训练阶段的多样性设计而非后置对齐,是实现语言公平模型的更有效路径。
Abstract: Large Language Models (LLMs) have emerged as powerful general-purpose reasoning systems, yet their development remains dominated by English-centric data, architectures, and optimization paradigms. This exclusionary design results in structural under-representation of linguistically diverse regions such as India, where over 20 official languages and 100+ dialects coexist alongside phenomena like code-switching and diglossia. We introduce PARAM-1, a 2.9B parameter decoder-only, text-only language model trained from scratch with an explicit architectural and linguistic focus on Indian diversity. PARAM-1 is trained on a bilingual dataset consisting of only Hindi and English, constructed with a strong focus on fact-rich, high-quality content. It is guided by three core principles: equitable representation of Indic languages through a 25% corpus allocation; tokenization fairness via a SentencePiece tokenizer adapted to Indian morphological structures; and culturally aligned evaluation benchmarks across IndicQA, code-mixed reasoning, and socio-linguistic robustness tasks. By embedding diversity at the pretraining level-rather than deferring it to post-hoc alignment-PARAM-1 offers a design-first blueprint for equitable foundation modeling. Our results demonstrate that it serves as both a competent general-purpose model and a robust baseline for India-centric applications.
[4] TopicImpact: Improving Customer Feedback Analysis with Opinion Units for Topic Modeling and Star-Rating Prediction
Emil Häglund,Johanna Björklund
Main category: cs.CL
TL;DR: 论文提出了一种基于意见单元的客户反馈分析方法,通过改进主题建模流程,提升主题一致性和情感关联性,并预测星级评分。
Details
Motivation: 传统的主题建模方法未能充分利用客户反馈中的情感信息,且缺乏与业务指标的关联。Contribution: 1. 提出基于意见单元的主题建模流程,提升主题一致性和可解释性;2. 将情感与主题关联,分析其对业务指标(如星级)的影响;3. 系统实现与评估展示了方法的有效性。
Method: 利用大语言模型提取意见单元(包含文本摘录和情感分数),改进主题建模流程,并结合情感模态预测星级评分。
Result: 实验表明,改进的流程能生成更一致的主题,并提升星级评分的预测准确性。
Insight: 意见单元可以更细粒度地捕捉客户反馈中的情感与主题关联,为业务决策提供更精准的洞察。
Abstract: We improve the extraction of insights from customer reviews by restructuring the topic modelling pipeline to operate on opinion units - distinct statements that include relevant text excerpts and associated sentiment scores. Prior work has demonstrated that such units can be reliably extracted using large language models. The result is a heightened performance of the subsequent topic modeling, leading to coherent and interpretable topics while also capturing the sentiment associated with each topic. By correlating the topics and sentiments with business metrics, such as star ratings, we can gain insights on how specific customer concerns impact business outcomes. We present our system’s implementation, use cases, and advantages over other topic modeling and classification solutions. We also evaluate its effectiveness in creating coherent topics and assess methods for integrating topic and sentiment modalities for accurate star-rating prediction.
[5] Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers
Liang Lin,Zhihao Xu,Xuehai Tang,Shi Liu,Biyu Zhou,Fuqing Zhu,Jizhong Han,Songlin Hu
Main category: cs.CL
TL;DR: 该论文提出了一种名为Paper Summary Attack (PSA)的新型jailbreaking方法,利用LLMs对权威学术论文内容的信任,通过合成攻击性或防御性LLM安全论文的内容构造对抗性提示,成功攻击了包括Claude3.5-Sonnet和Deepseek-R1在内的多种模型。
Details
Motivation: 研究发现LLMs倾向于信任权威来源(如学术论文)的信息,作者由此探讨了这种信任可能带来的安全漏洞,并提出了一种新的攻击方法。Contribution: 1. 提出PSA方法,通过合成学术论文内容构造对抗性提示;2. 揭示了不同模型甚至同一模型不同版本在攻击性和防御性论文内容下的漏洞偏差。
Method: PSA通过系统地合成攻击性或防御性LLM安全论文的内容,构造对抗性提示模板,并在预定义的子部分中注入有害查询作为对抗载荷。
Result: PSA在Claude3.5-Sonnet上达到97%的攻击成功率(ASR),在Deepseek-R1上达到98%的ASR,并揭示了不同模型的漏洞偏差。
Insight: 结果表明,LLMs对权威内容的信任可能导致严重的安全风险,且不同模型在处理攻击性与防御性内容时表现出显著差异,为未来对抗方法和安全对齐研究提供了线索。
Abstract: The safety of large language models (LLMs) has garnered significant research attention. In this paper, we argue that previous empirical studies demonstrate LLMs exhibit a propensity to trust information from authoritative sources, such as academic papers, implying new possible vulnerabilities. To verify this possibility, a preliminary analysis is designed to illustrate our two findings. Based on this insight, a novel jailbreaking method, Paper Summary Attack (\llmname{PSA}), is proposed. It systematically synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template, while strategically infilling harmful query as adversarial payloads within predefined subsections. Extensive experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1. PSA achieves a 97% attack success rate (ASR) on well-aligned models like Claude3.5-Sonnet and an even higher 98% ASR on Deepseek-R1. More intriguingly, our work has further revealed diametrically opposed vulnerability bias across different base models, and even between different versions of the same model, when exposed to either attack-focused or defense-focused papers. This phenomenon potentially indicates future research clues for both adversarial methodologies and safety alignment.Code is available at https://github.com/233liang/Paper-Summary-Attack
[6] Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters
Shanbo Cheng,Yu Bao,Qian Cao,Luyang Huang,Liyan Kang,Zhicheng Liu,Yu Lu,Wenhao Zhu,Zhichao Huang,Tao Li,Sitong Liu,Ningxin Peng,Shuaijie She,Lu Xu,Nuo Xu,Sen Yang,Runsheng Yu,Yiming Yu,Liehao Zou,Hang Li,Lu Lu,Yuxuan Wang,Yonghui Wu
Main category: cs.CL
TL;DR: Seed-X 是一种开源的多语言翻译大语言模型(LLM),通过 7B 参数规模在 28 种语言中展现出卓越的翻译能力,媲美闭源模型,且优于更大的开源模型。
Details
Motivation: 多语言翻译任务复杂且面临自动化翻译中的语言模式与生硬翻译问题,需要高效的开源解决方案。Contribution: 提出了 Seed-X 系列模型,包括基于 Chain-of-Thought 推理的指令模型和强化学习优化,展示了小参数规模下高性能多语言翻译的潜力。
Method: 基础模型通过多样高质量多语言数据预训练,指令模型通过 CoT 推理微调,并通过强化学习进一步提升泛化能力。
Result: Seed-X 在 28 种语言中表现媲美 Gemini-2.5 和 GPT-4o,且显著优于更大的开源模型。
Insight: 小参数模型通过优化训练方法和数据多样性,可以高效实现多语言翻译的高性能。
Abstract: Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.
[7] CU-ICU: Customizing Unsupervised Instruction-Finetuned Language Models for ICU Datasets via Text-to-Text Transfer Transformer
Teerapong Panboonyuen
Main category: cs.CL
TL;DR: 论文提出CU-ICU方法,通过T5架构将无监督指令微调的语言模型适配到ICU数据集,结合稀疏微调和少样本提示,显著提升预测精度和可解释性。
Details
Motivation: 医疗领域尤其是ICU场景数据标注较少,且需要高效的自适应方法以支持临床决策。Contribution: 1) 提出CU-ICU方法,结合稀疏微调与少样本提示;2) 实现高效的ICU任务适配,仅更新1%参数;3) 在脓毒症检测等任务中表现优异。
Method: 采用T5架构,通过稀疏微调(选择性参数更新)和少样本提示技术,最小化监督需求。
Result: 脓毒症检测精度提升15%,临床解释生成增强20%,参数更新少于1%。
Insight: 稀疏微调和少样本提示的结合是高效领域适配的关键,对资源有限的医疗场景尤为重要。
Abstract: Integrating large language models into specialized domains like healthcare presents unique challenges, including domain adaptation and limited labeled data. We introduce CU-ICU, a method for customizing unsupervised instruction-finetuned language models for ICU datasets by leveraging the Text-to-Text Transfer Transformer (T5) architecture. CU-ICU employs a sparse fine-tuning approach that combines few-shot prompting with selective parameter updates, enabling efficient adaptation with minimal supervision. Our evaluation across critical ICU tasks–early sepsis detection, mortality prediction, and clinical note generation–demonstrates that CU-ICU consistently improves predictive accuracy and interpretability over standard fine-tuning methods. Notably, CU-ICU achieves up to a 15% increase in sepsis detection accuracy and a 20% enhancement in generating clinically relevant explanations while updating fewer than 1% of model parameters in its most efficient configuration. These results establish CU-ICU as a scalable, low-overhead solution for delivering accurate and interpretable clinical decision support in real-world ICU environments.
[8] KiC: Keyword-inspired Cascade for Cost-Efficient Text Generation with LLMs
Woo-Chan Kim,Ji-Hoon Park,Seong-Whan Lee
Main category: cs.CL
TL;DR: KiC 是一个关键词驱动的级联框架,用于在 LLM 中高效生成文本,通过评估语义对齐减少对强大模型的依赖,节约成本。
Details
Motivation: 现有的级联方法因依赖精确文本匹配,难以评估自由形式输出的可靠性,导致成本高且效率低。KiC 旨在解决这一问题。Contribution: 提出 KiC 框架,通过识别弱模型输出的代表性答案并评估语义对齐,动态决定是否升级到强模型,实现成本效率优化。
Method: KiC 首先用弱模型生成多个输出,选择最具代表性的答案,并基于语义对齐评估决定是否接受或升级模型。
Result: 在三个自由形式文本生成基准测试中,KiC 实现了 GPT-4 97.53% 的准确率,平均减少 28.81% 的 API 成本,甚至在某任务中超越 GPT-4。
Insight: 语义对齐评估比精确匹配更有效,动态级联策略能显著降低成本且保持高质量输出。
Abstract: Large language models (LLMs) have demonstrated state-of-the-art performance across a wide range of natural language processing tasks. However, high-performing models are typically accessible only via APIs, incurring substantial inference costs. Cascade methods address this by initially employing a cheaper model and escalating to a stronger one only when necessary. Nevertheless, existing cascade approaches struggle to select a reliable representative response and assess the overall reliability of free-form outputs, as they rely on exact text matching. To overcome these limitations, we propose Keyword-inspired Cascade (KiC), a novel framework for cost-efficient free-form text generation. KiC identifies the most representative answer among multiple outputs from a weaker model and evaluates the semantic alignment of other responses with it. Based on the degree of alignment, KiC determines whether to accept the weaker model’s output or escalate to a stronger model. Experiments on three free-form text generation benchmarks show that KiC achieves 97.53 percent of GPT-4’s accuracy while reducing API costs by 28.81 percent on average, and even outperforms GPT-4 in a specific benchmark.
[9] LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues
Haoyang Li,Zhanchao Xu,Yiming Li,Xuejia Chen,Darian Li,Anxin Tian,Qingfa Xiao,Cheng Deng,Jun Wang,Qing Li,Lei Chen,Mingxuan Yuan
Main category: cs.CL
TL;DR: LoopServe提出了一种自适应双阶段的大型语言模型推理加速框架,针对多轮对话中的计算和内存挑战,通过动态注意力稀疏化和渐进式键值压缩优化推理效率。
Details
Motivation: 当前大型语言模型在多轮对话中面临计算和内存效率低下的问题,现有方法依赖固定或启发式策略,难以适应动态对话模式。Contribution: 1. 提出动态注意力稀疏化技术;2. 设计渐进式键值压缩方法;3. 发布一个多轮长上下文基准数据集。
Method: 1. 在预填充阶段动态选择注意力矩阵的重要部分;2. 在解码阶段自适应维护高效的键值缓存。
Result: 实验表明,LoopServe在多轮对话任务中显著提升推理速度,优于现有基线。
Insight: 动态适应对话模式和渐进式优化缓存是实现高效推理的关键。
Abstract: Multi-turn dialogues are essential in many real-world applications of large language models, such as chatbots and virtual assistants. As conversation histories become longer, existing large language models face increasing computational and memory challenges, which hinder their ability to provide efficient and responsive interactions. Most current acceleration methods either compress the context or optimize key value caching, but they often rely on fixed or position-based heuristics that do not adapt well to the dynamic and unpredictable patterns found in actual multi-turn conversations. In this paper, we present LoopServe, an adaptive dual-phase inference acceleration framework for large language models in multi-turn dialogues. LoopServe introduces two main innovations. First, it performs online sparsification during the prefilling phase by dynamically selecting the most important parts of the attention matrix for each new input. Second, it uses progressive key value compression during decoding by adaptively maintaining a relevant and efficient cache based on the most recently generated output tokens. We also propose a \href{https://huggingface.co/datasets/TreeAILab/Multi-turn_Long-context_Benchmark_for_LLMs}{new benchmark} with eleven multi-turn datasets that reflect realistic query positions and conversational dependencies. Extensive experiments demonstrate that LoopServe consistently achieves superior effectiveness compared to existing baselines and significantly accelerates LLM inference across a wide range of long-context dialogue tasks.
[10] Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models
Palash Nandi,Maithili Joshi,Tanmoy Chakraborty
Main category: cs.CL
TL;DR: 该论文研究了视觉语言模型(VLMs)中提示设计的敏感性如何被利用来生成不恰当内容,分析了三个关键因素对越狱(jailbreaking)成功的影响,并提出了一种利用内部跳连的框架来提高越狱成功率。
Details
Motivation: 视觉语言模型对输入的敏感性可能会导致不恰当内容的生成,研究如何通过提示设计利用这种敏感性,有助于揭示模型的漏洞并提高其安全性。Contribution: 论文的主要贡献包括:1)分析了三个关键因素对VLMs越狱成功的影响;2)提出了一种利用跳连的框架,显著提高了越狱成功率;3)发现模因(memes)与有毒图像在诱导有害内容上具有相似效果。
Method: 研究方法包括:1)分析单模态和多模态输入下模型的敏感性;2)通过实验验证三个关键因素对越狱的影响;3)设计了一种跳连框架以提高越狱成功率。
Result: 实验结果表明,VLMs在单模态下能区分良性和有害输入,但在多模态下能力显著下降;跳连框架大幅提高了越狱成功率;模因与有毒图像在诱导有害内容上效果类似。
Insight: 论文揭示了VLMs在多模态输入下的脆弱性,提示设计中的微小变化可能导致严重的安全问题,跳连机制为攻击者提供了新的工具。
Abstract: Language models are highly sensitive to prompt formulations - small changes in input can drastically alter their output. This raises a critical question: To what extent can prompt sensitivity be exploited to generate inapt content? In this paper, we investigate how discrete components of prompt design influence the generation of inappropriate content in Visual Language Models (VLMs). Specifically, we analyze the impact of three key factors on successful jailbreaks: (a) the inclusion of detailed visual information, (b) the presence of adversarial examples, and (c) the use of positively framed beginning phrases. Our findings reveal that while a VLM can reliably distinguish between benign and harmful inputs in unimodal settings (text-only or image-only), this ability significantly degrades in multimodal contexts. Each of the three factors is independently capable of triggering a jailbreak, and we show that even a small number of in-context examples (as few as three) can push the model toward generating inappropriate outputs. Furthermore, we propose a framework that utilizes a skip-connection between two internal layers of the VLM, which substantially increases jailbreak success rates, even when using benign images. Finally, we demonstrate that memes, often perceived as humorous or harmless, can be as effective as toxic visuals in eliciting harmful content, underscoring the subtle and complex vulnerabilities of VLMs.
[11] An Enhanced Model-based Approach for Short Text Clustering
Enhao Cheng,Shoujia Zhang,Jianhua Yin,Xuemeng Song,Tian Gan,Liqiang Nie
Main category: cs.CL
TL;DR: 论文提出了GSDMM+,一种改进的基于Dirichlet多项混合模型的短文本聚类方法,通过降低初始化噪声、自适应调整词权重和策略性合并簇,提升了聚类性能。
Details
Motivation: 短文本聚类因社交媒体数据稀疏、高维和大规模的特点而极具挑战性,现有方法计算成本高且性能有限,需要更高效的解决方案。Contribution: 1) 改进了GSDMM模型(GSDMM+);2) 降低了初始化噪声;3) 基于熵自适应调整词权重;4) 通过策略性合并簇优化聚类粒度。
Method: 基于GSDMM模型,引入初始化解噪声、熵加权词权重调整和策略性簇合并的优化方法。
Result: 实验表明,GSDMM+在效率和效果上均优于经典和前沿方法。
Insight: 通过更精细的初始化和动态权重调整,可显著提升短文本聚类的性能。
Abstract: Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook. Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches. This task is inherently challenging due to the sparse, large-scale, and high-dimensional characteristics of the short text data. Furthermore, the computational intensity required by representation learning significantly increases the running time. To address these issues, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts while identifying representative words for each cluster. Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance. GSDMM+ reduces initialization noise and adaptively adjusts word weights based on entropy, achieving fine-grained clustering that reveals more topic-related information. Additionally, strategic cluster merging is employed to refine clustering granularity, better aligning the predicted distribution with the true category distribution. We conduct extensive experiments, comparing our methods with both classical and state-of-the-art approaches. The experimental results demonstrate the efficiency and effectiveness of our methods. The source code for our model is publicly available at https://github.com/chehaoa/VEMC.
[12] Question-Answer Extraction from Scientific Articles Using Knowledge Graphs and Large Language Models
Hosein Azarbonyad,Zi Long Zhu,Georgios Cheirmpos,Zubair Afzal,Vikrant Yadav,Georgios Tsatsaronis
Main category: cs.CL
TL;DR: 该论文提出两种方法从科学文章中提取问题-答案对(QA),一种是基于大语言模型(LLM),另一种是基于知识图谱(KG)的QA生成方法。实验表明,KG方法能更有效地捕捉文章的核心思想。
Details
Motivation: 学者在决定是否阅读或引用文章时,需要快速理解其核心内容。论文旨在通过生成QA对来帮助学者高效提取文章的关键概念和贡献。Contribution: 1. 提出两种QA生成方法:基于LLM的纯文本方法和基于KG的方法;2. 设计了一种基于三元组TF-IDF类指标的知识图谱构建方法,用于评估三元组的重要性;3. 通过专家评估验证了KG方法的有效性。
Method: 1. 纯文本方法:选择重要段落,用LLM生成问题并排序,再生成答案;2. KG方法:构建科学文章的知识图谱,提取重要三元组(基于实体中心性和三元组重要性对比文献中频率),生成QA对。
Result: 专家评估表明,KG方法能更准确地捕捉文章核心思想;同时,对实体关系提取模型的领域微调对三元组质量至关重要。
Insight: 1. 纯文本方法依赖文章内容,适合快速QA生成;2. KG方法通过结合领域知识图谱,更适合评估文章的创新性和贡献;3. 领域专用微调是提升知识图谱质量的关键。
Abstract: When deciding to read an article or incorporate it into their research, scholars often seek to quickly identify and understand its main ideas. In this paper, we aim to extract these key concepts and contributions from scientific articles in the form of Question and Answer (QA) pairs. We propose two distinct approaches for generating QAs. The first approach involves selecting salient paragraphs, using a Large Language Model (LLM) to generate questions, ranking these questions by the likelihood of obtaining meaningful answers, and subsequently generating answers. This method relies exclusively on the content of the articles. However, assessing an article’s novelty typically requires comparison with the existing literature. Therefore, our second approach leverages a Knowledge Graph (KG) for QA generation. We construct a KG by fine-tuning an Entity Relationship (ER) extraction model on scientific articles and using it to build the graph. We then employ a salient triplet extraction method to select the most pertinent ERs per article, utilizing metrics such as the centrality of entities based on a triplet TF-IDF-like measure. This measure assesses the saliency of a triplet based on its importance within the article compared to its prevalence in the literature. For evaluation, we generate QAs using both approaches and have them assessed by Subject Matter Experts (SMEs) through a set of predefined metrics to evaluate the quality of both questions and answers. Our evaluations demonstrate that the KG-based approach effectively captures the main ideas discussed in the articles. Furthermore, our findings indicate that fine-tuning the ER extraction model on our scientific corpus is crucial for extracting high-quality triplets from such documents.
[13] The Expressions of Depression and Anxiety in Chinese Psycho-counseling: Usage of First-person Singular Pronoun and Negative Emotional Words
Lizhi Ma,Tong Zhao,Shuai Zhang,Nirui Song,Hongliang He,Anqi Li,Ran Feng,Huachuan Qiu,Jingsong Ma,Zhenzhong Lan
Main category: cs.CL
TL;DR: 该研究探讨了中文心理咨询中第一人称单数代词和负面情绪词汇的使用与抑郁和焦虑状态的关系,发现负面情绪词与心理问题严重程度正相关,但第一人称代词的使用频率未表现出显著差异。文化背景和咨询对话动态是重要影响因素。
Details
Motivation: 研究旨在揭示语言表达(如第一人称代词和负面情绪词)与抑郁和焦虑状态之间的关系,特别关注中文心理咨询的背景,以补充西方语境中的已有发现。Contribution: 研究表明负面情绪词与抑郁和焦虑严重程度显著相关,但在中文语境中第一人称代词的使用未表现出显著差异,强调了文化和对话动态对语言表达的独特影响。
Method: 基于735个在线心理咨询会话的语料库,使用LIWC软件量化语言模式,并通过广义线性混合效应模型分析数据。
Result: 负面情绪词频率与抑郁和焦虑严重程度呈显著正相关,而第一人称代词使用频率未随心理状态显著变化。
Insight: 文化差异(集体主义vs.个人主义)和心理咨询互动的特殊性显著影响语言表达,建议在中文心理健康实践中关注这些独特的语言标记。
Abstract: This study explores the relationship between linguistic expressions and psychological states of depression and anxiety within Chinese psycho-counseling interactions, focusing specifically on the usage of first-person singular pronouns and negative emotional words. Utilizing a corpus derived from 735 online counseling sessions, the analysis employed a general linear mixed-effect model to assess linguistic patterns quantified by the Linguistic Inquiry and Word Count (LIWC) software. Results indicate a significant positive correlation between the frequency of negative emotional words and the severity of both depressive and anxious states among clients. However, contrary to prior findings predominantly derived from English-language contexts, the usage frequency of first-person singular pronouns did not vary significantly with the clients’ psychological conditions. These outcomes are discussed within the framework of cultural distinctions between collectivist Chinese contexts and individualistic Western settings, as well as the interactive dynamics unique to psycho-counseling conversations. The findings highlight the nuanced influence of cultural and conversational contexts on language use in mental health communications, providing insights into psycholinguistic markers relevant to therapeutic practices in Chinese-speaking populations.
[14] InTraVisTo: Inside Transformer Visualisation Tool
Nicolò Brunello,Davide Rigamonti,Andrea Sassella,Vincenzo Scotti,Mark James Carman
Main category: cs.CL
TL;DR: 论文介绍了一款名为InTraVisTo的工具,用于可视化Transformer模型内部状态和信息流,帮助研究人员理解LLM的内部计算模式。
Details
Motivation: 随着LLM规模和复杂度的增加,其不可预测性和实际输出的不一致性使其在生产中的应用面临挑战。需要工具来理解其内部计算过程。Contribution: 开发了InTraVisTo工具,能够可视化Transformer模型的内部状态(通过解码每层的token嵌入)和信息流(使用桑基图)。
Method: 通过解码每层的token嵌入和桑基图可视化信息流,追踪Transformer模型的推理过程。
Result: 工具能帮助研究人员更好地理解LLM的内部计算和推理模式。
Insight: 可视化工具在理解和调试复杂LLM的过程中具有重要价值,尤其是对于Transformer模型的内部工作机制。
Abstract: The reasoning capabilities of Large Language Models (LLMs) have increased greatly over the last few years, as have their size and complexity. Nonetheless, the use of LLMs in production remains challenging due to their unpredictable nature and discrepancies that can exist between their desired behavior and their actual model output. In this paper, we introduce a new tool, InTraVisTo (Inside Transformer Visualisation Tool), designed to enable researchers to investigate and trace the computational process that generates each token in a Transformer-based LLM. InTraVisTo provides a visualization of both the internal state of the Transformer model (by decoding token embeddings at each layer of the model) and the information flow between the various components across the different layers of the model (using a Sankey diagram). With InTraVisTo, we aim to help researchers and practitioners better understand the computations being performed within the Transformer model and thus to shed some light on internal patterns and reasoning processes employed by LLMs.
[15] Label Unification for Cross-Dataset Generalization in Cybersecurity NER
Maciej Jalocha,Johan Hausted Schmidt,William Michelseen
Main category: cs.CL
TL;DR: 该论文研究网络安全领域命名实体识别(NER)中标签统一的问题,通过粗粒度标签统一和交叉数据集评估,揭示了模型泛化的局限性,并尝试了多头部和图基转移模型等替代架构。
Details
Motivation: 网络安全领域NER缺乏标准化标签,导致数据集难以整合,影响了模型的泛化能力和数据资源的利用率。Contribution: 论文的主要贡献包括:1) 对四个网络安全数据集进行了粗粒度标签统一;2) 通过交叉数据集评估揭示了模型的泛化问题;3) 提出了多头部和图基转移模型等替代架构,试图解决标签统一的局限性。
Method: 采用BiLSTM模型进行交叉数据集评估,并提出了两种替代架构:多头部模型(支持权重共享)和图基转移模型(基于BERT-base-NER)。
Result: 结果显示,模型在统一数据集上训练的泛化能力较差。多头部模型仅带来边际改进,而图基转移模型相比BERT-base-NER未表现出显著性能提升。
Insight: 标签统一的局限性可能源于数据集的固有差异,替代架构未能显著解决问题,表明需要更深入的数据集或任务特定方法。
Abstract: The field of cybersecurity NER lacks standardized labels, making it challenging to combine datasets. We investigate label unification across four cybersecurity datasets to increase data resource usability. We perform a coarse-grained label unification and conduct pairwise cross-dataset evaluations using BiLSTM models. Qualitative analysis of predictions reveals errors, limitations, and dataset differences. To address unification limitations, we propose alternative architectures including a multihead model and a graph-based transfer model. Results show that models trained on unified datasets generalize poorly across datasets. The multihead model with weight sharing provides only marginal improvements over unified training, while our graph-based transfer model built on BERT-base-NER shows no significant performance gains compared BERT-base-NER.
[16] Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies
Carlos Mena,Pol Serra,Jacobo Romero,Abir Messaoudi,Jose Giraldo,Carme Armentano-Oller,Rodolfo Zevallos,Ivan Meza,Javier Hernando
Main category: cs.CL
TL;DR: 本文探讨了如何优化自动语音识别(ASR)以处理加泰罗尼亚语-西班牙语语码转换(CS),通过三种策略比较了数据合成和语言标记的效果,发现合成数据结合主导语言标记效果最佳。
Details
Motivation: 语码转换在实际场景中普遍存在,但缺乏专门的数据集导致ASR性能受限,尤其在加泰罗尼亚语-西班牙语的混合使用场景中。Contribution: 提出并比较了三种优化ASR的策略:合成语码转换数据、拼接单语音频、以及使用真实CS数据结合语言标记,同时发布了基于Whisper模型的改进版本。
Method: 1. 生成合成CS数据;2. 拼接单语音频;3. 利用真实CS数据加上语言标记。对OpenAI的Whisper模型进行微调,并在Hugging Face上发布。
Result: 实验表明,少量合成CS数据结合主导语言标记的转录性能最佳。
Insight: 合成数据与语言标记的结合可以有效缓解语码转换ASR任务中的数据稀缺问题,且提升模型在真实场景中的表现。
Abstract: Code-switching (CS), the alternating use of two or more languages, challenges automatic speech recognition (ASR) due to scarce training data and linguistic similarities. The lack of dedicated CS datasets limits ASR performance, as most models rely on monolingual or mixed-language corpora that fail to reflect real-world CS patterns. This issue is critical in multilingual societies where CS occurs in informal and formal settings. A key example is Catalan-Spanish CS, widely used in media and parliamentary speeches. In this work, we improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens. We extract CS data from Catalan speech corpora and fine-tune OpenAI’s Whisper models, making them available on Hugging Face. Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.
[17] Using LLMs to identify features of personal and professional skills in an open-response situational judgment test
Cole Walsh,Rodica Ivan,Muhammad Zafar Iqbal,Colleen Robb
Main category: cs.CL
TL;DR: 该论文探讨了使用大型语言模型(LLMs)从开放回答的情境判断测试(SJT)中提取与个人和专业技能相关的特征,为自动评分系统提供了一种新的可行方法。
Details
Motivation: 学术项目越来越重视个人和专业技能的重要性,但传统的开放回答SJT依赖人工评分,难以规模化。以往基于自然语言处理的评分系统因构建效度问题效果不佳,因此研究者希望通过LLMs解决这一问题。Contribution: 提出了一种使用LLMs从SJT回答中提取相关特征的新方法,并以Casper SJT为例验证了该方法的有效性,为未来自动化评分系统的发展奠定了基础。
Method: 利用大型语言模型(LLMs)对开放回答SJT的回答进行特征提取,通过分析构建效度相关的特征,评估模型的性能。
Result: 研究表明LLMs能够有效地从SJT回答中提取与技能相关的特征,为自动化评分提供了可行方案。
Insight: LLMs在评估开放回答SJT方面具有潜力,能够克服传统人工评分和早期NLP方法的局限性,推动规模化评估的实现。
Abstract: Academic programs are increasingly recognizing the importance of personal and professional skills and their critical role alongside technical expertise in preparing students for future success in diverse career paths. With this growing demand comes the need for scalable systems to measure, evaluate, and develop these skills. Situational Judgment Tests (SJTs) offer one potential avenue for measuring these skills in a standardized and reliable way, but open-response SJTs have traditionally relied on trained human raters for evaluation, presenting operational challenges to delivering SJTs at scale. Past attempts at developing NLP-based scoring systems for SJTs have fallen short due to issues with construct validity of these systems. In this article, we explore a novel approach to extracting construct-relevant features from SJT responses using large language models (LLMs). We use the Casper SJT to demonstrate the efficacy of this approach. This study sets the foundation for future developments in automated scoring for personal and professional skills.
[18] Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need
Bhishma Dedhia,Yuval Kansal,Niraj K. Jha
Main category: cs.CL
TL;DR: 论文提出了一种自底向上的方法,通过知识图谱(KG)构建领域特定的超级智能,并验证其在医学领域的有效性。
Details
Motivation: 传统语言模型的训练方式(自顶向下)难以支持深度领域专长所需的抽象能力,因此需要一种自底向上的方法,通过组合领域基础概念来获取专业知识。Contribution: 1. 提出了基于知识图谱的任务生成流程,通过组合KG中的基础概念生成复杂推理任务。2. 在医学领域验证了方法的有效性,并提出了评估套件ICD-Bench。3. 展示了模型QwQ-Med-3在医学推理任务上的显著优势。
Method: 1. 利用知识图谱构建领域基础知识库。2. 设计任务生成流程,从KG中合成推理任务。3. 用生成的课程对语言模型进行微调。
Result: QwQ-Med-3在ICD-Bench评估中显著优于现有推理模型,并在复杂任务上表现更优,同时能迁移知识提升基准模型性能。
Insight: 领域特定的超级智能可以通过组合基础概念实现,未来人工通用智能(AGI)可能是高效领域专用智能体的交互结果。
Abstract: Language models traditionally used for cross-domain generalization have recently demonstrated task-specific reasoning. However, their top-down training approach on general corpora is insufficient for acquiring abstractions needed for deep domain expertise. This may require a bottom-up approach that acquires expertise by learning to compose simple domain concepts into more complex ones. A knowledge graph (KG) provides this compositional structure, where domain primitives are represented as head-relation-tail edges and their paths encode higher-level concepts. We present a task generation pipeline that synthesizes tasks directly from KG primitives, enabling models to acquire and compose them for reasoning. We fine-tune language models on the resultant KG-grounded curriculum to demonstrate domain-specific superintelligence. While broadly applicable, we validate our approach in medicine, where reliable KGs exist. Using a medical KG, we curate 24,000 reasoning tasks paired with thinking traces derived from diverse medical primitives. We fine-tune the QwQ-32B model on this curriculum to obtain QwQ-Med-3 that takes a step towards medical superintelligence. We also introduce ICD-Bench, an evaluation suite to quantify reasoning abilities across 15 medical domains. Our experiments demonstrate that QwQ-Med-3 significantly outperforms state-of-the-art reasoning models on ICD-Bench categories. Further analysis reveals that QwQ-Med-3 utilizes acquired primitives to widen the performance gap on the hardest tasks of ICD-Bench. Finally, evaluation on medical question-answer benchmarks shows that QwQ-Med-3 transfers acquired expertise to enhance the base model’s performance. While the industry’s approach to artificial general intelligence (AGI) emphasizes broad expertise, we envision a future in which AGI emerges from the composable interaction of efficient domain-specific superintelligent agents.
[19] Efficient Temporal Tokenization for Mobility Prediction with Large Language Models
Haoyu He,Haozheng Luo,Yan Chen,Qi R. Wang
Main category: cs.CL
TL;DR: RHYTHM proposes一种高效的时间标记化框架,利用大语言模型(LLMs)进行移动预测,通过分层注意力减少序列长度,显著提升计算效率与准确性。
Details
Motivation: 当前移动预测方法在捕捉时空依赖性和计算效率之间存在权衡,需要一种高效且高精度的解决方案。Contribution: 提出了RHYTHM框架,通过分层时间标记化和冻结LLM骨干,显著提升预测精度和计算效率。
Method: 将轨迹分段为每日标记,采用分层注意力机制捕捉日间与周间依赖,并利用预计算提示嵌入增强表示。
Result: 在三个真实数据集上,RHYTHM准确率提升2.4%,周末预测提升5.0%,训练时间减少24.6%。
Insight: 冻结LLM骨干可以有效减少计算开销,同时分层标记化能够高效捕捉时空依赖。
Abstract: We introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a framework that leverages large language models (LLMs) as spatio-temporal predictors and trajectory reasoners. RHYTHM partitions trajectories into daily segments encoded as discrete tokens with hierarchical attention, capturing both daily and weekly dependencies while substantially reducing the sequence length. Token representations are enriched with pre-computed prompt embeddings via a frozen LLM, enhancing the model’s ability to capture interdependencies without extensive computational overhead. By freezing the LLM backbone, RHYTHM achieves significant computational efficiency. Evaluation on three real-world datasets demonstrates a 2.4% improvement in accuracy, 5.0% increase on weekends, and 24.6% reduction in training time compared to state-of-the-art methods.
[20] Evaluating the Effectiveness of Cost-Efficient Large Language Models in Benchmark Biomedical Tasks
Israt Jahan,Md Tahmid Rahman Laskar,Chun Peng,Jimmy Huang
Main category: cs.CL
TL;DR: 本文评估了多种成本高效的大型语言模型(LLM)在生物医学任务中的表现,发现不同的LLM在不同任务中表现最佳,开源模型在某些任务中甚至优于闭源模型。
Details
Motivation: 研究旨在确定哪种LLM在多样化的生物医学任务中表现最优,为实际应用提供选择依据。Contribution: 全面评估了闭源和开源LLM在生物医学任务中的表现,揭示了不同模型的优势和适用场景。
Method: 在文本分类、生成、问答和多模态图像处理等任务中,测试了多种LLM的性能表现。
Result: 实验结果显示,没有单一LLM在所有任务中表现最佳,不同模型在不同任务中表现优异,且开源模型在某些任务中性能接近甚至优于闭源模型。
Insight: 选择合适的LLM需结合任务需求,开源模型在速度和隐私方面具有额外优势。
Abstract: This paper presents a comprehensive evaluation of cost-efficient Large Language Models (LLMs) for diverse biomedical tasks spanning both text and image modalities. We evaluated a range of closed-source and open-source LLMs on tasks such as biomedical text classification and generation, question answering, and multimodal image processing. Our experimental findings indicate that there is no single LLM that can consistently outperform others across all tasks. Instead, different LLMs excel in different tasks. While some closed-source LLMs demonstrate strong performance on specific tasks, their open-source counterparts achieve comparable results (sometimes even better), with additional benefits like faster inference and enhanced privacy. Our experimental results offer valuable insights for selecting models that are optimally suited for specific biomedical applications.
[21] Collaborative Rational Speech Act: Pragmatic Reasoning for Multi-Turn Dialog
Lautaro Estienne,Gabriel Ben Zenou,Nona Naderi,Jackie Cheung,Pablo Piantanida
Main category: cs.CL
TL;DR: CRSA扩展了RSA框架,通过信息论方法建模多轮对话,优化增益函数,提升对话的协作性和可解释性。
Details
Motivation: 现有RSA框架难以扩展到多轮合作场景,亟需一种能建模共享目标和信念的方法。Contribution: 提出了CRSA框架,结合信息论和多轮对话优化,提升协作性和解释性。
Method: 基于信息论扩展RSA,通过优化的增益函数建模多轮对话中私密信息的共享。
Result: 在指涉游戏和医疗领域对话中,CRSA表现优于基线,更一致且可解释。
Insight: CRSA为构建更具实用性和社交意识的语言模型提供了新方向。
Abstract: As AI systems take on collaborative roles, they must reason about shared goals and beliefs-not just generate fluent language. The Rational Speech Act (RSA) framework offers a principled approach to pragmatic reasoning, but existing extensions face challenges in scaling to multi-turn, collaborative scenarios. In this paper, we introduce Collaborative Rational Speech Act (CRSA), an information-theoretic (IT) extension of RSA that models multi-turn dialog by optimizing a gain function adapted from rate-distortion theory. This gain is an extension of the gain model that is maximized in the original RSA model but takes into account the scenario in which both agents in a conversation have private information and produce utterances conditioned on the dialog. We demonstrate the effectiveness of CRSA on referential games and template-based doctor-patient dialogs in the medical domain. Empirical results show that CRSA yields more consistent, interpretable, and collaborative behavior than existing baselines-paving the way for more pragmatic and socially aware language agents.
cs.CV [Back]
[22] Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives
Yang Zhou,Junjie Li,CongYang Ou,Dawei Yan,Haokui Zhang,Xizhe Xue
Main category: cs.CV
TL;DR: 该论文综述了无人机影像中的开放词汇目标检测(OVOD),结合跨模态文本-图像对齐技术(如CLIP),探讨其核心原理、分类方法、数据集及未来研究方向。
Details
Motivation: 传统无人机目标检测方法局限于预定义类别,跨模态技术的出现为开放词汇检测提供了可能,推动了无人机智能化和自主性。Contribution: 1. 提出无人机影像中OVOD的系统分类法;2. 综述相关数据集;3. 分析关键挑战和未解决问题;4. 指出未来研究方向。
Method: 通过跨模态文本-图像对齐技术(如CLIP)实现开放词汇检测,结合无人机视觉特性进行系统分类和综述。
Result: 系统整理了OVOD在无人机影像中的应用现状、挑战及未来发展路径。
Insight: 跨模态技术为无人机目标检测带来了新机遇,但数据稀缺、标注成本等问题仍需解决。
Abstract: Due to its extensive applications, aerial image object detection has long been a hot topic in computer vision. In recent years, advancements in Unmanned Aerial Vehicles (UAV) technology have further propelled this field to new heights, giving rise to a broader range of application requirements. However, traditional UAV aerial object detection methods primarily focus on detecting predefined categories, which significantly limits their applicability. The advent of cross-modal text-image alignment (e.g., CLIP) has overcome this limitation, enabling open-vocabulary object detection (OVOD), which can identify previously unseen objects through natural language descriptions. This breakthrough significantly enhances the intelligence and autonomy of UAVs in aerial scene understanding. This paper presents a comprehensive survey of OVOD in the context of UAV aerial scenes. We begin by aligning the core principles of OVOD with the unique characteristics of UAV vision, setting the stage for a specialized discussion. Building on this foundation, we construct a systematic taxonomy that categorizes existing OVOD methods for aerial imagery and provides a comprehensive overview of the relevant datasets. This structured review enables us to critically dissect the key challenges and open problems at the intersection of these fields. Finally, based on this analysis, we outline promising future research directions and application prospects. This survey aims to provide a clear road map and a valuable reference for both newcomers and seasoned researchers, fostering innovation in this rapidly evolving domain. We keep tracing related works at https://github.com/zhouyang2002/OVOD-in-UVA-imagery
[23] VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs
Shmuel Berman,Jia Deng
Main category: cs.CV
TL;DR: 论文评估了主流视觉语言模型(VLMs)在非局部视觉推理任务中的表现,发现即使顶级模型在人类看来简单的任务上也表现不佳。
Details
Motivation: 尽管VLM在复杂视觉任务中表现出色,但在需要跨区域推理的简单任务中表现较差,因此设计实验测试其非局部视觉推理能力。Contribution: 提出了一个结构化评估框架,测试VLM在比较感知、扫视搜索和平滑视觉搜索三种非局部视觉任务中的能力,揭示了现有模型的局限性。
Method: 设计了三种任务(比较感知、扫视搜索、平滑视觉搜索),评估模型在需要跨区域推理的场景中的表现,并与人类能力对比。
Result: 主流模型(如Gemini 2.5 Pro、Claude Vision 3.7)在这些任务中表现接近随机水平,远低于人类能力。
Insight: 尽管VLM在视觉精度上有进步,但缺乏非局部视觉推理的核心能力,表明其推理机制与人类存在本质差异。
Abstract: Visual Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation that tests vision-language models’ capacity for nonlocal visual reasoning – reasoning that requires chaining evidence collected from multiple, possibly distant, regions of an image. We isolate three distinct forms of non-local vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence-driven jumps to locate successive targets; and smooth visual search, which involves searching smoothly along a continuous contour. Flagship models (e.g., Gemini 2.5 Pro, Claude Vision 3.7, GPT-o4-mini), even those that perform well on prior primitive-vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test if VLMs can perform similar visual algorithms to humans. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.
[24] Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning
Binbin Ji,Siddharth Agrawal,Qiance Tang,Yvonne Wu
Main category: cs.CV
TL;DR: 该论文研究了通过思维链提示和强化学习提升视觉语言模型(VLM)的空间推理能力,发现结构化场景图提示显著优于简单提示,且强化学习方法GRPO在泛化性和鲁棒性上优于监督微调。
Details
Motivation: 提升VLM的空间推理能力是当前研究的重点,但现有方法(如简单思维链提示)效果有限,甚至可能损害模型性能。Contribution: 1) 提出基于场景图的结构化多阶段提示(SceneGraph CoT),显著提升空间推理精度;2) 提出使用GRPO强化学习方法微调模型,在泛化性和鲁棒性上优于SFT。
Method: 1) 对比不同提示策略(如简单CoT vs. SceneGraph CoT)的效果;2) 使用GRPO对模型进行强化学习微调,并在SAT数据集上训练后评估CVBench的表现。
Result: SceneGraph CoT显著提升空间推理准确性;GRPO在Pass@1评测中优于SFT,且在OOD条件下表现更稳健,避免了SFT对语言模式的过拟合问题。
Insight: 结构化提示和强化学习是提升VLM空间推理能力的关键,尤其能解决简单提示和SFT泛化性不足的问题。
Abstract: This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought (CoT) prompting and reinforcement learning. We begin by evaluating the impact of different prompting strategies and find that simple CoT formats, where the model generates a reasoning step before the answer, not only fail to help, but can even harm the model’s original performance. In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy. Furthermore, to improve spatial reasoning ability, we fine-tune models using Group Relative Policy Optimization (GRPO) on the SAT dataset and evaluate their performance on CVBench. Compared to supervised fine-tuning (SFT), GRPO achieves higher accuracy on Pass@1 evaluations and demonstrates superior robustness under out-of-distribution (OOD) conditions. In particular, we find that SFT overfits to surface-level linguistic patterns and may degrade performance when test-time phrasing changes (e.g., from “closer to” to “farther from”). GRPO, on the other hand, generalizes more reliably and maintains stable performance under such shifts. Our findings provide insights into how reinforcement learning and structured prompting improve the spatial reasoning capabilities and generalization behavior of modern VLMs. All code is open source at: https://github.com/Yvonne511/spatial-vlm-investigator
[25] Just Add Geometry: Gradient-Free Open-Vocabulary 3D Detection Without Human-in-the-Loop
Atharv Goel,Mehar Khurana
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注的开放词汇3D物体检测方法,利用2D视觉语言模型和几何推理技术,避免了成本高昂的3D标签需求。
Details
Motivation: 现有3D检测数据集受限于狭窄的类别分类和昂贵的人工标注,难以适应开放世界的场景。相比之下,基于图像文本训练的2D视觉语言模型具有丰富的语义理解和开放词汇检测能力。Contribution: 1. 首次提出了一种无需训练的开放词汇3D检测框架;2. 引入了几何膨胀策略和旋转卡尺算法,从2D提案推断3D边界框;3. 构建了Pseudo-nuScenes数据集,模拟真实世界的雾天条件。
Method: 1. 使用2D视觉语言检测器生成文本条件提案;2. 通过SAM分割并利用相机几何和伪深度回投影到3D空间;3. 采用DBSCAN聚类和旋转卡尺算法进行几何膨胀,推断3D边界框。
Result: 实验表明,该方法在多种输入设置(包括LiDAR和纯RGB-D)下均能实现竞争力的定位性能,且无需训练和开放词汇支持。
Insight: 2D基础模型在可扩展3D感知中具有巨大潜力,几何推理能有效弥补3D标注的不足。
Abstract: Modern 3D object detection datasets are constrained by narrow class taxonomies and costly manual annotations, limiting their ability to scale to open-world settings. In contrast, 2D vision-language models trained on web-scale image-text pairs exhibit rich semantic understanding and support open-vocabulary detection via natural language prompts. In this work, we leverage the maturity and category diversity of 2D foundation models to perform open-vocabulary 3D object detection without any human-annotated 3D labels. Our pipeline uses a 2D vision-language detector to generate text-conditioned proposals, which are segmented with SAM and back-projected into 3D using camera geometry and either LiDAR or monocular pseudo-depth. We introduce a geometric inflation strategy based on DBSCAN clustering and Rotating Calipers to infer 3D bounding boxes without training. To simulate adverse real-world conditions, we construct Pseudo-nuScenes, a fog-augmented, RGB-only variant of the nuScenes dataset. Experiments demonstrate that our method achieves competitive localization performance across multiple settings, including LiDAR-based and purely RGB-D inputs, all while remaining training-free and open-vocabulary. Our results highlight the untapped potential of 2D foundation models for scalable 3D perception. We open-source our code and resources at https://github.com/atharv0goel/open-world-3D-det.
[26] OmniVec2 – A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning
Siddharth Srivastava,Gaurav Sharma
Main category: cs.CV
TL;DR: OmniVec2 是一种基于 Transformer 的多模态多任务学习网络,支持 12 种模态数据,通过模态专用 tokenizer 和共享 Transformer 架构,统一嵌入空间,并采用创新的预训练策略和多任务训练算法实现高效学习。
Details
Motivation: 处理大规模多模态数据和多任务学习的需求,提出一种统一的架构以支持多种模态,并提升跨模态和多任务的性能。Contribution: 1. 提出支持 12 种模态的多模态多任务网络;2. 使用模态专用 tokenizer 和共享 Transformer 架构;3. 提出迭代模态切换的预训练策略和高效多任务训练算法。
Method: 1. 模态专用 tokenizer 和共享 Transformer 架构;2. 跨模态注意力机制;3. 迭代模态切换预训练和多任务训练算法。
Result: 在 12 种模态的 25 个数据集上实现最优性能。
Insight: 统一的嵌入空间和跨模态注意力机制能有效提升多模态学习效率;迭代预训练策略对多模态初始化尤为重要。
Abstract: We present a novel multimodal multitask network and associated training algorithm. The method is capable of ingesting data from approximately 12 different modalities namely image, video, audio, text, depth, point cloud, time series, tabular, graph, X-ray, infrared, IMU, and hyperspectral. The proposed approach utilizes modality specialized tokenizers, a shared transformer architecture, and cross-attention mechanisms to project the data from different modalities into a unified embedding space. It addresses multimodal and multitask scenarios by incorporating modality-specific task heads for different tasks in respective modalities. We propose a novel pretraining strategy with iterative modality switching to initialize the network, and a training algorithm which trades off fully joint training over all modalities, with training on pairs of modalities at a time. We provide comprehensive evaluation across 25 datasets from 12 modalities and show state of the art performances, demonstrating the effectiveness of the proposed architecture, pretraining strategy and adapted multitask training.
[27] Transformer-Based Framework for Motion Capture Denoising and Anomaly Detection in Medical Rehabilitation
Yeming Cai,Yang Wang,Zhenglin Li
Main category: cs.CV
TL;DR: 该论文提出了一种基于Transformer的端到端深度学习框架,用于医疗康复中的动作捕捉去噪和异常检测,解决了由遮挡和环境因素引起的数据噪声和缺失问题。
Details
Motivation: 医疗康复中的动作捕捉数据常因遮挡和环境干扰产生噪声和缺失,影响数据的准确性和实时异常检测。本文旨在通过深度学习提升数据质量,保障患者安全。Contribution: 1. 提出了一种结合Transformer和光学动作捕捉的端到端框架;2. 实现了数据去噪和补全,并支持实时异常检测;3. 在真实康复数据集上验证了其高效性和鲁棒性。
Method: 利用Transformer的时间序列建模能力,对动作捕捉数据进行去噪和补全,并结合异常检测模块实时监控患者动作。
Result: 在卒中和骨科康复数据集上的实验表明,该方法在数据重建和异常检测方面表现优越,适用于低成本远程康复场景。
Insight: Transformer在时间序列任务中的潜力被进一步拓展,为医疗康复提供了一种可扩展且经济的解决方案。
Abstract: This paper proposes an end-to-end deep learning framework integrating optical motion capture with a Transformer-based model to enhance medical rehabilitation. It tackles data noise and missing data caused by occlusion and environmental factors, while detecting abnormal movements in real time to ensure patient safety. Utilizing temporal sequence modeling, our framework denoises and completes motion capture data, improving robustness. Evaluations on stroke and orthopedic rehabilitation datasets show superior performance in data reconstruction and anomaly detection, providing a scalable, cost-effective solution for remote rehabilitation with reduced on-site supervision.
[28] Enhancing Breast Cancer Detection with Vision Transformers and Graph Neural Networks
Yeming Cai,Zhenglin Li,Yang Wang
Main category: cs.CV
TL;DR: 论文提出了一种结合视觉变换器(ViT)和图神经网络(GNN)的新框架,用于提升乳腺癌检测的准确性和可解释性,在CBIS-DDSM数据集上表现优于传统方法。
Details
Motivation: 乳腺癌是全球女性死亡的主要原因之一,早期检测对提高生存率至关重要。现有方法在特征捕捉和结构建模方面存在局限性,因此需要创新方法来提升检测性能。Contribution: 1. 提出了一种集成ViT和GNN的创新框架;2. 在CBIS-DDSM数据集上实现了84.2%的准确率,优于传统方法;3. 提供了可解释的注意力热图,帮助临床医生理解模型决策。
Method: 1. 使用ViT捕捉图像的全局特征;2. 应用GNN建模图像结构关系;3. 结合两种网络的优势提升检测性能。
Result: 在CBIS-DDSM数据集上达到了84.2%的准确率,并生成了可解释的注意力热图。
Insight: 结合ViT和GNN可以有效提升医学图像的检测性能,同时增强模型的可解释性,对临床诊断具有实际意义。
Abstract: Breast cancer is a leading cause of death among women globally, and early detection is critical for improving survival rates. This paper introduces an innovative framework that integrates Vision Transformers (ViT) and Graph Neural Networks (GNN) to enhance breast cancer detection using the CBIS-DDSM dataset. Our framework leverages ViT’s ability to capture global image features and GNN’s strength in modeling structural relationships, achieving an accuracy of 84.2%, outperforming traditional methods. Additionally, interpretable attention heatmaps provide insights into the model’s decision-making process, aiding radiologists in clinical settings.
[29] Butter: Frequency Consistency and Hierarchical Fusion for Autonomous Driving Object Detection
Xiaojian Lin,Wenxin Zhang,Yuchu Jiang,Wangyu Wu,Yiran Guo,Kangxu Wang,Zongzheng Zhang,Guijin Wang,Lei Jin,Hao Zhao
Main category: cs.CV
TL;DR: Butter是一种用于自动驾驶目标检测的框架,通过频率一致性增强和分层特征融合,提升检测精度和计算效率。
Details
Motivation: 现有目标检测架构(如YOLO和DETR)在多尺度特征一致性和语义理解方面存在不足,限制了动态环境中检测的鲁棒性。Contribution: 提出Butter框架,包含频率自适应特征一致性增强(FAFCE)和渐进分层特征融合网络(PHFFNet),显著改进特征表示和检测精度。
Method: FAFCE通过自适应频率滤波优化多尺度特征一致性;PHFFNet逐步融合多级特征以减少语义差距。
Result: 在BDD100K、KITTI和Cityscapes数据集上验证了Butter的优越性,同时降低模型复杂度。
Insight: 通过分层特征的精细化和整合,Butter在实时自动驾驶场景中实现了精度与效率的平衡。
Abstract: Hierarchical feature representations play a pivotal role in computer vision, particularly in object detection for autonomous driving. Multi-level semantic understanding is crucial for accurately identifying pedestrians, vehicles, and traffic signs in dynamic environments. However, existing architectures, such as YOLO and DETR, struggle to maintain feature consistency across different scales while balancing detection precision and computational efficiency. To address these challenges, we propose Butter, a novel object detection framework designed to enhance hierarchical feature representations for improving detection robustness. Specifically, Butter introduces two key innovations: Frequency-Adaptive Feature Consistency Enhancement (FAFCE) Component, which refines multi-scale feature consistency by leveraging adaptive frequency filtering to enhance structural and boundary precision, and Progressive Hierarchical Feature Fusion Network (PHFFNet) Module, which progressively integrates multi-level features to mitigate semantic gaps and strengthen hierarchical feature learning. Through extensive experiments on BDD100K, KITTI, and Cityscapes, Butter demonstrates superior feature representation capabilities, leading to notable improvements in detection accuracy while reducing model complexity. By focusing on hierarchical feature refinement and integration, Butter provides an advanced approach to object detection that achieves a balance between accuracy, deployability, and computational efficiency in real-time autonomous driving scenarios. Our model and implementation are publicly available at https://github.com/Aveiro-Lin/Butter, facilitating further research and validation within the autonomous driving community.
[30] Smart Routing for Multimodal Video Retrieval: When to Search What
Kevin Dela Rosa
Main category: cs.CV
TL;DR: 该论文提出了一种基于LLM的智能路由系统ModaRoute,通过动态选择最佳模态进行多模态视频检索,显著降低了计算开销并保持了较高的检索效果。
Details
Motivation: 多模态视频检索通常需要处理多种模态(如语音、文本、视觉),但完全搜索所有模态会带来高昂的计算开销,而固定模态选择可能无法满足动态信息需求。Contribution: 主要贡献是提出了ModaRoute,一种能够根据查询意图动态选择最优模态的智能路由系统,显著降低了计算开销并保持了检索的高效性。
Method: 使用GPT-4.1分析查询意图并预测信息需求,动态路由查询到ASR(语音)、OCR(文本)和视觉索引,平均每个查询仅搜索1.78种模态,而非全部3种。
Result: 在180万视频片段上的实验表明,ModaRoute减少了41%的计算开销,同时实现了60.9%的Recall@5,优于固定模态选择方法。
Insight: 通过智能路由系统动态选择模态,可以在多模态检索中显著减少计算成本,同时保持较高的检索效果,为实际部署提供了实用解决方案。
Abstract: We introduce ModaRoute, an LLM-based intelligent routing system that dynamically selects optimal modalities for multimodal video retrieval. While dense text captions can achieve 75.9% Recall@5, they require expensive offline processing and miss critical visual information present in 34% of clips with scene text not captured by ASR. By analyzing query intent and predicting information needs, ModaRoute reduces computational overhead by 41% while achieving 60.9% Recall@5. Our approach uses GPT-4.1 to route queries across ASR (speech), OCR (text), and visual indices, averaging 1.78 modalities per query versus exhaustive 3.0 modality search. Evaluation on 1.8M video clips demonstrates that intelligent routing provides a practical solution for scaling multimodal retrieval systems, reducing infrastructure costs while maintaining competitive effectiveness for real-world deployment.
[31] A Comprehensive Survey for Real-World Industrial Defect Detection: Challenges, Approaches, and Prospects
Yuqi Cheng,Yunkang Cao,Haiming Yao,Wei Luo,Cheng Jiang,Hui Zhang,Weiming Shen
Main category: cs.CV
TL;DR: 该论文是一篇关于工业缺陷检测的全面综述,重点探讨了从封闭集到开放集检测方法的演变,以及2D和3D模态下的挑战与前沿趋势。
Details
Motivation: 随着制造业对精确性、自动化和可扩展性的需求增加,传统检测方法已无法满足实际需求。计算机视觉和深度学习的进展推动了缺陷检测技术的发展,但缺乏对这一领域的系统性综述。Contribution: 1. 综述了封闭集和开放集缺陷检测方法的进展;2. 分析了2D和3D模态下的研究现状;3. 总结了实际应用中的关键挑战和新兴趋势。
Method: 论文采用系统性文献综述方法,分类整理了近年来的研究进展,重点对比了封闭集和开放集检测方法。
Result: 论文展示了开放集检测方法在减少标注需求和提高新异常识别能力方面的潜力,并指出其未来发展趋势。
Insight: 开放集检测方法有望解决工业场景中标注成本高和缺陷多样性问题,是未来的重要研究方向。
Abstract: Industrial defect detection is vital for upholding product quality across contemporary manufacturing systems. As the expectations for precision, automation, and scalability intensify, conventional inspection approaches are increasingly found wanting in addressing real-world demands. Notable progress in computer vision and deep learning has substantially bolstered defect detection capabilities across both 2D and 3D modalities. A significant development has been the pivot from closed-set to open-set defect detection frameworks, which diminishes the necessity for extensive defect annotations and facilitates the recognition of novel anomalies. Despite such strides, a cohesive and contemporary understanding of industrial defect detection remains elusive. Consequently, this survey delivers an in-depth analysis of both closed-set and open-set defect detection strategies within 2D and 3D modalities, charting their evolution in recent years and underscoring the rising prominence of open-set techniques. We distill critical challenges inherent in practical detection environments and illuminate emerging trends, thereby providing a current and comprehensive vista of this swiftly progressing field.
[32] Using Multiple Input Modalities Can Improve Data-Efficiency and O.O.D. Generalization for ML with Satellite Imagery
Arjun Rao,Esther Rolf
Main category: cs.CV
TL;DR: 论文探讨了在卫星图像机器学习任务中,结合多种地理数据模态(如数字高程模型、温度数据等)可以显著提升模型性能,特别是在数据有限和分布外泛化场景下。
Details
Motivation: 当前大多数基于卫星图像的机器学习模型仅依赖光学图像数据,忽略了其他可用的地理数据层。论文旨在评估结合多种输入模态对模型性能的影响。Contribution: 论文通过实验证明,融合额外的地理数据层可以显著提升卫星图像机器学习任务的性能,尤其在数据稀缺和分布外泛化场景下。同时发现硬编码融合策略优于学习策略。
Method: 生成增强版的卫星图像基准任务,将额外的地理数据层(如数字高程模型、温度数据)附加到数据集中,并通过分类、回归和分割任务评估模型性能。
Result: 实验表明,融合多种输入模态能显著提升模型性能,特别是在数据有限和分布外泛化场景下。硬编码融合策略表现优于学习策略。
Insight: 多模态输入在卫星图像机器学习中具有重要价值,尤其在提升数据效率和泛化能力方面。未来研究可以进一步探索融合策略的优化。
Abstract: A large variety of geospatial data layers is available around the world ranging from remotely-sensed raster data like satellite imagery, digital elevation models, predicted land cover maps, and human-annotated data, to data derived from environmental sensors such as air temperature or wind speed data. A large majority of machine learning models trained on satellite imagery (SatML), however, are designed primarily for optical input modalities such as multi-spectral satellite imagery. To better understand the value of using other input modalities alongside optical imagery in supervised learning settings, we generate augmented versions of SatML benchmark tasks by appending additional geographic data layers to datasets spanning classification, regression, and segmentation. Using these augmented datasets, we find that fusing additional geographic inputs with optical imagery can significantly improve SatML model performance. Benefits are largest in settings where labeled data are limited and in geographic out-of-sample settings, suggesting that multi-modal inputs may be especially valuable for data-efficiency and out-of-sample performance of SatML models. Surprisingly, we find that hard-coded fusion strategies outperform learned variants, with interesting implications for future work.
[33] Minimalist Concept Erasure in Generative Models
Yang Zhang,Er Jin,Yanfei Dong,Yixuan Wu,Philip Torr,Ashkan Khakzar,Johannes Stegmaier,Kenji Kawaguchi
Main category: cs.CV
TL;DR: 该论文提出了一种最小化概念擦除方法,仅基于生成输出的分布距离,通过可微分优化实现端到端训练,并通过神经元掩码提高鲁棒性。
Details
Motivation: 生成模型虽能生成高质量图像,但其依赖的大规模无标注数据引发了安全和版权问题。现有擦除方法常过度修改模型,损害其整体性能。本文旨在通过最小化概念擦除解决这一问题。Contribution: 提出了一种基于生成输出分布距离的最小化概念擦除目标,并推导出可微分优化的损失函数,同时引入神经元掩码提高鲁棒性。
Method: 通过端到端的反向传播优化生成步骤中的所有环节,结合神经元掩码替代微调,实现概念擦除。
Result: 在流匹配模型中验证了该方法能鲁棒地擦除概念且不损害模型整体性能。
Insight: 最小化概念擦除可以在不破坏生成模型性能的情况下解决安全和版权问题,为更安全的生成模型提供了新思路。
Abstract: Recent advances in generative models have demonstrated remarkable capabilities in producing high-quality images, but their reliance on large-scale unlabeled data has raised significant safety and copyright concerns. Efforts to address these issues by erasing unwanted concepts have shown promise. However, many existing erasure methods involve excessive modifications that compromise the overall utility of the model. In this work, we address these issues by formulating a novel minimalist concept erasure objective based \emph{only} on the distributional distance of final generation outputs. Building on our formulation, we derive a tractable loss for differentiable optimization that leverages backpropagation through all generation steps in an end-to-end manner. We also conduct extensive analysis to show theoretical connections with other models and methods. To improve the robustness of the erasure, we incorporate neuron masking as an alternative to model fine-tuning. Empirical evaluations on state-of-the-art flow-matching models demonstrate that our method robustly erases concepts without degrading overall model performance, paving the way for safer and more responsible generative models.
[34] From Binary to Semantic: Utilizing Large-Scale Binary Occupancy Data for 3D Semantic Occupancy Prediction
Chihiro Noguchi,Takaki Yamamoto
Main category: cs.CV
TL;DR: 论文提出了一种利用大规模二值占据数据(无语义标签)来增强3D语义占据预测的框架,包括预训练和学习式自动标注方法,显著提升了性能。
Details
Motivation: 3D语义占据预测需要标注的LiDAR点云数据,成本高昂;而大规模二值占据数据(仅标记占据与否)成本低,但未被充分利用。Contribution: 提出了一种分解式框架,将预测分为二值占据和语义占据模块,首次探索了二值数据在3D语义占据预测中的潜力。
Method: 通过预训练和自动标注利用二值数据,提出二值-语义分解模块化框架,优化语义预测。
Result: 在预训练和自动标注任务中,性能优于现有方法。
Insight: 低成本的二值数据可通过模块化设计有效提升高成本语义任务性能。
Abstract: Accurate perception of the surrounding environment is essential for safe autonomous driving. 3D occupancy prediction, which estimates detailed 3D structures of roads, buildings, and other objects, is particularly important for vision-centric autonomous driving systems that do not rely on LiDAR sensors. However, in 3D semantic occupancy prediction – where each voxel is assigned a semantic label – annotated LiDAR point clouds are required, making data acquisition costly. In contrast, large-scale binary occupancy data, which only indicate occupied or free space without semantic labels, can be collected at a lower cost. Despite their availability, the potential of leveraging such data remains unexplored. In this study, we investigate the utilization of large-scale binary occupancy data from two perspectives: (1) pre-training and (2) learning-based auto-labeling. We propose a novel binary occupancy-based framework that decomposes the prediction process into binary and semantic occupancy modules, enabling effective use of binary occupancy data. Our experimental results demonstrate that the proposed framework outperforms existing methods in both pre-training and auto-labeling tasks, highlighting its effectiveness in enhancing 3D semantic occupancy prediction. The code is available at https://github.com/ToyotaInfoTech/b2s-occupancy
[35] UL-DD: A Multimodal Drowsiness Dataset Using Video, Biometric Signals, and Behavioral Data
Morteza Bodaghi,Majid Hosseini,Raju Gottumukkala,Ravi Teja Bhupatiraju,Iftikhar Ahmad,Moncef Gabbouj
Main category: cs.CV
TL;DR: 该论文提出了一个名为UL-DD的多模态数据集,用于驾驶员疲劳检测,包含视频、生物信号和行为数据,数据涵盖多种信号源,并记录了驾驶员的渐进状态变化。
Details
Motivation: 现有疲劳检测数据集多为离散标签且信号来源有限,难以全面反映驾驶员状态变化。此研究旨在提供更全面的多模态数据,以支持更精确的疲劳检测研究。Contribution: 提出了一个综合多模态数据集,整合了3D面部视频、IR摄像头、生物信号和行为数据,记录了驾驶员从警觉到疲劳的渐进变化。
Method: 使用深度摄像头、IR摄像头和多种生物传感器采集数据,结合KSS自我报告和模拟驾驶环境,记录了40分钟的连续数据。
Result: 数据集总时长1400分钟,来自19名受试者,涵盖了警觉和疲劳状态的多模态信号,支持未来疲劳检测算法的开发。
Insight: 通过多模态信号和连续时间记录,该数据集能够更好地捕捉驾驶员的疲劳状态变化,为疲劳检测提供更丰富的数据支持。
Abstract: In this study, we present a comprehensive public dataset for driver drowsiness detection, integrating multimodal signals of facial, behavioral, and biometric indicators. Our dataset includes 3D facial video using a depth camera, IR camera footage, posterior videos, and biometric signals such as heart rate, electrodermal activity, blood oxygen saturation, skin temperature, and accelerometer data. This data set provides grip sensor data from the steering wheel and telemetry data from the American truck simulator game to provide more information about drivers’ behavior while they are alert and drowsy. Drowsiness levels were self-reported every four minutes using the Karolinska Sleepiness Scale (KSS). The simulation environment consists of three monitor setups, and the driving condition is completely like a car. Data were collected from 19 subjects (15 M, 4 F) in two conditions: when they were fully alert and when they exhibited signs of sleepiness. Unlike other datasets, our multimodal dataset has a continuous duration of 40 minutes for each data collection session per subject, contributing to a total length of 1,400 minutes, and we recorded gradual changes in the driver state rather than discrete alert/drowsy labels. This study aims to create a comprehensive multimodal dataset of driver drowsiness that captures a wider range of physiological, behavioral, and driving-related signals. The dataset will be available upon request to the corresponding author.
[36] AortaDiff: Volume-Guided Conditional Diffusion Models for Multi-Branch Aortic Surface Generation
Delin An,Pan Du,Jian-Xun Wang,Chaoli Wang
Main category: cs.CV
TL;DR: AortaDiff是一个基于扩散模型的框架,用于从CT/MRI图像生成平滑的主动脉表面,解决了现有方法依赖标注数据和手动干预的问题。
Details
Motivation: 精确的3D主动脉构建对临床诊断、术前规划和计算流体动力学(CFD)模拟至关重要,但现有方法需要大量标注数据和手动干预,且生成的网格几何一致性差。Contribution: 提出了AortaDiff,一个基于体积引导条件扩散模型(CDM)的方法,生成几何一致、适合CFD分析的主动脉表面;减少了标注数据依赖和手动干预。
Method: 使用体积引导的CDM迭代生成主动脉中心线,自动提取血管轮廓,并拟合为平滑3D表面,实现端到端工作流。
Result: 实验表明AortaDiff在有限训练数据下仍能生成高质量主动脉网格,包括正常和病理(如动脉瘤、狭窄)病例。
Insight: 体积引导的扩散模型在医学图像生成任务中具有潜力,能在减少数据需求的同时提升几何一致性,适用于复杂解剖结构建模。
Abstract: Accurate 3D aortic construction is crucial for clinical diagnosis, preoperative planning, and computational fluid dynamics (CFD) simulations, as it enables the estimation of critical hemodynamic parameters such as blood flow velocity, pressure distribution, and wall shear stress. Existing construction methods often rely on large annotated training datasets and extensive manual intervention. While the resulting meshes can serve for visualization purposes, they struggle to produce geometrically consistent, well-constructed surfaces suitable for downstream CFD analysis. To address these challenges, we introduce AortaDiff, a diffusion-based framework that generates smooth aortic surfaces directly from CT/MRI volumes. AortaDiff first employs a volume-guided conditional diffusion model (CDM) to iteratively generate aortic centerlines conditioned on volumetric medical images. Each centerline point is then automatically used as a prompt to extract the corresponding vessel contour, ensuring accurate boundary delineation. Finally, the extracted contours are fitted into a smooth 3D surface, yielding a continuous, CFD-compatible mesh representation. AortaDiff offers distinct advantages over existing methods, including an end-to-end workflow, minimal dependency on large labeled datasets, and the ability to generate CFD-compatible aorta meshes with high geometric fidelity. Experimental results demonstrate that AortaDiff performs effectively even with limited training data, successfully constructing both normal and pathologically altered aorta meshes, including cases with aneurysms or coarctation. This capability enables the generation of high-quality visualizations and positions AortaDiff as a practical solution for cardiovascular research.
[37] COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark
Ishant Chintapatla,Kazuma Choji,Naaisha Agarwal,Andrew Lin,Hannah You,Charles Duong,Kevin Zhu,Sean O’Brien,Vasu Sharma
Main category: cs.CV
TL;DR: 该论文提出了一个名为COREVQA的新型视觉问答基准测试,专注于拥挤场景中的视觉蕴含推理能力,揭示了当前视觉语言模型的局限性。
Details
Motivation: 现有视觉问答基准测试很少评估模型在视觉蕴含任务中的能力,特别是在拥挤场景下。为解决这一问题,论文提出了COREVQA基准测试。Contribution: 提出了COREVQA基准测试,包含5608张图像和合成的真/假陈述对,用于评估模型在拥挤场景中的视觉蕴含推理能力。
Method: 基于CrowdHuman数据集生成图像和真/假陈述对,构建了COREVQA基准测试,并对多个视觉语言模型进行了测试。
Result: 实验显示,即使是表现最好的模型,准确率也低于80%,其他模型表现更差(39.98%-69.95%),揭示了模型在拥挤场景中的推理局限性。
Insight: 拥挤场景下的视觉蕴含推理是当前视觉语言模型的一个薄弱环节,未来研究需要进一步改进模型在此类任务中的表现。
Abstract: Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model’s ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs’ ability to reason over certain types of image-question pairs in crowded scenes.
[38] IConMark: Robust Interpretable Concept-Based Watermark For AI Images
Vinu Sankar Sadasivan,Mehrdad Saberi,Soheil Feizi
Main category: cs.CV
TL;DR: IConMark提出了一种基于可解释概念的鲁棒语义水印方法,通过嵌入语义属性使水印可读且抗对抗攻击,显著提升了检测准确率和图像质量,并可与其他水印技术结合进一步增强鲁棒性。
Details
Motivation: 随着生成式AI和合成媒体的快速发展,如何区分AI生成图像与真实图像成为防止错误信息和确保数字真实性的关键。传统水印技术易受对抗攻击,因此需要一种更鲁棒且可解释的水印方法。Contribution: 1. 提出了IConMark,一种基于可解释概念的鲁棒语义水印方法;2. 通过嵌入语义属性使水印可读且抗对抗攻击;3. 展示了与现有水印技术结合的混合方法(IConMark+SS/TM),进一步增强鲁棒性。
Method: 1. 在AI生成图像中嵌入可解释的语义属性作为水印;2. 结合StegaStamp和TrustMark生成混合方法(IConMark+SS/TM);3. 通过AUROC等指标评估水印的检测能力和鲁棒性。
Result: IConMark及其变体的AUROC分数比最佳基线高出10.8%、14.5%和15.9%,证明了其在检测准确率和鲁棒性上的优越性。
Insight: 可解释语义水印不仅提升鲁棒性,还能通过人类可读的方式验证水印,为未来数字内容认证提供了新思路。
Abstract: With the rapid rise of generative AI and synthetic media, distinguishing AI-generated images from real ones has become crucial in safeguarding against misinformation and ensuring digital authenticity. Traditional watermarking techniques have shown vulnerabilities to adversarial attacks, undermining their effectiveness in the presence of attackers. We propose IConMark, a novel in-generation robust semantic watermarking method that embeds interpretable concepts into AI-generated images, as a first step toward interpretable watermarking. Unlike traditional methods, which rely on adding noise or perturbations to AI-generated images, IConMark incorporates meaningful semantic attributes, making it interpretable to humans and hence, resilient to adversarial manipulation. This method is not only robust against various image augmentations but also human-readable, enabling manual verification of watermarks. We demonstrate a detailed evaluation of IConMark’s effectiveness, demonstrating its superiority in terms of detection accuracy and maintaining image quality. Moreover, IConMark can be combined with existing watermarking techniques to further enhance and complement its robustness. We introduce IConMark+SS and IConMark+TM, hybrid approaches combining IConMark with StegaStamp and TrustMark, respectively, to further bolster robustness against multiple types of image manipulations. Our base watermarking technique (IConMark) and its variants (+TM and +SS) achieve 10.8%, 14.5%, and 15.9% higher mean area under the receiver operating characteristic curve (AUROC) scores for watermark detection, respectively, compared to the best baseline on various datasets.
[39] “PhyWorldBench”: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models
Jing Gu,Xian Liu,Yu Zeng,Ashwin Nagarajan,Fangrui Zhu,Daniel Hong,Yue Fan,Qianqi Yan,Kaiwen Zhou,Ming-Yu Liu,Xin Eric Wang
Main category: cs.CV
TL;DR: PhyWorldBench是一个用于评估文本生成视频模型物理真实性的综合基准框架,覆盖了从基础物理现象到复杂场景的多层次测试,并通过人类评估和零样本评估方法对12个先进模型进行了系统分析。
Details
Motivation: 视频生成模型在生成高质量、逼真内容方面取得了显著进展,但它们在模拟物理现象方面的能力仍是一个未解决的关键挑战。Contribution: 1. 提出了PhyWorldBench,一个全面的物理真实性评估基准;2. 引入了“反物理”类别,用于评估模型在违反物理规律的提示下的表现;3. 设计了零样本评估方法,利用现有MLLM进行物理真实性评估。
Method: 1. 构建多层次的物理现象测试集,包括基础物理场景和复杂交互场景;2. 通过人类评估和基于MLLM的零样本方法进行模型评估。
Result: 对12个先进视频生成模型进行了评估,揭示了它们在物理真实性方面的主要挑战,并提出了提示优化的针对性建议。
Insight: 物理真实性是视频生成模型的一个重要但未充分解决的领域,通过系统化的测试和评估,可以为模型改进提供明确方向。
Abstract: Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles like object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel “”Anti-Physics”” category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that could utilize current MLLM to evaluate the physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with a detailed comparison and analysis. we identify pivotal challenges models face in adhering to real-world physics. Through systematic testing of their outputs across 1,050 curated prompts-spanning fundamental, composite, and anti-physics scenarios-we identify pivotal challenges these models face in adhering to real-world physics. We then rigorously examine their performance on diverse physical phenomena with varying prompt types, deriving targeted recommendations for crafting prompts that enhance fidelity to physical principles.
[40] Uncertainty Quantification Framework for Aerial and UAV Photogrammetry through Error Propagation
Debao Huang,Rongjun Qin
Main category: cs.CV
TL;DR: 该论文提出了一种用于航空和无人机摄影测量的不确定性量化框架,通过误差传播提供点云中每个点的精度评估。
Details
Motivation: 摄影测量点云的精度高度依赖场景,且现有方法在MVS阶段的误差估计尚未解决和标准化。因此,作者提出一种自校准方法,填补了这一空白。Contribution: 主要贡献是设计了一个包含SfM和MVS两阶段误差传播的不确定性量化框架,并提出了一种自监督的自校准方法,用于估计MVS阶段的误差协方差矩阵。
Method: 通过选择可靠的n-view点(n≥6),并利用MVS阶段的相关线索(如匹配成本值)回归视差误差,实现了自校准的误差估计。
Result: 在公开的航空和无人机影像数据集上验证,该方法在不夸大误差的情况下实现了较高的边界覆盖率。
Insight: 该方法不仅自监督且符合摄影测量的误差传播路径,还能适用于多样化场景,为点云精度评估提供了可靠的理论支持。
Abstract: Uncertainty quantification of the photogrammetry process is essential for providing per-point accuracy credentials of the point clouds. Unlike airborne LiDAR, which typically delivers consistent accuracy across various scenes, the accuracy of photogrammetric point clouds is highly scene-dependent, since it relies on algorithm-generated measurements (i.e., stereo or multi-view stereo). Generally, errors of the photogrammetric point clouds propagate through a two-step process: Structure-from-Motion (SfM) with Bundle adjustment (BA), followed by Multi-view Stereo (MVS). While uncertainty estimation in the SfM stage has been well studied using the first-order statistics of the reprojection error function, that in the MVS stage remains largely unsolved and non-standardized, primarily due to its non-differentiable and multi-modal nature (i.e., from pixel values to geometry). In this paper, we present an uncertainty quantification framework closing this gap by associating an error covariance matrix per point accounting for this two-step photogrammetry process. Specifically, to estimate the uncertainty in the MVS stage, we propose a novel, self-calibrating method by taking reliable n-view points (n>=6) per-view to regress the disparity uncertainty using highly relevant cues (such as matching cost values) from the MVS stage. Compared to existing approaches, our method uses self-contained, reliable 3D points extracted directly from the MVS process, with the benefit of being self-supervised and naturally adhering to error propagation path of the photogrammetry process, thereby providing a robust and certifiable uncertainty quantification across diverse scenes. We evaluate the framework using a variety of publicly available airborne and UAV imagery datasets. Results demonstrate that our method outperforms existing approaches by achieving high bounding rates without overestimating uncertainty.
[41] $\nabla$NABLA: Neighborhood Adaptive Block-Level Attention
Dmitrii Mikhailov,Aleksey Letunovskiy,Maria Kovaleva,Vladimir Arkhipkin,Vladimir Korviakov,Vladimir Polovnikov,Viacheslav Vasilev,Evelina Sidorova,Denis Dimitrov
Main category: cs.CV
TL;DR: NABLA提出了一种新的邻域自适应块级注意力机制,用于解决视频扩散变换器中全注意力机制的二次复杂度问题,显著提升了训练和推理速度,同时保持了生成质量。
Details
Motivation: 近年来基于Transformer的视频生成任务取得了显著成功,但全注意力机制的二次复杂度仍然是高分辨率和长时视频序列的关键瓶颈。因此,作者提出了NABLA来解决这一问题。Contribution: NABLA的主要贡献是设计了一种动态适应稀疏模式的邻域自适应块级注意力机制,显著降低了计算开销,同时保持了生成质量。
Method: NABLA通过块级注意力结合自适应的稀疏阈值,动态调整注意力机制的计算复杂度。该方法无需定制底层算子设计,可直接与PyTorch的Flex Attention算子集成。
Result: 实验表明,NABLA在训练和推理速度上相比基线提升了2.7倍,同时在定量指标(如CLIP分数、VBench分数和人类评估分数)和视觉质量上几乎没有损失。
Insight: NABLA的块级注意力机制表明,在视频生成任务中,动态适应稀疏模式是一种高效降低计算复杂度的方法,同时无需牺牲生成质量。
Abstract: Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch’s Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: https://github.com/gen-ai-team/Wan2.1-NABLA
[42] LoRA-Loop: Closing the Synthetic Replay Cycle for Continual VLM Learning
Kaihong Wang,Donghyun Kim,Margrit Betke
Main category: cs.CV
TL;DR: 论文提出了一种名为LoRA-Loop的方法,通过结合低秩适配器(LoRA)技术改进合成重放(synthetic replay),解决了现有方法在持续视觉语言模型(VLM)学习中因生成样本与真实数据不匹配而导致的问题,显著提升了模型的表现和鲁棒性。
Details
Motivation: 现有的合成重放方法在持续学习中生成的样本可能无法捕捉下游任务的领域特定细节和细粒度语义,导致模型在微调过程中偏离正确的方向并损害先验知识的保留。Contribution: 1. 提出了LoRA-Loop框架,通过任务特定的低秩适配器改进生成模型的合成重放能力;2. 设计了基于置信度的两阶段样本选择策略,提高了生成和蒸馏的效率;3. 在MTIL基准上验证了方法优于现有技术。
Method: 1. 在冻结的Stable Diffusion模型中注入任务特定的LoRA适配器;2. 使用基于置信度的两阶段样本选择:先筛选真实任务数据以优化LoRA微调,再筛选生成样本用于蒸馏;3. 无缝集成现有重放流程。
Result: 实验表明,LoRA-Loop在稳定性、可塑性和零样本能力上达到最优平衡,显著优于现有合成重放技术。
Insight: 通过适配生成模型(如LoRA技术)可以有效提升合成样本的质量,从而改善持续学习中的知识保留和任务适应性。
Abstract: Continual learning for vision-language models has achieved remarkable performance through synthetic replay, where samples are generated using Stable Diffusion to regularize during finetuning and retain knowledge. However, real-world downstream applications often exhibit domain-specific nuances and fine-grained semantics not captured by generators, causing synthetic-replay methods to produce misaligned samples that misguide finetuning and undermine retention of prior knowledge. In this work, we propose a LoRA-enhanced synthetic-replay framework that injects task-specific low-rank adapters into a frozen Stable Diffusion model, efficiently capturing each new task’s unique visual and semantic patterns. Specifically, we introduce a two-stage, confidence-based sample selection: we first rank real task data by post-finetuning VLM confidence to focus LoRA finetuning on the most representative examples, then generate synthetic samples and again select them by confidence for distillation. Our approach integrates seamlessly with existing replay pipelines-simply swap in the adapted generator to boost replay fidelity. Extensive experiments on the Multi-domain Task Incremental Learning (MTIL) benchmark show that our method outperforms previous synthetic-replay techniques, achieving an optimal balance among plasticity, stability, and zero-shot capability. These results demonstrate the effectiveness of generator adaptation via LoRA for robust continual learning in VLMs.
[43] NoiseSDF2NoiseSDF: Learning Clean Neural Fields from Noisy Supervision
Tengkai Wang,Weihao Li,Ruikai Cui,Shi Qiu,Nick Barnes
Main category: cs.CV
TL;DR: NoiseSDF2NoiseSDF提出了一种从噪声点云中学习干净神经SDF的方法,通过噪声监督最小化MSE损失,实现了隐式去噪和表面重构的提升。
Details
Motivation: 低质量扫描设备捕获的点云通常包含大量噪声,导致表面重构不准确。本研究旨在解决这一问题,将2D图像的Noise2Noise范式扩展到3D神经场。Contribution: 提出了NoiseSDF2NoiseSDF方法,首次将Noise2Noise思想引入神经SDF领域,直接从噪声点云中学习干净表面表示。
Method: 通过最小化噪声SDF表示之间的MSE损失,网络能够隐式去噪并优化表面估计。
Result: 在ShapeNet、ABC、Famous和Real数据集上的实验表明,方法显著提升了噪声输入下的表面重构质量。
Insight: 噪声监督可以有效地学习干净神经场,为3D重建提供了一种新的去噪思路。
Abstract: Reconstructing accurate implicit surface representations from point clouds remains a challenging task, particularly when data is captured using low-quality scanning devices. These point clouds often contain substantial noise, leading to inaccurate surface reconstructions. Inspired by the Noise2Noise paradigm for 2D images, we introduce NoiseSDF2NoiseSDF, a novel method designed to extend this concept to 3D neural fields. Our approach enables learning clean neural SDFs directly from noisy point clouds through noisy supervision by minimizing the MSE loss between noisy SDF representations, allowing the network to implicitly denoise and refine surface estimations. We evaluate the effectiveness of NoiseSDF2NoiseSDF on benchmarks, including the ShapeNet, ABC, Famous, and Real datasets. Experimental results demonstrate that our framework significantly improves surface reconstruction quality from noisy inputs.
[44] Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model
Chengxu Liu,Lu Qi,Jinshan Pan,Xueming Qian,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型(DM)的框架,通过从未配对数据中学习纹理先验,实现了高效的盲图像去模糊,优于现有方法。
Details
Motivation: 由于获取大量真实的模糊-清晰图像对困难且成本高,从未配对数据中学习盲图像去模糊成为更实用的解决方案。现有方法过度依赖对抗学习,难以处理真实世界中复杂多变的模糊模式。Contribution: 1. 提出了一种基于扩散模型的框架(\ours),学习空间变化的纹理先验。2. 设计了纹理先验编码器(TPE)和纹理传递变换层(TTformer),引入记忆机制和自适应滤波。3. 实现了基于小波的对抗损失,保留高频纹理细节。
Method: 1. 使用扩散模型生成纹理先验,辅助模糊图像恢复。2. TPE编码纹理并提供监督信号。3. TTformer通过FM-MSA机制高效去除空间变化的模糊。4. 结合小波对抗损失优化高频细节。
Result: 在广泛使用的基准测试中,\ours优于现有方法,提供了高效的无监督去模糊解决方案。
Insight: 扩散模型在捕获复杂纹理先验方面表现出色,结合自适应滤波和记忆机制,能够有效处理真实世界的模糊模式。
Abstract: Since acquiring large amounts of realistic blurry-sharp image pairs is difficult and expensive, learning blind image deblurring from unpaired data is a more practical and promising solution. Unfortunately, dominant approaches rely heavily on adversarial learning to bridge the gap from blurry domains to sharp domains, ignoring the complex and unpredictable nature of real-world blur patterns. In this paper, we propose a novel diffusion model (DM)-based framework, dubbed \ours, for image deblurring by learning spatially varying texture prior from unpaired data. In particular, \ours performs DM to generate the prior knowledge that aids in recovering the textures of blurry images. To implement this, we propose a Texture Prior Encoder (TPE) that introduces a memory mechanism to represent the image textures and provides supervision for DM training. To fully exploit the generated texture priors, we present the Texture Transfer Transformer layer (TTformer), in which a novel Filter-Modulated Multi-head Self-Attention (FM-MSA) efficiently removes spatially varying blurring through adaptive filtering. Furthermore, we implement a wavelet-based adversarial loss to preserve high-frequency texture details. Extensive evaluations show that \ours provides a promising unsupervised deblurring solution and outperforms SOTA methods in widely-used benchmarks.
[45] Efficient Burst Super-Resolution with One-step Diffusion
Kento Kawai,Takeru Oba,Kyotaro Tokoro,Kazutoshi Akita,Norimichi Ukita
Main category: cs.CV
TL;DR: 该论文提出了一种通过一步扩散和高效采样方法提高爆发式低分辨率图像超分辨率重建效率的方法,同时保持良好的视觉质量。
Details
Motivation: 爆发式低分辨率图像虽然能提升超分辨率重建效果,但传统确定性方法生成的图像模糊且感知质量差。本文旨在通过扩散模型生成更清晰、高保真的超分辨率图像。Contribution: 1. 提出了一种高效的扩散模型方法,结合高阶ODE随机采样器。2. 通过知识蒸馏实现一步扩散,显著减少计算时间。
Method: 1. 使用高阶ODE随机采样器提升扩散模型的效率。2. 通过知识蒸馏将多步扩散简化为一步,大幅降低运行时。
Result: 实验表明,该方法将运行时间降至基线的1.6%,同时保持了图像失真和感知质量的超分辨率重建效果。
Insight: 通过结合高阶ODE和一步扩散,可以在保持超分辨率质量的同时显著提升计算效率,为实际应用提供了可能性。
Abstract: While burst Low-Resolution (LR) images are useful for improving their Super Resolution (SR) image compared to a single LR image, prior burst SR methods are trained in a deterministic manner, which produces a blurry SR image. Since such blurry images are perceptually degraded, we aim to reconstruct sharp and high-fidelity SR images by a diffusion model. Our method improves the efficiency of the diffusion model with a stochastic sampler with a high-order ODE as well as one-step diffusion using knowledge distillation. Our experimental results demonstrate that our method can reduce the runtime to 1.6 % of its baseline while maintaining the SR quality measured based on image distortion and perceptual quality.
[46] CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks
Yanan Wang,Julio Vizcarra,Zhi Li,Hao Niu,Mori Kurokawa
Main category: cs.CV
TL;DR: CoTasks提出了一种基于思维链的视频指令调优任务框架,通过分解复杂的视频问题为四个实体级基础任务,显著提升了模型在视频推理中的表现。
Details
Motivation: 当前视频大语言模型缺乏基于细粒度物体级视频理解的思维链推理能力,现有指令调优模型在高层视频-文本对上训练,缺乏结构化注释以支持逐步推理。Contribution: CoTasks框架将复杂视频问题分解为帧定位、实体跟踪、时空关系提取等四个基础任务,嵌入中间推理步骤,实现了显式的对象为中心时空推理。
Method: CoTasks将现有数据集(如NeXT-QA、STAR)的复杂问题分解为四个实体级任务,并将这些中间推理步骤嵌入输入中。
Result: 在NeXT-QA基准测试中,LLaVA-video-7B和Qwen2.5-VL-3B分别提升了+3.3和+17.4分,尤其在因果、时空和描述性子类别中表现显著。
Insight: 通过结构化思维链监督框架,CoTasks能够显著提升视频推理能力,特别是在需要逐步分解问题的任务中。
Abstract: Despite recent progress in video large language models (VideoLLMs), a key open challenge remains: how to equip models with chain-of-thought (CoT) reasoning abilities grounded in fine-grained object-level video understanding. Existing instruction-tuned models, such as the Qwen and LLaVA series, are trained on high-level video-text pairs, often lacking structured annotations necessary for compositional, step-by-step reasoning. We propose CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks, a new framework that decomposes complex video questions of existing datasets (e.g., NeXT-QA, STAR) into four entity-level foundational tasks: frame localization, entity tracking, spatial and temporal relation extraction. By embedding these intermediate CoT-style reasoning steps into the input, CoTasks enables models to explicitly perform object-centric spatiotemporal reasoning. Experiments on the NeXT-QA benchmark show that CoTasks significantly enhance inference performance: LLaVA-video-7B improves by +3.3 points in average GPT-4 evaluation score, and Qwen2.5-VL-3B gains +17.4, with large boosts in causal (+14.6), temporal (+10.9), and descriptive (+48.1) subcategories. These results demonstrate the effectiveness of CoTasks as a structured CoT-style supervision framework for improving compositional video reasoning.
[47] Moving Object Detection from Moving Camera Using Focus of Expansion Likelihood and Segmentation
Masahiro Ogawa,Qi An,Atsushi Yamashita
Main category: cs.CV
TL;DR: 论文提出了一种基于光流和纹理信息的动态物体检测方法FoELS,通过整合焦点扩张似然和分割技术,有效解决了移动相机视角下复杂场景的物体检测问题。
Details
Motivation: 现有方法主要依赖光流,但在复杂结构化场景中检测移动物体效果不佳。需要一种能够结合光流和纹理信息的更鲁棒的方法。Contribution: 提出了FoELS方法,整合了光流中的焦点扩张(FoE)似然和分割先验,实现了在移动相机视角下高效检测动态物体。
Method: FoELS首先基于光流计算焦点扩张(FoE),并利用FoE计算的外点生成初始运动似然,随后将似然与分割先验结合以估计最终的运动概率。
Result: 在DAVIS 2016数据集和真实交通视频上的实验表明,FoELS在复杂场景中表现出色,性能达到最先进水平。
Insight: 整合光流和纹理信息可以有效提升动态物体检测的鲁棒性,尤其是在复杂场景和相机运动中。
Abstract: Separating moving and static objects from a moving camera viewpoint is essential for 3D reconstruction, autonomous navigation, and scene understanding in robotics. Existing approaches often rely primarily on optical flow, which struggles to detect moving objects in complex, structured scenes involving camera motion. To address this limitation, we propose Focus of Expansion Likelihood and Segmentation (FoELS), a method based on the core idea of integrating both optical flow and texture information. FoELS computes the focus of expansion (FoE) from optical flow and derives an initial motion likelihood from the outliers of the FoE computation. This likelihood is then fused with a segmentation-based prior to estimate the final moving probability. The method effectively handles challenges including complex structured scenes, rotational camera motion, and parallel motion. Comprehensive evaluations on the DAVIS 2016 dataset and real-world traffic videos demonstrate its effectiveness and state-of-the-art performance.
[48] EPSilon: Efficient Point Sampling for Lightening of Hybrid-based 3D Avatar Generation
Seungjun Moon,Sangjoon Yu,Gyeong-Moon Park
Main category: cs.CV
TL;DR: EPSilon提出了两种高效的点采样策略(ERO和EIO),显著提升了基于混合表示(NeRF和SMPL网格)的3D虚拟人生成训练和推理效率,同时保持了生成质量。
Details
Motivation: 混合表示方法虽然能生成逼真的3D虚拟人,但由于基于SMPL权重变形的计算成本高,推理效率极低。大多数采样点位于空区域,不影响生成质量但增加了延迟。Contribution: 1. 提出两种空区域点采样策略ERO和EIO,显著减少无效计算;2. 通过高效采样实现单阶段NeRF结构,无需分层采样;3. 在保持质量的同时,将采样点减少至3.9%,推理速度提升20倍,训练收敛速度加快4倍。
Method: ERO剔除空区域的光线,EIO进一步缩减光线采样区间,仅对衣物或网格覆盖区域采样。这一策略减少了变形计算量,并支持单阶段NeRF结构。
Result: EPSilon在生成质量上与现有方法相当,但仅需3.9%的采样点,推理速度快20倍,训练收敛快4倍。
Insight: 高效的采样策略不仅能减少计算开销,还能简化模型结构(如单阶段NeRF),从而显著提升整体性能。
Abstract: The rapid advancement of neural radiance fields (NeRF) has paved the way to generate animatable human avatars from a monocular video. However, the sole usage of NeRF suffers from a lack of details, which results in the emergence of hybrid representation that utilizes SMPL-based mesh together with NeRF representation. While hybrid-based models show photo-realistic human avatar generation qualities, they suffer from extremely slow inference due to their deformation scheme: to be aligned with the mesh, hybrid-based models use the deformation based on SMPL skinning weights, which needs high computational costs on each sampled point. We observe that since most of the sampled points are located in empty space, they do not affect the generation quality but result in inference latency with deformation. In light of this observation, we propose EPSilon, a hybrid-based 3D avatar generation scheme with novel efficient point sampling strategies that boost both training and inference. In EPSilon, we propose two methods to omit empty points at rendering; empty ray omission (ERO) and empty interval omission (EIO). In ERO, we wipe out rays that progress through the empty space. Then, EIO narrows down the sampling interval on the ray, which wipes out the region not occupied by either clothes or mesh. The delicate sampling scheme of EPSilon enables not only great computational cost reduction during deformation but also the designation of the important regions to be sampled, which enables a single-stage NeRF structure without hierarchical sampling. Compared to existing methods, EPSilon maintains the generation quality while using only 3.9% of sampled points and achieves around 20 times faster inference, together with 4 times faster training convergence. We provide video results on https://github.com/seungjun-moon/epsilon.
[49] When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework
Xiao Wang,Qian Zhu,Shujuan Wu,Bo Jiang,Shiliang Zhang,Yaowei Wang,Yonghong Tian,Bin Luo
Main category: cs.CV
TL;DR: 该论文提出了一个大规模RGB-事件相机融合的行人重识别数据集EvReID,并设计了一种基于行人属性的对比学习框架TriPro-ReID,显著提升了识别性能。
Details
Motivation: 现有事件相机行人重识别方法多在小规模或模拟数据集上训练,缺乏真实场景评估。本文旨在解决数据稀缺问题并提升模型性能。Contribution: 1. 提出大规模RGB-事件行人重识别数据集EvReID;2. 设计了TriPro-ReID框架,结合RGB和事件数据及行人属性;3. 评估了15种现有算法。
Method: TriPro-ReID框架利用RGB帧和事件流的视觉特征,结合行人属性作为中层语义特征,通过对比学习提升重识别性能。
Result: 在EvReID和MARS数据集上验证了TriPro-ReID的有效性,显著优于现有方法。
Insight: 融合RGB与事件数据及行人属性可显著提升行人重识别性能,大规模真实数据集对推动研究具有重要意义。
Abstract: Recent researchers have proposed using event cameras for person re-identification (ReID) due to their promising performance and better balance in terms of privacy protection, event camera-based person ReID has attracted significant attention. Currently, mainstream event-based person ReID algorithms primarily focus on fusing visible light and event stream, as well as preserving privacy. Although significant progress has been made, these methods are typically trained and evaluated on small-scale or simulated event camera datasets, making it difficult to assess their real identification performance and generalization ability. To address the issue of data scarcity, this paper introduces a large-scale RGB-event based person ReID dataset, called EvReID. The dataset contains 118,988 image pairs and covers 1200 pedestrian identities, with data collected across multiple seasons, scenes, and lighting conditions. We also evaluate 15 state-of-the-art person ReID algorithms, laying a solid foundation for future research in terms of both data and benchmarking. Based on our newly constructed dataset, this paper further proposes a pedestrian attribute-guided contrastive learning framework to enhance feature learning for person re-identification, termed TriPro-ReID. This framework not only effectively explores the visual features from both RGB frames and event streams, but also fully utilizes pedestrian attributes as mid-level semantic features. Extensive experiments on the EvReID dataset and MARS datasets fully validated the effectiveness of our proposed RGB-Event person ReID framework. The benchmark dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID
[50] MaskHOI: Robust 3D Hand-Object Interaction Estimation via Masked Pre-training
Yuechen Xie,Haobo Jiang,Jian Yang,Yigong Zhang,Jin Xie
Main category: cs.CV
TL;DR: MaskHOI提出了一种基于Masked Autoencoder (MAE)的预训练框架,用于提升3D手-物体交互(HOI)姿态估计任务中的几何感知和遮挡鲁棒性。通过区域特定掩码比率分配和掩码符号距离场(SDF)驱动的多模态学习,该方法显著优于现有技术。
Details
Motivation: 3D手-物体交互任务中,从单目RGB输入中精确估计手和物体的关节姿态非常困难,主要由于RGB图像的几何模糊性和交互过程中的严重遮挡问题。Contribution: 1. 提出了基于MAE的预训练框架MaskHOI;2. 设计了区域特定掩码比率分配策略,以平衡手部和物体的特征学习难度;3. 引入了掩码SDF驱动的多模态学习机制,增强几何感知能力。
Method: 1. MAE驱动的掩码-重建策略;2. 区域特定掩码比率分配,包括基于区域的掩码分配和骨架驱动的手部掩码引导;3. 掩码SDF驱动的多模态学习,用于全局几何结构感知。
Result: 实验表明,MaskHOI在3D手-物体交互姿态估计任务中显著优于现有方法。
Insight: 针对手和物体在几何复杂度上的差异,非均匀掩码策略能够更有效地学习遮挡鲁棒的特征。掩码SDF机制进一步弥补了单目输入的局限性,提升了全局几何结构的感知能力。
Abstract: In 3D hand-object interaction (HOI) tasks, estimating precise joint poses of hands and objects from monocular RGB input remains highly challenging due to the inherent geometric ambiguity of RGB images and the severe mutual occlusions that occur during interaction.To address these challenges, we propose MaskHOI, a novel Masked Autoencoder (MAE)-driven pretraining framework for enhanced HOI pose estimation. Our core idea is to leverage the masking-then-reconstruction strategy of MAE to encourage the feature encoder to infer missing spatial and structural information, thereby facilitating geometric-aware and occlusion-robust representation learning. Specifically, based on our observation that human hands exhibit far greater geometric complexity than rigid objects, conventional uniform masking fails to effectively guide the reconstruction of fine-grained hand structures. To overcome this limitation, we introduce a Region-specific Mask Ratio Allocation, primarily comprising the region-specific masking assignment and the skeleton-driven hand masking guidance. The former adaptively assigns lower masking ratios to hand regions than to rigid objects, balancing their feature learning difficulty, while the latter prioritizes masking critical hand parts (e.g., fingertips or entire fingers) to realistically simulate occlusion patterns in real-world interactions. Furthermore, to enhance the geometric awareness of the pretrained encoder, we introduce a novel Masked Signed Distance Field (SDF)-driven multimodal learning mechanism. Through the self-masking 3D SDF prediction, the learned encoder is able to perceive the global geometric structure of hands and objects beyond the 2D image plane, overcoming the inherent limitations of monocular input and alleviating self-occlusion issues. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches.
[51] HeCoFuse: Cross-Modal Complementary V2X Cooperative Perception with Heterogeneous Sensors
Chuheng Wei,Ziye Qin,Walter Zimmer,Guoyuan Wu,Matthew J. Barth
Main category: cs.CV
TL;DR: HeCoFuse 是一个针对异构传感器(如相机和 LiDAR)的 V2X 协同感知框架,通过层次化融合机制和自适应空间分辨率调整,解决了跨模态特征对齐和表示质量不平衡的问题,在 TUMTraf-V2X 数据集上表现出色。
Details
Motivation: 现实中的 V2X 协同感知系统常因成本和部署差异,使用异构传感器配置,这导致特征融合和感知可靠性面临挑战。需要一种统一框架来处理这种异构性。Contribution: 提出了 HeCoFuse 框架,通过层次化融合和自适应空间分辨率调整,解决了异构传感器下的跨模态特征对齐问题,并在多配置场景下实现了高性能。
Method: 采用层次化融合机制(通道和空间注意力)和自适应空间分辨率调整模块,动态调整融合策略,并结合协同学习增强鲁棒性。
Result: 在 TUMTraf-V2X 数据集上,HeCoFuse 在 LC+LC 配置下达到 43.22% 3D mAP,优于基线 CoopDet3D,并在 L+LC 场景下进一步提升至 43.38%,同时在九种异构配置中表现稳定。
Insight: 动态调整融合策略和自适应分辨率是在异构传感器协同感知中实现高性能的关键。HeCoFuse 展示了跨模态互补性对感知性能的显著提升。
Abstract: Real-world Vehicle-to-Everything (V2X) cooperative perception systems often operate under heterogeneous sensor configurations due to cost constraints and deployment variability across vehicles and infrastructure. This heterogeneity poses significant challenges for feature fusion and perception reliability. To address these issues, we propose HeCoFuse, a unified framework designed for cooperative perception across mixed sensor setups where nodes may carry Cameras (C), LiDARs (L), or both. By introducing a hierarchical fusion mechanism that adaptively weights features through a combination of channel-wise and spatial attention, HeCoFuse can tackle critical challenges such as cross-modality feature misalignment and imbalanced representation quality. In addition, an adaptive spatial resolution adjustment module is employed to balance computational cost and fusion effectiveness. To enhance robustness across different configurations, we further implement a cooperative learning strategy that dynamically adjusts fusion type based on available modalities. Experiments on the real-world TUMTraf-V2X dataset demonstrate that HeCoFuse achieves 43.22% 3D mAP under the full sensor configuration (LC+LC), outperforming the CoopDet3D baseline by 1.17%, and reaches an even higher 43.38% 3D mAP in the L+LC scenario, while maintaining 3D mAP in the range of 21.74% to 43.38% across nine heterogeneous sensor configurations. These results, validated by our first-place finish in the CVPR 2025 DriveX challenge, establish HeCoFuse as the current state-of-the-art on TUM-Traf V2X dataset while demonstrating robust performance across diverse sensor deployments.
[52] Gaussian kernel-based motion measurement
Hongyi Liu,Haifeng Wang
Main category: cs.CV
TL;DR: 论文提出了一种基于高斯核的运动测量方法,通过跟踪高斯核的位置实现亚像素级运动测量,解决了现有视觉方法精度不足或需要大量手动调参的问题。
Details
Motivation: 结构健康监测对高精度运动测量的需求日益增长,现有视觉方法在亚像素级运动测量中要么精度不足,要么需要大量手动调参,亟需一种更高效、精确的方法。Contribution: 提出了一种新型基于高斯核的运动测量方法,结合运动一致性和超分辨率约束,提高了精度和鲁棒性,且无需针对不同样本定制参数。
Method: 通过跟踪高斯核的位置提取帧间运动,引入运动一致性和超分辨率约束以提升准确性和鲁棒性。
Result: 数值和实验验证表明,该方法能稳定达到高精度,且无需对不同测试样本进行参数定制。
Insight: 高斯核跟踪结合一致性约束和超分辨率可以显著提升运动测量的精度和实用性,适用于结构健康监测等场景。
Abstract: The growing demand for structural health monitoring has driven increasing interest in high-precision motion measurement, as structural information derived from extracted motions can effectively reflect the current condition of the structure. Among various motion measurement techniques, vision-based methods stand out due to their low cost, easy installation, and large-scale measurement. However, when it comes to sub-pixel-level motion measurement, current vision-based methods either lack sufficient accuracy or require extensive manual parameter tuning (e.g., pyramid layers, target pixels, and filter parameters) to reach good precision. To address this issue, we developed a novel Gaussian kernel-based motion measurement method, which can extract the motion between different frames via tracking the location of Gaussian kernels. The motion consistency, which fits practical structural conditions, and a super-resolution constraint, are introduced to increase accuracy and robustness of our method. Numerical and experimental validations show that it can consistently reach high accuracy without customized parameter setup for different test samples.
[53] GOSPA and T-GOSPA quasi-metrics for evaluation of multi-object tracking algorithms
Ángel F. García-Fernández,Jinhao Gu,Lennart Svensson,Yuxuan Xia,Jan Krejčí,Oliver Kost,Ondřej Straka
Main category: cs.CV
TL;DR: 该论文提出了两种用于评估多目标跟踪(MOT)算法的准度量:一种扩展了GOSPA度量,另一种扩展了T-GOSPA度量,分别用于衡量对象集合和轨迹集合的差异。这些准度量在不对称定位成本和不同惩罚机制上具有灵活性。
Details
Motivation: 现有的GOSPA和T-GOSPA度量在多目标跟踪评估中缺乏灵活性,特别是在不对称成本和不同惩罚需求的应用场景下。论文旨在解决这一问题。Contribution: 提出了两种新的准度量(GOSPA和T-GOSPA的扩展),支持不对称定位成本和不同的假目标与漏检目标惩罚机制。
Method: 扩展了GOSPA和T-GOSPA度量,引入了不对称定位成本和灵活的惩罚机制,并通过仿真实验评估了贝叶斯MOT算法的性能。
Result: T-GOSPA准度量成功用于评估贝叶斯MOT算法,验证了其在不对称场景下的有效性。
Insight: 准度量的灵活性使其适用于需要差异化惩罚的应用场景,如某些目标的重要性高于其他目标的情况。
Abstract: This paper introduces two quasi-metrics for performance assessment of multi-object tracking (MOT) algorithms. In particular, one quasi-metric is an extension of the generalised optimal subpattern assignment (GOSPA) metric and measures the discrepancy between sets of objects. The other quasi-metric is an extension of the trajectory GOSPA (T-GOSPA) metric and measures the discrepancy between sets of trajectories. Similar to the GOSPA-based metrics, these quasi-metrics include costs for localisation error for properly detected objects, the number of false objects and the number of missed objects. The T-GOSPA quasi-metric also includes a track switching cost. Differently from the GOSPA and T-GOSPA metrics, the proposed quasi-metrics have the flexibility of penalising missed and false objects with different costs, and the localisation costs are not required to be symmetric. These properties can be useful in MOT evaluation in certain applications. The performance of several Bayesian MOT algorithms is assessed with the T-GOSPA quasi-metric via simulations.
[54] Augmented Reality in Cultural Heritage: A Dual-Model Pipeline for 3D Artwork Reconstruction
Daniele Pannone,Alessia Castronovo,Maurizio Mancini,Gian Luca Foresti,Claudio Piciarelli,Rossana Gabrieli,Muhammad Yasir Bilal,Danilo Avola
Main category: cs.CV
TL;DR: 该论文提出了一种用于博物馆环境的创新增强现实(AR)流水线,通过结合两种预训练深度估计模型(GLPN和Depth-Anything),从单张图像生成精确的3D艺术品模型,显著提升了重建精度和视觉真实感。
Details
Motivation: 博物馆希望通过增强现实技术提升游客的互动体验,但艺术品的复杂轮廓和纹理使得3D重建具有挑战性。本文旨在解决这一问题。Contribution: 提出了一个双模型流水线,结合GLPN和Depth-Anything的互补优势,优化深度图生成,实现高质量的3D艺术品重建。
Method: 使用GLPN捕捉全局场景结构,Depth-Anything处理局部细节,结合两者的结果生成优化的深度图,并转换为点云和网格。
Result: 实验表明,该方法在重建精度和视觉真实感上显著优于现有技术,成为博物馆AR应用的强有力工具。
Insight: 双模型互补策略对于处理复杂艺术品的3D重建非常有效,尤其是在单图像输入的情况下。
Abstract: This paper presents an innovative augmented reality pipeline tailored for museum environments, aimed at recognizing artworks and generating accurate 3D models from single images. By integrating two complementary pre-trained depth estimation models, i.e., GLPN for capturing global scene structure and Depth-Anything for detailed local reconstruction, the proposed approach produces optimized depth maps that effectively represent complex artistic features. These maps are then converted into high-quality point clouds and meshes, enabling the creation of immersive AR experiences. The methodology leverages state-of-the-art neural network architectures and advanced computer vision techniques to overcome challenges posed by irregular contours and variable textures in artworks. Experimental results demonstrate significant improvements in reconstruction accuracy and visual realism, making the system a highly robust tool for museums seeking to enhance visitor engagement through interactive digital content.
[55] Tackling fake images in cybersecurity – Interpretation of a StyleGAN and lifting its black-box
Julia Laubmann,Johannes Reschke
Main category: cs.CV
TL;DR: 本文通过分析StyleGAN的生成器组件,揭示了其工作原理,包括Equalized Learning Rate等关键技术,并展示了通过对权重剪枝减少计算需求的可能性。此外,研究还探讨了潜在向量对生成人脸的影响以及相关的伦理问题。
Details
Motivation: 在AI生成图像日益普及的背景下,对StyleGAN这类强大工具的深入研究,既有助于理解其工作原理,也能揭示潜在的技术滥用风险。Contribution: 1. 详细解析了StyleGAN生成器的内部机制;2. 展示了权重剪枝对计算效率的提升;3. 分析了潜在向量对生成图像的精细控制能力。
Method: 1. 使用PyTorch框架训练StyleGAN模型;2. 通过权重剪枝减少模型复杂度;3. 研究潜在向量的全局与局部修改对图像生成的影响。
Result: 研究表明,通过剪枝可以显著减少计算需求而无明显性能损失,同时潜在向量的修改能精确控制生成图像的特定特征。
Insight: StyleGAN的潜在向量提供了强大的图像控制能力,但也引发了技术滥用的伦理担忧,尤其是伪造身份的风险。
Abstract: In today’s digital age, concerns about the dangers of AI-generated images are increasingly common. One powerful tool in this domain is StyleGAN (style-based generative adversarial networks), a generative adversarial network capable of producing highly realistic synthetic faces. To gain a deeper understanding of how such a model operates, this work focuses on analyzing the inner workings of StyleGAN’s generator component. Key architectural elements and techniques, such as the Equalized Learning Rate, are explored in detail to shed light on the model’s behavior. A StyleGAN model is trained using the PyTorch framework, enabling direct inspection of its learned weights. Through pruning, it is revealed that a significant number of these weights can be removed without drastically affecting the output, leading to reduced computational requirements. Moreover, the role of the latent vector – which heavily influences the appearance of the generated faces – is closely examined. Global alterations to this vector primarily affect aspects like color tones, while targeted changes to individual dimensions allow for precise manipulation of specific facial features. This ability to finetune visual traits is not only of academic interest but also highlights a serious ethical concern: the potential misuse of such technology. Malicious actors could exploit this capability to fabricate convincing fake identities, posing significant risks in the context of digital deception and cybercrime.
[56] Can Synthetic Images Conquer Forgetting? Beyond Unexplored Doubts in Few-Shot Class-Incremental Learning
Junsu Kim,Yunhoe Ku,Seungryul Baek
Main category: cs.CV
TL;DR: 论文提出了Diffusion-FSCIL,利用预训练的文本到图像扩散模型作为骨干网络,通过多尺度特征提取和特征蒸馏减少遗忘,在Few-Shot类增量学习中表现出色。
Details
Motivation: Few-Shot类增量学习(FSCIL)面临因数据极少而导致的灾难性遗忘问题,作者希望通过利用大规模生成模型的表达能力来解决这一挑战。Contribution: 1) 使用预训练的扩散模型作为冻结骨干;2) 提出多尺度特征提取和潜在重放机制;3) 结合特征蒸馏减少生成偏差。
Method: 采用冻结的文本到图像扩散模型,提取多尺度特征作为潜在重放,并利用特征蒸馏优化模型,仅微调少量组件以实现高效学习。
Result: 在CUB-200、miniImageNet和CIFAR-100上超越了现有方法,在保留旧类性能的同时有效适应新类。
Insight: 大规模生成模型的预训练特征和多尺度表征能力对Few-Shot学习的类增量问题具有显著优势。
Abstract: Few-shot class-incremental learning (FSCIL) is challenging due to extremely limited training data; while aiming to reduce catastrophic forgetting and learn new information. We propose Diffusion-FSCIL, a novel approach that employs a text-to-image diffusion model as a frozen backbone. Our conjecture is that FSCIL can be tackled using a large generative model’s capabilities benefiting from 1) generation ability via large-scale pre-training; 2) multi-scale representation; 3) representational flexibility through the text encoder. To maximize the representation capability, we propose to extract multiple complementary diffusion features to play roles as latent replay with slight support from feature distillation for preventing generative biases. Our framework realizes efficiency through 1) using a frozen backbone; 2) minimal trainable components; 3) batch processing of multiple feature extractions. Extensive experiments on CUB-200, \emph{mini}ImageNet, and CIFAR-100 show that Diffusion-FSCIL surpasses state-of-the-art methods, preserving performance on previously learned classes and adapting effectively to new ones.
[57] Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis
Tongtong Su,Chengyu Wang,Bingyan Liu,Jun Huang,Dongming Lu
Main category: cs.CV
TL;DR: EVS结合了文本到图像(T2I)和文本到视频(T2V)模型,通过封装两者的优势,实现了高质量视频合成,同时显著提升了推理速度和运动一致性。
Details
Motivation: 现有T2V模型在生成高质量视频时面临图像质量与运动表现不一致的问题,尤其是帧间闪烁和伪影。EVS旨在通过结合T2I和T2V模型的优势来解决这些问题。Contribution: 提出了EVS,一种无需训练的封装视频合成器,通过优化T2I和T2V模型的结合,实现了视觉保真度和运动平滑性的提升,同时提高了推理速度。
Method: EVS利用扩散式T2I模型优化低质量视频帧(视为分布外样本),并通过T2V框架保证运动一致性,封装T2V的时序先验至T2I生成过程中。
Result: 实验表明,EVS在视频质量和运动表现上优于现有方法,推理时间提升了1.6到4.5倍。
Insight: 通过封装T2I和T2V模型的互补优势,EVS提供了一种高效且高质量的视频合成方法,同时避免了传统方法的帧间不一致问题。
Abstract: In recent years, large text-to-video (T2V) synthesis models have garnered considerable attention for their abilities to generate videos from textual descriptions. However, achieving both high imaging quality and effective motion representation remains a significant challenge for these T2V models. Existing approaches often adapt pre-trained text-to-image (T2I) models to refine video frames, leading to issues such as flickering and artifacts due to inconsistencies across frames. In this paper, we introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness of generated videos. Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames by treating them as out-of-distribution samples, effectively optimizing them with noising and denoising steps. Meanwhile, we employ T2V backbones to ensure consistent motion dynamics. By encapsulating the T2V temporal-only prior into the T2I generation process, EVS successfully leverages the strengths of both types of models, resulting in videos of improved imaging and motion quality. Experimental results validate the effectiveness of our approach compared to previous approaches. Our composition process also leads to a significant improvement of 1.6x-4.5x speedup in inference time. Source codes: https://github.com/Tonniia/EVS.
[58] Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
Pu Jian,Donglei Yu,Wen Yang,Shuo Ren,Jiajun Zhang
Main category: cs.CV
TL;DR: 本文提出了一个名为ClearVQA的新基准,用于评估视觉语言模型(VLM)通过交互解决视觉问答(VQA)中歧义的能力,并探讨了现有VLMs倾向于回答而非提问的问题。
Details
Motivation: 在VQA场景中,用户的问题常因表达习惯不同而产生歧义,现有方法主要依赖问题重述,却忽略了用户交互的潜力。Contribution: 1. 提出ClearVQA基准,覆盖VQA中常见的三类歧义场景;2. 指出VLM训练中偏好回答而非提问的问题。
Method: 通过ClearVQA基准评估VLM解决歧义的能力,并探讨交互式澄清的实现方式。
Result: 未明确提及实验结果,但通过基准设计填补了交互式歧义解决的研究空白。
Insight: 利用用户反馈进行交互式澄清可能是解决VQA歧义的有效方向,而现有VLM设计需要调整以支持主动提问。
Abstract: In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs’ capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.
[59] Feature Engineering is Not Dead: Reviving Classical Machine Learning with Entropy, HOG, and LBP Feature Fusion for Image Classification
Abhijit Sen,Giridas Maiti,Bikram K. Parida,Bhanu P. Mishra,Mahima Arya,Denys I. Bondar
Main category: cs.CV
TL;DR: 该论文提出了一种基于排列熵(PE)、HOG和LBP特征融合的轻量级图像分类方法,展示了传统特征工程在图像分类中的潜力,性能接近深度学习模型。
Details
Motivation: 传统机器学习在图像分类中因可解释性和计算效率的优势仍有价值,但需探索更高效的特征提取方法。排列熵(PE)在时间序列分析中表现出色,但未充分应用于图像数据,值得研究。Contribution: 1)将排列熵扩展至二维图像,提出多尺度、多方向的特征提取方法;2)结合HOG和LBP,构建了一个紧凑高效的特征集;3)展示了传统机器学习在图像分类中的竞争力。
Method: 使用多尺度排列熵提取图像空间复杂度特征,结合HOG捕获形状信息,LBP编码纹理,形成780维特征向量,用SVM分类器优化训练。
Result: 在Fashion-MNIST等基准数据集上表现接近深度学习模型,证明了传统特征工程的有效性和可解释性。
Insight: 排列熵在图像分类中有潜力,特征融合能提升传统方法的性能,为轻量级、可解释性强的图像分类方案提供了新思路。
Abstract: Feature engineering continues to play a critical role in image classification, particularly when interpretability and computational efficiency are prioritized over deep learning models with millions of parameters. In this study, we revisit classical machine learning based image classification through a novel approach centered on Permutation Entropy (PE), a robust and computationally lightweight measure traditionally used in time series analysis but rarely applied to image data. We extend PE to two-dimensional images and propose a multiscale, multi-orientation entropy-based feature extraction approach that characterizes spatial order and complexity along rows, columns, diagonals, anti-diagonals, and local patches of the image. To enhance the discriminatory power of the entropy features, we integrate two classic image descriptors: the Histogram of Oriented Gradients (HOG) to capture shape and edge structure, and Local Binary Patterns (LBP) to encode micro-texture of an image. The resulting hand-crafted feature set, comprising of 780 dimensions, is used to train Support Vector Machine (SVM) classifiers optimized through grid search. The proposed approach is evaluated on multiple benchmark datasets, including Fashion-MNIST, KMNIST, EMNIST, and CIFAR-10, where it delivers competitive classification performance without relying on deep architectures. Our results demonstrate that the fusion of PE with HOG and LBP provides a compact, interpretable, and effective alternative to computationally expensive and limited interpretable deep learning models. This shows a potential of entropy-based descriptors in image classification and contributes a lightweight and generalizable solution to interpretable machine learning in image classification and computer vision.
[60] SuperCM: Improving Semi-Supervised Learning and Domain Adaptation through differentiable clustering
Durgesh Singh,Ahcène Boubekki,Robert Jenssen,Michael Kampffmeyer
Main category: cs.CV
TL;DR: 这篇论文提出了一种名为SuperCM的新方法,通过引入可微聚类模块,显式地利用聚类假设来改进半监督学习(SSL)和无监督域适应(UDA)的性能,尤其在低监督场景下表现突出。
Details
Motivation: 现有的SSL和UDA方法通常通过隐式方式利用聚类假设,而这种方法并未充分利用监督数据的潜力。论文旨在通过显式引入可微聚类模块,结合监督数据优化聚类中心,以更直接地利用聚类假设。Contribution: 1. 提出了SuperCM,一种显式利用可微聚类模块的端到端训练方法;2. 展示了该方法在SSL和UDA任务中的有效性,尤其是在低监督条件下;3. 证明SuperCM可以作为独立模型或现有方法的正则化手段。
Method: 通过引入可微聚类模块,显式优化聚类中心的计算,并结合监督数据进行端到端训练。该方法在训练过程中直接利用聚类假设,并支持与其他方法的联合训练。
Result: 论文通过大量实验验证了SuperCM在SSL和UDA任务中的优异性能,尤其是在低监督场景下。此外,SuperCM作为正则化手段也能提升现有方法的性能。
Insight: 显式利用聚类假设并结合监督数据优化聚类中心,能够更有效地提升SSL和UDA的性能。这种方法为半监督学习和域适应提供了新的思路。
Abstract: Semi-Supervised Learning (SSL) and Unsupervised Domain Adaptation (UDA) enhance the model performance by exploiting information from labeled and unlabeled data. The clustering assumption has proven advantageous for learning with limited supervision and states that data points belonging to the same cluster in a high-dimensional space should be assigned to the same category. Recent works have utilized different training mechanisms to implicitly enforce this assumption for the SSL and UDA. In this work, we take a different approach by explicitly involving a differentiable clustering module which is extended to leverage the supervised data to compute its centroids. We demonstrate the effectiveness of our straightforward end-to-end training strategy for SSL and UDA over extensive experiments and highlight its benefits, especially in low supervision regimes, both as a standalone model and as a regularizer for existing approaches.
[61] Localized FNO for Spatiotemporal Hemodynamic Upsampling in Aneurysm MRI
Kyriakos Flouris,Moritz Halter,Yolanne Y. R. Lee,Samuel Castonguay,Luuk Jacobs,Pietro Dirix,Jonathan Nestmann,Sebastian Kozerke,Ender Konukoglu
Main category: cs.CV
TL;DR: 论文提出了一种名为LoFNO的新架构,利用几何先验和神经算子框架提升MRI血流数据的时空分辨率,直接预测壁面剪切应力(WSS),优于传统方法和深度学习替代方案。
Details
Motivation: 当前磁共振血流成像的低时空分辨率和信噪比限制了其在动脉瘤诊断中的实用性,亟需一种能提升分辨率并直接预测WSS的方法。Contribution: 提出了LoFNO架构,整合拉普拉斯特征向量作为几何先验,结合EDSR层,实现了对血流数据的去噪和时空上采样,提升了预测精度。
Method: LoFNO结合了几何先验(拉普拉斯特征向量)和神经算子框架,采用EDSR层进行稳健上采样,直接预测WSS和血流速度。
Result: LoFNO在血流速度和WSS预测上表现优于插值和其他深度学习方法,提升了脑血管诊断的精确性。
Insight: 将几何先验与神经算子结合能有效处理不规则结构,提升血流数据的时空分辨率,对临床诊断具有重要意义。
Abstract: Hemodynamic analysis is essential for predicting aneurysm rupture and guiding treatment. While magnetic resonance flow imaging enables time-resolved volumetric blood velocity measurements, its low spatiotemporal resolution and signal-to-noise ratio limit its diagnostic utility. To address this, we propose the Localized Fourier Neural Operator (LoFNO), a novel 3D architecture that enhances both spatial and temporal resolution with the ability to predict wall shear stress (WSS) directly from clinical imaging data. LoFNO integrates Laplacian eigenvectors as geometric priors for improved structural awareness on irregular, unseen geometries and employs an Enhanced Deep Super-Resolution Network (EDSR) layer for robust upsampling. By combining geometric priors with neural operator frameworks, LoFNO de-noises and spatiotemporally upsamples flow data, achieving superior velocity and WSS predictions compared to interpolation and alternative deep learning methods, enabling more precise cerebrovascular diagnostics.
[62] NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining
Maksim Kuprashevich,Grigorii Alekseenko,Irina Tolstykh,Georgii Fedorov,Bulat Suleimanov,Vladimir Dokholyan,Aleksandr Gordeev
Main category: cs.CV
TL;DR: 本文提出了一种自动化流水线,用于从多领域、多分辨率、多风格的图像中挖掘高质量的编辑三元组(原始图像、指令、编辑后图像),并无需人工干预。通过任务调优的Gemini验证器和数据增强技术,该流水线大幅提升了数据集规模和编辑质量。
Details
Motivation: 生成式建模的进步使得图像编辑助手能够遵循自然语言指令,但高质量的训练数据(三元组)的获取困难,需要自动化解决方案。Contribution: 1. 提出了一种全自动化、模块化的流水线,用于挖掘高质量编辑三元组;2. 通过反转和组合增强技术将数据集规模扩大约2.2倍;3. 发布了NHR-Edit数据集和Bagel-NHR-Edit模型,支持开源研究。
Method: 1. 利用公开的生成模型和任务调优的Gemini验证器直接评分指令遵循性和美观性;2. 采用反转和组合增强技术扩展数据集;3. 无需依赖分割或接地模型的辅助。
Result: NHR-Edit数据集包含358k高质量三元组,在跨数据集评估中表现最优;Bagel-NHR-Edit模型在实验中达到SOTA性能。
Insight: 自动化流水线和数据增强技术能够显著减少对人工标注的依赖,为大规模高质量训练数据生成提供了新思路。
Abstract: Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.
[63] One Step Closer: Creating the Future to Boost Monocular Semantic Scene Completion
Haoang Lu,Yuanqi Su,Xiaoning Zhang,Hao Hu
Main category: cs.CV
TL;DR: 论文提出了一种新的时间语义场景完成框架CF-SSC,通过预测伪未来帧扩展模型的感知范围,提升了单目视觉3D场景完成的性能。
Details
Motivation: 现实交通场景中,部分场景常被遮挡或处于相机视野外,传统单目SSC方法难以有效解决这一问题。Contribution: 提出了CF-SSC框架,结合伪未来帧预测和3D几何一致性融合,显著提升了场景完成和遮挡推理的能力。
Method: 通过位姿和深度建立3D对应关系,在3D空间中融合过去、当前和预测的未来帧,显式建模时空关系。
Result: 在SemanticKITTI和SSCBench-KITTI-360基准测试中达到SOTA性能,证明了方法的有效性。
Insight: 显式建模时空关系和几何一致性融合是提升单目视觉3D场景完成的关键。
Abstract: In recent years, visual 3D Semantic Scene Completion (SSC) has emerged as a critical perception task for autonomous driving due to its ability to infer complete 3D scene layouts and semantics from single 2D images. However, in real-world traffic scenarios, a significant portion of the scene remains occluded or outside the camera’s field of view – a fundamental challenge that existing monocular SSC methods fail to address adequately. To overcome these limitations, we propose Creating the Future SSC (CF-SSC), a novel temporal SSC framework that leverages pseudo-future frame prediction to expand the model’s effective perceptual range. Our approach combines poses and depths to establish accurate 3D correspondences, enabling geometrically-consistent fusion of past, present, and predicted future frames in 3D space. Unlike conventional methods that rely on simple feature stacking, our 3D-aware architecture achieves more robust scene completion by explicitly modeling spatial-temporal relationships. Comprehensive experiments on SemanticKITTI and SSCBench-KITTI-360 benchmarks demonstrate state-of-the-art performance, validating the effectiveness of our approach, highlighting our method’s ability to improve occlusion reasoning and 3D scene completion accuracy.
[64] GRAM-MAMBA: Holistic Feature Alignment for Wireless Perception with Adaptive Low-Rank Compensation
Weiqi Yang,Xu Zhou,Jingfu Guan,Hao Du,Tianyu Bai
Main category: cs.CV
TL;DR: 论文提出了GRAM-MAMBA框架,用于解决多模态感知中的模型复杂性、模态对齐不充分和缺失模态鲁棒性问题。通过结合Mamba模型和GRAM矩阵策略,实现了高效的多模态对齐,并引入自适应低秩补偿策略处理缺失模态。实验验证了其优越性能。
Details
Motivation: 现有多模态感知系统存在模型复杂度过高、单向模态对齐忽视模态间关系以及缺失模态下的鲁棒性不足等问题,限制了其在资源受限的物联网环境中的应用。Contribution: 1) 提出GRAM-MAMBA框架,结合Mamba模型和GRAM矩阵实现高效多模态对齐;2) 引入自适应低秩补偿策略,增强缺失模态的鲁棒性;3) 实验验证框架在定位和活动识别任务中的高效性和性能提升。
Method: 1) 使用线性复杂度的Mamba模型处理传感器时间序列;2) 优化GRAM矩阵策略实现多模态对齐;3) 通过自适应低秩层补偿策略(受LoRA启发)处理缺失模态,仅微调相关层。
Result: 在SPAWC2021室内定位数据集中,预训练模型误差低于基线;缺失模态适应时,仅训练0.2%参数,性能提升24.5%。在USC-HAD活动识别数据集中,F1达93.55%,OA达93.81%,优于现有方法;更新策略提升F1 23%,仅训练0.3%参数。
Insight: GRAM-MAMBA通过高效对齐和多模态自适应补偿,显著提升了资源受限环境下的多模态感知性能,为物联网应用提供了实用解决方案。
Abstract: Multi-modal fusion is crucial for Internet of Things (IoT) perception, widely deployed in smart homes, intelligent transport, industrial automation, and healthcare. However, existing systems often face challenges: high model complexity hinders deployment in resource-constrained environments, unidirectional modal alignment neglects inter-modal relationships, and robustness suffers when sensor data is missing. These issues impede efficient and robust multimodal perception in real-world IoT settings. To overcome these limitations, we propose GRAM-MAMBA. This framework utilizes the linear-complexity Mamba model for efficient sensor time-series processing, combined with an optimized GRAM matrix strategy for pairwise alignment among modalities, addressing the shortcomings of traditional single-modality alignment. Inspired by Low-Rank Adaptation (LoRA), we introduce an adaptive low-rank layer compensation strategy to handle missing modalities post-training. This strategy freezes the pre-trained model core and irrelevant adaptive layers, fine-tuning only those related to available modalities and the fusion process. Extensive experiments validate GRAM-MAMBA’s effectiveness. On the SPAWC2021 indoor positioning dataset, the pre-trained model shows lower error than baselines; adapting to missing modalities yields a 24.5% performance boost by training less than 0.2% of parameters. On the USC-HAD human activity recognition dataset, it achieves 93.55% F1 and 93.81% Overall Accuracy (OA), outperforming prior work; the update strategy increases F1 by 23% while training less than 0.3% of parameters. These results highlight GRAM-MAMBA’s potential for achieving efficient and robust multimodal perception in resource-constrained environments.
[65] SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing
Yingying Zhang,Lixiang Ru,Kang Wu,Lei Yu,Lei Liang,Yansheng Li,Jingdong Chen
Main category: cs.CV
TL;DR: SkySense V2 是一个统一的多模态遥感基础模型,通过单一 Transformer 主干网络处理多模态数据,结合创新的自监督学习策略和自适应模块,显著提升了性能和泛化能力。
Details
Motivation: 现有多模态遥感基础模型通常为每个模态训练独立的骨干网络,导致冗余和低效。同时,传统自监督学习方法未能充分考虑遥感图像的特殊性。Contribution: 提出 SkySense V2,采用单一 Transformer 主干网络支持多模态输入,设计自适应 patch 合并模块和可学习模态提示 token,并引入专家混合(MoE)模块。
Method: 结合自监督学习策略(针对遥感数据特性)和自适应模块处理模态间分辨率和特征多样性问题。
Result: 在 16 个数据集和 7 项任务上的测试中,平均性能超过 SkySense 1.8 个百分点。
Insight: 统一的多模态主干网络和针对遥感数据设计的自监督学习策略是提升模型效能和泛化能力的关键。
Abstract: The multi-modal remote sensing foundation model (MM-RSFM) has significantly advanced various Earth observation tasks, such as urban planning, environmental monitoring, and natural disaster management. However, most existing approaches generally require the training of separate backbone networks for each data modality, leading to redundancy and inefficient parameter utilization. Moreover, prevalent pre-training methods typically apply self-supervised learning (SSL) techniques from natural images without adequately accommodating the characteristics of remote sensing (RS) images, such as the complicated semantic distribution within a single RS image. In this work, we present SkySense V2, a unified MM-RSFM that employs a single transformer backbone to handle multiple modalities. This backbone is pre-trained with a novel SSL strategy tailored to the distinct traits of RS data. In particular, SkySense V2 incorporates an innovative adaptive patch merging module and learnable modality prompt tokens to address challenges related to varying resolutions and limited feature diversity across modalities. In additional, we incorporate the mixture of experts (MoE) module to further enhance the performance of the foundation model. SkySense V2 demonstrates impressive generalization abilities through an extensive evaluation involving 16 datasets over 7 tasks, outperforming SkySense by an average of 1.8 points.
[66] Team of One: Cracking Complex Video QA with Model Synergy
Jun Xie,Zhaoran Zhao,Xiongjun Guan,Yingjian Zhu,Hongzhu Yi,Xinming Wang,Feng Chen,Zhepeng Wang
Main category: cs.CV
TL;DR: 本文提出了一种新颖的框架,通过协同多个异构视频语言模型(VLMs)来提升开放视频问答的推理深度和鲁棒性,显著优于现有基线。
Details
Motivation: 现有视频大型多模态模型(Video-LMMs)在复杂场景中表现不佳,如上下文理解有限、时间建模弱、对模糊或组合查询泛化能力差。本文旨在解决这些问题。Contribution: 提出了一种轻量级、可扩展的提示与响应集成机制,通过结构化思维链协调多个VLMs,并由外部大型语言模型(LLM)作为评估器和集成器。
Method: 利用多个VLMs的异构性,通过结构化思维链为不同推理路径定制响应,并由LLM评估和融合最可靠的输出。
Result: 在CVRR-ES数据集上显著优于现有基线,表现出更高的泛化能力和鲁棒性。
Insight: 该方法无需重新训练模型,为多模态推理提供了一种轻量级且可扩展的策略,为未来Video-LMM的发展奠定了基础。
Abstract: We propose a novel framework for open-ended video question answering that enhances reasoning depth and robustness in complex real-world scenarios, as benchmarked on the CVRR-ES dataset. Existing Video-Large Multimodal Models (Video-LMMs) often exhibit limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries. To address these challenges, we introduce a prompting-and-response integration mechanism that coordinates multiple heterogeneous Video-Language Models (VLMs) via structured chains of thought, each tailored to distinct reasoning pathways. An external Large Language Model (LLM) serves as an evaluator and integrator, selecting and fusing the most reliable responses. Extensive experiments demonstrate that our method significantly outperforms existing baselines across all evaluation metrics, showcasing superior generalization and robustness. Our approach offers a lightweight, extensible strategy for advancing multimodal reasoning without requiring model retraining, setting a strong foundation for future Video-LMM development.
[67] Depth3DLane: Fusing Monocular 3D Lane Detection with Self-Supervised Monocular Depth Estimation
Max van den Hoven,Kishaan Jeeveswaran,Pieter Piscaer,Thijs Wensveen,Elahe Arani,Bahram Zonooz
Main category: cs.CV
TL;DR: Depth3DLane结合自监督单目深度估计与3D车道检测,无需昂贵的传感器或真实深度数据,解决了现有方法依赖相机参数的局限,并在OpenLane数据集上表现优异。
Details
Motivation: 现有3D车道检测方法依赖昂贵的深度传感器或真实深度数据,且需已知相机参数,限制了其在大规模或相机标定不可行场景中的应用。Contribution: 1. 提出双路径框架,融合自监督深度估计与3D车道检测;2. 扩展框架以预测每帧相机参数;3. 在无相机标定情况下仍能工作。
Method: 1. 自监督深度网络生成场景点云;2. 鸟瞰图路径提取空间信息,前视图路径提取语义信息;3. 基于3D车道锚点从双路径采样特征推断几何。
Result: 在OpenLane数据集上表现优异,且无需真实相机参数即可适用。
Insight: 自监督深度估计为3D车道检测提供了实用且经济的解决方案,尤其是在相机标定不可行的场景中。
Abstract: Monocular 3D lane detection is essential for autonomous driving, but challenging due to the inherent lack of explicit spatial information. Multi-modal approaches rely on expensive depth sensors, while methods incorporating fully-supervised depth networks rely on ground-truth depth data that is impractical to collect at scale. Additionally, existing methods assume that camera parameters are available, limiting their applicability in scenarios like crowdsourced high-definition (HD) lane mapping. To address these limitations, we propose Depth3DLane, a novel dual-pathway framework that integrates self-supervised monocular depth estimation to provide explicit structural information, without the need for expensive sensors or additional ground-truth depth data. Leveraging a self-supervised depth network to obtain a point cloud representation of the scene, our bird’s-eye view pathway extracts explicit spatial information, while our front view pathway simultaneously extracts rich semantic information. Depth3DLane then uses 3D lane anchors to sample features from both pathways and infer accurate 3D lane geometry. Furthermore, we extend the framework to predict camera parameters on a per-frame basis and introduce a theoretically motivated fitting procedure to enhance stability on a per-segment basis. Extensive experiments demonstrate that Depth3DLane achieves competitive performance on the OpenLane benchmark dataset. Furthermore, experimental results show that using learned parameters instead of ground-truth parameters allows Depth3DLane to be applied in scenarios where camera calibration is infeasible, unlike previous methods.
[68] When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models
Francesco Ortu,Zhijing Jin,Diego Doimo,Alberto Cazzaniga
Main category: cs.CV
TL;DR: 本文研究了视觉语言模型(VLMs)在处理内部参数化知识与外部信息冲突时的机制,通过引入多模态反事实查询数据集,定位了控制冲突的少量头部单元,并通过修改这些头部单元实现对模型行为的调控。
Details
Motivation: VLMs在处理复杂任务时,常面临内部知识与外部信息的冲突,导致幻觉和不可靠响应。目前尚不清楚模型如何解决这种跨模态冲突。Contribution: 1. 提出多模态反事实查询数据集,用于研究VLMs的冲突解决机制;2. 定位控制冲突的头部单元;3. 展示如何通过修改头部单元调控模型行为。
Method: 1. 引入多模态反事实查询数据集;2. 使用logit检查定位冲突控制头部;3. 通过修改头部单元验证其调控效果。
Result: 实验表明,定位的头部单元能精确控制模型对内部知识或视觉输入的偏好,且其注意机制优于基于梯度的归因方法。
Insight: VLMs通过少量头部单元处理跨模态冲突,调控这些单元可改善模型行为,为理解模型内部机制提供了新方向。
Abstract: Vision-language models (VLMs) increasingly leverage diverse knowledge sources to address complex tasks, often encountering conflicts between their internal parametric knowledge and external information. Knowledge conflicts can result in hallucinations and unreliable responses, but the mechanisms governing such interactions remain unknown. To address this gap, we analyze the mechanisms that VLMs use to resolve cross-modal conflicts by introducing a dataset of multimodal counterfactual queries that deliberately contradict internal commonsense knowledge. We localize with logit inspection a small set of heads that control the conflict. Moreover, by modifying these heads, we can steer the model towards its internal knowledge or the visual inputs. Finally, we show that attention from such heads pinpoints localized image regions driving visual overrides, outperforming gradient-based attribution in precision.
[69] Real-Time Fusion of Visual and Chart Data for Enhanced Maritime Vision
Marten Kreis,Benjamin Kiefer
Main category: cs.CV
TL;DR: 该论文提出了一种将实时视觉数据与海图信息融合的新方法,通过基于transformer的端到端神经网络匹配检测到的导航标志与其在海图中的表示,提升了海洋视觉的准确性。
Details
Motivation: 海洋视觉在动态和复杂环境中面临物体定位和关联的挑战,需要一种更准确的方法将实时视频与海图数据融合。Contribution: 提出了一种基于transformer的端到端神经网络,直接匹配图像检测与世界空间的海图标记,显著提升了物体定位和关联的准确性。
Method: 使用transformer架构预测导航标志的边界框和置信度分数,实现图像域检测与海图数据的直接匹配。
Result: 在真实海洋场景数据集上的实验表明,提出的方法在动态和挑战性环境中优于基线方法(如基于YOLOv7的网络和射线投射模型)。
Insight: 融合视觉与海图数据能够显著提升导航标志的定位准确性,尤其是在复杂海洋环境中。
Abstract: This paper presents a novel approach to enhancing marine vision by fusing real-time visual data with chart information. Our system overlays nautical chart data onto live video feeds by accurately matching detected navigational aids, such as buoys, with their corresponding representations in chart data. To achieve robust association, we introduce a transformer-based end-to-end neural network that predicts bounding boxes and confidence scores for buoy queries, enabling the direct matching of image-domain detections with world-space chart markers. The proposed method is compared against baseline approaches, including a ray-casting model that estimates buoy positions via camera projection and a YOLOv7-based network extended with a distance estimation module. Experimental results on a dataset of real-world maritime scenes demonstrate that our approach significantly improves object localization and association accuracy in dynamic and challenging environments.
[70] PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations
Yu Wei,Jiahui Zhang,Xiaoqin Zhang,Ling Shao,Shijian Lu
Main category: cs.CV
TL;DR: PCR-GS是一种不需要COLMAP的3D高斯泼溅技术,通过相机姿态共正则化实现高质量3D场景建模和相机姿态估计,解决了复杂相机轨迹场景下的难题。
Details
Motivation: 现有的COLMAP-free 3D高斯泼溅技术在复杂相机轨迹场景(如剧烈旋转和平移)中表现不佳,导致相机姿态估计和联合优化陷入局部极小值,限制了场景建模效果。Contribution: 提出了PCR-GS,通过两种共正则化方法(特征重投影正则化和小波频域正则化)优化相机姿态和3D高斯泼溅的联合建模,显著提升了复杂相机轨迹场景的表现。
Method: 1. 特征重投影正则化:利用相邻视图的DINO特征对齐语义信息;
2. 小波频域正则化:通过高频细节差异优化旋转矩阵。
Result: 在多个真实场景实验中,PCR-GS在剧烈变化的相机轨迹下实现了优于现有技术的姿态无关3D高斯泼溅建模。
Insight: 共正则化方法可以有效解决复杂相机轨迹场景中的姿态估计问题,语义对齐和频域分析是提升性能的关键。
Abstract: COLMAP-free 3D Gaussian Splatting (3D-GS) has recently attracted increasing attention due to its remarkable performance in reconstructing high-quality 3D scenes from unposed images or videos. However, it often struggles to handle scenes with complex camera trajectories as featured by drastic rotation and translation across adjacent camera views, leading to degraded estimation of camera poses and further local minima in joint optimization of camera poses and 3D-GS. We propose PCR-GS, an innovative COLMAP-free 3DGS technique that achieves superior 3D scene modeling and camera pose estimation via camera pose co-regularization. PCR-GS achieves regularization from two perspectives. The first is feature reprojection regularization which extracts view-robust DINO features from adjacent camera views and aligns their semantic information for camera pose regularization. The second is wavelet-based frequency regularization which exploits discrepancy in high-frequency details to further optimize the rotation matrix in camera poses. Extensive experiments over multiple real-world scenes show that the proposed PCR-GS achieves superior pose-free 3D-GS scene modeling under dramatic changes of camera trajectories.
[71] TimeNeRF: Building Generalizable Neural Radiance Fields across Time from Few-Shot Input Views
Hsiang-Hui Hung,Huu-Phu Do,Yung-Hui Li,Ching-Chun Huang
Main category: cs.CV
TL;DR: TimeNeRF 是一种通用的神经渲染方法,能够在少量输入视图的情况下,在任意视角和时间渲染新视图,适用于动态时间变化的3D场景建模。
Details
Motivation: 现实世界中获取多视图数据成本高,且现有NeRF方法在时间维度上的建模能力有限。数字领域(如元宇宙)需要能够自然过渡昼夜变化的3D场景建模技术。Contribution: 1. 提出TimeNeRF,支持少量输入视图的泛化能力;2. 构建了适用于时间变化的隐式场景表示;3. 首次实现了无需逐场景优化的时间动态渲染。
Method: 结合多视图立体视觉、神经辐射场(NeRF)和解缠策略,构建包含时间维度的隐式场景表示,并通过体渲染合成新视图。
Result: 实验表明,TimeNeRF能在少量输入下高效渲染新视图,且能平滑过渡昼夜变化,捕捉自然场景的复杂动态。
Insight: 将时间维度引入NeRF框架,扩展了其动态场景建模能力,为元宇宙等应用提供了新的工具。
Abstract: We present TimeNeRF, a generalizable neural rendering approach for rendering novel views at arbitrary viewpoints and at arbitrary times, even with few input views. For real-world applications, it is expensive to collect multiple views and inefficient to re-optimize for unseen scenes. Moreover, as the digital realm, particularly the metaverse, strives for increasingly immersive experiences, the ability to model 3D environments that naturally transition between day and night becomes paramount. While current techniques based on Neural Radiance Fields (NeRF) have shown remarkable proficiency in synthesizing novel views, the exploration of NeRF’s potential for temporal 3D scene modeling remains limited, with no dedicated datasets available for this purpose. To this end, our approach harnesses the strengths of multi-view stereo, neural radiance fields, and disentanglement strategies across diverse datasets. This equips our model with the capability for generalizability in a few-shot setting, allows us to construct an implicit content radiance field for scene representation, and further enables the building of neural radiance fields at any arbitrary time. Finally, we synthesize novel views of that time via volume rendering. Experiments show that TimeNeRF can render novel views in a few-shot setting without per-scene optimization. Most notably, it excels in creating realistic novel views that transition smoothly across different times, adeptly capturing intricate natural scene changes from dawn to dusk.
[72] DiViD: Disentangled Video Diffusion for Static-Dynamic Factorization
Marzieh Gheisari,Auguste Genovesio
Main category: cs.CV
TL;DR: DiViD是一个基于扩散模型的视频解耦框架,首次实现了静态-动态内容的显式分解。通过序列编码器和条件DDPM解码器,结合多个关键归纳偏差和约束,显著提升了视频解耦效果。
Details
Motivation: 现有的基于VAE和GAN的视频解耦方法存在信息泄漏和模糊重建问题,无法有效分离静态和动态内容。DiViD旨在通过扩散模型实现更清晰的解耦。Contribution: 1. 提出首个端到端的视频扩散框架DiViD,用于显式静态-动态分解;2. 引入序列编码器和条件DDPM解码器,结合多种归纳偏差(如共享噪声计划、时变KL瓶颈等);3. 设计了正交性正则化器防止信息泄漏。
Method: 1. 序列编码器提取全局静态token和逐帧动态token;2. 条件DDPM解码器使用共享噪声计划、时变KL瓶颈和交叉注意力机制;3. 通过正交性正则化防止静态-动态泄漏。
Result: 在真实数据集上,DiViD在交换准确率和跨泄漏指标上优于现有方法,实现了最高的联合准确率,同时保持了静态保真度和动态迁移能力。
Insight: 扩散模型在视频解耦任务中表现出潜力,通过显式分解和精心设计的约束,可以有效解决传统方法的信息泄漏问题。
Abstract: Unsupervised disentanglement of static appearance and dynamic motion in video remains a fundamental challenge, often hindered by information leakage and blurry reconstructions in existing VAE- and GAN-based approaches. We introduce DiViD, the first end-to-end video diffusion framework for explicit static-dynamic factorization. DiViD’s sequence encoder extracts a global static token from the first frame and per-frame dynamic tokens, explicitly removing static content from the motion code. Its conditional DDPM decoder incorporates three key inductive biases: a shared-noise schedule for temporal consistency, a time-varying KL-based bottleneck that tightens at early timesteps (compressing static information) and relaxes later (enriching dynamics), and cross-attention that routes the global static token to all frames while keeping dynamic tokens frame-specific. An orthogonality regularizer further prevents residual static-dynamic leakage. We evaluate DiViD on real-world benchmarks using swap-based accuracy and cross-leakage metrics. DiViD outperforms state-of-the-art sequential disentanglement methods: it achieves the highest swap-based joint accuracy, preserves static fidelity while improving dynamic transfer, and reduces average cross-leakage.
[73] Generalist Forecasting with Frozen Video Models via Latent Diffusion
Jacob C Walker,Pedro Vélez,Luisa Polania Cabrera,Guangyao Zhou,Rishabh Kabra,Carl Doersch,Maks Ovsjanikov,João Carreira,Shiry Ginosar
Main category: cs.CV
TL;DR: 该研究提出了一种通用预测框架,利用冻结的预训练视觉模型,通过潜在扩散模型预测未来特征,并在多个任务中验证其有效性。
Details
Motivation: 预测未来动态是通用系统的关键能力,研究发现视觉模型的感知能力与短期预测性能强相关,希望通过结合表征学习和生成模型提升视频理解能力。Contribution: 1. 提出了一种通用预测框架,适用于任何冻结视觉模型;2. 引入分布度量指标,实现跨任务一致性评估;3. 验证了预训练模型感知能力与预测性能的强相关性。
Method: 1. 使用潜在扩散模型在冻结表征空间中预测未来特征;2. 通过轻量级任务特定解码器解码;3. 在多种预训练模型和任务上进行验证。
Result: 实验表明,该方法在多个任务中表现优异,验证了表征学习与生成模型结合对时序视频理解的价值。
Insight: 1. 视觉模型的感知能力与预测性能具有普适性关联;2. 冻结预训练模型的特征空间可用于高效预测;3. 分布度量指标能够更全面地评估预测性能。
Abstract: Forecasting what will happen next is a critical skill for general-purpose systems that plan or act in the world at different levels of abstraction. In this paper, we identify a strong correlation between a vision model’s perceptual ability and its generalist forecasting performance over short time horizons. This trend holds across a diverse set of pretrained models-including those trained generatively-and across multiple levels of abstraction, from raw pixels to depth, point tracks, and object motion. The result is made possible by a novel generalist forecasting framework that operates on any frozen vision backbone: we train latent diffusion models to forecast future features in the frozen representation space, which are then decoded via lightweight, task-specific readouts. To enable consistent evaluation across tasks, we introduce distributional metrics that compare distributional properties directly in the space of downstream tasks and apply this framework to nine models and four tasks. Our results highlight the value of bridging representation learning and generative modeling for temporally grounded video understanding.
[74] CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models
Quang-Binh Nguyen,Minh Luu,Quang Nguyen,Anh Tran,Khoi Nguyen
Main category: cs.CV
TL;DR: 本文提出了CSD-VAR方法,通过在视觉自回归模型(VAR)中实现内容与风格分解(CSD),引入了三个创新点:尺度感知交替优化、基于SVD的校正方法和增强的键值记忆,显著提升了内容保留和风格化效果。
Details
Motivation: 现有内容-风格分解方法主要针对扩散模型,而视觉自回归模型(VAR)在生成任务中表现出色但尚未被探索。本文希望通过VAR实现更优的内容-风格分解,以提升生成灵活性和效果。Contribution: 1)首次在VAR中实现CSD;2)提出三个创新方法(尺度感知优化、SVD校正、增强键值记忆);3)发布CSD-100数据集用于内容-风格分解任务。
Method: 采用VAR框架,通过尺度感知交替优化对齐内容与风格表示,利用SVD校正减少内容泄漏,并通过增强键值记忆提升内容一致性。
Result: 实验表明CSD-VAR在内容保留和风格化保真度上优于现有方法。
Insight: VAR框架的尺度预测特性天然适合内容-风格分解,且通过创新方法可进一步提升分解效果。
Abstract: Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored the decomposition of explicit content style, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance comparable to that of diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) an Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity.
[75] Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations
Yong Feng,Xiaolei Zhang,Shijin Feng,Yong Zhao,Yihan Chen
Main category: cs.CV
TL;DR: 论文提出了一种基于深度学习的隧道裂缝自动分类与分割方法,通过两阶段模型(分类+分割)显著提升了检测精度和效率,并通过可视化解释技术增强了模型的可解释性。
Details
Motivation: 隧道衬砌裂缝是隧道安全状态的重要指标,传统检测方法效率低且准确性不足。研究旨在利用深度学习技术提升裂缝检测的精度和效率。Contribution: 1. 提出两阶段深度学习方法(DenseNet-169分类+DeepLabV3+分割)。2. 结合分类与分割,优化检测流程。3. 引入可视化解释技术,增强模型可解释性。
Method: 1. 第一阶段使用DenseNet-169分类隧道图像,筛选出含裂缝的图像。2. 第二阶段使用DeepLabV3+分割裂缝。3. 通过分数加权可视化技术解释模型内部逻辑。
Result: 分类模型准确率92.23%,FPS 39.80;分割模型IoU 57.01%,F1分数67.44%,均优于其他先进模型。可视化解释技术有助于理解深度学习模型的决策。
Insight: 两阶段方法显著提升检测效率,分类与分割的结合是优化的关键;可视化解释为模型的黑箱问题提供了解决方案。
Abstract: Tunnel lining crack is a crucial indicator of tunnels’ safety status. Aiming to classify and segment tunnel cracks with enhanced accuracy and efficiency, this study proposes a two-step deep learning-based method. An automatic tunnel image classification model is developed using the DenseNet-169 in the first step. The proposed crack segmentation model in the second step is based on the DeepLabV3+, whose internal logic is evaluated via a score-weighted visual explanation technique. Proposed method combines tunnel image classification and segmentation together, so that the selected images containing cracks from the first step are segmented in the second step to improve the detection accuracy and efficiency. The superior performances of the two-step method are validated by experiments. The results show that the accuracy and frames per second (FPS) of the tunnel crack classification model are 92.23% and 39.80, respectively, which are higher than other convolutional neural networks (CNN) based and Transformer based models. Also, the intersection over union (IoU) and F1 score of the tunnel crack segmentation model are 57.01% and 67.44%, respectively, outperforming other state-of-the-art models. Moreover, the provided visual explanations in this study are conducive to understanding the “black box” of deep learning-based models. The developed two-stage deep learning-based method integrating visual explanations provides a basis for fast and accurate quantitative assessment of tunnel health status.
[76] Moodifier: MLLM-Enhanced Emotion-Driven Image Editing
Jiarong Ye,Sharon X. Huang
Main category: cs.CV
TL;DR: 该论文提出了一种名为Moodifier的系统,通过结合多模态大语言模型(MLLM)和经过情感标注的图像数据集,实现了基于情感的精确图像编辑。
Details
Motivation: 情感驱动的图像编辑在创意产业中潜力巨大,但由于情感的抽象性和多样性,实现精确的视觉转换一直是一个挑战。Contribution: 论文的主要贡献包括:1) 创建了MoodArchive数据集,包含800万+带层次情感标注的图像;2) 开发了MoodifyCLIP模型,将抽象情感映射为具体视觉属性;3) 提出了无需训练的编辑模型Moodifier,通过结合MLLM和MoodifyCLIP实现情感驱动的编辑。
Method: 方法分为三部分:1) 使用LLaVA生成并部分人工验证的MoodArchive数据集;2) 在MoodArchive上微调MoodifyCLIP模型;3) 结合MLLM和MoodifyCLIP设计无需训练的编辑框架Moodifier。
Result: 实验表明,Moodifier在情感准确性和内容保留上优于现有方法,适用于角色表情、时尚设计等多个领域。
Insight: 将抽象情感与具体视觉变化联系起来的思路为创意内容生成提供了新可能性。
Abstract: Bridging emotions and visual content for emotion-driven image editing holds great potential in creative industries, yet precise manipulation remains challenging due to the abstract nature of emotions and their varied manifestations across different contexts. We tackle this challenge with an integrated approach consisting of three complementary components. First, we introduce MoodArchive, an 8M+ image dataset with detailed hierarchical emotional annotations generated by LLaVA and partially validated by human evaluators. Second, we develop MoodifyCLIP, a vision-language model fine-tuned on MoodArchive to translate abstract emotions into specific visual attributes. Third, we propose Moodifier, a training-free editing model leveraging MoodifyCLIP and multimodal large language models (MLLMs) to enable precise emotional transformations while preserving content integrity. Our system works across diverse domains such as character expressions, fashion design, jewelry, and home d'ecor, enabling creators to quickly visualize emotional variations while preserving identity and structure. Extensive experimental evaluations show that Moodifier outperforms existing methods in both emotional accuracy and content preservation, providing contextually appropriate edits. By linking abstract emotions to concrete visual changes, our solution unlocks new possibilities for emotional content creation in real-world applications. We will release the MoodArchive dataset, MoodifyCLIP model, and make the Moodifier code and demo publicly available upon acceptance.
[77] Training-free Token Reduction for Vision Mamba
Qiankun Ma,Ziyao Zhang,Chi Su,Jie Chen,Zhen Song,Hairong Zheng,Wen Gao
Main category: cs.CV
TL;DR: 本文提出了一种无训练的Vision Mamba令牌缩减方法MTR,通过设计Mamba结构感知的重要性评分,显著降低了计算成本,同时保持了模型性能。
Details
Motivation: Vision Mamba因其线性计算复杂度在长程依赖建模中表现优异,但在令牌缩减方面尚未被充分探索。直接将ViT的令牌缩减方法用于Mamba会导致性能显著下降,因此需要一种更适合Mamba的方法。Contribution: 1. 提出了一种Mamba结构感知的重要性评分方法。2. 设计了MTR框架,无需训练即可作为即插即用组件集成到Mamba模型中。3. 在多种任务和骨干网络上验证了有效性。
Method: 1. 设计了一种基于Mamba结构的令牌重要性评分方法。2. 提出了MTR框架,通过动态移除冗余令牌实现计算效率提升。3. 避免了额外训练或参数调整。
Result: 实验表明,MTR在Vim-B骨干上降低了约40%的FLOPs,仅导致ImageNet性能下降1.6%,且无需重新训练。
Insight: Mamba模型在令牌缩减中需要不同于ViT的方法,因其缺乏注意力机制,MTR通过动态评估令牌重要性优化了效率与性能的平衡。
Abstract: Vision Mamba has emerged as a strong competitor to Vision Transformers (ViTs) due to its ability to efficiently capture long-range dependencies with linear computational complexity. While token reduction, an effective compression technique in ViTs, has rarely been explored in Vision Mamba. Exploring Vision Mamba’s efficiency is essential for enabling broader applications. However, we find that directly applying existing token reduction techniques for ViTs to Vision Mamba leads to significant performance degradation. This is primarily because Mamba is a sequence model without attention mechanisms, whereas most token reduction techniques for ViTs rely on attention mechanisms for importance measurement and overlook the order of compressed tokens. In this paper, we investigate a Mamba structure-aware importance score to evaluate token importance in a simple and effective manner. Building on this score, we further propose MTR, a training-free \textbf{M}amba \textbf{T}oken \textbf{R}eduction framework. Without the need for training or additional tuning parameters, our method can be seamlessly integrated as a plug-and-play component across various Mamba models. Extensive experiments demonstrate that our approach significantly reduces computational workload while minimizing performance impact across various tasks and multiple backbones. Notably, MTR reduces FLOPs by approximately 40% on the Vim-B backbone, with only a 1.6% drop in ImageNet performance without retraining.
[78] VLA-Mark: A cross modal watermark for large vision-language alignment model
Shuliang Liu,Qi Zheng,Jesse Jiaxi Xu,Yibo Yan,He Geng,Aiwei Liu,Peijie Jiang,Jia Liu,Yik-Cheung Tam,Xuming Hu
Main category: cs.CV
TL;DR: VLA-Mark是一种跨模态水印框架,通过视觉-语言对齐保护多模态模型的版权,同时保持语义保真度。
Details
Motivation: 现有文本水印方法破坏了视觉-文本对齐,亟需一种能保护知识产权且不影响多模态一致性的解决方案。Contribution: 提出了基于视觉-语言对齐的多尺度水印框架VLA-Mark,无需重新训练模型即可嵌入可检测水印,并保持语义保真。
Method: 整合多尺度视觉-文本对齐指标(如局部补丁亲和性、全局语义一致性和上下文注意力模式),动态平衡水印强度和语义保留。
Result: 实验表明,VLA-Mark在PPL(降7.4%)和BLEU(升26.6%)上优于传统方法,检测AUC达98.8%,抗攻击能力为96.1%。
Insight: 通过跨模态协调和熵敏感机制,VLA-Mark展示了如何在多模态模型中高效嵌入水印并保持语义一致。
Abstract: Vision-language models demand watermarking solutions that protect intellectual property without compromising multimodal coherence. Existing text watermarking methods disrupt visual-textual alignment through biased token selection and static strategies, leaving semantic-critical concepts vulnerable. We propose VLA-Mark, a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination. Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns, to guide watermark injection without model retraining. An entropy-sensitive mechanism dynamically balances watermark strength and semantic preservation, prioritizing visual grounding during low-uncertainty generation phases. Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near-perfect detection (98.8% AUC). The framework demonstrates 96.1% attack resilience against attacks such as paraphrasing and synonym substitution, while maintaining text-visual consistency, establishing new standards for quality-preserving multimodal watermarking
[79] Unmasking Performance Gaps: A Comparative Study of Human Anonymization and Its Effects on Video Anomaly Detection
Sara Abdulaziz,Egor Bondarev
Main category: cs.CV
TL;DR: 论文通过比较四种人类匿名化技术对视频异常检测性能的影响,揭示了算法设计对匿名化数据的敏感性和隐私与检测效用的权衡。
Details
Motivation: 深度学习在视频异常检测中提升性能的同时引发了隐私问题,需要研究匿名化技术对检测效果的影响。Contribution: 1. 比较了四种匿名化技术对异常检测性能的影响;2. 揭示了算法设计如何响应匿名化技术;3. 提出了隐私与检测效用的权衡。
Method: 在UCF-Crime数据集上应用四种匿名化技术(模糊、掩码、加密、虚拟化身替换),评估四种异常检测方法(MGFN、UR-DMU、BN-WVAD、PEL4VAD)。
Result: 实验表明匿名化数据仍支持异常检测,某些匿名化模式(如加密和掩码)甚至提高了部分模型的AUC表现,算法设计对匿名化敏感。
Insight: 算法设计对匿名化技术有特定敏感性,隐私保护与检测效用需要权衡,隐私设计解决方案需考虑灵活性。
Abstract: Advancements in deep learning have improved anomaly detection in surveillance videos, yet they raise urgent privacy concerns due to the collection of sensitive human data. In this paper, we present a comprehensive analysis of anomaly detection performance under four human anonymization techniques, including blurring, masking, encryption, and avatar replacement, applied to the UCF-Crime dataset. We evaluate four anomaly detection methods, MGFN, UR-DMU, BN-WVAD, and PEL4VAD, on the anonymized UCF-Crime to reveal how each method responds to different obfuscation techniques. Experimental results demonstrate that anomaly detection remains viable under anonymized data and is dependent on the algorithmic design and the learning strategy. For instance, under certain anonymization patterns, such as encryption and masking, some models inadvertently achieve higher AUC performance compared to raw data, due to the strong responsiveness of their algorithmic components to these noise patterns. These results highlight the algorithm-specific sensitivities to anonymization and emphasize the trade-off between preserving privacy and maintaining detection utility. Furthermore, we compare these conventional anonymization techniques with the emerging privacy-by-design solutions, highlighting an often overlooked trade-off between robust privacy protection and utility flexibility. Through comprehensive experiments and analyses, this study provides a compelling benchmark and insights into balancing human privacy with the demands of anomaly detection.
[80] C-DOG: Training-Free Multi-View Multi-Object Association in Dense Scenes Without Visual Feature via Connected δ-Overlap Graphs
Yung-Hong Sun,Ting-Hung Lin,Jiangang Chen,Hongrui Jiang,Yu Hen Hu
Main category: cs.CV
TL;DR: C-DOG是一种无需训练的多视图多目标关联框架,通过连接的δ-重叠图和外极几何,在不依赖视觉特征的情况下,可靠地关联稠密场景中的目标检测。
Details
Motivation: 现有方法依赖视觉特征或几何约束,但在目标视觉相似或噪声干扰时表现不佳。C-DOG旨在解决这一问题,提供一种无需训练的鲁棒关联方法。Contribution: 提出了一种无需训练的关联框架C-DOG,结合连接的δ-重叠图和外极几何,在无需视觉特征的情况下实现多视图目标检测的可靠关联。
Method: 使用图模型表示2D观测,节点间权重为外极一致性;通过δ-邻域重叠聚类识别一致性强的组;结合IQR过滤和3D反投影误差消除不一致观测。
Result: 在合成基准测试中,C-DOG优于基于几何的基线方法,且在目标密集、无视觉特征和相机重叠有限的挑战性条件下仍保持鲁棒性。
Insight: C-DOG适用于实际场景中可扩展的3D重建,尤其适用于视觉特征不可靠或不可用的场景。
Abstract: Multi-view multi-object association is a fundamental step in 3D reconstruction pipelines, enabling consistent grouping of object instances across multiple camera views. Existing methods often rely on appearance features or geometric constraints such as epipolar consistency. However, these approaches can fail when objects are visually indistinguishable or observations are corrupted by noise. We propose C-DOG, a training-free framework that serves as an intermediate module bridging object detection (or pose estimation) and 3D reconstruction, without relying on visual features. It combines connected delta-overlap graph modeling with epipolar geometry to robustly associate detections across views. Each 2D observation is represented as a graph node, with edges weighted by epipolar consistency. A delta-neighbor-overlap clustering step identifies strongly consistent groups while tolerating noise and partial connectivity. To further improve robustness, we incorporate Interquartile Range (IQR)-based filtering and a 3D back-projection error criterion to eliminate inconsistent observations. Extensive experiments on synthetic benchmarks demonstrate that C-DOG outperforms geometry-based baselines and remains robust under challenging conditions, including high object density, without visual features, and limited camera overlap, making it well-suited for scalable 3D reconstruction in real-world scenarios.
[81] Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Shashanka Venkataramanan,Valentinos Pariza,Mohammadreza Salehi,Lukas Knobel,Spyros Gidaris,Elias Ramzi,Andrei Bursuc,Yuki M. Asano
Main category: cs.CV
TL;DR: Franca 是一个完全开源的基础视觉模型,其性能媲美甚至超越现有专有模型(如 DINOv2、CLIP 等),并采用透明训练流程和公开数据集。其主要创新包括嵌套 Matryoshka 聚类投影器和位置解缠策略,显著提升了特征空间的语义表达能力和下游任务性能。
Details
Motivation: 当前专有视觉基础模型(如 DINOv2、CLIP)性能优异,但缺乏透明度。同时,自监督学习(SSL)中的聚类方法存在语义模糊性问题。Franca 的目标是为社区提供高性能、透明的开源模型,并解决聚类方法的局限性。Contribution: 1. 首个完全开源(数据、代码、权重)的基础视觉模型;2. 提出嵌套 Matryoshka 聚类投影器,在不增加模型参数的情况下实现多级特征细化;3. 设计位置解缠策略以消除位置偏差,提升语义特征编码;4. 在多个下游任务上取得性能提升。
Method: 1. 采用透明训练流程,基于公开数据集(ImageNet-21K 和 ReLAION-2B 子集);2. 设计多头的嵌套 Matryoshka 聚类投影器,逐步细化特征;3. 引入位置解缠策略,从密集表示中显式去除位置偏差。
Result: Franca 在多个下游任务上的表现超越了现有专有模型(如 DINOv2、CLIP 等),证明了其高效性和可推广性。
Insight: 1. 透明性和开源对推动通用基础模型的发展至关重要;2. 多级聚类能有效解决语义模糊性问题;3. 位置解缠策略有助于生成更纯净的语义特征空间。
Abstract: We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.
cs.RO [Back]
[82] EdgeVLA: Efficient Vision-Language-Action Models
Paweł Budzianowski,Wesley Maa,Matthew Freed,Jingxiang Mo,Winston Hsiao,Aaron Xie,Tomasz Młoduchowski,Viraj Tipnis,Benjamin Bolte
Main category: cs.RO
TL;DR: 该论文提出EdgeVLA(EVLA),一种高效视觉-语言-动作模型,旨在解决大型VLA模型在资源受限的移动设备上部署的挑战,通过消除自回归需求和使用小型语言模型(SLM)显著提升推理速度。
Details
Motivation: 大型VLM模型如OpenVLA在机器人领域展现出潜力,但在资源受限的设备上实时部署仍面临挑战,因此需要一种高效且能保持模型性能的解决方案。Contribution: 1)消除末端执行器位置预测的自回归需求,实现7倍的推理加速;2)采用小型语言模型(SLM),在保持性能的同时降低计算需求。
Method: EVLA通过消除自回归依赖和利用SLM的高效性,显著提升推理速度。具体包括对模型的架构优化和对计算资源的优化利用。
Result: EVLA在保持与OpenVLA相当训练性能的同时,显著提升了推理速度和内存效率。
Insight: 通过优化模型结构和计算需求,可以在边缘设备上实现高效的视觉-语言-动作模型部署,为实际应用提供了可行方案。
Abstract: Vision-Language Models (VLMs) have emerged as a promising approach to address the data scarcity challenge in robotics, enabling the development of generalizable visuomotor control policies. While models like OpenVLA showcase the potential of this paradigm, deploying large-scale VLMs on resource-constrained mobile manipulation systems remains a significant hurdle. This paper introduces Edge VLA (EVLA), a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models. EVLA maintains the representational power of these models while enabling real-time performance on edge devices. We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs), demonstrating comparable training performance to larger models with significantly reduced computational demands. Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency. We release our model checkpoints and training \href{https://github.com/kscalelabs/evla }{codebase} to foster further research.
[83] Safety Certification in the Latent space using Control Barrier Functions and World Models
Mehul Anand,Shishir Kolathaya
Main category: cs.RO
TL;DR: 论文提出了一种半监督框架,利用世界模型的潜在空间学习控制屏障证书(CBCs),以合成安全的视觉运动策略。
Details
Motivation: 传统的基于视觉数据的安全控制器合成方法需要大量标记安全关键数据,这在现实场景中往往不可行。利用世界模型的潜在空间预测能力,可以更高效地实现安全控制。Contribution: 1. 提出了一种半监督框架,结合有限标记数据和世界模型的预测能力;2. 在潜在空间中学习神经屏障函数和安全控制器;3. 利用现代视觉变换器进行潜在动力学建模。
Method: 1. 在潜在空间中学习控制屏障证书(CBCs);2. 联合优化神经屏障函数和安全控制器;3. 利用视觉变换器建模潜在动力学。
Result: 该方法能够通过有限的标记数据高效合成安全的视觉运动策略,展示了在潜在空间中实现安全控制的可扩展性。
Insight: 潜在空间的预测能力为安全控制提供了新的研究方向,结合半监督学习和现代视觉模型可显著减少对标记数据的依赖。
Abstract: Synthesising safe controllers from visual data typically requires extensive supervised labelling of safety-critical data, which is often impractical in real-world settings. Recent advances in world models enable reliable prediction in latent spaces, opening new avenues for scalable and data-efficient safe control. In this work, we introduce a semi-supervised framework that leverages control barrier certificates (CBCs) learned in the latent space of a world model to synthesise safe visuomotor policies. Our approach jointly learns a neural barrier function and a safe controller using limited labelled data, while exploiting the predictive power of modern vision transformers for latent dynamics modelling.
cs.AI [Back]
[84] DailyLLM: Context-Aware Activity Log Generation Using Multi-Modal Sensors and LLMs
Ye Tian,Xiaoyuan Ren,Zihao Wang,Onat Gungor,Xiaofan Yu,Tajana Rosing
Main category: cs.AI
TL;DR: DailyLLM提出了一种轻量级的LLM框架,结合多模态传感器数据,显著提升了活动日志生成的准确性、效率和语义丰富度。
Details
Motivation: 现有活动日志生成方法在准确性、效率和语义丰富度上存在显著不足,LLM的语义理解能力为解决这些问题提供了新机会。Contribution: 首次结合四维上下文信息(位置、运动、环境和生理),提出轻量级LLM框架DailyLLM,显著优于SOTA方法。
Method: 通过结构化提示和高效特征提取,集成多模态传感器数据,实现高级活动理解。
Result: 相比70B参数的SOTA基线,1.5B参数的DailyLLM在BERTScore精度上提升17%,推理速度提高10倍。
Insight: 轻量级LLM结合多模态数据能高效生成高质量活动日志,适合部署在资源受限设备上。
Abstract: Rich and context-aware activity logs facilitate user behavior analysis and health monitoring, making them a key research focus in ubiquitous computing. The remarkable semantic understanding and generation capabilities of Large Language Models (LLMs) have recently created new opportunities for activity log generation. However, existing methods continue to exhibit notable limitations in terms of accuracy, efficiency, and semantic richness. To address these challenges, we propose DailyLLM. To the best of our knowledge, this is the first log generation and summarization system that comprehensively integrates contextual activity information across four dimensions: location, motion, environment, and physiology, using only sensors commonly available on smartphones and smartwatches. To achieve this, DailyLLM introduces a lightweight LLM-based framework that integrates structured prompting with efficient feature extraction to enable high-level activity understanding. Extensive experiments demonstrate that DailyLLM outperforms state-of-the-art (SOTA) log generation methods and can be efficiently deployed on personal computers and Raspberry Pi. Utilizing only a 1.5B-parameter LLM model, DailyLLM achieves a 17% improvement in log generation BERTScore precision compared to the 70B-parameter SOTA baseline, while delivering nearly 10x faster inference speed.
[85] Cross-modal Causal Intervention for Alzheimer’s Disease Prediction
Yutao Jin,Haowen Xiao,Jielei Chu,Fengmao Lv,Yuxiao Li,Tianrui Li
Main category: cs.AI
TL;DR: 论文提出了一种名为ADPC的视觉-语言因果干预框架,用于辅助阿尔茨海默病(AD)的预测诊断,通过消除混杂因素提升分类性能。
Details
Motivation: AD诊断面临多模态数据选择偏差和变量间复杂关系的挑战,非因果模型容易捕捉虚假关联,导致结果不可靠。Contribution: 提出了ADPC框架,结合大语言模型(LLM)生成结构化文本数据,并通过因果干预消除混杂因素。
Method: 利用MRI、fMRI图像和LLM生成的文本数据,通过因果干预模型对患者分为CN、MCI和AD三类。
Result: 实验显示ADPC在区分CN/MCI/AD上表现优异,达到SOTA水平。
Insight: 融合因果推理与多模态学习可提升神经疾病诊断的可靠性。
Abstract: Mild Cognitive Impairment (MCI) serves as a prodromal stage of Alzheimer’s Disease (AD), where early identification and intervention can effectively slow the progression to dementia. However, diagnosing AD remains a significant challenge in neurology due to the confounders caused mainly by the selection bias of multimodal data and the complex relationships between variables. To address these issues, we propose a novel visual-language causal intervention framework named Alzheimer’s Disease Prediction with Cross-modal Causal Intervention (ADPC) for diagnostic assistance. Our ADPC employs large language model (LLM) to summarize clinical data under strict templates, maintaining structured text outputs even with incomplete or unevenly distributed datasets. The ADPC model utilizes Magnetic Resonance Imaging (MRI), functional MRI (fMRI) images and textual data generated by LLM to classify participants into Cognitively Normal (CN), MCI, and AD categories. Because of the presence of confounders, such as neuroimaging artifacts and age-related biomarkers, non-causal models are likely to capture spurious input-output correlations, generating less reliable results. Our framework implicitly eliminates confounders through causal intervention. Experimental results demonstrate the outstanding performance of our method in distinguishing CN/MCI/AD cases, achieving state-of-the-art (SOTA) metrics across most evaluation metrics. The study showcases the potential of integrating causal reasoning with multi-modal learning for neurological disease diagnosis.
[86] Generative AI-Driven High-Fidelity Human Motion Simulation
Hari Iyer,Neel Macwan,Atharva Jitendra Hude,Heejin Jeong,Shenghan Guo
Main category: cs.AI
TL;DR: 该论文提出了一种生成式AI驱动的高保真人体运动模拟方法(G-AI-HMS),通过结合文本到文本和文本到运动模型,提高了工业任务中运动模拟的质量。
Details
Motivation: 现有的人体运动模拟方法在运动保真度上表现不足,无法高效评估工人的行为、安全性和生产力。为了解决这一问题,论文提出利用生成式AI技术提升模拟质量。Contribution: 主要贡献包括:1) 提出了一个结合文本到文本和文本到运动模型的G-AI-HMS框架,用于生成高保真运动序列;2) 通过计算机视觉技术验证AI生成的运动与真实人类运动的一致性。
Method: 方法分为两步:1) 使用与MotionGPT训练词汇对齐的大语言模型,将任务描述转换为运动感知语言;2) 通过姿态估计算法和运动相似性度量,在实时视频中验证AI生成运动的准确性。
Result: 在8个任务的案例研究中,AI生成的运动在大多数场景中表现优于人类描述的运动,特别是在空间准确性(6个任务)、姿态对齐(4个任务)和时间相似性(7个任务)方面。统计结果表明,AI生成的提示显著降低了关节误差和时间错位。
Insight: 生成式AI可以有效提升运动模拟的保真度,尤其是在结合实时计算机视觉验证的情况下。这一方法为工业任务中的人体运动模拟提供了新的技术方向。
Abstract: Human motion simulation (HMS) supports cost-effective evaluation of worker behavior, safety, and productivity in industrial tasks. However, existing methods often suffer from low motion fidelity. This study introduces Generative-AI-Enabled HMS (G-AI-HMS), which integrates text-to-text and text-to-motion models to enhance simulation quality for physical tasks. G-AI-HMS tackles two key challenges: (1) translating task descriptions into motion-aware language using Large Language Models aligned with MotionGPT’s training vocabulary, and (2) validating AI-enhanced motions against real human movements using computer vision. Posture estimation algorithms are applied to real-time videos to extract joint landmarks, and motion similarity metrics are used to compare them with AI-enhanced sequences. In a case study involving eight tasks, the AI-enhanced motions showed lower error than human created descriptions in most scenarios, performing better in six tasks based on spatial accuracy, four tasks based on alignment after pose normalization, and seven tasks based on overall temporal similarity. Statistical analysis showed that AI-enhanced prompts significantly (p $<$ 0.0001) reduced joint error and temporal misalignment while retaining comparable posture accuracy.
cs.IR [Back]
[87] DyG-RAG: Dynamic Graph Retrieval-Augmented Generation with Event-Centric Reasoning
Qingyun Sun,Jiaqi Yuan,Shan He,Xiao Guan,Haonan Yuan,Xingcheng Fu,Jianxin Li,Philip S. Yu
Main category: cs.IR
TL;DR: DyG-RAG提出了一种动态图检索增强生成框架,专注于解决时间推理问题,通过动态事件单元和事件图构建,实现了对时序知识的高效检索和一致性生成。
Details
Motivation: 现有图检索增强生成方法难以建模现实世界事件的动态时序结构,限制了时间推理能力。Contribution: 1. 提出动态事件单元(DEUs)显式编码语义和时间锚点;2. 构建事件图以捕捉事件间的时序和因果关系;3. 设计时间感知的检索和生成策略。
Method: 通过动态事件单元编码时间信息,构建事件图链接相关事件,并采用时间链式思维生成时序一致的答案。
Result: 在时序QA基准测试中显著提升了时间推理问题的准确率和召回率。
Insight: 事件中心的时间建模可以有效解决传统RAG的时序模糊问题,为复杂时间敏感查询提供可靠支持。
Abstract: Graph Retrieval-Augmented Generation has emerged as a powerful paradigm for grounding large language models with external structured knowledge. However, existing Graph RAG methods struggle with temporal reasoning, due to their inability to model the evolving structure and order of real-world events. In this work, we introduce DyG-RAG, a novel event-centric dynamic graph retrieval-augmented generation framework designed to capture and reason over temporal knowledge embedded in unstructured text. To eliminate temporal ambiguity in traditional retrieval units, DyG-RAG proposes Dynamic Event Units (DEUs) that explicitly encode both semantic content and precise temporal anchors, enabling accurate and interpretable time-aware retrieval. To capture temporal and causal dependencies across events, DyG-RAG constructs an event graph by linking DEUs that share entities and occur close in time, supporting efficient and meaningful multi-hop reasoning. To ensure temporally consistent generation, DyG-RAG introduces an event timeline retrieval pipeline that retrieves event sequences via time-aware traversal, and proposes a Time Chain-of-Thought strategy for temporally grounded answer generation. This unified pipeline enables DyG-RAG to retrieve coherent, temporally ordered event sequences and to answer complex, time-sensitive queries that standard RAG systems cannot resolve. Extensive experiments on temporal QA benchmarks demonstrate that DyG-RAG significantly improves the accuracy and recall of three typical types of temporal reasoning questions, paving the way for more faithful and temporal-aware generation. DyG-RAG is available at https://github.com/RingBDStack/DyG-RAG.
[88] SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection
Aleksandr Gashkov,Aleksandr Perevalov,Maria Eltsova,Andreas Both
Main category: cs.IR
TL;DR: 该论文提出了一种新方法,用于评估LLM在生成自然语言问题SPARQL查询时的性能,特别是在零样本、知识注入和匿名知识注入条件下,以此衡量训练数据记忆化的影响。
Details
Motivation: 当前基于知识图谱的问答系统(KGQA)中,生成SPARQL查询是一个核心任务。LLM的出现为提升QA质量提供了潜力,但训练数据中可能包含基准测试或知识图谱,这可能导致结果的偏差。论文旨在衡量训练数据记忆化对LLM性能的影响。Contribution: 1. 提出了一种评估LLM生成SPARQL查询质量的新方法;2. 在零样本、知识注入和匿名知识注入条件下测试了LLM的性能;3. 首次量化了训练数据记忆化对LLM的影响。
Method: 1. 在零样本条件下生成SPARQL查询;2. 通过知识注入(直接使用知识图谱)和匿名知识注入(隐藏知识图谱信息)测试LLM;3. 比较不同条件下的性能差异,以评估记忆化的影响。
Result: 论文通过实验展示了LLM在不同条件下生成SPARQL查询的性能差异,揭示了训练数据记忆化对结果的潜在影响。
Insight: 该研究表明,LLM在KGQA任务中的性能可能部分依赖于训练数据的记忆化,而非其真正的推理能力。这对于评估LLM的可移植性和适用性具有重要意义。
Abstract: Nowadays, the importance of software with natural-language user interfaces cannot be underestimated. In particular, in Question Answering (QA) systems, generating a SPARQL query for a given natural-language question (often named Query Building) from the information retrieved from the same question is the central task of QA systems working over Knowledge Graphs (KGQA). Due to the rise of Large Language Models (LLMs), they are considered a well-suited method to increase the quality of the question-answering functionality, as there is still a lot of room for improvement, aiming for enhanced quality and trustworthiness. However, LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data. In this paper, we introduce a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question under various conditions: (1) zero-shot SPARQL generation, (2) with knowledge injection, and (3) with “anonymized” knowledge injection. This enables us, for the first time, to estimate the influence of the training data on the QA quality improved by LLMs. Ultimately, this will help to identify how portable a method is or whether good results might mostly be achieved because a benchmark was already included in the training data (cf. LLM memorization). The developed method is portable, robust, and supports any knowledge graph; therefore, it could be easily applied to any KGQA or LLM, s.t., generating consistent insights into the actual LLM capabilities is possible.
cs.CR [Back]
[89] A Novel APVD Steganography Technique Incorporating Pseudorandom Pixel Selection for Robust Image Security
Mehrab Hosain,Rajiv Kapoor
Main category: cs.CR
TL;DR: 本文提出了一种结合自适应像素值差分(APVD)与伪随机像素选择的新型隐写技术,解决了传统APVD中“未使用块”问题,显著提升了安全性、数据隐藏能力和图像质量。
Details
Motivation: 当前自适应像素值差分(APVD)隐写技术存在“未使用块”问题,导致安全性下降、嵌入容量受限和视觉质量降低。为了解决这些问题,研究者提出了一种结合伪随机像素选择的新方法。Contribution: 主要贡献是提出了一种结合APVD与伪随机像素选择的隐写技术,有效解决了“未使用块”问题,显著提升了安全性、数据隐藏能力和图像质量。
Method: 该方法将自适应像素值差分(APVD)与伪随机像素选择相结合,提升了嵌入效率并避免了未使用块问题。同时,该方法支持彩色和灰度图像。
Result: 实验结果表明,新方法在峰值信噪比(PSNR)、通用图像质量指标(UIQ)和结构相似性指数(SSIM)等指标上优于现有技术。
Insight: 伪随机像素选择与APVD的结合不仅能提升隐写术的安全性,还能在不牺牲图像视觉质量的情况下增加数据隐藏容量。
Abstract: Steganography is the process of embedding secret information discreetly within a carrier, ensuring secure exchange of confidential data. The Adaptive Pixel Value Differencing (APVD) steganography method, while effective, encounters certain challenges like the “unused blocks” issue. This problem can cause a decrease in security, compromise the embedding capacity, and lead to lower visual quality. This research presents a novel steganographic strategy that integrates APVD with pseudorandom pixel selection to effectively mitigate these issues. The results indicate that the new method outperforms existing techniques in aspects of security, data hiding capacity, and the preservation of image quality. Empirical results reveal that the combination of APVD with pseudorandom pixel selection significantly enhances key image quality metrics such as Peak Signal-to-Noise Ratio (PSNR), Universal Image Quality Index (UIQ), and Structural Similarity Index (SSIM), surpassing other contemporary methods in performance. The newly proposed method is versatile, able to handle a variety of cover and secret images in both color and grayscale, thereby ensuring secure data transmission without compromising the aesthetic quality of the image.
[90] GIFT: Gradient-aware Immunization of diffusion models against malicious Fine-Tuning with safe concepts retention
Amro Abdalla,Ismail Shaheen,Dan DeGenaro,Rupayan Mallick,Bogdan Raita,Sarah Adel Bargal
Main category: cs.CR
TL;DR: 本文提出了一种名为GIFT的梯度感知免疫技术,旨在保护扩散模型免受恶意微调攻击,同时保留其生成安全内容的能力。
Details
Motivation: 现有的安全机制(如安全检查器)容易被绕过,而概念擦除方法在对抗性微调下失效。因此,需要一种更有效的方法来抵御恶意微调。Contribution: GIFT通过将免疫问题建模为一个双层优化问题,上层目标通过表征噪声和最大化来降低模型对有害概念的表示能力,下层目标保留对安全数据的性能。
Method: GIFT采用双层优化框架,上层通过表征噪声和最大化降低有害概念的表示能力,下层通过优化保留安全内容的生成性能。
Result: 实验表明,GIFT显著削弱了模型重新学习有害概念的能力,同时保持了安全内容的生成质量。
Insight: GIFT为构建本质上更安全的生成模型提供了新思路,尤其是对对抗性微调攻击的鲁棒性。
Abstract: We present GIFT: a {G}radient-aware {I}mmunization technique to defend diffusion models against malicious {F}ine-{T}uning while preserving their ability to generate safe content. Existing safety mechanisms like safety checkers are easily bypassed, and concept erasure methods fail under adversarial fine-tuning. GIFT addresses this by framing immunization as a bi-level optimization problem: the upper-level objective degrades the model’s ability to represent harmful concepts using representation noising and maximization, while the lower-level objective preserves performance on safe data. GIFT achieves robust resistance to malicious fine-tuning while maintaining safe generative quality. Experimental results show that our method significantly impairs the model’s ability to re-learn harmful concepts while maintaining performance on safe content, offering a promising direction for creating inherently safer generative models resistant to adversarial fine-tuning attacks.
q-bio.NC [Back]
[91] Convergent transformations of visual representation in brains and models
Pablo Marcos-Manchón,Lluís Fuentemilla
Main category: q-bio.NC
TL;DR: 通过结合跨被试相似性和模型层次对齐的统一框架,揭示了人类和深度神经网络在视觉编码中的共同计算解。
Details
Motivation: 探索视觉感知是由外部世界结构还是大脑内部架构塑造的,并比较人类大脑与深度神经网络在视觉表示转换中的共同轨迹。Contribution: 提出了一个统一框架,用于追踪表征流,揭示了大脑皮层中两种保守的功能通路,并验证了视觉DNNs层级结构与其相似性。
Method: 结合跨被试fMRI数据集和模型层次对齐分析,识别保守的功能通路,并与不同模型(视觉DNNs和语言模型)对比。
Result: 发现大脑皮层中两个功能通路(内侧-腹侧流和外侧-背侧流),视觉DNNs层级结构与人类视觉表示转换一致。
Insight: 外部世界的结构驱动了人类和人工视觉在编码中的共同计算解,凸显了视觉特异性转换的重要性。
Abstract: A fundamental question in cognitive neuroscience is what shapes visual perception: the external world’s structure or the brain’s internal architecture. Although some perceptual variability can be traced to individual differences, brain responses to naturalistic stimuli evoke similar activity patterns across individuals, suggesting a convergent representational principle. Here, we test if this stimulus-driven convergence follows a common trajectory across people and deep neural networks (DNNs) during its transformation from sensory to high-level internal representations. We introduce a unified framework that traces representational flow by combining inter-subject similarity with alignment to model hierarchies. Applying this framework to three independent fMRI datasets of visual scene perception, we reveal a cortex-wide network, conserved across individuals, organized into two pathways: a medial-ventral stream for scene structure and a lateral-dorsal stream tuned for social and biological content. This functional organization is captured by the hierarchies of vision DNNs but not language models, reinforcing the specificity of the visual-to-semantic transformation. These findings show a convergent computational solution for visual encoding in both human and artificial vision, driven by the structure of the external world.
eess.IV [Back]
[92] Flatten Wisely: How Patch Order Shapes Mamba-Powered Vision for MRI Segmentation
Osama Hardan,Omar Elshenhabi,Tamer Khattab,Mohamed Mabrok
Main category: eess.IV
TL;DR: 该论文首次系统研究了Vision Mamba模型在MRI分割中patch扫描顺序的影响,提出了一种无参数的模块MS2D,并通过大规模实验验证了扫描顺序对性能的显著影响。
Details
Motivation: Vision Mamba模型虽然在计算复杂度上具有优势,但其将2D图像序列化为1D序列的过程忽略了patch扫描顺序的设计选择,尤其是在医学图像中,解剖学先验信息使得这一选择尤为重要。Contribution: 1. 首次系统性研究扫描顺序对MRI分割的影响;2. 提出无参数的MS2D模块,支持多样化扫描路径探索;3. 通过大规模实验验证扫描顺序的显著性,并给出最优路径推荐。
Method: 引入Multi-Scan 2D (MS2D)模块,对多种扫描策略在大规模数据集(BraTS 2020, ISLES 2022, LGG)上进行测试,分析其性能差异。
Result: 扫描顺序对性能影响显著(差异高达27 Dice点),空间连续的扫描路径(如水平或垂直光栅)表现最佳。
Insight: 扫描顺序是一个无需额外计算成本即可提升Mamba模型性能的重要超参数,尤其在医学图像分割任务中。
Abstract: Vision Mamba models promise transformer-level performance at linear computational cost, but their reliance on serializing 2D images into 1D sequences introduces a critical, yet overlooked, design choice: the patch scan order. In medical imaging, where modalities like brain MRI contain strong anatomical priors, this choice is non-trivial. This paper presents the first systematic study of how scan order impacts MRI segmentation. We introduce Multi-Scan 2D (MS2D), a parameter-free module for Mamba-based architectures that facilitates exploring diverse scan paths without additional computational cost. We conduct a large-scale benchmark of 21 scan strategies on three public datasets (BraTS 2020, ISLES 2022, LGG), covering over 70,000 slices. Our analysis shows conclusively that scan order is a statistically significant factor (Friedman test: $\chi^{2}_{20}=43.9, p=0.0016$), with performance varying by as much as 27 Dice points. Spatially contiguous paths – simple horizontal and vertical rasters – consistently outperform disjointed diagonal scans. We conclude that scan order is a powerful, cost-free hyperparameter, and provide an evidence-based shortlist of optimal paths to maximize the performance of Mamba models in medical imaging.
[93] Domain-randomized deep learning for neuroimage analysis
Malte Hoffmann
Main category: eess.IV
TL;DR: 这篇论文介绍了一种基于领域随机化的深度学习方法,用于解决神经影像分析中训练数据集范围狭窄导致的模型鲁棒性和泛化性问题。通过合成具有随机强度和解剖内容的图像,模型能够处理未经训练的影像类型,而无需重新训练或微调。
Details
Motivation: 神经影像分析中,磁共振成像(MRI)等技术的影像外观因脉冲序列和扫描硬件差异而广泛变化,导致模型泛化能力受限。传统方法依赖特定训练数据集,无法适应多样化的影像类型。Contribution: 提出了一个基于合成图像的域随机化策略,通过生成多样化的训练数据,显著提高了模型的泛化能力,并适用于多种影像模态和非神经影像领域。
Method: 利用解剖分割图生成具有随机强度和内容的合成图像,构建多样化的训练数据集,以增强深度学习模型的泛化性和鲁棒性。
Result: 该方法在MRI、CT、PET等多种影像模态中表现出色,且在超声、电子显微镜等其他领域也展示了有效性,显著提升了模型的普适性。
Insight: 合成驱动的训练范式可以克服数据稀缺和多样性不足的问题,为领域专家提供了无需大量计算资源或机器学习知识的通用工具开发途径。
Abstract: Deep learning has revolutionized neuroimage analysis by delivering unprecedented speed and accuracy. However, the narrow scope of many training datasets constrains model robustness and generalizability. This challenge is particularly acute in magnetic resonance imaging (MRI), where image appearance varies widely across pulse sequences and scanner hardware. A recent domain-randomization strategy addresses the generalization problem by training deep neural networks on synthetic images with randomized intensities and anatomical content. By generating diverse data from anatomical segmentation maps, the approach enables models to accurately process image types unseen during training, without retraining or fine-tuning. It has demonstrated effectiveness across modalities including MRI, computed tomography, positron emission tomography, and optical coherence tomography, as well as beyond neuroimaging in ultrasound, electron and fluorescence microscopy, and X-ray microtomography. This tutorial paper reviews the principles, implementation, and potential of the synthesis-driven training paradigm. It highlights key benefits, such as improved generalization and resistance to overfitting, while discussing trade-offs such as increased computational demands. Finally, the article explores practical considerations for adopting the technique, aiming to accelerate the development of generalizable tools that make deep learning more accessible to domain experts without extensive computational resources or machine learning knowledge.
[94] Converting T1-weighted MRI from 3T to 7T quality using deep learning
Malo Gicquel,Ruoyi Zhao,Anika Wuestefeld,Nicola Spotorno,Olof Strandberg,Kalle Åström,Yu Xiao,Laura EM Wisse,Danielle van Westen,Rik Ossenkoppele,Niklas Mattsson-Carlgren,David Berron,Oskar Hansson,Gabrielle Flood,Jacob Vogel
Main category: eess.IV
TL;DR: 这篇论文提出了一种深度学习方法,将3T MRI图像转换为接近7T质量的图像,提升了分辨率和组织对比度,同时减少了伪影。
Details
Motivation: 7T MRI虽然能提供高质量图像,但其可及性较低;通过深度学习将3T MRI提升至7T质量,可以解决这一问题。Contribution: 提出了一种结合U-Net和GAN的深度学习模型,能够生成接近7T质量的合成MRI图像,并在图像质量测评和人眼评估中表现优异。
Method: 训练了两种模型:专用U-Net和结合GAN的U-Net(GAN U-Net),用于从3T图像生成7T图像。
Result: 合成7T图像在视觉质量和自动化分割任务中优于原始3T图像,且在下游认知状态预测任务中表现接近真实3T图像。
Insight: 深度学习可以有效提升3T MRI的图像质量,生成接近7T的图像,同时不影响下游任务性能,为临床应用提供了潜在解决方案。
Abstract: Ultra-high resolution 7 tesla (7T) magnetic resonance imaging (MRI) provides detailed anatomical views, offering better signal-to-noise ratio, resolution and tissue contrast than 3T MRI, though at the cost of accessibility. We present an advanced deep learning model for synthesizing 7T brain MRI from 3T brain MRI. Paired 7T and 3T T1-weighted images were acquired from 172 participants (124 cognitively unimpaired, 48 impaired) from the Swedish BioFINDER-2 study. To synthesize 7T MRI from 3T images, we trained two models: a specialized U-Net, and a U-Net integrated with a generative adversarial network (GAN U-Net). Our models outperformed two additional state-of-the-art 3T-to-7T models in image-based evaluation metrics. Four blinded MRI professionals judged our synthetic 7T images as comparable in detail to real 7T images, and superior in subjective visual quality to 7T images, apparently due to the reduction of artifacts. Importantly, automated segmentations of the amygdalae of synthetic GAN U-Net 7T images were more similar to manually segmented amygdalae (n=20), than automated segmentations from the 3T images that were used to synthesize the 7T images. Finally, synthetic 7T images showed similar performance to real 3T images in downstream prediction of cognitive status using MRI derivatives (n=3,168). In all, we show that synthetic T1-weighted brain images approaching 7T quality can be generated from 3T images, which may improve image quality and segmentation, without compromising performance in downstream tasks. Future directions, possible clinical use cases, and limitations are discussed.
[95] Divide and Conquer: A Large-Scale Dataset and Model for Left-Right Breast MRI Segmentation
Maximilian Rokuss,Benjamin Hamm,Yannick Kirchhoff,Klaus Maier-Hein
Main category: eess.IV
TL;DR: 论文提出了首个公开的大规模乳腺MRI数据集,包含13,000多个标注病例的左右乳腺分割标签,并提供了一个深度学习模型用于分割。
Details
Motivation: 目前缺乏公开的乳腺MRI数据集,尤其是左右乳腺的精确分割标签,这限制了相关研究和工具的开发。Contribution: 1. 发布了首个大规模公开的乳腺MRI数据集,包含左右乳腺分割标签;2. 提供了一个深度学习模型用于自动分割。
Method: 使用深度学习方法训练一个模型,针对大规模乳腺MRI数据进行左右乳腺分割。
Result: 数据集和模型已公开,为乳腺MRI分析提供了重要资源。
Insight: 公开数据集和预训练模型可以加速乳腺健康领域的工具开发和学术研究。
Abstract: We introduce the first publicly available breast MRI dataset with explicit left and right breast segmentation labels, encompassing more than 13,000 annotated cases. Alongside this dataset, we provide a robust deep-learning model trained for left-right breast segmentation. This work addresses a critical gap in breast MRI analysis and offers a valuable resource for the development of advanced tools in women’s health. The dataset and trained model are publicly available at: www.github.com/MIC-DKFZ/BreastDivider
[96] Blind Super Resolution with Reference Images and Implicit Degradation Representation
Huu-Phu Do,Po-Chih Hu,Hao-Chien Hsueh,Che-Kai Liu,Vu-Hoang Tran,Ching-Chun Huang
Main category: eess.IV
TL;DR: 本文提出了一种盲超分辨率(BSR)新方法,通过使用参考图像和隐式退化表示,动态适应不同超分辨率尺度的退化核和缩放因子,显著提升了性能。
Details
Motivation: 现有盲超分辨率方法主要从低分辨率输入直接估计退化核,忽略了退化核和缩放因子的动态适应性问题,导致在不同超分辨率尺度下表现不稳定。Contribution: 1. 提出使用高分辨率参考图像动态适应退化核;2. 结合隐式退化表征,生成额外的LR-HR对以提升超分辨率性能;3. 方法适用于预训练和零样本盲SR模型。
Method: 通过内容无关的HR参考图像与目标LR图像配对,动态学习退化过程,并生成LR-HR对用于训练。优化了传统退化核估计的局限性。
Result: 方法在预训练和零样本盲SR场景下均优于现有方法,证明了其在动态适应退化核和缩放因子方面的有效性。
Insight: 退化核和缩放因子的动态适应是盲SR的关键,而参考图像的引入为隐式表征提供了更丰富的训练信号。
Abstract: Previous studies in blind super-resolution (BSR) have primarily concentrated on estimating degradation kernels directly from low-resolution (LR) inputs to enhance super-resolution. However, these degradation kernels, which model the transition from a high-resolution (HR) image to its LR version, should account for not only the degradation process but also the downscaling factor. Applying the same degradation kernel across varying super-resolution scales may be impractical. Our research acknowledges degradation kernels and scaling factors as pivotal elements for the BSR task and introduces a novel strategy that utilizes HR images as references to establish scale-aware degradation kernels. By employing content-irrelevant HR reference images alongside the target LR image, our model adaptively discerns the degradation process. It is then applied to generate additional LR-HR pairs through down-sampling the HR reference images, which are keys to improving the SR performance. Our reference-based training procedure is applicable to proficiently trained blind SR models and zero-shot blind SR methods, consistently outperforming previous methods in both scenarios. This dual consideration of blur kernels and scaling factors, coupled with the use of a reference image, contributes to the effectiveness of our approach in blind super-resolution tasks.
[97] OrthoInsight: Rib Fracture Diagnosis and Report Generation Based on Multi-Modal Large Models
Ningyong Wu,Jinzhi Wang,Wenhong Zhao,Chenzhan Yu,Zhigang Xiu,Duwei Dai
Main category: eess.IV
TL;DR: OrthoInsight是一个多模态深度学习框架,结合YOLOv9模型、医学知识图谱和LLaVA语言模型,用于肋骨骨折诊断和报告生成,性能优于GPT-4和Claude-3等模型。
Details
Motivation: 医疗影像数据增长迅速,手动诊断肋骨骨折费时且易出错,需要自动化工具辅助。Contribution: 提出OrthoInsight框架,整合视觉与文本数据,提升诊断准确性和报告生成质量。
Method: 结合YOLOv9检测骨折、医学知识图谱提供临床上下文,LLaVA生成报告。
Result: 在28,675张CT图像上测试,平均得分4.28,优于GPT-4和Claude-3。
Insight: 多模态学习在医学影像分析中潜力巨大,可有效支持放射科医生。
Abstract: The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports. OrthoInsight combines visual features from CT images with expert textual data to deliver clinically useful outputs. Evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28, outperforming models like GPT-4 and Claude-3. This study demonstrates the potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists.
cs.LG [Back]
[98] Generalist Bimanual Manipulation via Foundation Video Diffusion Models
Yao Feng,Hengkai Tan,Xinyi Mao,Guodong Liu,Shuhe Huang,Chendong Xiang,Hang Su,Jun Zhu
Main category: cs.LG
TL;DR: 论文提出了VIDAR,一个基于视频扩散模型的双手机器人操作框架,通过两阶段训练和掩码逆动力学模型实现高效动作预测,仅需少量人类演示即可泛化到新任务和背景。
Details
Motivation: 双手机器人操作的数据稀缺性和异构性限制了其扩展性,作者希望通过利用大规模视频扩散模型和掩码动作预测来解决这一问题。Contribution: 1. 提出VIDAR框架,结合视频扩散预训练和掩码逆动力学模型;2. 在750K多视角视频上预训练,统一编码多模态观测空间;3. 掩码模型无需像素级标签即可提取动作相关信息,泛化能力强。
Method: 1. 在大规模双手机器人视频上预训练扩散模型;2. 设计掩码逆动力学模型,通过掩码提取动作相关轨迹信息,无需额外标注。
Result: 仅需20分钟人类演示(典型数据量的1%),VIDAR在新机器人平台上展现强大泛化能力,超越现有方法。
Insight: 视频基础模型与掩码动作预测结合,为机器人操作提供了可扩展且泛化性强的解决方案。
Abstract: Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce VIdeo Diffusion for Action Reasoning (VIDAR), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), VIDAR generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.
[99] Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models
Charvi Rastogi,Tian Huey Teh,Pushkar Mishra,Roma Patel,Ding Wang,Mark Díaz,Alicia Parrish,Aida Mostafazadeh Davani,Zoe Ashwood,Michela Paganini,Vinodkumar Prabhakaran,Verena Rieser,Lora Aroyo
Main category: cs.LG
TL;DR: 该论文提出了一种多模态数据集DIVE,用于多样化的文本到图像(T2I)模型对齐,旨在捕捉不同人口统计群体对安全性的多元观点,并探讨了其对T2I模型开发的影响。
Details
Motivation: 当前T2I模型未能充分考虑到多样化的人类体验,导致系统与人类价值观不一致。论文希望通过多元对齐,使AI能够理解和适应多样且可能冲突的人类价值观。Contribution: 1. 提出了DIVE数据集,首个用于多元对齐的多模态数据集;2. 证实人口统计是多样观点的关键代理;3. 讨论了构建对齐T2I模型的策略。
Method: 通过大规模多样化人口统计的人类评估者收集数据,分析其在1000个提示下的反馈,以捕捉安全性的多元感知。
Result: 数据显示不同人口统计群体对安全性的感知存在显著差异,与传统的评估方法不同。
Insight: 多元对齐是构建公平T2I系统的关键,人口统计多样性为捕捉多元观点提供了有效途径。
Abstract: Current text-to-image (T2I) models often fail to account for diverse human experiences, leading to misaligned systems. We advocate for pluralistic alignment, where an AI understands and is steerable towards diverse, and often conflicting, human values. Our work provides three core contributions to achieve this in T2I models. First, we introduce a novel dataset for Diverse Intersectional Visual Evaluation (DIVE) – the first multimodal dataset for pluralistic alignment. It enable deep alignment to diverse safety perspectives through a large pool of demographically intersectional human raters who provided extensive feedback across 1000 prompts, with high replication, capturing nuanced safety perceptions. Second, we empirically confirm demographics as a crucial proxy for diverse viewpoints in this domain, revealing significant, context-dependent differences in harm perception that diverge from conventional evaluations. Finally, we discuss implications for building aligned T2I models, including efficient data collection strategies, LLM judgment capabilities, and model steerability towards diverse perspectives. This research offers foundational tools for more equitable and aligned T2I systems. Content Warning: The paper includes sensitive content that may be harmful.
[100] Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning
Seyyed Saeid Cheshmi,Buyao Lyu,Thomas Lisko,Rajesh Rajamani,Robert A. McGovern,Yogatheesan Varatharajah
Main category: cs.LG
TL;DR: 该论文提出了一种基于IMU-视频跨模态自监督预训练的新方法,通过学习大规模未标记IMU-视频数据的表示,提升了人体活动识别(HAR)在分布外(OOD)数据上的泛化能力,特别是在帕金森病患者数据集上表现优异。
Details
Motivation: 传统HAR任务依赖特定应用的标签数据,缺乏对不同环境或人群数据的泛化能力,而IMU数据的动态特性使得跨模态学习成为可能。Contribution: 提出了一种新的跨模态自监督预训练方法,通过结合IMU和视频数据学习通用表示,显著提升HAR任务在OOD数据上的性能。
Method: 利用大规模未标记IMU-视频数据进行跨模态预训练,通过对比学习等方法学习数据表示,并在零样本和小样本评估中验证效果。
Result: 提出的方法在OOD数据(包括帕金森病患者数据集)上优于现有IMU-视频和仅IMU的预训练方法。
Insight: 在动态数据模态(如IMU信号)中,跨模态预训练可能是学习通用表示的有效工具。
Abstract: Human Activity Recognition (HAR) based on wearable inertial sensors plays a critical role in remote health monitoring. In patients with movement disorders, the ability to detect abnormal patient movements in their home environments can enable continuous optimization of treatments and help alert caretakers as needed. Machine learning approaches have been proposed for HAR tasks using Inertial Measurement Unit (IMU) data; however, most rely on application-specific labels and lack generalizability to data collected in different environments or populations. To address this limitation, we propose a new cross-modal self-supervised pretraining approach to learn representations from large-sale unlabeled IMU-video data and demonstrate improved generalizability in HAR tasks on out of distribution (OOD) IMU datasets, including a dataset collected from patients with Parkinson’s disease. Specifically, our results indicate that the proposed cross-modal pretraining approach outperforms the current state-of-the-art IMU-video pretraining approach and IMU-only pretraining under zero-shot and few-shot evaluations. Broadly, our study provides evidence that in highly dynamic data modalities, such as IMU signals, cross-modal pretraining may be a useful tool to learn generalizable data representations. Our software is available at https://github.com/scheshmi/IMU-Video-OOD-HAR.