Table of Contents
- cs.CL [Total: 23]
- cs.CV [Total: 76]
- cs.SE [Total: 1]
- cs.RO [Total: 3]
- cs.DC [Total: 1]
- cs.AI [Total: 1]
- cs.IR [Total: 1]
- cs.CR [Total: 4]
- eess.IV [Total: 3]
- cs.LG [Total: 3]
cs.CL [Back]
[1] Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation
Chi Zhang,Changjia Zhu,Junjie Xiong,Xiaoran Xu,Lingyao Li,Yao Liu,Zhuo Lu
Main category: cs.CL
TL;DR: 该论文系统综述了大语言模型(LLMs)在生成有害内容及其安全缓解方面的研究,提出了统一的分类法,分析了多模态和LLM辅助的攻击策略,并评估了现有的缓解方法。
Details
Motivation: LLMs既能作为强大的工具解决实际问题,也可能成为有害语言的来源,这种双重角色带来了紧迫的社会技术挑战。Contribution: 论文的主要贡献是提出了LLM相关危害和防御的统一分类法,并系统分析了攻击策略和安全缓解技术。
Method: 论文综述了近期研究,包括无意毒性、对抗性越狱攻击和内容审核技术,评估了RLHF、提示工程和安全对齐等方法。
Result: 论文综合了LLM安全领域的最新进展,指出了当前评估方法的局限性,并提出了未来研究方向。
Insight: LLM的安全问题是一个动态发展的领域,需要更鲁棒和符合伦理的语言技术来应对多模态攻击和复杂的社会技术挑战。
Abstract: Large Language Models (LLMs) have revolutionized content creation across digital platforms, offering unprecedented capabilities in natural language generation and understanding. These models enable beneficial applications such as content generation, question and answering (Q&A), programming, and code reasoning. Meanwhile, they also pose serious risks by inadvertently or intentionally producing toxic, offensive, or biased content. This dual role of LLMs, both as powerful tools for solving real-world problems and as potential sources of harmful language, presents a pressing sociotechnical challenge. In this survey, we systematically review recent studies spanning unintentional toxicity, adversarial jailbreaking attacks, and content moderation techniques. We propose a unified taxonomy of LLM-related harms and defenses, analyze emerging multimodal and LLM-assisted jailbreak strategies, and assess mitigation efforts, including reinforcement learning with human feedback (RLHF), prompt engineering, and safety alignment. Our synthesis highlights the evolving landscape of LLM safety, identifies limitations in current evaluation methodologies, and outlines future research directions to guide the development of robust and ethically aligned language technologies.
[2] FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification
Xiangyan Chen,Yufeng Li,Yujian Gan,Arkaitz Zubiaga,Matthew Purver
Main category: cs.CL
TL;DR: 论文提出了一个名为FineDialFact的细粒度对话事实验证基准,旨在解决现有方法对对话响应中混合事实的过于简化分类问题。通过构建一个基于公开对话数据集的验证数据集,并使用多种基线方法评估,发现结合Chain-of-Thought推理的方法能提升性能,但任务仍具挑战性。
Details
Motivation: 解决大语言模型在对话中产生的幻觉问题,现有方法对事实一致性的验证过于粗粒度,无法处理混合事实的复杂情况。Contribution: 提出了首个面向细粒度对话事实验证的基准FineDialFact,并构建了相关数据集,推动了该领域的研究。
Method: 从对话响应中提取原子事实并验证,评估了包括Chain-of-Thought推理在内的多种基线方法。
Result: 实验显示Chain-of-Thought推理提升了性能,但最高F1分数仅为0.75,表明任务仍具挑战性。
Insight: 细粒度事实验证是对话系统的关键挑战,Chain-of-Thought推理可能是一个有前景的方向,仍需要进一步研究。
Abstract: Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many Natural Language Processing (NLP) applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or unverifiable facts, making one factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought (CoT) reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.75, indicating that the benchmark remains a challenging task for future research. Our dataset and code will be public on GitHub.
[3] Do Machines Think Emotionally? Cognitive Appraisal Analysis of Large Language Models
Sree Bhattacharyya,Lucas Craig,Tharun Dilliraj,Jia Li,James Z. Wang
Main category: cs.CL
TL;DR: 该论文提出了一种基于认知评价理论的方法,用于评估大型语言模型(LLMs)在情感推理中的认知维度,超越了传统的情感标签任务。
Details
Motivation: 现有研究多集中在情感标签的监督任务上,而忽视了LLMs在情感推理中的认知维度。论文旨在填补这一空白,探索LLMs是否能够通过认知维度进行情感推理。Contribution: 1. 提出了一个大规模基准CoRE,用于评估LLMs在情感推理中的认知维度;2. 揭示了LLMs在不同情感推理任务中的多样化模式;3. 提供了对LLMs内部表征的认知评价维度解释。
Method: 基于认知评价理论,设计了一系列实验,通过CoRE基准评估LLMs的认知推理能力。实验分析包括模型对特定认知维度的依赖、认知维度与情感的关系,以及LLMs内部表征的认知解释。
Result: 结果显示,不同LLMs在情感推理中表现出多样化的认知模式,某些认知维度对特定情感的建模尤为重要。
Insight: LLMs在情感推理中展现出类似人类的认知评价能力,但其表现因模型而异。这一发现为LLMs的情感理解能力提供了新的解释框架。
Abstract: Affective Computing has been established as a crucial field of inquiry to advance the holistic development of Artificial Intelligence (AI) systems. Foundation models – especially Large Language Models (LLMs) – have been evaluated, trained, or instruction-tuned in several past works, to become better predictors or generators of emotion. Most of these studies, however, approach emotion-related tasks in a supervised manner, assessing or training the capabilities of LLMs using discrete emotion labels associated with stimuli (e.g., text, images, video, audio). Evaluation studies, in particular, have often been limited to standard and superficial emotion-related tasks, such as the recognition of evoked or expressed emotions. In this paper, we move beyond surface-level emotion tasks to investigate how LLMs reason about emotions through cognitive dimensions. Drawing from cognitive appraisal theory, we examine whether LLMs produce coherent and plausible cognitive reasoning when reasoning about emotionally charged stimuli. We introduce a large-scale benchmark on Cognitive Reasoning for Emotions - CoRE - to evaluate internal cognitive structures implicitly used by LLMs for emotional reasoning. Through a plethora of evaluation experiments and analysis, we seek to answer: (a) Are models more likely to implicitly rely on specific cognitive appraisal dimensions?, (b) What cognitive dimensions are important for characterizing specific emotions?, and, (c) Can the internal representations of different emotion categories in LLMs be interpreted through cognitive appraisal dimensions? Our results and analyses reveal diverse reasoning patterns across different LLMs. Our benchmark and code will be made publicly available.
[4] Spectrum Projection Score: Aligning Retrieved Summaries with Reader Models in Retrieval-Augmented Generation
Zhanghao Hu,Qinglin Zhu,Siya Qi,Yulan He,Hanqi Yan,Lin Gui
Main category: cs.CL
TL;DR: 论文提出了Spectrum Projection Score(SPS),一种轻量级、无监督的度量方法,用于评估检索到的摘要与读者模型的语义对齐,并基于此设计了动态采样、排序和压缩检索摘要的框架xCompress。实验表明SPS能提升多任务的性能。
Details
Motivation: 现有检索增强生成(RAG)方法难以单独评估检索的真实贡献,且LLM作为读者对提示词敏感。需要一种方法能独立衡量检索摘要与读者模型的语义对齐。Contribution: 提出了SPS度量方法,无需监督即可衡量检索摘要与读者模型的语义对齐;设计了xCompress框架,动态优化检索摘要的选择和压缩。
Method: SPS通过比较生成令牌形成的区域与读者模型子空间的主方向来评估语义对齐;xCompress利用SPS动态采样和压缩候选摘要。
Result: 在五个QA基准测试和四个开源LLM上的实验表明,SPS能提升任务性能,并为检索与生成的交互提供了理论视角。
Insight: SPS为RAG提供了定量分析工具,有助于理解检索对生成的贡献,同时xCompress展示了动态优化检索摘要的实际价值。
Abstract: Large Language Models (LLMs) have shown improved generation performance through retrieval-augmented generation (RAG) following the retriever-reader paradigm, which supplements model inputs with externally retrieved knowledge. However, prior work often evaluates RAG holistically, assessing the retriever and reader jointly, making it difficult to isolate the true contribution of retrieval, particularly given the prompt sensitivity of LLMs used as readers. We introduce Spectrum Projection Score (SPS), a lightweight, supervision-free metric that allows the reader to gauge the semantic alignment of a retrieved summary with its hidden representation by comparing the area formed by generated tokens from the summary, and the principal directions of subspace in the reader and to measure the relevance. Building on SPS we present xCompress, an inference time controller framework that dynamically samples, ranks, and compresses retrieval summary candidates. Extensive experiments on five QA benchmarks with four open source LLMs show that SPS not only enhances performance across a range of tasks but also provides a principled perspective on the interaction between retrieval and generation.
[5] Prosocial Behavior Detection in Player Game Chat: From Aligning Human-AI Definitions to Efficient Annotation at Scale
Rafal Kocielnik,Min Kim,Penphob,Boonyarungsrit,Fereshteh Soltani,Deshawn Sambrano,Animashree Anandkumar,R. Michael Alvarez
Main category: cs.CL
TL;DR: 论文提出了一种三阶段流程,用于高效检测游戏聊天中的亲社会行为,结合人类-AI协作、任务定义优化以及低成本推理系统设计。
Details
Motivation: 亲社会性检测是一个新挑战,缺乏明确的定义和标注数据,需要新的方法来应对标注和部署问题。Contribution: 提出了一个三阶段流程,结合人类-AI协作优化标注质量,设计高效推理架构降低成本。
Method: 三阶段流程:1) 基于小样本选择最佳LLM标注策略;2) 人类-AI循环优化任务定义;3) 训练两阶段推理系统(轻量分类器+GPT-4升级)。
Result: 系统降低70%推理成本,同时保持高精度(约0.90)。
Insight: 通过人类-AI协作优化定义和低成本架构设计,可为新兴责任AI任务提供可扩展的解决方案。
Abstract: Detecting prosociality in text–communication intended to affirm, support, or improve others’ behavior–is a novel and increasingly important challenge for trust and safety systems. Unlike toxic content detection, prosociality lacks well-established definitions and labeled data, requiring new approaches to both annotation and deployment. We present a practical, three-stage pipeline that enables scalable, high-precision prosocial content classification while minimizing human labeling effort and inference costs. First, we identify the best LLM-based labeling strategy using a small seed set of human-labeled examples. We then introduce a human-AI refinement loop, where annotators review high-disagreement cases between GPT-4 and humans to iteratively clarify and expand the task definition-a critical step for emerging annotation tasks like prosociality. This process results in improved label quality and definition alignment. Finally, we synthesize 10k high-quality labels using GPT-4 and train a two-stage inference system: a lightweight classifier handles high-confidence predictions, while only $\sim$35% of ambiguous instances are escalated to GPT-4o. This architecture reduces inference costs by $\sim$70% while achieving high precision ($\sim$0.90). Our pipeline demonstrates how targeted human-AI interaction, careful task formulation, and deployment-aware architecture design can unlock scalable solutions for novel responsible AI tasks.
[6] Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
Yidong Wang,Xin Wang,Cunxiang Wang,Junfeng Fang,Qiufeng Wang,Jianing Chu,Xuran Meng,Shuxun Yang,Libo Qin,Yue Zhang,Wei Ye,Shikun Zhang
Main category: cs.CL
TL;DR: 论文提出了一种新的方法——Temporal Self-Rewarding Language Models,解决了现有Self-Rewarding范式中的关键局限性,通过解耦过去与未来的生成和评判,显著提升了模型性能。
Details
Motivation: 现有Self-Rewarding方法中,被选择和拒绝的响应同步改进,导致对比样本的表征差异逐渐缩小,削弱了偏好学习的效果。论文试图解决这一问题。Contribution: 提出了一种双阶段框架:Anchored Rejection(固定过去的拒绝响应)和Future-Guided Chosen(动态选择未来的最佳响应),有效维持学习信号。
Method: 通过分离过去、现在和未来的模型生成,动态优化偏好学习。具体包括固定初始模型的拒绝响应和基于未来模型预测动态选择最佳响应。
Result: 在多个模型家族和规模(如Llama3B/8B/70B)上的实验显示,该方法显著优于基线,例如Llama3.1-8B在AlpacaEval 2.0上的胜率提升了9.75%。
Insight: 解耦过去与未来的生成和评判可以更有效地维持学习信号,提升模型的泛化能力,即使在没有特定训练数据的任务中也能表现优异。
Abstract: Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose \textbf{Temporal Self-Rewarding Language Models} that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) \textit{Anchored Rejection} - fixing rejected responses using the past initial model’s outputs and (2) \textit{Future-Guided Chosen} - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.
[7] EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation
Xinda Wang,Zhengxu Hou,Yangshijie Zhang,Bingren Yan,Zhibo Yang,Xingsheng Zhang,Luxi Xing,Qiang Zhou,Chen Zhang
Main category: cs.CL
TL;DR: 本文提出了一种名为EvolvR的自进化成对推理框架,通过自合成评分对齐的思维链数据和自过滤过程,显著提升了故事评估任务的表现,并在多个基准测试中达到SOTA。
Details
Motivation: 现有的LLM评估方法在开放式任务(如故事评估)中表现有限,提示工程对闭源模型适应性差,而微调方法缺乏严格的推理能力。因此,需要一种既能适应开放任务又具备严谨推理能力的解决方案。Contribution: 提出了自进化成对推理(EvolvR)框架,通过多角色策略自合成思维链数据,并通过多智能体过滤机制确保数据质量,训练出的评估模型能够显著提升故事生成的质量。
Method: 基于成对比较,首先通过多角色策略自合成评分对齐的思维链数据,然后通过多智能体进行自过滤以优化数据质量,最终训练评估模型作为奖励模型指导故事生成。
Result: 在StoryER、HANNA和OpenMEVA三项基准测试中达到SOTA,作为奖励模型显著提升了生成故事的质量。
Insight: 自进化的数据合成与过滤机制是提升开放式任务评估能力的关键,多智能体协作确保了数据的逻辑严谨性,从而为生成任务提供了更可靠的信号。
Abstract: Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed-source models suffers from poor adaptability, while fine-tuning approaches for open-source models lack the rigorous reasoning capabilities essential for story evaluation. To address this, we propose the Self-Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy. To ensure data quality, these raw CoTs undergo a self-filtering process, utilizing multi-agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state-of-the-art (SOTA) performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it significantly enhances the quality of generated stories, thereby fully validating the superiority of our self-evolving approach.
[8] ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline
Morris Alper,Moran Yanuka,Raja Giryes,Gašper Beguš
Main category: cs.CL
TL;DR: ConlangCrafter利用多跳LLM(大语言模型)管道为人工语言(conlangs)设计提供端到端的支持,通过模块化分解和元语言推理生成多样且一致的构造语言。
Details
Motivation: 人工语言(如世界语和昆雅语)在艺术、哲学和国际交流中发挥重要作用,现代大语言模型为创造性生成提供了新工具。本文旨在利用LLM辅助设计端到端的人工语言。Contribution: 提出了ConlangCrafter,一个多跳LLM管道,将语言设计分解为音系、形态、句法、词汇生成和翻译模块化阶段,并通过随机性和自反馈机制增强多样性和一致性。
Method: 采用多阶段管道设计,每个阶段利用LLM的元语言推理能力,注入随机性以增加多样性,并通过自反馈机制确保语言描述的连贯性。
Result: 实验表明,ConlangCrafter能够生成连贯且类型多样的人工语言,无需人类语言学专业知识。
Insight: LLM可以成为构造语言设计的有效工具,模块化设计和反馈机制是确保生成语言质量和多样性的关键。
Abstract: Constructed languages (conlangs) such as Esperanto and Quenya have played diverse roles in art, philosophy, and international communication. Meanwhile, large-scale foundation models have revolutionized creative generation in text, images, and beyond. In this work, we leverage modern LLMs as computational creativity aids for end-to-end conlang creation. We introduce ConlangCrafter, a multi-hop pipeline that decomposes language design into modular stages – phonology, morphology, syntax, lexicon generation, and translation. At each stage, our method leverages LLMs’ meta-linguistic reasoning capabilities, injecting randomness to encourage diversity and leveraging self-refinement feedback to encourage consistency in the emerging language description. We evaluate ConlangCrafter on metrics measuring coherence and typological diversity, demonstrating its ability to produce coherent and varied conlangs without human linguistic expertise.
[9] Few-Shot Prompting for Extractive Quranic QA with Instruction-Tuned LLMs
Mohamed Basem,Islam Oshallah,Ali Hamdi,Ammar Mohammed
Main category: cs.CL
TL;DR: 该论文提出两种方法用于《古兰经》的抽取式问答,利用少样本提示和指令调优的大型语言模型(如Gemini和DeepSeek),开发了阿拉伯语专用提示框架和后处理系统,显著提升了精度并减少了幻觉生成。
Details
Motivation: 解决《古兰经》文本中复杂语言、独特术语和深层含义带来的问答挑战,特别是在资源有限的情况下。Contribution: 1) 提出少样本提示与指令调优大模型的结合方法;2) 开发阿拉伯语专用提示框架和后处理系统(包括子词对齐、重叠抑制和语义过滤)。
Method: 1) 使用指令调优的大模型(如Gemini和DeepSeek)进行少样本提示;2) 设计专用阿拉伯语提示框架;3) 后处理系统优化结果。
Result: 指令调优大模型优于传统微调模型,最佳配置的pAP10分数为0.637,表明在低资源、语义丰富的任务中有效。
Insight: 在低资源任务中,结合指令调优和少样本提示的方法能显著提升性能,尤其适合复杂语义的文本处理。
Abstract: This paper presents two effective approaches for Extractive Question Answering (QA) on the Quran. It addresses challenges related to complex language, unique terminology, and deep meaning in the text. The second uses few-shot prompting with instruction-tuned large language models such as Gemini and DeepSeek. A specialized Arabic prompt framework is developed for span extraction. A strong post-processing system integrates subword alignment, overlap suppression, and semantic filtering. This improves precision and reduces hallucinations. Evaluations show that large language models with Arabic instructions outperform traditional fine-tuned models. The best configuration achieves a pAP10 score of 0.637. The results confirm that prompt-based instruction tuning is effective for low-resource, semantically rich QA tasks.
[10] You Don’t Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures
Shengyuan Chen,Chuang Zhou,Zheng Yuan,Qinggang Zhang,Zeyang Cui,Hao Chen,Yilin Xiao,Jiannong Cao,Xiao Huang
Main category: cs.CL
TL;DR: 论文提出LogicRAG框架,动态构建推理结构以优化检索增强生成(RAG),避免预建图的成本和不灵活性,显著提升性能和效率。
Details
Motivation: 解决大语言模型(LLM)的幻觉问题,同时克服现有GraphRAG方法因预建图导致的成本和检索逻辑不匹配问题。Contribution: 提出LogicRAG框架,动态生成逻辑依赖的推理结构(DAG),支持自适应检索,并通过图排序和剪枝优化效率。
Method: 1. 将查询分解为子问题并构建带逻辑依赖的DAG;2. 使用拓扑排序线性化图以实现多步推理;3. 通过图剪枝和上下文剪枝减少冗余检索。
Result: 实验表明,LogicRAG在性能和效率上均优于现有方法。
Insight: 动态推理结构能更灵活适应不同查询需求,同时降低计算成本,为RAG提供了新的优化方向。
Abstract: Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge and perception. Retrieval-augmented generation (RAG) addresses this by retrieving query-relevant contexts from knowledge bases to support LLM reasoning. Recent advances leverage pre-constructed graphs to capture the relational connections among distributed documents, showing remarkable performance in complex tasks. However, existing Graph-based RAG (GraphRAG) methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency. Moreover, real-world queries vary in type and complexity, requiring different logic structures for accurate reasoning. The pre-built graph may not align with these required structures, resulting in ineffective knowledge retrieval. To this end, we propose a \textbf{\underline{Logic}}-aware \textbf{\underline{R}}etrieval-\textbf{\underline{A}}ugmented \textbf{\underline{G}}eneration framework (\textbf{LogicRAG}) that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre-built graph. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi-step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order. Besides, LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. Extensive experiments demonstrate that LogicRAG achieves both superior performance and efficiency compared to state-of-the-art baselines.
[11] AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models
Sayantan Adak,Pratyush Chatterjee,Somnath Banerjee,Rima Hazra,Somak Aditya,Animesh Mukherjee
Main category: cs.CL
TL;DR: AURA 提出了一种新型多层框架,通过过程奖励模型(PRMs)在步骤级别评估逻辑一致性和安全性,显著提升模型输出的逻辑完整性和安全性。
Details
Motivation: 现有LLMs在管理基于affordance的安全风险(输出无意中促成有害行为)方面存在不足,传统安全解决方案缺乏细粒度和主动性。Contribution: 提出了AURA框架,结合自省式自我批判、细粒度PRM评估和自适应安全性解码,动态引导模型生成更安全的推理轨迹。
Method: 采用多层次的Process Reward Models(PRMs),在生成过程中进行逻辑和安全性评估,并结合自适应解码策略。
Result: 实验表明Aura显著优于现有方法,提升了模型输出的逻辑完整性和对affordance敏感的Safety。
Insight: 通过步骤级别的细粒度评估和动态干预,能够更有效地解决LLMs在生成过程中的安全性和逻辑一致性问题。
Abstract: Present day LLMs face the challenge of managing affordance-based safety risks-situations where outputs inadvertently facilitate harmful actions due to overlooked logical implications. Traditional safety solutions, such as scalar outcome-based reward models, parameter tuning, or heuristic decoding strategies, lack the granularity and proactive nature needed to reliably detect and intervene during subtle yet crucial reasoning steps. Addressing this fundamental gap, we introduce AURA, an innovative, multi-layered framework centered around Process Reward Models (PRMs), providing comprehensive, step level evaluations across logical coherence and safety-awareness. Our framework seamlessly combines introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding to dynamically and proactively guide models toward safer reasoning trajectories. Empirical evidence clearly demonstrates that this approach significantly surpasses existing methods, significantly improving the logical integrity and affordance-sensitive safety of model outputs. This research represents a pivotal step toward safer, more responsible, and contextually aware AI, setting a new benchmark for alignment-sensitive applications.
[12] Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models
Lingyuan Liu,Mengxiang Zhang
Main category: cs.CL
TL;DR: 论文提出了一种名为选择性反射蒸馏(SRD)的新框架,通过动态选择高质量、学生模型兼容的训练数据,提升知识蒸馏(KD)的效果并降低计算成本。
Details
Motivation: 当前的白盒知识蒸馏方法忽视了训练数据质量和学生模型兼容性,导致蒸馏效率不高。Contribution: 提出了SRD框架,动态评估和选择高质量的兼容训练数据,并通过课程调度策略逐步引入这些数据,显著提升了蒸馏效果和效率。
Method: SRD通过比较真实数据和学生模型输出,动态选择训练数据,并利用课程调度策略逐步引入数据。
Result: 实验表明,SRD能显著提升蒸馏模型的性能,并减少高达39%的训练时间,且无需修改底层KD算法。
Insight: 数据质量和兼容性是高效蒸馏大语言模型的关键,SRD为数据中心的蒸馏提供了实用框架。
Abstract: Knowledge Distillation (KD) is a fundamental technique for compressing large language models (LLMs) into compact, efficient student models. However, existing white-box KD methods mainly focus on balancing ground truth and student-generated responses while overlooking two critical factors: training data quality and student-model compatibility. To address these limitations, we propose Selective Reflection Distillation (SRD), a novel data curation framework that leverages reflections from student models to systematically refine training data. SRD dynamically evaluates and selects prompt-response pairs by comparing ground truth data with student model outputs, selectively curating high-quality, student-compatible training instances through automated ranking based on difficulty. Furthermore, after selecting the training data, a curriculum scheduling strategy is employed to incrementally introduce these curated subsets into the distillation process at fixed intervals. As a plug-and-play enhancement, SRD consistently improves distillation outcomes across diverse white-box KD approaches and model architectures, as well as decreases computational cost significantly during KD training. Experiments on a range of language model benchmarks demonstrate SRD’s consistent improvements in distilled model performance, as well as a reduction in training runtime by up to 39%, under diverse KD methods and model families. Notably, SRD operates as a plug-and-play module, enhancing sample efficiency without modifying underlying KD algorithms. Our findings highlight that data quality and compatibility are pivotal to effective and efficient distillation of LLMs, and SRD provides a principled framework to achieve both. This work advances the understanding of data-centric factors in KD and offers practical insights for enhancing the capability and efficiency of compressed LLMs.
[13] Scaling Personality Control in LLMs with Big Five Scaler Prompts
Gunhee Cho,Yun-Gyung Cheong
Main category: cs.CL
TL;DR: 该论文提出了Big5-Scaler,一种通过自然语言提示(prompts)来控制大型语言模型(LLMs)中Big Five人格特质的方法,无需额外训练即可实现细粒度的人格特征调控。
Details
Motivation: 现有的对话系统通常缺乏对人格特质的精细控制,限制了其在个性化交互中的应用。因此,作者希望通过提示工程(prompt engineering)实现对LLMs的人格调控。Contribution: 论文的主要贡献是提出了Big5-Scaler框架,通过嵌入数字化的特质值到自然语言提示中,实现了对LLMs的人格特质的细粒度控制。
Method: Big5-Scaler通过设计包含Big Five人格特质数值的自然语言提示,直接用于条件生成。作者评估了不同提示类型和特质强度的影响。
Result: 实验表明,Big5-Scaler能够在不同模型中诱导出一致且可区分的人格特质,性能因提示类型和特质强度而异。简洁的提示和较低的特质强度表现更优。
Insight: 论文的启示在于,简单的提示工程可以有效控制LLMs的人格特质,为构建人格感知的对话代理提供了高效途径,避免了复杂的模型训练。
Abstract: We present Big5-Scaler, a prompt-based framework for conditioning large language models (LLMs) with controllable Big Five personality traits. By embedding numeric trait values into natural language prompts, our method enables fine-grained personality control without additional training. We evaluate Big5-Scaler across trait expression, dialogue generation, and human trait imitation tasks. Results show that it induces consistent and distinguishable personality traits across models, with performance varying by prompt type and scale. Our analysis highlights the effectiveness of concise prompts and lower trait intensities, providing a efficient approach for building personality-aware dialogue agents.
[14] One Size Does Not Fit All: A Distribution-Aware Sparsification for More Precise Model Merging
Yingfeng Luo,Dingyang Lin,Junxin Wang,Ziqiang Xu,Kaiyan Chang,Tong Zheng,Bei Li,Anxiang Ma,Tong Xiao,Zhengtao Yu,Jingbo Zhu
Main category: cs.CL
TL;DR: 论文提出了一种名为TADrop的自适应稀疏化策略,通过根据参数张量的分布特性定制稀疏化程度,改进了现有模型合并方法,显著提升了性能。
Details
Motivation: 现有模型合并方法采用统一的稀疏化比例("one-size-fits-all"),忽略了模型参数的异质性,导致关键参数被误删或冗余参数被保留,限制了性能。Contribution: 提出了TADrop,一种基于张量分布特性的自适应稀疏化策略,显著提升了模型合并的性能。
Method: TADrop为每个参数张量分配定制化的稀疏化比例,密度高的张量被更激进地稀疏化,而关键张量则保留更多。
Result: 在多个任务(视觉、语言、多模态)和模型(ViT、BEiT)上的实验表明,TADrop平均提升了2.0%的性能。
Insight: 模型参数的分布异质性对稀疏化至关重要,自适应策略能更高效地减轻参数干扰,为高性能模型合并提供了新基准。
Abstract: Model merging has emerged as a compelling data-free paradigm for multi-task learning, enabling the fusion of multiple fine-tuned models into a single, powerful entity. A key technique in merging methods is sparsification, which prunes redundant parameters from task vectors to mitigate interference. However, prevailing approaches employ a ``one-size-fits-all’’ strategy, applying a uniform sparsity ratio that overlooks the inherent structural and statistical heterogeneity of model parameters. This often leads to a suboptimal trade-off, where critical parameters are inadvertently pruned while less useful ones are retained. To address this limitation, we introduce \textbf{TADrop} (\textbf{T}ensor-wise \textbf{A}daptive \textbf{Drop}), an adaptive sparsification strategy that respects this heterogeneity. Instead of a global ratio, TADrop assigns a tailored sparsity level to each parameter tensor based on its distributional properties. The core intuition is that tensors with denser, more redundant distributions can be pruned aggressively, while sparser, more critical ones are preserved. As a simple and plug-and-play module, we validate TADrop by integrating it with foundational, classic, and SOTA merging methods. Extensive experiments across diverse tasks (vision, language, and multimodal) and models (ViT, BEiT) demonstrate that TADrop consistently and significantly boosts their performance. For instance, when enhancing a leading merging method, it achieves an average performance gain of 2.0% across 8 ViT-B/32 tasks. TADrop provides a more effective way to mitigate parameter interference by tailoring sparsification to the model’s structure, offering a new baseline for high-performance model merging.
[15] UR$^2$: Unify RAG and Reasoning through Reinforcement Learning
Weitao Li,Boran Xiang,Xiaolong Wang,Zhinan Gou,Weizhi Ma,Yang Liu
Main category: cs.CL
TL;DR: 本文提出UR2框架,通过强化学习统一检索增强生成(RAG)和复杂推理能力(RLVR),解决了现有方法在泛化和跨领域适用性上的局限性。
Details
Motivation: 现有工作将RAG和RLVR两种能力孤立发展,缺乏统一的框架,限制了其泛化能力和跨领域适用性。本文旨在通过强化学习动态协调检索与推理,提升模型在多样化任务中的表现。Contribution: 1. 提出难度感知课程训练,选择性调用检索;2. 设计混合知识访问策略,结合领域特定语料库和LLM生成摘要。
Method: 基于强化学习框架UR2,动态协调检索与推理,引入课程训练和混合知识访问策略。实验采用Qwen2.5和LLaMA-3.1作为基模型。
Result: 在开放域QA、MMLU-Pro、医疗和数学推理任务中,UR2显著优于现有RAG和RL方法,性能接近GPT-4o-mini和GPT-4.1-mini。
Insight: 通过动态协调检索与推理,UR2展示了在多样化任务中的强大适应性,为未来统一RAG与推理能力提供了新思路。
Abstract: Large Language Models (LLMs) have shown remarkable capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG), which enhances knowledge grounding, and Reinforcement Learning from Verifiable Rewards (RLVR), which optimizes complex reasoning abilities. However, these two capabilities are often developed in isolation, and existing efforts to unify them remain narrow in scope-typically limited to open-domain QA with fixed retrieval settings and task-specific assumptions. This lack of integration constrains generalization and limits the applicability of RAG-RL methods to broader domains. To bridge this gap, we propose UR2 (Unified RAG and Reasoning), a general framework that unifies retrieval and reasoning through reinforcement learning. UR2 introduces two key contributions: a difficulty-aware curriculum training that selectively invokes retrieval only for challenging problems, and a hybrid knowledge access strategy combining domain-specific offline corpora with LLM-generated summaries. These components are designed to enable dynamic coordination between retrieval and reasoning, improving adaptability across a diverse range of tasks. Experiments across open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks demonstrate that UR2 (built on Qwen2.5-3/7B and LLaMA-3.1-8B) significantly outperforms existing RAG and RL methods, achieving comparable performance to GPT-4o-mini and GPT-4.1-mini on several benchmarks. We have released all code, models, and data at https://github.com/Tsinghua-dhy/UR2.
[16] EICAP: Deep Dive in Assessment and Enhancement of Large Language Models in Emotional Intelligence through Multi-Turn Conversations
Nizi Nazar,Ehsaneddin Asgari
Main category: cs.CL
TL;DR: 本文提出了一个针对大语言模型(LLM)情感智能(EI)的统一四层分类法,并开发了EICAP-Bench基准测试,评估了六种开源LLM。研究发现,现有方法在情感推理方面存在局限,仅部分EI层级可通过微调提升。
Details
Motivation: 情感智能是LLM与人类对齐的重要维度,但现有研究对此关注不足。本文旨在填补这一空白,提出系统性评估和提升LLM情感智能的框架。Contribution: 1. 提出了基于心理学的EI四层分类法;2. 开发了多轮对话评估基准EICAP-Bench;3. 在六种开源LLM上评估EI表现;4. 通过微调实验揭示了现有方法的局限性。
Method: 1. 定义EI分层的心理学分类法;2. 构建MCQ风格的多轮对话基准;3. 评估多种LLM;4. 使用LoRA对模型进行多语言微调。
Result: Qwen2.5-Instruct表现最佳。微调实验显示,仅Appraisal层级有显著提升,其他层级未见明显改善。
Insight: 现有预训练和指令微调范式在深层情感推理能力上存在不足,需针对性数据和建模策略。
Abstract: Emotional Intelligence (EI) is a critical yet underexplored dimension in the development of human-aligned LLMs. To address this gap, we introduce a unified, psychologically grounded four-layer taxonomy of EI tailored for large language models (LLMs), encompassing emotional tracking, cause inference, appraisal, and emotionally appropriate response generation. Building on this framework, we present EICAP-Bench, a novel MCQ style multi-turn benchmark designed to evaluate EI capabilities in open-source LLMs across diverse linguistic and cultural contexts. We evaluate six LLMs: LLaMA3 (8B), LLaMA3-Instruct, Gemma (9B), Gemma-Instruct, Qwen2.5 (7B), and Qwen2.5-Instruct on EmoCap-Bench, identifying Qwen2.5-Instruct as the strongest baseline. To assess the potential for enhancing EI capabilities, we fine-tune both Qwen2.5-Base and Qwen2.5-Instruct using LoRA adapters on UltraChat (UC), a large-scale, instruction-tuned dialogue dataset, in both English and Arabic. Our statistical analysis reveals that among the five EI layers, only the Appraisal layer shows significant improvement through UC-based fine-tuning. These findings highlight the limitations of existing pretraining and instruction-tuning paradigms in equipping LLMs with deeper emotional reasoning and underscore the need for targeted data and modeling strategies for comprehensive EI alignment.
[17] Classification is a RAG problem: A case study on hate speech detection
Richard Willats,Josh Pennington,Aravind Mohan,Bertie Vidgen
Main category: cs.CL
TL;DR: 论文提出了一种基于检索增强生成(RAG)的分类方法,将传统分类任务转化为基于上下文知识的动态评估,特别在仇恨言论检测中展现了灵活性和透明性。
Details
Motivation: 现有的内容审核系统需要频繁重新训练以适应政策变化,成本高昂且缺乏灵活性。作者希望通过RAG方法实现动态分类,减少对重新训练的依赖。Contribution: 1. 提出了一种RAG-based分类方法,将分类任务从静态分类转化为动态上下文评估;
2. 开发了Contextual Policy Engine(CPE)系统,支持动态政策更新和解释性;
3. 展示了在仇恨言论检测中的高效性和灵活性。
Method: 1. 使用检索增强生成(RAG)框架,在推理时检索上下文知识以辅助分类;
2. 通过CPE系统实现基于政策的动态分类和解释性;
3. 在仇恨言论检测任务中验证方法的有效性。
Result: 1. 实现了与主流商业系统相当的分类准确率;
2. 展示了通过检索的政策片段提供解释性的能力;
3. 在不重新训练模型的情况下支持动态政策更新。
Insight: RAG可以将分类问题从基于预训练参数的静态任务转化为基于上下文知识的动态任务,为内容审核等需要灵活性和透明性的场景提供了新的解决方案。
Abstract: Robust content moderation requires classification systems that can quickly adapt to evolving policies without costly retraining. We present classification using Retrieval-Augmented Generation (RAG), which shifts traditional classification tasks from determining the correct category in accordance with pre-trained parameters to evaluating content in relation to contextual knowledge retrieved at inference. In hate speech detection, this transforms the task from “is this hate speech?” to “does this violate the hate speech policy?” Our Contextual Policy Engine (CPE) - an agentic RAG system - demonstrates this approach and offers three key advantages: (1) robust classification accuracy comparable to leading commercial systems, (2) inherent explainability via retrieved policy segments, and (3) dynamic policy updates without model retraining. Through three experiments, we demonstrate strong baseline performance and show that the system can apply fine-grained policy control by correctly adjusting protection for specific identity groups without requiring retraining or compromising overall performance. These findings establish that RAG can transform classification into a more flexible, transparent, and adaptable process for content moderation and wider classification problems.
[18] InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?
Keummin Ka,Junhyeong Park,Jahyun Jeon,Youngjae Yu
Main category: cs.CL
TL;DR: 论文提出了一个新的基准测试InfoCausalQA,用于评估多模态模型基于信息图表(infographics)的非显式因果推理能力,揭示了当前视觉语言模型(VLMs)在这一任务上的局限性。
Details
Motivation: 当前视觉语言模型(VLMs)在感知和推理方面取得了显著进展,但在因果推理这一人类认知核心能力上仍未被充分探索,特别是结合多模态信息的场景。Contribution: 1. 提出了InfoCausalQA基准测试,覆盖定量和语义因果推理任务;2. 手工收集并生成了高质量的问答对;3. 揭示了当前VLMs在因果推理能力上的显著不足。
Method: 1. 设计了两类任务:定量因果推理(基于数值趋势)和语义因果推理(涉及五种因果关系);2. 使用GPT-4生成问答对,人工修正以确保任务难度;3. 评估了多种VLMs的表现。
Result: 实验结果显示,当前VLMs在定量推理和语义因果推理任务中表现有限,远低于人类水平,表明其在基于信息图表的因果推理能力上存在显著差距。
Insight: 信息图表为多模态因果推理提供了新的挑战和机会,当前模型的能力仍需进一步提升。通过InfoCausalQA可以推动多模态AI系统在因果推理方向的发展。
Abstract: Recent advances in Vision-Language Models (VLMs) have demonstrated impressive capabilities in perception and reasoning. However, the ability to perform causal inference – a core aspect of human cognition – remains underexplored, particularly in multimodal settings. In this study, we introduce InfoCausalQA, a novel benchmark designed to evaluate causal reasoning grounded in infographics that combine structured visual data with textual context. The benchmark comprises two tasks: Task 1 focuses on quantitative causal reasoning based on inferred numerical trends, while Task 2 targets semantic causal reasoning involving five types of causal relations: cause, effect, intervention, counterfactual, and temporal. We manually collected 494 infographic-text pairs from four public sources and used GPT-4o to generate 1,482 high-quality multiple-choice QA pairs. These questions were then carefully revised by humans to ensure they cannot be answered based on surface-level cues alone but instead require genuine visual grounding. Our experimental results reveal that current VLMs exhibit limited capability in computational reasoning and even more pronounced limitations in semantic causal reasoning. Their significantly lower performance compared to humans indicates a substantial gap in leveraging infographic-based information for causal inference. Through InfoCausalQA, we highlight the need for advancing the causal reasoning abilities of multimodal AI systems.
[19] Harnessing Adaptive Topology Representations for Zero-Shot Graph Question Answering
Yanbin Wei,Jiangyue Yan,Chun Kang,Yang Chen,Hua Liu,James T. Kwok,Yu Zhang
Main category: cs.CL
TL;DR: 该论文提出了一种动态自适应拓扑表示框架(DynamicTRF),通过设计和选择最适合的图拓扑表示,提升零样本图问答任务的准确性和简洁性。
Details
Motivation: 现有的图问答系统通常采用单一的图拓扑表示方法(TRF),忽略了不同任务和模型的偏好,导致回答不准确或冗长。为解决这一问题,作者提出了DynamicTRF框架。Contribution: 1. 分析了现有TRF的特点和局限性;2. 设计了专门用于零样本图问答的TRF集合$F_{ZS}$;3. 提出了新的评价指标Graph Response Efficiency(GRE);4. 开发了DynamicTRF框架。
Method: 1. 构建TRF Preference(TRFP)数据集,根据GRE评分对TRF进行排序;2. 训练TRF路由模块,在推理阶段动态分配最佳TRF;3. 结合7个领域内和2个领域外任务进行实验验证。
Result: 实验表明,DynamicTRF显著提升了大型多模态模型在零样本图问答任务中的准确性和简洁性。
Insight: 动态选择适合任务和模型的图拓扑表示方法可以有效提升图问答性能,同时GRE指标为评估模型回答的效率和准确性提供了新思路。
Abstract: Large Multimodal Models (LMMs) have shown generalized zero-shot capabilities in diverse domain question-answering (QA) tasks, including graph QA that involves complex graph topologies. However, most current approaches use only a single type of graph representation, namely Topology Representation Form (TRF), such as prompt-unified text descriptions or style-fixed visual styles. Those “one-size-fits-all” approaches fail to consider the specific preferences of different models or tasks, often leading to incorrect or overly long responses. To address this, we first analyze the characteristics and weaknesses of existing TRFs, and then design a set of TRFs, denoted by $F_{ZS}$, tailored to zero-shot graph QA. We then introduce a new metric, Graph Response Efficiency (GRE), which measures the balance between the performance and the brevity in graph QA. Built on these, we develop the DynamicTRF framework, which aims to improve both the accuracy and conciseness of graph QA. To be specific, DynamicTRF first creates a TRF Preference (TRFP) dataset that ranks TRFs based on their GRE scores, to probe the question-specific TRF preferences. Then it trains a TRF router on the TRFP dataset, to adaptively assign the best TRF from $F_{ZS}$ for each question during the inference. Extensive experiments across 7 in-domain algorithmic graph QA tasks and 2 out-of-domain downstream tasks show that DynamicTRF significantly enhances the zero-shot graph QA of LMMs in terms of accuracy
[20] Cyberbullying Detection via Aggression-Enhanced Prompting
Aisha Saeid,Anu Sabu,Girish A. Koushik,Ferrante Neri,Diptesh Kanojia
Main category: cs.CL
TL;DR: 该研究探索通过整合攻击性检测作为辅助任务,结合增强提示方法,提升大型语言模型在社交网络上的网络霸凌检测性能。
Details
Motivation: 由于网络霸凌表达的多样性和隐蔽性,现有方法难以准确检测。研究旨在通过利用攻击性检测任务提供上下文信息,改进网络霸凌检测的性能和泛化能力。Contribution: 提出了一种新型的增强提示管道方法,将攻击性预测嵌入网络霸凌检测提示中,显著提升了检测性能。
Method: 采用多种策略(零样本、少样本、独立LoRA微调、多任务学习)进行比较,最终提出了一种基于增强提示的管道方法。实验在五个攻击性数据集和一个网络霸凌数据集上进行。
Result: 增强提示管道方法在检测性能和泛化能力上优于标准的LoRA微调,验证了攻击性信息的重要性。
Insight: 辅助任务(如攻击性检测)能为安全关键任务(如网络霸凌检测)提供有意义的上下文信息,提升模型的整体表现。
Abstract: Detecting cyberbullying on social media remains a critical challenge due to its subtle and varied expressions. This study investigates whether integrating aggression detection as an auxiliary task within a unified training framework can enhance the generalisation and performance of large language models (LLMs) in cyberbullying detection. Experiments are conducted on five aggression datasets and one cyberbullying dataset using instruction-tuned LLMs. We evaluated multiple strategies: zero-shot, few-shot, independent LoRA fine-tuning, and multi-task learning (MTL). Given the inconsistent results of MTL, we propose an enriched prompt pipeline approach in which aggression predictions are embedded into cyberbullying detection prompts to provide contextual augmentation. Preliminary results show that the enriched prompt pipeline consistently outperforms standard LoRA fine-tuning, indicating that aggression-informed context significantly boosts cyberbullying detection. This study highlights the potential of auxiliary tasks, such as aggression detection, to improve the generalisation of LLMs for safety-critical applications on social networks.
[21] GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4. 5 Team,:,Aohan Zeng,Xin Lv,Qinkai Zheng,Zhenyu Hou,Bin Chen,Chengxing Xie,Cunxiang Wang,Da Yin,Hao Zeng,Jiajie Zhang,Kedong Wang,Lucen Zhong,Mingdao Liu,Rui Lu,Shulin Cao,Xiaohan Zhang,Xuancheng Huang,Yao Wei,Yean Cheng,Yifan An,Yilin Niu,Yuanhao Wen,Yushi Bai,Zhengxiao Du,Zihan Wang,Zilin Zhu,Bohan Zhang,Bosi Wen,Bowen Wu,Bowen Xu,Can Huang,Casey Zhao,Changpeng Cai,Chao Yu,Chen Li,Chendi Ge,Chenghua Huang,Chenhui Zhang,Chenxi Xu,Chenzheng Zhu,Chuang Li,Congfeng Yin,Daoyan Lin,Dayong Yang,Dazhi Jiang,Ding Ai,Erle Zhu,Fei Wang,Gengzheng Pan,Guo Wang,Hailong Sun,Haitao Li,Haiyang Li,Haiyi Hu,Hanyu Zhang,Hao Peng,Hao Tai,Haoke Zhang,Haoran Wang,Haoyu Yang,He Liu,He Zhao,Hongwei Liu,Hongxi Yan,Huan Liu,Huilong Chen,Ji Li,Jiajing Zhao,Jiamin Ren,Jian Jiao,Jiani Zhao,Jianyang Yan,Jiaqi Wang,Jiayi Gui,Jiayue Zhao,Jie Liu,Jijie Li,Jing Li,Jing Lu,Jingsen Wang,Jingwei Yuan,Jingxuan Li,Jingzhao Du,Jinhua Du,Jinxin Liu,Junkai Zhi,Junli Gao,Ke Wang,Lekang Yang,Liang Xu,Lin Fan,Lindong Wu,Lintao Ding,Lu Wang,Man Zhang,Minghao Li,Minghuan Xu,Mingming Zhao,Mingshu Zhai,Pengfan Du,Qian Dong,Shangde Lei,Shangqing Tu,Shangtong Yang,Shaoyou Lu,Shijie Li,Shuang Li,Shuang-Li,Shuxun Yang,Sibo Yi,Tianshu Yu,Wei Tian,Weihan Wang,Wenbo Yu,Weng Lam Tam,Wenjie Liang,Wentao Liu,Xiao Wang,Xiaohan Jia,Xiaotao Gu,Xiaoying Ling,Xin Wang,Xing Fan,Xingru Pan,Xinyuan Zhang,Xinze Zhang,Xiuqing Fu,Xunkai Zhang,Yabo Xu,Yandong Wu,Yida Lu,Yidong Wang,Yilin Zhou,Yiming Pan,Ying Zhang,Yingli Wang,Yingru Li,Yinpei Su,Yipeng Geng,Yitong Zhu,Yongkun Yang,Yuhang Li,Yuhao Wu,Yujiang Li,Yunan Liu,Yunqing Wang,Yuntao Li,Yuxuan Zhang,Zezhen Liu,Zhen Yang,Zhengda Zhou,Zhongpei Qiao,Zhuoer Feng,Zhuorui Liu,Zichen Zhang,Zihan Wang,Zijun Yao,Zikang Wang,Ziqiang Liu,Ziwei Chai,Zixuan Li,Zuodong Zhao,Wenguang Chen,Jidong Zhai,Bin Xu,Minlie Huang,Hongning Wang,Juanzi Li,Yuxiao Dong,Jie Tang
Main category: cs.CL
TL;DR: GLM-4.5是一个开源的专家混合(MoE)大语言模型,总参数量355B,激活参数量32B,支持混合推理方法。通过多阶段训练和后训练优化,其在代理、推理和编码任务上表现优异。
Details
Motivation: 旨在推动推理和代理AI系统的研究,提供高效的开放模型。Contribution: 1. 提出GLM-4.5和其紧凑版本GLM-4.5-Air;2. 在多任务基准中表现优秀;3. 开源模型和代码支持社区研究。
Method: 1. 多阶段训练23T token数据;2. 后训练结合专家模型迭代和强化学习;3. 混合推理方法。
Result: 在TAU-Bench、AIME 24和SWE-bench Verified上分别达到70.1%、91.0%和64.2%的分数,总体排名第三,代理任务排名第二。
Insight: 即使在参数量较少的情况下,通过高效的训练和推理方法,模型仍能实现卓越性能。
Abstract: We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.
[22] HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
Guimin Hu,Daniel Hershcovich,Hasti Seifi
Main category: cs.CL
TL;DR: HapticLLaMA 是一种多模态感官语言模型,用于将触觉信号(如振动)转化为自然语言描述。该模型结合了频率和 EnCodec 两种触觉标记器,并通过两阶段训练(监督微调和强化学习)优化。实验表明其在触觉描述任务中表现优异。
Details
Motivation: 触觉信号在多模态研究中较少被探索,而触觉描述在虚拟现实、无障碍服务及康复应用中具有重要意义。为了解决这一未充分研究的问题,作者提出了 HapticLLaMA。Contribution: 1. 正式定义了触觉描述任务;2. 提出了 HapticLLaMA 模型,支持将触觉信号转化为多类别描述;3. 研究了两种触觉标记器并设计了两阶段训练方法。
Method: 1. 使用频率和 EnCodec 标记器将触觉信号离散化;2. 基于 LLaMA 架构,采用 LoRA 进行监督微调;3. 通过 RLHF 进一步优化模型。
Result: HapticLLaMA 在 METEOR(59.98)和 BLEU-4(32.06)指标上表现优异,61% 以上的人工评分超过 3.5(满分 7),RLHF 带来 10% 的评分提升。
Insight: 大语言模型能够处理和适应感官数据,尤其是在未被充分研究的触觉领域,展现了多模态模型的潜力。
Abstract: Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality, accessibility, and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic signals for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). We assess HapticLLaMA’s captioning performance using both automated n-gram metrics and human evaluation. HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06 respectively. Additionally, over 61% of the generated captions received human ratings above 3.5 on a 7-point scale, with RLHF yielding a 10% improvement in the overall rating distribution, indicating stronger alignment with human haptic perception. These findings highlight the potential of large language models to process and adapt to sensory data.
[23] Post-training for Efficient Communication via Convention Formation
Yilun Hua,Evan Wang,Yoav Artzi
Main category: cs.CL
TL;DR: 该论文提出了一种后训练方法,通过微调启发式识别的约定形成演示,显著提升了LLMs在交流中形成约定的能力,并通过两个新基准验证了其效果。
Details
Motivation: 人类在多轮交互中通过适应语言和形成临时约定来提高交流效率,而现有的LLMs缺乏这种能力。本文旨在通过后训练方法让LLM学会类似人类的约定形成行为。Contribution: 1. 提出了一种后训练过程,通过微调改进LLMs的约定形成能力;2. 设计了两个新基准(互动基准和文档基于的参考完成任务),用于评估模型的约定形成能力。
Method: 1. 后训练过程:对启发式识别的约定形成演示进行微调;2. 开发了两个评估基准:一个认知驱动的互动任务和一个基于文档的参考完成任务。
Result: 实验表明,经过后训练的LLMs在两个评估基准上均表现出显著提升的约定形成能力。
Insight: 通过针对性微调和设计评估任务,LLMs可以学会人类类似的交流约定行为,从而提升多轮交互的效率。
Abstract: Humans communicate with increasing efficiency in multi-turn interactions, by adapting their language and forming ad-hoc conventions. In contrast, prior work shows that LLMs do not naturally show this behavior. We develop a post-training process to develop this ability through targeted fine-tuning on heuristically identified demonstrations of convention formation. We evaluate with two new benchmarks focused on this capability. First, we design a focused, cognitively-motivated interaction benchmark that consistently elicits strong convention formation trends in humans. Second, we create a new document-grounded reference completion task that reflects in-the-wild convention formation behavior. Our studies show significantly improved convention formation abilities in post-trained LLMs across the two evaluation methods.
cs.CV [Back]
[24] Boosting Adversarial Transferability via Residual Perturbation Attack
Jinjia Peng,Zeze Tao,Huibing Wang,Meng Wang,Yang Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的对抗攻击方法ResPA(残差扰动攻击),通过利用残差梯度作为扰动方向,将对抗样本引导到损失函数的平坦区域,从而提高了对抗样本的可迁移性。
Details
Motivation: 现有的对抗攻击方法忽视了扰动方向的影响,导致可迁移性有限。本文旨在通过捕捉全局扰动方向的变化,提升对抗样本的可迁移性。Contribution: 提出了一种新颖的对抗攻击方法ResPA,利用残差梯度作为扰动方向,结合历史梯度信息,显著提高了对抗样本的可迁移性。
Method: ResPA通过对输入梯度进行指数移动平均,获取包含历史梯度方向的参考梯度,并利用当前梯度与参考梯度之间的残差来捕捉全局扰动方向的变化。
Result: 实验结果表明,ResPA在可迁移性上优于现有典型的基于迁移的攻击方法,且与现有输入变换方法结合后效果更佳。
Insight: 捕捉扰动方向的全局变化对提升对抗样本的可迁移性至关重要,而ResPA通过残差梯度的设计有效实现了这一点。
Abstract: Deep neural networks are susceptible to adversarial examples while suffering from incorrect predictions via imperceptible perturbations. Transfer-based attacks create adversarial examples for surrogate models and transfer these examples to target models under black-box scenarios. Recent studies reveal that adversarial examples in flat loss landscapes exhibit superior transferability to alleviate overfitting on surrogate models. However, the prior arts overlook the influence of perturbation directions, resulting in limited transferability. In this paper, we propose a novel attack method, named Residual Perturbation Attack (ResPA), relying on the residual gradient as the perturbation direction to guide the adversarial examples toward the flat regions of the loss function. Specifically, ResPA conducts an exponential moving average on the input gradients to obtain the first moment as the reference gradient, which encompasses the direction of historical gradients. Instead of heavily relying on the local flatness that stems from the current gradients as the perturbation direction, ResPA further considers the residual between the current gradient and the reference gradient to capture the changes in the global perturbation direction. The experimental results demonstrate the better transferability of ResPA than the existing typical transfer-based attack methods, while the transferability can be further improved by combining ResPA with the current input transformation methods. The code is available at https://github.com/ZezeTao/ResPA.
[25] Generalized Few-Shot Out-of-Distribution Detection
Pinxuan Li,Bing Cao,Changqing Zhang,Qinghua Hu
Main category: cs.CV
TL;DR: 本文提出了一个广义少样本离群检测(GOOD)框架,通过引入辅助通用知识模型(GKM)提升模型的泛化能力,并提出知识动态嵌入(KDE)机制自适应调整通用知识指导,实验表明其优越性。
Details
Motivation: 现有少样本离群检测方法因数据限制导致泛化能力不足,难以适应开放世界的多样化场景,亟需提升其泛化性能。Contribution: 1. 提出GOOD框架,引入GKM增强模型泛化能力;2. 从泛化角度分析少样本离群检测,理论推导GS平衡;3. 设计KDE机制动态对齐模型与通用知识分布。
Method: 1. 使用GKM捕获通用知识;2. 理论推导GS平衡降低泛化误差上界;3. KDE基于G-Belief动态调节模型输出分布。
Result: 在真实离群检测基准测试中表现出优越性能。
Insight: 通过辅助通用知识模型和动态嵌入机制,可以有效平衡模型的通用性与特异性,提升少样本离群检测的泛化能力。
Abstract: Few-shot Out-of-Distribution (OOD) detection has emerged as a critical research direction in machine learning for practical deployment. Most existing Few-shot OOD detection methods suffer from insufficient generalization capability for the open world. Due to the few-shot learning paradigm, the OOD detection ability is often overfit to the limited training data itself, thus degrading the performance on generalized data and performing inconsistently across different scenarios. To address this challenge, we proposed a Generalized Few-shot OOD Detection (GOOD) framework, which empowers the general knowledge of the OOD detection model with an auxiliary General Knowledge Model (GKM), instead of directly learning from few-shot data. We proceed to reveal the few-shot OOD detection from a generalization perspective and theoretically derive the Generality-Specificity balance (GS-balance) for OOD detection, which provably reduces the upper bound of generalization error with a general knowledge model. Accordingly, we propose a Knowledge Dynamic Embedding (KDE) mechanism to adaptively modulate the guidance of general knowledge. KDE dynamically aligns the output distributions of the OOD detection model to the general knowledge model based on the Generalized Belief (G-Belief) of GKM, thereby boosting the GS-balance. Experiments on real-world OOD benchmarks demonstrate our superiority. Codes will be available.
[26] UnGuide: Learning to Forget with LoRA-Guided Diffusion Models
Agnieszka Polowczyk,Alicja Polowczyk,Dawid Malarz,Artur Kasymov,Marcin Mazur,Jacek Tabor,Przemysław Spurek
Main category: cs.CV
TL;DR: UnGuide提出了一种基于LoRA和动态推理机制的新型方法,用于从扩散模型中精准移除特定知识,同时保持模型整体性能。
Details
Motivation: 近年来,大规模文本到图像扩散模型的进展引发了对其滥用的担忧,亟需有效的方法来移除模型中的有害或误导性内容,而不影响其他功能。Contribution: UnGuide引入了UnGuidance机制,利用Classifier-Free Guidance动态控制LoRA适配器的去学习过程,实现了对特定概念的精准移除,同时保持图像质量和真实性。
Method: 通过结合LoRA动态适配和Classifier-Free Guidance,UnGuide在推理过程中根据去噪步骤的稳定性调整引导尺度,从而实现选择性知识移除。
Result: 实验表明,UnGuide在对象擦除和显式内容移除任务中优于现有LoRA方法,成功去除了目标概念且保留了模型的生成能力。
Insight: 动态调整引导尺度是实现精准机器去学习的关键,UnGuide为扩散模型的安全应用提供了新思路。
Abstract: Recent advances in large-scale text-to-image diffusion models have heightened concerns about their potential misuse, especially in generating harmful or misleading content. This underscores the urgent need for effective machine unlearning, i.e., removing specific knowledge or concepts from pretrained models without compromising overall performance. One possible approach is Low-Rank Adaptation (LoRA), which offers an efficient means to fine-tune models for targeted unlearning. However, LoRA often inadvertently alters unrelated content, leading to diminished image fidelity and realism. To address this limitation, we introduce UnGuide – a novel approach which incorporates UnGuidance, a dynamic inference mechanism that leverages Classifier-Free Guidance (CFG) to exert precise control over the unlearning process. UnGuide modulates the guidance scale based on the stability of a few first steps of denoising processes, enabling selective unlearning by LoRA adapter. For prompts containing the erased concept, the LoRA module predominates and is counterbalanced by the base model; for unrelated prompts, the base model governs generation, preserving content fidelity. Empirical results demonstrate that UnGuide achieves controlled concept removal and retains the expressive power of diffusion models, outperforming existing LoRA-based methods in both object erasure and explicit content removal tasks.
[27] Few-Shot Deployment of Pretrained MRI Transformers in Brain Imaging Tasks
Mengyu Li,Guoyao Shen,Chad W. Farris,Xin Zhang
Main category: cs.CV
TL;DR: 该论文提出了一种基于预训练MRI变换器的少样本部署框架,适用于多种脑部成像任务,通过MAE预训练和轻量级线性头部或混合架构MAE-FUnet,在数据有限条件下实现高效且稳定的性能。
Details
Motivation: 医学影像中标注数据稀缺,限制了变换器模型的实际应用。本文旨在解决这一问题,提出一种实用的少样本部署方法。Contribution: 1. 提出了基于MAE预训练的MRI变换器框架;2. 设计了轻量级线性头部和混合架构MAE-FUnet,分别适用于高级和低级任务;3. 在多个脑部成像任务中验证了方法的效率和可扩展性。
Method: 1. 使用MAE策略在大规模多队列MRI数据集(含3100万切片)上进行预训练;2. 高级任务(如分类)采用冻结MAE编码器加线性头部;3. 低级任务(如分割)设计MAE-FUnet融合CNN与MAE特征。
Result: 在数据有限的条件下,方法在MRI序列识别中达到SOTA,在分割任务中优于基线模型,表现出高效性和稳定性。
Insight: MAE预训练的特征具有良好的可迁移性,结合轻量级设计适用于低资源临床环境。
Abstract: Machine learning using transformers has shown great potential in medical imaging, but its real-world applicability remains limited due to the scarcity of annotated data. In this study, we propose a practical framework for the few-shot deployment of pretrained MRI transformers in diverse brain imaging tasks. By utilizing the Masked Autoencoder (MAE) pretraining strategy on a large-scale, multi-cohort brain MRI dataset comprising over 31 million slices, we obtain highly transferable latent representations that generalize well across tasks and datasets. For high-level tasks such as classification, a frozen MAE encoder combined with a lightweight linear head achieves state-of-the-art accuracy in MRI sequence identification with minimal supervision. For low-level tasks such as segmentation, we propose MAE-FUnet, a hybrid architecture that fuses multiscale CNN features with pretrained MAE embeddings. This model consistently outperforms other strong baselines in both skull stripping and multi-class anatomical segmentation under data-limited conditions. With extensive quantitative and qualitative evaluations, our framework demonstrates efficiency, stability, and scalability, suggesting its suitability for low-resource clinical environments and broader neuroimaging applications.
[28] Optimization-Free Style Transfer for 3D Gaussian Splats
Raphael Du Sablon,David Hart
Main category: cs.CV
TL;DR: 本文提出了一种无需重建或优化的3D高斯斑点风格迁移方法,通过生成隐式表面的图结构实现快速风格迁移,速度快且无需额外训练。
Details
Motivation: 传统3D高斯斑点风格迁移方法需要重建或优化斑点或特征提取网络,计算成本高且耗时。本文旨在提出一种无需优化或重建的快速方法。Contribution: 提出了一种基于图结构的表面风格迁移方法,实现了无需优化或训练的3D高斯斑点风格迁移,速度显著提升。
Method: 通过生成3D高斯斑点隐式表面的图结构,利用前馈表面风格迁移方法,并将结果插值回原始斑点中。
Result: 在消费级硬件上实现2分钟内的快速风格迁移,效果与其他方法相当。
Insight: 通过隐式表面图结构的利用,避免了复杂的优化过程,为3D风格迁移提供了高效解决方案。
Abstract: The task of style transfer for 3D Gaussian splats has been explored in many previous works, but these require reconstructing or fine-tuning the splat while incorporating style information or optimizing a feature extraction network on the splat representation. We propose a reconstruction- and optimization-free approach to stylizing 3D Gaussian splats. This is done by generating a graph structure across the implicit surface of the splat representation. A feed-forward, surface-based stylization method is then used and interpolated back to the individual splats in the scene. This allows for any style image and 3D Gaussian splat to be used without any additional training or optimization. This also allows for fast stylization of splats, achieving speeds under 2 minutes even on consumer-grade hardware. We demonstrate the quality results this approach achieves and compare to other 3D Gaussian splat style transfer methods. Code is publicly available at https://github.com/davidmhart/FastSplatStyler.
[29] MZEN: Multi-Zoom Enhanced NeRF for 3-D Reconstruction with Unknown Camera Poses
Jong-Ik Park,Carlee Joe-Wong,Gary K. Fedder
Main category: cs.CV
TL;DR: MZEN提出了一种多尺度增强的NeRF框架,解决了传统NeRF在工业检测中无法捕捉细微结构的问题,通过学习可调整的缩放因子和改进的姿态估计策略显著提升了3D重建质量。
Details
Motivation: 工业检测中需要捕捉微米级细节,但传统NeRF在多尺度图像(如放大图像)下表现不佳,因其破坏了多视角一致性。Contribution: 1. 引入可学习的缩放因子扩展相机模型;2. 提出分阶段姿态估计策略,先解决广域图像再处理放大图像并通过联合优化提升精度。
Method: 1. 在针孔相机模型中引入可学习缩放因子;2. 分阶段姿态估计:先解决广域图像建立全局坐标系,再通过裁剪匹配将放大图像与广域图像对齐并联合优化。
Result: 在合成TCAD模型、真实SEM和BLEFF对象上,MZEN显著优于基线方法,PSNR提升28%,SSIM提升10%,LPIPS降低222%。
Insight: MZEN通过多尺度处理和姿态优化技术,为NeRF在工业检测中的应用开辟了新路径,兼顾全局精度与微米级细节。
Abstract: Neural Radiance Fields (NeRF) methods excel at 3D reconstruction from multiple 2D images, even those taken with unknown camera poses. However, they still miss the fine-detailed structures that matter in industrial inspection, e.g., detecting sub-micron defects on a production line or analyzing chips with Scanning Electron Microscopy (SEM). In these scenarios, the sensor resolution is fixed and compute budgets are tight, so the only way to expose fine structure is to add zoom-in images; yet, this breaks the multi-view consistency that pose-free NeRF training relies on. We propose Multi-Zoom Enhanced NeRF (MZEN), the first NeRF framework that natively handles multi-zoom image sets. MZEN (i) augments the pin-hole camera model with an explicit, learnable zoom scalar that scales the focal length, and (ii) introduces a novel pose strategy: wide-field images are solved first to establish a global metric frame, and zoom-in images are then pose-primed to the nearest wide-field counterpart via a zoom-consistent crop-and-match procedure before joint refinement. Across eight forward-facing scenes$\unicode{x2013}$synthetic TCAD models, real SEM of micro-structures, and BLEFF objects$\unicode{x2013}$MZEN consistently outperforms pose-free baselines and even high-resolution variants, boosting PSNR by up to $28 %$, SSIM by $10 %$, and reducing LPIPS by up to $222 %$. MZEN, therefore, extends NeRF to real-world factory settings, preserving global accuracy while capturing the micron-level details essential for industrial inspection.
[30] TSMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios
Guoping Xu,Hua-Chieh Shao,You Zhang
Main category: cs.CV
TL;DR: TSMS-SAM2通过多尺度时间采样增强和内存分割剪枝策略,解决了基于SAM2的提示型视频对象分割与跟踪(VOST)在手术场景中的运动动态复杂性和内存冗余问题,显著提升了分割精度。
Details
Motivation: 手术视频中的快速运动动态性和SAM2的内存冗余问题限制了提示型VOST的效果,需要改进以适应复杂手术场景。Contribution: 提出了TSMS-SAM2框架,包含多时间尺度视频采样增强和内存分割剪枝机制,显著提升了手术视频中的VOST性能。
Method: 采用多时间尺度采样增强以应对运动动态性,并通过内存分割和剪枝机制优化SAM2的内存利用率。
Result: 在EndoVis2017和EndoVis2018数据集上分别达到95.24和86.73的Dice分数,优于现有方法。
Insight: 多时间尺度采样和内存优化机制是提升VOST在复杂手术场景中性能的关键。
Abstract: Promptable video object segmentation and tracking (VOST) has seen significant advances with the emergence of foundation models like Segment Anything Model 2 (SAM2); however, their application in surgical video analysis remains challenging due to complex motion dynamics and the redundancy of memory that impedes effective learning. In this work, we propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy in SAM2. TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features for more efficient and accurate segmentation. Evaluated on EndoVis2017 and EndoVis2018 datasets, TSMS-SAM2 achieved the highest mean Dice scores of 95.24 and 86.73, respectively, outperforming prior SAM-based and task-specific methods. Extensive ablation studies confirm the effectiveness of multiscale temporal augmentation and memory splitting, highlighting the framework’s potential for robust, efficient segmentation in complex surgical scenarios. Our source code will be available at https://github.com/apple1986/TSMS-SAM2.
[31] Temporal Cluster Assignment for Efficient Real-Time Video Segmentation
Ka-Wai Yung,Felix J. S. Bragman,Jialang Xu,Imanol Luengo,Danail Stoyanov,Evangelos B. Mazomenos
Main category: cs.CV
TL;DR: 该论文提出了一种名为Temporal Cluster Assignment (TCA)的轻量级、免微调策略,通过利用帧间时间相关性优化视频分割中的令牌聚类,显著减少计算成本的同时保留细节。
Details
Motivation: Swin Transformer在视频分割中计算成本高,现有令牌剪枝方法未充分利用时间冗余,限制了实时应用。Contribution: 提出了TCA方法,通过时间相关性优化令牌聚类,提升速度和性能平衡。
Method: TCA利用帧间时间一致性改进令牌聚类,避免冗余令牌丢弃,保持细节。
Result: 在多个数据集(YouTube-VIS 2019/2021、OVIS和外科手术视频)上验证TCA能显著提升性能与速度平衡。
Insight: 时间相关性在视频分割中是优化令牌聚类的有效工具,适用于自然和特定领域视频。
Abstract: Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains. Among them, the Swin Transformer stands out for its ability to capture hierarchical, multi-scale representations, making it a popular backbone for segmentation in videos. However, despite its window-attention scheme, it still incurs a high computational cost, especially in larger variants commonly used for dense prediction in videos. This remains a major bottleneck for real-time, resource-constrained applications. Whilst token reduction methods have been proposed to alleviate this, the window-based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques. Meanwhile, training-free token clustering approaches have shown promise in image segmentation while maintaining window consistency. Nevertheless, they fail to exploit temporal redundancy, missing a key opportunity to further optimize video segmentation performance. We introduce Temporal Cluster Assignment (TCA), a lightweight and effective, fine-tuning-free strategy that enhances token clustering by leveraging temporal coherence across frames. Instead of indiscriminately dropping redundant tokens, TCA refines token clusters using temporal correlations, thereby retaining fine-grained details while significantly reducing computation. Extensive evaluations on YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and a private surgical video dataset show that TCA consistently boosts the accuracy-speed trade-off of existing clustering-based methods. Our results demonstrate that TCA generalizes competently across both natural and domain-specific videos.
[32] VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments
Kaiser Hamid,Khandakar Ashrafi Akbar,Nade Liang
Main category: cs.CV
TL;DR: 该论文提出了一种基于视觉-语言框架的方法VISTA,用于模拟驾驶员在动态环境中的视觉注意力分配和转移,通过自然语言描述实现。方法结合了少样本和零样本学习,在BDD-A数据集上微调LLaVA模型,提升了注意力预测的可解释性。
Details
Motivation: 驾驶员视觉注意力预测对自动驾驶和人机交互至关重要,但现有方法多基于静态图像且缺乏解释性。本文希望通过自然语言生成注意力分配描述,提升预测的动态性和可解释性。Contribution: 1. 提出首个基于自然语言的驾驶员视觉注意力预测框架;2. 结合少样本和零样本学习,优化模型在动态环境中的表现;3. 引入领域专用指标评估语义对齐和响应多样性。
Method: 1. 使用BDD-A数据集和专家反馈生成高质量标注;2. 微调LLaVA模型,结合低级视觉线索和高级语义(如路径规划和风险预测);3. 通过少样本和零样本学习实现动态环境的注意力建模。
Result: 微调后的模型在注意力转移检测和解释性上优于通用视觉-语言模型,验证了方法的有效性。
Insight: 自然语言描述能显著提升注意力预测的可解释性,为自动驾驶中的行为预测和人类-AI协作提供了新方向。
Abstract: Driver visual attention prediction is a critical task in autonomous driving and human-computer interaction (HCI) research. Most prior studies focus on estimating attention allocation at a single moment in time, typically using static RGB images such as driving scene pictures. In this work, we propose a vision-language framework that models the changing landscape of drivers’ gaze through natural language, using few-shot and zero-shot learning on single RGB images. We curate and refine high-quality captions from the BDD-A dataset using human-in-the-loop feedback, then fine-tune LLaVA to align visual perception with attention-centric scene understanding. Our approach integrates both low-level cues and top-down context (e.g., route semantics, risk anticipation), enabling language-based descriptions of gaze behavior. We evaluate performance across training regimes (few shot, and one-shot) and introduce domain-specific metrics for semantic alignment and response diversity. Results show that our fine-tuned model outperforms general-purpose VLMs in attention shift detection and interpretability. To our knowledge, this is among the first attempts to generate driver visual attention allocation and shifting predictions in natural language, offering a new direction for explainable AI in autonomous driving. Our approach provides a foundation for downstream tasks such as behavior forecasting, human-AI teaming, and multi-agent coordination.
[33] ETTA: Efficient Test-Time Adaptation for Vision-Language Models through Dynamic Embedding Updates
Hamidreza Dastmalchi,Aijun An,Ali cheraghian
Main category: cs.CV
TL;DR: 论文提出了ETTA(Efficient Test-Time Adaptation),一种高效测试时适应方法,通过动态嵌入更新改进视觉语言模型在分布偏移下的性能。
Details
Motivation: 现有的基于缓存的测试时适应方法仅存储高置信度样本,忽略了其他测试数据的影响,限制了决策边界的优化。Contribution: 提出递归更新模块动态整合所有测试样本,结合自适应集成模块减少提示依赖性,并通过置信度动态融合两个模块的输出。
Method: 递归更新模块逐步优化决策边界,自适应集成模块动态选择最优提示,两者结合基于置信度进行融合。
Result: 在两个基准测试中,ETTA在计算复杂度和准确率上均超越现有方法,成为高效测试时适应的新标准。
Insight: 通过动态更新嵌入和自适应集成提示,ETTA展示了如何在不增加过高计算负担的情况下,充分利用测试数据提升性能。
Abstract: Pretrained vision-language models (VLMs) like CLIP show strong zero-shot performance but struggle with generalization under distribution shifts. Test-Time Adaptation (TTA) addresses this by adapting VLMs to unlabeled test data in new domains. While some TTA methods rely on prompt-tuning, training-free cache-based approaches are preferred for efficiency. However, current cache-based TTA models store only a limited set of high-confidence samples, restricting the decision boundary to these samples and ignoring the influence of other incoming test data. To address this, we propose Efficient Test-Time Adaptation (ETTA), introducing a Recursive Updating module that integrates all incoming test samples, progressively refining the decision boundary. This strategy mimics an unbounded cache, dynamically updating contextual embeddings for improved accuracy with minimal memory and computational overhead. ETTA also includes an Adaptive Ensemble module to reduce prompt dependency in image-to-text scores by dynamically selecting optimal prompts for each class. Furthermore, ETTA adaptively combines scores from both modules based on confidence levels, leveraging their complementary strengths. Extensive experiments on two benchmarks confirm that ETTA surpasses the state-of-the-art TTA models in computational complexity and accuracy, setting a new standard for effective, efficient test-time adaptation. The code has been released at https://github.com/hamidreza-dastmalchi/ETTA.
[34] HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing
Zixuan Bian,Ruohan Ren,Yue Yang,Chris Callison-Burch
Main category: cs.CV
TL;DR: HOLODECK 2.0 是一个基于视觉-语言引导的3D世界生成框架,支持通过反馈交互式编辑场景,能生成多样且语义高保真的3D场景,适用于室内和开放领域。
Details
Motivation: 现有3D场景设计依赖人工且自动化方法难以生成开放域场景或支持灵活编辑,因此需要从文本直接生成高质量3D世界。Contribution: HOLODECK 2.0 提出了一种结合视觉-语言模型和3D生成模型的框架,实现了语义一致且物理合理的场景布局,并支持交互式编辑。
Method: 利用视觉-语言模型解析场景需求,结合3D生成模型生成高质量资产,并通过迭代空间约束优化布局。
Result: 在人类评估和CLIP基准测试中,HOLODECK 2.0 生成场景质量显著优于基线,支持布局和风格的灵活编辑。
Insight: 视觉-语言模型与3D生成的结合能高效实现语义驱动的场景设计,为游戏和虚拟现实领域提供实用工具。
Abstract: 3D scene generation plays a crucial role in gaming, artistic creation, virtual reality and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. As a result, generating 3D worlds directly from text has garnered increasing attention. In this paper, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality assets via state-of-the-art 3D generative models. It then iteratively applies spatial constraints derived from the VLMs to achieve semantically coherent and physically plausible layouts. Human evaluations and CLIP-based assessments demonstrate that HOLODECK 2.0 effectively generates high-quality scenes closely aligned with detailed textual descriptions, consistently outperforming baselines across indoor and open-domain scenarios. Additionally, we provide editing capabilities that flexibly adapt to human feedback, supporting layout refinement and style-consistent object edits. Finally, we present a practical application of HOLODECK 2.0 in procedural game modeling, generating visually rich and immersive environments, potentially boosting efficiency.
[35] Enhancing Construction Site Analysis and Understanding with 3D Segmentation
Sri Ramana Saketh Vasanthawada,Pengkun Liu,Pingbo Tang
Main category: cs.CV
TL;DR: 论文探讨了两种3D分割方法(SAM和Mask3D)在复杂施工场景中的应用,揭示了现有方法在户外场景的局限性,并提出了定制化分割流程的需求。
Details
Motivation: 施工进度监测资源密集,传统方法在复杂多变的施工环境中表现不佳,需要高效且可扩展的计算机视觉解决方案。Contribution: 评估了SAM和Mask3D在真实施工场景中的适应性,填补了户外分割基准的空白,推动了自动化监测技术的发展。
Method: 对比分析了SAM和Mask3D在室内和户外施工场景中的表现,重点关注模型的适应性和性能差距。
Result: 研究揭示了现有模型在户外环境中的局限性,强调了定制化分割流程的重要性。
Insight: 施工场景的复杂性和动态变化要求分割方法具备更强的适应性和针对性,未来需开发更多户外基准数据集。
Abstract: Monitoring construction progress is crucial yet resource-intensive, prompting the exploration of computer-vision-based methodologies for enhanced efficiency and scalability. Traditional data acquisition methods, primarily focusing on indoor environments, falter in construction site’s complex, cluttered, and dynamically changing conditions. This paper critically evaluates the application of two advanced 3D segmentation methods, Segment Anything Model (SAM) and Mask3D, in challenging outdoor and indoor conditions. Trained initially on indoor datasets, both models’ adaptability and performance are assessed in real-world construction settings, highlighting the gap in current segmentation approaches due to the absence of benchmarks for outdoor scenarios. Through a comparative analysis, this study not only showcases the relative effectiveness of SAM and Mask3D but also addresses the critical need for tailored segmentation workflows capable of extracting actionable insights from construction site data, thereby advancing the field towards more automated and precise monitoring techniques.
[36] A 3DGS-Diffusion Self-Supervised Framework for Normal Estimation from a Single Image
Yanxing Liang,Yinghui Wang,Jinlong Yang,Wei Li
Main category: cs.CV
TL;DR: 论文提出SINGAD框架,通过3D高斯散射引导的扩散模型,实现了单张图像自监督法向估计,解决了多视角几何一致性和数据依赖问题。
Details
Motivation: 单张图像法向估计缺乏空间维度信息,现有方法依赖数据统计先验,忽略了光-表面交互建模,导致多视角法向冲突。扩散模型的离散采样机制也导致梯度不连续,无法反向传播3D几何误差。Contribution: 1)提出SINGAD框架,结合物理驱动的光交互建模和可微分渲染重投影策略,将3D几何误差直接转化为法向优化信号;2)设计跨域特征融合模块,嵌入几何先验约束法向生成;3)引入可微分3D重投影损失,实现自监督优化。
Method: 1)构建光交互驱动的3D高斯散射(3DGS)重参数化模型,生成符合光传输原则的多尺度几何特征;2)在条件扩散模型中设计跨域特征融合模块;3)提出可微分3D重投影损失策略。
Result: 在Google Scanned Objects数据集上的定量评估显示,SINGAD在多项指标上优于现有方法。
Insight: 通过物理建模和可微分渲染解决梯度不连续性和多视角一致性问题,为自监督法向估计提供了新思路。
Abstract: The lack of spatial dimensional information remains a challenge in normal estimation from a single image. Recent diffusion-based methods have demonstrated significant potential in 2D-to-3D implicit mapping, they rely on data-driven statistical priors and miss the explicit modeling of light-surface interaction, leading to multi-view normal direction conflicts. Moreover, the discrete sampling mechanism of diffusion models causes gradient discontinuity in differentiable rendering reconstruction modules, preventing 3D geometric errors from being backpropagated to the normal generation network, thereby forcing existing methods to depend on dense normal annotations. This paper proposes SINGAD, a novel Self-supervised framework from a single Image for Normal estimation via 3D GAussian splatting guided Diffusion. By integrating physics-driven light-interaction modeling and a differentiable rendering-based reprojection strategy, our framework directly converts 3D geometric errors into normal optimization signals, solving the challenges of multi-view geometric inconsistency and data dependency. Specifically, the framework constructs a light-interaction-driven 3DGS reparameterization model to generate multi-scale geometric features consistent with light transport principles, ensuring multi-view normal consistency. A cross-domain feature fusion module is designed within a conditional diffusion model, embedding geometric priors to constrain normal generation while maintaining accurate geometric error propagation. Furthermore, a differentiable 3D reprojection loss strategy is introduced for self-supervised optimization that minimizes geometric error between the reconstructed and input image, eliminating dependence on annotated normal datasets. Quantitative evaluations on the Google Scanned Objects dataset demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.
[37] Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
Han Lin,Jaemin Cho,Amir Zadeh,Chuan Li,Mohit Bansal
Main category: cs.CV
TL;DR: Bifrost-1是一个统一框架,通过CLIP图像嵌入作为潜在变量,连接预训练多模态大语言模型(MLLMs)和扩散模型,实现高效、高保真度的可控图像生成。
Details
Motivation: 现有方法在直接训练LLMs或连接LLMs与扩散模型时面临昂贵的训练成本问题,因为主干LLMs在预训练时未见过图像表示。研究旨在解决这一挑战。Contribution: 1. 提出Bifrost-1框架,通过patch级CLIP图像嵌入作为潜在变量,高效连接MLLMs和扩散模型;2. 在保留MLLMs多模态推理能力的同时,实现高保真图像生成;3. 通过轻量级ControlNet适应,显著降低训练成本。
Method: 1. 使用patch级CLIP图像嵌入作为潜在变量;2. 通过轻量级ControlNet将嵌入整合到扩散模型中;3. 为MLLM添加视觉生成分支,用于预测patch级嵌入。
Result: 实验显示Bifrost-1在视觉保真度和多模态理解方面优于或媲美现有方法,且训练计算成本显著降低。
Insight: 预训练模型的跨模态对齐(如CLIP)可以高效地桥接不同模态的生成与推理任务,同时减少训练负担。
Abstract: There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM’s CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.
[38] PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation
Zhihao Zhu,Yifan Zheng,Siyu Pan,Yaohui Jin,Yao Mu
Main category: cs.CV
TL;DR: PASG是一个闭环框架,通过几何基元提取和语义锚定解决了机器人操作中高层语义与低层几何特征的割裂问题,结合VLM实现了动态语义-功能关系的建模。
Details
Motivation: 机器人操作中高层任务语义与低层几何特征的割裂问题阻碍了系统的灵活性与自动化。现有方法依赖手动标注且缺乏语义与几何的动态耦合能力。Contribution: 1. 提出几何基元自动提取方法;2. 引入VLM驱动的动态语义锚定;3. 构建空间语义推理基准和微调VLM模型(Qwen2.5VL-PA)。
Method: 1. 基于几何特征聚合的基元提取;2. 通过VLM将几何基元与功能属性和任务描述动态绑定;3. 闭环框架验证有效性。
Result: 在多样化机器人操作任务中,PASG性能接近人工标注水平,实现了更细粒度的语义-功能理解。
Insight: PASG通过闭环框架和VLM结合,为几何基元与任务语义的统一建模提供了新范式,有望推动机器人操作的自动化和灵活性。
Abstract: The fragmentation between high-level task semantics and low-level geometric features remains a persistent challenge in robotic manipulation. While vision-language models (VLMs) have shown promise in generating affordance-aware visual representations, the lack of semantic grounding in canonical spaces and reliance on manual annotations severely limit their ability to capture dynamic semantic-affordance relationships. To address these, we propose Primitive-Aware Semantic Grounding (PASG), a closed-loop framework that introduces: (1) Automatic primitive extraction through geometric feature aggregation, enabling cross-category detection of keypoints and axes; (2) VLM-driven semantic anchoring that dynamically couples geometric primitives with functional affordances and task-relevant description; (3) A spatial-semantic reasoning benchmark and a fine-tuned VLM (Qwen2.5VL-PA). We demonstrate PASG’s effectiveness in practical robotic manipulation tasks across diverse scenarios, achieving performance comparable to manual annotations. PASG achieves a finer-grained semantic-affordance understanding of objects, establishing a unified paradigm for bridging geometric primitives with task semantics in robotic manipulation.
[39] AnimateScene: Camera-controllable Animation in Any Scene
Qingyang Liu,Bingjie Gao,Weiheng Huang,Jun Zhang,Zhongqian Sun,Yang Wei,Zelin Peng,Qianli Ma,Shuai Yang,Zhaohe Liao,Haonan Zhao,Li Niu
Main category: cs.CV
TL;DR: AnimateScene提出了一种统一框架,解决了将4D人体动画无缝融入3D场景的挑战,包括位置放置、风格对齐和相机轨迹控制。
Details
Motivation: 3D场景重建和4D人体动画的结合存在位置放置、光照风格对齐和相机轨迹控制的挑战,亟需一种统一的解决方案。Contribution: 1.提出准确的3D人体位置放置模块;2.开发无需训练的4D人体风格对齐方法;3.设计支持相机轨迹插入的后重建方法。
Method: 1.位置放置模块避免穿模;2.风格对齐方法适应背景光照和风格;3.后重建方法支持相机轨迹控制。
Result: 实验表明,AnimateScene能生成具有高几何细节和时空一致性的动态场景视频。
Insight: 通过统一框架解决多模态数据(3D场景+4D人体)的融合问题,为动态场景渲染提供了新思路。
Abstract: 3D scene reconstruction and 4D human animation have seen rapid progress and broad adoption in recent years. However, seamlessly integrating reconstructed scenes with 4D human animation to produce visually engaging results remains challenging. One key difficulty lies in placing the human at the correct location and scale within the scene while avoiding unrealistic interpenetration. Another challenge is that the human and the background may exhibit different lighting and style, leading to unrealistic composites. In addition, appealing character motion videos are often accompanied by camera movements, which means that the viewpoints need to be reconstructed along a specified trajectory. We present AnimateScene, which addresses the above issues in a unified framework. First, we design an accurate placement module that automatically determines a plausible 3D position for the human and prevents any interpenetration within the scene during motion. Second, we propose a training-free style alignment method that adapts the 4D human representation to match the background’s lighting and style, achieving coherent visual integration. Finally, we design a joint post-reconstruction method for both the 4D human and the 3D scene that allows camera trajectories to be inserted, enabling the final rendered video to feature visually appealing camera movements. Extensive experiments show that AnimateScene generates dynamic scene videos with high geometric detail and spatiotemporal coherence across various camera and action combinations.
[40] ETA: Energy-based Test-time Adaptation for Depth Completion
Younjoon Chung,Hyoungseob Park,Patrick Rim,Xiaoran Zhang,Jihe He,Ziyao Zeng,Safa Cicek,Byung-Woo Hong,James S. Duncan,Alex Wong
Main category: cs.CV
TL;DR: 论文提出了一种基于能量的测试时适应(ETA)方法,用于预训练的深度补全模型,以解决因协变量偏移引起的分布外数据预测误差问题。通过对抗扰动探索数据空间并训练能量模型,ETA在测试时调整模型参数以最小化能量,显著提升了预测精度。
Details
Motivation: 深度补全模型在从源数据迁移到目标数据时,由于协变量偏移导致预测误差。现有方法通常假设目标分布的先验知识,但在实际应用中难以满足。Contribution: 提出了一种无需目标数据先验知识的测试时适应方法(ETA),通过对抗扰动和能量模型实现分布对齐,显著提升了深度补全的鲁棒性。
Method: 利用对抗扰动探索数据空间,训练能量模型评分深度预测的分布内/外区域。在测试时,通过最小化能量调整模型参数。
Result: 在三个室内和三个室外数据集上,ETA分别比之前的最佳方法平均提升了10.23%和6.94%。
Insight: 通过能量模型动态适应测试数据分布,避免了强假设,为分布外适应提供了新思路。
Abstract: We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some source'' data, often predict erroneous outputs when transferred to target’’ data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-of-distribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method ``Energy-based Test-time Adaptation’’, or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: https://fuzzythecat.github.io/eta.
[41] Fast Motion Estimation and Context-Aware Refinement for Efficient Bayer-Domain Video Vision
Haichao Wang,Xinyue Xi,Jiangtao Wen,Yuxing Han
Main category: cs.CV
TL;DR: 该论文提出了一种高效的视频计算机视觉系统,通过直接使用Bayer格式数据节省前端计算,并提出了一种快速块匹配运动估计算法及上下文感知的MV精炼模块,显著提高了效率。
Details
Motivation: 现有视频计算机视觉系统存在高时空冗余和前端计算开销的问题,未能充分减少冗余。Contribution: 1. 直接使用Bayer格式数据,节省前端计算;
2. 提出快速块匹配运动估计算法及上下文感知的精炼模块;
3. 引入帧选择策略平衡准确性和效率。
Method: 1. 移除图像信号处理器,直接输入Bayer数据;
2. 采用块匹配运动估计和MV精炼模块;
3. 使用上下文感知块精炼网络修正误差。
Result: 在多个任务中实现了显著的速度提升,且性能损失轻微。
Insight: 通过优化前端输入和运动估计方法,可以显著提高视频计算机视觉系统的效率。
Abstract: The efficiency of video computer vision system remains a challenging task due to the high temporal redundancy inside a video. Existing works have been proposed for efficient vision computer vision. However, they do not fully reduce the temporal redundancy and neglect the front end computation overhead. In this paper, we propose an efficient video computer vision system. First, image signal processor is removed and Bayer-format data is directly fed into video computer vision models, thus saving the front end computation. Second, instead of optical flow models and video codecs, a fast block matching-based motion estimation algorithm is proposed specifically for efficient video computer vision, with a MV refinement module. To correct the error, context-aware block refinement network is introduced to refine regions with large error. To further balance the accuracy and efficiency, a frame selection strategy is employed. Experiments on multiple video computer vision tasks demonstrate that our method achieves significant acceleration with slight performance loss.
[42] ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge
Juewen Hu,Yexin Li,Jiulin Li,Shuo Chen,Pring Wong
Main category: cs.CV
TL;DR: 论文提出了一种增强的跨模态融合框架ECMF,用于解决MER-SEMI挑战中的多模态情感识别问题,通过双分支视觉编码器、上下文丰富化文本处理和多源标签策略,显著提升了性能。
Details
Motivation: 情感识别在增强人机交互中至关重要。针对MER-SEMI挑战中数据稀缺和多模态融合的问题,作者提出了ECMF框架以提高识别性能。Contribution: 1. 设计了双分支视觉编码器捕捉全局和局部特征;2. 提出上下文丰富化文本处理方法;3. 引入自注意力机制和残差连接的跨模态融合策略;4. 采用多源标签策略优化噪声标签。
Method: 1. 使用大规模预训练模型提取视觉、音频和文本特征;2. 双分支视觉编码器捕获全局与局部特征;3. 基于大语言模型的文本情感增强;4. 自注意力机制和残差连接实现跨模态融合;5. 多源标签策略优化训练集。
Result: 在MER2025-SEMI数据集上,加权F-score达到87.49%,显著优于官方基线(78.63%)。
Insight: 通过预训练模型和跨模态融合策略,可以有效提升多模态情感识别的性能,同时对噪声标签的优化也起到了关键作用。
Abstract: Emotion recognition plays a vital role in enhancing human-computer interaction. In this study, we tackle the MER-SEMI challenge of the MER2025 competition by proposing a novel multimodal emotion recognition framework. To address the issue of data scarcity, we leverage large-scale pre-trained models to extract informative features from visual, audio, and textual modalities. Specifically, for the visual modality, we design a dual-branch visual encoder that captures both global frame-level features and localized facial representations. For the textual modality, we introduce a context-enriched method that employs large language models to enrich emotional cues within the input text. To effectively integrate these multimodal features, we propose a fusion strategy comprising two key components, i.e., self-attention mechanisms for dynamic modality weighting, and residual connections to preserve original representations. Beyond architectural design, we further refine noisy labels in the training set by a multi-source labeling strategy. Our approach achieves a substantial performance improvement over the official baseline on the MER2025-SEMI dataset, attaining a weighted F-score of 87.49% compared to 78.63%, thereby validating the effectiveness of the proposed framework.
[43] MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models
Jun Feng,Zixin Wang,Zhentao Zhang,Yue Guo,Zhihan Zhou,Xiuyi Chen,Zhenyang Li,Dawei Yin
Main category: cs.CV
TL;DR: MathReal是一个真实场景的数学推理评估基准,专注于评估多模态大语言模型在现实K-12教育场景中的表现,通过2,000个真实拍摄的数学问题图像揭示现有模型的局限性。
Details
Motivation: 现有数学推理基准多基于干净或处理过的多模态输入,缺乏真实教育场景中的复杂图像输入,无法全面评估模型在实际应用中的表现。Contribution: 提出了MathReal数据集,包含2,000个真实拍摄的数学问题图像,分类为三类主要挑战(图像质量、视角变化、无关内容干扰)以及五个核心知识类别,设计了六种实验设置以系统评估模型性能。
Method: MathReal通过真实场景图像分类和多层次问题设计(知识类别、问题类型、难度级别),结合六种实验设置,全面测试多模态大语言模型的数学推理能力。
Result: 实验显示,现有模型在真实教育场景中的表现显著受限,识别、理解和推理能力存在明显不足。
Insight: MathReal揭示了多模态大语言模型在实际应用中的弱点,为未来改进提供了方向,尤其是对复杂现实场景的适应能力。
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.
[44] ExploreGS: Explorable 3D Scene Reconstruction with Virtual Camera Samplings and Diffusion Priors
Minsu Kim,Subin Jeon,In Cho,Mijin Yoo,Seon Joo Kim
Main category: cs.CV
TL;DR: ExploreGS提出了一种基于3D高斯溅射(3DGS)的流水线,通过虚拟相机采样和扩散先验增强场景重建质量,解决现有方法在偏离训练轨迹视角时的问题。
Details
Motivation: 现有基于3DGS的新视角合成(NVS)方法在偏离训练轨迹的视角下会出现伪影和缺失区域,限制了场景的无缝探索。Contribution: 1) 提出一种基于信息增益的虚拟相机放置策略以最大化场景覆盖;
2) 利用视频扩散先验优化渲染结果;
3) 引入Wild-Explore基准用于评估挑战性场景探索。
Method: 通过虚拟相机采样生成额外训练视图,利用信息增益优化相机布局,结合扩散先验细化渲染,最终微调3D高斯模型。
Result: 实验表明,ExploreGS优于现有3DGS方法,支持从任意视角高质量、无伪影的渲染。
Insight: 结合信息驱动的相机采样和扩散先验可显著提升3DGS在复杂场景下的鲁棒性和渲染质量。
Abstract: Recent advances in novel view synthesis (NVS) have enabled real-time rendering with 3D Gaussian Splatting (3DGS). However, existing methods struggle with artifacts and missing regions when rendering from viewpoints that deviate from the training trajectory, limiting seamless scene exploration. To address this, we propose a 3DGS-based pipeline that generates additional training views to enhance reconstruction. We introduce an information-gain-driven virtual camera placement strategy to maximize scene coverage, followed by video diffusion priors to refine rendered results. Fine-tuning 3D Gaussians with these enhanced views significantly improves reconstruction quality. To evaluate our method, we present Wild-Explore, a benchmark designed for challenging scene exploration. Experiments demonstrate that our approach outperforms existing 3DGS-based methods, enabling high-quality, artifact-free rendering from arbitrary viewpoints. https://exploregs.github.io
[45] Improved Sub-Visible Particle Classification in Flow Imaging Microscopy via Generative AI-Based Image Synthesis
Utku Ozbulak,Michaela Cohrs,Hristo L. Svilenov,Joris Vankerschaver,Wesley De Neve
Main category: cs.CV
TL;DR: 论文提出了一种基于生成式AI的图像合成方法,通过扩散模型生成高保真度的粒子图像,以解决流式成像显微镜中数据不平衡问题,提升多分类器的性能。
Details
Motivation: 在流式成像显微镜中,由于数据稀缺和类别不平衡问题(如硅油和气泡粒子数量远少于蛋白质粒子),多分类器的性能受限。传统方法难以应对此类问题。Contribution: 主要贡献包括:(1)开发了一种先进的扩散模型,生成高逼真度的粒子图像以增强训练数据;(2)验证了生成图像在提升分类性能上的有效性;(3)公开了模型和代码,促进开源研究。
Method: 采用扩散模型生成合成图像,用于数据增强。通过视觉质量与结构相似性验证生成图像的真实性,并在包含50万张蛋白质粒子图像的验证集上进行大规模实验。
Result: 实验表明,使用扩散生成的图像能够显著改善分类性能,且无明显副作用。
Insight: 生成式AI可以有效解决数据不平衡问题,尤其在样本稀缺的场景下,通过合成高质量数据提升模型性能。
Abstract: Sub-visible particle analysis using flow imaging microscopy combined with deep learning has proven effective in identifying particle types, enabling the distinction of harmless components such as silicone oil from protein particles. However, the scarcity of available data and severe imbalance between particle types within datasets remain substantial hurdles when applying multi-class classifiers to such problems, often forcing researchers to rely on less effective methods. The aforementioned issue is particularly challenging for particle types that appear unintentionally and in lower numbers, such as silicone oil and air bubbles, as opposed to protein particles, where obtaining large numbers of images through controlled settings is comparatively straightforward. In this work, we develop a state-of-the-art diffusion model to address data imbalance by generating high-fidelity images that can augment training datasets, enabling the effective training of multi-class deep neural networks. We validate this approach by demonstrating that the generated samples closely resemble real particle images in terms of visual quality and structure. To assess the effectiveness of using diffusion-generated images in training datasets, we conduct large-scale experiments on a validation dataset comprising 500,000 protein particle images and demonstrate that this approach improves classification performance with no negligible downside. Finally, to promote open research and reproducibility, we publicly release both our diffusion models and the trained multi-class deep neural network classifiers, along with a straightforward interface for easy integration into future studies, at https://github.com/utkuozbulak/svp-generative-ai.
[46] Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts
Kiran Chhatre,Christopher Peters,Srikrishna Karanam
Main category: cs.CV
TL;DR: 论文提出了Spectrum,一种基于3D纹理感知表示的统一网络,用于解析多样化的服装和身体部位。通过微调图像到纹理(I2Tx)扩散模型,提高了对服装和身体部位的语义对齐能力,并在多数据集实验上优于基线方法。
Details
Motivation: 现有的人类解析方法通常使用固定的掩码类别,无法区分细粒度的服装类型或详细的身体部位。开放词汇分割方法虽然利用了预训练的文本到图像(T2I)扩散模型特征,但通常将整个人归类为单一的“人”类别,无法区分多样化的服装或身体部位。Contribution: 1. 提出了一种基于3D纹理感知表示的统一网络Spectrum,用于像素级解析(身体部位和服装)和实例级分组。
2. 通过微调T2I模型成图像到纹理(I2Tx)扩散模型,改进了对服装和身体部位的语义对齐能力。
3. 在多个数据集上验证了Spectrum在提示引导分割任务中的优越性。
Method: 1. 通过微调T2I扩散模型,将其转化为I2Tx扩散模型,专门用于生成3D人类纹理贴图。
2. 从输入图像中提取人体部位的特征表示,并通过提示引导生成语义上有效的掩码。
3. Spectrum网络生成基于语义的分割掩码,涵盖所有可见的身体部位和服装类别。
Result: 在多数据集实验中,Spectrum在身体部位解析、服装部分分割、未见过的服装类别分割以及全身掩码生成任务上均表现出色,超越了基线方法。
Insight: 1. 3D纹理生成器比传统扩散模型能更好地与输入图像对齐,从而提供更强的人类解析特征。
2. 通过微调T2I模型以专注于3D纹理生成,可以显著提升特定任务(如人类解析)的性能。
3. 提示引导的分割方法在开放词汇任务中表现出色,尤其是在区分多样化服装和身体部位时。
Abstract: Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model – obtained by fine-tuning a T2I model on 3D human texture maps – for improved alignment with body parts and clothing. From an input image, we extract human-part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross-dataset experiments – separately assessing body parts, clothing parts, unseen clothing categories, and full-body masks – and demonstrate that Spectrum consistently outperforms baseline methods in prompt-based segmentation.
[47] More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference Alignment
Jun Xie,Yingjian Zhu,Feng Chen,Zhenghao Zhang,Xiaohui Fan,Hongzhu Yi,Xinming Wang,Chen Yu,Yue Bi,Zhaoran Zhao,Xiongjun Guan,Zhepeng Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于Mixture of Experts(MoE)的情感识别框架,通过整合多模态输入和伪标签策略,结合投票集成和重排序,显著提升了半监督学习任务的性能。
Details
Motivation: 当前情感识别任务中,单模态信息不足且半监督数据利用不充分。通过多专家系统和伪标签优化,提升模型鲁棒性和对齐人类偏好。Contribution: 1)提出MoE框架整合多模态专家(如VLM和AU);2)共识伪标签策略提升半监督数据利用;3)多专家投票与规则重排序优化预测对齐。
Method: 1)多模态输入作为独立专家;2)基于基线模型与Gemini共识生成伪标签;3)两阶段训练;4)投票集成与规则重排序。
Result: 在MER2025-SEMI测试集上F1达0.8772,排名第二。
Insight: 多模态整合和伪标签策略显着提升半监督学习性能,投票与重排序有效减少预测偏差。
Abstract: In this paper, we present our solution for the semi-supervised learning track (MER-SEMI) in MER2025. We propose a comprehensive framework, grounded in the principle that “more is better,” to construct a robust Mixture of Experts (MoE) emotion recognition system. Our approach integrates a diverse range of input modalities as independent experts, including novel signals such as knowledge from large Vision-Language Models (VLMs) and temporal Action Unit (AU) information. To effectively utilize unlabeled data, we introduce a consensus-based pseudo-labeling strategy, generating high-quality labels from the agreement between a baseline model and Gemini, which are then used in a two-stage training paradigm. Finally, we employ a multi-expert voting ensemble combined with a rule-based re-ranking process to correct prediction bias and better align the outputs with human preferences. Evaluated on the MER2025-SEMI challenge dataset, our method achieves an F1-score of 0.8772 on the test set, ranking 2nd in the track. Our code is available at https://github.com/zhuyjan/MER2025-MRAC25.
[48] Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models
Huanyu Wang,Jushi Kai,Haoli Bai,Lu Hou,Bo Jiang,Ziwei He,Zhouhan Lin
Main category: cs.CV
TL;DR: Fourier-VLM通过频域压缩视觉特征,显著减少了VLMs的计算开销和推理延迟,同时保持性能。
Details
Motivation: 现有的视觉语言模型(VLMs)因视觉特征数量庞大而导致高计算开销和推理延迟。传统方法通过选择重要特征或引入可学习查询来减少特征数量,但这些方法通常牺牲性能或增加额外成本。Contribution: 提出Fourier-VLM,一种在频域压缩视觉特征的高效方法。通过观察视觉特征在低频分量集中的特性,利用二维离散余弦变换(DCT)和快速傅里叶变换(FFT)实现低通滤波,显著减少计算开销和推理时间。
Method: 利用二维DCT和FFT对视觉特征进行频域压缩,仅保留低频分量。这一方法的时间复杂度为O(n log n),无需额外参数。
Result: 在多个图像基准测试中表现优异,通用性强;相比LLaVA-v1.5,推理FLOPs减少83.8%,生成速度提升31.2%。
Insight: 视觉特征集中在低频分量的现象为高效压缩提供了自然依据,频域压缩是一种无需额外参数或牺牲性能的实用解决方案。
Abstract: Vision-Language Models (VLMs) typically replace the predefined image placeholder token (
[49] NEP: Autoregressive Image Editing via Next Editing Token Prediction
Huimin Wu,Xiaojian Ma,Haozhe Zhao,Yanpeng Zhao,Qing Li
Main category: cs.CV
TL;DR: 该论文提出了一种新的图像编辑方法NEP,通过自回归生成仅编辑所需区域,避免了现有方法中不必要的计算成本和编辑质量下降的问题,并在零样本编辑任务中取得了最佳性能。
Details
Motivation: 现有文本引导图像编辑方法通常会生成整个目标图像,而非仅编辑所需区域,导致计算浪费和非编辑区域的重建偏差。NEP旨在解决这一问题。Contribution: 提出了NEP方法,基于自回归图像生成框架,仅选择性重新生成需编辑区域;预训练了一个支持任意顺序自回归的文本到图像模型,实现零样本编辑。
Method: 基于自回归图像生成框架,通过预测下一个编辑标记(Next Editing-token Prediction)实现局部编辑;预训练T2I模型支持任意顺序编辑。
Result: 在广泛使用的图像编辑基准测试中取得新SOTA,并支持零样本测试时缩放(TTS)。
Insight: 选择性编辑不仅提升计算效率,还能避免非编辑区域的重建误差,而预训练的T2I模型为实现零样本编辑提供了基础。
Abstract: Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: https://nep-bigai.github.io/
[50] VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning
Linhan Cao,Wei Sun,Weixia Zhang,Xiangyang Zhu,Jun Jia,Kaiwei Zhang,Dandan Zhu,Guangtao Zhai,Xiongkuo Min
Main category: cs.CV
TL;DR: 该论文提出了VQAThinker,一个基于推理的视频质量评估框架,利用多模态大模型和强化学习提升泛化性和可解释性。
Details
Motivation: 现有视频质量评估(VQA)模型在泛化性和可解释性方面存在不足,限制了其在实际场景中的应用。Contribution: 提出了结合多模态大模型和强化学习的VQAThinker框架,设计了三个针对VQA的强化学习奖励机制,显著提升了泛化性和可解释性。
Method: 使用分组相对策略优化(GRPO)算法,结合三个奖励:钟形回归奖励、成对排序奖励和时间一致性奖励。
Result: 在域内和域外VQA基准测试中达到最优性能,同时在质量理解和描述任务中表现卓越。
Insight: 强化学习结合多模态大模型为仅依靠分数监督构建泛化性强且可解释的VQA模型提供了有效途径。
Abstract: Video quality assessment (VQA) aims to objectively quantify perceptual quality degradation in alignment with human visual perception. Despite recent advances, existing VQA models still suffer from two critical limitations: \textit{poor generalization to out-of-distribution (OOD) videos} and \textit{limited explainability}, which restrict their applicability in real-world scenarios. To address these challenges, we propose \textbf{VQAThinker}, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. Specifically, we adopt group relative policy optimization (GRPO), a rule-guided reinforcement learning algorithm that enables reasoning over video quality under score-level supervision, and introduce three VQA-specific rewards: (1) a \textbf{bell-shaped regression reward} that increases rapidly as the prediction error decreases and becomes progressively less sensitive near the ground truth; (2) a \textbf{pairwise ranking reward} that guides the model to correctly determine the relative quality between video pairs; and (3) a \textbf{temporal consistency reward} that encourages the model to prefer temporally coherent videos over their perturbed counterparts. Extensive experiments demonstrate that VQAThinker achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, showing strong generalization for video quality scoring. Furthermore, evaluations on video quality understanding tasks validate its superiority in distortion attribution and quality description compared to existing explainable VQA models and LMMs. These findings demonstrate that reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.
[51] LV-Net: Anatomy-aware lateral ventricle shape modeling with a case study on Alzheimer’s disease, the Australian Imaging Biomarkers and Lifestyle flagship study of ageing
Wonjung Park,Suhyun Ahn,Jinah Park
Main category: cs.CV
TL;DR: LV-Net 是一种新型框架,通过变形解剖学敏感的联合 LV-海马模板网格,从脑 MRI 生成个性化的 3D LV 网格,改善了边 界分割伪影并提升了重建稳健性。
Details
Motivation: LV 形状分析作为神经系统疾病的生物标志物潜力巨大,但个体间形状差异大和 MRI 分辨率有限导致分割困难,需要更稳健的方法。Contribution: 提出 LV-Net 框架,通过解剖学敏感的模板网格和顶点分类技术,提升 LV 形状建模的准确性和跨数据集鲁棒性。
Method: 通过变形联合 LV-海马模板网格,引入解剖学关系,减少分割伪影;分类模板顶点以增强跨被试点对应性。
Result: LV-Net 在重建精度上表现优越,即使分割不完美时仍能保持稳健,且能识别与阿尔茨海默病显著相关的 LV 子区域。
Insight: 解剖学先验和顶点分类的结合显著提升了形状建模的可靠性,为疾病相关生物标志物研究提供了新工具。
Abstract: Lateral ventricle (LV) shape analysis holds promise as a biomarker for neurological diseases; however, challenges remain due to substantial shape variability across individuals and segmentation difficulties arising from limited MRI resolution. We introduce LV-Net, a novel framework for producing individualized 3D LV meshes from brain MRI by deforming an anatomy-aware joint LV-hippocampus template mesh. By incorporating anatomical relationships embedded within the joint template, LV-Net reduces boundary segmentation artifacts and improves reconstruction robustness. In addition, by classifying the vertices of the template mesh based on their anatomical adjacency, our method enhances point correspondence across subjects, leading to more accurate LV shape statistics. We demonstrate that LV-Net achieves superior reconstruction accuracy, even in the presence of segmentation imperfections, and delivers more reliable shape descriptors across diverse datasets. Finally, we apply LV-Net to Alzheimer’s disease analysis, identifying LV subregions that show significantly associations with the disease relative to cognitively normal controls. The codes for LV shape modeling are available at https://github.com/PWonjung/LV_Shape_Modeling.
[52] AGI for the Earth, the path, possibilities and how to evaluate intelligence of models that work with Earth Observation Data?
Mojtaba Valipour,Kelly Zheng,James Lowman,Spencer Szabados,Mike Gartner,Bobby Braswell
Main category: cs.CV
TL;DR: 本文探讨了AGI在处理地球观测数据方面的潜力与挑战,呼吁建立一个更全面的基准评测系统以评估模型在该领域的泛化能力。
Details
Motivation: AGI的发展需要对多模态数据的全面理解,而卫星光谱图像作为一种重要模态尚未得到充分关注。本文旨在推动地球观测数据在AGI中的应用,并解决现有基准评测的局限性。Contribution: 1. 强调了地球观测数据对AGI的重要性;2. 分析了现有基准评测的不足;3. 提出了一套全面的任务集作为未来评测标准。
Method: 通过文献综述和问题分析,识别了地球观测数据的独特挑战,并设计了一套任务集以评测模型的泛化能力。
Result: 现有基准评测在地球观测数据领域存在局限性,需建立更全面的评测标准以支持模型能力的全面评估。
Insight: 地球观测数据在AGI研究中有巨大潜力,但需要针对其特点设计专门的评测方法,以推动该领域的发展。
Abstract: Artificial General Intelligence (AGI) is closer than ever to becoming a reality, sparking widespread enthusiasm in the research community to collect and work with various modalities, including text, image, video, and audio. Despite recent efforts, satellite spectral imagery, as an additional modality, has yet to receive the attention it deserves. This area presents unique challenges, but also holds great promise in advancing the capabilities of AGI in understanding the natural world. In this paper, we argue why Earth Observation data is useful for an intelligent model, and then we review existing benchmarks and highlight their limitations in evaluating the generalization ability of foundation models in this domain. This paper emphasizes the need for a more comprehensive benchmark to evaluate earth observation models. To facilitate this, we propose a comprehensive set of tasks that a benchmark should encompass to effectively assess a model’s ability to understand and interact with Earth observation data.
[53] Lightweight Quad Bayer HybridEVS Demosaicing via State Space Augmented Cross-Attention
Shiyang Zhou,Haijin Zeng,Yunfan Lu,Yongyong Chen,Jie Liu,Jingyong Su
Main category: cs.CV
TL;DR: 该论文提出了一种轻量级的TSANet网络,通过两阶段状态空间增强交叉注意力解决Quad Bayer HybridEVS相机的去马赛克问题,显著优于现有方法且计算成本更低。
Details
Motivation: 事件相机(如HybridEVS)在移动摄影中具有潜力,但其与Quad Bayer CFA传感器的结合导致了去马赛克过程中的伪影和混叠问题。当前方法在资源有限的移动设备上表现不佳。Contribution: 提出TSANet网络,通过两阶段设计(事件像素修复和去马赛克)和轻量化的Cross-Swin State Block,实现了高效的去马赛克性能。
Method: 采用两阶段网络结构,分别处理事件像素修复和去马赛克任务;设计了Cross-Swin State Block,利用位置先验和状态空间模型增强全局依赖性。
Result: 在7个数据集上的实验表明,TSANet在PSNR和SSIM上优于现有方法DemosaicFormer,同时参数和计算成本分别降低了1.86倍和3.29倍。
Insight: 通过任务分解和轻量化设计,TSANet为移动设备上的高效图像去马赛克提供了新思路。
Abstract: Event cameras like the Hybrid Event-based Vision Sensor (HybridEVS) camera capture brightness changes as asynchronous “events” instead of frames, offering advanced application on mobile photography. However, challenges arise from combining a Quad Bayer Color Filter Array (CFA) sensor with event pixels lacking color information, resulting in aliasing and artifacts on the demosaicing process before downstream application. Current methods struggle to address these issues, especially on resource-limited mobile devices. In response, we introduce \textbf{TSANet}, a lightweight \textbf{T}wo-stage network via \textbf{S}tate space augmented cross-\textbf{A}ttention, which can handle event pixels inpainting and demosaicing separately, leveraging the benefits of dividing complex tasks into manageable subtasks. Furthermore, we introduce a lightweight Cross-Swin State Block that uniquely utilizes positional prior for demosaicing and enhances global dependencies through the state space model with linear complexity. In summary, TSANet demonstrates excellent demosaicing performance on both simulated and real data of HybridEVS while maintaining a lightweight model, averaging better results than the previous state-of-the-art method DemosaicFormer across seven diverse datasets in both PSNR and SSIM, while respectively reducing parameter and computation costs by $1.86\times$ and $3.29\times$. Our approach presents new possibilities for efficient image demosaicing on mobile devices. Code is available in the supplementary materials.
[54] Distribution-Specific Learning for Joint Salient and Camouflaged Object Detection
Chao Hao,Zitong Yu,Xin Liu,Yuhao Wang,Weicheng Xie,Jingang Shi,Huanjing Yue,Jingyu Yang
Main category: cs.CV
TL;DR: 本文提出了一种联合学习框架SCJoint,用于同时处理显著目标检测(SOD)和伪装目标检测(COD)这两个看似矛盾的任务,通过任务特定的分布学习实现了性能提升。
Details
Motivation: 显著目标检测和伪装目标检测是两种相关但矛盾的任务。传统观点认为联合学习会降低性能,但本文提出通过正确的学习方式,两者可以互相受益。Contribution: 提出SCJoint框架,通过任务特定的分布学习(均值和方差)在共享网络中解耦两种任务的矛盾属性。此外,设计了基于显著性的采样策略(SBSS)优化训练集平衡和质量。
Method: 在完全共享的网络结构中插入少量任务特定的可学习参数,学习解码过程的均值和方差。通过SBSS策略平衡两种任务的训练集大小和质量。
Result: 提出了JoNet网络,能够同时捕捉显著和伪装目标,实验证明了其竞争性能和有效性。
Insight: 联合学习显著和伪装目标检测不仅可行,还能互相促进,关键在于对任务特定分布的解耦和训练集的优化。
Abstract: Salient object detection (SOD) and camouflaged object detection (COD) are two closely related but distinct computer vision tasks. Although both are class-agnostic segmentation tasks that map from RGB space to binary space, the former aims to identify the most salient objects in the image, while the latter focuses on detecting perfectly camouflaged objects that blend into the background in the image. These two tasks exhibit strong contradictory attributes. Previous works have mostly believed that joint learning of these two tasks would confuse the network, reducing its performance on both tasks. However, here we present an opposite perspective: with the correct approach to learning, the network can simultaneously possess the capability to find both salient and camouflaged objects, allowing both tasks to benefit from joint learning. We propose SCJoint, a joint learning scheme for SOD and COD tasks, assuming that the decoding processes of SOD and COD have different distribution characteristics. The key to our method is to learn the respective means and variances of the decoding processes for both tasks by inserting a minimal amount of task-specific learnable parameters within a fully shared network structure, thereby decoupling the contradictory attributes of the two tasks at a minimal cost. Furthermore, we propose a saliency-based sampling strategy (SBSS) to sample the training set of the SOD task to balance the training set sizes of the two tasks. In addition, SBSS improves the training set quality and shortens the training time. Based on the proposed SCJoint and SBSS, we train a powerful generalist network, named JoNet, which has the ability to simultaneously capture both salient" and camouflaged”. Extensive experiments demonstrate the competitive performance and effectiveness of our proposed method. The code is available at https://github.com/linuxsino/JoNet.
[55] Can Large Models Fool the Eye? A New Turing Test for Biological Animation
Zijian Chen,Lirong Deng,Zhengyu Chen,Kaiwei Zhang,Qi Jia,Yuan Tian,Yucheng Zhu,Guangtao Zhai
Main category: cs.CV
TL;DR: 论文提出了BioMotion Arena,一种通过视觉动画评估大型语言模型(LLM)和多模态大型语言模型(MLLM)的新框架,利用点光源成像放大模型性能差异,并通过人类投票分析显示其有效性。
Details
Motivation: 当前评估大型模型的基准方法存在局限性,要么依赖于静态数据集的真实值评分,要么采用模糊的聊天机器人式人类偏好收集。这些方法无法提供直观、易感知的性能差异反馈。Contribution: 提出了BioMotion Arena框架,采用点光源动画技术,通过人类投票对53种主流LLM和MLLM在90种生物运动变体上的表现进行差异化评估。
Method: 利用点光源成像技术生成生物运动动画,通过成对比较评估和收集45k+人类投票,对比模型生成的动画与真实生物运动的差异。
Result: 超过90%的评估模型(包括前沿的开源InternVL3和专有Claude-4系列)无法生成基本的仿人点光源组或流畅、生物合理的运动。
Insight: BioMotion Arena不仅是一种具有挑战性的性能可视化基准,还是一种灵活的评估框架,不受真实值限制,为模型性能评估提供了新的视角。
Abstract: Evaluating the abilities of large models and manifesting their gaps are challenging. Current benchmarks adopt either ground-truth-based score-form evaluation on static datasets or indistinct textual chatbot-style human preferences collection, which may not provide users with immediate, intuitive, and perceptible feedback on performance differences. In this paper, we introduce BioMotion Arena, a novel framework for evaluating large language models (LLMs) and multimodal large language models (MLLMs) via visual animation. Our methodology draws inspiration from the inherent visual perception of motion patterns characteristic of living organisms that utilizes point-light source imaging to amplify the performance discrepancies between models. Specifically, we employ a pairwise comparison evaluation and collect more than 45k votes for 53 mainstream LLMs and MLLMs on 90 biological motion variants. Data analyses show that the crowd-sourced human votes are in good agreement with those of expert raters, demonstrating the superiority of our BioMotion Arena in offering discriminative feedback. We also find that over 90% of evaluated models, including the cutting-edge open-source InternVL3 and proprietary Claude-4 series, fail to produce fundamental humanoid point-light groups, much less smooth and biologically plausible motions. This enables BioMotion Arena to serve as a challenging benchmark for performance visualization and a flexible evaluation framework without restrictions on ground-truth.
[56] DreamVE: Unified Instruction-based Image and Video Editing
Bin Xia,Jiyang Liu,Yuechen Zhang,Bohao Peng,Ruihang Chu,Yitong Wang,Xinglong Wu,Bei Yu,Jiaya Jia
Main category: cs.CV
TL;DR: DreamVE 是一种基于指令的统一图像和视频编辑模型,通过两阶段训练策略(先图像后视频)和多样化的数据合成方法(拼贴与生成模型结合)实现高效编辑。
Details
Motivation: 基于指令的编辑因其简单高效的交互形式具有巨大潜力,但视频编辑领域因数据不足受限,DreamVE 旨在解决这一问题。Contribution: 1. 提出两阶段训练策略(图像到视频),利用图像数据高效训练为视频编辑提供先验;2. 提出拼贴与生成模型结合的数据合成方法,提升编辑多样性和准确性;3. 设计基于 SOTA T2V 的编辑框架,确保一致性和编辑能力。
Method: 1. 两阶段训练(图像编辑预训练后视频编辑微调);2. 拼贴与生成模型结合的多样化数据合成;3. 基于 T2V 的编辑框架,采用 token 拼接与早期丢弃策略注入源图像引导。
Result: DreamVE 在关键编辑类型上表现优异,具备较强的泛化和迁移能力,拼贴数据提升多样性,生成模型弥补属性编辑不足。
Insight: 图像数据为视频编辑提供高效先验,拼贴与生成模型数据互补,编辑框架设计需平衡一致性与灵活性。
Abstract: Instruction-based editing holds vast potential due to its simple and efficient interactive editing format. However, instruction-based editing, particularly for video, has been constrained by limited training data, hindering its practical application. To this end, we introduce DreamVE, a unified model for instruction-based image and video editing. Specifically, We propose a two-stage training strategy: first image editing, then video editing. This offers two main benefits: (1) Image data scales more easily, and models are more efficient to train, providing useful priors for faster and better video editing training. (2) Unifying image and video generation is natural and aligns with current trends. Moreover, we present comprehensive training data synthesis pipelines, including collage-based and generative model-based data synthesis. The collage-based data synthesis combines foreground objects and backgrounds to generate diverse editing data, such as object manipulation, background changes, and text modifications. It can easily generate billions of accurate, consistent, realistic, and diverse editing pairs. We pretrain DreamVE on extensive collage-based data to achieve strong performance in key editing types and enhance generalization and transfer capabilities. However, collage-based data lacks some attribute editing cases, leading to a relative drop in performance. In contrast, the generative model-based pipeline, despite being hard to scale up, offers flexibility in handling attribute editing cases. Therefore, we use generative model-based data to further fine-tune DreamVE. Besides, we design an efficient and powerful editing framework for DreamVE. We build on the SOTA T2V model and use a token concatenation with early drop approach to inject source image guidance, ensuring strong consistency and editability. The codes and models will be released.
[57] SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment
Yanxiao Sun,Jiafu Wu,Yun Cao,Chengming Xu,Yabiao Wang,Weijian Cao,Donghao Luo,Chengjie Wang,Yanwei Fu
Main category: cs.CV
TL;DR: SwiftVideo 是一个统一的视频生成框架,通过轨迹-分布对齐在少步推理下实现高质量视频生成,结合轨迹保护和分布匹配的优势。
Details
Motivation: 当前基于扩散或流的视频生成模型需要多步迭代采样,计算开销大。而现有的蒸馏方法在少步设置下性能下降或产生更多伪影,需改进稳定性。Contribution: 提出 SwiftVideo,结合轨迹保护和分布匹配策略,引入连续时间一致性蒸馏和双视角对齐(分布和轨迹对齐),实现高效少步视频生成。
Method: 采用连续时间一致性蒸馏确保 ODE 轨迹精确保护,并提出双视角对齐(合成与真实数据的分布对齐,以及不同推理步骤的轨迹对齐)。
Result: 在 OpenVid-1M 基准测试中,SwiftVideo 显著优于现有方法,实现高质量视频生成且大幅减少推理步骤。
Insight: 联合轨迹和分布对齐策略能有效平衡效率与质量,为少步视频生成提供新思路。
Abstract: Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts under few-step settings. To address these limitations, we propose \textbf{\emph{SwiftVideo}}, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. Our approach introduces continuous-time consistency distillation to ensure precise preservation of ODE trajectories. Subsequently, we propose a dual-perspective alignment that includes distribution alignment between synthetic and real data along with trajectory alignment across different inference steps. Our method maintains high-quality video generation while substantially reducing the number of inference steps. Quantitative evaluations on the OpenVid-1M benchmark demonstrate that our method significantly outperforms existing approaches in few-step video generation.
[58] AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance
Weichen Zhang,Zhui Zhu,Ningbo Li,Kebin Liu,Yunhao Liu
Main category: cs.CV
TL;DR: AdaptInfer is a dynamic token pruning framework for vision-language models (VLMs) that leverages layer-wise text-to-text attention maps for adaptive pruning, improving inference efficiency without significant accuracy loss.
Details
Motivation: Vision-language models (VLMs) face high inference costs due to processing a large number of vision tokens. Existing pruning methods fail to utilize dynamic internal signals during inference.Contribution: 1. A fine-grained, dynamic text-guided pruning mechanism. 2. Identification of cross-modal attention shifts for a more principled pruning schedule. 3. A lightweight, plug-and-play framework generalizable across tasks.
Method: AdaptInfer uses layer-wise text-to-text attention maps to score vision tokens dynamically and identifies consistent inflection points in attention shifts to optimize pruning schedules.
Result: The method reduces CUDA latency by 61.3% while maintaining 92.9% accuracy on LLaVA-1.5-7B. It outperforms state-of-the-art methods under the same token budget.
Insight: Dynamic pruning guided by internal attention signals is more effective than static approaches, enabling efficient inference without sacrificing model performance.
Abstract: Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering (VQA), but their inference cost remains a significant challenge due to the large number of vision tokens processed during the prefill stage. Existing pruning methods often rely on directly using the attention patterns or static text prompt guidance, failing to exploit the dynamic internal signals generated during inference. To address these issues, we propose AdaptInfer, a plug-and-play framework for adaptive vision token pruning in VLMs. First, we introduce a fine-grained, dynamic text-guided pruning mechanism that reuses layer-wise text-to-text attention maps to construct soft priors over text-token importance, allowing more informed scoring of vision tokens at each stage. Second, we perform an offline analysis of cross-modal attention shifts and identify consistent inflection locations in inference, which inspire us to propose a more principled and efficient pruning schedule. Our method is lightweight and plug-and-play, also generalizable across multi-modal tasks. Experimental results have verified the effectiveness of the proposed method. For example, it reduces CUDA latency by 61.3% while maintaining an average accuracy of 92.9% on vanilla LLaVA-1.5-7B. Under the same token budget, AdaptInfer surpasses SOTA in accuracy.
[59] Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
Yachun Mi,Yu Li,Yanting Li,Shixin Sun,Chen Hui,Tong Zhang,Yuanyuan Liu,Chenyue Song,Shaohui Liu
Main category: cs.CV
TL;DR: Q-CLIP is a novel Vision-Language Model (VLM) based framework for Video Quality Assessment (VQA), utilizing a lightweight Shared Cross-Modal Adapter (SCMA) and quality-level prompts to achieve efficient and accurate quality assessment.
Details
Motivation: Traditional VQA methods rely on pretraining on large datasets, which is computationally expensive and lacks focus on quality-specific factors. VLMs offer generalization potential but have not been fully explored for VQA.Contribution: 1) First fully VLM-based VQA framework (Q-CLIP). 2) Lightweight SCMA with minimal trainable parameters reduces computational cost. 3) Introduction of quality-level prompts to enhance sensitivity to video quality. 4) Analysis of frame sampling strategies for better generalization.
Method: 1) Uses a Shared Cross-Modal Adapter (SCMA) to enhance visual and textual representations with minimal trainable parameters. 2) Introduces learnable quality-level prompts to guide VLMs in perceiving quality variations. 3) Evaluates frame-difference-based sampling for improved VQA performance.
Result: Q-CLIP achieves excellent performance on multiple VQA datasets, demonstrating the effectiveness of VLMs in this domain while reducing computational costs.
Insight: Leveraging VLMs with lightweight adapters and task-specific prompts can bridge the gap between general vision-language tasks and specialized tasks like VQA, offering a scalable and efficient solution.
Abstract: Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model’s sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.
[60] UGD-IML: A Unified Generative Diffusion-based Framework for Constrained and Unconstrained Image Manipulation Localization
Yachun Mi,Xingyang He,Shixin Sun,Yu Li,Yanting Li,Zhixuan Li,Jian Jin,Chen Hui,Shaohui Liu
Main category: cs.CV
TL;DR: 论文提出了UGD-IML,一个基于扩散模型的生成框架,首次将图像操纵定位(IML)和受限图像操纵定位(CIML)任务统一在一个框架中,减少了对大规模标注数据的依赖,并通过实验验证了其优越性能。
Details
Motivation: 数字时代的高级图像编辑工具威胁了视觉内容的完整性,当前的方法依赖于大标注数据集且效率低下,因此需要一种新的高效框架。Contribution: 1. 首次将IML和CIML任务统一在单个生成框架中;2. 通过扩散模型减少对大规模标注数据的依赖;3. 引入类嵌入机制和参数共享设计,实现任务切换无额外开销。
Method: 基于扩散模型设计了一个生成框架,使用类嵌入机制和参数共享设计,支持IML和CIML任务的无缝切换,并通过端到端设计简化数据标注流程。
Result: 在多个数据集上,UGD-IML在IML和CIML任务上的F1指标分别比SOTA方法平均提升了9.66和4.36,同时在不确定性估计、可视化和鲁棒性方面表现优异。
Insight: 扩散模型通过学习数据分布,可以有效缓解对标注数据的依赖,而统一的生成框架为多任务处理提供了高效解决方案。
Abstract: In the digital age, advanced image editing tools pose a serious threat to the integrity of visual content, making image forgery detection and localization a key research focus. Most existing Image Manipulation Localization (IML) methods rely on discriminative learning and require large, high-quality annotated datasets. However, current datasets lack sufficient scale and diversity, limiting model performance in real-world scenarios. To overcome this, recent studies have explored Constrained IML (CIML), which generates pixel-level annotations through algorithmic supervision. However, existing CIML approaches often depend on complex multi-stage pipelines, making the annotation process inefficient. In this work, we propose a novel generative framework based on diffusion models, named UGD-IML, which for the first time unifies both IML and CIML tasks within a single framework. By learning the underlying data distribution, generative diffusion models inherently reduce the reliance on large-scale labeled datasets, allowing our approach to perform effectively even under limited data conditions. In addition, by leveraging a class embedding mechanism and a parameter-sharing design, our model seamlessly switches between IML and CIML modes without extra components or training overhead. Furthermore, the end-to-end design enables our model to avoid cumbersome steps in the data annotation process. Extensive experimental results on multiple datasets demonstrate that UGD-IML outperforms the SOTA methods by an average of 9.66 and 4.36 in terms of F1 metrics for IML and CIML tasks, respectively. Moreover, the proposed method also excels in uncertainty estimation, visualization and robustness.
[61] MCA: 2D-3D Retrieval with Noisy Labels via Multi-level Adaptive Correction and Alignment
Gui Zou,Chaofan Gan,Chern Hong Lim,Supavadee Aramvith,Weiyao Lin
Main category: cs.CV
TL;DR: MCA提出了一种多级自适应校正和对齐框架,用于处理2D-3D跨模态检索中的噪声标签问题,通过多模态联合标签校正和多级自适应对齐策略提升性能。
Details
Motivation: 现有的跨模态检索方法在噪声标签条件下容易过拟合,亟需一种鲁棒的解决方案。Contribution: 提出了MCA框架,结合多模态联合标签校正(MJC)和多级自适应对齐(MAA)策略,显著提升了噪声标签条件下的跨模态检索性能。
Method: 1. MJC利用多模态历史自预测联合建模模态预测一致性;2. MAA通过多层次对齐增强跨模态特征的语义和判别性。
Result: MCA在传统和现实噪声3D基准测试中取得了SOTA性能。
Insight: 多模态信息的联合利用和层次化对齐是提升噪声标签条件下跨模态检索鲁棒性的有效途径。
Abstract: With the increasing availability of 2D and 3D data, significant advancements have been made in the field of cross-modal retrieval. Nevertheless, the existence of imperfect annotations presents considerable challenges, demanding robust solutions for 2D-3D cross-modal retrieval in the presence of noisy label conditions. Existing methods generally address the issue of noise by dividing samples independently within each modality, making them susceptible to overfitting on corrupted labels. To address these issues, we propose a robust 2D-3D \textbf{M}ulti-level cross-modal adaptive \textbf{C}orrection and \textbf{A}lignment framework (MCA). Specifically, we introduce a Multimodal Joint label Correction (MJC) mechanism that leverages multimodal historical self-predictions to jointly model the modality prediction consistency, enabling reliable label refinement. Additionally, we propose a Multi-level Adaptive Alignment (MAA) strategy to effectively enhance cross-modal feature semantics and discrimination across different levels. Extensive experiments demonstrate the superiority of our method, MCA, which achieves state-of-the-art performance on both conventional and realistic noisy 3D benchmarks, highlighting its generality and effectiveness.
[62] Mask & Match: Learning to Recognize Handwritten Math with Self-Supervised Attention
Shree Mitra,Ritabrata Chakraborty,Nilkanta Sahu
Main category: cs.CV
TL;DR: 论文提出了一种自监督学习框架,用于手写数学表达式的识别,无需昂贵标注数据。通过全局和局部对比损失预训练图像编码器,并结合渐进式空间掩码策略训练自监督注意力网络,最终通过监督微调生成LaTeX序列。实验表明其性能优于现有方法。
Details
Motivation: 手写数学表达式识别(HMER)因二维结构、符号尺度多样及复杂空间关系而具有挑战性。现有方法依赖大量标注数据,成本高昂。本文旨在通过自监督学习减少标注依赖,提升识别性能。Contribution: 1. 提出了一种结合全局和局部对比损失的自监督预训练方法;2. 设计了渐进式空间掩码策略训练自监督注意力网络,无需监督即可学习语义焦点区域;3. 在CROHME基准测试中表现优于现有方法。
Method: 1. 使用全局和局部对比损失预训练图像编码器;2. 通过渐进式空间掩码训练自监督注意力网络,提升对缺失或遮挡信息的鲁棒性;3. 结合Transformer解码器进行监督微调,生成LaTeX序列。
Result: 在CROHME基准测试中,方法优于现有自监督和全监督基线,验证了渐进式注意力机制的有效性。
Insight: 自监督学习可以通过渐进掩码策略有效学习数学表达式的语义结构,减少对标注数据的依赖,同时提升对复杂空间关系的理解能力。
Abstract: Recognizing handwritten mathematical expressions (HMER) is a challenging task due to the inherent two-dimensional structure, varying symbol scales, and complex spatial relationships among symbols. In this paper, we present a self-supervised learning (SSL) framework for HMER that eliminates the need for expensive labeled data. Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss, enabling the model to learn both holistic and fine-grained representations. A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy. This attention mechanism is designed to learn semantically meaningful focus regions, such as operators, exponents, and nested mathematical notation, without requiring any supervision. The progressive masking curriculum encourages the network to become increasingly robust to missing or occluded visual information, ultimately improving structural understanding. Our complete pipeline consists of (1) self-supervised pretraining of the encoder, (2) self-supervised attention learning, and (3) supervised fine-tuning with a transformer decoder to generate LATEX sequences. Extensive experiments on CROHME benchmarks demonstrate that our method outperforms existing SSL and fully supervised baselines, validating the effectiveness of our progressive attention mechanism in enhancing HMER performance. Our codebase can be found here.
[63] FMCE-Net++: Feature Map Convergence Evaluation and Training
Zhibo Zhu,Renyu Huang,Lei He
Main category: cs.CV
TL;DR: FMCE-Net++ 是一种新型训练框架,通过集成预训练的 FMCE-Net 作为辅助头,动态平衡分类损失和特征收敛优化,显著提升模型性能。
Details
Motivation: 深度神经网络(DNNs)的内部表征不透明,现有特征图收敛评估(FMCE)缺乏实验验证和闭环集成,无法动态优化训练过程。Contribution: 提出了 FMCE-Net++ 框架,通过结合表示辅助损失(RAL)动态优化特征收敛,无需修改架构或增加数据即可提升性能。
Method: 使用预训练的 FMCE-Net 作为辅助头生成特征图收敛分数(FMCS),通过 RAL 动态平衡分类损失和特征收敛优化。
Result: 在多个数据集上(如 MNIST、CIFAR-10 等)实现了显著性能提升,例如 ResNet-50/CIFAR-10 准确率提高 1.16 个百分点。
Insight: FMCE-Net++ 将特征收敛评估与训练过程闭环结合,为模型优化提供了新方向。
Abstract: Deep Neural Networks (DNNs) face interpretability challenges due to their opaque internal representations. While Feature Map Convergence Evaluation (FMCE) quantifies module-level convergence via Feature Map Convergence Scores (FMCS), it lacks experimental validation and closed-loop integration. To address this limitation, we propose FMCE-Net++, a novel training framework that integrates a pretrained, frozen FMCE-Net as an auxiliary head. This module generates FMCS predictions, which, combined with task labels, jointly supervise backbone optimization through a Representation Auxiliary Loss. The RAL dynamically balances the primary classification loss and feature convergence optimization via a tunable \Representation Abstraction Factor. Extensive experiments conducted on MNIST, CIFAR-10, FashionMNIST, and CIFAR-100 demonstrate that FMCE-Net++ consistently enhances model performance without architectural modifications or additional data. Key experimental outcomes include accuracy gains of $+1.16$ pp (ResNet-50/CIFAR-10) and $+1.08$ pp (ShuffleNet v2/CIFAR-100), validating that FMCE-Net++ can effectively elevate state-of-the-art performance ceilings.
[64] GMF-Drive: Gated Mamba Fusion with Spatial-Aware BEV Representation for End-to-End Autonomous Driving
Jian Wang,Chaokang Jiang,Haitao Xu
Main category: cs.CV
TL;DR: GMF-Drive提出了一种新的端到端自动驾驶框架,通过几何增强的LiDAR表示和分层门控Mamba融合架构,解决了基于Transformer模型的局限性,实现了更高的性能和效率。
Details
Motivation: 现有的基于扩散模型的端到端自动驾驶方法依赖Transformer进行特征融合,但其二次计算复杂性和缺乏空间先验限制了性能提升。Contribution: 1. 提出了几何增强的LiDAR表示方法,保留3D几何细节;2. 设计了分层门控Mamba融合(GM-Fusion)架构,用高效的状态空间模型(SSM)替代Transformer。
Method: 1. 使用几何增强的LiDAR表示;2. 设计了BEV-SSM,通过方向序列和自适应融合机制以线性复杂度捕获长程依赖,同时显式建模驾驶场景的空间特性。
Result: 在NAVSIM基准测试中,GMF-Drive超越了DiffusionDrive,达到新的SOTA性能。
Insight: 任务专用的状态空间模型在自动驾驶中的性能和效率可以超越通用Transformer,几何增强表示对3D场景建模至关重要。
Abstract: Diffusion-based models are redefining the state-of-the-art in end-to-end autonomous driving, yet their performance is increasingly hampered by a reliance on transformer-based fusion. These architectures face fundamental limitations: quadratic computational complexity restricts the use of high-resolution features, and a lack of spatial priors prevents them from effectively modeling the inherent structure of Bird’s Eye View (BEV) representations. This paper introduces GMF-Drive (Gated Mamba Fusion for Driving), an end-to-end framework that overcomes these challenges through two principled innovations. First, we supersede the information-limited histogram-based LiDAR representation with a geometrically-augmented pillar format encoding shape descriptors and statistical features, preserving critical 3D geometric details. Second, we propose a novel hierarchical gated mamba fusion (GM-Fusion) architecture that substitutes an expensive transformer with a highly efficient, spatially-aware state-space model (SSM). Our core BEV-SSM leverages directional sequencing and adaptive fusion mechanisms to capture long-range dependencies with linear complexity, while explicitly respecting the unique spatial properties of the driving scene. Extensive experiments on the challenging NAVSIM benchmark demonstrate that GMF-Drive achieves a new state-of-the-art performance, significantly outperforming DiffusionDrive. Comprehensive ablation studies validate the efficacy of each component, demonstrating that task-specific SSMs can surpass a general-purpose transformer in both performance and efficiency for autonomous driving.
[65] SynSeg: Feature Synergy for Multi-Category Contrastive Learning in Open-Vocabulary Semantic Segmentation
Weichen Zhang,Kebin Liu,Fan Dang,Zhui Zhu,Xikai Sun,Yunhao Liu
Main category: cs.CV
TL;DR: SynSeg提出了一种新的弱监督方法,通过多类别对比学习(MCCL)和特征协同结构(FSS),解决了开放词汇语义分割中的语义对齐问题,显著提升了性能。
Details
Motivation: 开放词汇语义分割中,语义类别的广泛性和细粒度带来了巨大挑战。现有方法依赖类别特定监督和不适合的特征构建方法,导致语义对齐不佳和性能低下。Contribution: 提出了多类别对比学习(MCCL)和特征协同结构(FSS),通过结合类别内外的对齐和分离,显著提升了语义定位和区分能力。
Method: 通过MCCL策略实现类别内外的语义对齐,并利用FSS通过先验融合和语义激活图增强重构判别性特征。
Result: 在多个基准测试中优于现有最优方法,例如在VOC上提升了4.5%,Context上提升了8.9%。
Insight: 通过结合类别关系和特征重构,SynSeg为弱监督语义分割提供了更鲁棒的训练信号。
Abstract: Semantic segmentation in open-vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories. Existing weakly-supervised methods often rely on category-specific supervision and ill-suited feature construction methods for contrastive learning, leading to semantic misalignment and poor performance. In this work, we propose a novel weakly-supervised approach, SynSeg, to address the challenges. SynSeg performs Multi-Category Contrastive Learning (MCCL) as a stronger training signal with a new feature reconstruction framework named Feature Synergy Structure (FSS). Specifically, MCCL strategy robustly combines both intra- and inter-category alignment and separation in order to make the model learn the knowledge of correlations from different categories within the same image. Moreover, FSS reconstructs discriminative features for contrastive learning through prior fusion and semantic-activation-map enhancement, effectively avoiding the foreground bias introduced by the visual encoder. In general, SynSeg effectively improves the abilities in semantic localization and discrimination under weak supervision. Extensive experiments on benchmarks demonstrate that our method outperforms state-of-the-art (SOTA) performance. For instance, SynSeg achieves higher accuracy than SOTA baselines by 4.5% on VOC, 8.9% on Context, 2.6% on Object and 2.0% on City.
[66] SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning
Lin Zhang,Xianfang Zeng,Kangcong Li,Gang Yu,Tao Chen
Main category: cs.CV
TL;DR: SC-Captioner 是一个基于强化学习的自校正图像字幕框架,通过设计奖励函数激励准确的字幕修正,显著提升了字幕质量。
Details
Motivation: 现有图像字幕模型缺乏自校正能力,可能导致生成的描述不准确。SC-Captioner 旨在通过强化学习框架解决这一问题。Contribution: 1. 提出了一种基于场景图解析的奖励函数设计方法;2. 改进了 CAPTURE 评估指标;3. 构建了细粒度标注的数据集 RefinedCaps。
Method: 使用场景图解析将字幕分解为对象、属性和关系集合,通过集合差异计算奖励,激励模型进行自校正。
Result: 实验表明,SC-Captioner 显著优于直接偏好优化训练策略,生成的字幕更准确。
Insight: 借助强化学习和细粒度奖励设计,可以显著提升图像字幕模型的自校正能力和生成质量。
Abstract: We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Specifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate correctness bonuses for accurate refinements and mistake punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Furthermore, we collect a fine-grained annotated image caption dataset, RefinedCaps, consisting of 6.5K diverse images from COCO dataset. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.
[67] SAM Encoder Breach by Adversarial Simplicial Complex Triggers Downstream Model Failures
Yi Qin,Rui Wang,Tao Huang,Tong Xiao,Liping Jing
Main category: cs.CV
TL;DR: 论文提出VeSCA方法,利用SAM编码器的脆弱性生成可迁移的对抗样本,通过参数化单纯复形增强攻击效果,实验性能提升12.7%。
Details
Motivation: Segment Anything Model (SAM)的脆弱性可能影响下游应用,现有攻击方法迁移性不足,需探索其与下游模型的共享脆弱区域。Contribution: 提出Vertex-Refining Simplicial Complex Attack (VeSCA),通过参数化单纯复形生成可迁移对抗样本,提升攻击效果。
Method: VeSCA通过迭代顶点优化识别脆弱区域,结合轻量级领域适配策略,随机采样单纯复形生成对抗样本。
Result: 在5个领域数据集上,VeSCA性能比现有方法提升12.7%,显著揭示SAM对下游模型的风险。
Insight: SAM的脆弱性是系统性风险,需开发更鲁棒的基础模型以减少对下游应用的影响。
Abstract: While the Segment Anything Model (SAM) transforms interactive segmentation with zero-shot abilities, its inherent vulnerabilities present a single-point risk, potentially leading to the failure of numerous downstream applications. Proactively evaluating these transferable vulnerabilities is thus imperative. Prior adversarial attacks on SAM often present limited transferability due to insufficient exploration of common weakness across domains. To address this, we propose Vertex-Refining Simplicial Complex Attack (VeSCA), a novel method that leverages only the encoder of SAM for generating transferable adversarial examples. Specifically, it achieves this by explicitly characterizing the shared vulnerable regions between SAM and downstream models through a parametric simplicial complex. Our goal is to identify such complexes within adversarially potent regions by iterative vertex-wise refinement. A lightweight domain re-adaptation strategy is introduced to bridge domain divergence using minimal reference data during the initialization of simplicial complex. Ultimately, VeSCA generates consistently transferable adversarial examples through random simplicial complex sampling. Extensive experiments demonstrate that VeSCA achieves performance improved by 12.7% compared to state-of-the-art methods across three downstream model categories across five domain-specific datasets. Our findings further highlight the downstream model risks posed by SAM’s vulnerabilities and emphasize the urgency of developing more robust foundation models.
[68] Roll Your Eyes: Gaze Redirection via Explicit 3D Eyeball Rotation
YoungChan Choi,HengFei Wang,YiHua Cheng,Boeun Kim,Hyung Jin Chang,YoungGeun Choi,Sang-Il Choi
Main category: cs.CV
TL;DR: 本文提出了一种基于显式3D眼球结构的新型注视重定向框架,通过3D高斯散射技术实现高质量图像生成,优于现有的隐式神经网络方法。
Details
Motivation: 现有的注视重定向方法通常基于神经辐射场(NeRF),其通过隐式神经表示实现体积渲染,但无法显式建模眼球旋转和平移。本文提出显式3D眼球结构以解决这一问题。Contribution: 1. 引入显式3D眼球结构,使用3D高斯散射(3DGS)实现注视方向的精确控制。2. 提出自适应变形模块,模拟眼部周围肌肉的细微运动。
Method: 1. 使用3D高斯散射显式建模眼球结构;2. 通过旋转和平移3D眼球实现注视重定向;3. 加入自适应变形模块以增强真实感。
Result: 在ETH-XGaze数据集上,该方法生成的图像质量和注视估计精度优于现有方法。
Insight: 显式3D结构设计在注视重定向任务中优于隐式方法,且能够更自然地模拟眼部运动。
Abstract: We propose a novel 3D gaze redirection framework that leverages an explicit 3D eyeball structure. Existing gaze redirection methods are typically based on neural radiance fields, which employ implicit neural representations via volume rendering. Unlike these NeRF-based approaches, where the rotation and translation of 3D representations are not explicitly modeled, we introduce a dedicated 3D eyeball structure to represent the eyeballs with 3D Gaussian Splatting (3DGS). Our method generates photorealistic images that faithfully reproduce the desired gaze direction by explicitly rotating and translating the 3D eyeball structure. In addition, we propose an adaptive deformation module that enables the replication of subtle muscle movements around the eyes. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our framework is capable of generating diverse novel gaze images, achieving superior image quality and gaze estimation accuracy compared to previous state-of-the-art methods.
[69] DiffCap: Diffusion-based Real-time Human Motion Capture using Sparse IMUs and a Monocular Camera
Shaohua Pan,Xinyu Yi,Yan Zhou,Weihua Jian,Yuan Zhang,Pengfei Wan,Feng Xu
Main category: cs.CV
TL;DR: DiffCap提出了一种基于扩散模型的方法,结合稀疏IMU和单目摄像头的实时人体运动捕捉,通过融合两种模态信号提升了运动捕捉的鲁棒性和准确性。
Details
Motivation: 现有方法在处理单目摄像头视觉信息时容易因遮挡或视野外运动失效,而IMU信号虽稳定但缺乏全局信息,需一种新框架将两者优势结合。Contribution: 1. 提出首个基于扩散模型的多模态运动捕捉框架;2. 将视觉信息整体编码为条件嵌入,逐帧融合IMU信号,提升鲁棒性;3. 实验验证了设计的有效性及SOTA性能。
Method: 1. 序列视觉信息整体编码为条件嵌入;2. IMU信号逐帧与噪声姿态拼接作为扩散模型输入;3. 扩散模型学习运动先验并融合两种信号。
Result: 实验表明,DiffCap在姿态估计任务中优于现有方法,尤其在视觉信息退化时仍能保持稳定性能。
Insight: 视觉信息整体编码和IMU信号逐帧处理是多模态融合的关键,扩散模型能有效学习运动先验并提升实时捕捉的鲁棒性。
Abstract: Combining sparse IMUs and a monocular camera is a new promising setting to perform real-time human motion capture. This paper proposes a diffusion-based solution to learn human motion priors and fuse the two modalities of signals together seamlessly in a unified framework. By delicately considering the characteristics of the two signals, the sequential visual information is considered as a whole and transformed into a condition embedding, while the inertial measurement is concatenated with the noisy body pose frame by frame to construct a sequential input for the diffusion model. Firstly, we observe that the visual information may be unavailable in some frames due to occlusions or subjects moving out of the camera view. Thus incorporating the sequential visual features as a whole to get a single feature embedding is robust to the occasional degenerations of visual information in those frames. On the other hand, the IMU measurements are robust to occlusions and always stable when signal transmission has no problem. So incorporating them frame-wisely could better explore the temporal information for the system. Experiments have demonstrated the effectiveness of the system design and its state-of-the-art performance in pose estimation compared with the previous works. Our codes are available for research at https://shaohua-pan.github.io/diffcap-page.
[70] SDEval: Safety Dynamic Evaluation for Multimodal Large Language Models
Hanqing Wang,Yuan Tian,Mingyu Liu,Zhenhao Zhang,Xiangyang Zhu
Main category: cs.CV
TL;DR: SDEval是一个动态评估框架,旨在解决多模态大语言模型(MLLM)安全问题,通过动态调整基准的分布和复杂性生成新样本,有效缓解数据污染并揭示模型的安全局限性。
Details
Motivation: 随着多模态大语言模型(MLLM)的快速发展,其输出的安全性问题备受关注。现有的数据集容易过时且存在数据污染问题,需要一种动态的评估方法来解决这些问题。Contribution: 提出了首个安全动态评估框架SDEval,通过文本、图像和文本-图像动态策略生成新样本,提升了安全评估的灵活性,并揭示了MLLM的安全局限性。
Method: 采用三种动态策略(文本动态、图像动态、文本-图像动态)从原始基准生成新样本,研究其对模型安全性的影响,并将框架应用于多个安全与能力基准测试。
Result: 实验表明,SDEval显著影响安全评估,缓解数据污染问题,并在多个基准测试中暴露出MLLM的安全局限性。
Insight: 动态生成样本是一种有效的方法,可以持续评估MLLM的安全性,同时避免数据和模型的静态性带来的评价偏差。
Abstract: In the rapidly evolving landscape of Multimodal Large Language Models (MLLMs), the safety concerns of their outputs have earned significant attention. Although numerous datasets have been proposed, they may become outdated with MLLM advancements and are susceptible to data contamination issues. To address these problems, we propose \textbf{SDEval}, the \textit{first} safety dynamic evaluation framework to controllably adjust the distribution and complexity of safety benchmarks. Specifically, SDEval mainly adopts three dynamic strategies: text, image, and text-image dynamics to generate new samples from original benchmarks. We first explore the individual effects of text and image dynamics on model safety. Then, we find that injecting text dynamics into images can further impact safety, and conversely, injecting image dynamics into text also leads to safety risks. SDEval is general enough to be applied to various existing safety and even capability benchmarks. Experiments across safety benchmarks, MLLMGuard and VLSBench, and capability benchmarks, MMBench and MMVet, show that SDEval significantly influences safety evaluation, mitigates data contamination, and exposes safety limitations of MLLMs. Code is available at https://github.com/hq-King/SDEval
[71] Text-guided Visual Prompt DINO for Generic Segmentation
Yuchen Guan,Chong Sun,Canmiao Fu,Zhipeng Huang,Chun Yuan,Chen Li
Main category: cs.CV
TL;DR: 提出了Prompt-DINO框架,通过早期特征融合、对齐查询选择和生成式数据引擎,解决了开放世界分割中的模态融合和标注噪声问题。
Details
Motivation: 现有的多模态视觉模型在特征融合阶段较晚且查询选择次优,同时受限于固定词汇表。Prompt-DINO旨在解决这些问题。Contribution: 1. 早期模态融合机制;2. 对齐查询选择;3. 生成式数据引擎。
Method: 1. 早期融合文本/视觉提示与骨干特征;2. 对齐查询解码优化语义-空间一致性;3. 使用RAP模型生成高质量训练数据。
Result: 在开放世界检测任务中达到SOTA性能,并显著扩展语义覆盖范围,减少标注噪声80.5%。
Insight: 早期融合与对齐查询能有效改善跨模态交互,生成式数据引擎为开放世界任务提供了高质量数据支持。
Abstract: Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5% compared to conventional approaches. Extensive experiments demonstrate that Prompt-DINO achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data&Code are available at https://github.com/WeChatCV/WeVisionOne.
[72] Effective Training Data Synthesis for Improving MLLM Chart Understanding
Yuwei Yang,Zeyu Zhang,Yunzhong Hou,Zhuowan Li,Gaowen Liu,Ali Payani,Yuan-Sen Ting,Liang Zheng
Main category: cs.CV
TL;DR: 该论文提出了一种模块化和多样化的数据合成方法,通过生成高质量的合成图表数据集(ECD),显著提升了MLLM在图表理解任务中的性能。
Details
Motivation: 现有的MLLM在图表理解任务中表现不足(成功率仅30%-50%),且传统合成图表数据的相似性和多样性不足,限制了模型的训练效果。Contribution: 设计了五步数据合成流程,生成包含10k+图表和300k+ QA对的ECD数据集,显著提升了多种MLLM在图表理解任务中的表现。
Method: 1. 分离图表数据与功能生成;2. 多子图生成时条件化生成;3. 视觉细节多样化;4. 过滤低质量数据;5. 使用GPT-4生成QA对。
Result: ECD数据集在多种真实和合成测试集上显著提升了MLLM的性能。
Insight: 通过模块化和多样化的数据合成方法,可以有效提升MLLM在复杂图表理解任务中的表现。
Abstract: Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: https://github.com/yuweiyang-anu/ECD.
[73] DSConv: Dynamic Splitting Convolution for Pansharpening
Xuanyu Liu,Bonan An
Main category: cs.CV
TL;DR: 论文提出了一种动态分割卷积(DSConv)方法,结合注意力机制选择感兴趣位置,将原始卷积核分割为多个小核,以提升特征提取能力,并在全色锐化任务中实现了高效性能和SOTA结果。
Details
Motivation: 全色锐化任务中,现有方法多依赖标准卷积,而动态卷积能更好地利用遥感图像的像素间相关性。因此,作者提出DSConv来解决这一挑战。Contribution: 1. 提出动态分割卷积(DSConv),通过注意力机制动态选择位置并分割卷积核;2. 构建了新的全色锐化网络架构,提升了泛化和特征表示能力。
Method: DSConv结合注意力机制,动态选择感兴趣位置并将原始卷积核分割为多个小核,以增强特征提取能力。网络架构基于这一方法设计。
Result: 实验表明DSConv在全色锐化任务中表现优异,达到了SOTA性能。
Insight: DSConv的动态分割策略有效利用了遥感图像的局部特征,提升了网络的适应性和特征提取能力。
Abstract: Aiming to obtain a high-resolution image, pansharpening involves the fusion of a multi-spectral image (MS) and a panchromatic image (PAN), the low-level vision task remaining significant and challenging in contemporary research. Most existing approaches rely predominantly on standard convolutions, few making the effort to adaptive convolutions, which are effective owing to the inter-pixel correlations of remote sensing images. In this paper, we propose a novel strategy for dynamically splitting convolution kernels in conjunction with attention, selecting positions of interest, and splitting the original convolution kernel into multiple smaller kernels, named DSConv. The proposed DSConv more effectively extracts features of different positions within the receptive field, enhancing the network’s generalization, optimization, and feature representation capabilities. Furthermore, we innovate and enrich concepts of dynamic splitting convolution and provide a novel network architecture for pansharpening capable of achieving the tasks more efficiently, building upon this methodology. Adequate fair experiments illustrate the effectiveness and the state-of-the-art performance attained by DSConv.Comprehensive and rigorous discussions proved the superiority and optimal usage conditions of DSConv.
[74] VISTAR:A User-Centric and Role-Driven Benchmark for Text-to-Image Evaluation
Kaiyuan Jiang,Ruoxi Sun,Ying Cao,Yuqi Xu,Xinran Zhang,Junyan Guo,ChengSheng Deng
Main category: cs.CV
TL;DR: VISTAR是一个以用户为中心、多维度的文本到图像(T2I)评估基准,解决了现有指标的局限性。它结合确定性指标和层次加权P/N提问方案,显著提升了抽象语义评估的准确性。
Details
Motivation: 现有T2I评估指标在量化属性和抽象语义评估上存在不足,缺乏用户角色导向的多维度分析。VISTAR旨在提供一个更全面、用户导向的评估框架。Contribution: 1. 提出VISTAR基准,结合确定性指标和HWPQ方案;2. 通过专家研究定义用户角色和评估角度;3. HWPQ在抽象语义评估上达到85.9%的准确性。
Method: 1. 采用确定性指标评估量化属性;2. 引入HWPQ方案,利用约束性视觉语言模型评估抽象语义;3. 基于专家研究和人类验证数据构建基准。
Result: VISTAR指标与人类对齐度>75%,HWPQ在抽象语义上显著优于VQA基线。评估显示无通用最佳模型,角色加权分数提供领域特异性指导。
Insight: 用户角色和评估角度的多样性对T2I评估至关重要,HWPQ方案在抽象语义评估上有潜力。
Abstract: We present VISTAR, a user-centric, multi-dimensional benchmark for text-to-image (T2I) evaluation that addresses the limitations of existing metrics. VISTAR introduces a two-tier hybrid paradigm: it employs deterministic, scriptable metrics for physically quantifiable attributes (e.g., text rendering, lighting) and a novel Hierarchical Weighted P/N Questioning (HWPQ) scheme that uses constrained vision-language models to assess abstract semantics (e.g., style fusion, cultural fidelity). Grounded in a Delphi study with 120 experts, we defined seven user roles and nine evaluation angles to construct the benchmark, which comprises 2,845 prompts validated by over 15,000 human pairwise comparisons. Our metrics achieve high human alignment (>75%), with the HWPQ scheme reaching 85.9% accuracy on abstract semantics, significantly outperforming VQA baselines. Comprehensive evaluation of state-of-the-art models reveals no universal champion, as role-weighted scores reorder rankings and provide actionable guidance for domain-specific deployment. All resources are publicly released to foster reproducible T2I assessment.
[75] Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment
Zhenbang Du,Yonggan Fu,Lifu Wang,Jiayi Qian,Xiao Luo,Yingyan,Lin
Main category: cs.CV
TL;DR: 本文提出PostDiff框架,旨在通过后训练方式优化扩散模型的部署效率,探讨减少去噪步数与降低每一步推理成本的权衡,并提出了混合分辨率去噪和模块缓存策略。
Details
Motivation: 扩散模型在生成任务中表现出色,但高计算成本限制了其在资源有限平台上的部署,因此需要研究如何在保持生成质量的同时优化计算效率。Contribution: 1. 提出PostDiff框架,通过后训练优化扩散模型;2. 提出混合分辨率去噪和模块缓存策略;3. 系统分析减少去噪步数与降低每一步成本的权衡。
Method: 1. 混合分辨率去噪:在早期去噪步骤中降低分辨率以增强低频成分;2. 模块缓存策略:跨步骤复用计算以减少冗余。
Result: 实验表明,PostDiff显著提升了扩散模型的效率-保真度权衡,且降低单步推理成本比减少去噪步数更有效。
Insight: 在保持生成质量的前提下,优化每一步的计算效率比简单地减少去噪步数更有效。
Abstract: Diffusion models have shown remarkable success across generative tasks, yet their high computational demands challenge deployment on resource-limited platforms. This paper investigates a critical question for compute-optimal diffusion model deployment: Under a post-training setting without fine-tuning, is it more effective to reduce the number of denoising steps or to use a cheaper per-step inference? Intuitively, reducing the number of denoising steps increases the variability of the distributions across steps, making the model more sensitive to compression. In contrast, keeping more denoising steps makes the differences smaller, preserving redundancy, and making post-training compression more feasible. To systematically examine this, we propose PostDiff, a training-free framework for accelerating pre-trained diffusion models by reducing redundancy at both the input level and module level in a post-training manner. At the input level, we propose a mixed-resolution denoising scheme based on the insight that reducing generation resolution in early denoising steps can enhance low-frequency components and improve final generation fidelity. At the module level, we employ a hybrid module caching strategy to reuse computations across denoising steps. Extensive experiments and ablation studies demonstrate that (1) PostDiff can significantly improve the fidelity-efficiency trade-off of state-of-the-art diffusion models, and (2) to boost efficiency while maintaining decent generation fidelity, reducing per-step inference cost is often more effective than reducing the number of denoising steps. Our code is available at https://github.com/GATECH-EIC/PostDiff.
[76] Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation
Ojonugwa Oluwafemi Ejiga Peter,Akingbola Oluwapemiisin,Amalahu Chetachi,Adeniran Opeyemi,Fahmi Khalifa,Md Mahmudur Rahman
Main category: cs.CV
TL;DR: 该论文提出了一种多架构框架,通过集成检测和掩码生成实现自动化息肉分割,并结合合成数据生成技术解决医疗数据稀缺问题。
Details
Motivation: 结肠镜图像中息肉的自动化分割对结直肠癌的早期诊断至关重要,但医疗数据稀缺和标注复杂是主要挑战。Contribution: 1) 提出了一种结合合成数据生成(Stable Diffusion)的多方向架构框架;2) 集成Faster R-CNN和SAM实现检测与分割;3) 评估了五种分割模型的性能。
Method: 使用Faster R-CNN进行息肉定位,SAM生成掩码,并评估了U-Net、PSPNet等五种分割模型在ResNet34基础上的表现。
Result: Faster R-CNN召回率达93.08%,FPN在PSNR和SSIM上表现最佳,U-Net在召回率上占优,LinkNet在IoU和Dice分数上平衡。
Insight: 合成数据生成能有效缓解医疗数据不足问题,结合检测与分割的多架构框架在息肉分割任务中表现优越。
Abstract: Colonoscopy is a vital tool for the early diagnosis of colorectal cancer, which is one of the main causes of cancer-related mortality globally; hence, it is deemed an essential technique for the prevention and early detection of colorectal cancer. The research introduces a unique multidirectional architectural framework to automate polyp detection within colonoscopy images while helping resolve limited healthcare dataset sizes and annotation complexities. The research implements a comprehensive system that delivers synthetic data generation through Stable Diffusion enhancements together with detection and segmentation algorithms. This detection approach combines Faster R-CNN for initial object localization while the Segment Anything Model (SAM) refines the segmentation masks. The faster R-CNN detection algorithm achieved a recall of 93.08% combined with a precision of 88.97% and an F1 score of 90.98%.SAM is then used to generate the image mask. The research evaluated five state-of-the-art segmentation models that included U-Net, PSPNet, FPN, LinkNet, and MANet using ResNet34 as a base model. The results demonstrate the superior performance of FPN with the highest scores of PSNR (7.205893) and SSIM (0.492381), while UNet excels in recall (84.85%) and LinkNet shows balanced performance in IoU (64.20%) and Dice score (77.53%).
[77] MA-CBP: A Criminal Behavior Prediction Framework Based on Multi-Agent Asynchronous Collaboration
Cheng Liu,Daou Zhang,Tingxu Liu,Yuhan Wang,Jinyang Chen,Yuexuan Li,Xinying Xiao,Chenbo Xin,Ziru Wang,Weichao Wu
Main category: cs.CV
TL;DR: MA-CBP是一个基于多智能体异步协作的犯罪行为预测框架,通过将实时视频流转化为语义描述并融合长短期上下文进行联合推理,实现了对潜在犯罪行为的早期预警。
Details
Motivation: 城市化加速导致公共场所犯罪行为增多,传统方法难以捕捉高级行为语义或满足实时性需求,亟需一种高效解决方案。Contribution: 提出MA-CBP框架,实现了视频流的语义转换与多尺度语言监督;构建了高质量犯罪行为数据集。
Method: 通过多智能体异步协作,将视频流转为语义描述,并融合长短期上下文进行联合推理。
Result: 在多个数据集上表现优异,为城市公共安全风险预警提供了有效方案。
Insight: 多智能体协作和语义描述的结合能显著提升实时犯罪行为预测的准确性和效率。
Abstract: With the acceleration of urbanization, criminal behavior in public scenes poses an increasingly serious threat to social security. Traditional anomaly detection methods based on feature recognition struggle to capture high-level behavioral semantics from historical information, while generative approaches based on Large Language Models (LLMs) often fail to meet real-time requirements. To address these challenges, we propose MA-CBP, a criminal behavior prediction framework based on multi-agent asynchronous collaboration. This framework transforms real-time video streams into frame-level semantic descriptions, constructs causally consistent historical summaries, and fuses adjacent image frames to perform joint reasoning over long- and short-term contexts. The resulting behavioral decisions include key elements such as event subjects, locations, and causes, enabling early warning of potential criminal activity. In addition, we construct a high-quality criminal behavior dataset that provides multi-scale language supervision, including frame-level, summary-level, and event-level semantic annotations. Experimental results demonstrate that our method achieves superior performance on multiple datasets and offers a promising solution for risk warning in urban public safety scenarios.
[78] A Semantic Segmentation Algorithm for Pleural Effusion Based on DBIF-AUNet
Ruixiang Tang,Jianglong Qin,Mingda Zhang,Yan Song,Yi Wu,Wei Wu
Main category: cs.CV
TL;DR: 论文提出了一种基于DBIF-AUNet的语义分割算法,用于胸腔积液的精确分割,通过双分支交互融合注意力模块和多尺度特征互补,显著提升了分割精度。
Details
Motivation: 当前胸腔积液CT图像的语义分割面临灰度相似、边缘模糊和形态多变等挑战,现有方法因语义鸿沟问题难以应对复杂变化。因此,需要一种更高效的分割算法。Contribution: 1. 提出DBIF-AUNet模型,包含双域特征解耦模块(DDFD)和分支交互注意力融合模块(BIAF);2. 设计嵌套的深度监督机制和混合损失函数,解决类别不平衡问题。
Method: 1. DDFD模块通过正交解耦实现多尺度特征互补;2. BIAF模块动态融合全局、局部和频域特征;3. 采用分层自适应混合损失函数优化训练。
Result: 在1622张CT图像上,DBIF-AUNet的IoU和Dice分数分别为80.1%和89.0%,优于U-Net++和Swin-UNet。
Insight: 通过解耦双域特征和动态融合多尺度信息,可以有效提升复杂医学图像的分割性能,尤其在边缘模糊和多变形态的场景下。
Abstract: Pleural effusion semantic segmentation can significantly enhance the accuracy and timeliness of clinical diagnosis and treatment by precisely identifying disease severity and lesion areas. Currently, semantic segmentation of pleural effusion CT images faces multiple challenges. These include similar gray levels between effusion and surrounding tissues, blurred edges, and variable morphology. Existing methods often struggle with diverse image variations and complex edges, primarily because direct feature concatenation causes semantic gaps. To address these challenges, we propose the Dual-Branch Interactive Fusion Attention model (DBIF-AUNet). This model constructs a densely nested skip-connection network and innovatively refines the Dual-Domain Feature Disentanglement module (DDFD). The DDFD module orthogonally decouples the functions of dual-domain modules to achieve multi-scale feature complementarity and enhance characteristics at different levels. Concurrently, we design a Branch Interaction Attention Fusion module (BIAF) that works synergistically with the DDFD. This module dynamically weights and fuses global, local, and frequency band features, thereby improving segmentation robustness. Furthermore, we implement a nested deep supervision mechanism with hierarchical adaptive hybrid loss to effectively address class imbalance. Through validation on 1,622 pleural effusion CT images from Southwest Hospital, DBIF-AUNet achieved IoU and Dice scores of 80.1% and 89.0% respectively. These results outperform state-of-the-art medical image segmentation models U-Net++ and Swin-UNet by 5.7%/2.7% and 2.2%/1.5% respectively, demonstrating significant optimization in segmentation accuracy for complex pleural effusion CT images.
[79] LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning
Chang Che,Ziqi Wang,Pengwan Yang,Qi Wang,Hui Ma,Zenglin Shi
Main category: cs.CV
TL;DR: 本文提出了LoRA in LoRA (LiLoRA),一种针对持续视觉指令调优(CVIT)的高效架构扩展方法,通过共享LoRA矩阵A和低秩分解矩阵B来减少参数冗余,并结合余弦正则化稳定性损失防止遗忘。
Details
Motivation: 持续视觉指令调优(CVIT)面临灾难性遗忘和参数效率低下的问题,现有方法通过扩展整个层来学习新任务,导致参数开销大、扩展性差。Contribution: 提出了LiLoRA方法,通过共享LoRA矩阵A、低秩分解矩阵B和余弦正则化稳定性损失,显著提高了参数效率和任务学习性能。
Method: 1. 共享LoRA矩阵A以减少冗余;2. 对矩阵B进行低秩分解以最小化任务特定参数;3. 引入余弦正则化稳定性损失保持共享表示的一致性。
Result: 在多样化CVIT基准测试中,LiLoRA在顺序任务学习中表现优异,同时显著提升了参数效率。
Insight: 通过共享和分解LoRA矩阵,能够有效减少参数开销并防止遗忘,为多模态大语言模型的持续学习提供了高效解决方案。
Abstract: Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches.
[80] AnomalyMoE: Towards a Language-free Generalist Model for Unified Visual Anomaly Detection
Zhaopeng Gu,Bingke Zhu,Guibo Zhu,Yingying Chen,Wei Ge,Ming Tang,Jinqiao Wang
Main category: cs.CV
TL;DR: AnomalyMoE提出了一种基于Mixture-of-Experts(MoE)架构的通用异常检测框架,通过分解异常检测问题为三个语义层次,并利用专家网络分别处理不同层次的异常,实现了跨领域的优异性能。
Details
Motivation: 现有异常检测方法通常是针对特定任务设计的,缺乏通用性,导致在新领域或异常类型下性能下降。AnomalyMoE旨在打破这种局限性,构建一个统一的通用模型。Contribution: 1. 提出了一种层次化的MoE框架AnomalyMoE,能够同时检测局部结构异常、组件级语义异常和全局逻辑异常。2. 引入了EIR和ESB模块,分别用于增强专家多样性和确保专家利用率。
Method: AnomalyMoE采用三个专家网络分别处理不同语义层次的异常:局部(patch级)、组件级和全局级。EIR模块通过排斥性损失促进专家多样性,ESB模块平衡专家选择。
Result: 在8个跨领域数据集上的实验表明,AnomalyMoE显著优于专用方法,并建立了新的SOTA性能。
Insight: 通过分层次的异常检测和动态专家选择,AnomalyMoE展示了通用模型在异常检测任务中的潜力,为未来研究提供了新方向。
Abstract: Anomaly detection is a critical task across numerous domains and modalities, yet existing methods are often highly specialized, limiting their generalizability. These specialized models, tailored for specific anomaly types like textural defects or logical errors, typically exhibit limited performance when deployed outside their designated contexts. To overcome this limitation, we propose AnomalyMoE, a novel and universal anomaly detection framework based on a Mixture-of-Experts (MoE) architecture. Our key insight is to decompose the complex anomaly detection problem into three distinct semantic hierarchies: local structural anomalies, component-level semantic anomalies, and global logical anomalies. AnomalyMoE correspondingly employs three dedicated expert networks at the patch, component, and global levels, and is specialized in reconstructing features and identifying deviations at its designated semantic level. This hierarchical design allows a single model to concurrently understand and detect a wide spectrum of anomalies. Furthermore, we introduce an Expert Information Repulsion (EIR) module to promote expert diversity and an Expert Selection Balancing (ESB) module to ensure the comprehensive utilization of all experts. Experiments on 8 challenging datasets spanning industrial imaging, 3D point clouds, medical imaging, video surveillance, and logical anomaly detection demonstrate that AnomalyMoE establishes new state-of-the-art performance, significantly outperforming specialized methods in their respective domains.
[81] Interpretable Rheumatoid Arthritis Scoring via Anatomy-aware Multiple Instance Learning
Zhiyan Bo,Laura C. Coates,Bartlomiej W. Papiez
Main category: cs.CV
TL;DR: 论文提出了一种基于解剖学感知的多示例学习(MIL)方法,用于解释性风湿性关节炎(RA)评分,通过双阶段流程从手部X光片中提取疾病相关区域并预测SvdH评分,性能接近经验丰富的放射科医生。
Details
Motivation: SvdH评分在临床实践中应用受限,主要因其复杂性导致评分效率低。论文旨在通过自动化方法提升评分效率,同时保证解释性。Contribution: 1) 提出一种两阶段预测流程;2) 引入两种疾病相关区域提取方案;3) 通过集成学习提升性能,接近专家水平。
Method: 1) 选择异常区域或关节区域;2) 使用基于注意力的多示例学习整合区域特征;3) 通过集成学习优化模型。
Result: 最佳方案的PCC为0.945,RMSE为15.57,接近放射科医生的表现(PCC=0.97, RMSE=18.75)。
Insight: 解剖学感知的区域提取和注意力机制的结合显著提升了评分的解释性和准确性。
Abstract: The Sharp/van der Heijde (SvdH) score has been widely used in clinical trials to quantify radiographic damage in Rheumatoid Arthritis (RA), but its complexity has limited its adoption in routine clinical practice. To address the inefficiency of manual scoring, this work proposes a two-stage pipeline for interpretable image-level SvdH score prediction using dual-hand radiographs. Our approach extracts disease-relevant image regions and integrates them using attention-based multiple instance learning to generate image-level features for prediction. We propose two region extraction schemes: 1) sampling image tiles most likely to contain abnormalities, and 2) cropping patches containing disease-relevant joints. With Scheme 2, our best individual score prediction model achieved a Pearson’s correlation coefficient (PCC) of 0.943 and a root mean squared error (RMSE) of 15.73. Ensemble learning further boosted prediction accuracy, yielding a PCC of 0.945 and RMSE of 15.57, achieving state-of-the-art performance that is comparable to that of experienced radiologists (PCC = 0.97, RMSE = 18.75). Finally, our pipeline effectively identified and made decisions based on anatomical structures which clinicians consider relevant to RA progression.
[82] Depth Jitter: Seeing through the Depth
Md Sazidur Rahman,David Cabecinhas,Ricard Marxer
Main category: cs.CV
TL;DR: 论文提出了一种基于深度的数据增强方法Depth-Jitter,通过模拟自然深度变化提升模型在深度敏感任务中的泛化能力。
Details
Motivation: 传统的数据增强方法缺乏对深度感知变换的关注,限制了模型在现实深度变化场景中的鲁棒性。Contribution: 提出Depth-Jitter方法,通过自适应深度偏移模拟自然深度变化,增强模型在深度敏感环境中的泛化能力。
Method: Depth-Jitter利用深度方差阈值指导自适应深度偏移,生成合成深度扰动同时保持结构完整性。
Result: 在FathomNet和UTDAC2020数据集上的实验表明,Depth-Jitter虽不总是优于传统方法,但能显著提升模型在深度变化场景中的稳定性。
Insight: 深度感知数据增强对现实应用具有潜力,为进一步研究深度感知学习策略奠定了基础。
Abstract: Depth information is essential in computer vision, particularly in underwater imaging, robotics, and autonomous navigation. However, conventional augmentation techniques overlook depth aware transformations, limiting model robustness in real world depth variations. In this paper, we introduce Depth-Jitter, a novel depth-based augmentation technique that simulates natural depth variations to improve generalization. Our approach applies adaptive depth offsetting, guided by depth variance thresholds, to generate synthetic depth perturbations while preserving structural integrity. We evaluate Depth-Jitter on two benchmark datasets, FathomNet and UTDAC2020 demonstrating its impact on model stability under diverse depth conditions. Extensive experiments compare Depth-Jitter against traditional augmentation strategies such as ColorJitter, analyzing performance across varying learning rates, encoders, and loss functions. While Depth-Jitter does not always outperform conventional methods in absolute performance, it consistently enhances model stability and generalization in depth-sensitive environments. These findings highlight the potential of depth-aware augmentation for real-world applications and provide a foundation for further research into depth-based learning strategies. The proposed technique is publicly available to support advancements in depth-aware augmentation. The code is publicly available on \href{https://github.com/mim-team/Depth-Jitter}{github}.
[83] Towards Unified Image Deblurring using a Mixture-of-Experts Decoder
Daniel Feijoo,Paula Garrido-Mellado,Jaesung Rim,Alvaro Garcia,Marcos V. Conde
Main category: cs.CV
TL;DR: 本文提出了一种基于专家混合(MoE)解码器的统一图像去模糊方法,能够处理多种模糊类型,避免了传统方法需要针对不同模糊类型使用不同模型的不足。
Details
Motivation: 现有的图像去模糊方法通常针对特定模糊类型设计,缺乏通用性。为解决这一问题,本文提出了一种统一的方法,能够高效处理多种模糊类型。Contribution: 主要贡献是提出了一种基于MoE的解码模块,能够动态路由图像特征,实现多模糊类型的统一处理。这种方法在性能上与专用模型相当,并在未见过的模糊场景中表现出良好的泛化能力。
Method: 方法的核心是MoE解码模块,它通过动态路由特征实现多模糊类型的自适应处理,并以端到端的方式高效恢复图像。
Result: 实验表明,该方法在多种模糊类型上的性能与专用模型相当,并在未知模糊场景中表现出鲁棒性和泛化能力。
Insight: 本文的MoE解码器提供了一种灵活的多任务处理框架,为其他低层视觉任务的统一设计提供了启示。
Abstract: Image deblurring, removing blurring artifacts from images, is a fundamental task in computational photography and low-level computer vision. Existing approaches focus on specialized solutions tailored to particular blur types, thus, these solutions lack generalization. This limitation in current methods implies requiring multiple models to cover several blur types, which is not practical in many real scenarios. In this paper, we introduce the first all-in-one deblurring method capable of efficiently restoring images affected by diverse blur degradations, including global motion, local motion, blur in low-light conditions, and defocus blur. We propose a mixture-of-experts (MoE) decoding module, which dynamically routes image features based on the recognized blur degradation, enabling precise and efficient restoration in an end-to-end manner. Our unified approach not only achieves performance comparable to dedicated task-specific models, but also demonstrates remarkable robustness and generalization capabilities on unseen blur degradation scenarios.
[84] Deepfake Detection that Generalizes Across Benchmarks
Andrii Yermakov,Jan Cech,Jiri Matas,Mario Fritz
Main category: cs.CV
TL;DR: 本文提出了一种参数高效的方法LNCLIP-DF,通过微调预训练的CLIP视觉编码器的层归一化参数(仅0.03%),并结合L2归一化和潜在空间增强,提升了深度伪造检测在多个基准数据集上的泛化能力。该方法在13个数据集上实现了最先进的性能,并揭示了避免捷径学习和提升泛化能力的关键因素。
Details
Motivation: 尽管许多方法通过引入复杂的架构来提升深度伪造检测的泛化能力,但实际部署中仍面临跨数据集泛化难题。本文旨在证明通过参数高效的预训练模型微调,可以实现鲁棒的泛化性能。Contribution: 1. 提出了LNCLIP-DF方法,仅微调CLIP模型的层归一化参数,实现高效泛化检测;2. 揭示了避免捷径学习和提升泛化能力的关键因素;3. 在13个数据集上验证了方法的最先进性能。
Method: 采用预训练的CLIP视觉编码器,仅微调层归一化参数(0.03%总参数),并利用L2归一化和潜在空间增强来优化特征空间。
Result: 在13个基准数据集上的实验表明,LNCLIP-DF在跨数据集AUROC指标上优于其他复杂方法,计算效率高且可复现。
Insight: 1. 使用同源视频的真实-伪造数据对训练可避免捷径学习;2. 老数据集训练的模型在新数据集上仍能表现出色,泛化能力与数据集发布时间无直接关系。
Abstract: The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of a pre-trained CLIP vision encoder. The proposed method, LNCLIP-DF, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and latent space augmentations. We conducted an extensive evaluation on 13 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained CLIP model. The code will be made publicly available upon acceptance.
[85] XAG-Net: A Cross-Slice Attention and Skip Gating Network for 2.5D Femur MRI Segmentation
Byunghyun Ko,Anning Tian,Jeongkyu Lee
Main category: cs.CV
TL;DR: XAG-Net是一种新型2.5D U-Net架构,通过像素级跨切片注意力(CSA)和跳跃注意力门控(AG)机制,提升了对股骨MRI图像的准确分割能力。
Details
Motivation: 现有的2D和3D深度学习方法在股骨MRI分割中存在局限性,需要在精确性和计算效率之间找到平衡。Contribution: 提出了像素级跨切片注意力(CSA)和跳跃注意力门控(AG)模块,改进了2.5D分割的性能。
Method: 采用2.5D U-Net架构,结合CSA进行细粒度的跨切片建模,AG优化特征融合。
Result: XAG-Net在股骨分割精度上优于基线2D、2.5D和3D U-Net模型,且计算效率高。
Insight: 像素级跨切片注意力能够更精细地捕捉切片间信息,结合跳跃注意力门控可有效优化分割结果。
Abstract: Accurate segmentation of femur structures from Magnetic Resonance Imaging (MRI) is critical for orthopedic diagnosis and surgical planning but remains challenging due to the limitations of existing 2D and 3D deep learning-based segmentation approaches. In this study, we propose XAG-Net, a novel 2.5D U-Net-based architecture that incorporates pixel-wise cross-slice attention (CSA) and skip attention gating (AG) mechanisms to enhance inter-slice contextual modeling and intra-slice feature refinement. Unlike previous CSA-based models, XAG-Net applies pixel-wise softmax attention across adjacent slices at each spatial location for fine-grained inter-slice modeling. Extensive evaluations demonstrate that XAG-Net surpasses baseline 2D, 2.5D, and 3D U-Net models in femur segmentation accuracy while maintaining computational efficiency. Ablation studies further validate the critical role of the CSA and AG modules, establishing XAG-Net as a promising framework for efficient and accurate femur MRI segmentation.
[86] SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
Zhangquan Chen,Ruihui Zhao,Chuwei Luo,Mingze Sun,Xinlei Yu,Yangyang Kang,Ruqi Huang
Main category: cs.CV
TL;DR: 论文提出SIFThinker,一种空间感知的视觉推理框架,通过深度增强的边界框和自然语言交替实现注意力修正和图像区域聚焦。
Details
Motivation: 当前多模态大语言模型(MLLMs)在复杂视觉任务中表现不佳,尤其是空间理解和细粒度感知。现有方法未能利用空间线索进行注意力修正和迭代聚焦。Contribution: 1. 提出反向扩展前向推理策略,生成交替的图像-文本思维链;2. 提出GRPO-SIF训练范式,整合深度信息进行视觉定位。
Method: SIFThinker采用深度增强边界框与自然语言交替的方式,动态修正注意力并聚焦相关区域。
Result: 实验表明,SIFThinker在空间理解和细粒度感知任务上表现优于现有方法,同时保持通用能力。
Insight: 结合空间线索和动态注意力修正的框架能够显著提升视觉推理能力。
Abstract: Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware “think-with-images” framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method.
[87] Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Temporal Grounding
Jian Hu,Zixu Cheng,Shaogang Gong,Isabel Guan,Jianye Hao,Jun Wang,Kun Shao
Main category: cs.CV
TL;DR: 论文提出了一种名为URPA的数据高效方法,用于无标签跨域视频时序定位任务,通过强化学习的GRPO生成多个候选预测并估计置信度,从而在不依赖目标域标注的情况下实现跨域知识迁移。
Details
Motivation: 当前的视频时序定位(TG)方法在跨域任务中依赖大量标注数据,且计算和存储开销大,难以实时部署。URPA旨在解决这些问题,仅利用少量无标签视频进行跨域适应。Contribution: 主要贡献包括提出URPA方法,实现无标签跨域时序定位,并利用GRPO生成伪标签和置信度估计,从而减少对目标域标注的依赖和计算开销。
Method: URPA通过GRPO生成多个候选预测,平均得到伪标签,并根据预测方差估计置信度,指导模型训练。这种方法无需目标域标注,计算高效。
Result: 在三个数据集的六个跨域设置中,URPA仅用少量无标签视频实现了良好的泛化性能。
Insight: URPA展示了在无标签跨域任务中利用置信度加权训练的有效性,为视频时序定位的实时化提供了可行方案。
Abstract: Video Temporal Grounding (TG) aims to temporally locate video segments matching a natural language description (a query) in a long video. While Vision-Language Models (VLMs) are effective at holistic semantic matching, they often struggle with fine-grained temporal localisation. Recently, Group Relative Policy Optimisation (GRPO) reformulates the inference process as a reinforcement learning task, enabling fine-grained grounding and achieving strong in-domain performance. However, GRPO relies on labelled data, making it unsuitable in unlabelled domains. Moreover, because videos are large and expensive to store and process, performing full-scale adaptation introduces prohibitive latency and computational overhead, making it impractical for real-time deployment. To overcome both problems, we introduce a Data-Efficient Unlabelled Cross-domain Temporal Grounding method, from which a model is first trained on a labelled source domain, then adapted to a target domain using only a small number of unlabelled videos from the target domain. This approach eliminates the need for target annotation and keeps both computational and storage overhead low enough to run in real time. Specifically, we introduce. Uncertainty-quantified Rollout Policy Adaptation (URPA) for cross-domain knowledge transfer in learning video temporal grounding without target labels. URPA generates multiple candidate predictions using GRPO rollouts, averages them to form a pseudo label, and estimates confidence from the variance across these rollouts. This confidence then weights the training rewards, guiding the model to focus on reliable supervision. Experiments on three datasets across six cross-domain settings show that URPA generalises well using only a few unlabelled target videos. Codes will be released once published.
[88] Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection
Giacomo D’Amicantonio,Snehashis Majhi,Quan Kong,Lorenzo Garattoni,Gianpiero Francesca,François Bremond,Egor Bondarev
Main category: cs.CV
TL;DR: 该论文提出了一种基于高斯散射引导的专家混合(GS-MoE)框架,用于弱监督视频异常检测,通过专业化专家模型和时序一致性增强弱监督信号,显著提升了性能。
Details
Motivation: 传统弱监督视频异常检测方法无法处理异常类型多样性且缺乏精确时间信息,导致对复杂异常事件检测效果不佳。Contribution: 1. 提出GS-MoE框架,结合专家混合机制和高斯散射时序损失;2. 通过专业化专家模型和时序一致性增强弱监督信号;3. 在多个数据集上取得了当前最优性能。
Method: 1. 使用多个专家模型,每个专注于特定异常类型;2. 通过高斯散射损失引导时序一致性;3. 通过专家混合机制整合预测结果。
Result: 在UCF-Crime数据集上实现91.58%的AUC,并在XD-Violence和MSAD数据集上表现优异。
Insight: 1. 类别特异性建模和时序一致性是提升弱监督异常检测性能的关键;2. 高斯散射损失能有效增强弱监督信号的时空表征能力。
Abstract: Video Anomaly Detection (VAD) is a challenging task due to the variability of anomalous events and the limited availability of labeled data. Under the Weakly-Supervised VAD (WSVAD) paradigm, only video-level labels are provided during training, while predictions are made at the frame level. Although state-of-the-art models perform well on simple anomalies (e.g., explosions), they struggle with complex real-world events (e.g., shoplifting). This difficulty stems from two key issues: (1) the inability of current models to address the diversity of anomaly types, as they process all categories with a shared model, overlooking category-specific features; and (2) the weak supervision signal, which lacks precise temporal information, limiting the ability to capture nuanced anomalous patterns blended with normal events. To address these challenges, we propose Gaussian Splatting-guided Mixture of Experts (GS-MoE), a novel framework that employs a set of expert models, each specialized in capturing specific anomaly types. These experts are guided by a temporal Gaussian splatting loss, enabling the model to leverage temporal consistency and enhance weak supervision. The Gaussian splatting approach encourages a more precise and comprehensive representation of anomalies by focusing on temporal segments most likely to contain abnormal events. The predictions from these specialized experts are integrated through a mixture-of-experts mechanism to model complex relationships across diverse anomaly patterns. Our approach achieves state-of-the-art performance, with a 91.58% AUC on the UCF-Crime dataset, and demonstrates superior results on XD-Violence and MSAD datasets. By leveraging category-specific expertise and temporal guidance, GS-MoE sets a new benchmark for VAD under weak supervision.
[89] ViPro-2: Unsupervised State Estimation via Integrated Dynamics for Guiding Video Prediction
Patrick Takenaka,Johannes Maucher,Marco F. Huber
Main category: cs.CV
TL;DR: 论文ViPro-2通过集成动力学改进无监督状态估计,解决了早期模型ViPro需要依赖初始真值状态的问题,提高了从观测中推断状态的能力,并通过扩展3D数据集接近真实场景。
Details
Motivation: 之前的工作ViPro依赖初始真值状态,导致模型学习了一种捷径,无法从噪声观测中准确推断状态。ViPro-2旨在改进这一问题,实现无监督状态估计。Contribution: 1. 改进了ViPro模型,无需初始真值状态即可推断状态;2. 提出了一种无监督方法;3. 扩展了Orbits数据集至3D版本,接近真实场景。
Method: 通过集成动力学优化模型,使其能够直接从观测中估计状态,无需依赖初始真值状态。
Result: 模型能够更鲁棒地从噪声观测中推断状态,并展示了在扩展数据集上的有效性。
Insight: 无监督状态估计的关键在于模型与动力学紧密结合,避免依赖初始真值,从而更贴近实际应用场景。
Abstract: Predicting future video frames is a challenging task with many downstream applications. Previous work has shown that procedural knowledge enables deep models for complex dynamical settings, however their model ViPro assumed a given ground truth initial symbolic state. We show that this approach led to the model learning a shortcut that does not actually connect the observed environment with the predicted symbolic state, resulting in the inability to estimate states given an observation if previous states are noisy. In this work, we add several improvements to ViPro that enables the model to correctly infer states from observations without providing a full ground truth state in the beginning. We show that this is possible in an unsupervised manner, and extend the original Orbits dataset with a 3D variant to close the gap to real world scenarios.
[90] Street View Sociability: Interpretable Analysis of Urban Social Behavior Across 15 Cities
Kieran Elrod,Katherine Flanigan,Mario Bergés
Main category: cs.CV
TL;DR: 论文利用多模态大语言模型分析街景图像,提取社会交互信息,关联城市规划理论,验证社会交互类型与建成环境变量之间的关系。
Details
Motivation: 现有研究多关注行人数量而非社会交互质量,街景图像作为低成本、全球覆盖的数据源,可能隐含社会信息,可用于城市规划。Contribution: 1. 提出街景图像作为社会行为分析的新数据源。2. 结合Mehta的社会交互分类理论,验证社会交互与建成环境的关联。3. 为跨文化城市规划提供可扩展的工具。
Method: 1. 使用多模态大语言模型分析街景图像。2. 控制天气、时间等因素,采用线性回归模型验证社会交互类型与城市依恋感及环境变量(如绿视率、天空可见度)的关系。
Result: 结果支持城市规划理论:天空可见度与所有社会交互类型相关,绿视率预测持久社会交互,城市依恋感与短暂社会交互正相关。
Insight: 街景图像可作为隐私保护的工具,为城市规划提供量化依据,并支持跨文化理论验证。
Abstract: Designing socially active streets has long been a goal of urban planning, yet existing quantitative research largely measures pedestrian volume rather than the quality of social interactions. We hypothesize that street view imagery – an inexpensive data source with global coverage – contains latent social information that can be extracted and interpreted through established social science theory. As a proof of concept, we analyzed 2,998 street view images from 15 cities using a multimodal large language model guided by Mehta’s taxonomy of passive, fleeting, and enduring sociability – one illustrative example of a theory grounded in urban design that could be substituted or complemented by other sociological frameworks. We then used linear regression models, controlling for factors like weather, time of day, and pedestrian counts, to test whether the inferred sociability measures correlate with city-level place attachment scores from the World Values Survey and with environmental predictors (e.g., green, sky, and water view indices) derived from individual street view images. Results aligned with long-standing urban planning theory: the sky view index was associated with all three sociability types, the green view index predicted enduring sociability, and place attachment was positively associated with fleeting sociability. These results provide preliminary evidence that street view images can be used to infer relationships between specific types of social interactions and built environment variables. Further research could establish street view imagery as a scalable, privacy-preserving tool for studying urban sociability, enabling cross-cultural theory testing and evidence-based design of socially vibrant cities.
[91] Aligning Effective Tokens with Video Anomaly in Large Language Models
Yingxian Chen,Jiahui Liu,Ruifan Di,Yanwei Li,Chirui Chang,Shizhen Zhao,Wilton W. T. Fok,Xiaojuan Qi,Yik-Chung Wu
Main category: cs.CV
TL;DR: 论文提出了一种名为VA-GPT的多模态大语言模型,用于高效总结和定位视频中的异常事件,通过空间和时间模块捕捉关键信息,显著提升了性能。
Details
Motivation: 视频异常事件分析是重要且具有挑战性的任务,现有MLLM在处理异常事件时表现不佳,主要由于异常事件在时空上的稀疏性和冗余信息干扰。Contribution: 提出了VA-GPT模型,设计了SETS和TETG模块,用于高效对齐视觉和语言模态中的有效令牌;构建了针对视频异常的指令数据集和跨域评测基准。
Method: 采用SETS模块选择空间有效令牌,TETG模块生成时间有效令牌,结合视觉编码器和大语言模型,实现异常事件的高效分析。
Result: 在多个基准测试中表现优于现有方法,验证了模型的有效性。
Insight: 通过模块化设计解决异常事件时空稀疏性问题,结合跨域评测,展示了模型在实际应用中的潜力。
Abstract: Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.
[92] Are you In or Out (of gallery)? Wisdom from the Same-Identity Crowd
Aman Bhatta,Maria Dhakal,Michael C. King,Kevin W. Bowyer
Main category: cs.CV
TL;DR: 该论文提出了一种新方法,通过利用同身份群体的额外注册图像来预测排名第一的识别结果是图库内(In-gallery)还是图库外(Out-of-gallery),旨在减少误识别和调查时间浪费。
Details
Motivation: 传统方法主要依赖相似性得分的阈值来判断图库内外的问题,存在局限性。论文提出利用同身份群体的额外信息来改进这一判断,减少误识别。Contribution: 提出了一种基于同身份群体信息的分类器方法,用于预测一至多面部识别的图库内外状态,并在不同数据集和匹配器上验证了其有效性。
Method: 通过提取排名第一身份的额外注册图像的特征向量,训练分类器来预测图库内外状态。实验覆盖了多种图像退化情况和不同人口统计数据。
Result: 方法在图库内外分类中表现良好,尤其是在图像质量退化时仍有效,且在不同人口统计组中表现一致。
Insight: 该方法的效果依赖于先进的基于边际损失的深度CNN匹配器,说明模型训练方式对结果有重要影响。
Abstract: A central problem in one-to-many facial identification is that the person in the probe image may or may not have enrolled image(s) in the gallery; that is, may be In-gallery or Out-of-gallery. Past approaches to detect when a rank-one result is Out-of-gallery have mostly focused on finding a suitable threshold on the similarity score. We take a new approach, using the additional enrolled images of the identity with the rank-one result to predict if the rank-one result is In-gallery / Out-of-gallery. Given a gallery of identities and images, we generate In-gallery and Out-of-gallery training data by extracting the ranks of additional enrolled images corresponding to the rank-one identity. We then train a classifier to utilize this feature vector to predict whether a rank-one result is In-gallery or Out-of-gallery. Using two different datasets and four different matchers, we present experimental results showing that our approach is viable for mugshot quality probe images, and also, importantly, for probes degraded by blur, reduced resolution, atmospheric turbulence and sunglasses. We also analyze results across demographic groups, and show that In-gallery / Out-of-gallery classification accuracy is similar across demographics. Our approach has the potential to provide an objective estimate of whether a one-to-many facial identification is Out-of-gallery, and thereby to reduce false positive identifications, wrongful arrests, and wasted investigative time. Interestingly, comparing the results of older deep CNN-based face matchers with newer ones suggests that the effectiveness of our Out-of-gallery detection approach emerges only with matchers trained using advanced margin-based loss functions.
[93] Text as Any-Modality for Zero-Shot Classification by Consistent Prompt Tuning
Xiangyu Wu,Feng Yu,Yang Yang,Jianfeng Lu
Main category: cs.CV
TL;DR: TaAM-CPT提出了一种基于文本数据的跨模态表征学习框架,通过模态提示池和文本对齐编码器,实现了无需模态标注数据的零样本分类。
Details
Motivation: 现有方法依赖大量模态标注数据或仅针对单一模态设计,限制了跨模态学习的通用性和扩展性。Contribution: 提出了TaAM-CPT框架,利用文本数据和一致性提示调优,支持无限模态扩展,实现了零样本分类的领先性能。
Method: 结合模态提示池、文本构造和对齐编码器,设计了跨模态一致性学习目标。
Result: 在视频、图像和音频分类任务中,无需模态标注数据即取得了领先结果。
Insight: 文本数据可作为跨模态学习的通用桥梁,提示池和语义一致性目标能有效解决模态间的异构性。
Abstract: The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within modalities while maintaining semantic consistency across different modalities. Benefiting from its scalable architecture and pre-trained models, TaAM-CPT can be seamlessly extended to accommodate unlimited modalities. Remarkably, without any modality-specific labeled data, TaAM-CPT achieves leading results on diverse datasets spanning various modalities, including video classification, image classification, and audio classification. The code is available at https://github.com/Jinx630/TaAM-CPT.
[94] FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation
Wenbin Teng,Gonglin Chen,Haiwei Chen,Yajie Zhao
Main category: cs.CV
TL;DR: FVGen提出了一种新的框架,通过将多步去噪的教师模型蒸馏为少步去噪的学生模型,实现快速新视角合成,显著提升了时间效率。
Details
Motivation: 现有方法在使用视频扩散模型(VDMs)进行稀疏视图的3D重建时,采样速度较慢,影响了实际应用效率。Contribution: 提出了FVGen框架,结合GAN和软化的反向KL散度最小化,将多步VDMs蒸馏为少步模型,实现了90%以上的时间节省。
Method: 使用生成对抗网络和反向KL散度最小化,将多步去噪的教师模型蒸馏为少步去噪的学生模型。
Result: 在相同视觉质量下,FVGen将采样时间减少了90%以上,显著提升了稀疏视图重建的效率。
Insight: 通过模型的蒸馏和优化,可以显著提升复杂生成任务的效率,同时保持高质量的生成结果。
Abstract: Recent progress in 3D reconstruction has enabled realistic 3D models from dense image captures, yet challenges persist with sparse views, often leading to artifacts in unseen areas. Recent works leverage Video Diffusion Models (VDMs) to generate dense observations, filling the gaps when only sparse views are available for 3D reconstruction tasks. A significant limitation of these methods is their slow sampling speed when using VDMs. In this paper, we present FVGen, a novel framework that addresses this challenge by enabling fast novel view synthesis using VDMs in as few as four sampling steps. We propose a novel video diffusion model distillation method that distills a multi-step denoising teacher model into a few-step denoising student model using Generative Adversarial Networks (GANs) and softened reverse KL-divergence minimization. Extensive experiments on real-world datasets show that, compared to previous works, our framework generates the same number of novel views with similar (or even better) visual quality while reducing sampling time by more than 90%. FVGen significantly improves time efficiency for downstream reconstruction tasks, particularly when working with sparse input views (more than 2) where pre-trained VDMs need to be run multiple times to achieve better spatial coverage.
[95] Feature-Space Oversampling for Addressing Class Imbalance in SAR Ship Classification
Ch Muhammad Awais,Marco Reggiannini,Davide Moroni,Oktay Karakus
Main category: cs.CV
TL;DR: 该论文研究了SAR舰船分类中的类别不平衡问题,提出了两种基于特征空间的过采样方法M2m_f和M2m_u,并在两个数据集上验证了其效果优于基线方法。
Details
Motivation: SAR舰船分类中的数据长尾分布问题导致少数类别分类困难,传统的光学数据过采样方法在特征空间的表现需要验证和改进。Contribution: 提出了两种新颖的特征空间过采样方法M2m_f和M2m_u,显著提升了少数类别的分类性能。
Method: 基于Major-to-minor (M2m)方法,设计了两种特征空间过采样算法,并结合ViT、VGG16和ResNet50作为特征提取器进行实验。
Result: 在FuSARShip和OpenSARShip数据集上平均F1分数分别提升了8.82%和4.44%,证明了方法的有效性。
Insight: 特征空间过采样方法是解决SAR舰船分类中类别不平衡问题的有效手段,且对不同类别规模的性能影响值得进一步研究。
Abstract: SAR ship classification faces the challenge of long-tailed datasets, which complicates the classification of underrepresented classes. Oversampling methods have proven effective in addressing class imbalance in optical data. In this paper, we evaluated the effect of oversampling in the feature space for SAR ship classification. We propose two novel algorithms inspired by the Major-to-minor (M2m) method M2m$_f$, M2m$_u$. The algorithms are tested on two public datasets, OpenSARShip (6 classes) and FuSARShip (9 classes), using three state-of-the-art models as feature extractors: ViT, VGG16, and ResNet50. Additionally, we also analyzed the impact of oversampling methods on different class sizes. The results demonstrated the effectiveness of our novel methods over the original M2m and baselines, with an average F1-score increase of 8.82% for FuSARShip and 4.44% for OpenSARShip.
[96] MotionSwap
Om Patil,Jinesh Modi,Suryabha Mukhopadhyay,Meghaditya Giri,Chhavi Malhotra
Main category: cs.CV
TL;DR: 本文介绍了对SimSwap框架的改进,通过集成自注意力和交叉注意力机制、动态损失加权和余弦退火学习率调度等方法,显著提升了换脸的质量和身份一致性。
Details
Motivation: 换脸技术在学术和商业应用中受到广泛关注,但现有方法在身份保留和视觉质量上仍有改进空间。Contribution: 提出了多项改进措施,包括注意力机制的集成、动态损失加权和学习率调度,显著提升了SimSwap的性能。
Method: 使用了自注意力和交叉注意力机制增强生成器架构,结合动态损失加权和余弦退火学习率调度,优化训练过程。
Result: 改进后的模型在40万次训练迭代中表现出色,身份相似度更高,FID分数更低,视觉质量明显优于基线。
Insight: 注意力机制和动态训练策略对提升换脸质量至关重要,未来可结合StyleGAN3、3D建模和时间一致性进一步改进。
Abstract: Face swapping technology has gained significant attention in both academic research and commercial applications. This paper presents our implementation and enhancement of SimSwap, an efficient framework for high fidelity face swapping. We introduce several improvements to the original model, including the integration of self and cross-attention mechanisms in the generator architecture, dynamic loss weighting, and cosine annealing learning rate scheduling. These enhancements lead to significant improvements in identity preservation, attribute consistency, and overall visual quality. Our experimental results, spanning 400,000 training iterations, demonstrate progressive improvements in generator and discriminator performance. The enhanced model achieves better identity similarity, lower FID scores, and visibly superior qualitative results compared to the baseline. Ablation studies confirm the importance of each architectural and training improvement. We conclude by identifying key future directions, such as integrating StyleGAN3, improving lip synchronization, incorporating 3D facial modeling, and introducing temporal consistency for video-based applications.
[97] CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment
Shengzhu Yang,Jiawei Du,Shuai Lu,Weihang Zhang,Ningli Wang,Huiqi Li
Main category: cs.CV
TL;DR: CLIPin是一种非对比学习的插件,可无缝集成到CLIP架构中,提升多模态语义对齐能力。通过共享预投影器设计,结合对比与非对比学习,增强了模型的鲁棒性和泛化性。
Details
Motivation: 大规模自然图像-文本数据集的语义对齐较弱,而医学数据集的高相关性但低多样性也影响了CLIP模型的性能表现,因此需要一种改进方法。Contribution: 提出了CLIPin插件,统一了非对比学习方法,增强了多模态语义对齐;设计了共享预投影器,实现了对比与非对比学习的参数妥协式结合。
Method: 通过共享预投影器分别处理图像和文本模态,结合非对比学习增强语义对齐,并与现有CLIP架构兼容。
Result: 在多样化下游任务中验证了CLIPin的有效性和通用性,能够无缝提升多种对比学习框架的性能。
Insight: 通过非对比学习和参数共享机制,CLIPin为多模态表示学习提供了一种高效且灵活的增强方案,适用于不同数据特性。
Abstract: Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model’s ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.
[98] TRUST: Leveraging Text Robustness for Unsupervised Domain Adaptation
Mattia Litrico,Mario Valerio Giuffrida,Sebastiano Battiato,Devis Tuia
Main category: cs.CV
TL;DR: TRUST是一种新颖的无监督域自适应方法,通过利用语言模态的鲁棒性指导视觉模型的适应,提出伪标签生成和不确定性估计策略,并结合多模态对比学习损失,显著提升了在复杂域偏移上的性能。
Details
Motivation: 针对复杂域偏移(如地理变化)中传统无监督域自适应方法效果不佳的问题,研究者发现语言模态在适应过程中表现更鲁棒。因此,TRUST旨在利用语言模态的鲁棒性改进视觉模型的域适应能力。Contribution: 1. 通过生成基于描述的伪标签和不确定性估计策略,减少了低质量伪标签的负面影响;2. 提出多模态软对比学习损失,避免了传统的正负样本对划分,通过描述相似性指导特征对齐。
Method: 1. 从目标样本的描述生成伪标签,并通过归一化CLIP相似性分数估计伪标签的不确定性;2. 提出多模态软对比学习损失,利用描述相似性动态调整特征对齐强度。
Result: 在经典(DomainNet)和复杂(GeoNet)域偏移数据上,TRUST超越了现有方法,达到了新的SOTA性能。
Insight: 语言模态在复杂域偏移中具有更强的鲁棒性,通过多模态对齐可以显著提升视觉模型的域适应能力。不确定性估计和动态对比学习机制是关键创新点。
Abstract: Recent unsupervised domain adaptation (UDA) methods have shown great success in addressing classical domain shifts (e.g., synthetic-to-real), but they still suffer under complex shifts (e.g. geographical shift), where both the background and object appearances differ significantly across domains. Prior works showed that the language modality can help in the adaptation process, exhibiting more robustness to such complex shifts. In this paper, we introduce TRUST, a novel UDA approach that exploits the robustness of the language modality to guide the adaptation of a vision model. TRUST generates pseudo-labels for target samples from their captions and introduces a novel uncertainty estimation strategy that uses normalised CLIP similarity scores to estimate the uncertainty of the generated pseudo-labels. Such estimated uncertainty is then used to reweight the classification loss, mitigating the adverse effects of wrong pseudo-labels obtained from low-quality captions. To further increase the robustness of the vision model, we propose a multimodal soft-contrastive learning loss that aligns the vision and language feature spaces, by leveraging captions to guide the contrastive training of the vision model on target images. In our contrastive loss, each pair of images acts as both a positive and a negative pair and their feature representations are attracted and repulsed with a strength proportional to the similarity of their captions. This solution avoids the need for hardly determining positive and negative pairs, which is critical in the UDA setting. Our approach outperforms previous methods, setting the new state-of-the-art on classical (DomainNet) and complex (GeoNet) domain shifts. The code will be available upon acceptance.
[99] WGAST: Weakly-Supervised Generative Network for Daily 10 m Land Surface Temperature Estimation via Spatio-Temporal Fusion
Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai
Main category: cs.CV
TL;DR: WGAST提出了一种弱监督生成网络,通过时空融合多源卫星数据,实现每日10米分辨率的地表温度(LST)估计,显著优于现有方法。
Details
Motivation: 城市化、气候变化和农业压力对精细环境监测的需求增加,而现有遥感系统在时空分辨率上存在权衡,缺乏高分辨率每日LST的估计方法。Contribution: 1. 首次提出端到端深度学习框架WGAST,用于10米分辨率每日LST估计;2. 采用条件生成对抗网络架构,结合物理平均原则的弱监督训练策略。
Method: 1. 生成器分四阶段:特征提取(编码器)、时空融合(余弦相似度、归一化、时序注意力)、LST重建(解码)、噪声抑制(高斯滤波);2. 使用PatchGAN判别器和弱监督训练。
Result: WGAST在RMSE和SSIM上分别比最优基线降低17.18%和提升11.00%,且对云干扰和细粒度热模式捕捉表现优异。
Insight: 通过多级特征融合和物理约束的弱监督方法,WGAST有效解决了高分辨率LST估计的挑战,为环境监测提供了新工具。
Abstract: Urbanization, climate change, and agricultural stress are increasing the demand for precise and timely environmental monitoring. Land Surface Temperature (LST) is a key variable in this context and is retrieved from remote sensing satellites. However, these systems face a trade-off between spatial and temporal resolution. While spatio-temporal fusion methods offer promising solutions, few have addressed the estimation of daily LST at 10 m resolution. In this study, we present WGAST, a Weakly-Supervised Generative Network for Daily 10 m LST Estimation via Spatio-Temporal Fusion of Terra MODIS, Landsat 8, and Sentinel-2. WGAST is the first end-to-end deep learning framework designed for this task. It adopts a conditional generative adversarial architecture, with a generator composed of four stages: feature extraction, fusion, LST reconstruction, and noise suppression. The first stage employs a set of encoders to extract multi-level latent representations from the inputs, which are then fused in the second stage using cosine similarity, normalization, and temporal attention mechanisms. The third stage decodes the fused features into high-resolution LST, followed by a Gaussian filter to suppress high-frequency noise. Training follows a weakly supervised strategy based on physical averaging principles and reinforced by a PatchGAN discriminator. Experiments demonstrate that WGAST outperforms existing methods in both quantitative and qualitative evaluations. Compared to the best-performing baseline, on average, WGAST reduces RMSE by 17.18% and improves SSIM by 11.00%. Furthermore, WGAST is robust to cloud-induced LST and effectively captures fine-scale thermal patterns, as validated against 33 ground-based sensors. The code is available at https://github.com/Sofianebouaziz1/WGAST.git.
cs.SE [Back]
[100] Position: Intelligent Coding Systems Should Write Programs with Justifications
Xiangzhe Xu,Shiwei Feng,Zian Su,Chengpeng Wang,Xiangyu Zhang
Main category: cs.SE
TL;DR: 本文主张智能编程系统在生成代码时还应提供清晰的解释,以提升用户信任和使用体验,提出了认知对齐和语义忠实性两种关键属性,并探讨了神经符号方法在解释生成中的潜力。
Details
Motivation: 现有的智能编程系统虽然能够根据自然语言描述生成代码,但其决策过程不透明,导致用户(尤其是非专业人士)难以理解和信任其输出。Contribution: 提出了智能编程系统应生成解释的重要性,明确了认知对齐和语义忠实性两种关键属性,并探讨了神经符号方法作为解决方案。
Method: 通过神经符号方法生成解释,其中符号约束指导模型行为,神经表示丰富程序语义,从而在推理时实现自动化一致性检查。
Result: 现有方法(如形式验证、静态分析和事后解释)在生成解释方面存在局限性,神经符号方法可能是更优的解决方案。
Insight: 解释生成对提升智能编程系统的透明度和用户信任至关重要,神经符号方法结合了符号逻辑和神经网络的优点,有望成为未来的研究方向。
Abstract: Intelligent coding systems are transforming software development by enabling users to specify code behavior in natural language. However, the opaque decision-making of AI-driven coders raises trust and usability concerns, particularly for non-expert users who cannot inspect low-level implementations. We argue that these systems should not only generate code but also produce clear, consistent justifications that bridge model reasoning and user understanding. To this end, we identify two critical justification properties-cognitive alignment and semantic faithfulness-and highlight the limitations of existing methods, including formal verification, static analysis, and post-hoc explainability. We advocate exploring neuro-symbolic approaches for justification generation, where symbolic constraints guide model behavior during training and program semantics are enriched through neural representations, enabling automated consistency checks at inference time.
cs.RO [Back]
[101] Integrating Vision Foundation Models with Reinforcement Learning for Enhanced Object Interaction
Ahmad Farooq,Kamran Iqbal
Main category: cs.RO
TL;DR: 这篇论文提出了一种将视觉基础模型与强化学习结合的新方法,旨在提升模拟环境中物体交互的能力。通过结合Segment Anything Model (SAM)和YOLOv5,以及基于PPO的智能体,实验表明在AI2-THOR仿真环境中,交互成功率和导航效率显著提升。
Details
Motivation: 现有的强化学习智能体在复杂环境中进行物体交互时,感知能力有限。为了解决这一问题,论文探索如何利用视觉基础模型提升智能体的感知和交互能力。Contribution: 论文的主要贡献是将视觉基础模型(SAM和YOLOv5)与强化学习(PPO)相结合,显著提升了智能体在仿真环境中的交互成功率和导航效率。
Method: 方法核心是结合SAM和YOLOv5进行物体分割与检测,再与PPO智能体在AI2-THOR环境中交互。通过实验验证了这种组合的有效性。
Result: 实验结果显示,与基线方法相比,新方法的平均累积奖励提升了68%,物体交互成功率提高了52.5%,导航效率提升了33%。
Insight: 视觉基础模型与强化学习的结合为复杂机器人任务提供了新思路,未来可进一步探索更复杂的场景和任务。
Abstract: This paper presents a novel approach that integrates vision foundation models with reinforcement learning to enhance object interaction capabilities in simulated environments. By combining the Segment Anything Model (SAM) and YOLOv5 with a Proximal Policy Optimization (PPO) agent operating in the AI2-THOR simulation environment, we enable the agent to perceive and interact with objects more effectively. Our comprehensive experiments, conducted across four diverse indoor kitchen settings, demonstrate significant improvements in object interaction success rates and navigation efficiency compared to a baseline agent without advanced perception. The results show a 68% increase in average cumulative reward, a 52.5% improvement in object interaction success rate, and a 33% increase in navigation efficiency. These findings highlight the potential of integrating foundation models with reinforcement learning for complex robotic tasks, paving the way for more sophisticated and capable autonomous agents.
[102] Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model
Hanqing Wang,Shaoyang Wang,Yiming Zhong,Zemin Yang,Jiamin Wang,Zhiqing Cui,Jiahao Yuan,Yifan Han,Mingyu Liu,Yuexin Ma
Main category: cs.RO
TL;DR: Affordance-R1是一个统一的affordance grounding框架,结合了CoT推理与GRPO强化学习,首次将强化学习与推理能力整合到affordance reasoning中,实现了零样本泛化和显式推理能力。
Details
Motivation: 现有模型缺乏Chain-of-Thought推理能力,导致在不同对象间共享affordance的能力不足,限制了泛化和显式推理能力。Contribution: 1. 提出首个统一的affordance grounding框架Affordance-R1;2. 设计复杂的affordance函数,包含格式、感知和认知奖励;3. 构建高质量数据集ReasonAff;4. 首次将GRPO强化学习与推理能力结合。
Method: 1. 结合CoT推理与GRPO强化学习;2. 设计affordance函数指导优化;3. 使用ReasonAff数据集训练;4. 纯强化学习训练实现泛化和推理能力。
Result: Affordance-R1在泛化和推理能力上优于现有方法,展现出开放世界的泛化能力。
Insight: 将强化学习与推理能力结合可以显著提升affordance reasoning的泛化和显式推理能力。
Abstract: Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.
[103] Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation
Youguang Xing,Xu Luo,Junlin Xie,Lianli Gao,Hengtao Shen,Jingkuan Song
Main category: cs.RO
TL;DR: 论文研究了通用机器人策略中存在的捷径学习问题,指出其泛化能力受限的原因是数据集多样性和碎片化问题,并提出了数据增强策略以改善性能。
Details
Motivation: 通用机器人策略在大规模数据集(如OXE)上训练表现出色,但泛化能力有限。论文旨在揭示其背后的根本原因,并提出解决方案。Contribution: 揭示了捷径学习是泛化能力受限的主要原因,并分析了数据集多样性和碎片化的影响。还提出了数据增强策略以减少捷径学习。
Method: 通过理论和实证分析,识别了数据集多样性和碎片化对捷径学习的影响,并验证了数据增强策略的有效性。
Result: 研究证明,数据增强策略可以减少捷径学习,提升通用机器人策略在仿真和现实环境中的泛化能力。
Insight: 大规模数据集的多样性和子数据集间的分布差异会引发捷径学习,而数据增强是改善泛化能力的有效手段。
Abstract: Generalist robot policies trained on large-scale datasets such as Open X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks. However, they often struggle to generalize beyond the distribution of their training data. In this paper, we investigate the underlying cause of this limited generalization capability. We identify shortcut learning – the reliance on task-irrelevant features – as a key impediment to generalization. Through comprehensive theoretical and empirical analysis, we uncover two primary contributors to shortcut learning: (1) limited diversity within individual sub-datasets, and (2) significant distributional disparities across sub-datasets, leading to dataset fragmentation. These issues arise from the inherent structure of large-scale datasets like OXE, which are typically composed of multiple sub-datasets collected independently across varied environments and embodiments. Our findings provide critical insights into dataset collection strategies that can reduce shortcut learning and enhance the generalization ability of generalist robot policies. Moreover, in scenarios where acquiring new large-scale data is impractical, we demonstrate that carefully selected robotic data augmentation strategies can effectively reduce shortcut learning in existing offline datasets, thereby improving generalization capabilities of generalist robot policies, e.g., $\pi_0$, in both simulation and real-world environments. More information at https://lucky-light-sun.github.io/proj/shortcut-learning-in-grps/.
cs.DC [Back]
[104] KnapFormer: An Online Load Balancer for Efficient Diffusion Transformers Training
Kai Zhang,Peng Wang,Sai Bi,Jianming Zhang,Yuanjun Xiong
Main category: cs.DC
TL;DR: KnapFormer是一个高效的分布式训练负载均衡框架,结合序列并行和全局调度,优化Diffusion Transformers训练中的负载分配,显著提升训练速度。
Details
Motivation: 在Diffusion Transformers的分布式训练中,可变长度的文本输入和视觉令牌不均衡导致负载分布不均,影响训练效率。KnapFormer旨在解决这一问题。Contribution: 提出了KnapFormer框架,首次将序列并行与全局负载均衡结合,通过解决背包问题实现高效负载分配,显著减少通信开销和工作负载差异。
Method: 1) 收集序列长度元数据;2) 解决全局背包问题以最小化负载方差;3) 集成DeepSpeed-Ulysees的序列并行和半经验工作负载模型。
Result: 在真实训练任务中,负载差异小于1%,消除拖尾效应,训练速度提升2-3倍。
Insight: 序列并行与负载均衡的协同作用是优化Diffusion Transformers训练的关键,KnapFormer为类似场景提供了高效解决方案。
Abstract: We present KnapFormer, an efficient and versatile framework to combine workload balancing and sequence parallelism in distributed training of Diffusion Transformers (DiT). KnapFormer builds on the insight that strong synergy exists between sequence parallelism and the need to address the significant token imbalance across ranks. This imbalance arises from variable-length text inputs and varying visual token counts in mixed-resolution and image-video joint training. KnapFormer redistributes tokens by first gathering sequence length metadata across all ranks in a balancing group and solving a global knapsack problem. The solver aims to minimize the variances of total workload per-GPU, while accounting for the effect of sequence parallelism. By integrating DeepSpeed-Ulysees-based sequence parallelism in the load-balancing decision process and utilizing a simple semi-empirical workload model, KnapFormers achieves minimal communication overhead and less than 1% workload discrepancy in real-world training workloads with sequence length varying from a few hundred to tens of thousands. It eliminates straggler effects and achieves 2x to 3x speedup when training state-of-the-art diffusion models like FLUX on mixed-resolution and image-video joint data corpora. We open-source the KnapFormer implementation at https://github.com/Kai-46/KnapFormer/
cs.AI [Back]
[105] InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization
Yuhang Liu,Zeyu Liu,Shuanghe Zhu,Pengxiang Li,Congkai Xie,Jiasheng Wang,Xueyu Hu,Xiaotian Han,Jianbo Yuan,Xinyao Wang,Shengyu Zhang,Hongxia Yang,Fei Wu
Main category: cs.AI
TL;DR: 论文提出了一种自适应探索策略优化(AEPO)框架,通过多答案生成策略和理论驱动的自适应探索奖励(AER)函数,显著提升了GUI任务中对语义对齐的能力,并在多个基准测试中取得了新的最佳性能。
Details
Motivation: 现有的多模态大语言模型(MLLMs)在处理图形用户界面(GUI)任务时,虽然在空间对齐方面表现不错,但在语义对齐方面存在探索效率低下的问题,限制了模型学习复杂语义关联的能力。Contribution: 提出了自适应探索策略优化(AEPO)框架,通过多答案生成策略和基于效率理论的自适应探索奖励(AER)函数,显著提升了语义对齐能力。
Method: AEPO框架包含两部分:(1) 多答案生成策略,强制模型进行更广泛的探索;(2) 理论驱动的AER函数,指导探索过程。实验基于InfiGUI-G1-3B和InfiGUI-G1-7B模型。
Result: 在多个GUI基准测试中取得了新的最佳性能,相对于基准方法RLVR,实现了最高9.0%的性能提升,特别在泛化和语义理解任务上表现突出。
Insight: 探索效率是影响语义对齐的关键因素,AEPO通过理论驱动的奖励函数和多答案生成策略,有效解决了探索瓶颈问题。
Abstract: The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at https://github.com/InfiXAI/InfiGUI-G1.
cs.IR [Back]
[106] Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports
Jin Khye Tan,En Jun Choong,Ethan Jeremiah Chitty,Yan Pheng Choo,John Hsin Yang Wong,Chern Eu Cheah
Main category: cs.IR
TL;DR: 本文提出了一种基于Qwen2.5-VL-7B的微调视觉语言模型,用于将马来西亚审计财务报告中的表格转换为Markdown格式,显著提升了转换准确性和效率。
Details
Motivation: 财务文档中表格数据的精确提取和结构表示是文档理解中的关键挑战,尤其是对具有复杂布局的表格。Contribution: 1) 开发了一个包含2152张图像-文本对的数据集;2) 提出了一种基于LoRA的监督微调策略;3) 引入了新的Markdown TEDS指标评估结构保真度。
Method: 采用Qwen2.5-VL-7B作为基础模型,通过数据增强和LoRA微调优化模型性能。
Result: 模型在标准评估中达到92.20%的准确率和96.53%的Markdown TEDS分数,性能优于基础模型和其他大型VLM。
Insight: 领域特定的微调模型可以在减少计算开销的同时,显著超越通用模型的表现。
Abstract: Accurately extracting and representing the structure of tabular data from financial documents remains a critical challenge in document understanding, particularly for regulatory and analytical use cases. This study addresses the complexity of converting financial tables from Malaysian audited financial reports into Markdown format, a task complicated by rotated layouts, multi-level headers, and implicit structural cues. We propose a fine-tuned vision-language model (VLM), based on Qwen2.5-VL-7B, optimized for high-fidelity Markdown generation from document images. Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA. To assess performance, we evaluated our model on 100 out-of-sample tables using a dual framework: a criteria-based LLM-as-a-judge for fine-grained accuracy and our novel Markdown Tree-Edit-Distance-based Similarity (TEDS) metric for holistic structural fidelity. Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% Markdown TEDS score. This performance significantly surpasses its Qwen2.5-VL-7B base model, larger-scale VLMs, and specialized reasoning-enabled models. Compared to these self-hosted alternatives, it also significantly reduces inference time. Furthermore, its accuracy exceeds that of widely used proprietary models such as OpenAI’s GPT-4o and Gemini 2.5 Flash. These results demonstrate that domain-specific fine-tuning provides an effective and efficient method to bridge the gap between unstructured financial documents and downstream automation, rivalling much larger and more general models without their computational overhead.
cs.CR [Back]
[107] DINA: A Dual Defense Framework Against Internal Noise and External Attacks in Natural Language Processing
Ko-Wei Chuang,Hen-Hsen Huang,Tsai-Yen Li
Main category: cs.CR
TL;DR: 论文提出了DINA框架,同时针对NLP中的内部标签噪声和外部对抗攻击设计双重防御策略,显著提升了模型的鲁棒性和准确性。
Details
Motivation: 随着大语言模型在客服和内容审核中的应用增多,内部标签噪声和外部对抗攻击成为双重威胁,亟需统一的防御框架。Contribution: 提出了DINA框架,首次将噪声标签学习和对抗训练结合,为NLP系统提供双重防御能力。
Method: 结合计算机视觉中的噪声标签学习方法与对抗训练,设计统一的防御框架。
Result: 在实际在线游戏数据集上,DINA显著优于基线模型,提升了鲁棒性和准确性。
Insight: 双重威胁防御对NLP系统至关重要,为AI公平和负责任部署提供了实践策略。
Abstract: As large language models (LLMs) and generative AI become increasingly integrated into customer service and moderation applications, adversarial threats emerge from both external manipulations and internal label corruption. In this work, we identify and systematically address these dual adversarial threats by introducing DINA (Dual Defense Against Internal Noise and Adversarial Attacks), a novel unified framework tailored specifically for NLP. Our approach adapts advanced noisy-label learning methods from computer vision and integrates them with adversarial training to simultaneously mitigate internal label sabotage and external adversarial perturbations. Extensive experiments conducted on a real-world dataset from an online gaming service demonstrate that DINA significantly improves model robustness and accuracy compared to baseline models. Our findings not only highlight the critical necessity of dual-threat defenses but also offer practical strategies for safeguarding NLP systems in realistic adversarial scenarios, underscoring broader implications for fair and responsible AI deployment.
[108] DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM-Based Insider Threat Detection
Kaichuan Kong,Dongjie Liu,Xiaobo Jin,Guanggang Geng,Zhiying Li,Jian Weng
Main category: cs.CR
TL;DR: DMFI是一个双模态框架,通过结合语义推理和行为感知微调来提升内鬼威胁检测(ITD)的性能,优于现有方法。
Details
Motivation: 内鬼威胁检测的隐蔽性、长期性和上下文依赖性使得传统模型难以捕捉语义意图和复杂行为动态,而现有基于LLM的方法在提示适应性和模态覆盖方面存在局限。Contribution: 提出了DMFI框架,包括双模态转换、独立LoRA增强的LLM微调以及轻量级决策模块;进一步引入DMFI-B策略,提高了类别不平衡下的鲁棒性。
Method: 将原始日志转换为语义视图和行为抽象,分别通过指令格式提示和4W转换处理,两独立LoRA-LLM微调后,通过MLP模块融合输出。
Result: 在CERT r4.2和r5.2数据集上,DMFI的检测准确率优于现有技术。
Insight: LLM的语义推理能力与结构化行为建模的结合显著提升了内鬼威胁检测的性能,为实际部署提供了可扩展的解决方案。
Abstract: Insider threat detection (ITD) poses a persistent and high-impact challenge in cybersecurity due to the subtle, long-term, and context-dependent nature of malicious insider behaviors. Traditional models often struggle to capture semantic intent and complex behavior dynamics, while existing LLM-based solutions face limitations in prompt adaptability and modality coverage. To bridge this gap, we propose DMFI, a dual-modality framework that integrates semantic inference with behavior-aware fine-tuning. DMFI converts raw logs into two structured views: (1) a semantic view that processes content-rich artifacts (e.g., emails, https) using instruction-formatted prompts; and (2) a behavioral abstraction, constructed via a 4W-guided (When-Where-What-Which) transformation to encode contextual action sequences. Two LoRA-enhanced LLMs are fine-tuned independently, and their outputs are fused via a lightweight MLP-based decision module. We further introduce DMFI-B, a discriminative adaptation strategy that separates normal and abnormal behavior representations, improving robustness under severe class imbalance. Experiments on CERT r4.2 and r5.2 datasets demonstrate that DMFI outperforms state-of-the-art methods in detection accuracy. Our approach combines the semantic reasoning power of LLMs with structured behavior modeling, offering a scalable and effective solution for real-world insider threat detection. Our work demonstrates the effectiveness of combining LLM reasoning with structured behavioral modeling, offering a scalable and deployable solution for modern insider threat detection.
[109] ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls
Sanket Badhe
Main category: cs.CR
TL;DR: 论文介绍了ScamAgent,一种基于LLMs的自主多轮对话代理,能够生成高度逼真的诈骗通话脚本,并展示了当前LLMs的安全防护机制对其无效。
Details
Motivation: 现有研究主要关注单次提示的滥用,而忽视了多轮对话中LLMs的潜在威胁。作者希望通过构建ScamAgent,揭示LLMs在动态对话中可能被滥用的风险。Contribution: 1) 开发了ScamAgent,展示LLMs在多轮对话中生成逼真诈骗脚本的能力;2) 揭示当前LLMs安全防护机制的局限性;3) 提出对多轮安全审计的需求。
Method: 基于LLMs构建ScamAgent,利用对话记忆和动态用户响应对抗技术模拟真实诈骗场景,并通过TTS系统将脚本转换为语音通话。
Result: ScamAgent成功绕过现有LLMs的安全机制,生成逼真诈骗通话,证明多轮对话中威胁的严重性。
Insight: LLMs在多轮对话中的滥用风险被低估,亟需开发针对代理级威胁的安全防护和多轮审计框架。
Abstract: Large Language Models (LLMs) have demonstrated impressive fluency and reasoning capabilities, but their potential for misuse has raised growing concern. In this paper, we present ScamAgent, an autonomous multi-turn agent built on top of LLMs, capable of generating highly realistic scam call scripts that simulate real-world fraud scenarios. Unlike prior work focused on single-shot prompt misuse, ScamAgent maintains dialogue memory, adapts dynamically to simulated user responses, and employs deceptive persuasion strategies across conversational turns. We show that current LLM safety guardrails, including refusal mechanisms and content filters, are ineffective against such agent-based threats. Even models with strong prompt-level safeguards can be bypassed when prompts are decomposed, disguised, or delivered incrementally within an agent framework. We further demonstrate the transformation of scam scripts into lifelike voice calls using modern text-to-speech systems, completing a fully automated scam pipeline. Our findings highlight an urgent need for multi-turn safety auditing, agent-level control frameworks, and new methods to detect and disrupt conversational deception powered by generative AI.
[110] Universally Unfiltered and Unseen:Input-Agnostic Multimodal Jailbreaks against Text-to-Image Model Safeguards
Song Yan,Hui Wei,Jinlong Fei,Guoliang Yang,Zhengyu Zhao,Zheng Wamg
Main category: cs.CR
TL;DR: 该论文提出了U3-Attack,一种多模态对抗攻击方法,用于绕过文本到图像(T2I)模型的安全防护机制。该方法通过优化图像背景的对抗补丁和敏感词的语义改写,实现了通用且高效的攻击效果。
Details
Motivation: 现有的多模态攻击方法通常针对特定的文本提示或图像扰动,缺乏通用性和可扩展性。U3-Attack旨在解决这一问题,通过通用攻击方法暴露T2I模型安全防护的潜在漏洞。Contribution: U3-Attack提出了两种优化策略:1)优化图像背景的对抗补丁以绕过安全检查器;2)优化敏感词的语义改写集以绕过提示过滤器。这些策略显著提升了攻击的成功率和效率。
Method: U3-Attack结合了图像和文本的对抗优化。在图像层面,优化背景补丁;在文本层面,生成安全改写集。该方法避免了冗余计算,提高了攻击的可扩展性。
Result: 实验表明,U3-Attack在开源和商业T2I模型上表现优异,例如在Runway-inpainting模型上的成功率比现有最佳方法高4倍。
Insight: 研究揭示了T2I模型安全防护在面对通用多模态攻击时的脆弱性,强调了设计更鲁棒防护机制的必要性。
Abstract: Various (text) prompt filters and (image) safety checkers have been implemented to mitigate the misuse of Text-to-Image (T2I) models in creating Not-Safe-For-Work (NSFW) content.In order to expose potential security vulnerabilities of such safeguards, multimodal jailbreaks have been studied.However, existing jailbreaks are limited to prompt-specific and image-specific perturbations, which suffer from poor scalability and time-consuming optimization.To address these limitations, we propose Universally Unfiltered and Unseen (U3)-Attack, a multimodal jailbreak attack method against T2I safeguards.Specifically, U3-Attack optimizes an adversarial patch on the image background to universally bypass safety checkers and optimizes a safe paraphrase set from a sensitive word to universally bypass prompt filters while eliminating redundant computations.Extensive experimental results demonstrate the superiority of our U3-Attack on both open-source and commercial T2I models.For example, on the commercial Runway-inpainting model with both prompt filter and safety checker, our U3-Attack achieves $~4\times$ higher success rates than the state-of-the-art multimodal jailbreak attack, MMA-Diffusion.Content Warning: This paper includes examples of NSFW content.
eess.IV [Back]
[111] Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
Shuo Chen,Yijin Li,Xi Zheng,Guofeng Zhang
Main category: eess.IV
TL;DR: 论文提出NFH-SEM方法,基于神经场表示,从多视角、多探测器的SEM图像中实现复杂微结构的高保真3D重建,解决了现有方法的校准和阴影问题。
Details
Motivation: 传统SEM图像无法直接揭示微样品的3D形貌,而现有3D重建方法在复杂微结构上表现不佳,存在离散表示、依赖校准样本和阴影梯度误差等问题。Contribution: 提出NFH-SEM,一种基于神经场的混合方法,实现端到端自校准和阴影自动分离,能够精确重建复杂微结构。
Method: 利用多视角、多探测器的SEM图像输入,将几何和光度信息融合为连续的神经场表示,通过训练自动消除阴影。
Result: 在真实和模拟数据集上验证了NFH-SEM的效果,成功重建了多种复杂样品,展示了高精度和广泛适用性。
Insight: 神经场表示为SEM 3D重建提供了连续且灵活的解决方案,端到端自校准和阴影处理显著提升了复杂样本的重建质量。
Abstract: The scanning electron microscope (SEM) is a widely used imaging device in scientific research and industrial applications. Conventional two-dimensional (2D) SEM images do not directly reveal the three-dimensional (3D) topography of micro samples, motivating the development of SEM 3D surface reconstruction methods. However, reconstruction of complex microstructures remains challenging for existing methods due to the limitations of discrete 3D representations, the need for calibration with reference samples, and shadow-induced gradient errors. Here, we introduce NFH-SEM, a neural field-based hybrid SEM 3D reconstruction method that takes multi-view, multi-detector 2D SEM images as input and fuses geometric and photometric information into a continuous neural field representation. NFH-SEM eliminates the manual calibration procedures through end-to-end self-calibration and automatically disentangles shadows from SEM images during training, enabling accurate reconstruction of intricate microstructures. We validate the effectiveness of NFH-SEM on real and simulated datasets. Our experiments show high-fidelity reconstructions of diverse, challenging samples, including two-photon lithography microstructures, peach pollen, and silicon carbide particle surfaces, demonstrating precise detail and broad applicability.
[112] Transformer-Based Explainable Deep Learning for Breast Cancer Detection in Mammography: The MammoFormer Framework
Ojonugwa Oluwafemi Ejiga Peter,Daniel Emakporuena,Bamidele Dayo Tunde,Maryam Abdulkarim,Abdullahi Bn Umar
Main category: eess.IV
TL;DR: MammoFormer框架结合了Transformer架构、多特征增强技术和可解释AI功能,显著提升了乳腺癌检测的性能和临床实用性。
Details
Motivation: 乳腺癌检测中,专家难以识别微小异常且解读存在差异,而CNN在医学图像分析中无法同时处理局部和全局信息,且缺乏可解释性。Contribution: 1. 通过特征增强优化Transformer架构,性能提升高达13%;2. 集成多视角可解释AI功能;3. 结合CNN和Transformer的临床可部署系统。
Method: 测试了七种架构(CNN、ViT、Swin Transformer、ConvNext)和四种特征增强技术(原始图像、负转换、自适应直方图均衡化、HOG)。
Result: ViT结合AHE达到98.3%准确率,Swin Transformer通过HOG增强性能提升13%。
Insight: Transformer结合特征增强可以超越CNN,同时提供可解释性,有助于AI在临床的接受度。
Abstract: Breast cancer detection through mammography interpretation remains difficult because of the minimal nature of abnormalities that experts need to identify alongside the variable interpretations between readers. The potential of CNNs for medical image analysis faces two limitations: they fail to process both local information and wide contextual data adequately, and do not provide explainable AI (XAI) operations that doctors need to accept them in clinics. The researcher developed the MammoFormer framework, which unites transformer-based architecture with multi-feature enhancement components and XAI functionalities within one framework. Seven different architectures consisting of CNNs, Vision Transformer, Swin Transformer, and ConvNext were tested alongside four enhancement techniques, including original images, negative transformation, adaptive histogram equalization, and histogram of oriented gradients. The MammoFormer framework addresses critical clinical adoption barriers of AI mammography systems through: (1) systematic optimization of transformer architectures via architecture-specific feature enhancement, achieving up to 13% performance improvement, (2) comprehensive explainable AI integration providing multi-perspective diagnostic interpretability, and (3) a clinically deployable ensemble system combining CNN reliability with transformer global context modeling. The combination of transformer models with suitable feature enhancements enables them to achieve equal or better results than CNN approaches. ViT achieves 98.3% accuracy alongside AHE while Swin Transformer gains a 13.0% advantage through HOG enhancements
[113] Clinically-guided Data Synthesis for Laryngeal Lesion Detection
Chiara Baldini,Kaisar Kushibar,Richard Osuala,Simone Balocco,Oliver Diaz,Karim Lekadir,Leonardo S. Mattos
Main category: eess.IV
TL;DR: 论文提出了一种基于潜在扩散模型(LDM)和控制网络适配器的方法,用于生成高质量的咽喉内窥镜图像-标注对,以解决数据稀缺问题。实验表明,合成数据显著提升了咽喉病变的检测性能。
Details
Motivation: 现有咽喉内窥镜计算机辅助诊断/检测系统受限于数据稀缺和高异质性病变的挑战,依赖专家标注且泛化能力不足。Contribution: 论文的主要贡献是提出了一种临床观察引导的合成数据生成方法,通过结合LDM和控制网络适配器,生成多样化且临床相关的图像数据。
Method: 作者使用LDM和控制网络适配器,通过条件扩散过程生成逼真的咽喉内窥镜图像,并基于临床观察指导合成数据的多样性。
Result: 实验显示,仅添加10%的合成数据便提升了病变检测率(内部数据提升9%,外部数据提升22.1%),且专家难以区分合成与真实图像。
Insight: 通过合成数据解决数据稀缺问题在医学影像中可行,且临床相关性是合成数据实用性的关键。
Abstract: Although computer-aided diagnosis (CADx) and detection (CADe) systems have made significant progress in various medical domains, their application is still limited in specialized fields such as otorhinolaryngology. In the latter, current assessment methods heavily depend on operator expertise, and the high heterogeneity of lesions complicates diagnosis, with biopsy persisting as the gold standard despite its substantial costs and risks. A critical bottleneck for specialized endoscopic CADx/e systems is the lack of well-annotated datasets with sufficient variability for real-world generalization. This study introduces a novel approach that exploits a Latent Diffusion Model (LDM) coupled with a ControlNet adapter to generate laryngeal endoscopic image-annotation pairs, guided by clinical observations. The method addresses data scarcity by conditioning the diffusion process to produce realistic, high-quality, and clinically relevant image features that capture diverse anatomical conditions. The proposed approach can be leveraged to expand training datasets for CADx/e models, empowering the assessment process in laryngology. Indeed, during a downstream task of detection, the addition of only 10% synthetic data improved the detection rate of laryngeal lesions by 9% when the model was internally tested and 22.1% on out-of-domain external data. Additionally, the realism of the generated images was evaluated by asking 5 expert otorhinolaryngologists with varying expertise to rate their confidence in distinguishing synthetic from real images. This work has the potential to accelerate the development of automated tools for laryngeal disease diagnosis, offering a solution to data scarcity and demonstrating the applicability of synthetic data in real-world scenarios.
cs.LG [Back]
[114] AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models
Xuan Lin,Long Chen,Yile Wang
Main category: cs.LG
TL;DR: AttriLens-Mol是一个基于属性引导强化学习的框架,用于结合大语言模型进行分子属性预测,通过结构化的输出奖励和相关性验证提升预测效果。
Details
Motivation: 传统的大语言模型在分子属性预测中依赖人工设计的提示模板,且生成的推理内容可能冗长且不相关,因此需要一种更有效的引导方法。Contribution: 提出AttriLens-Mol框架,通过三种奖励机制(格式奖励、计数奖励和合理性奖励)引导模型生成更相关的分子属性,提升预测性能和可解释性。
Method: 使用强化学习框架,结合格式奖励(结构化输出)、计数奖励(避免冗余)和合理性奖励(通过高级LLM和RDKit验证相关性)。
Result: 在多个数据集上显著优于监督微调模型和高级模型(如GPT-4o),并生成更具预测性的分子属性,用于决策树模型时表现更优。
Insight: 通过强化学习引导大语言模型的推理过程,可以更有效地提取其内在的分子属性知识,提升预测性能的同时增强可解释性。
Abstract: Large Language Models (LLMs) have shown promise in assisting molecular property prediction tasks but often rely on human-crafted prompts and chain-of-thought templates. While recent advanced large reasoning models like DeepSeek-R1 employ reinforcement learning for an extended ``thinking’’ process, their reasoning can be verbose and lack relevance. We introduce AttriLens-Mol, an attribute-guided reinforcement learning framework for molecular property prediction with LLMs. AttriLens-Mol steers the model’s reasoning by using: (1) a format reward encouraging attribute-based structured output, (2) a count reward to avoid enumerating irrelevant attributes, and (3) a rationality reward using advanced LLMs and RDKit to verify the relatedness of the generated attributes. This approach implicitly elicits the model’s inherent knowledge of relevant molecular attributes during reasoning, enables making predictions for the molecular property more effectively. Experiments on both in-distribution and out-of-distribution datasets show that, training both 7B-size R1-Distilled-Qwen2.5 and R1-Distilled-LLaMA3.1 models on 4,000 samples with our proposed AttriLens-Mol method significantly boosts the performance, getting comparable or better results than supervised fine-tuning models (Mol-Instructions, ChemDFM, etc.) and advanced models (GPT-3.5, GPT-4o, DeepSeek-V3, DeepSeek-R1, etc.). Further, our extracted attributes for the target property, when used as features for an interpretable decision tree model, yield superior performance compared to attributes generated by prompting LLMs. This shows that AttriLens-Mol effectively elicits more relevant and predictive molecular attributes, leading to enhanced interpretability and performance for property prediction. We release the code in https://github.com/szu-tera/AttriLens-Mol.
[115] Sample-efficient LLM Optimization with Reset Replay
Zichuan Liu,Jinyu Wang,Lei Song,Jiang Bian
Main category: cs.LG
TL;DR: The paper introduces LoRR, a plugin to improve sample efficiency in preference-based LLM optimization by combining high replay training with periodic resets and hybrid objectives.
Details
Motivation: Addressing low sample efficiency and primacy bias in post-training LLM optimization methods, which degrade policy quality.Contribution: Proposes LoRR, a sample-efficient plugin for preference-based LLM optimization frameworks.
Method: LoRR uses high replay training with periodic resets and hybrid objectives (SFT + preference-based loss) to maximize data utility.
Result: LoRR boosts performance on math and reasoning tasks, matching or outperforming complex RL-based methods.
Insight: LoRR provides a practical and efficient paradigm for LLM finetuning with limited data.
Abstract: Recent advancements in post-training Large Language Models (LLMs), particularly through Reinforcement Learning (RL) and preference optimization methods, are key drivers for enhancing their reasoning capabilities. However, these methods are often plagued by low sample efficiency and a susceptibility to primacy bias, where overfitting to initial experiences degrades policy quality and damages the learning process. To address these challenges, we introduce LLM optimization with Reset Replay (LoRR), a general and powerful plugin designed to enhance sample efficiency in any preference-based optimization framework. LoRR core mechanism enables training at a high replay number, maximizing the utility of each collected data batch. To counteract the risk of overfitting inherent in high-replay training, LoRR incorporates a periodic reset strategy with reusing initial data, which preserves network plasticity. Furthermore, it leverages a hybrid optimization objective, combining supervised fine-tuning (SFT) and preference-based losses to further bolster data exploitation. Our extensive experiments demonstrate that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks. Notably, an iterative DPO approach augmented with LoRR achieves comparable performance on challenging math tasks, outperforming some complex and computationally intensive RL-based algorithms. These findings highlight that LoRR offers a practical, sample-efficient, and highly effective paradigm for LLM finetuning, unlocking greater performance from limited data.
[116] FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields
Junhyeog Yun,Minui Hong,Gunhee Kim
Main category: cs.LG
TL;DR: FedMeNF是一种用于神经联邦元学习的隐私保护方法,通过新的损失函数减少隐私泄露,支持小样本和非IID数据的高效优化。
Details
Motivation: 神经场的高效学习需要大量数据和计算资源,而传统FML方法存在隐私泄露问题,FedMeNF旨在解决这些挑战。Contribution: 提出FedMeNF方法,设计了一种隐私保护的损失函数,用于安全且高效的联邦元学习。
Method: 通过隐私保护的损失函数在本地元优化中调控隐私泄露,快速优化神经网络而不保留客户私有数据。
Result: 实验表明,FedMeNF在少量样本和非IID数据下仍能快速优化并保持重建性能,同时保护隐私。
Insight: 隐私保护与高效学习可以结合,尤其在边缘设备资源有限的情况下,FedMeNF提供了一种可行的解决方案。
Abstract: Neural fields provide a memory-efficient representation of data, which can effectively handle diverse modalities and large-scale data. However, learning to map neural fields often requires large amounts of training data and computations, which can be limited to resource-constrained edge devices. One approach to tackle this limitation is to leverage Federated Meta-Learning (FML), but traditional FML approaches suffer from privacy leakage. To address these issues, we introduce a novel FML approach called FedMeNF. FedMeNF utilizes a new privacy-preserving loss function that regulates privacy leakage in the local meta-optimization. This enables the local meta-learner to optimize quickly and efficiently without retaining the client’s private data. Our experiments demonstrate that FedMeNF achieves fast optimization speed and robust reconstruction performance, even with few-shot or non-IID data across diverse data modalities, while preserving client data privacy.