Table of Contents
- cs.CL [Total: 32]
- cs.CV [Total: 63]
- cs.GR [Total: 1]
- cs.AI [Total: 2]
- cs.RO [Total: 2]
- cs.IR [Total: 1]
- cs.HC [Total: 1]
- eess.AS [Total: 1]
- cs.DC [Total: 1]
- eess.IV [Total: 5]
- cs.LG [Total: 6]
cs.CL [Back]
[1] User Behavior Prediction as a Generic, Robust, Scalable, and Low-Cost Evaluation Strategy for Estimating Generalization in LLMs
Sougata Saha,Monojit Choudhury
Main category: cs.CL
TL;DR: 本文提出用户行为预测作为衡量大型语言模型泛化能力的通用、鲁棒、可扩展且低成本的方法,避免了知识检索和推理任务的局限性,并在多个模型上验证了其有效性。
Details
Motivation: 由于数据污染问题,衡量大型语言模型(LLMs)的泛化能力具有挑战性。传统任务(如知识检索和推理)不适合评估LLMs的泛化能力,因为它们不是为特定任务设计的。Contribution: 提出用户行为预测作为评估LLMs泛化能力的新框架,具有理论支持、可扩展性和鲁棒性。
Method: 引入一个通用框架,以用户行为预测为评估指标,并在电影和音乐推荐数据集上测试了GPT-4o、GPT-4o-mini和Llama-3.1-8B-Instruct等模型。
Result: 实验结果验证了框架的预测,GPT-4o表现优于其他模型,但所有模型(尤其是Llama)仍有改进空间。
Insight: 用户行为预测是一种更通用且实用的评估策略,适合衡量LLMs的泛化能力,并能避免数据污染问题。
Abstract: Measuring the generalization ability of Large Language Models (LLMs) is challenging due to data contamination. As models grow and computation becomes cheaper, ensuring tasks and test cases are unseen during training phases will become nearly impossible. We argue that knowledge-retrieval and reasoning tasks are not ideal for measuring generalization, as LLMs are not trained for specific tasks. Instead, we propose user behavior prediction, also a key aspect of personalization, as a theoretically sound, scalable, and robust alternative. We introduce a novel framework for this approach and test it on movie and music recommendation datasets for GPT-4o, GPT-4o-mini, and Llama-3.1-8B-Instruct. Results align with our framework’s predictions, showing GPT-4o outperforms GPT-4o-mini and Llama, though all models have much room for improvement, especially Llama.
[2] Beyond classical and contemporary models: a transformative ai framework for student dropout prediction in distance learning using rag, prompt engineering, and cross-modal fusion
Miloud Mihoubi,Meriem Zerkouk,Belkacem Chikhaoui
Main category: cs.CL
TL;DR: 这篇论文提出了一种创新的AI框架,用于预测远程学习中的学生辍学问题,结合了RAG、即时工程和跨模态融合技术,显著提升了预测准确性和可解释性。
Details
Motivation: 远程学习中的学生辍学问题具有深远的社会和经济影响,传统机器学习模型难以捕捉学生互动中的情感和情境因素,因此需要更先进的解决方案。Contribution: 提出了一个由RAG增强的情感分析、即时工程解码学术压力源以及跨模态注意力融合组成的AI框架,显著提升了辍学预测的准确性和可解释性。
Method: 1. 使用RAG技术结合领域知识库进行情感分析;2. 通过即时工程识别学术压力指标;3. 采用跨模态注意力融合整合多种数据类型。
Result: 在4,423名学生数据集上实现了89%的准确率和0.88的F1分数,比传统模型提升了7%,误报率降低了21%。
Insight: 该框架不仅提升了预测性能,还能生成可解释的干预策略,为全球教育系统中的辍学问题提供了可扩展的解决方案。
Abstract: Student dropout in distance learning remains a critical challenge, with profound societal and economic consequences. While classical machine learning models leverage structured socio-demographic and behavioral data, they often fail to capture the nuanced emotional and contextual factors embedded in unstructured student interactions. This paper introduces a transformative AI framework that redefines dropout prediction through three synergistic innovations: Retrieval-Augmented Generation (RAG) for domain-specific sentiment analysis, prompt engineering to decode academic stressors, and cross-modal attention fusion to dynamically align textual, behavioral, and socio-demographic insights. By grounding sentiment analysis in a curated knowledge base of pedagogical content, our RAG-enhanced BERT model interprets student comments with unprecedented contextual relevance, while optimized prompts isolate indicators of academic distress (e.g., “isolation,” “workload anxiety”). A cross-modal attention layer then fuses these insights with temporal engagement patterns, creating holistic risk profiles. Evaluated on a longitudinal dataset of 4 423 students, the framework achieves 89% accuracy and an F1-score of 0.88, outperforming conventional models by 7% and reducing false negatives by 21%. Beyond prediction, the system generates interpretable interventions by retrieving contextually aligned strategies (e.g., mentorship programs for isolated learners). This work bridges the gap between predictive analytics and actionable pedagogy, offering a scalable solution to mitigate dropout risks in global education systems
[3] LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review
Cheng Yuan,Xinkai Rui,Yongqi Fan,Yawei Fan,Boyang Zhong,Jiacheng Wang,Weiyan Zhang,Tong Ruan
Main category: cs.CL
TL;DR: 论文提出了LCDS系统,通过逻辑控制和源映射表解决LLMs在自动生成出院小结时的幻觉问题,并支持专家审查与反馈。
Details
Motivation: 大型语言模型在生成出院小结时存在幻觉和不准确内容的问题,且难以将生成内容与电子病历中的长文本来源关联。Contribution: 提出LCDS系统,结合逻辑规则和源映射表,有效减少幻觉问题,支持内容溯源和专家审查,并生成高质量出院小结用于微调LLMs。
Method: 通过计算电子病历与出院小结的文本相似性构建源映射表,并结合逻辑规则生成可靠内容,支持专家审查和内容溯源。
Result: LCDS能够生成更可靠的出院小结,支持专家高效审查与反馈,并用于逐步优化LLMs。
Insight: 逻辑控制和源映射表是解决LLMs在医疗文本生成中幻觉问题的有效手段,专家参与提升了生成内容的可靠性。
Abstract: Despite the remarkable performance of Large Language Models (LLMs) in automated discharge summary generation, they still suffer from hallucination issues, such as generating inaccurate content or fabricating information without valid sources. In addition, electronic medical records (EMRs) typically consist of long-form data, making it challenging for LLMs to attribute the generated content to the sources. To address these challenges, we propose LCDS, a Logic-Controlled Discharge Summary generation system. LCDS constructs a source mapping table by calculating textual similarity between EMRs and discharge summaries to constrain the scope of summarized content. Moreover, LCDS incorporates a comprehensive set of logical rules, enabling it to generate more reliable silver discharge summaries tailored to different clinical fields. Furthermore, LCDS supports source attribution for generated content, allowing experts to efficiently review, provide feedback, and rectify errors. The resulting golden discharge summaries are subsequently recorded for incremental fine-tuning of LLMs. Our project and demo video are in the GitHub repository https://github.com/ycycyc02/LCDS.
[4] MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents
Ming Gong,Xucheng Huang,Chenghan Yang,Xianhan Peng,Haoxin Wang,Yang Liu,Ling Jiang
Main category: cs.CL
TL;DR: MindFlow是一个针对电子商务客服的多模态LLM代理,通过整合记忆、决策和行动模块,显著改进了复杂查询处理、用户满意度和运营成本。
Details
Motivation: 现有的LLM在复杂多模态电子商务场景中能力有限,需要更强大的解决方案以提升客服效率和用户体验。Contribution: 提出了首个开源的面向电子商务的多模态LLM代理MindFlow,采用模块化设计并整合了多种功能模块。
Method: 基于CoALA框架,采用“MLLM-as-Tool”策略进行视觉-文本推理,并通过A/B测试和仿真评估性能。
Result: 在实际部署中,MindFlow表现出显著优势,实现了93.53%的相对改进。
Insight: 模块化和多模态推理是提升LLM在复杂场景中性能的关键。
Abstract: Recent advances in large language models (LLMs) have enabled new applications in e-commerce customer service. However, their capabilities remain constrained in complex, multimodal scenarios. We present MindFlow, the first open-source multimodal LLM agent tailored for e-commerce. Built on the CoALA framework, it integrates memory, decision-making, and action modules, and adopts a modular “MLLM-as-Tool” strategy for effect visual-textual reasoning. Evaluated via online A/B testing and simulation-based ablation, MindFlow demonstrates substantial gains in handling complex queries, improving user satisfaction, and reducing operational costs, with a 93.53% relative improvement observed in real-world deployments.
[5] LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks
William Fleshman,Benjamin Van Durme
Main category: cs.CL
TL;DR: 论文提出了LoRA-Augmented Generation(LAG)方法,用于高效选择和结合任务特定的LoRA适配器,无需额外训练或数据,并在知识密集型任务中表现优于现有方法。
Details
Motivation: 随着针对特定任务和领域微调的语言模型专家的增多,需要高效的选择和结合方法。LAG为此提供了一种解决方案。Contribution: 主要贡献是提出了LAG方法,能够高效地过滤、检索和应用任务特定的LoRA适配器,且无需额外训练或数据。
Method: LAG利用大型知识库和任务特定的LoRA适配器,通过逐层和逐词的方式选择和结合专家。
Result: LAG在知识密集型任务上表现优于现有数据无关方法,并展示了与其他方案(如RAG)的兼容性。
Insight: LAG提供了一种灵活且高效的方式来结合多个任务特定的专家,扩展了语言模型的应用范围。
Abstract: The proliferation of fine-tuned language model experts for specific tasks and domains signals the need for efficient selection and combination methods. We propose LoRA-Augmented Generation (LAG) for leveraging large libraries of knowledge and task-specific LoRA adapters. LAG requires no additional training or access to data, and efficiently filters, retrieves, and applies experts on a per-token and layer basis. We evaluate LAG on various knowledge-intensive tasks, achieving superior performance over existing data-free methods. We explore scenarios where additional data is available, demonstrating LAG’s compatibility with alternative solutions such as retrieval-augmented generation (RAG).
[6] On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study
Riccardo Alberghi,Elizaveta Demyanenko,Luca Biggio,Luca Saglietti
Main category: cs.CL
TL;DR: 本文研究了在最短路径任务中,语言模型在推理过程中系统性地偏好低效率的推理路径的现象。研究发现,训练时使用冗余但连贯的推理路径比最优路径更能提升模型的泛化能力。
Details
Motivation: 尽管大语言模型(LLMs)在推理任务中表现出色,但其测试时计算效率与推理路径的系统性之间存在矛盾。本文通过最短路径任务的实验,研究了冗余推理路径对模型泛化能力的影响。Contribution: 研究发现,训练时使用冗余但连贯的推理路径比最优路径更能提升模型的泛化能力,并揭示了这一现象与模型对下一个token预测的置信度相关。
Method: 通过设计最短路径任务的实验,比较了训练时使用最优路径和低效路径对模型性能的影响,同时研究了路径冗余长度和连贯性的作用。
Result: 实验表明,训练时使用低效但连贯的推理路径的模型,泛化能力优于使用最优路径的模型,而单纯的冗余则会损害性能。
Insight: 推理路径的连贯性和局部增量性对模型的优化信号更为重要,而非路径的最优性。这一发现为提升语言模型的推理能力提供了新思路。
Abstract: Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model’s confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.
[7] EduCoder: An Open-Source Annotation System for Education Transcript Data
Guanzhong Pan,Mei Tan,Hyunji Nam,Lucía Langlois,James Malamut,Liliana Deonizio,Dorottya Demszky
Main category: cs.CL
TL;DR: EduCoder是一个开源的、专门针对教育对话数据标注的工具,旨在解决教育领域对话标注的复杂性,支持协作定义复杂代码本和多类型标注。
Details
Motivation: 现有的通用文本标注工具无法满足教育对话数据的复杂需求,例如定义复杂的教学特征代码本、支持多种标注类型以及上下文信息整合。Contribution: 开发了EduCoder,一个支持教育对话数据精细化标注的开源工具,具有协作代码本定义、多类型标注和标注校准功能。
Method: 设计了一个平台,允许研究者和领域专家基于观察数据协作定义代码本,支持分类和开放式标注,并提供上下文材料和标注比较功能。
Result: EduCoder提供了一个可用的开源系统,支持教育对话数据的可靠标注,并通过标注校准提高数据质量。
Insight: 教育领域的对话标注需要专门的工具来支持复杂特征的描述和协作标注,EduCoder填补了这一空白。
Abstract: We introduce EduCoder, a domain-specialized tool designed to support utterance-level annotation of educational dialogue. While general-purpose text annotation tools for NLP and qualitative research abound, few address the complexities of coding education dialogue transcripts – with diverse teacher-student and peer interactions. Common challenges include defining codebooks for complex pedagogical features, supporting both open-ended and categorical coding, and contextualizing utterances with external features, such as the lesson’s purpose and the pedagogical value of the instruction. EduCoder is designed to address these challenges by providing a platform for researchers and domain experts to collaboratively define complex codebooks based on observed data. It incorporates both categorical and open-ended annotation types along with contextual materials. Additionally, it offers a side-by-side comparison of multiple annotators’ responses, allowing comparison and calibration of annotations with others to improve data reliability. The system is open-source, with a demo video available.
[8] The Generalization Ridge: Information Flow in Natural Language Generation
Ruidi Chang,Chunyuan Deng,Hanjie Chen
Main category: cs.CL
TL;DR: 该论文提出了InfoRidge框架,用于研究Transformer模型中任务相关信息如何在各层间流动,发现预测信息在中间层达到峰值(形成所谓的“泛化岭”),揭示了中间层在泛化中的关键作用。
Details
Motivation: 尽管基于Transformer的语言模型在自然语言生成任务中表现出色,但其内部信息合成的机制仍不清晰,尤其是在任务相关信息如何在不同层间流动方面。Contribution: 提出了信息论框架InfoRidge,通过研究预测信息在模型各层的分布,揭示了中间层在泛化中的关键作用,并引入了残差缩放系数作为功能探针。
Method: 使用信息论方法量化隐藏表示与目标输出之间的互信息(预测信息),并通过实验和残差缩放系数分析各层的重要性。
Result: 实验表明预测信息在中间层达到峰值(“泛化岭”),且在分布偏移时模型更依赖这些中间层,凸显其泛化能力。
Insight: Transformer的中间层在泛化中起关键作用,而最终层更偏向记忆化,这一现象为模型设计提供了新视角。
Abstract: Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG) tasks, yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. To address this gap, we propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth. Estimating this quantity enables us to trace the flow of task-relevant information throughout the model during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in upper-middle layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we introduce residual scaling coefficients-trainable scalar parameters applied to each residual block-which serve as functional probes for assessing the relative importance of individual transformer layers. These coefficients reveal that, under distribution shift, models downweight final layers and increasingly rely on ridge layers, highlighting their role in generalization. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.
[9] Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning
Jaedong Hwang,Kumar Tanmay,Seok-Jin Lee,Ayush Agrawal,Hamid Palangi,Kumar Ayush,Ila Fiete,Paul Pu Liang
Main category: cs.CL
TL;DR: 论文提出了GeoFact-X基准和BRIDGE训练方法,旨在解决多语言推理中的语言偏见问题,并通过语言一致性奖励提升推理能力。
Details
Motivation: 当前多语言大模型在低资源语言上的推理能力不足,容易偏向高资源语言(如英语),影响事实准确性和可信度。Contribution: 1. 提出GeoFact-X多语言地理推理基准,包含五种语言的标注推理轨迹;2. 提出BRIDGE训练方法,结合语言一致性奖励优化推理;3. 引入基于LLM的自动评估协议。
Method: 1. 构建GeoFact-X基准,覆盖五种语言;2. 设计BRIDGE方法,通过监督微调和测试时强化学习结合语言一致性奖励;3. 使用LLM作为评判工具评估推理质量。
Result: BRIDGE显著提升了多语言推理的准确性,证明推理感知的多语言强化学习对跨语言泛化至关重要。
Insight: 语言一致性奖励是提升多语言推理能力的关键,自动评估协议为多语言任务提供更细粒度的分析工具。
Abstract: Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. Current multilingual benchmarks focus only on final answers, overlooking whether models actually reason in the target language. To address this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. We further propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning with a language-consistency reward to align reasoning with the input language. Finally, we develop an automatic evaluation protocol using LLM-as-a-judge to assess answer correctness and the quality and language consistency of reasoning traces, enabling nuanced and scalable analysis beyond surface-level metrics. Our results show that BRIDGE significantly enhances multilingual reasoning fidelity, demonstrating that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization. https://jd730.github.io/projects/GeoFact-X_BRIDGE
[10] “Lost-in-the-Later”: Framework for Quantifying Contextual Grounding in Large Language Models
Yufei Tao,Adam Hiatt,Rahul Seetharaman,Ameeta Agrawal
Main category: cs.CL
TL;DR: 这篇论文提出了CoPE框架,用于评估大语言模型(LLMs)在上下文知识与参数知识整合中的表现,揭示了‘lost-in-the-later’现象,即模型倾向于忽略或低估上下文后段信息。通过实验发现,思维链(CoT)提示会降低上下文利用效率,并提出基于提示的改进方法。
Details
Motivation: 大语言模型能够结合上下文和参数知识,但其整合机制尚不明确。作者希望通过系统化评估框架量化这两种知识的利用,并揭示模型在处理上下文时的潜在偏向。Contribution: 1. 提出CoPE框架,系统评估上下文知识与参数知识的使用。
2. 揭示了‘lost-in-the-later’现象,指出LLMs在上下文处理中的位置偏向。
3. 发现思维链提示(CoT)会恶化上下文利用效率,并提出改进方法。
Method: 1. 开发MultiWikiAtomic数据集(英语、西班牙语、丹麦语),用于开放问答任务。
2. 设计CoPE框架分析模型对上下文和参数知识的整合方式。
3. 通过提示工程改进上下文利用效率。
Result: 1. LLMs存在‘lost-in-the-later’现象,即忽略上下文后段信息。
2. 思维链提示(CoT)导致更低的召回率和更短的响应,降低上下文利用效率。
3. 基于上下文的提示方法在摘要任务中提高了事实依据,减少了幻觉。
Insight: 1. 上下文利用存在位置偏向,模型需进一步优化以均衡处理信息。
2. 思维链提示并非总是有益,需结合具体任务调整。
3. 提示工程是改善模型性能的有效手段。
Abstract: Large language models are capable of leveraging both contextual and parametric knowledge but how they prioritize and integrate these sources remains underexplored. We introduce CoPE, a novel evaluation framework that systematically measures contextual knowledge (CK) and parametric knowledge (PK) across models and languages. Using our MultiWikiAtomic dataset in English, Spanish, and Danish, we analyze how large language models (LLMs) integrate context, prioritize information, and incorporate PK in open-ended question answering. Our analysis uncovers a phenomenon we call lost-in-the-later, where LLMs tend to overlook or deprioritize information that appears later in a given context, revealing a strong positional bias that affects contextual grounding. We further find that reasoning models, as well as non-reasoning models prompted with chain-of-thought (CoT), use context even less than non-reasoning models without CoT and fail to mitigate the lost-in-the-later effect. CoT prompting, in particular, results in lower recall and shorter responses, leading to degraded contextual grounding. Based on these insights, we design prompt-based methods to effectively leverage input context. A case study applying CoPE to summarization demonstrates that CK-informed prompting improves factual grounding and reduces hallucination.
[11] Gendered Divides in Online Discussions about Reproductive Rights
Ashwin Rao,Sze Yuh Nina Wang,Kristina Lerman
Main category: cs.CL
TL;DR: 该论文研究了美国最高法院2022年Dobbs案裁决后,X(原Twitter)平台上关于生殖权利的讨论中的性别差异,揭示了性别和地方政治背景如何影响公共话语。
Details
Motivation: 研究者的动机是探索性别和地方政治背景如何共同影响公众对堕胎问题的态度和情感表达,填补了当前研究中关于性别与空间交互作用的空白。Contribution: 论文的主要贡献是发现性别在保守地区显著调节了堕胎态度和情感表达,且独立于意识形态,揭示了性别差异在堕胎讨论中的结构性作用。
Method: 研究者分析了近1000万条带有推断性别、意识形态和地理位置信息的X平台帖子,采用定量方法分析了性别、地区和意识形态的交互作用。
Result: 结果显示,在保守地区,性别对堕胎态度的影响更显著,且Dobbs案舆论泄露进一步激发了女性支持堕权的在线参与。
Insight: 研究揭示了堕胎讨论不仅因意识形态而极化,还深刻受到性别和地理空间的塑造,突显了身份认同在制度变动期间政治表达中的核心作用。
Abstract: The U.S. Supreme Court’s 2022 ruling in Dobbs v. Jackson Women’s Health Organization marked a turning point in the national debate over reproductive rights. While the ideological divide over abortion is well documented, less is known about how gender and local sociopolitical contexts interact to shape public discourse. Drawing on nearly 10 million abortion-related posts on X (formerly Twitter) from users with inferred gender, ideology and location, we show that gender significantly moderates abortion attitudes and emotional expression, particularly in conservative regions, and independently of ideology. This creates a gender gap in abortion attitudes that grows more pronounced in conservative regions. The leak of the Dobbs draft opinion further intensified online engagement, disproportionately mobilizing pro-abortion women in areas where access was under threat. These findings reveal that abortion discourse is not only ideologically polarized but also deeply structured by gender and place, highlighting the central role of identity in shaping political expression during moments of institutional disruption.
[12] Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS
Alex ZH Dou,Zhongwei Wan,Dongfei Cui,Xin Wang,Jing Xiong,Haokun Lin,Chaofan Tao,Shen Yan,Mi Zhang
Main category: cs.CL
TL;DR: 论文提出了一种名为R2-LLMs的分层检索增强推理框架,通过结合粗粒度与细粒度的检索学习,以及蒙特卡洛树搜索和改进的过程奖励模型,显著提升了大型语言模型在推理时的表现。
Details
Motivation: 测试时扩展(test-time scaling)是一种在推理阶段利用额外计算资源提升语言模型性能的范式。然而,现有方法通常需要从更先进的模型中蒸馏生成链式思维(CoT)训练数据。本工作旨在解决这一限制,提出一种无需蒸馏的训练数据生成方法。Contribution: R2-LLMs框架的创新点包括:1)分层检索增强学习(粗粒度和细粒度);2)结合蒙特卡洛树搜索(MCTS)和改进的过程奖励模型(PRM);3)在多个复杂推理任务上实现了高达16%的性能提升。
Method: 方法分为两个层级:粗粒度层面通过检索相似问题-答案对进行高层上下学习;细粒度层面在MCTS过程中检索中间解决步骤,利用PRM评分优化逐步推理。
Result: 在MATH500、GSM8K和OlympiadBench-TO数据集上的实验表明,使用LLaMA-3.1-8B模型时,性能相对基线提升了16%。
Insight: 该框架展示了分层检索与树搜索结合在提升推理任务性能上的潜力,同时为无需蒸馏生成训练数据提供了新思路。
Abstract: Test-time scaling has emerged as a promising paradigm in language modeling, leveraging additional computational resources at inference time to enhance model performance. In this work, we introduce R2-LLMs, a novel and versatile hierarchical retrieval-augmented reasoning framework designed to improve test-time scaling in large language models (LLMs) without requiring distillation from more advanced models to obtain chain-of-thought (CoT) training data. R2-LLMs enhances inference-time generalization by integrating dual-level retrieval-based in-context learning: (1) At the coarse level, our approach extracts abstract templates from complex reasoning problems and retrieves similar problem-answer pairs to facilitate high-level in-context learning; (2) At the fine level, during Monte Carlo Tree Search (MCTS), R2-LLMs efficiently retrieves analogous intermediate solution steps from reference mathematical problem datasets, refining step-wise reasoning with the aid of a process reward model (PRM) for scoring. R2-LLMs is a robust hierarchical reasoning-augmentation method that enhances in-context-level reasoning while seamlessly integrating with step-level tree search methods. Utilizing PRM, it refines both candidate generation and decision-making for improved reasoning accuracy. Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO datasets achieve substantial relative improvement with an increase of up to 16% using LLaMA-3.1-8B compared to the baselines, showcasing the effectiveness of our approach in complex reasoning tasks.
[13] Self-Review Framework for Enhancing Instruction Following Capability of LLM
Sihyun Park
Main category: cs.CL
TL;DR: Re5是一个自评估和修订框架,旨在提升LLM遵循指令的能力,同时保持生成内容的质量。通过任务和约束提取、结构评估及选择性修订,Re5在少量数据和有限外部监督下实现了与高性能模型相当的效果。
Details
Motivation: 现有的迭代修订方法虽然能提升LLM的指令遵循能力,但随着数据和修订次数增加,成本显著上升。开源LLM的自评估能力有限,过度修订会导致输出质量下降。因此,需要一种高效且质量保持的方法。Contribution: >提出了Re5框架,通过自评估和选择性修订提升指令遵循能力。2.设计结构化评估和细粒度约束检查,避免错误累积和质量下降。3.实验表明Re5在少量数据下表现优秀,优于未修订的初始响应。
Method: Re5从用户指令中提取任务和约束组件,进行结构评估和细粒度内容评估,选择性修订以提升质量和指令遵循性。最终生成高质量数据用于对齐调优,实现长期改良。
Result: Re5在少量数据下表现优异,指令遵循性能接近高性能模型GPT-4o-mini生成的数据,且保持64.24%的胜率优于未修订响应。
Insight: 通过结合自评估、结构化分析和选择性修订,可以在少量数据和低成本下显著提升LLM的指令遵循能力,同时避免质量下降。这一框架为LLM的高效优化提供了新思路。
Abstract: Various techniques have been proposed to improve large language models (LLMs) adherence to formatting and instruction constraints. One of the most effective approaches involves utilizing high-quality data generated by powerful models. However, such models often fail to fully comply with complex instructions in a single generation. To address this limitation, iterative revision methods have been introduced. Nevertheless, as the number of data points and revision iterations increases, the associated monetary costs grow significantly. As a resource-efficient alternative, methods have been proposed that leverage high-performance evaluation tools to compensate for the limited self-evaluation capabilities of open-source LLMs. However, these approaches often lead to a degradation in output quality due to excessive revision. To overcome these challenges, we propose Re5, a self-evaluation and revision framework designed to enhance instruction-following performance while preserving the quality of the generated content. Re5 extracts task and constraint components from user instructions, performs structural evaluations to prevent error accumulation, and applies fine-grained constraint-specific content evaluations followed by selective revisions. This process ensures precise and quality-preserving improvements. The final high-quality outputs are used for alignment tuning, enabling long-term alignment improvements through a data-centric iterative refinement loop. Experimental results demonstrate that Re5 achieves instruction-following performance comparable to models trained on data generated by GPT-4o-mini, a high-performance model, even with a small amount of data while maintaining response quality with a 64.24%-win rate over the non-revised initial responses. These results validate Re5 as an efficient and effective solution for enhancing instruction adherence with minimal external supervision.
[14] Flipping Knowledge Distillation: Leveraging Small Models’ Expertise to Enhance LLMs in Text Matching
Mingzhe Li,Jing Xiang,Qishen Zhang,Kaiyang Wan,Xiuying Chen
Main category: cs.CL
TL;DR: 该论文提出了一种翻转知识蒸馏方法,通过让小模型(SLM)向大模型(LLM)传递知识,利用SLM在文本匹配任务中的专长提升LLM的性能。
Details
Motivation: 传统的知识蒸馏通常是从LLM到SLM传递知识,但在文本匹配任务中,微调后的小模型往往更擅长领域特定的表示学习。为了结合两者的优势,提出了翻转知识蒸馏。Contribution: 1. 提出翻转知识蒸馏范式,让LLM从SLM学习;2. 通过LoRA重新解释LLM为编码器-解码器架构;3. 提出Margin-aware Contrastive Learning(MCL)方法,对齐相似度分数。
Method: 1. 使用LoRA将LLM重新解释为编码器-解码器架构;2. 编码器生成压缩表示及相似度分数;3. MCL方法对齐SLM和LLM的相似度分数。
Result: 在金融和医疗领域的基准测试及实际应用中验证了有效性,模型已部署在线上环境中。
Insight: 小模型在特定任务中可能比大模型更具优势,翻转知识蒸馏可以充分利用这一点,为LLM的性能提升提供新思路。
Abstract: Knowledge distillation typically involves transferring knowledge from a Large Language Model (LLM) to a Smaller Language Model (SLM). However, in tasks such as text matching, fine-tuned smaller models often yield more effective domain-specific representations, as they focus on optimizing the similarity of input pairs. To leverage both the specialized strengths of small models and the rich semantic understanding of LLMs, we introduce a flipped knowledge distillation paradigm, where LLM learns from SLM. Specifically, we address the architectural gap between decoder-only LLMs and smaller encoder-based models by reinterpreting LLMs in an encoder-decoder manner using LoRA. The encoder generates compressed representations, while the decoder maps them to the output space. During training, the encoder produces representations and their similarities, which are then aligned with the similarity scores produced by the teacher, using our proposed Margin-aware Contrastive Learning (MCL) approach. The MCL ensures accurate similarity for both positive and negative pairs, and adaptively handles the internal differences within positive and negative samples. Our paradigm requires only a reasonably good-performing SLM, allowing the LLM to achieve improved performance. Experiments on financial and healthcare benchmarks, as well as real-world applications, confirm its effectiveness, and the model has been fully deployed in an online environment.
[15] ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?
Haoxin Wang,Xianhan Peng,Xucheng Huang,Yizhe Huang,Ming Gong,Chenghan Yang,Yang Liu,Ling Jiang
Main category: cs.CL
TL;DR: ECom-Bench是首个用于评估具有多模态能力的LLM代理在电子商务客服领域的基准框架,基于真实用户对话和动态用户模拟,任务涵盖复杂场景,即使GPT-4o表现也有限。
Details
Motivation: 当前缺乏针对电子商务客服领域的LLM代理评估基准,真实场景复杂度高,需系统性测试多模态能力。Contribution: 提出ECom-Bench基准,包含动态用户模拟和真实任务数据集,覆盖广泛业务场景,公开代码和数据以推动研究。
Method: 基于真实用户对话构建数据集,设计动态用户模拟任务,评估LLM代理在复杂电子商务场景中的表现。
Result: 高级模型如GPT-4o在基准中表现不佳(10-20%通过率),凸显电子商务场景的挑战性。
Insight: 电子商务客服任务对LLM代理的多模态能力和复杂场景处理提出更高要求,现有效率仍有待提升。
Abstract: In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. Upon publication, the code and data will be open-sourced to facilitate further research and development in this domain.
[16] Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs
SeungWon Ji,Jungyup Lee,Jemin Kim,Sang Park,SeungJae Lee
Main category: cs.CL
TL;DR: 该论文提出了Smoothie-Qwen,一种轻量级的后处理方法,用于减少多语言大模型(LLMs)中的语言偏见,无需重新训练。该方法通过选择性调整token级输出概率,有效抑制不期望的语言生成。
Details
Motivation: 多语言LLMs常因语言混淆(language confusion)而生成主导语言的响应,而忽略输入提示的语言。这限制了模型在全球应用中的可靠性和可控性。Contribution: 提出Smoothie-Qwen,一种无需重新训练的后处理技术,显著减少中文意外生成(95%以上),同时保持任务准确性。
Method: 通过选择性调整token级输出概率,抑制不期望的语言生成,同时保留任务相关的性能。
Result: 在Qwen模型上应用,Smoothie-Qwen减少了95%以上的中文意外输出,并在多语言基准测试中保持准确性。
Insight: 该方法提供了一种高效且实用的解决方案,显著提升了多语言LLMs的语言可控性,适合全球化应用。
Abstract: Multilingual large language models (LLMs) often exhibit language confusion, a tendency to generate responses in a dominant language irrespective of the prompt’s language. To address this, we propose Smoothie-Qwen, a lightweight, post-hoc method that mitigates language bias without retraining. This technique selectively adjusts token-level output probabilities to effectively suppress undesired language generation. Applied to the Qwen model, our method reduces unintended Chinese output by over 95% while preserving task accuracy on multilingual benchmarks. This work provides a practical and efficient solution for enhancing the language controllability of LLMs, making them more reliable for global applications.
[17] Agentic-R1: Distilled Dual-Strategy Reasoning
Weihua Du,Pranjal Aggarwal,Sean Welleck,Yiming Yang
Main category: cs.CL
TL;DR: DualDistill框架通过蒸馏多教师模型的互补推理策略训练学生模型Agentic-R1,动态选择工具执行或文本推理,提升任务准确率。
Details
Motivation: 现有模型在数学推理上表现优异但依赖慢且易错的自然语言推理,而工具增强代理在复杂逻辑任务上表现不佳。Contribution: 提出了DualDistill框架,蒸馏多教师模型的策略至统一学生模型,实现动态策略选择。
Method: 通过蒸馏多教师模型的互补策略训练Agentic-R1,动态选择工具执行或文本推理。
Result: 在多任务上提升准确率,特别是计算密集型任务和标准基准测试。
Insight: 多策略蒸馏能实现高效且鲁棒的推理,动态选择策略优于单策略模型。
Abstract: Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using this approach, we train Agentic-R1, which dynamically selects the optimal strategy for each query, invoking tools for arithmetic and algorithmic problems, and using text-based reasoning for abstract ones. Our method improves accuracy across a range of tasks, including both computation-intensive and standard benchmarks, demonstrating the effectiveness of multi-strategy distillation in achieving robust and efficient reasoning. Our project is available at https://github.com/StigLidu/DualDistill
[18] HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation
YiHan Jiao,ZheHao Tan,Dan Yang,DuoLin Sun,Jie Feng,Jian Wang,Peng Wei
Main category: cs.CL
TL;DR: HIRAG是一种新的检索增强生成(RAG)指令微调方法,通过引入多级渐进式思维链和分层能力,显著提升了模型处理实时信息和领域特定问题的能力。
Details
Motivation: 传统RAG系统依赖大语言模型自身的上下文学习能力,但对RAG生成模型所需的具体能力缺乏深入研究,导致文档质量不一致和检索系统不完善的问题。Contribution: 提出了HIRAG方法,强调RAG模型应具备分层能力(过滤、组合和RAG特定推理),并通过多级渐进式思维链提升性能。
Method: 采用分层思维指令微调策略(”think before answering”),利用多级渐进式思维链增强模型的开放书考试能力。
Result: 在RGB、PopQA、MuSiQue、HotpotQA和PubmedQA等数据集上,HIRAG显著提升了模型性能。
Insight: 分层能力设计和渐进式思维链的使用能够更有效地解决RAG任务中的信息处理和推理问题。
Abstract: Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a “think before answering” strategy. This method enhances the model’s open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model’s performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.
[19] Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Zijin Gu,Tatiana Likhomanenko,Navdeep Jaitly
Main category: cs.CL
TL;DR: 该论文提出了一种名为Omni-Router Transformer的模型,通过在稀疏混合专家(MoE)架构中共享路由决策,提高了专家之间的协作和专业化,从而在自动语音识别任务中取得了更好的性能。
Details
Motivation: 传统MoE架构中,每一层的路由决策独立,缺乏专家间的协作,导致性能受限。作者希望通过共享路由决策来增强专家在不同层间的协作,提高模型的鲁棒性和性能。Contribution: 主要贡献是提出了Omni-Router Transformer,它通过跨层共享路由决策,实现了专家间的更好协作和专业化,显著降低了词错误率。
Method: 采用共享路由机制替代传统MoE中独立的逐层路由决策,通过实验验证其在稀疏MoE架构中的有效性。
Result: 在大规模伪标记数据集和10个多样化ASR基准测试中,Omni-Router Transformer的训练损失更低,平均词错误率分别比稠密模型和Switch Transformer降低了11.2%和8.2%。
Insight: 共享路由决策可以增强专家间的协作,提高模型的鲁棒性和泛化能力,为MoE架构的设计提供了新的思路。
Abstract: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model \emph{Omni-router Transformer}. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.
[20] GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge
Yujia Hu,Tuan-Phong Nguyen,Shrestha Ghosh,Moritz Müller,Simon Razniewski
Main category: cs.CL
TL;DR: GPTKB v1.5是一个由GPT-4.1构建的包含1亿三元组的密集互联知识库,用于系统性分析和探索语言模型(LLM)的事实知识,支持链接遍历、SPARQL查询以及LLM知识优缺点的比较研究。
Details
Motivation: 语言模型(LLM)的事实知识仍未被充分了解,且难以通过临时浏览或可扩展的统计分析进行访问。GPTKB v1.5旨在填补这一空白。Contribution: 提出了一个由GPT-4.1构建的100-million-triple知识库(GPTKB v1.5),支持对LLM知识的系统性探索与分析。
Method: 采用大规模递归LLM知识具现化(massive-recursive LLM knowledge materialization)方法构建知识库,同时支持链接遍历和SPARQL查询。
Result: 成功构建了密集互联的知识库,为研究LLM知识提供了实用工具,并通过三种用例展示了其功能。
Insight: 大规模递归LLM知识具现化不仅在LLM知识分析中有突破性意义,也为自动化知识库构建提供了新机会。
Abstract: Language models are powerful tools, yet their factual knowledge is still poorly understood, and inaccessible to ad-hoc browsing and scalable statistical analysis. This demonstration introduces GPTKB v1.5, a densely interlinked 100-million-triple knowledge base (KB) built for $14,000 from GPT-4.1, using the GPTKB methodology for massive-recursive LLM knowledge materialization (Hu et al., ACL 2025). The demonstration experience focuses on three use cases: (1) link-traversal-based LLM knowledge exploration, (2) SPARQL-based structured LLM knowledge querying, (3) comparative exploration of the strengths and weaknesses of LLM knowledge. Massive-recursive LLM knowledge materialization is a groundbreaking opportunity both for the research area of systematic analysis of LLM knowledge, as well as for automated KB construction. The GPTKB demonstrator is accessible at https://gptkb.org.
[21] DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities
Jing Yang Lee,Hamed Bonab,Nasser Zalmout,Ming Zeng,Sanket Lokegaonkar,Colin Lockard,Binxuan Huang,Ritesh Sarkhel,Haodong Wang
Main category: cs.CL
TL;DR: 该论文提出了一种名为DocTalk的新方法,通过将文本文档转化为多轮对话数据,以增强大语言模型(LLM)的对话能力。实验表明,使用DocTalk预训练可以显著提升模型的多轮对话性能。
Details
Motivation: 现有的LLM预训练数据主要基于连续文本,而对话任务需要多轮交互能力,导致训练数据与任务需求不匹配。论文旨在解决这一问题。Contribution: 提出了DocTalk,一种基于图的对话合成方法,可从文本文档中生成多轮对话数据。构建了包含73万对话的大规模预训练数据集。
Method: 利用文档聚类和图形化方法,将相关文档转化为多轮、多主题的信息检索对话。
Result: 实验显示,使用DocTalk预训练的模型在多轮对话任务中显著提升了40%的性能(如上下文记忆和理解)。
Insight: 通过结构化的对话数据生成方法,可以有效弥合预训练数据与对话任务需求之间的差距,提升LLM的对话能力。
Abstract: Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance. DocTalk is available at https://huggingface.co/datasets/AmazonScience/DocTalk.
[22] Bridging Perception and Language: A Systematic Benchmark for LVLMs’ Understanding of Amodal Completion Reports
Amane Watahiki,Tomoki Doi,Taiga Shinozaki,Satoshi Nishida,Takuya Niikawa,Katsunori Miyahara,Hitomi Yanaka
Main category: cs.CL
TL;DR: 论文构建了一个基于基本形式本体论的基准测试,用于系统评估大型视觉语言模型(LVLMs)在理解遮罩感知描述(amodal completion)上的能力,发现某些模型在特定物体类别和日语提示下的表现较差。
Details
Motivation: 研究旨在填补LVLMs在理解和推断遮罩感知文本能力上的空白,并探索其跨语言表现。Contribution: 1) 提出首个系统分类遮罩感知的基准测试;2) 揭示了LVLMs在跨语言任务中的表现差异。
Method: 基于Basic Formal Ontology构建分类基准,测试多种LVLMs在遮罩感知任务中的表现,并使用日语提示进一步验证。
Result: 某些LVLMs(如LLaVA-NeXT变体和Claude 3.5 Sonnet)在特定物体类别和日语提示下表现较差,甚至在无视觉内容的空白刺激上表现更好。
Insight: 部分LVLMs在日语理解能力上存在不足,语言能力可能影响其多模态任务的性能。
Abstract: One of the main objectives in developing large vision-language models (LVLMs) is to engineer systems that can assist humans with multimodal tasks, including interpreting descriptions of perceptual experiences. A central phenomenon in this context is amodal completion, in which people perceive objects even when parts of those objects are hidden. Although numerous studies have assessed whether computer-vision algorithms can detect or reconstruct occluded regions, the inferential abilities of LVLMs on texts related to amodal completion remain unexplored. To address this gap, we constructed a benchmark grounded in Basic Formal Ontology to achieve a systematic classification of amodal completion. Our results indicate that while many LVLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed. Notably, in certain categories, some LLaVA-NeXT variants and Claude 3.5 Sonnet exhibit lower accuracy on original images compared to blank stimuli lacking visual content. Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.
[23] Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors
Bing Wang,Ximing Li,Mengzhe Ye,Changchun Li,Bo Fu,Jianfeng Qu,Lin Yuanbo Wu
Main category: cs.CL
TL;DR: 论文提出了一种名为DAEDCMD的新方法,用于持续多模态假新闻检测(MMD),通过隔离事件特定参数和学习连续时间动态模型,解决了过去知识遗忘和未来泛化能力不足的问题。
Details
Motivation: 现实世界中新事件不断涌现,导致传统离线训练的MMD模型性能下降,而现有方法无法有效应对持续学习中的知识遗忘和环境变化问题。Contribution: 提出了DAEDCMD方法,通过Dirichlet过程混合专家结构隔离干扰,并结合连续时间动态模型预测未来环境分布,显著提升了持续MMD的性能。
Method: 结合了基于Dirichlet过程的混合专家结构(隔离事件参数)和连续时间动态模型学习(预测未来分布),实现了对过去知识的保留和未来泛化能力的提升。
Result: 在实验中,DAEDCMD显著优于六种MMD基线和三种持续学习方法,验证了其有效性。
Insight: 通过动态隔离事件参数和预测环境变化,可以有效缓解持续学习中的知识遗忘和未来泛化问题,为其他持续学习任务提供了借鉴。
Abstract: Nowadays, misinformation articles, especially multimodal ones, are widely spread on social media platforms and cause serious negative effects. To control their propagation, Multimodal Misinformation Detection (MMD) becomes an active topic in the community to automatically identify misinformation. Previous MMD methods focus on supervising detectors by collecting offline data. However, in real-world scenarios, new events always continually emerge, making MMD models trained on offline data consistently outdated and ineffective. To address this issue, training MMD models under online data streams is an alternative, inducing an emerging task named continual MMD. Unfortunately, it is hindered by two major challenges. First, training on new data consistently decreases the detection performance on past data, named past knowledge forgetting. Second, the social environment constantly evolves over time, affecting the generalization on future data. To alleviate these challenges, we propose to remember past knowledge by isolating interference between event-specific parameters with a Dirichlet process-based mixture-of-expert structure, and anticipate future environmental distributions by learning a continuous-time dynamics model. Accordingly, we induce a new continual MMD method DAEDCMD. Extensive experiments demonstrate that DAEDCMD can consistently and significantly outperform the compared methods, including six MMD baselines and three continual learning methods.
[24] DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations
Nicholas Popovič,Ashish Kangen,Tim Schopf,Michael Färber
Main category: cs.CL
TL;DR: 论文提出了一种基于LLM的全自动合成数据生成和上下文学习流水线,用于文档级实体和关系抽取,避免了手动标注需求,并在零样本场景中进行了评测。
Details
Motivation: 文档级实体和关系抽取在零样本或少样本场景中缺乏高质量标注数据,现有方法依赖手动标注或零样本推断,限制了其扩展性和性能。Contribution: 1. 提出全自动合成数据生成和检索式上下文学习流水线;2. 构建了一个包含5k+摘要、59k实体和30k关系三元组的高质量合成数据集;3. 在DocIE共享任务中评测了零样本性能。
Method: 结合合成数据生成与检索式上下文学习,利用推理优化的语言模型动态检索相关示例,实现无手动标注的演示数据库构建。
Result: 在文档级实体和关系抽取任务中,上下文联合抽取对当前最优大语言模型仍具挑战性。
Insight: 全自动合成数据生成是零样本或少样本信息抽取的可行方向,但复杂文档级任务仍需进一步优化模型能力。
Abstract: Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over $5k$ Wikipedia abstracts with approximately $59k$ entities and $30k$ relation triples. Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting. We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.
[25] Conditional Multi-Stage Failure Recovery for Embodied Agents
Youmna Farag,Svetlana Stoyanchev,Mohan Li,Simon Keizer,Rama Doddipatla
Main category: cs.CL
TL;DR: 论文提出了一种基于零样本链提示的条件多阶段失败恢复框架,用于提升具身代理在复杂任务中的执行鲁棒性。该框架分为四个错误处理阶段,利用LLM的推理能力分析环境挑战并提出解决方案。实验表明,该方法在TfD基准上表现优异。
Details
Motivation: 具身代理在执行复杂任务时容易失败,需要有效的错误恢复机制。Contribution: 提出了一个四阶段的条件多阶段失败恢复框架,结合零样本链提示和LLM的推理能力,显著提升了任务执行的成功率。
Method: 采用四阶段错误处理(三个执行阶段和一个事后反思阶段),利用LLM的零样本链提示能力分析环境上下文并制定解决方案。
Result: 在TEACH数据集的TfD基准上,方法比无错误恢复的基线表现好11.5%,并超过了现有最优模型19%。
Insight: 结合多阶段错误处理和LLM的推理能力可以显著提升具身代理的任务鲁棒性,零样本提示的灵活性为此提供了高效工具。
Abstract: Embodied agents performing complex tasks are susceptible to execution failures, motivating the need for effective failure recovery mechanisms. In this work, we introduce a conditional multistage failure recovery framework that employs zero-shot chain prompting. The framework is structured into four error-handling stages, with three operating during task execution and one functioning as a post-execution reflection phase. Our approach utilises the reasoning capabilities of LLMs to analyse execution challenges within their environmental context and devise strategic solutions. We evaluate our method on the TfD benchmark of the TEACH dataset and achieve state-of-the-art performance, outperforming a baseline without error recovery by 11.5% and surpassing the strongest existing model by 19%.
[26] Coding Triangle: How Does Large Language Model Understand Code?
Taolin Zhang,Zihan Ma,Maosong Cao,Junnan Liu,Songyang Zhang,Kai Chen
Main category: cs.CL
TL;DR: 论文提出了Code Triangle框架,从编辑分析、代码实现和测试用例生成三个维度系统评估大语言模型(LLMs)的编程能力,揭示了LLMs在多样性和鲁棒性上的不足,并提出了结合人类生成内容和模型混合的方法来提升性能。
Details
Motivation: 尽管LLMs在代码生成方面取得了显著进展,但其真正的编程能力尚未被充分探索。论文旨在系统评估LLMs在编程任务中的表现及其与人类专家的差距。Contribution: 1. 提出Code Triangle评估框架;2. 揭示LLMs在多样性和鲁棒性上的不足;3. 提出结合人类生成内容和模型混合的改进方法;4. 发现LLMs认知的一致性与不一致性。
Method: 通过Code Triangle框架,对LLMs在编辑分析、代码实现和测试用例生成三个维度进行系统评估,并结合人类生成内容和模型混合进行改进。
Result: 实验表明,LLMs虽然能在三个维度上形成自洽的系统,但其解决方案的多样性和鲁棒性不如人类程序员。通过改进方法,显著提升了LLMs的性能和鲁棒性。
Insight: LLMs的认知与人类专家存在显著分布偏移,模型错误多源于训练数据偏差和有限的推理迁移能力。研究为未来开发更强大的编码模型提供了方向。
Abstract: Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.
[27] Skywork-R1V3 Technical Report
Wei Shen,Jiangbo Pei,Yi Peng,Xuchen Song,Yang Liu,Jian Peng,Haofeng Sun,Yunzhuo Hao,Peiyu Wang,Yahui Zhou
Main category: cs.CL
TL;DR: Skywork-R1V3 是一种先进的、开源的视觉语言模型(VLM),通过在文本大语言模型(LLMs)的基础上实现视觉任务的推理能力,其创新点在于无需额外的预训练,通过后训练强化学习(RL)框架激活模型的推理能力。该模型在 MMMU 基准测试中实现了 76.0% 的性能,媲美人类初级水平。
Details
Motivation: 传统的视觉语言模型在跨模态推理任务中表现有限。论文旨在通过结合 LLMs 的文本推理能力,利用强化学习框架提升视觉任务的推理性能,推动开源 VLM 的发展。Contribution: 1. 提出一种无需继续预训练的 RL 后训练框架,有效激活模型推理能力;2. 揭示连接模块(connector module)对跨模态对齐的核心作用;3. 提出推理能力的量化指标(关键推理 Token 的熵),用于 RL 训练的检查点选择;4. 在 MMMU 基准上取得 SOTA 性能,并成功将数学推理迁移到其他学科。
Method: 1. 利用 RL 框架对模型进行后训练;2. 设计连接模块实现跨模态对齐;3. 引入关键推理 Token 的熵作为 RL 训练中的检查点选择指标。
Result: 1. Skywork-R1V3 在 MMMU 测试中从 64.3% 提升至 76.0%,媲美人类初级水平;2. 38B 参数的模型性能与闭源 VLMs 相当;3. 数学推理能力成功迁移到其他学科。
Insight: 1. 强化学习是提升开源 VLM 性能的有力工具;2. 跨模态对齐对多模态推理至关重要;3. 关键推理 Token 的熵可作为模型推理能力的有效量化指标。
Abstract: We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model’s reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.
[28] CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Zhongyuan Peng,Yifan Yao,Kaijing Ma,Shuyue Guo,Yizhe Li,Yichi Zhang,Chenchen Zhang,Yifan Zhang,Zhouliang Yu,Luming Li,Minghao Liu,Yihang Xia,Jiawei Shen,Yuchen Wu,Yixin Cao,Zhaoxiang Zhang,Wenhao Huang,Jiaheng Liu,Ge Zhang
Main category: cs.CL
TL;DR: CriticLean提出了一种基于评论家引导的强化学习框架,将评论家从被动验证器转变为主动学习组件,显著提升了数学形式化任务的语义保真度。该方法在基准测试中表现优于其他基线模型。
Details
Motivation: 现有研究主要关注数学形式化的生成和编译成功率,而忽略了评论家阶段(即验证生成的形式化是否真正捕获原始问题的语义意图)的重要性。Contribution: 1. 提出了CriticLean框架,将评论家作为主动学习组件;2. 开发了CriticLeanGPT模型,通过监督微调和强化学习训练;3. 构建了CriticLeanBench基准和FineLeanCorpus数据集。
Method: 结合监督微调与强化学习训练CriticLeanGPT,使用CriticLeanBench评估模型区分语义正确形式化的能力。
Result: CriticLeanGPT在基准测试中显著优于开闭源基线模型,并构建了包含285K问题的FineLeanCorpus数据集。
Insight: 优化评论家阶段对生成可靠的形式化结果至关重要,为形式化数学推理领域提供了新方向。
Abstract: Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models’ ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.
[29] DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation
Maximilian Heil,Dionne Bang
Main category: cs.CL
TL;DR: 本文介绍了在CheckThat! 2025任务1中,通过迁移学习和风格化数据增强提升新闻文本中主观性和客观性句子分类的方法。研究发现,特定编码器的迁移学习优于通用编码器的微调,且精心设计的数据增强显著提高了模型鲁棒性。官方提交结果排名第16。
Details
Motivation: 动机是提升新闻文本中主观性和客观性句子的分类效果,探索迁移学习和数据增强在任务中的应用潜力。Contribution: 主要贡献包括对比了预训练编码器的微调与迁移学习方法,提出了一种基于GPT-4o的受控数据增强流程,并通过模型校正确保生成的样本标签和风格一致性。
Method: 方法包括使用迁移学习技术对特定编码器进行训练,同时提出了一种基于GPT-4o的数据增强流程,生成特定风格的复述样本,并通过模型校正改进样本质量。
Result: 结果显示,特定编码器的迁移学习优于通用编码器的微调,且经过精心设计的数据增强显著提高了模型在检测主观内容上的鲁棒性。官方提交结果排名第16(共24个团队)。
Insight: 研究强调了结合编码器专业化与标签一致性数据增强在改进主观性检测任务中的重要性。
Abstract: This paper presents our submission to Task 1, Subjectivity Detection, of the CheckThat! Lab at CLEF 2025. We investigate the effectiveness of transfer-learning and stylistic data augmentation to improve classification of subjective and objective sentences in English news text. Our approach contrasts fine-tuning of pre-trained encoders and transfer-learning of fine-tuned transformer on related tasks. We also introduce a controlled augmentation pipeline using GPT-4o to generate paraphrases in predefined subjectivity styles. To ensure label and style consistency, we employ the same model to correct and refine the generated samples. Results show that transfer-learning of specified encoders outperforms fine-tuning general-purpose ones, and that carefully curated augmentation significantly enhances model robustness, especially in detecting subjective content. Our official submission placed us $16^{th}$ of 24 participants. Overall, our findings underscore the value of combining encoder specialization with label-consistent augmentation for improved subjectivity detection. Our code is available at https://github.com/dsgt-arc/checkthat-2025-subject.
[30] DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification
Maximilian Heil,Aleksandar Pramov
Main category: cs.CL
TL;DR: 该研究评估了数值事实核查的上下文和分词策略,发现右到左分词(R2L)对自然语言推理任务无提升,较长上下文窗口也未改善性能,证据质量是关键瓶颈。
Details
Motivation: 数值声明(如数量和比较)对自动事实核查系统带来独特挑战,研究旨在探索建模策略以提升其准确性。Contribution: 1. 评估了上下文窗口和R2L分词对数值事实核查的影响;2. 发现证据质量是性能的主要瓶颈。
Method: 使用QuanTemp数据集和ModernBERT,研究(1)上下文窗口长度,(2)R2L分词,(3)二者结合的影响。
Result: R2L分词对任务无提升,较长上下文窗口也未改善性能;最佳系统F1得分为0.57(CheckThat! 2025任务3前4)。
Insight: 数值事实核查中,证据质量比分词方向或上下文长度更具决定性。
Abstract: Numerical claims, statements involving quantities, comparisons, and temporal references, pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the effect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at https://github.com/dsgt-arc/checkthat-2025-numerical.
[31] UQLM: A Python Package for Uncertainty Quantification in Large Language Models
Dylan Bouchard,Mohit Singh Chauhan,David Skarbrevik,Ho-Kyeong Ra,Viren Bajaj,Zeya Ahmad
Main category: cs.CL
TL;DR: UQLM是一个Python工具包,用于通过不确定性量化(UQ)技术检测大型语言模型(LLM)的幻觉问题,提供0到1的置信度评分,旨在提升LLM输出的可靠性。
Details
Motivation: LLM生成的幻觉内容(虚假或误导性信息)对下游应用的安全性和可信度构成挑战,需高效检测方法。Contribution: 开发了UQLM工具包,集成了先进的UQ技术,为LLM幻觉检测提供即用型解决方案。
Method: 采用不确定性量化技术,设计了一套基于UQ的评分工具,计算生成内容的置信度分数。
Result: UQLM提供易集成的置信度评分功能,有助于提高LLM输出的可信度。
Insight: 不确定性量化技术可用于检测LLM的幻觉问题,为模型可信度评估提供了新工具。
Abstract: Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.
[32] A Survey on Latent Reasoning
Rui-Jie Zhu,Tianhao Peng,Tianhao Cheng,Xingwei Qu,Jinfa Huang,Dawei Zhu,Hao Wang,Kaiwen Xue,Xuanliang Zhang,Yong Shan,Tianle Cai,Taylor Kergan,Assel Kembay,Andrew Smith,Chenghua Lin,Binh Nguyen,Yuqi Pan,Yuhong Chou,Zefan Cai,Zhenhe Wu,Yongchi Zhao,Tianyu Liu,Jian Yang,Wangchunshu Zhou,Chujie Zheng,Chongxuan Li,Yuyin Zhou,Zhoujun Li,Zhaoxiang Zhang,Jiaheng Liu,Ge Zhang,Wenhao Huang,Jason Eshraghian
Main category: cs.CL
TL;DR: 本文综述了潜在推理(Latent Reasoning)这一新兴领域,探讨了如何通过模型的连续隐藏状态进行多步推理,解决了显式链式推理(CoT)依赖自然语言表达的局限性。文章分析了神经网络层次在推理中的基础作用,并讨论了多种潜在推理方法及前沿范式。
Details
Motivation: 显式链式推理(CoT)虽然提升了模型的解释性和准确性,但其依赖自然语言表达的中间步骤限制了模型的表达带宽。潜在推理旨在通过隐藏状态进行推理,克服这一瓶颈。Contribution: 文章系统地综述了潜在推理的研究现状,提供了对神经网络层次作为推理计算基础的深入分析,并介绍了多种潜在推理方法和前沿范式(如基于掩码扩散模型的无限深度推理)。
Method: 文章首先分析了神经网络层次在推理中的作用,随后探讨了多种潜在推理方法,包括基于激活的循环、隐藏状态传播和微调策略,以及掩码扩散模型等先进范式。
Result: 通过隐藏状态进行多步推理,潜在推理能够实现更高的表达效率和全局一致性,同时消除了对显式推理痕迹的依赖。
Insight: 潜在推理为LLM的推理能力提供了新的研究方向,其通过连续隐藏状态实现推理的方法可能在未来的认知模型中发挥重要作用。
Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model’s expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model’s continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.
cs.CV [Back]
[33] CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection
Binjia Zhou,Hengrui Lou,Lizhe Chen,Haoyuan Li,Dawei Luo,Shuai Chen,Jie Lei,Zunlei Feng,Yijun Bei
Main category: cs.CV
TL;DR: 这篇论文提出了一个名为CorrDetail的视觉细节增强自校正框架,用于可解释的面部伪造检测,通过纠正真实伪造细节和增强视觉细粒度细节来提升检测性能。
Details
Motivation: 随着图像生成技术的快速发展,面部深度伪造的广泛出现对安全领域提出了重大挑战,现有的伪造检测方法要么缺乏清晰的伪造细节解释,要么容易产生幻觉问题,因此需要一种更有效且可解释的检测方法。Contribution: CorrDetail框架的主要贡献包括:1)引入自校正机制以纠正伪造细节;2)设计视觉细粒度细节增强模块提升细节精确度;3)提出融合决策策略增强模型对极端样本的判别能力。
Method: 论文方法包括:1)利用错误引导的问题来训练自校正能力;2)通过视觉细节增强模块提供更精确的伪造细节;3)融合视觉信息补偿和模型偏差减少的决策策略。
Result: 实验结果表明,CorrDetail在性能上达到了最新方法的水平,同时在准确识别伪造细节和泛化能力方面表现出色。
Insight: 论文展示了通过增强视觉细节和自校正机制可以有效提升伪造检测的可解释性和性能,这在安全领域具有重要意义。
Abstract: With the swift progression of image generation technology, the widespread emergence of facial deepfakes poses significant challenges to the field of security, thus amplifying the urgent need for effective deepfake detection.Existing techniques for face forgery detection can broadly be categorized into two primary groups: visual-based methods and multimodal approaches. The former often lacks clear explanations for forgery details, while the latter, which merges visual and linguistic modalities, is more prone to the issue of hallucinations.To address these shortcomings, we introduce a visual detail enhanced self-correction framework, designated CorrDetail, for interpretable face forgery detection. CorrDetail is meticulously designed to rectify authentic forgery details when provided with error-guided questioning, with the aim of fostering the ability to uncover forgery details rather than yielding hallucinated responses. Additionally, to bolster the reliability of its findings, a visual fine-grained detail enhancement module is incorporated, supplying CorrDetail with more precise visual forgery details. Ultimately, a fusion decision strategy is devised to further augment the model’s discriminative capacity in handling extreme samples, through the integration of visual information compensation and model bias reduction.Experimental results demonstrate that CorrDetail not only achieves state-of-the-art performance compared to the latest methodologies but also excels in accurately identifying forged details, all while exhibiting robust generalization capabilities.
[34] pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models
Sajjad Ghiasvand,Mahnoosh Alizadeh,Ramtin Pedarsani
Main category: cs.CV
TL;DR: pFedMMA提出了一种个性化的联邦学习框架,利用多模态适配器优化视觉-语言模型,在局部适应性和全局泛化性之间取得平衡,并通过共享投影层实现通信高效性。
Details
Motivation: 现有的联邦学习方法在个性化与泛化性之间难以平衡,尤其在未见过的类别或领域上表现不佳。pFedMMA旨在解决这一问题。Contribution: 提出了首个基于多模态适配器的个性化联邦学习框架pFedMMA,通过非对称优化策略实现局部适应性与全局泛化的协同优化。
Method: 采用多模态适配器,包含模态特定的上下投影层和全局共享的跨模态对齐投影,仅共享部分参数以减少通信开销。
Result: 在11个数据集上的实验表明,pFedMMA在个性化和泛化性权衡上优于现有联邦提示调优方法。
Insight: 共享投影层的设计是实现通信高效性和全局泛化的关键,非对称优化策略有助于兼顾局部与全局性能。
Abstract: Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our asymmetric optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods. The code is available at https://github.com/sajjad-ucsb/pFedMMA.
[35] Neural-Driven Image Editing
Pengfei Zhou,Jie Xia,Xiaopeng Peng,Wangbo Zhao,Zilong Ye,Zekai Li,Suorong Yang,Jiadong Pan,Yuanxiang Chen,Ziqiao Wang,Kai Wang,Qian Zheng,Xiaojun Chang,Gang Pan,Shurong Dong,Kaipeng Zhang,Yang You
Main category: cs.CV
TL;DR: LoongX提出了一种基于多模态神经生理信号(如EEG、fNIRS、PPG等)的无手操作图像编辑方法,利用扩散模型和对比学习实现意图与语义的对齐,性能媲美文本驱动方法,并展示了与语音结合的潜力。
Details
Motivation: 传统图像编辑需要手动输入提示,对行动受限或语言能力有限的人群不友好。通过结合脑机接口和生成模型,提出了一种更直观、无障碍的编辑方式。Contribution: 1. 提出LoongX,首个基于多模态神经信号的无手操作图像编辑框架。
2. 设计了CS3和DGF模块,解决信号异构性问题。
3. 通过对比学习预训练,实现认知状态与语义意图的对齐。
Method: 1. 使用扩散模型(DiT)并结合多模态神经信号(EEG、fNIRS等)。
2. 引入CS3模块编码模态特征,DGF模块融合特征到统一空间。
3. 对比学习预训练对齐意图与语义。
Result: LoongX性能与文本驱动方法相当(CLIP-I: 0.6605 vs. 0.6558),且在结合语音时更优(CLIP-T: 0.2588 vs. 0.2549)。
Insight: 神经驱动的生成模型为无障碍图像编辑和认知驱动技术开辟了新方向;多模态信号融合能提升意图理解的准确性。
Abstract: Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.
[36] Motion Generation: A Survey of Generative Approaches and Benchmarks
Aliasghar Khani,Arianna Rampini,Bruno Roy,Larasika Nadela,Noa Kaplan,Evan Atherton,Derek Cheung,Jacky Bibliowicz
Main category: cs.CV
TL;DR: 这是一篇关于运动生成的综述论文,重点对2023年以来顶级会议中的生成方法进行了分类,并总结了架构原理、评估指标和数据集,旨在为研究人员提供参考和挑战识别。
Details
Motivation: 运动生成在计算机视觉、图形学和机器人领域具有重要应用,但现有方法的多样性使得全面回顾和比较变得困难,因此需要一篇系统的综述来梳理最新进展。Contribution: 论文提供了一个基于生成策略的运动生成方法分类,分析了架构原理、条件机制、评估指标和数据集,并总结了开放挑战。
Method: 论文通过总结GANs、自编码器、自回归模型和基于扩散的技术等生成方法,进行了系统的分类和比较。
Result: 论文提供了全面的运动生成方法综述,强调了不同方法的优缺点,并指出了未来的研究方向。
Insight: 运动生成领域的快速发展需要更标准化的评估指标和数据集,以促进方法的比较和进步。
Abstract: Motion generation, the task of synthesizing realistic motion sequences from various conditioning inputs, has become a central problem in computer vision, computer graphics, and robotics, with applications ranging from animation and virtual agents to human-robot interaction. As the field has rapidly progressed with the introduction of diverse modeling paradigms including GANs, autoencoders, autoregressive models, and diffusion-based techniques, each approach brings its own advantages and limitations. This growing diversity has created a need for a comprehensive and structured review that specifically examines recent developments from the perspective of the generative approach employed. In this survey, we provide an in-depth categorization of motion generation methods based on their underlying generative strategies. Our main focus is on papers published in top-tier venues since 2023, reflecting the most recent advancements in the field. In addition, we analyze architectural principles, conditioning mechanisms, and generation settings, and compile a detailed overview of the evaluation metrics and datasets used across the literature. Our objective is to enable clearer comparisons and identify open challenges, thereby offering a timely and foundational reference for researchers and practitioners navigating the rapidly evolving landscape of motion generation.
[37] OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts
Shiting Xiao,Rishabh Kabra,Yuhang Li,Donghyun Lee,Joao Carreira,Priyadarshini Panda
Main category: cs.CV
TL;DR: OpenWorldSAM扩展了SAM2,通过集成轻量级视觉语言模型(VLM)的多模态嵌入,实现了基于开放词汇语言提示的通用图像分割。其核心优势包括统一提示、高效性、实例感知和强泛化能力,在多个基准测试中表现优异。
Details
Motivation: 现有图像分割模型在开放词汇场景下的能力有限,特别是基于语言提示的分割任务仍需改进。OpenWorldSAM旨在解决这一挑战,通过结合多模态嵌入与预训练模型的能力,实现更灵活和通用的分割。Contribution: 1. 提出OpenWorldSAM框架,扩展SAM2以支持开放词汇语言提示的分割任务。2. 通过轻量级VLM的多模态嵌入实现高效训练和零样本泛化。3. 引入位置打破嵌入和跨注意力层增强实例感知能力。
Method: 1. 冻结SAM2和VLM的预训练组件,仅训练450万参数。2. 设计统一提示机制,支持类别级和句子级语言描述。3. 使用位置打破嵌入和跨注意力层优化实例分割。
Result: 在ADE20k、PASCAL、ScanNet和SUN-RGBD等基准测试中,OpenWorldSAM在语义、实例和全景分割任务上实现了SOTA性能。
Insight: 通过高效的多模态嵌入和轻量化设计,OpenWorldSAM证明了在保持模型简洁性的同时,可以实现对开放词汇语义的精确分割和零样本泛化。
Abstract: The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model’s spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks, including ADE20k, PASCAL, ScanNet, and SUN-RGBD.
[38] Robotic System with AI for Real Time Weed Detection, Canopy Aware Spraying, and Droplet Pattern Evaluation
Inayat Rasool,Pappu Kumar Yadav,Amee Parmar,Hasan Mirzakhaninafchi,Rikesh Budhathoki,Zain Ul Abideen Usmani,Supriya Paudel,Ivan Perez Olivera,Eric Jone
Main category: cs.CV
TL;DR: 论文提出了一种基于AI的实时杂草检测和可变喷雾系统,通过轻量级YOLO11n模型和嵌入式硬件实现实时杂草识别和喷雾调控,显著减少农药浪费。
Details
Motivation: 现代农业中农药的过度使用导致成本增加、环境污染和抗药性杂草的出现,需要一种智能化的解决方案。Contribution: 开发了一款集成了YOLO11n和YOLO11n-seg模型的实时杂草检测与喷雾系统,能够根据植被冠层动态调整喷雾。
Method: 采用轻量级YOLO11n和YOLO11n-seg模型在NVIDIA Jetson Orin Nano上进行实时推理,结合Arduino控制喷雾喷嘴。
Result: YOLO11n模型mAP@50达0.98,喷雾覆盖率为24.22%,且能根据冠层大小动态调整喷雾量。
Insight: 结合实时深度学习与低成本嵌入式硬件可实现精准施药,未来需扩展杂草种类检测和田间验证。
Abstract: Uniform and excessive herbicide application in modern agriculture contributes to increased input costs, environmental pollution, and the emergence of herbicide resistant weeds. To address these challenges, we developed a vision guided, AI-driven variable rate sprayer system capable of detecting weed presence, estimating canopy size, and dynamically adjusting nozzle activation in real time. The system integrates lightweight YOLO11n and YOLO11n-seg deep learning models, deployed on an NVIDIA Jetson Orin Nano for onboard inference, and uses an Arduino Uno-based relay interface to control solenoid actuated nozzles based on canopy segmentation results. Indoor trials were conducted using 15 potted Hibiscus rosa sinensis plants of varying canopy sizes to simulate a range of weed patch scenarios. The YOLO11n model achieved a mean average precision (mAP@50) of 0.98, with a precision of 0.99 and a recall close to 1.0. The YOLO11n-seg segmentation model achieved a mAP@50 of 0.48, precision of 0.55, and recall of 0.52. System performance was validated using water sensitive paper, which showed an average spray coverage of 24.22% in zones where canopy was present. An upward trend in mean spray coverage from 16.22% for small canopies to 21.46% and 21.65% for medium and large canopies, respectively, demonstrated the system’s capability to adjust spray output based on canopy size in real time. These results highlight the potential of combining real time deep learning with low-cost embedded hardware for selective herbicide application. Future work will focus on expanding the detection capabilities to include three common weed species in South Dakota: water hemp (Amaranthus tuberculatus), kochia (Bassia scoparia), and foxtail (Setaria spp.), followed by further validation in both indoor and field trials within soybean and corn production systems.
[39] Driving as a Diagnostic Tool: Scenario-based Cognitive Assessment in Older Drivers From Driving Video
Md Zahid Hasan,Guillermo Basulto-Elias,Jun Ha Chang,Sahuna Hallmark,Matthew Rizzo,Anuj Sharma,Soumik Sarkar
Main category: cs.CV
TL;DR: 该论文提出了一种基于自然驾驶视频和大规模视觉模型的场景化认知状态识别方法,旨在通过分析老年驾驶员的日常驾驶行为,早期发现认知衰退(如阿尔茨海默病和轻度认知障碍),为主动干预策略提供支持。
Details
Motivation: 当前认知衰退的诊断方法耗时且昂贵,导致许多病例未能及时发现。通过分析驾驶行为(作为认知状态的观察指标),可以开发一种非侵入性、可扩展的早期检测工具。Contribution: 1. 提出利用驾驶视频和大规模视觉模型提取与认知衰退相关的“数字指纹”;2. 开发了一种框架,用于分类认知状态并预测疾病进展;3. 将车辆作为“诊断工具”,实现早期认知衰退的监测。
Method: 通过自然驾驶视频和大规模视觉模型,分析驾驶行为特征,提取与认知衰退相关的信息,构建分类和预测模型。
Result: 该方法能够识别功能衰退的早期预警信号,支持早期干预策略的开发。
Insight: 驾驶行为是认知状态的有效观察指标,结合大规模视觉模型可以实现非侵入性、可扩展的早期检测,减轻老龄化社会中认知衰退的社会和经济负担。
Abstract: We introduce scenario-based cognitive status identification in older drivers from Naturalistic driving videos and large vision models. In recent times, cognitive decline, including Alzheimer’s disease (AD) and mild cognitive impairment (MCI), is often underdiagnosed due to the time-consuming and costly nature of current diagnostic methods. By analyzing real-world driving behavior captured through in-vehicle systems, this research aims to extract “digital fingerprints” that correlate with functional decline and clinical features of MCI and AD. Moreover, modern large vision models can draw meaningful insights from everyday driving patterns of older patients to early detect cognitive decline. We propose a framework that uses large vision models and naturalistic driving videos to analyze driver behavior, classify cognitive status and predict disease progression. We leverage the strong relationship between real-world driving behavior as an observation of the current cognitive status of the drivers where the vehicle can be utilized as a “diagnostic tool”. Our method identifies early warning signs of functional impairment, contributing to proactive intervention strategies. This work enhances early detection and supports the development of scalable, non-invasive monitoring systems to mitigate the growing societal and economic burden of cognitive decline in the aging population.
[40] Cloud Diffusion Part 1: Theory and Motivation
Andrew Randono
Main category: cs.CV
TL;DR: 这篇论文提出了一种名为‘云扩散模型’的新方法,通过引入尺度不变性的噪声分布替代传统的白噪声,旨在提升扩散模型的生成速度、高频细节和可控性。
Details
Motivation: 传统的扩散模型使用白噪声作为噪声分布,但自然图像的低阶统计特性表现出尺度不变性。论文认为,利用这种尺度不变的噪声分布可以更好地匹配自然图像的特性,从而改进模型性能。Contribution: 提出了‘云扩散模型’,通过引入尺度不变性的噪声分布,优化了扩散模型的噪声分布选择,为后续模型的改进提供了理论依据。
Method: 理论分析了尺度不变性噪声分布的优势,并将其与白噪声进行对比,提出了将其融入扩散模型的方法。
Result: 论文认为云扩散模型在推理速度、高频细节生成和可控性方面优于传统白噪声扩散模型,但具体实验结果将在后续论文中展示。
Insight: 通过利用自然图像的尺度不变性,云扩散模型在理论上更接近真实图像分布,从而在生成任务中可能表现更优。
Abstract: Diffusion models for image generation function by progressively adding noise to an image set and training a model to separate out the signal from the noise. The noise profile used by these models is white noise – that is, noise based on independent normal distributions at each point whose mean and variance is independent of the scale. By contrast, most natural image sets exhibit a type of scale invariance in their low-order statistical properties characterized by a power-law scaling. Consequently, natural images are closer (in a quantifiable sense) to a different probability distribution that emphasizes large scale correlations and de-emphasizes small scale correlations. These scale invariant noise profiles can be incorporated into diffusion models in place of white noise to form what we will call a ``Cloud Diffusion Model”. We argue that these models can lead to faster inference, improved high-frequency details, and greater controllability. In a follow-up paper, we will build and train a Cloud Diffusion Model that uses scale invariance at a fundamental level and compare it to classic, white noise diffusion models.
[41] LoomNet: Enhancing Multi-View Image Generation via Latent Space Weaving
Giulio Federico,Fabio Carrara,Claudio Gennaro,Giuseppe Amato,Marco Di Benedetto
Main category: cs.CV
TL;DR: LoomNet提出了一种新颖的多视图扩散架构,通过共享潜在空间生成一致的16视图图像,显著提升了多视图图像的生成质量和3D重建效果。
Details
Motivation: 从单一图像生成一致的多视图图像是一个挑战,空间一致性的缺乏会影响3D网格的表面重建质量。Contribution: 提出了LoomNet,一种并行多次应用扩散模型的多视图扩散架构,通过共享潜在空间实现视图一致性。
Method: 使用多个特定视角的推理生成编码,投影到三个正交平面并融合为聚合平面,通过信息传播和插值生成统一潜在空间。
Result: 在15秒内生成16个高质量一致视图,实验显示其在图像质量和重建指标上优于现有方法,并能生成多样化的合理新视图。
Insight: 通过共享潜在空间和协作推理,LoomNet在多视图生成中实现了更高的空间一致性和效率。
Abstract: Generating consistent multi-view images from a single image remains challenging. Lack of spatial consistency often degrades 3D mesh quality in surface reconstruction. To address this, we propose LoomNet, a novel multi-view diffusion architecture that produces coherent images by applying the same diffusion model multiple times in parallel to collaboratively build and leverage a shared latent space for view consistency. Each viewpoint-specific inference generates an encoding representing its own hypothesis of the novel view from a given camera pose, which is projected onto three orthogonal planes. For each plane, encodings from all views are fused into a single aggregated plane. These aggregated planes are then processed to propagate information and interpolate missing regions, combining the hypotheses into a unified, coherent interpretation. The final latent space is then used to render consistent multi-view images. LoomNet generates 16 high-quality and coherent views in just 15 seconds. In our experiments, LoomNet outperforms state-of-the-art methods on both image quality and reconstruction metrics, also showing creativity by producing diverse, plausible novel views from the same input.
[42] Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model
Mengyao Xu,Gabriel Moreira,Ronay Ak,Radek Osmulski,Yauhen Babakhin,Zhiding Yu,Benedikt Schifferer,Even Oldridge
Main category: cs.CV
TL;DR: 论文提出了一种名为Llama Nemoretriever Colembed的多模态检索模型,通过改进NVIDIA Eagle2 VLM的注意力机制并集成ColBERT风格的交互机制,实现了文本-图像检索的顶尖性能。
Details
Motivation: 随着对跨模态检索系统需求的增长,作者旨在开发一种统一的文本-图像检索模型,以在多个基准测试中实现最优表现。Contribution: 1. 提出了一种结合双向注意力和ColBERT风格交互机制的模型架构;2. 发布了1B和3B两种模型变体,其中3B模型在ViDoRe V1和V2上取得了SOTA;3. 提供了对存储和效率权衡的全面分析。
Method: 1. 改进NVIDIA Eagle2 VLM的注意力机制(因果注意力替换为双向注意力);2. 集成ColBERT风格的晚期交互机制;3. 采用两阶段训练策略提升检索能力。
Result: 3B模型在ViDoRe V1和V2上分别达到NDCG@5 91.0和63.5,均为当前最优表现。
Insight: 双向注意力和晚期交互机制显著提升了检索性能,但需权衡存储和计算效率。
Abstract: Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model’s retrieval capabilities.
[43] ReLayout: Integrating Relation Reasoning for Content-aware Layout Generation with Multi-modal Large Language Models
Jiaxu Tian,Xuehui Yu,Yaoxing Wang,Pan Wang,Guangqian Guo,Shan Gao
Main category: cs.CV
TL;DR: ReLayout提出了一种基于关系推理的内容感知布局生成方法,通过引入明确的元素间关系定义和布局原型重平衡采样器,解决了现有LLM方法在空间关系理解上的不足。
Details
Motivation: 现有基于LLM的布局生成方法未能充分理解视觉主题与设计元素间的空间关系,导致生成的布局结构性和多样性不足。Contribution: 1. 引入了明确的关系定义(如区域、显着性和边距);2. 提出了布局原型重平衡采样器;3. 通过关系-CoT实现更结构和多样的布局生成。
Method: 1. 通过关系-CoT分解布局为结构化递归布局;2. 布局原型重平衡采样器量化布局风格,解决数据偏差问题。
Result: 实验表明,ReLayout在生成更符合人类美学和可解释性更高的布局上优于基线方法。
Insight: 关系推理和原型重平衡是提升布局生成结构性与多样性的关键。
Abstract: Content-aware layout aims to arrange design elements appropriately on a given canvas to convey information effectively. Recently, the trend for this task has been to leverage large language models (LLMs) to generate layouts automatically, achieving remarkable performance. However, existing LLM-based methods fail to adequately interpret spatial relationships among visual themes and design elements, leading to structural and diverse problems in layout generation. To address this issue, we introduce ReLayout, a novel method that leverages relation-CoT to generate more reasonable and aesthetically coherent layouts by fundamentally originating from design concepts. Specifically, we enhance layout annotations by introducing explicit relation definitions, such as region, salient, and margin between elements, with the goal of decomposing the layout into smaller, structured, and recursive layouts, thereby enabling the generation of more structured layouts. Furthermore, based on these defined relationships, we introduce a layout prototype rebalance sampler, which defines layout prototype features across three dimensions and quantifies distinct layout styles. This sampler addresses uniformity issues in generation that arise from data bias in the prototype distribution balance process. Extensive experimental results verify that ReLayout outperforms baselines and can generate structural and diverse layouts that are more aligned with human aesthetics and more explainable.
[44] Multi-Modal Face Anti-Spoofing via Cross-Modal Feature Transitions
Jun-Xiong Chong,Fang-Yu Hsu,Ming-Tsung Hsu,Yi-Ting Lin,Kai-Heng Chien,Chiou-Ting Hsu,Pei-Kai Huang
Main category: cs.CV
TL;DR: 该论文提出了一种跨模态特征转换引导网络(CTNet),用于解决多模态人脸防伪(FAS)任务中的领域差异和模态缺失问题,通过学习活体和伪造样本的特征转换差异,显著提升多模态FAS的性能。
Details
Motivation: 多模态人脸防伪(FAS)因跨模态数据分布差异和模态缺失问题,导致性能不稳定。论文基于活体和伪造样本在特征转换中的差异特性,提出了一种新的解决方案。Contribution: 1) 提出了一种跨模态特征转换引导网络(CTNet),通过学习活体和伪造样本的特征转换差异,提升多模态FAS的鲁棒性;2) 提出从RGB模态中学习互补的红外(IR)和深度特征,以解决模态缺失问题。
Method: 1) 学习活体样本间一致的特征转换,构建通用特征空间;2) 学习活体与伪造样本间不一致的特征转换,检测分布外攻击;3) 从RGB模态生成互补的IR和深度特征。
Result: 实验表明,CTNet在大多数协议下优于现有两分类多模态FAS方法。
Insight: 活体和伪造样本在特征转换中的差异是提升多模态FAS性能的关键;从RGB模态生成互补特征能有效缓解模态缺失问题。
Abstract: Multi-modal face anti-spoofing (FAS) aims to detect genuine human presence by extracting discriminative liveness cues from multiple modalities, such as RGB, infrared (IR), and depth images, to enhance the robustness of biometric authentication systems. However, because data from different modalities are typically captured by various camera sensors and under diverse environmental conditions, multi-modal FAS often exhibits significantly greater distribution discrepancies across training and testing domains compared to single-modal FAS. Furthermore, during the inference stage, multi-modal FAS confronts even greater challenges when one or more modalities are unavailable or inaccessible. In this paper, we propose a novel Cross-modal Transition-guided Network (CTNet) to tackle the challenges in the multi-modal FAS task. Our motivation stems from that, within a single modality, the visual differences between live faces are typically much smaller than those of spoof faces. Additionally, feature transitions across modalities are more consistent for the live class compared to those between live and spoof classes. Upon this insight, we first propose learning consistent cross-modal feature transitions among live samples to construct a generalized feature space. Next, we introduce learning the inconsistent cross-modal feature transitions between live and spoof samples to effectively detect out-of-distribution (OOD) attacks during inference. To further address the issue of missing modalities, we propose learning complementary infrared (IR) and depth features from the RGB modality as auxiliary modalities. Extensive experiments demonstrate that the proposed CTNet outperforms previous two-class multi-modal FAS methods across most protocols.
[45] GSVR: 2D Gaussian-based Video Representation for 800+ FPS with Hybrid Deformation Field
Zhizhuo Pang,Zhihui Ke,Xiaobo Zhou,Tie Qiu
Main category: cs.CV
TL;DR: GSVR提出了一种基于2D高斯分布的视频表示方法,结合混合形变场和动态感知时间切片策略,实现了800+FPS的解码速度和35+PSNR,训练时间仅需2秒每帧。
Details
Motivation: 现有的视频隐式神经表示方法主要通过卷积网络实现,但存在解码速度慢、训练时间长的问题。GSVR旨在解决这些问题,提供高效的视频表示和解码方案。Contribution: 1. 提出GSVR,一种基于2D高斯分布的视频表示方法,显著提升了解码速度(800+FPS)和训练效率(2秒每帧)。2. 设计了混合形变场,结合三平面运动和多形运动,处理视频中的相机和物体运动。3. 提出动态感知时间切片策略,自适应划分视频的GOP。4. 引入量化感知微调,避免量化后性能下降。
Method: 1. 使用混合形变场建模视频动态,结合三平面运动和多形运动。2. 动态感知时间切片策略,根据视频动态水平划分GOP。3. 量化感知微调和图像编解码压缩高斯分布,实现紧凑表示。
Result: 在Bunny和UVG数据集上,GSVR实现了800+FPS的解码速度和35+PSNR,训练时间仅需2秒每帧,解码速度比其他方法快10倍,在视频插值和压缩任务中表现优异。
Insight: 通过2D高斯分布和混合形变场的结合,GSVR显著提升了视频表示的效率和解码速度,为实时高清视频处理提供了新思路。
Abstract: Implicit neural representations for video have been recognized as a novel and promising form of video representation. Existing works pay more attention to improving video reconstruction quality but little attention to the decoding speed. However, the high computation of convolutional network used in existing methods leads to low decoding speed. Moreover, these convolution-based video representation methods also suffer from long training time, about 14 seconds per frame to achieve 35+ PSNR on Bunny. To solve the above problems, we propose GSVR, a novel 2D Gaussian-based video representation, which achieves 800+ FPS and 35+ PSNR on Bunny, only needing a training time of $2$ seconds per frame. Specifically, we propose a hybrid deformation field to model the dynamics of the video, which combines two motion patterns, namely the tri-plane motion and the polynomial motion, to deal with the coupling of camera motion and object motion in the video. Furthermore, we propose a Dynamic-aware Time Slicing strategy to adaptively divide the video into multiple groups of pictures(GOP) based on the dynamic level of the video in order to handle large camera motion and non-rigid movements. Finally, we propose quantization-aware fine-tuning to avoid performance reduction after quantization and utilize image codecs to compress Gaussians to achieve a compact representation. Experiments on the Bunny and UVG datasets confirm that our method converges much faster than existing methods and also has 10x faster decoding speed compared to other methods. Our method has comparable performance in the video interpolation task to SOTA and attains better video compression performance than NeRV.
[46] PaddleOCR 3.0 Technical Report
Cheng Cui,Ting Sun,Manhui Lin,Tingquan Gao,Yubo Zhang,Jiaxuan Liu,Xueqing Wang,Zelun Zhang,Changda Zhou,Hongen Liu,Yue Zhang,Wenyu Lv,Kui Huang,Yichao Zhang,Jing Zhang,Jun Zhang,Yi Liu,Dianhai Yu,Yanjun Ma
Main category: cs.CV
TL;DR: PaddleOCR 3.0是一个开源的OCR和文档解析工具包,针对大规模语言模型时代的文档理解需求,提出了三种主要解决方案:多语言文本识别、分层文档解析和关键信息提取,同时保持了高效和轻量化。
Details
Motivation: 为了应对大规模语言模型时代对文档理解的日益增长需求,PaddleOCR 3.0旨在提供一个高效、轻量且多功能的OCR和文档解析工具。Contribution: 1. 多语言文本识别模型PP-OCRv5;2. 分层文档解析模型PP-StructureV3;3. 关键信息提取工具PP-ChatOCRv4;4. 高效的训练、推理和部署工具。
Method: 通过轻量化设计和异构硬件加速技术,实现了参数少于1亿的模型在精度和效率上与数十亿参数的视觉语言模型竞争。
Result: PaddleOCR 3.0的模型在保持高效的同时,达到了与主流视觉语言模型竞争的精度。
Insight: 轻量化模型通过优化设计和硬件加速,可以在文档理解任务中实现与大规模模型相当的性能,同时更适用于实际部署。
Abstract: This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. To address the growing demand for document understanding in the era of large language models, PaddleOCR 3.0 presents three major solutions: (1) PP-OCRv5 for multilingual text recognition, (2) PP-StructureV3 for hierarchical document parsing, and (3) PP-ChatOCRv4 for key information extraction. Compared to mainstream vision-language models (VLMs), these models with fewer than 100 million parameters achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs. In addition to offering a high-quality OCR model library, PaddleOCR 3.0 provides efficient tools for training, inference, and deployment, supports heterogeneous hardware acceleration, and enables developers to easily build intelligent document applications.
[47] Rethinking Layered Graphic Design Generation with a Top-Down Approach
Jingye Chen,Zhaowen Wang,Nanxuan Zhao,Li Zhang,Difan Liu,Jimei Yang,Qifeng Chen
Main category: cs.CV
TL;DR: 提出了一种名为Accordion的图形设计生成框架,首次尝试将AI生成的图像转换为可编辑的分层设计,并通过用户提示优化无意义的生成文本。该框架采用自上而下的方式,利用视觉协调的参考图像全局引导分层设计的生成。
Details
Motivation: 现有的AI生成设计虽能提供高质量像素图,但缺乏编辑性。非分层设计虽难以编辑,却能启发设计师的布局和文本风格选择。Accordion旨在结合两者优势,将AI生成设计转换为可分层的可编辑设计。Contribution: 1. 提出首个将AI生成设计转换为可编辑分层设计的框架。2. 利用视觉语言模型(VLM)在三个阶段中执行不同任务。3. 采用自上而下方法,结合参考图像和多种视觉专家(如SAM)生成分层设计。4. 通过用户提示优化生成文本。
Method: 1. 使用视觉语言模型在三个阶段分别执行任务,如分解图层和优化文本。2. 采用自上而下方式,以参考图像为全局引导生成分层设计。3. 结合SAM和元素移除模型等视觉专家辅助生成图层。4. 在Design39K数据集上训练,并结合AI生成图像优化地面真实数据。
Result: 实验和用户研究表明,Accordion在DesignIntention基准测试中表现优异,包括文本到模板、在背景中添加文本和文本去渲染等任务,且在生成设计变体方面效果显著。
Insight: 自上而下方法在图形设计生成中更具全局协调性,结合VLM和多专家模型能显著提升生成设计的可编辑性和实用性。用户提示的引入进一步优化了生成内容的质量。
Abstract: Graphic design is crucial for conveying ideas and messages. Designers usually organize their work into objects, backgrounds, and vectorized text layers to simplify editing. However, this workflow demands considerable expertise. With the rise of GenAI methods, an endless supply of high-quality graphic designs in pixel format has become more accessible, though these designs often lack editability. Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs. Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs, meanwhile refining nonsensical AI-generated text with meaningful alternatives guided by user prompts. It is built around a vision language model (VLM) playing distinct roles in three curated stages. For each stage, we design prompts to guide the VLM in executing different tasks. Distinct from existing bottom-up methods (e.g., COLE and Open-COLE) that gradually generate elements to create layered designs, our approach works in a top-down manner by using the visually harmonious reference image as global guidance to decompose each layer. Additionally, it leverages multiple vision experts such as SAM and element removal models to facilitate the creation of graphic layers. We train our method using the in-house graphic design dataset Design39K, augmented with AI-generated design images coupled with refined ground truth created by a customized inpainting model. Experimental results and user studies by designers show that Accordion generates favorable results on the DesignIntention benchmark, including tasks such as text-to-template, adding text to background, and text de-rendering, and also excels in creating design variations.
[48] OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval
Zhiwei Chen,Yupeng Hu,Zixu Li,Zhiheng Fu,Xuemeng Song,Liqiang Nie
Main category: cs.CV
TL;DR: OFFSET提出了一种基于分割的焦点转移修正方法,用于解决组合图像检索中的视觉噪声干扰和文本优先级问题,通过显著区域分割和双焦点映射提取特征,并结合文本引导的焦点修正模块提升检索性能。
Details
Motivation: 现有的组合图像检索方法忽视视觉数据中主要部分和噪声部分的异质性,导致查询特征退化;同时忽略文本数据在图像修改过程中的优先级,造成视觉焦点偏差。OFFSET旨在解决这些问题。Contribution: 1) 设计了基于焦点映射的特征提取器,包含显著区域分割和双焦点映射模块;2) 提出了文本引导的焦点修正模块,通过文本隐含的修改需求自适应调整参考图像的焦点。
Method: 方法包括两部分:1) 显著区域分割和双焦点映射模块,用于提取高质量视觉和文本特征;2) 文本引导的焦点修正模块,自适应修正参考图像的焦点。
Result: 在四个基准数据集上的实验证明,OFFSET在组合图像检索任务中表现出优越性。
Insight: 分割方法可以有效减少噪声干扰,而文本引导的焦点修正能够显著提升对修改需求的理解和捕捉能力。
Abstract: Composed Image Retrieval (CIR) represents a novel retrieval paradigm that is capable of expressing users’ intricate retrieval requirements flexibly. It enables the user to give a multimodal query, comprising a reference image and a modification text, and subsequently retrieve the target image. Notwithstanding the considerable advances made by prevailing methodologies, CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation, and 2) the priority of textual data in the image modification process is overlooked, which leads to a visual focus bias. To address these two limitations, this work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping. It is designed to identify significant dominant portions in images and guide the extraction of visual and textual data features, thereby reducing the impact of noise interference. Subsequently, we propose a textually guided focus revision module, which can utilize the modification requirements implied in the text to perform adaptive focus revision on the reference image, thereby enhancing the perception of the modification focus on the composed features. The aforementioned modules collectively constitute the segmentatiOn-based Focus shiFt reviSion nETwork (\mbox{OFFSET}), and comprehensive experiments on four benchmark datasets substantiate the superiority of our proposed method. The codes and data are available on https://zivchen-ty.github.io/OFFSET.github.io/
[49] Knowledge-guided Complex Diffusion Model for PolSAR Image Classification in Contourlet Domain
Junfei Shi,Yu Cheng,Haiyan Jin,Junhuai Li,Zhaolin Xiao,Maoguo Gong,Weisi Lin
Main category: cs.CV
TL;DR: 论文提出了一种基于Contourlet域的知识引导复值扩散模型,用于PolSAR图像分类,通过结合多尺度和多方向信息,显著提升了分类精度和边缘保护能力。
Details
Motivation: 传统实值扩散模型在处理PolSAR数据时难以捕捉复值相位信息,且容易丢失细节结构。Contourlet变换能提供丰富的多尺度和多方向表示,适合PolSAR图像。Contribution: 1. 提出在Contourlet域中构建复值扩散模型;2. 利用高频系数的结构信息引导扩散过程;3. 联合学习多尺度多方向特征以提升分类效果。
Method: 1. 使用复值Contourlet变换分解数据;2. 设计知识引导的复值扩散网络,建模低频分量的统计特性;3. 结合高频特征优化分类。
Result: 在三个真实PolSAR数据集上,该方法的分类精度优于现有方法,尤其是在边缘保护和区域均匀性方面表现出色。
Insight: Contourlet变换与复值扩散模型的结合是处理PolSAR数据的有效方法,结构信息的引导能显著提升模型的细节保留能力。
Abstract: Diffusion models have demonstrated exceptional performance across various domains due to their ability to model and generate complicated data distributions. However, when applied to PolSAR data, traditional real-valued diffusion models face challenges in capturing complex-valued phase information.Moreover, these models often struggle to preserve fine structural details. To address these limitations, we leverage the Contourlet transform, which provides rich multiscale and multidirectional representations well-suited for PolSAR imagery. We propose a structural knowledge-guided complex diffusion model for PolSAR image classification in the Contourlet domain. Specifically, the complex Contourlet transform is first applied to decompose the data into low- and high-frequency subbands, enabling the extraction of statistical and boundary features. A knowledge-guided complex diffusion network is then designed to model the statistical properties of the low-frequency components. During the process, structural information from high-frequency coefficients is utilized to guide the diffusion process, improving edge preservation. Furthermore, multiscale and multidirectional high-frequency features are jointly learned to further boost classification accuracy. Experimental results on three real-world PolSAR datasets demonstrate that our approach surpasses state-of-the-art methods, particularly in preserving edge details and maintaining region homogeneity in complex terrain.
[50] Dynamic Rank Adaptation for Vision-Language Models
Jiahui Wang,Qin Xu,Bo Jiang,Bin Luo
Main category: cs.CV
TL;DR: 论文提出动态秩适配(DRA),一种新型适配器方法,动态分配特征重要性以增强预训练视觉语言模型(VLMs)对新类的泛化能力。
Details
Motivation: 现有基于提示和适配器的方法在微调VLMs时对所有图像和文本编码器的令牌一视同仁,导致对无关特征过拟合,影响对新概念的识别。Contribution: 提出动态秩适配(DRA),通过令牌重要性分组和动态分配特征秩,保留通用知识,增强新类泛化能力。
Method: 1. 令牌重要性分组;2. 动态分配特征秩;3. 引入通道响应机制;4. 添加L1正则化稳定训练。
Result: 实验表明DRA在多个基准测试(如基新类、跨数据集评估和领域泛化)中优于现有方法。
Insight: 动态调整特征重要性可有效避免过拟合,提升模型对新类的泛化能力。
Abstract: Pre-trained large vision-language models (VLMs) like CLIP demonstrate impressive generalization ability. Existing prompt-based and adapter-based works have made significant progress in fine-tuning VLMs but still face the challenges of maintaining strong generalization abilities, particularly towards unseen new classes. This limitation partly arises from these methods treating all tokens of the image and text encoder equally, which can lead to overfitting on less informative features (e.g., background noise, template words) and degrade the general representations that are crucial for novel concept recognition. To address this issue, we propose Dynamic Rank Adaptation (DRA), a novel adapter variant method, designed specifically to enhance new class generalization. DRA dynamically allocates adaptation ranks based on the importance of features during training to preserve general knowledge. DRA first employs token importance grouping, using sequence attention to evaluate and group tokens by their importance. Then, we adopt rank adaptation according to the importance of each token group dynamically by assigning higher feature ranks to the more important tokens. Also, we design a new channel response mechanism to prioritize the preservation and adaptation of feature channels identified as the most informative for each instance. In addition, a L1 regularization term is introduced to stabilize the training. Extensive experiments demonstrate the effectiveness and superiority of our proposed DRA over existing works, especially on enhancing the performance of new classes on various benchmarks, including base-new classes, cross-datasets evaluation and domain generalization. The source code will be published after the paper is received.
[51] Modeling and Reversing Brain Lesions Using Diffusion Models
Omar Zamzam,Haleh Akrami,Anand Joshi,Richard Leahy
Main category: cs.CV
TL;DR: 论文提出了一种基于扩散模型的框架,用于分析和逆转脑损伤过程,包括分割异常区域、估计并逆转组织变形,最后修复核心损伤区域以估计损伤前的健康大脑。
Details
Motivation: 现有的脑损伤分割方法未能区分受损与变形组织,导致分析不准确。该研究旨在通过扩散模型解决这一问题,提供更精确的损伤分析与逆转方法。Contribution: 1. 提出扩散模型框架,用于分割、逆转变形并修复脑损伤区域;2. 通过模拟前向模型验证逆转过程的准确性;3. 在分割和标记任务中优于传统方法。
Method: 1. 分割异常区域;2. 估计并逆转组织变形;3. 修复核心损伤区域;4. 使用前向模型模拟损伤过程以验证方法。
Result: 与传统方法相比,该方法在损伤分割、表征和大脑标记任务中表现出更高的准确性。
Insight: 通过逆转损伤过程,该方法不仅提升了分割精度,还为临床和研究提供了损伤分析的可靠工具,尤其是在缺乏真实预损伤数据的情况下,模拟前向模型为验证提供了新思路。
Abstract: Brain lesions are abnormalities or injuries in brain tissue that are often detectable using magnetic resonance imaging (MRI), which reveals structural changes in the affected areas. This broad definition of brain lesions includes areas of the brain that are irreversibly damaged, as well as areas of brain tissue that are deformed as a result of lesion growth or swelling. Despite the importance of differentiating between damaged and deformed tissue, existing lesion segmentation methods overlook this distinction, labeling both of them as a single anomaly. In this work, we introduce a diffusion model-based framework for analyzing and reversing the brain lesion process. Our pipeline first segments abnormal regions in the brain, then estimates and reverses tissue deformations by restoring displaced tissue to its original position, isolating the core lesion area representing the initial damage. Finally, we inpaint the core lesion area to arrive at an estimation of the pre-lesion healthy brain. This proposed framework reverses a forward lesion growth process model that is well-established in biomechanical studies that model brain lesions. Our results demonstrate improved accuracy in lesion segmentation, characterization, and brain labeling compared to traditional methods, offering a robust tool for clinical and research applications in brain lesion analysis. Since pre-lesion healthy versions of abnormal brains are not available in any public dataset for validation of the reverse process, we simulate a forward model to synthesize multiple lesioned brain images.
[52] R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding
Joonhyung Park,Peng Tang,Sagnik Das,Srikar Appalaraju,Kunwar Yashraj Singh,R. Manmatha,Shabnam Ghadar
Main category: cs.CV
TL;DR: R-VLM是一种基于区域的视觉语言模型,通过放大区域提案精确定位GUI元素,并结合IoU感知的损失函数,提升了GUI自动化任务的准确性和泛化能力。
Details
Motivation: GUI自动化任务中,现有视觉模型直接从杂乱的大截图中定位元素,准确性不足,且使用的交叉熵损失无法有效衡量定位质量。Contribution: 提出R-VLM模型,引入放大区域提案和IoU感知的损失函数,显著提升GUI元素的精确定位能力。
Method: 结合区域提案和视觉语言模型,提出IoU感知的损失函数以优化定位质量,替代传统的交叉熵损失。
Result: 在ScreenSpot和AgentStudio基准上提升13%的定位准确率,在AITW和Mind2Web导航任务中获得3.2-9.7%的绝对提升。
Insight: 通过结合视觉语言模型与目标检测技术,更有效地解决GUI元素定位问题,为GUI自动化任务提供了新的思路。
Abstract: Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring them to process substantial irrelevant information that compromises their accuracy. In addition, these approaches typically employ basic cross-entropy loss for learning grounding objectives, which fails to effectively capture grounding quality compared to established object detection metrics like Intersection-over-Union (IoU). To address these issues, we introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization. We also propose an IoU-aware objective function that facilitates model convergence toward high IoU predictions. Our approach bridges the gap between VLMs and conventional object detection techniques, improving the state-of-the-art grounding accuracy by 13% across diverse GUI platforms on the GUI grounding benchmarks ScreenSpot and AgentStudio. In addition, our R-VLM approach shows 3.2-9.7% absolute accuracy improvements in GUI navigation tasks on the AITW and Mind2Web benchmarks.
[53] MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
Rongsheng Wang,Junying Chen,Ke Ji,Zhenyang Cai,Shunian Chen,Yunjin Yang,Benyou Wang
Main category: cs.CV
TL;DR: 本文提出了首个针对医疗视频生成的大规模高质量数据集MedVideoCap-55K,并基于此开发了MedGen模型,在医疗视频生成的视觉质量和医学准确性上取得了领先性能。
Details
Motivation: 医疗视频生成在临床培训、教育和模拟中具有重要应用价值,但现有生成模型在医学领域缺乏高质量数据集支持,导致生成内容不准确或不真实。Contribution: 1. 发布了首个大规模、多样化和注释丰富的医疗视频数据集MedVideoCap-55K;2. 开发了MedGen模型,在视觉质量和医学准确性上表现优异。
Method: 通过构建MedVideoCap-55K数据集,覆盖真实医疗场景中的55,000多个视频片段,并基于此训练MedGen模型。
Result: MedGen在多项基准测试中表现优于开源模型,并与商业系统性能相当。
Insight: 高质量数据集是提升医学领域生成模型性能的关键,MedVideoCap-55K为医疗视频生成研究提供了重要资源。
Abstract: Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at https://github.com/FreedomIntelligence/MedGen
[54] Integrated Structural Prompt Learning for Vision-Language Models
Jiahui Wang,Qin Xu,Bo Jiang,Bin Luo
Main category: cs.CV
TL;DR: 这篇论文提出了一种集成结构化提示学习(ISP)方法,用于增强视觉-语言模型(VLM)中文本和图像模态间的信息交互,通过自结构和跨结构提示模块建模可学习提示与冻结标记之间的关系,同时引入样本探测模块动态调整损失系数,提升模型对新类别的泛化能力。
Details
Motivation: 现有方法未能充分利用可学习提示与模态内及模态间标记的结构关系,且难以平衡基类与新类别的性能,因此需要一种更高效的方法来增强模态间的信息交互和模型泛化能力。Contribution: 提出了集成结构化提示(ISP),包含自结构和跨结构提示模块以建模提示与标记的关系,并设计了样本探测模块动态调整损失系数,提升了模型的泛化性能。
Method: ISP通过自结构和跨结构提示模块增强模态内及模态间的信息交互,同时利用样本探测模块动态调整训练样本的损失系数,避免对简单样本过拟合。
Result: 在基类到新类泛化、跨数据集评估和领域泛化三个实验中,ISP表现出色,优于现有方法。
Insight: 模态内及模态间的结构关系对提升模型性能至关重要;动态调整损失系数有助于平衡基类与新类别的学习,提升泛化能力。
Abstract: Prompt learning methods have significantly extended the transferability of pre-trained Vision-Language Models (VLMs) like CLIP for various downstream tasks. These methods adopt handcraft templates or learnable vectors to provide text or image instructions in fine-tuning VLMs. However, most existing works ignore the structural relationships between learnable prompts and tokens within and between modalities. Moreover, balancing the performance of base and new classes remains a significant challenge. In this paper, we propose an Integrated Structural Prompt (ISP) for VLMs to enhance the interaction of information representations between the text and image branches. ISP introduces self-structural and cross-structural prompt modules to model the structural relationships between learnable prompts and frozen tokens within and across modalities. This enables efficient information transfer while preserving feature stability. Additionally, we propose a sample probing module that dynamically adjusts loss coefficients based on sample difficulty, preventing the mode from overfitting to simple samples and improving generalization ability to new classes. Extensive experiments on three widely used settings: base-to-new generalization, cross-dataset evaluation, and domain generalization demonstrate that the proposed ISP achieves competitive performance against state-of-the-art methods.
[55] LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion
Yisu Zhang,Chenjie Cao,Chaohui Yu,Jianke Zhu
Main category: cs.CV
TL;DR: LiON-LoRA 是一种新框架,通过线性可扩展性、正交性和范数一致性重新思考 LoRA 融合,以统一视频扩散模型中时空生成的控制。
Details
Motivation: 现有的 LoRA 方法在视频扩散模型中难以同时精确控制相机轨迹和物体运动,主要是由于融合不稳定和非线性扩展问题。Contribution: 提出了 LiON-LoRA 框架,通过线性可扩展性、正交性和范数一致性优化 LoRA 融合,实现了对时空生成的统一控制。
Method: 1. 分析浅层 VDM 中的 LoRA 特征正交性。2. 通过范数一致性稳定复杂相机运动组合的融合。3. 在扩散变换器中引入可控令牌,线性调节运动幅度。
Result: LiON-LoRA 在轨迹控制精度和运动强度调整方面优于现有方法,且能用少量训练数据实现出色泛化。
Insight: LoRA 特征的正交性和范数一致性是优化视频扩散模型中时空控制的关键。
Abstract: Video Diffusion Models (VDMs) have demonstrated remarkable capabilities in synthesizing realistic videos by learning from large-scale data. Although vanilla Low-Rank Adaptation (LoRA) can learn specific spatial or temporal movement to driven VDMs with constrained data, achieving precise control over both camera trajectories and object motion remains challenging due to the unstable fusion and non-linear scalability. To address these issues, we propose LiON-LoRA, a novel framework that rethinks LoRA fusion through three core principles: Linear scalability, Orthogonality, and Norm consistency. First, we analyze the orthogonality of LoRA features in shallow VDM layers, enabling decoupled low-level controllability. Second, norm consistency is enforced across layers to stabilize fusion during complex camera motion combinations. Third, a controllable token is integrated into the diffusion transformer (DiT) to linearly adjust motion amplitudes for both cameras and objects with a modified self-attention mechanism to ensure decoupled control. Additionally, we extend LiON-LoRA to temporal generation by leveraging static-camera videos, unifying spatial and temporal controllability. Experiments demonstrate that LiON-LoRA outperforms state-of-the-art methods in trajectory control accuracy and motion strength adjustment, achieving superior generalization with minimal training data. Project Page: https://fuchengsu.github.io/lionlora.github.io/
[56] Event-RGB Fusion for Spacecraft Pose Estimation Under Harsh Lighting
Mohsi Jawaid,Marcus Märtens,Tat-Jun Chin
Main category: cs.CV
TL;DR: 该论文提出了一种结合RGB和事件传感器的融合方法,用于在极端光照条件下提升航天器姿态估计的鲁棒性,通过光束分离棱镜实现精确对齐,并开发了一种RANSAC融合技术,同时公开了数据集以推动社区研究。
Details
Motivation: 航天器姿态估计在自主空间操作中至关重要,但传统RGB传感器在极端光照条件下表现不佳,而事件传感器虽有高动态范围但存在分辨率和信噪比问题。为此,论文提出融合两种传感器以互补优势。Contribution: 1. 提出了一种基于RANSAC的RGB-事件传感器融合方法;2. 引入了光束分离棱镜实现精确对齐;3. 公开了包含多种光照条件的航天器姿态估计数据集。
Method: 采用光束分离棱镜对齐RGB和事件传感器数据,开发RANSAC融合技术结合两种模态信息,并通过退出不确定性估计检测极端条件。
Result: 实验结果表明,融合方法在极端光照条件下显著提升了姿态估计的鲁棒性,支持事件传感器在航天器姿态估计中的应用。
Insight: 事件传感器在极端光照条件下具有潜力,但与RGB传感器的融合可以进一步提升性能,为未来空间任务中的传感器选择提供了新思路。
Abstract: Spacecraft pose estimation is crucial for autonomous in-space operations, such as rendezvous, docking and on-orbit servicing. Vision-based pose estimation methods, which typically employ RGB imaging sensors, is a compelling solution for spacecraft pose estimation, but are challenged by harsh lighting conditions, which produce imaging artifacts such as glare, over-exposure, blooming and lens flare. Due to their much higher dynamic range, neuromorphic or event sensors are more resilient to extreme lighting conditions. However, event sensors generally have lower spatial resolution and suffer from reduced signal-to-noise ratio during periods of low relative motion. This work addresses these individual sensor limitations by introducing a sensor fusion approach combining RGB and event sensors. A beam-splitter prism was employed to achieve precise optical and temporal alignment. Then, a RANSAC-based technique was developed to fuse the information from the RGB and event channels to achieve pose estimation that leveraged the strengths of the two modalities. The pipeline was complemented by dropout uncertainty estimation to detect extreme conditions that affect either channel. To benchmark the performance of the proposed event-RGB fusion method, we collected a comprehensive real dataset of RGB and event data for satellite pose estimation in a laboratory setting under a variety of challenging illumination conditions. Encouraging results on the dataset demonstrate the efficacy of our event-RGB fusion approach and further supports the usage of event sensors for spacecraft pose estimation. To support community research on this topic, our dataset will be released publicly.
[57] Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study
Aayushma Pant,Arbind Agrahari Baniya,Tsz-Kwan Lee,Sunil Aryal
Main category: cs.CV
TL;DR: 该论文综述了高光谱异常检测(HAD)方法,对比了统计模型、表示学习方法、经典机器学习和深度学习方法,并通过17个基准数据集评估其性能,指出深度模型的检测精度最高,而统计模型速度最快。
Details
Motivation: 高光谱图像在农业、军事等领域有广泛应用,但现有异常检测方法面临计算复杂度高、对噪声敏感等问题,亟需系统性的比较和评估。Contribution: 论文对HAD技术进行了全面分类和对比,评估了17个数据集上的性能,提出了未来研究方向。
Method: 作者将HAD方法分为四类(统计模型、表示学习、经典机器学习、深度学习),并使用ROC、AUC等指标在多个数据集上评估它们的性能。
Result: 实验显示,深度学习模型检测精度最高,而统计模型计算速度最快。
Insight: 未来研究可结合深度学习的精度和统计模型的速度优势,同时需解决噪声敏感性和泛化性问题。
Abstract: Hyperspectral images are high-dimensional datasets consisting of hundreds of contiguous spectral bands, enabling detailed material and surface analysis. Hyperspectral anomaly detection (HAD) refers to the technique of identifying and locating anomalous targets in such data without prior information about a hyperspectral scene or target spectrum. This technology has seen rapid advancements in recent years, with applications in agriculture, defence, military surveillance, and environmental monitoring. Despite this significant progress, existing HAD methods continue to face challenges such as high computational complexity, sensitivity to noise, and limited generalisation across diverse datasets. This study presents a comprehensive comparison of various HAD techniques, categorising them into statistical models, representation-based methods, classical machine learning approaches, and deep learning models. We evaluated these methods across 17 benchmarking datasets using different performance metrics, such as ROC, AUC, and separability map to analyse detection accuracy, computational efficiency, their strengths, limitations, and directions for future research.The research shows that deep learning models achieved the highest detection accuracy, while statistical models demonstrated exceptional speed across all datasets. This study aims to provide valuable insights for researchers and practitioners working to advance the field of hyperspectral anomaly detection methods.
[58] SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations
Yegyu Han,Taegyoon Yoon,Dayeon Woo,Sojeong Kim,Hyung-Sin Kim
Main category: cs.CV
TL;DR: SenseShift6D是第一個RGB-D數據集,專注於測試光源和感測器設置變化對6D姿態估計的影響,提供多種感測器配置和光源條件,並顯示實時感測器控制在測試階段的優越性。
Details
Motivation: 現有的6D姿態估計數據集(如LM-O、YCB-V和T-Less)在固定光源和相機設置下捕捉,未能反映真實世界的光源和感測器變化。為了填補這一空白,作者提出了一個新數據集SenseShift6D。Contribution: 1. 提出SenseShift6D數據集,包含多種RGB曝光、增益、深度捕捉模式和光源條件。2. 展示實時感測器控制在測試階段的性能提升優於數據增廣。3. 多模態RGB-D配置的聯合調整進一步提升了性能。
Method: 作者通過物理方式捕捉13種RGB曝光、9種RGB增益、自動曝光、4種深度捕捉模式和5種光源條件,生成101.9k RGB和10k深度圖像。實驗中測試了多種感測器配置對6D姿態估計模型的影響。
Result: 實驗結果顯示,測試階段的感測器控制比數據增廣更有效,且聯合調整RGB和深度感測器配置能進一步提升性能。
Insight: 該工作將6D姿態估計的評估範式從數據中心轉向感測器感知的魯棒性,為適應性感知系統在真實環境中的應用奠定了基礎。
Abstract: Recent advances on 6D object-pose estimation has achieved high performance on representative benchmarks such as LM-O, YCB-V, and T-Less. However, these datasets were captured under fixed illumination and camera settings, leaving the impact of real-world variations in illumination, exposure, gain or depth-sensor mode - and the potential of test-time sensor control to mitigate such variations - largely unexplored. To bridge this gap, we introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels. For three common household objects (spray, pringles, and tincase), we acquire 101.9k RGB and 10k depth images, which can provide 1,380 unique sensor-lighting permutations per object pose. Experiments with state-of-the-art models on our dataset show that applying sensor control during test-time induces greater performance improvement over digital data augmentation, achieving performance comparable to or better than costly increases in real-world training data quantity and diversity. Adapting either RGB or depth sensors individually is effective, while jointly adapting multimodal RGB-D configurations yields even greater improvements. SenseShift6D extends the 6D-pose evaluation paradigm from data-centered to sensor-aware robustness, laying a foundation for adaptive, self-tuning perception systems capable of operating robustly in uncertain real-world environments. Our dataset is available at: huggingface.co/datasets/Yegyu/SenseShift6D Associated scripts can be found at: github.com/yegyu-han/SenseShift6D
[59] Normal Patch Retinex Robust Alghoritm for White Balancing in Digital Microscopy
Radoslaw Roszczyk,Artur Krupa,Izabella Antoniuk
Main category: cs.CV
TL;DR: 本文提出了一种名为Normal Patch Retinex的自动白平衡算法,专门用于解决显微镜图像色彩校正问题,并在实验中验证了其有效性。
Details
Motivation: 解决显微镜图像采集过程中色彩平衡的挑战,尤其是病理学中常用的染色样本。Contribution: 提出了一种全新的自动白平衡算法,适用于显微镜图像,特别是HPS染色和免疫组化染色样本。
Method: 基于Normal Patch Retinex算法,通过自动机制校正显微镜图像的色彩平衡。
Result: 在200张显微镜图像上验证了算法的有效性,优于传统摄影白平衡算法。
Insight: 该方法特别适用于病理学中的染色图像,为显微镜图像处理提供了更有效的解决方案。
Abstract: The acquisition of accurately coloured, balanced images in an optical microscope can be a challenge even for experienced microscope operators. This article presents an entirely automatic mechanism for balancing the white level that allows the correction of the microscopic colour images adequately. The results of the algorithm have been confirmed experimentally on a set of two hundred microscopic images. The images contained scans of three microscopic specimens commonly used in pathomorphology. Also, the results achieved were compared with other commonly used white balance algorithms in digital photography. The algorithm applied in this work is more effective than the classical algorithms used in colour photography for microscopic images stained with hematoxylin-phloxine-saffron and for immunohistochemical staining images.
[60] DreamArt: Generating Interactable Articulated Objects from a Single Image
Ruijie Lu,Yu Liu,Jiaxiang Tang,Junfeng Ni,Yuxiang Wang,Diwen Wan,Gang Zeng,Yixin Chen,Siyuan Huang
Main category: cs.CV
TL;DR: DreamArt提出了一种从单张图像生成可交互关节化物体的新框架,通过三阶段流程实现高质量的关节化3D资产生成。
Details
Motivation: 现有方法主要关注表面几何和纹理,而忽视了部件分解和关节建模;同时,神经重建方法依赖多视角或交互数据,难以扩展。DreamArt的目标是从单视角图像生成高保真可交互的关节化资产。Contribution: 1. 提出了DreamArt框架,能够从单张图像生成高质量的关节化3D资产;2. 结合了部件分割、视频扩散模型和关节优化技术;3. 实验证明其生成的物体部件形状准确、外观保真且关节合理。
Method: 1. 通过图像到3D生成、掩码提示的3D分割和部件补全重建部件分割的完整3D网格;2. 微调视频扩散模型以学习部件级关节先验;3. 优化双四元数表示的关节运动并进行全局纹理细化。
Result: 实验结果表明,DreamArt能够生成高质量的关节化物体,部件形状准确、外观保真且运动合理。
Insight: DreamArt展示了如何通过结合生成模型和优化技术,从单张图像生成复杂关节化物体,为AR/VR和具身AI提供了可扩展的解决方案。
Abstract: Generating articulated objects, such as laptops and microwaves, is a crucial yet challenging task with extensive applications in Embodied AI and AR/VR. Current image-to-3D methods primarily focus on surface geometry and texture, neglecting part decomposition and articulation modeling. Meanwhile, neural reconstruction approaches (e.g., NeRF or Gaussian Splatting) rely on dense multi-view or interaction data, limiting their scalability. In this paper, we introduce DreamArt, a novel framework for generating high-fidelity, interactable articulated assets from single-view images. DreamArt employs a three-stage pipeline: firstly, it reconstructs part-segmented and complete 3D object meshes through a combination of image-to-3D generation, mask-prompted 3D segmentation, and part amodal completion. Second, we fine-tune a video diffusion model to capture part-level articulation priors, leveraging movable part masks as prompt and amodal images to mitigate ambiguities caused by occlusion. Finally, DreamArt optimizes the articulation motion, represented by a dual quaternion, and conducts global texture refinement and repainting to ensure coherent, high-quality textures across all parts. Experimental results demonstrate that DreamArt effectively generates high-quality articulated objects, possessing accurate part shape, high appearance fidelity, and plausible articulation, thereby providing a scalable solution for articulated asset generation. Our project page is available at https://dream-art-0.github.io/DreamArt/.
[61] TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model
Yujie Hu,Xuanyu Zhang,Weiqi Li,Jian Zhang
Main category: cs.CV
TL;DR: TalkFashion提出了一种基于多模态大语言模型的智能虚拟试穿助手,通过文本指令实现多功能虚拟试穿,包括全身换装和局部编辑,解决了传统方法缺乏灵活性的问题。
Details
Motivation: 传统虚拟试穿方法主要依赖端到端网络完成单一任务,缺乏多功能性和灵活性。本文旨在通过多模态大语言模型的理解能力,实现仅需文本指令指导的多功能虚拟试穿。Contribution: 1. 提出TalkFashion,利用大语言模型分析用户指令并激活不同处理流程,实现多功能虚拟试穿;2. 引入基于指令的局部重绘模型,无需手动提供掩码,实现全自动局部编辑。
Method: 1. 使用多模态大语言模型解析文本指令,确定任务类型;2. 设计指令驱动的局部重绘模型,避免手动掩码;3. 结合不同处理流程完成全身换装或局部编辑。
Result: 实验表明,该方法在语义一致性和视觉质量上优于现有方法。
Insight: 多模态大语言模型能够显著提升虚拟试穿的灵活性和自动化程度,同时减少用户手动操作的需求。
Abstract: Virtual try-on has made significant progress in recent years. This paper addresses how to achieve multifunctional virtual try-on guided solely by text instructions, including full outfit change and local editing. Previous methods primarily relied on end-to-end networks to perform single try-on tasks, lacking versatility and flexibility. We propose TalkFashion, an intelligent try-on assistant that leverages the powerful comprehension capabilities of large language models to analyze user instructions and determine which task to execute, thereby activating different processing pipelines accordingly. Additionally, we introduce an instruction-based local repainting model that eliminates the need for users to manually provide masks. With the help of multi-modal models, this approach achieves fully automated local editings, enhancing the flexibility of editing tasks. The experimental results demonstrate better semantic consistency and visual quality compared to the current methods.
[62] SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning
Xin Hu,Ke Qin,Guiduo Duan,Ming Li,Yuan-Fang Li,Tao He
Main category: cs.CV
TL;DR: SPADE提出了一种空间感知的去噪网络,结合长程和局部上下文推理,提升了开放词汇全景场景图生成任务的性能。
Details
Motivation: 现有的基于视觉语言模型的开放词汇全景场景图生成方法在空间关系推理上存在局限,导致关系预测效果不佳。Contribution: 提出SPADE框架,包含反转引导的UNet校准和空间感知的上下文推理,首次将去噪扩散模型引入全景场景图生成任务。
Method: 1. 通过轻量级LoRA微调策略校准预训练的扩散模型;2. 设计了空间感知的关系图Transformer,捕捉局部和全局上下文信息。
Result: 在PSG和Visual Genome数据集上,SPADE在封闭和开放集场景下均优于现有方法,尤其在空间关系预测上表现突出。
Insight: 扩散模型的反转过程能有效保留空间结构信息,结合Transformer的长程和局部推理能力,可以显著提升开放词汇下的关系预测性能。
Abstract: Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction. Motivated by the denoising diffusion model’s inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial-Aware Denoising-nEtwork) framework – a novel approach for open-vocabulary PSG. SPADE consists of two key steps: (1) inversion-guided calibration for the UNet adaptation, and (2) spatial-aware context reasoning. In the first step, we calibrate a general pre-trained teacher diffusion model into a PSG-specific denoising network with cross-attention maps derived during inversion through a lightweight LoRA-based fine-tuning strategy. In the second step, we develop a spatial-aware relation graph transformer that captures both local and long-range contextual information, facilitating the generation of high-quality relation queries. Extensive experiments on benchmark PSG and Visual Genome datasets demonstrate that SPADE outperforms state-of-the-art methods in both closed- and open-set scenarios, particularly for spatial relationship prediction.
[63] DREAM: Document Reconstruction via End-to-end Autoregressive Model
Xin Li,Mingming Gong,Yunfei Wu,Jianxin Dai,Antai Guo,Xinghua Jiang,Haoyu Cao,Yinsong Liu,Deqiang Jiang,Xing Sun
Main category: cs.CV
TL;DR: 论文提出了一种端到端的自回归模型DREAM,用于文档重建任务,解决了现有方法中错误传播和布局信息缺失的问题,并在多个子任务中表现出色。
Details
Motivation: 文档重建是文档分析与识别的重要任务,但目前的多阶段方法存在错误传播问题,而现有端到端方法无法保留布局信息。这促使作者提出新的解决方案。Contribution: 1. 提出DREAM模型,实现端到端文档重建;2. 定义标准化文档重建任务;3. 引入新评估指标DSM和数据集DocRec1K;4. 在多子任务中验证模型竞争力。
Method: 使用自回归模型将文档图像转换为包含丰富元素信息的重建序列,保留了布局信息,并通过端到端训练减少错误传播。
Result: 实验证明DREAM在文档重建任务中性能最佳,并在布局分析、文本识别等子任务中表现优异。
Insight: 端到端自回归模型能有效整合文档元素信息,标准化任务定义和评估指标有助于推动领域进展。
Abstract: Document reconstruction constitutes a significant facet of document analysis and recognition, a field that has been progressively accruing interest within the scholarly community. A multitude of these researchers employ an array of document understanding models to generate predictions on distinct subtasks, subsequently integrating their results into a holistic document reconstruction format via heuristic principles. Nevertheless, these multi-stage methodologies are hindered by the phenomenon of error propagation, resulting in suboptimal performance. Furthermore, contemporary studies utilize generative models to extract the logical sequence of plain text, tables and mathematical expressions in an end-to-end process. However, this approach is deficient in preserving the information related to element layouts, which are vital for document reconstruction. To surmount these aforementioned limitations, we in this paper present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM). DREAM transmutes the text image into a sequence of document reconstruction in a comprehensive, end-to-end process, encapsulating a broader spectrum of document element information. In addition, we establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task. Empirical results substantiate that our methodology attains unparalleled performance in the realm of document reconstruction. Furthermore, the results on a variety of subtasks, encompassing document layout analysis, text recognition, table structure recognition, formula recognition and reading order detection, indicate that our model is competitive and compatible with various tasks.
[64] Towards Solar Altitude Guided Scene Illumination
Samed Doğan,Maximilian Hoh,Nico Leuze,Nicolas R. -Peña,Alfred Schöttl
Main category: cs.CV
TL;DR: 该论文提出了一种利用太阳高度角指导场景光照的方法,通过全局条件生成合成相机传感器数据,解决了白天光照变化的标注数据稀缺问题。
Details
Motivation: 现实世界数据采集成本高且受限,缺乏对白天光照变化的有效标注数据,因此需要通过合成数据弥补这一不足。Contribution: 提出了太阳高度角作为全局条件变量,无需复杂标注即可生成准确的光照效果。
Method: 结合纬度-经度坐标和本地时间计算太阳高度角,并针对小数值变化设计定制化归一化方法。
Result: 该方法能够准确捕捉光照特性和光照依赖的图像噪声,适用于扩散模型。
Insight: 太阳高度角是一种简单且无需额外标注的全局条件变量,可以有效指导合成数据的光照生成。
Abstract: The development of safe and robust autonomous driving functions is heavily dependent on large-scale, high-quality sensor data. However, real-word data acquisition demands intensive human labor and is strongly limited by factors such as labeling cost, driver safety protocols and diverse scenario coverage. Thus, multiple lines of work focus on the conditional generation of synthetic camera sensor data. We identify a significant gap in research regarding daytime variation, presumably caused by the scarcity of available labels. Consequently, we present the solar altitude as global conditioning variable. It is readily computable from latitude-longitude coordinates and local time, eliminating the need for extensive manual labeling. Our work is complemented by a tailored normalization approach, targeting the sensitivity of daylight towards small numeric changes in altitude. We demonstrate its ability to accurately capture lighting characteristics and illumination-dependent image noise in the context of diffusion models.
[65] Empowering Bridge Digital Twins by Bridging the Data Gap with a Unified Synthesis Framework
Wang Wang,Mingyu Shi,Jun Jiang,Wenqian Ma,Chong Liu,Yasutaka Narazaki,Xuguang Wang
Main category: cs.CV
TL;DR: 本文提出了一种系统性框架,用于生成包含完整点云数据的桥梁数字孪生模型,支持语义分割和点云补全任务的训练,并在实际桥梁分析中表现优异。
Details
Motivation: 桥梁作为关键交通基础设施面临老化和损坏的挑战,传统手动检测效率低,现有3D点云技术因数据缺失和遮挡问题受限,亟需一种能生成完整且标注丰富数据的解决方案。Contribution: 提出了一种统一合成框架,能自动生成带组件级实例标注、高保真颜色和精确法向量的完整点云数据,并可扩展为多样化的不完整点云以支持分割和补全网络训练。
Method: 通过系统性框架生成3D桥梁数据,包括完整点云和模拟不完整点云,支持分割和补全任务。实验采用PointNet++和KT-Net验证效果。
Result: PointNet++在真实桥梁语义分割任务中达到84.2%的mIoU,KT-Net在组件补全任务中表现优异。
Insight: 该研究为桥梁结构的3D视觉分析提供了创新方法和基础数据集,推动了基础设施自动化管理与维护的进步。
Abstract: As critical transportation infrastructure, bridges face escalating challenges from aging and deterioration, while traditional manual inspection methods suffer from low efficiency. Although 3D point cloud technology provides a new data-driven paradigm, its application potential is often constrained by the incompleteness of real-world data, which results from missing labels and scanning occlusions. To overcome the bottleneck of insufficient generalization in existing synthetic data methods, this paper proposes a systematic framework for generating 3D bridge data. This framework can automatically generate complete point clouds featuring component-level instance annotations, high-fidelity color, and precise normal vectors. It can be further extended to simulate the creation of diverse and physically realistic incomplete point clouds, designed to support the training of segmentation and completion networks, respectively. Experiments demonstrate that a PointNet++ model trained with our synthetic data achieves a mean Intersection over Union (mIoU) of 84.2% in real-world bridge semantic segmentation. Concurrently, a fine-tuned KT-Net exhibits superior performance on the component completion task. This research offers an innovative methodology and a foundational dataset for the 3D visual analysis of bridge structures, holding significant implications for advancing the automated management and maintenance of infrastructure.
[66] Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models
L’ea Dubois,Klaus Schmidt,Chengyu Wang,Ji-Hoon Park,Lin Wang,Santiago Munoz
Main category: cs.CV
TL;DR: 论文提出了一种新框架,通过融合视觉基础模型(VFM)和大语言模型(LLM)来解决视频高级认知任务(如因果推理和未来预测),解决了当前模型缺乏常识性世界知识的问题。
Details
Motivation: 当前的视频理解模型在识别“发生了什么”方面表现优异,但在高级认知任务(如因果推理和未来预测)上表现不足,主要因为缺乏常识性世界知识。Contribution: 提出了一种新颖的框架,将视觉基础模型(VFM)与作为知识驱动推理核心的大语言模型(LLM)融合,并设计了一个基于Q-Former架构的高级融合模块。
Method: 采用两阶段训练策略:先在大规模视频-文本数据上进行对齐预训练,再在精心设计的数据集上进行指令微调以提升推理和预测能力。
Result: 模型在多个挑战性基准测试中达到了最先进的性能,并表现出卓越的零样本泛化能力。
Insight: 这项研究将机器感知从简单的识别推向真正的认知理解,为更智能的AI系统在机器人、人机交互等领域的应用铺平了道路。
Abstract: Current video understanding models excel at recognizing “what” is happening but fall short in high-level cognitive tasks like causal reasoning and future prediction, a limitation rooted in their lack of commonsense world knowledge. To bridge this cognitive gap, we propose a novel framework that synergistically fuses a powerful Vision Foundation Model (VFM) for deep visual perception with a Large Language Model (LLM) serving as a knowledge-driven reasoning core. Our key technical innovation is a sophisticated fusion module, inspired by the Q-Former architecture, which distills complex spatiotemporal and object-centric visual features into a concise, language-aligned representation. This enables the LLM to effectively ground its inferential processes in direct visual evidence. The model is trained via a two-stage strategy, beginning with large-scale alignment pre-training on video-text data, followed by targeted instruction fine-tuning on a curated dataset designed to elicit advanced reasoning and prediction skills. Extensive experiments demonstrate that our model achieves state-of-the-art performance on multiple challenging benchmarks. Notably, it exhibits remarkable zero-shot generalization to unseen reasoning tasks, and our in-depth ablation studies validate the critical contribution of each architectural component. This work pushes the boundary of machine perception from simple recognition towards genuine cognitive understanding, paving the way for more intelligent and capable AI systems in robotics, human-computer interaction, and beyond.
[67] D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos
Wenkang Zhang,Yan Zhao,Qiang Wang,Li Song,Zhengxue Cheng
Main category: cs.CV
TL;DR: D-FCGS提出了一种前馈式动态高斯泼溅压缩框架,通过I-P帧编码和稀疏控制点提取帧间运动,结合双先验熵模型实现高效压缩,无需逐场景优化。
Details
Motivation: 自由视点视频(FVV)需要高效压缩动态3D表示,但现有方法常耦合场景重建与优化依赖的编码,限制了泛化性。Contribution: 提出D-FCGS框架,通过GoF结构和I-P帧编码、稀疏控制点提取运动,结合双先验熵模型,实现了高效的动态高斯泼溅压缩。
Method: 使用GoF的I-P帧结构,稀疏控制点提取运动,双先验熵模型压缩,控制点引导的运动补偿和细化网络提升重建质量。
Result: 在保持多视角视觉质量的同时,实现了40倍以上的压缩,耗时不足2秒,性能媲美基于优化的方法。
Insight: 前馈式方法在动态3D表示压缩中具有潜力,为FVV的传输与存储提供了可扩展的解决方案。
Abstract: Free-viewpoint video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representations remains a major challenge. Recent advances in 3D Gaussian Splatting (3DGS) and its dynamic extensions have enabled high-fidelity scene modeling. However, existing methods often couple scene reconstruction with optimization-dependent coding, which limits generalizability. This paper presents Feedforward Compression of Dynamic Gaussian Splatting (D-FCGS), a novel feedforward framework for compressing temporally correlated Gaussian point cloud sequences. Our approach introduces a Group-of-Frames (GoF) structure with I-P frame coding, where inter-frame motions are extracted via sparse control points. The resulting motion tensors are compressed in a feedforward manner using a dual prior-aware entropy model that combines hyperprior and spatial-temporal priors for accurate rate estimation. For reconstruction, we perform control-point-guided motion compensation and employ a refinement network to enhance view-consistent fidelity. Trained on multi-view video-derived Gaussian frames, D-FCGS generalizes across scenes without per-scene optimization. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression in under 2 seconds while preserving visual quality across viewpoints. This work advances feedforward compression for dynamic 3DGS, paving the way for scalable FVV transmission and storage in immersive applications.
[68] GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing
Xianzhi Ma,Jianhui Li,Changhua Pei,Hao Liu
Main category: cs.CV
TL;DR: GeoMag是一个基于视觉-语言模型的端到端通用框架,用于遥感图像的多粒度解析,通过动态调整注意力和自适应裁剪提升小目标识别能力并降低计算成本。
Details
Motivation: 现有遥感视觉-语言模型在像素级任务和小目标识别上表现不佳,且处理高分辨率图像时计算成本高,因此需要一种更高效的方法。Contribution: 提出GeoMag框架,引入任务驱动的多粒度分辨率调整和提示引导的语义感知裁剪,显著提升了像素级任务性能并降低了计算开销。
Method: 采用TMRA和PSC技术,动态调整任务相关区域的分辨率并裁剪无关区域,优化模型注意力和计算效率。
Result: 在10个基准测试中,GeoMag在像素级任务上表现优异,同时在其他粒度任务上保持竞争力。
Insight: 通过动态注意力和自适应裁剪,模型能够更高效地处理高分辨率遥感图像,同时提升小目标识别能力。
Abstract: The application of Vision-Language Models (VLMs) in remote sensing (RS) image understanding has achieved notable progress, demonstrating the basic ability to recognize and describe geographical entities. However, existing RS-VLMs are mostly limited to image-level and region-level tasks, lacking the capability to handle pixel-level tasks and performing poorly in small-object recognition scenarios. Moreover, RS-VLMs consume significant computational resources when processing high-resolution RS images, further restricting their practical applicability. In this context, we propose GeoMag (Geographical Magnifier), an end-to-end general-purpose large model framework for RS. GeoMag dynamically focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing across multiple levels of granularity. This method introduces Task-driven Multi-granularity Resolution Adjustment (TMRA) and Prompt-guided Semantic-aware Cropping (PSC), which adaptively reduce the spatial resolution of task-irrelevant regions while enhancing the visual representation of task-relevant areas. This approach improves the model’s perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery. Extensive comparative experiments on 10 benchmarks demonstrate that GeoMag not only excels in handling pixel-level tasks but also maintains competitive performance across tasks of other granularities compared to existing RS-VLMs.
[69] What You Have is What You Track: Adaptive and Robust Multimodal Tracking
Yuedong Tan,Jiawei Shao,Eduard Zamfir,Ruanjun Li,Zhaochong An,Chao Ma,Danda Paudel,Luc Van Gool,Radu Timofte,Zongwei Wu
Main category: cs.CV
TL;DR: 论文研究了多模态数据在视觉跟踪中的作用,提出了一个灵活框架以应对数据缺失问题,通过自适应复杂性的异构专家混合机制和视频级掩码策略,实现了稳健的多模态跟踪。
Details
Motivation: 多模态数据在视觉跟踪中能提升鲁棒性,但传感器同步问题常导致数据缺失。现有跟踪器因架构僵化无法适应缺失情况,性能显著下降。Contribution: 1) 首次全面研究多模态数据缺失下的跟踪性能;2) 提出自适应复杂性的异构专家混合机制和视频级掩码策略的灵活框架,适应缺失率和场景复杂性。
Method: 1) 异构专家混合机制动态激活计算单元;2) 视频级掩码策略确保时空一致性。
Result: 在9个基准测试中达到SOTA性能,适用于完整和缺失多模态数据的场景。
Insight: 跟踪器不仅需要适应数据缺失,还应动态调整以应对场景复杂性,混合机制与掩码策略的结合是关键。
Abstract: Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities. To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness which is critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be publicly available at https://github.com/supertyd/FlexTrack/tree/main.
[70] On the Effectiveness of Methods and Metrics for Explainable AI in Remote Sensing Image Scene Classification
Jonas Klotz,Tom Burgert,Begüm Demir
Main category: cs.CV
TL;DR: 该论文研究了遥感图像场景分类中可解释AI(xAI)方法和评估指标的有效性,分析了五种特征归因方法和十个评估指标的局限性,并提出了针对遥感场景的选型指南。
Details
Motivation: 遥感场景分类中,大多数xAI方法和评估指标直接借用自然图像领域的成果,但其适用性未被验证。论文旨在填补这一空白,分析这些方法和指标在遥感图像中的有效性。Contribution: 论文的主要贡献包括:(1)系统分析了五种特征归因方法和十个评估指标在遥感图像中的表现;(2)揭示了这些方法和指标的局限性;(3)提出了针对遥感场景的xAI选型建议。
Method: 论文采用方法论和实验分析相结合的方式,评估了Occlusion、LIME、GradCAM、LRP和DeepLIFT五种特征归因方法,以及涵盖五类(忠实性、鲁棒性、定位性、复杂性和随机性)的十个指标。分析基于三个遥感数据集。
Result: 研究发现,扰动方法(如Occlusion和LIME)的表现依赖于扰动基线和场景的空间特性;梯度方法(如GradCAM)在多标签场景中表现不佳;部分指标(如定位性和复杂性指标)在空间范围较大的类别中不可靠。鲁棒性和随机性指标表现更稳定。
Insight: 论文指出,直接迁移自然图像的xAI方法和指标可能不适合遥感场景,需根据遥感图像特性(如空间分布和多标签)选择方法。鲁棒性和随机性指标是更可靠的选择。
Abstract: The development of explainable artificial intelligence (xAI) methods for scene classification problems has attracted great attention in remote sensing (RS). Most xAI methods and the related evaluation metrics in RS are initially developed for natural images considered in computer vision (CV), and their direct usage in RS may not be suitable. To address this issue, in this paper, we investigate the effectiveness of explanation methods and metrics in the context of RS image scene classification. In detail, we methodologically and experimentally analyze ten explanation metrics spanning five categories (faithfulness, robustness, localization, complexity, randomization), applied to five established feature attribution methods (Occlusion, LIME, GradCAM, LRP, and DeepLIFT) across three RS datasets. Our methodological analysis identifies key limitations in both explanation methods and metrics. The performance of perturbation-based methods, such as Occlusion and LIME, heavily depends on perturbation baselines and spatial characteristics of RS scenes. Gradient-based approaches like GradCAM struggle when multiple labels are present in the same image, while some relevance propagation methods (LRP) can distribute relevance disproportionately relative to the spatial extent of classes. Analogously, we find limitations in evaluation metrics. Faithfulness metrics share the same problems as perturbation-based methods. Localization metrics and complexity metrics are unreliable for classes with a large spatial extent. In contrast, robustness metrics and randomization metrics consistently exhibit greater stability. Our experimental results support these methodological findings. Based on our analysis, we provide guidelines for selecting explanation methods, metrics, and hyperparameters in the context of RS image scene classification.
[71] High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
Xinyu Huang,Yuhao Dong,Weiwei Tian,Bo Li,Rui Feng,Ziwei Liu
Main category: cs.CV
TL;DR: 该论文针对大型多模态模型(LMMs)在高分辨率图像处理中的视觉冗余问题,提出了基于多轮对话框架的强化学习方法MGPO,通过自动裁剪关键视觉区域提升推理能力,无需昂贵的标注数据。
Details
Motivation: 现有的大型多模态模型在处理高分辨率图像时,由于视觉标记过多且多数无关任务,导致效率低下。此外,监督微调需要昂贵的标注数据,限制了模型的扩展性。Contribution: 1.提出MGPO框架,通过强化学习在多轮对话中自动聚焦关键视觉区域;2.展示了LMMs在RL训练中能涌现出稳健的定位能力,仅需二分类奖励;3.设计了多轮对话模板解决冷启动问题。
Method: MGPO使用强化学习框架,通过模型预测的坐标裁剪子图像,逐步聚焦关键区域。结合多轮对话模板,优化策略损失以提升稳定性。
Result: 在标准视觉问答数据上,MGPO显著提升了定位能力,在MME-Realworld和V* Bench上分别取得了5.4%和5.2%的提升。Qwen2.5-VL-7B模型在OOD测试中超越了OpenAI的o1和GPT-4o。
Insight: 强化学习可以在无需额外标注的情况下提升LMMs的视觉定位能力;多轮对话框架能有效解决模型冷启动问题。
Abstract: State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4% improvement on in-distribution MME-Realworld and 5.2% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI’s o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.
[72] Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation
Quanzhu Niu,Yikang Zhou,Shihao Chen,Tao Zhang,Shunping Ji
Main category: cs.CV
TL;DR: 该论文通过引入几何感知(深度估计)来增强视频实例分割(VIS)的鲁棒性,研究了三种深度集成方法,其中两种(EDC和SV)显著提升了性能,EDC方法在OVIS基准测试中取得了56.2 AP的新最优结果。
Details
Motivation: 视频实例分割(VIS)在面对遮挡、运动模糊和外观变化等问题时表现不佳,论文希望通过深度估计提供的几何信息来提升分割的鲁棒性。Contribution: 论文的主要贡献包括:1)系统地研究了深度信息在VIS中的三种集成方法(EDC、SV、DS);2)证明了EDC和SV方法显著提升了VIS性能;3)在OVIS基准测试中取得了新的最优结果(56.2 AP)。
Method: 论文研究了三种深度集成方法:1)扩展深度通道(EDC):将深度图作为输入通道与分割网络结合;2)共享ViT(SV):设计统一的ViT骨干网络,在深度估计和分割分支中共享;3)深度监督(DS):利用深度预测作为特征学习的辅助监督信号。实验表明EDC和SV方法效果显著。
Result: 实验结果显示,EDC和SV方法显著提升了VIS的鲁棒性。EDC方法在使用Swin-L骨干网络时,在OVIS基准测试中达到了56.2 AP,为当前最优结果。
Insight: 论文的洞察是,深度信息(几何线索)是增强视频理解鲁棒性的关键因素,尤其是在解决遮挡、运动模糊等挑战时。
Abstract: Video Instance Segmentation (VIS) fundamentally struggles with pervasive challenges including object occlusions, motion blur, and appearance variations during temporal association. To overcome these limitations, this work introduces geometric awareness to enhance VIS robustness by strategically leveraging monocular depth estimation. We systematically investigate three distinct integration paradigms. Expanding Depth Channel (EDC) method concatenates the depth map as input channel to segmentation networks; Sharing ViT (SV) designs a uniform ViT backbone, shared between depth estimation and segmentation branches; Depth Supervision (DS) makes use of depth prediction as an auxiliary training guide for feature learning. Though DS exhibits limited effectiveness, benchmark evaluations demonstrate that EDC and SV significantly enhance the robustness of VIS. When with Swin-L backbone, our EDC method gets 56.2 AP, which sets a new state-of-the-art result on OVIS benchmark. This work conclusively establishes depth cues as critical enablers for robust video understanding.
[73] High-Fidelity and Generalizable Neural Surface Reconstruction with Sparse Feature Volumes
Aoxiang Fan,Corentin Dumery,Nicolas Talabot,Hieu Le,Pascal Fua
Main category: cs.CV
TL;DR: 该论文提出了一种基于稀疏特征体素的神经表面重建方法,显著提高了重建分辨率和内存效率,无需逐场景优化,并在公开数据集上实现了优于现有方法的精度。
Details
Motivation: 当前基于密集3D特征体素的神经表面重建方法在提高体素分辨率时面临内存和计算效率的瓶颈,限制了重建质量。本文旨在通过稀疏表示解决这一问题。Contribution: 1. 提出了一种稀疏特征体素表示方法,显著减少存储需求(50倍以上)。2. 实现了高分辨率(512^3)重建,优于典型128^3分辨率。3. 开发了新的高效采样、特征聚合和查询算法,支持稀疏体素。
Method: 采用两阶段方法:1. 训练网络从位姿图像和深度图预测体素占用率;2. 仅在占用率高的体素中计算特征并进行体积渲染。开发了定制算法以支持稀疏体素操作。
Result: 在公开数据集上,该方法减少了50倍以上存储需求,支持512^3分辨率重建,且重建精度优于现有方法。
Insight: 稀疏表示是实现高分辨率神经表面重建的有效途径,定制算法能够克服密集体素的假设,提升内存和计算效率。
Abstract: Generalizable neural surface reconstruction has become a compelling technique to reconstruct from few images without per-scene optimization, where dense 3D feature volume has proven effective as a global representation of scenes. However, the dense representation does not scale well to increasing voxel resolutions, severely limiting the reconstruction quality. We thus present a sparse representation method, that maximizes memory efficiency and enables significantly higher resolution reconstructions on standard hardware. We implement this through a two-stage approach: First training a network to predict voxel occupancies from posed images and associated depth maps, then computing features and performing volume rendering only in voxels with sufficiently high occupancy estimates. To support this sparse representation, we developed custom algorithms for efficient sampling, feature aggregation, and querying from sparse volumes-overcoming the dense-volume assumptions inherent in existing works. Experiments on public datasets demonstrate that our approach reduces storage requirements by more than 50 times without performance degradation, enabling reconstructions at $512^3$ resolution compared to the typical $128^3$ on similar hardware, and achieving superior reconstruction accuracy over current state-of-the-art methods.
[74] Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation
Zhenghao Zhang,Junchao Liao,Xiangyu Meng,Long Qin,Weizhi Wang
Main category: cs.CV
TL;DR: Tora2是Tora的升级版,通过解耦个性化提取器和门控自注意力机制,实现了多实体视频生成中的外观和运动定制。
Details
Motivation: 现有视频生成方法在多实体定制方面存在细节保留不足和多模态条件对齐不精确的问题,Tora2旨在解决这些问题。Contribution: 1. 提出解耦个性化提取器生成多实体个性化嵌入;2. 设计门控自注意力机制优化多模态条件对齐;3. 引入对比损失联合优化轨迹动态和实体一致性。
Method: 使用解耦个性化提取器捕获实体细节,通过门控自注意力融合轨迹、文本和视觉信息,并采用对比损失优化映射关系。
Result: Tora2在多实体视频生成中表现优异,支持高级运动控制,性能与SOTA定制方法相当。
Insight: 解耦和门控机制是多实体定制视频生成的关键创新点,联合优化损失进一步提升了生成质量。
Abstract: Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation. Project page: https://github.com/alibaba/Tora .
[75] Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval
Haiwen Li,Delong Liu,Zhaohui Hou,Zhicheng Zhao,Fei Su
Main category: cs.CV
TL;DR: 论文提出了一种自动生成高质量三元组数据的管道,构建了合成数据集CIRHS,并提出了新的CIR框架CoAlign,首次验证了完全合成数据集训练CIR模型的可行性,在零样本和监督训练中均表现优异。
Details
Motivation: 现有的CIR方法依赖昂贵的人工标注三元组数据,限制了其可扩展性和零样本能力。为了解决这一问题,论文提出自动生成三元组数据的方法。Contribution: 1. 提出了用于自动生成三元组数据的管道,并构建了合成数据集CIRHS;2. 提出了新的CIR框架CoAlign,结合全局对齐和局部推理能力;3. 首次验证了完全合成数据集训练CIR模型的可行性。
Method: 1. 利用大型语言模型(LLM)生成多样化的提示,控制文本到图像生成模型生成图像对;2. 通过过滤和重组构建CIRHS数据集;3. 提出Hybrid Contextual Alignment(CoAlign)框架,实现全局对齐和局部推理。
Result: CoAlign在三个常用基准测试中实现了优异的零样本性能,并在监督训练中超越了所有现有的CIR方法。
Insight: 合成数据集可以替代人工标注数据,实现高效且可扩展的CIR模型训练;全局对齐与局部推理的结合提升了模型的鲁棒性和表现力。
Abstract: As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.
[76] Exploring Partial Multi-Label Learning via Integrating Semantic Co-occurrence Knowledge
Xin Wu,Fei Teng,Yue Feng,Kaibo Shi,Zhuosheng Lin,Ji Zhang,James Wang
Main category: cs.CV
TL;DR: 本文提出了Semantic Co-occurrence Insight Network (SCINet),一种新型的部分多标签学习框架,通过引入语义共现知识和跨模态融合模块,解决不完全标注数据的挑战。
Details
Motivation: 部分多标签学习需要处理不完全标注的数据,即包含已知正确、已知错误和未知标签。核心挑战是如何准确识别标签与实例之间的模糊关系。Contribution: 1) 提出SCINet框架,利用双主导提示模块和跨模态融合模块增强语义对齐;2) 提出内在语义增强策略,提升模型对数据语义的理解能力。
Method: 1) 使用现成的多模态模型捕捉文本-图像关联;2) 跨模态融合模块联合建模标签间、实例间及实例-标签间的共现模式;3) 通过多样化的图像变换增强数据语义。
Result: 在四个基准数据集上的实验表明,SCINet超越了现有最先进方法。
Insight: 标签与实例的语义共现模式是解决部分多标签学习的关键,而跨模态融合和多模态数据增强可以显著提升模型性能。
Abstract: Partial multi-label learning aims to extract knowledge from incompletely annotated data, which includes known correct labels, known incorrect labels, and unknown labels. The core challenge lies in accurately identifying the ambiguous relationships between labels and instances. In this paper, we emphasize that matching co-occurrence patterns between labels and instances is key to addressing this challenge. To this end, we propose Semantic Co-occurrence Insight Network (SCINet), a novel and effective framework for partial multi-label learning. Specifically, SCINet introduces a bi-dominant prompter module, which leverages an off-the-shelf multimodal model to capture text-image correlations and enhance semantic alignment. To reinforce instance-label interdependencies, we develop a cross-modality fusion module that jointly models inter-label correlations, inter-instance relationships, and co-occurrence patterns across instance-label assignments. Moreover, we propose an intrinsic semantic augmentation strategy that enhances the model’s understanding of intrinsic data semantics by applying diverse image transformations, thereby fostering a synergistic relationship between label confidence and sample difficulty. Extensive experiments on four widely-used benchmark datasets demonstrate that SCINet surpasses state-of-the-art methods.
[77] Ensemble-Based Deepfake Detection using State-of-the-Art Models with Robust Cross-Dataset Generalisation
Haroon Wahab,Hassan Ugail,Lujain Jaleel
Main category: cs.CV
TL;DR: 该论文提出了一种基于集成学习的方法,通过结合多个最先进的深度伪造检测模型的预测概率,提升模型在多样化数据集上的泛化能力。
Details
Motivation: 现有的深度伪造检测模型在基准数据集上表现良好,但在分布外数据上性能显著下降。为了解决这一问题,作者研究了集成学习方法,以提高模型的跨数据集泛化能力。Contribution: 论文的主要贡献是提出了一种非对称集成方法,通过结合多个先进模型的预测概率,实现了在真实场景中更稳定和可靠的深度伪造检测性能。
Method: 作者基于一个开源基准,结合了多个顶级会议提出的非对称模型的预测概率,进行了跨数据集的实验验证。
Result: 实验结果表明,没有单个模型在所有场景中表现最佳,而集成方法在所有测试场景中均提供了更稳定和可靠的性能。
Insight: 论文的洞察是非对称集成方法是一种可扩展且鲁棒的解决方案,尤其适用于真实场景中伪造类型或质量未知的情况。
Abstract: Machine learning-based Deepfake detection models have achieved impressive results on benchmark datasets, yet their performance often deteriorates significantly when evaluated on out-of-distribution data. In this work, we investigate an ensemble-based approach for improving the generalization of deepfake detection systems across diverse datasets. Building on a recent open-source benchmark, we combine prediction probabilities from several state-of-the-art asymmetric models proposed at top venues. Our experiments span two distinct out-of-domain datasets and demonstrate that no single model consistently outperforms others across settings. In contrast, ensemble-based predictions provide more stable and reliable performance in all scenarios. Our results suggest that asymmetric ensembling offers a robust and scalable solution for real-world deepfake detection where prior knowledge of forgery type or quality is often unavailable.
[78] TextPixs: Glyph-Conditioned Diffusion with Character-Aware Attention and OCR-Guided Supervision
Syeda Anshrah Gillani,Mirza Samad Ahmed Baig,Osama Ahmed Khan,Shahid Munir Shah,Umema Mujeeb,Maheen Ali
Main category: cs.CV
TL;DR: 论文提出了一种基于字形条件和字符感知注意力的扩散模型框架(GCDA),解决了现有文本到图像生成模型中文本内容不可读、拼写错误的问题。通过双流文本编码器、字符感知注意力机制和OCR引导的监督学习,GCDA在文本渲染和图像合成质量上取得了新的SOTA表现。
Details
Motivation: 现有文本到图像扩散模型在生成图像时无法生成可读且拼写正确的文本,这限制了其在实际应用(如广告、教育、创意设计)中的使用。论文旨在解决这一问题。Contribution: 1. 提出了GCDA框架,扩展了扩散模型的能力,使其能生成可读文本;2. 设计了双流文本编码器,结合语义和字形信息;3. 提出了字符感知注意力机制及对应的注意力分离损失;4. 通过OCR引导的监督学习优化模型。
Method: 1. 双流文本编码器:结合语义和字形信息;2. 字符感知注意力机制:引入注意力分离损失,避免字符扭曲;3. OCR引导的微调阶段:使用全文本感知损失优化模型,提升文本可读性。
Result: 在MARIO-10M和T2I-CompBench等数据集上,GCDA在文本渲染(字符错误率:0.08 vs 0.21;单词错误率:0.15 vs 0.25)、人类感知和高保真图像合成(FID:14.3)上均达到SOTA水平。
Insight: 1. 字形信息的显式建模对文本生成至关重要;2. 字符感知注意力机制能有效避免文本扭曲;3. OCR监督可作为生成模型的有效优化目标。
Abstract: The modern text-to-image diffusion models boom has opened a new era in digital content production as it has proven the previously unseen ability to produce photorealistic and stylistically diverse imagery based on the semantics of natural-language descriptions. However, the consistent disadvantage of these models is that they cannot generate readable, meaningful, and correctly spelled text in generated images, which significantly limits the use of practical purposes like advertising, learning, and creative design. This paper introduces a new framework, namely Glyph-Conditioned Diffusion with Character-Aware Attention (GCDA), using which a typical diffusion backbone is extended by three well-designed modules. To begin with, the model has a dual-stream text encoder that encodes both semantic contextual information and explicit glyph representations, resulting in a character-aware representation of the input text that is rich in nature. Second, an attention mechanism that is aware of the character is proposed with a new attention segregation loss that aims to limit the attention distribution of each character independently in order to avoid distortion artifacts. Lastly, GCDA has an OCR-in-the-loop fine-tuning phase, where a full text perceptual loss, directly optimises models to be legible and accurately spell. Large scale experiments to benchmark datasets, such as MARIO-10M and T2I-CompBench, reveal that GCDA sets a new state-of-the-art on all metrics, with better character based metrics on text rendering (Character Error Rate: 0.08 vs 0.21 for the previous best; Word Error Rate: 0.15 vs 0.25), human perception, and comparable image synthesis quality on high-fidelity (FID: 14.3).
[79] VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis
Alexandre Symeonidis-Herzig,Özge Mercanoğlu Sincan,Richard Bowden
Main category: cs.CV
TL;DR: VisualSpeaker提出了一种通过光真实感可微分渲染和视觉语音识别监督的新方法,显著提升了3D面部动画的质量和感知效果。
Details
Motivation: 现有的3D面部动画方法主要基于网格域,无法充分利用2D计算机视觉和图形学的快速视觉创新。Contribution: 提出了基于光真实感3D高斯抛洒渲染和视觉语音识别的感知唇读损失,显著提升了动画质量。
Method: 通过预训练的视觉自动语音识别模型监督训练,利用3D高斯抛洒生成逼真的3D面部动画。
Result: 在MEAD数据集上,Lip Vertex Error指标提升了56.1%,同时保持了动画的可控性。
Insight: 感知驱动的唇读损失能够生成更准确的口型,对提升手语虚拟人的表现力至关重要。
Abstract: Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation. This perceptual focus naturally supports accurate mouthings, essential cues that disambiguate similar manual signs in sign language avatars.
[80] MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding
Chang Liu,Ye Pan,Chenyang Ding,Susanto Rahardja,Xiaokang Yang
Main category: cs.CV
TL;DR: MEDTalk提出了一种新颖的多模态控制3D面部动画生成框架,通过解耦内容和情感嵌入空间,实现动态情感和精准唇部同步的控制。
Details
Motivation: 现有方法通常依赖静态预定义情感标签,限制了生成表情的多样性和自然性。MEDTalk旨在解决这一问题,提供更精细的动态情感控制和多模态输入能力。Contribution: 1. 解耦了情感和内容的嵌入空间,实现独立控制;2. 结合音频和文本输入,动态调整情感表达;3. 支持多模态输入(如文本描述和参考图像),增强个性化控制。
Method: 1. 通过交叉重建过程解耦情感和内容;2. 结合音频和文本预测逐帧情感强度变化;3. 多模态输入驱动最终生成。
Result: 生成的3D面部动画能够自然表达动态情感,并实现精确唇同步,适用于工业流水线(如MetaHuman)。
Insight: MEDTalk通过解耦和多模态输入,显著提升了情感表达的动态性和用户控制能力,为3D动画生成提供了新思路。
Abstract: Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline.
[81] MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding
Tongtong Cheng,Rongzhen Li,Yixin Xiong,Tao Zhang,Jing Wang,Kai Liu
Main category: cs.CV
TL;DR: MCAM提出了一种多模态因果分析模型,解决了现有方法在自动驾驶视频理解中存在浅层因果、虚假跨模态相关性和忽略自我车辆因果建模的问题。
Details
Motivation: 自动驾驶视频理解需要准确的行为识别和推理,但现有方法难以解决跨模态的虚假相关性和自我车辆级别的因果关系建模。Contribution: 1. 设计了多级特征提取器捕捉长期依赖;2. 提出因果分析模块,动态建模驾驶场景的DAG结构;3. 利用视觉-语言Transformer对齐关键视觉特征与语言表达。
Method: 模型结合多级特征提取器、因果分析模块(DAG建模)和视觉-语言Transformer,实现多模态因果关系的学习和推理。
Result: 在BDD-X和CoVLA数据集上取得SOTA性能,验证了其在自动驾驶应用中的有效性。
Insight: 通过动态因果建模和跨模态对齐,MCAM能更有效地捕捉视频序列中的因果关系,提升自动驾驶场景的理解能力。
Abstract: Accurate driving behavior recognition and reasoning are critical for autonomous driving video understanding. However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. Firstly, we design a multi-level feature extractor to capture long-range dependencies. Secondly, we design a causal analysis module that dynamically models driving scenarios using a directed acyclic graph (DAG) of driving states. Thirdly, we utilize a vision-language transformer to align critical visual features with their corresponding linguistic expressions. Extensive experiments on the BDD-X, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning. Furthermore, the model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications. The code is available at https://github.com/SixCorePeach/MCAM.
[82] Discontinuity-aware Normal Integration for Generic Central Camera Models
Francesco Milano,Manuel López-Antequera,Naina Dhingra,Roland Siegwart,Robert Thiel
Main category: cs.CV
TL;DR: 该论文提出一种新的法向积分方法,显式处理深度不连续并支持通用中心相机模型,基于局部平面假设实现了更高精度的3D表面重建。
Details
Motivation: 现有法向积分方法通常隐式处理深度不连续且局限于正交或理想针孔相机模型,限制了其在复杂场景和通用相机中的适用性。Contribution: 提出了一种显式建模深度不连续的通用中心相机法向积分方法,通过局部平面假设约束表面法向与光线方向,提升了精度和应用范围。
Method: 基于局部平面假设,构建表面法向与光线方向的约束关系,显式建模深度不连续,支持通用中心相机模型。
Result: 在标准法向积分基准测试中达到SOTA效果,首次直接支持通用中心相机模型。
Insight: 通过约束表面法向与光线方向的关系,显式建模不连续性,可以更准确地逼近深度与法向的关系,扩展了法向积分的应用场景。
Abstract: Recovering a 3D surface from its surface normal map, a problem known as normal integration, is a key component for photometric shape reconstruction techniques such as shape-from-shading and photometric stereo. The vast majority of existing approaches for normal integration handle only implicitly the presence of depth discontinuities and are limited to orthographic or ideal pinhole cameras. In this paper, we propose a novel formulation that allows modeling discontinuities explicitly and handling generic central cameras. Our key idea is based on a local planarity assumption, that we model through constraints between surface normals and ray directions. Compared to existing methods, our approach more accurately approximates the relation between depth and surface normals, achieves state-of-the-art results on the standard normal integration benchmark, and is the first to directly handle generic central camera models.
[83] CAST-Phys: Contactless Affective States Through Physiological signals Database
Joaquim Comas,Alexander Joel Vera,Xavier Vives,Eleonora De Filippi,Alexandre Pereda,Federico Sukno
Main category: cs.CV
TL;DR: 论文提出了一个新型多模态无接触情感识别数据集CAST-Phys,旨在解决情感计算中真实情绪数据缺乏和接触式设备干扰的问题,展示了生理信号在无接触情感识别中的潜力。
Details
Motivation: 情感计算应用中,现有数据集多为接触式设备采集,可能干扰真实情绪反应,且多模态数据不足限制了情感识别系统的准确性。Contribution: 提出了首个专注于无接触多模态生理情感识别的数据集CAST-Phys,包含高质量的面部视频和多种生理信号(PPG、EDA、RR)。
Method: 通过高分辨率面部视频和非接触式生理信号采集技术(如远程恢复信号)构建数据集,并评估单模态与多模态融合的情感识别效果。
Result: 研究表明生理信号在真实场景中对情感识别至关重要,多模态融合显著提升了无接触情感识别的准确性。
Insight: 生理信号与面部特征的结合能够弥补单一模态的不足,为无接触情感识别技术提供了新的研究方向。
Abstract: In recent years, affective computing and its applications have become a fast-growing research topic. Despite significant advancements, the lack of affective multi-modal datasets remains a major bottleneck in developing accurate emotion recognition systems. Furthermore, the use of contact-based devices during emotion elicitation often unintentionally influences the emotional experience, reducing or altering the genuine spontaneous emotional response. This limitation highlights the need for methods capable of extracting affective cues from multiple modalities without physical contact, such as remote physiological emotion recognition. To address this, we present the Contactless Affective States Through Physiological Signals Database (CAST-Phys), a novel high-quality dataset explicitly designed for multi-modal remote physiological emotion recognition using facial and physiological cues. The dataset includes diverse physiological signals, such as photoplethysmography (PPG), electrodermal activity (EDA), and respiration rate (RR), alongside high-resolution uncompressed facial video recordings, enabling the potential for remote signal recovery. Our analysis highlights the crucial role of physiological signals in realistic scenarios where facial expressions alone may not provide sufficient emotional information. Furthermore, we demonstrate the potential of remote multi-modal emotion recognition by evaluating the impact of individual and fused modalities, showcasing its effectiveness in advancing contactless emotion recognition technologies.
[84] Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification
Murilo Gustineli,Anthony Miyaguchi,Adrian Cheung,Divyansh Khattak
Main category: cs.CV
TL;DR: 本文提出了一种基于ViT的零样本多物种植物识别方法,结合分块策略和视觉聚类先验,实现了在PlantCLEF 2025挑战赛中的优异表现。
Details
Motivation: 解决多物种植物识别中的零样本问题,无需额外训练,仅通过视觉聚类和地理定位滤波提升分类性能。Contribution: 结合ViT分块推理和视觉聚类先验,提出了一种高效的零样本植物识别方法。
Method: 1. 使用ViTD2PC24All进行分块级推理;2. 采用4x4分块策略;3. 通过PaCMAP + K-Means进行视觉聚类和地理定位滤波。
Result: 在私有排行榜上达到0.348的宏平均F1分数。
Insight: 视觉聚类和地理信息可显著提升零样本分类性能,分块策略有效利用了ViT的接收场优势。
Abstract: We describe DS@GT’s second-place solution to the PlantCLEF 2025 challenge on multi-species plant identification in vegetation quadrat images. Our pipeline combines (i) a fine-tuned Vision Transformer ViTD2PC24All for patch-level inference, (ii) a 4x4 tiling strategy that aligns patch size with the network’s 518x518 receptive field, and (iii) domain-prior adaptation through PaCMAP + K-Means visual clustering and geolocation filtering. Tile predictions are aggregated by majority vote and re-weighted with cluster-specific Bayesian priors, yielding a macro-averaged F1 of 0.348 (private leaderboard) while requiring no additional training. All code, configuration files, and reproducibility scripts are publicly available at https://github.com/dsgt-arc/plantclef-2025.
[85] Reflections Unlock: Geometry-Aware Reflection Disentanglement in 3D Gaussian Splatting for Photorealistic Scenes Rendering
Jiayi Song,Zihan Ye,Qingyuan Zhou,Weidong Yang,Ben Fei,Jingyi Xu,Ying He,Wanli Ouyang
Main category: cs.CV
TL;DR: 论文提出了一种基于3D高斯溅射的几何感知反射解耦框架Ref-Unlock,用于解决复杂反射场景中几何一致性的问题,显著优于传统方法。
Details
Motivation: 现有方法(如NeRF和3DGS)在处理反射表面时容易将反射误认为是物理几何,导致重建质量下降。传统约束条件不完整且泛化性差,进一步加剧了问题。Contribution: 1. 提出Ref-Unlock框架,显式解耦透射和反射分量;2. 采用高阶球谐函数和反射去除模块优化细节;3. 引入伪深度图和几何感知平滑约束提升几何一致性。
Method: 基于3D高斯溅射,构建双分支表示(透射与反射),结合高阶球谐函数和反射去除模块,并通过伪深度图和双边平滑约束增强几何一致性。
Result: 在实验中,Ref-Unlock显著优于传统高斯溅射方法,与NeRF模型竞争,支持灵活的反射编辑。
Insight: 显式解耦反射分量并引入几何约束是提升复杂反射场景重建质量的关键。
Abstract: Accurately rendering scenes with reflective surfaces remains a significant challenge in novel view synthesis, as existing methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) often misinterpret reflections as physical geometry, resulting in degraded reconstructions. Previous methods rely on incomplete and non-generalizable geometric constraints, leading to misalignment between the positions of Gaussian splats and the actual scene geometry. When dealing with real-world scenes containing complex geometry, the accumulation of Gaussians further exacerbates surface artifacts and results in blurred reconstructions. To address these limitations, in this work, we propose Ref-Unlock, a novel geometry-aware reflection modeling framework based on 3D Gaussian Splatting, which explicitly disentangles transmitted and reflected components to better capture complex reflections and enhance geometric consistency in real-world scenes. Our approach employs a dual-branch representation with high-order spherical harmonics to capture high-frequency reflective details, alongside a reflection removal module providing pseudo reflection-free supervision to guide clean decomposition. Additionally, we incorporate pseudo-depth maps and a geometry-aware bilateral smoothness constraint to enhance 3D geometric consistency and stability in decomposition. Extensive experiments demonstrate that Ref-Unlock significantly outperforms classical GS-based reflection methods and achieves competitive results with NeRF-based models, while enabling flexible vision foundation models (VFMs) driven reflection editing. Our method thus offers an efficient and generalizable solution for realistic rendering of reflective scenes. Our code is available at https://ref-unlock.github.io/.
[86] Omni-Video: Democratizing Unified Video Understanding and Generation
Zhiyu Tan,Hao Yang,Luozheng Qin,Jia Gong,Mengping Yang,Hao Li
Main category: cs.CV
TL;DR: 该论文提出了Omni-Video,一个高效的统一框架,用于视频理解、生成和指令编辑,通过利用多模态大语言模型(MLLMs)生成视觉线索,并将其用于扩散解码器的输入。
Details
Motivation: 当前的基础模型主要集中在图像处理上,而视频理解与生成的统一模型发展滞后,因此需要填补这一空白。Contribution: 1) 提出了Omni-Video框架,结合MLLMs与扩散解码器进行视频任务;2) 设计了轻量级架构和高效的多阶段训练方案。
Method: 1) 通过在MLLMs上添加视觉头,生成视觉标记;2) 使用适配器将这些标记转换为扩散解码器的条件输入;3) 多阶段训练方案连接MLLMs与扩散解码器。
Result: 模型在视频生成、编辑和理解任务中表现出良好的泛化能力。
Insight: 通过利用现有MLLMs的能力,可以高效地实现视频任务的统一建模,同时减少数据和计算资源的需求。
Abstract: Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.
[87] Prompt-Free Conditional Diffusion for Multi-object Image Augmentation
Haoyu Wang,Lei Zhang,Wei Wei,Chen Ding,Yanning Zhang
Main category: cs.CV
TL;DR: 论文提出了一种无需文本提示(prompt-free)的条件扩散框架,用于多目标图像增强,通过局部-全局语义融合和LoRA注入知识,解决了现有方法在多样性和类别偏差上的问题。
Details
Motivation: 现有的多目标图像生成方法要么过度依赖文本条件导致类别偏差,要么依赖原始图像导致多样性不足。本文旨在同时解决这两个问题。Contribution: 提出了一个无需文本提示的条件扩散框架,设计了局部-全局语义融合策略和基于计数的奖励模型,提高了生成图像的多样性和与原始数据的一致性。
Method: 结合局部-全局语义融合策略提取图像语义替代文本,通过LoRA注入知识;设计了基于计数的奖励模型辅助训练。
Result: 实验表明,该方法在多样性和下游任务性能上优于现有基线,并展示了强大的域外泛化能力。
Insight: 通过语义融合和计数约束,能够在不依赖文本提示的情况下生成多样且符合原始数据分布的图像。
Abstract: Diffusion models has underpinned much recent advances of dataset augmentation in various computer vision tasks. However, when involving generating multi-object images as real scenarios, most existing methods either rely entirely on text condition, resulting in a deviation between the generated objects and the original data, or rely too much on the original images, resulting in a lack of diversity in the generated images, which is of limited help to downstream tasks. To mitigate both problems with one stone, we propose a prompt-free conditional diffusion framework for multi-object image augmentation. Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text, and inject knowledge into the diffusion model through LoRA to alleviate the category deviation between the original model and the target dataset. In addition, we design a reward model based counting loss to assist the traditional reconstruction loss for model training. By constraining the object counts of each category instead of pixel-by-pixel constraints, bridging the quantity deviation between the generated data and the original data while improving the diversity of the generated data. Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. Code is available at \href{https://github.com/00why00/PFCD}{here}.
[88] SoftReMish: A Novel Activation Function for Enhanced Convolutional Neural Networks for Visual Recognition Performance
Mustafa Bayram Gücen
Main category: cs.CV
TL;DR: SoftReMish是一种新型激活函数,旨在提升CNN在图像分类任务中的性能,实验表明其在MNIST数据集上优于ReLU、Tanh和Mish。
Details
Motivation: 现有激活函数(如ReLU、Tanh、Mish)在CNN中的表现仍有改进空间,因此需要开发更高效的激活函数以提升模型性能。Contribution: 提出了SoftReMish激活函数,实验证明其在训练损失和验证准确率上优于现有主流激活函数。
Method: 在标准CNN架构中替换激活函数,对比测试SoftReMish与ReLU、Tanh、Mish的性能差异。
Result: SoftReMish在MNIST数据集上取得最低训练损失(3.14e-8)和最高验证准确率(99.41%)。
Insight: SoftReMish显示出更好的收敛行为和泛化能力,适用于视觉识别任务。
Abstract: In this study, SoftReMish, a new activation function designed to improve the performance of convolutional neural networks (CNNs) in image classification tasks, is proposed. Using the MNIST dataset, a standard CNN architecture consisting of two convolutional layers, max pooling, and fully connected layers was implemented. SoftReMish was evaluated against popular activation functions including ReLU, Tanh, and Mish by replacing the activation function in all trainable layers. The model performance was assessed in terms of minimum training loss and maximum validation accuracy. Results showed that SoftReMish achieved a minimum loss (3.14e-8) and a validation accuracy (99.41%), outperforming all other functions tested. These findings demonstrate that SoftReMish offers better convergence behavior and generalization capability, making it a promising candidate for visual recognition tasks.
[89] Normalizing Diffusion Kernels with Optimal Transport
Nathan Kessler,Robin Magnet,Jean Feydy
Main category: cs.CV
TL;DR: 论文提出了一种方法,通过Sinkhorn算法的对称变体,将一般相似性或邻接矩阵归一化为扩散类算子,从而在无结构数据(如点云、稀疏体素网格)上实现类似拉普拉斯算子的平滑处理。
Details
Motivation: 在机器学习和几何处理中,平滑信号是核心操作,但传统的拉普拉斯算子需要严格的域结构,而简单卷积核或消息传递层对边界存在偏差。本文旨在弥合这一差距。Contribution: 提出了一类广义平滑算子,通过对称Sinkhorn算法将其归一化为扩散类算子,继承了拉普拉斯算子的优良性质,适用于不规则数据。
Method: 使用对称Sinkhorn算法重新缩放正平滑算子,使其匹配热扩散的结构行为,从而构造类似拉普拉斯的平滑算子。
Result: 归一化后的算子不仅能近似热扩散,还能保留拉普拉斯算子的谱信息,应用于形状分析和匹配。
Insight: 通过最优传输理论,可以在无严格结构的域上实现类似拉普拉斯算子的平滑处理,拓展了传统方法的适用范围。
Abstract: Smoothing a signal based on local neighborhoods is a core operation in machine learning and geometry processing. On well-structured domains such as vector spaces and manifolds, the Laplace operator derived from differential geometry offers a principled approach to smoothing via heat diffusion, with strong theoretical guarantees. However, constructing such Laplacians requires a carefully defined domain structure, which is not always available. Most practitioners thus rely on simple convolution kernels and message-passing layers, which are biased against the boundaries of the domain. We bridge this gap by introducing a broad class of smoothing operators, derived from general similarity or adjacency matrices, and demonstrate that they can be normalized into diffusion-like operators that inherit desirable properties from Laplacians. Our approach relies on a symmetric variant of the Sinkhorn algorithm, which rescales positive smoothing operators to match the structural behavior of heat diffusion. This construction enables Laplacian-like smoothing and processing of irregular data such as point clouds, sparse voxel grids or mixture of Gaussians. We show that the resulting operators not only approximate heat diffusion but also retain spectral information from the Laplacian itself, with applications to shape analysis and matching.
[90] OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
Yunhan Yang,Yufan Zhou,Yuan-Chen Guo,Zi-Xin Zou,Yukun Huang,Ying-Tian Liu,Hao Xu,Ding Liang,Yan-Pei Cao,Xihui Liu
Main category: cs.CV
TL;DR: OmniPart 是一个新颖的框架,用于生成具有明确、可编辑部分的 3D 资产,通过两阶段方法实现语义解耦和结构一致性,支持用户定义的部分粒度。
Details
Motivation: 当前大多数生成方法只能生成整体形状,缺乏可编辑的部分结构,限制了交互应用的进一步开发。因此,需要一种方法能生成具有清晰部分结构的 3D 对象。Contribution: 1. 提出了 OmniPart 框架,支持生成具有语义解耦和结构一致性的 3D 对象。2. 通过两阶段任务分解(结构规划和部分合成)实现灵活可控的生成。3. 首次将 2D 部分掩码用于引导 3D 部分分解。
Method: 1. 使用自回归结构规划模块生成可控的 3D 部分包围盒序列,通过 2D 部分掩码引导分解。2. 基于预训练的 3D 生成器,通过空间条件校正流模型同时合成所有 3D 部分。
Result: 实验表明,OmniPart 在性能上达到最先进水平,支持多样化的下游应用。
Insight: 通过结合 2D 引导和 3D 生成,OmniPart 实现了高语义解耦和结构一致性,为可编辑和可解释的 3D 内容生成开辟了新途径。
Abstract: The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.
[91] Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling
Prahitha Movva,Naga Harshita Marupaka
Main category: cs.CV
TL;DR: 论文提出了一种通过多模态推理和集成建模提升科学视觉问答(VQA)性能的方法,并在SciVQA 2025任务中取得了显著效果。
Details
Motivation: 科学文献中的图表信息需要高精度解析,而现有视觉问答方法在数值处理和推理一致性上表现不佳。Contribution: 提出了一种基于多模态推理和集成建模的VQA方法,显著提升了科学数据的问答性能。
Method: 结合了5B至8B参数的模型(如InternVL3)和集成策略,通过提示优化和链式思维推理提升表现。
Result: InternVL3在SciVQA测试集上ROUGE-1和ROUGE-L F1达0.740,BERTScore达0.983。集成模型进一步提升了部分性能。
Insight: 提示优化、多步推理和集成策略对科学VQA任务至关重要。
Abstract: Technical reports and articles often contain valuable information in the form of semi-structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of \textbf{0.740} and a BERTScore of \textbf{0.983} on the SciVQA test split. We also developed an ensemble model with multiple vision language models (VLMs). Through error analysis on the validation split, our ensemble approach improved performance compared to most individual models, though InternVL3 remained the strongest standalone performer. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model’s ability in visual question answering.
[92] CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions
Yuchen Huang,Zhiyuan Fan,Zhitao He,Sandeep Polisetty,Wenyan Li,Yi R. Fung
Main category: cs.CV
TL;DR: CultureCLIP通过合成文化数据集CulTwin和定制对比学习,增强了CLIP的文化感知能力,显著提升了细粒度文化概念的识别性能,同时保持了模型的泛化能力。
Details
Motivation: 现有视觉语言模型(如CLIP)在文化相关任务中表现不佳,主要原因包括缺乏高质量文化数据集、上下文知识不足以及难以区分视觉相似但文化差异的概念。Contribution: 1) 提出了CulTwin合成文化数据集;2) 设计了定制对比学习方法,通过上下文增强描述和合成图像训练模型CultureCLIP;3) 在文化相关基准测试中显著提升了性能。
Method: 1) 使用开源视觉语言模型和文本到图像扩散模型合成CulTwin数据集;2) 通过定制对比学习微调CLIP,使模型能区分文化细微差异。
Result: CultureCLIP在文化相关任务中比基础CLIP提升了5.49%的细粒度概念识别能力,同时保持了泛化性能。
Insight: 合成数据和上下文增强描述的结合可以有效提升模型对文化细微差异的感知能力,同时不会牺牲其泛化性。
Abstract: Pretrained vision-language models (VLMs) such as CLIP excel in multimodal understanding but struggle with contextually relevant fine-grained visual features, making it difficult to distinguish visually similar yet culturally distinct concepts. This limitation stems from the scarcity of high-quality culture-specific datasets, the lack of integrated contextual knowledge, and the absence of hard negatives highlighting subtle distinctions. To address these challenges, we first design a data curation pipeline that leverages open-sourced VLMs and text-to-image diffusion models to construct CulTwin, a synthetic cultural dataset. This dataset consists of paired concept-caption-image triplets, where concepts visually resemble each other but represent different cultural contexts. Then, we fine-tune CLIP on CulTwin to create CultureCLIP, which aligns cultural concepts with contextually enhanced captions and synthetic images through customized contrastive learning, enabling finer cultural differentiation while preserving generalization capabilities. Experiments on culturally relevant benchmarks show that CultureCLIP outperforms the base CLIP, achieving up to a notable 5.49% improvement in fine-grained concept recognition on certain tasks, while preserving CLIP’s original generalization ability, validating the effectiveness of our data synthesis and VLM backbone training paradigm in capturing subtle cultural distinctions.
[93] Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion
Aleksandar Jevtić,Christoph Reich,Felix Wimbauer,Oliver Hahn,Christian Rupprecht,Stefan Roth,Daniel Cremers
Main category: cs.CV
TL;DR: SceneDINO是一种无监督的语义场景补全方法,通过自监督学习和2D场景理解技术,无需标注数据即可推断3D几何和语义,性能达到SOTA。
Details
Motivation: 现有语义场景补全方法依赖昂贵的标注数据,本文旨在探索无监督学习下的解决方案。Contribution: 提出SceneDINO,首次将自监督学习和2D场景理解技术应用于无监督语义场景补全,实现了高质量的3D几何和语义推断。
Method: 采用多视角一致性自监督训练,通过3D特征蒸馏技术提取无监督3D语义信息,并以端到端方式推断3D几何和DINO特征。
Result: 在3D和2D无监督场景理解任务中达到SOTA分割精度,线性探查3D特征可与监督方法媲美。
Insight: SceneDINO展示了无监督方法在3D场景理解中的潜力,并为单图像3D场景理解奠定了基础。
Abstract: Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.
[94] RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models
Keyan Chen,Chenyang Liu,Bowen Chen,Jiafan Zhang,Zhengxia Zou,Zhenwei Shi
Main category: cs.CV
TL;DR: RSRefSeg 2提出了一种解耦范式,通过结合CLIP和SAM的基础模型能力,改进遥感图像的分割精度和语义理解。
Details
Motivation: 现有方法在复杂语义关系和跨模态对齐方面存在局限,主要因为耦合的处理机制混淆了目标定位和边界划分。Contribution: 1. 提出了一种解耦的双阶段框架(粗定位+精细分割);2. 结合CLIP和SAM的优势;3. 设计了级联二阶提示器以解决CLIP的多实体激活问题。
Method: 1. 使用CLIP进行双模态编码和粗定位;2. 通过分解文本嵌入优化语义提示;3. 利用SAM实现精细分割。
Result: 实验表明,RSRefSeg 2在分割精度(gIoU提升~3%)和复杂语义理解上优于现有方法。
Insight: 解耦设计减少了错误传播,同时通过基础模型的协作提升了模型的泛化能力和可解释性。
Abstract: Referring Remote Sensing Image Segmentation provides a flexible and fine-grained framework for remote sensing scene analysis via vision-language collaborative interpretation. Current approaches predominantly utilize a three-stage pipeline encompassing dual-modal encoding, cross-modal interaction, and pixel decoding. These methods demonstrate significant limitations in managing complex semantic relationships and achieving precise cross-modal alignment, largely due to their coupled processing mechanism that conflates target localization with boundary delineation. This architectural coupling amplifies error propagation under semantic ambiguity while restricting model generalizability and interpretability. To address these issues, we propose RSRefSeg 2, a decoupling paradigm that reformulates the conventional workflow into a collaborative dual-stage framework: coarse localization followed by fine segmentation. RSRefSeg 2 integrates CLIP’s cross-modal alignment strength with SAM’s segmentation generalizability through strategic foundation model collaboration. Specifically, CLIP is employed as the dual-modal encoder to activate target features within its pre-aligned semantic space and generate localization prompts. To mitigate CLIP’s misactivation challenges in multi-entity scenarios described by referring texts, a cascaded second-order prompter is devised, which enhances precision through implicit reasoning via decomposition of text embeddings into complementary semantic subspaces. These optimized semantic prompts subsequently direct the SAM to generate pixel-level refined masks, thereby completing the semantic transmission pipeline. Extensive experiments (RefSegRS, RRSIS-D, and RISBench) demonstrate that RSRefSeg 2 surpasses contemporary methods in segmentation accuracy (+~3% gIoU) and complex semantic interpretation. Code is available at: https://github.com/KyanChen/RSRefSeg2.
[95] Learning to Track Any Points from Human Motion
Inès Hyeonsu Kim,Seokju Cho,Jahyeok Koo,Junghyun Park,Jiahui Huang,Joon-Young Lee,Seungryong Kim
Main category: cs.CV
TL;DR: 论文提出了AnthroTAP,一种利用SMPL模型自动生成伪标注数据的流程,用于训练点跟踪模型。在TAP-Vid基准测试中表现优异,仅用较少数据和计算资源取得了SOTA性能。
Details
Motivation: 人类运动数据虽然适合用于点跟踪任务的训练,但获取大规模标注数据成本高昂。论文旨在通过自动生成伪标注数据来解决这一问题。Contribution: 1. 提出AnthroTAP流程,利用SMPL模型自动生成伪标注数据;
2. 通过射线投射和光流一致性过滤不可靠轨迹;
3. 在TAP-Vid基准测试中取得SOTA性能,且数据量和计算资源消耗大幅降低。
Method: 1. 将SMPL模型拟合到视频帧中检测到的人体上;
2. 将3D网格顶点投影到2D图像平面生成伪轨迹;
3. 使用射线投射处理遮挡;
4. 基于光流一致性过滤不可靠轨迹。
Result: 在TAP-Vid基准测试中,AnthroTAP训练的点跟踪模型表现优异,超越其他使用真实视频训练的模型,且仅需少量数据和1天4GPU的训练时间。
Insight: 利用合成或伪标注数据可以显著降低点跟踪任务的训练成本,同时在性能上不逊于甚至优于真实数据训练的模型。
Abstract: Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracking remains difficult due to laborious manual annotation. Our proposed pipeline, AnthroTAP, addresses this by proposing an automated pipeline to generate pseudo-labeled training data, leveraging the Skinned Multi-Person Linear (SMPL) model. We first fit the SMPL model to detected humans in video frames, project the resulting 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handle occlusions using ray-casting, and filter out unreliable tracks based on optical flow consistency. A point tracking model trained on AnthroTAP annotated dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using 10,000 times less data and only 1 day in 4 GPUs, compared to 256 GPUs used in recent state-of-the-art.
cs.GR [Back]
[96] Self-Attention Based Multi-Scale Graph Auto-Encoder Network of 3D Meshes
Saqib Nazir,Olivier Lézoray,Sébastien Bougleux
Main category: cs.GR
TL;DR: 论文提出了一种基于图卷积网络(GCN)的新框架3DGeoMeshNet,用于3D网格数据的高效重建,通过各向异性卷积层直接学习空间域中的全局和局部特征。
Details
Motivation: 3D网格数据具有非欧几里得特性,传统的卷积神经网络(CNN)难以直接处理。现有图卷积方法多依赖各向同性滤波器或谱分解,难以同时捕捉局部和全局特征。Contribution: 1. 提出了3DGeoMeshNet框架,保留了原始的3D网格格式,避免转换为中间表示;2. 设计各向异性卷积层,直接在空间域学习特征;3. 结合多尺度编码器-解码器结构,有效捕捉几何细节。
Method: 1. 基于GCN构建网络,使用各向异性卷积层;2. 多尺度编码器-解码器结构,分离全局和局部路径;3. 直接在原始网格上操作,避免转换。
Result: 在COMA人脸数据集上的实验显示,3DGeoMeshNet在重建精度上表现优异。
Insight: 直接在3D网格上操作而非转换中间表示,可以更准确地保留几何细节;各向异性卷积结合多尺度设计是处理非欧几里得数据的有效方法。
Abstract: 3D meshes are fundamental data representations for capturing complex geometric shapes in computer vision and graphics applications. While Convolutional Neural Networks (CNNs) have excelled in structured data like images, extending them to irregular 3D meshes is challenging due to the non-Euclidean nature of the data. Graph Convolutional Networks (GCNs) offer a solution by applying convolutions to graph-structured data, but many existing methods rely on isotropic filters or spectral decomposition, limiting their ability to capture both local and global mesh features. In this paper, we introduce 3D Geometric Mesh Network (3DGeoMeshNet), a novel GCN-based framework that uses anisotropic convolution layers to effectively learn both global and local features directly in the spatial domain. Unlike previous approaches that convert meshes into intermediate representations like voxel grids or point clouds, our method preserves the original polygonal mesh format throughout the reconstruction process, enabling more accurate shape reconstruction. Our architecture features a multi-scale encoder-decoder structure, where separate global and local pathways capture both large-scale geometric structures and fine-grained local details. Extensive experiments on the COMA dataset containing human faces demonstrate the efficiency of 3DGeoMeshNet in terms of reconstruction accuracy.
cs.AI [Back]
[97] Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality
Haochen Huang,Jiahuan Pei,Mohammad Aliannejadi,Xin Sun,Moonisa Ahsan,Pablo Cesar,Chuang Yu,Zhaochun Ren,Junxiao Wang
Main category: cs.AI
TL;DR: 该论文探索了增强现实(AR)训练中的细粒度视觉-语言模型(VLM)应用,提出了一个专为AR训练设计的数据集,并评估了9种先进VLM模型的表现,揭示了现有模型在细粒度任务中的局限性。
Details
Motivation: 增强现实(AR)训练需要AI助手具备多模态理解能力,但现有视觉-语言模型在细粒度任务上的表现仍不足。论文旨在填补这一研究空白,并为盲人和视障用户提供平等的学习机会。Contribution: 论文的主要贡献包括:1)为AR训练设计了系统化的视觉-语言任务数据集;2)对9种先进VLM模型进行了评估;3)揭示了现有模型在细粒度任务中的不足;4)开放了数据集和源代码以支持社区研究。
Method: 论文提出了一个全面的数据集,用于评估VLM在AR训练中的表现,并测试了包括GPT-4o在内的9种模型。任务涵盖细粒度组装和状态检测等场景。
Result: 实验显示,即使是GPT-4o等先进模型,在细粒度任务上的表现也有限(最高F1分数仅40.54%),表明需要更多数据集和基准测试的改进。
Insight: 该研究强调了细粒度视觉-语言对齐的重要性,同时也为AI驱动的公平学习机会提供了社会价值。开放资源将促进未来研究的进展。
Abstract: Vision-language models (VLMs) are essential for enabling AI-powered smart assistants to interpret and reason in multimodal environments. However, their application in augmented reality (AR) training remains largely unexplored. In this work, we introduce a comprehensive dataset tailored for AR training, featuring systematized vision-language tasks, and evaluate nine state-of-the-art VLMs on it. Our results reveal that even advanced models, including GPT-4o, struggle with fine-grained assembly tasks, achieving a maximum F1 score of just 40.54% on state detection. These findings highlight the demand for enhanced datasets, benchmarks, and further research to improve fine-grained vision-language alignment. Beyond technical contributions, our work has broader social implications, particularly in empowering blind and visually impaired users with equitable access to AI-driven learning opportunities. We provide all related resources, including the dataset, source code, and evaluation results, to support the research community.
[98] MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation
Fathinah Izzati,Xinyue Li,Yuxuan Wu,Gus Xia
Main category: cs.AI
TL;DR: 论文提出MusiScene,通过微调MU-LLaMA实现音乐场景想象(MSI),并利用其生成的描述增强视频背景音乐生成(VBMG)。
Details
Motivation: 人类通过音乐能联想到场景,但现有音乐描述模型仅关注音乐元素,缺乏跨模态关联。Contribution: 1.构建大规模视频-音频描述数据集;2.微调MU-LLaMA实现MSI任务(MusiScene);3.MusiScene生成的描述更相关,可提升VBMG效果。
Method: 1.构建跨模态数据集;2.微调Music Understanding LLaMA(MU-LLaMA)为MusiScene;3.评估MSI能力并用于VBMG。
Result: MusiScene生成的音乐场景描述更贴合视频内容,优于仅基于音乐的MU-LLaMA。
Insight: 跨模态信息(如视频-音乐关联)能提升音乐描述模型的场景想象力,从而增强下游任务效果。
Abstract: Humans can imagine various atmospheres and settings when listening to music, envisioning movie scenes that complement each piece. For example, slow, melancholic music might evoke scenes of heartbreak, while upbeat melodies suggest celebration. This paper explores whether a Music Language Model, e.g. MU-LLaMA, can perform a similar task, called Music Scene Imagination (MSI), which requires cross-modal information from video and music to train. To improve upon existing music captioning models which focusing solely on musical elements, we introduce MusiScene, a music captioning model designed to imagine scenes that complement each music. In this paper, (1) we construct a large-scale video-audio caption dataset with 3,371 pairs, (2) we finetune Music Understanding LLaMA for the MSI task to create MusiScene, and (3) we conduct comprehensive evaluations and prove that our MusiScene is more capable of generating contextually relevant captions compared to MU-LLaMA. We leverage the generated MSI captions to enhance Video Background Music Generation (VBMG) from text.
cs.RO [Back]
[99] Evaluation of Habitat Robotics using Large Language Models
William Li,Lei Hamilton,Kaise Al-natour,Sanjeev Mohindra
Main category: cs.RO
TL;DR: 论文评估了大型语言模型在机器人实体化任务中的表现,发现推理型模型(如OpenAI o3-mini)在Meta PARTNER基准中优于非推理型模型(如GPT-4o和Llama 3)。
Details
Motivation: 研究动机是探索大型语言模型在协作机器人任务中的表现,特别是在简化但随机的室内厨房环境中,为机器人实体化开发提供新的研究方向。Contribution: 主要贡献是通过Meta PARTNER基准评测了多种前沿语言模型在机器人协作任务中的表现,证明了推理型模型的优势。
Method: 方法包括使用Meta PARTNER基准生成随机化厨房场景及其协作任务,评测多种语言模型在不同配置(集中式、分散式、全观测和部分观测)下的表现。
Result: 结果显示,OpenAI o3-mini在各项配置中均优于GPT-4o和Llama 3,为机器人实体化研究提供了新的可能。
Insight: 研究揭示了推理能力在机器人协作任务中的重要性,为未来开发更高效的机器人语言模型提供了方向。
Abstract: This paper focuses on evaluating the effectiveness of Large Language Models at solving embodied robotic tasks using the Meta PARTNER benchmark. Meta PARTNR provides simplified environments and robotic interactions within randomized indoor kitchen scenes. Each randomized kitchen scene is given a task where two robotic agents cooperatively work together to solve the task. We evaluated multiple frontier models on Meta PARTNER environments. Our results indicate that reasoning models like OpenAI o3-mini outperform non-reasoning models like OpenAI GPT-4o and Llama 3 when operating in PARTNR’s robotic embodied environments. o3-mini displayed outperform across centralized, decentralized, full observability, and partial observability configurations. This provides a promising avenue of research for embodied robotic development.
[100] 3DGS_LSR:Large_Scale Relocation for Autonomous Driving Based on 3D Gaussian Splatting
Haitao Lu,Haijier Chen,Haoze Liu,Shoujian Zhang,Bo Xu,Ziao Liu
Main category: cs.RO
TL;DR: 3DGS-LSR 是一种基于 3D 高斯溅射(3D Gaussian Splatting)的大规模重定位框架,通过单目 RGB 图像实现厘米级定位,适用于自动驾驶。它在 KITTI 数据集上表现优异,定位精度显著优于其他方法。
Details
Motivation: 在复杂城市环境中,GNSS 定位常因信号遮挡和多径效应变得不可靠。传统地图方法又因存储和计算效率问题难以应用于资源有限的机器人平台。因此,需要一种高效、精确的定位解决方案。Contribution: 提出 3DGS-LSR 框架,利用 3D 高斯溅射技术实现大规模场景下的精确重定位,仅需单目 RGB 图像输入即可达到厘米级精度。
Method: 结合多传感器数据构建高精度 3DGS 地图,采用 SuperPoint 和 SuperGlue 进行特征提取与匹配,并通过迭代优化策略逐步优化定位结果。
Result: 在 KITTI 数据集上,3DGS-LSR 在城镇道路、林荫道和交通密集的高速公路上分别实现了 0.026m、0.029m 和 0.081m 的平均定位精度。
Insight: 通过 3D 高斯溅射和单目 RGB 输入即可实现高精度定位,解决了 GNSS 不可靠的问题,为自动驾驶提供了可靠的定位方案。
Abstract: In autonomous robotic systems, precise localization is a prerequisite for safe navigation. However, in complex urban environments, GNSS positioning often suffers from signal occlusion and multipath effects, leading to unreliable absolute positioning. Traditional mapping approaches are constrained by storage requirements and computational inefficiency, limiting their applicability to resource-constrained robotic platforms. To address these challenges, we propose 3DGS-LSR: a large-scale relocalization framework leveraging 3D Gaussian Splatting (3DGS), enabling centimeter-level positioning using only a single monocular RGB image on the client side. We combine multi-sensor data to construct high-accuracy 3DGS maps in large outdoor scenes, while the robot-side localization requires just a standard camera input. Using SuperPoint and SuperGlue for feature extraction and matching, our core innovation is an iterative optimization strategy that refines localization results through step-by-step rendering, making it suitable for real-time autonomous navigation. Experimental validation on the KITTI dataset demonstrates our 3DGS-LSR achieves average positioning accuracies of 0.026m, 0.029m, and 0.081m in town roads, boulevard roads, and traffic-dense highways respectively, significantly outperforming other representative methods while requiring only monocular RGB input. This approach provides autonomous robots with reliable localization capabilities even in challenging urban environments where GNSS fails.
cs.IR [Back]
[101] A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models
Shuliang Liu,Hongyi Liu,Aiwei Liu,Bingchen Duan,Qi Zheng,Yibo Yan,He Geng,Peijie Jiang,Jia Liu,Xuming Hu
Main category: cs.IR
TL;DR: 这篇论文探讨了如何通过主动防御策略应对大语言模型(LLMs)生成的错误信息,提出了基于知识可信性、推理可靠性和输入鲁棒性的三支柱框架,展现了63%的改进效果。
Details
Motivation: 大语言模型的广泛部署加剧了算法生成错误信息的社会风险,传统检测方法难以应对其自我强化、高度可信和多语言传播的特性,因此需要转向主动防御策略。Contribution: 论文提出了主动防御的三支柱框架(Knowledge Credibility, Inference Reliability, Input Robustness),并通过比较分析证明其在预防错误信息上显著优于传统方法。
Method: 采用知识可信性(数据完整性)、推理可靠性(自纠错机制)和输入鲁棒性(对抗攻击防御)的综合框架,结合现有技术调查和元分析。
Result: 主动防御策略在错误信息预防上比传统方法提高了63%,但存在计算开销和泛化挑战。
Insight: 未来研究应关注知识基础、推理认证和对抗接口的协同设计,以增强大语言模型在各领域的抗误导能力。
Abstract: The widespread deployment of large language models (LLMs) across critical domains has amplified the societal risks posed by algorithmically generated misinformation. Unlike traditional false content, LLM-generated misinformation can be self-reinforcing, highly plausible, and capable of rapid propagation across multiple languages, which traditional detection methods fail to mitigate effectively. This paper introduces a proactive defense paradigm, shifting from passive post hoc detection to anticipatory mitigation strategies. We propose a Three Pillars framework: (1) Knowledge Credibility, fortifying the integrity of training and deployed data; (2) Inference Reliability, embedding self-corrective mechanisms during reasoning; and (3) Input Robustness, enhancing the resilience of model interfaces against adversarial attacks. Through a comprehensive survey of existing techniques and a comparative meta-analysis, we demonstrate that proactive defense strategies offer up to 63% improvement over conventional methods in misinformation prevention, despite non-trivial computational overhead and generalization challenges. We argue that future research should focus on co-designing robust knowledge foundations, reasoning certification, and attack-resistant interfaces to ensure LLMs can effectively counter misinformation across varied domains.
cs.HC [Back]
[102] NRXR-ID: Two-Factor Authentication (2FA) in VR Using Near-Range Extended Reality and Smartphones
Aiur Nanzatov,Lourdes Peña-Castillo,Oscar Meruvia-Pastor
Main category: cs.HC
TL;DR: NRXR-ID通过结合近距扩展现实(XR)和智能手机,提出了一种VR环境下的双因素认证(2FA)方法,无需用户摘掉头显设备即可完成认证挑战。
Details
Motivation: 虚拟现实(VR)用户戴着头显设备无法看到现实环境,导致传统的2FA方法难以实现。NRXR-ID旨在解决这一问题。Contribution: 提出了NRXR-ID方法,通过智能手机完成认证挑战,扩展了VR环境下2FA的可行性和用户体验。引入了新颖的跳棋式挑战和其他认证方式。
Method: 设计了四种认证挑战,包括跳棋式视觉匹配和数字PIN输入。通过用户研究(4X3内设计)评估了这些方法的性能和用户体验。
Result: 跳棋式视觉匹配最适合VR环境,其次是智能手机输入数字PIN并在VR中提交的方式。
Insight: 智能手机与XR技术的结合为VR环境下的安全认证提供了创新解决方案,跳棋式挑战因其直观性表现最佳。
Abstract: Two-factor authentication (2FA) has become widely adopted as an efficient and secure way to validate someone’s identity online. Two-factor authentication is difficult in virtual reality (VR) because users are usually wearing a head-mounted display (HMD) which does not allow them to see their real-world surroundings. We present NRXR-ID, a technique to implement two-factor authentication while using extended reality systems and smartphones. The proposed method allows users to complete an authentication challenge using their smartphones without removing their HMD. We performed a user study where we explored four types of challenges for users, including a novel checkers-style challenge. Users responded to these challenges under three different configurations, including a technique that uses the smartphone to support gaze-based selection without the use of VR controllers. A 4X3 within-subjects design allowed us to study all the variations proposed. We collected performance metrics and performed user experience questionnaires to collect subjective impressions from 30 participants. Results suggest that the checkers-style visual matching challenge was the most appropriate option, followed by entering a digital PIN challenge submitted via the smartphone and answered within the VR environment.
eess.AS [Back]
[103] ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark
He Wang,Linhan Ma,Dake Guo,Xiong Wang,Lei Xie,Jin Xu,Junyang Lin
Main category: eess.AS
TL;DR: 论文提出了ContextASR-Bench,一个大规模上下文语音识别基准测试,填补了传统ASR模型在上下文建模和世界知识推理能力评估上的空白。
Details
Motivation: 传统ASR评估局限于无上下文场景,而近期LLMs和LALMs的发展使得评估ASR系统的通用性和智能性成为迫切需求。Contribution: 提出了一个包含40,000条数据、覆盖10多个领域的上下文语音识别基准测试,支持粗粒度和细粒度上下文信息评估。
Method: 通过大规模数据集设计,分析模型在有无上下文信息下的表现,并引入命名实体识别任务评估模型能力。
Result: 实验表明,具备世界知识和上下文学习能力的LALMs显著优于传统ASR模型。
Insight: LALMs在上下文语音识别任务中的优势凸显了上下文建模和世界知识的重要性,为未来ASR系统设计提供了方向。
Abstract: Automatic Speech Recognition (ASR) has been extensively investigated, yet prior evaluative efforts have largely been restricted to contextless paradigms. This constraint stems from the limited proficiency of conventional ASR models in context modeling and their deficiency in memory and reasoning based on world knowledge. Recent breakthroughs in the development of Large Language Models (LLMs) and corresponding Large Audio Language Models (LALMs) have markedly enhanced the visibility of general artificial intelligence capabilities. Consequently, there exists a compelling need for a benchmark that can evaluate both the generality and intelligence of ASR systems. To address this gap, we propose ContextASR-Bench: a comprehensive, large-scale benchmark designed to assess contextual speech recognition. This benchmark encompasses up to 40,000 data entries across over 10 domains, enabling a thorough evaluation of model performance in scenarios that omit or incorporate coarse-grained or fine-grained contextual information. Moreover, diverging from conventional ASR evaluations, our benchmark includes an analysis of model efficacy in recognizing named entities mentioned within the auditory input. Our extensive evaluation highlights that LALMs, with strong world knowledge and context learning capabilities, outperform conventional ASR models by a large margin. The dataset and evaluation code have been released at https://github.com/MrSupW/ContextASR-Bench.
cs.DC [Back]
[104] ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge
Daghash K. Alqahtani,Maria A. Rodriguez,Muhammad Aamir Cheema,Hamid Rezatofighi,Adel N. Toosi
Main category: cs.DC
TL;DR: 论文提出ECORE框架,通过动态路由策略优化边缘设备上的深度学习模型,在保证检测精度的同时显著降低能耗和延迟。
Details
Motivation: 边缘设备在实时视觉分析(如目标检测)中资源受限,需平衡能耗与检测精度。Contribution: 提出ECORE框架,动态路由策略结合能耗与精度优化,相比基线方法能耗和延迟分别降低45%和49%。
Method: 集成动态路由策略(估计技术和贪婪选择算法),根据目标特性选择最优边缘设备-模型组合。
Result: 实验表明,ECORE在YOLO、SSD等模型及多种边缘平台上能耗和延迟显著降低,仅损失2%精度。
Insight: 动态路由策略可有效解决边缘资源受限问题,适合实时视觉分析场景。
Abstract: Edge computing enables data processing closer to the source, significantly reducing latency an essential requirement for real-time vision-based analytics such as object detection in surveillance and smart city environments. However, these tasks place substantial demands on resource constrained edge devices, making the joint optimization of energy consumption and detection accuracy critical. To address this challenge, we propose ECORE, a framework that integrates multiple dynamic routing strategies including estimation based techniques and a greedy selection algorithm to direct image processing requests to the most suitable edge device-model pair. ECORE dynamically balances energy efficiency and detection performance based on object characteristics. We evaluate our approach through extensive experiments on real-world datasets, comparing the proposed routers against widely used baseline techniques. The evaluation leverages established object detection models (YOLO, SSD, EfficientDet) and diverse edge platforms, including Jetson Orin Nano, Raspberry Pi 4 and 5, and TPU accelerators. Results demonstrate that our proposed context-aware routing strategies can reduce energy consumption and latency by 45% and 49%, respectively, while incurring only a 2% loss in detection accuracy compared to accuracy-centric methods.
eess.IV [Back]
[105] Learning Segmentation from Radiology Reports
Pedro R. A. S. Bassi,Wenxuan Li,Jieneng Chen,Zheren Zhu,Tianyu Lin,Sergio Decherchi,Andrea Cavalli,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou
Main category: eess.IV
TL;DR: 论文提出了一种报告监督损失(R-Super),利用放射学报告为肿瘤分割AI提供体素级监督,显著提升了分割性能,特别是在标注掩码稀缺的情况下。
Details
Motivation: 肿瘤分割在CT扫描中至关重要,但标注掩码稀缺且制作耗时。放射学报告数量庞大但未被充分利用,因此需要一种方法将报告转化为监督信号以提升分割模型性能。Contribution: 提出了R-Super损失函数,将放射学报告转化为体素级监督信号,为肿瘤分割AI提供额外的训练信息。
Method: 通过结合CT-报告对和公开的CT-掩码数据集,使用R-Super损失训练分割模型,实现了在内部和外部验证中的性能提升。
Result: 实验表明,R-Super显著提升了肿瘤分割性能,F1分数最高提升了16%,尤其在标注掩码稀缺时效果更明显。
Insight: 放射学报告可以作为有效的监督信号补充稀缺的标注数据,为医学影像分割任务提供了新的数据利用思路。
Abstract: Tumor segmentation in CT scans is key for diagnosis, surgery, and prognosis, yet segmentation masks are scarce because their creation requires time and expertise. Public abdominal CT datasets have from dozens to a couple thousand tumor masks, but hospitals have hundreds of thousands of tumor CTs with radiology reports. Thus, leveraging reports to improve segmentation is key for scaling. In this paper, we propose a report-supervision loss (R-Super) that converts radiology reports into voxel-wise supervision for tumor segmentation AI. We created a dataset with 6,718 CT-Report pairs (from the UCSF Hospital), and merged it with public CT-Mask datasets (from AbdomenAtlas 2.0). We used our R-Super to train with these masks and reports, and strongly improved tumor segmentation in internal and external validation–F1 Score increased by up to 16% with respect to training with masks only. By leveraging readily available radiology reports to supplement scarce segmentation masks, R-Super strongly improves AI performance both when very few training masks are available (e.g., 50), and when many masks were available (e.g., 1.7K). Project: https://github.com/MrGiovanni/R-Super
[106] Diffusion-Based Limited-Angle CT Reconstruction under Noisy Conditions
Jiaqi Guo,Santiago López-Tapia
Main category: eess.IV
TL;DR: 该论文提出了一种基于扩散模型的有限角度CT重建方法,通过MR-SDE框架和RNSD⁺噪声感知校正机制,解决了噪声条件下的图像重建问题,显著提升了数据一致性和感知质量。
Details
Motivation: 有限角度CT(LACT)因投影角度缺失导致重建图像存在严重伪影,现有方法多假设理想无噪声条件,忽略了实际噪声的影响。Contribution: 1. 将LACT建模为sinogram修复任务,提出基于MR-SDE的扩散框架;2. 引入RNSD⁺噪声感知校正机制,显式建模推断时不确定性,提升噪声条件下的鲁棒性。
Method: 1. 利用MR-SDE完成缺失角度视图;2. RNSD⁺在推断时动态调整噪声影响,确保重建可靠性。
Result: 实验表明,该方法在数据一致性和感知质量上优于基线模型,且对不同噪声强度和数据采集场景具有良好泛化性。
Insight: 扩散模型与噪声感知机制的结合,为复杂逆问题的鲁棒求解提供了新思路。
Abstract: Limited-Angle Computed Tomography (LACT) is a challenging inverse problem where missing angular projections lead to incomplete sinograms and severe artifacts in the reconstructed images. While recent learning-based methods have demonstrated effectiveness, most of them assume ideal, noise-free measurements and fail to address the impact of measurement noise. To overcome this limitation, we treat LACT as a sinogram inpainting task and propose a diffusion-based framework that completes missing angular views using a Mean-Reverting Stochastic Differential Equation (MR-SDE) formulation. To improve robustness under realistic noise, we propose RNSD$^+$, a novel noise-aware rectification mechanism that explicitly models inference-time uncertainty, enabling reliable and robust reconstruction. Extensive experiments demonstrate that our method consistently surpasses baseline models in data consistency and perceptual quality, and generalizes well across varying noise intensity and acquisition scenarios.
[107] A novel framework for fully-automated co-registration of intravascular ultrasound and optical coherence tomography imaging data
Xingwei He,Kit Mills Bransby,Ahmet Emir Ulutas,Thamil Kumaran,Nathan Angelo Lecaros Yap,Gonul Zeren,Hesong Zeng,Yaojun Zhang,Andreas Baumbach,James Moon,Anthony Mathur,Jouke Dijkstra,Qianni Zhang,Lorenz Raber,Christos V Bourantas
Main category: eess.IV
TL;DR: 该论文提出了一种基于深度学习的全新框架,用于完全自动化地将血管内超声(IVUS)和光学相干断层扫描(OCT)图像进行纵向和圆周配准,性能与专家分析相当且处理速度快。
Details
Motivation: 在多模态成像研究中,IVUS和OCT图像的配准通常需要人工干预,耗时且效率低。本文旨在开发一种自动化方法,提高配准效率和准确性。Contribution: 提出了一种完全自动化的深度学习框架,用于IVUS和OCT图像的纵向和圆周配准,解决了多模态图像配准中的人工依赖问题。
Method: 使用61,655张IVUS和62,334张OCT图像训练深度学习模型提取特征,结合动态时间规整算法和动态编程优化配准。测试集包含77条血管的数据。
Result: 纵向配准的相关系数>0.99,圆周配准>0.90;Williams Index分别为0.96和0.97,处理时间<90秒/血管。
Insight: 深度学习在多模态医学图像配准中表现出色,可显著提升大规模数据的分析效率,为斑块组成研究提供有力工具。
Abstract: Aims: To develop a deep-learning (DL) framework that will allow fully automated longitudinal and circumferential co-registration of intravascular ultrasound (IVUS) and optical coherence tomography (OCT) images. Methods and results: Data from 230 patients (714 vessels) with acute coronary syndrome that underwent near-infrared spectroscopy (NIRS)-IVUS and OCT imaging in their non-culprit vessels were included in the present analysis. The lumen borders annotated by expert analysts in 61,655 NIRS-IVUS and 62,334 OCT frames, and the side branches and calcific tissue identified in 10,000 NIRS-IVUS frames and 10,000 OCT frames, were used to train DL solutions for the automated extraction of these features. The trained DL solutions were used to process NIRS-IVUS and OCT images and their output was used by a dynamic time warping algorithm to co-register longitudinally the NIRS-IVUS and OCT images, while the circumferential registration of the IVUS and OCT was optimized through dynamic programming. On a test set of 77 vessels from 22 patients, the DL method showed high concordance with the expert analysts for the longitudinal and circumferential co-registration of the two imaging sets (concordance correlation coefficient >0.99 for the longitudinal and >0.90 for the circumferential co-registration). The Williams Index was 0.96 for longitudinal and 0.97 for circumferential co-registration, indicating a comparable performance to the analysts. The time needed for the DL pipeline to process imaging data from a vessel was <90s. Conclusion: The fully automated, DL-based framework introduced in this study for the co-registration of IVUS and OCT is fast and provides estimations that compare favorably to the expert analysts. These features renders it useful in research in the analysis of large-scale data collected in studies that incorporate multimodality imaging to characterize plaque composition.
[108] Enhancing Synthetic CT from CBCT via Multimodal Fusion and End-To-End Registration
Maximilian Tschuchnig,Lukas Lamminger,Philipp Steininger,Michael Gadermayr
Main category: eess.IV
TL;DR: 通过多模态融合和端到端配准技术,本文提升了从CBCT生成合成CT的质量。
Details
Motivation: CBCT图像由于采集速度快、辐射剂量低,被广泛用于术中成像,但其存在伪影和视觉质量较低的问题。传统的合成CT生成方法未能充分利用多模态数据,且模态间的配准问题未被有效解决。Contribution: 1) 在多模态学习框架中联合利用术中CBCT和术前CT数据;2) 在sCT生成流程中引入端到端可学习的配准模块。
Method: 1) 构建端到端学习框架,将配准模块嵌入sCT生成流程;2) 在合成数据集上验证模型对数据质量和配准参数的敏感性;3) 在真实临床数据集上测试方法的鲁棒性和泛化性。
Result: 实验表明,多模态融合与配准的结合在90个评估场景中79个优于基线方法,尤其在CBCT质量低且CT配准中度偏差时效果显著。
Insight: 配准模块在多模态sCT生成中至关重要,能够显著提升图像质量,尤其在数据质量不理想时效果更明显。
Abstract: Cone-Beam Computed Tomography (CBCT) is widely used for intraoperative imaging due to its rapid acquisition and low radiation dose. However, CBCT images typically suffer from artifacts and lower visual quality compared to conventional Computed Tomography (CT). A promising solution is synthetic CT (sCT) generation, where CBCT volumes are translated into the CT domain. In this work, we enhance sCT generation through multimodal learning by jointly leveraging intraoperative CBCT and preoperative CT data. To overcome the inherent misalignment between modalities, we introduce an end-to-end learnable registration module within the sCT pipeline. This model is evaluated on a controlled synthetic dataset, allowing precise manipulation of data quality and alignment parameters. Further, we validate its robustness and generalizability on two real-world clinical datasets. Experimental results demonstrate that integrating registration in multimodal sCT generation improves sCT quality, outperforming baseline multimodal methods in 79 out of 90 evaluation settings. Notably, the improvement is most significant in cases where CBCT quality is low and the preoperative CT is moderately misaligned.
[109] LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models
Zhihao Chen,Tao Chen,Chenhui Wang,Qi Gao,Huidong Xie,Chuang Niu,Ge Wang,Hongming Shan
Main category: eess.IV
TL;DR: LangMamba是一个新颖的框架,通过结合视觉语言模型(VLMs)和高效的Mamba机制,实现了低剂量CT(LDCT)去噪,显著提升了图像质量和解释性。
Details
Motivation: 低剂量CT减少了辐射暴露但降低了图像质量,传统深度学习方法忽略了高级语义信息的潜在优势。视觉语言模型的进展为利用语言作为监督信号提供了新机会。Contribution: 提出LangMamba框架,首次将语言驱动的语义信息与高效的Mamba机制结合,设计了LangAE预训练模块、SEED去噪器和LangDA损失函数,显著提升了LDCT去噪性能。
Method: 采用两阶段学习策略:1)预训练LangAE,利用冻结的VLMs将NDCT图像映射到富语义空间;2)结合SEED去噪器(局部语义增强+全局Mamba机制)和LangDA损失(双空间对齐),指导LDCT去噪。
Result: 在两个公开数据集上超越现有方法,提升了细节保留和视觉保真度。LangAE展现了对新数据集的强泛化能力,LangDA损失增强了模型解释性。
Insight: 语言可以作为有效的监督信号,结合语义信息与高效建模机制(如Mamba),能够显著提升医学图像重建任务的性能与通用性。
Abstract: Low-dose computed tomography (LDCT) reduces radiation exposure but often degrades image quality, potentially compromising diagnostic accuracy. Existing deep learning-based denoising methods focus primarily on pixel-level mappings, overlooking the potential benefits of high-level semantic guidance. Recent advances in vision-language models (VLMs) suggest that language can serve as a powerful tool for capturing structured semantic information, offering new opportunities to improve LDCT reconstruction. In this paper, we introduce LangMamba, a Language-driven Mamba framework for LDCT denoising that leverages VLM-derived representations to enhance supervision from normal-dose CT (NDCT). LangMamba follows a two-stage learning strategy. First, we pre-train a Language-guided AutoEncoder (LangAE) that leverages frozen VLMs to map NDCT images into a semantic space enriched with anatomical information. Second, we synergize LangAE with two key components to guide LDCT denoising: Semantic-Enhanced Efficient Denoiser (SEED), which enhances NDCT-relevant local semantic while capturing global features with efficient Mamba mechanism, and Language-engaged Dual-space Alignment (LangDA) Loss, which ensures that denoised images align with NDCT in both perceptual and semantic spaces. Extensive experiments on two public datasets demonstrate that LangMamba outperforms conventional state-of-the-art methods, significantly improving detail preservation and visual fidelity. Remarkably, LangAE exhibits strong generalizability to unseen datasets, thereby reducing training costs. Furthermore, LangDA loss improves explainability by integrating language-guided insights into image reconstruction and offers a plug-and-play fashion. Our findings shed new light on the potential of language as a supervisory signal to advance LDCT denoising. The code is publicly available on https://github.com/hao1635/LangMamba.
cs.LG [Back]
[110] Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training
Song Lai,Haohan Zhao,Rong Feng,Changyi Ma,Wenzhuo Liu,Hongbo Zhao,Xi Lin,Dong Yi,Min Xie,Qingfu Zhang,Hongbin Liu,Gaofeng Meng,Fei Zhu
Main category: cs.LG
TL;DR: 本文比较了监督微调(SFT)和强化微调(RFT)在持续后训练(CPT)中对知识保留的影响,发现RFT能有效缓解遗忘问题并保持模型性能,而SFT则导致灾难性遗忘。
Details
Motivation: 研究持续后训练中不同学习范式(SFT和RFT)对知识保留的影响,探索如何更有效地适应不断变化的下游任务。Contribution: 1. 发现RFT在CPT中表现出色,能保留先前知识并媲美多任务训练;2. 提出基于rollout的实例过滤算法,提升RFT的稳定性和效率。
Method: 在七个多模态任务基准上实验,使用Qwen2.5-VL-7B-Instruct作为基础模型,对比SFT和RFT的表现及机制。
Result: RFT能保护甚至增强模型的一般知识,而SFT导致灾难性遗忘和性能下降。进一步分析显示,RFT的隐式正则化是缓解遗忘的关键。
Insight: RFT的隐式正则化机制在持续学习中发挥了重要作用,为CPT提供了一种鲁棒的范式选择。
Abstract: Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model’s general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis shows that explicit mechanisms, such as KL penalty and chain-of-thought reasoning, are not the primary factors. Instead, we find that the implicit regularization inherent to RFT is a key factor in mitigating forgetting. Finally, we propose a rollout-based instance filtering algorithm to improve the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.
[111] AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs
Shangzhan Li,Zefan Wang,Ye He,Yuxuan Li,Qi Shi,Jianling Li,Yonggang Hu,Wanxiang Che,Xu Han,Zhiyuan Liu,Maosong Sun
Main category: cs.LG
TL;DR: AutoTriton利用强化学习自动优化Triton编程,通过监督微调和GRPO算法结合规则与执行奖励,显著提升GPU内核性能。
Details
Motivation: 深度学习内核开发需手动调优关键参数,如瓦片大小和内存访问模式,导致性能优化困难且耗时。AutoTriton旨在通过自动化减少人工干预。Contribution: 首次提出基于强化学习的Triton编程模型AutoTriton,结合监督微调与GRPO算法,显著提升内核性能。
Method: 1. 监督微调(SFT)阶段获取Triton编程能力;2. 使用GRPO算法进行强化学习,结合规则与执行奖励优化性能。
Result: 在TritonBench和KernelBench上,8B模型性能媲美主流大模型(如Claude-4-Sonnet)。实验验证了各模块(SFT、RL、奖励设计)的关键作用。
Insight: 强化学习在高性能内核自动生成中潜力巨大,为构建高效AI系统奠定重要基础。
Abstract: Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific languages like Triton simplify GPU programming by abstracting low-level details, developers must still manually tune critical parameters such as tile sizes and memory access patterns through iterative experimentation, creating substantial barriers to optimal performance and wider adoption. In this work, we introduce AutoTriton, the first model dedicated to Triton programming powered by reinforcement learning (RL). AutoTriton performs supervised fine-tuning (SFT) to be equipped with essential Triton programming expertise using a high-quality data gathering pipeline, and conducts RL with Group Relative Policy Optimization (GRPO) algorithm, combining a rule-based reward and an execution-based reward to further improve Triton programming ability, sequentially. Experiments across five evaluation channels of TritonBench and KernelBench illustrate that our 8B model AutoTriton achieves performance comparable to mainstream large models, including Claude-4-Sonnet and DeepSeek-R1-0528. Further experimental analysis demonstrates the crucial role of each module within AutoTriton, including the SFT stage, the RL stage, and the reward design strategy. These findings underscore the promise of RL for automatically generating high-performance kernels, and since high-performance kernels are core components of AI systems, this breakthrough establishes an important foundation for building more efficient AI systems. The model and code will be available at https://github.com/AI9Stars/AutoTriton.
[112] MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
Yucheng Shi,Wenhao Yu,Zaitang Li,Yonglin Wang,Hongming Zhang,Ninghao Liu,Haitao Mi,Dong Yu
Main category: cs.LG
TL;DR: MobileGUI-RL提出了一种在在线环境中训练GUI代理的强化学习框架,通过自探索和课程学习生成任务,并优化GRPO算法以提高导航效率。
Details
Motivation: 现有GUI代理多基于离线环境训练,面临可扩展性差、对特定UI模板过拟合及策略脆弱的问题,MobileGUI-RL旨在通过在线训练解决这些挑战。Contribution: 1. 提出一种在线训练的GUI代理框架;2. 引入自探索和课程学习生成任务;3. 优化GRPO算法,结合任务成功与执行效率的复合奖励。
Method: 1. 通过自探索和过滤生成学习任务课程;2. 在GRPO中结合轨迹感知优势和复合奖励,优化GUI导航。
Result: 在三个在线移动代理基准测试中表现优于现有方法,验证了其有效性。
Insight: 在线训练和课程学习可有效提升GUI代理的通用性和鲁棒性,复合奖励设计对平衡任务成功与效率至关重要。
Abstract: Recently, there has been a surge of vision-based GUI agents designed to automate everyday mobile and web tasks. These agents interpret raw GUI screenshots and autonomously decide where to click, scroll, or type, which bypasses handcrafted rules and app-specific APIs. However, most existing methods trained GUI agent in the offline environment using pre-collected trajectories. This approach limits scalability, causes overfitting to specific UI templates, and leads to brittle policies when faced with unseen environment. We present MobileGUI-RL, a scalable framework that trains GUI agent in online environment. MobileGUI-RL contains two key components. It (i) synthesizes a curriculum of learnable tasks through self-exploration and filtering, and (ii) adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards that balance task success and execution efficiency. Experiments on three online mobile-agent benchmarks show consistent gains, validating the effectiveness of our approach.
[113] Conditional Graph Neural Network for Predicting Soft Tissue Deformation and Forces
Madina Kojanazarova,Florentin Bieder,Robin Sandkühler,Philippe C. Cattin
Main category: cs.LG
TL;DR: 该论文提出了一种条件图神经网络(cGNN),用于预测软组织虚拟环境中的变形和力,解决了高变形性和数据稀缺的挑战,并通过实验数据微调提升了模型性能。
Details
Motivation: 虚拟环境中的软组织模拟对医学应用至关重要,但高变形性和精确力反馈的复杂性带来了挑战。现有方法依赖于分段、网格化和刚度估计,难以满足需求。Contribution: 提出了一种基于数据驱动的cGNN模型,能够预测软组织的变形和作用力,并通过实验数据与模拟数据的转移学习解决了数据稀缺问题。
Method: 使用条件图神经网络(cGNN),输入表面点和外力位置,预测变形和力。通过质量-弹簧模拟的预训练和实验数据的微调提升泛化能力。
Result: 模型预测变形距离误差为0.35±0.03 mm(最大变形30 mm),力绝对误差为0.37±0.05 N(最大力7.5 N),表现出高精度。
Insight: 数据驱动方法结合转移学习能有效解决软组织模拟的复杂性,适用于医学及其他需要真实软组织模拟的领域。
Abstract: Soft tissue simulation in virtual environments is becoming increasingly important for medical applications. However, the high deformability of soft tissue poses significant challenges. Existing methods rely on segmentation, meshing and estimation of stiffness properties of tissues. In addition, the integration of haptic feedback requires precise force estimation to enable a more immersive experience. We introduce a novel data-driven model, a conditional graph neural network (cGNN) to tackle this complexity. Our model takes surface points and the location of applied forces, and is specifically designed to predict the deformation of the points and the forces exerted on them. We trained our model on experimentally collected surface tracking data of a soft tissue phantom and used transfer learning to overcome the data scarcity by initially training it with mass-spring simulations and fine-tuning it with the experimental data. This approach improves the generalisation capability of the model and enables accurate predictions of tissue deformations and corresponding interaction forces. The results demonstrate that the model can predict deformations with a distance error of 0.35$\pm$0.03 mm for deformations up to 30 mm and the force with an absolute error of 0.37$\pm$0.05 N for forces up to 7.5 N. Our data-driven approach presents a promising solution to the intricate challenge of simulating soft tissues within virtual environments. Beyond its applicability in medical simulations, this approach holds the potential to benefit various fields where realistic soft tissue simulations are required.
[114] Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs
Sofiia Chorna,Kateryna Tarelkina,Eloïse Berthier,Gianni Franchi
Main category: cs.LG
TL;DR: 该论文提出了一种基于概念和结构化知识图谱的机制可解释性框架,用于全局分析模型行为,揭示概念在模型内部的表现、交互和传播方式。
Details
Motivation: 传统的概念可解释性方法主要关注局部解释,而该研究旨在扩展至机制可解释性领域,分析模型内部高层次语义概念的表现和交互方式,揭示潜在的信息流和电路。Contribution: 提出了一个模型无关、可扩展的框架和可视化工具BAGEL,通过结构化知识图谱展示概念-类别关系,帮助用户探索模型行为和增强可信度。
Method: 利用知识图谱量化语义概念在模型各层中的表现,识别潜在的信息流和电路,并通过可视化平台交互式分析。
Result: 开发了交互式工具BAGEL,可揭示模型中的虚假关联和信息流,提升了对深度学习模型泛化行为的理解。
Insight: 通过全局视角分析概念交互,该方法不仅能识别模型的决策机制,还能帮助发现数据集偏见对模型行为的影响。
Abstract: While concept-based interpretability methods have traditionally focused on local explanations of neural network predictions, we propose a novel framework and interactive tool that extends these methods into the domain of mechanistic interpretability. Our approach enables a global dissection of model behavior by analyzing how high-level semantic attributes (referred to as concepts) emerge, interact, and propagate through internal model components. Unlike prior work that isolates individual neurons or predictions, our framework systematically quantifies how semantic concepts are represented across layers, revealing latent circuits and information flow that underlie model decision-making. A key innovation is our visualization platform that we named BAGEL (for Bias Analysis with a Graph for global Explanation Layers), which presents these insights in a structured knowledge graph, allowing users to explore concept-class relationships, identify spurious correlations, and enhance model trustworthiness. Our framework is model-agnostic, scalable, and contributes to a deeper understanding of how deep learning models generalize (or fail to) in the presence of dataset biases. The demonstration is available at https://knowledge-graph-ui-4a7cb5.gitlab.io/.
[115] Fair Domain Generalization: An Information-Theoretic View
Tangzheng Lian,Guanyu Hu,Dimitrios Kollias,Xinyu Yang,Oya Celiktutan
Main category: cs.LG
TL;DR: 这篇论文首次研究了领域泛化(DG)与算法公平性的结合问题,提出了FairDG任务,并通过信息论视角导出了风险与公平性违反的上界,提出了PAFDG框架,在真实数据集上验证了其优越性。
Details
Motivation: 领域泛化(DG)方法通常只关注目标域的期望风险,忽略了算法公平性;而公平性方法又未考虑领域偏移。因此,需要一个统一的框架来解决领域泛化与公平性的双重挑战。Contribution: 1. 首次提出FairDG问题;2. 基于信息论推导了多类分类任务中风险与公平性违反的上界;3. 提出了PAFDG框架,通过Pareto优化实现效用-公平性权衡。
Method: 通过信息论视角分析风险与公平性违反的上界,设计PAFDG框架,利用Pareto优化平衡两者的权衡。
Result: 在真实视觉和语言数据集上,PAFDG表现优于现有方法,实现了更好的效用-公平性权衡。
Insight: 信息论为领域泛化与公平性的统一提供了理论基础,Pareto优化是实现两者平衡的有效方法。
Abstract: Domain generalization (DG) and algorithmic fairness are two critical challenges in machine learning. However, most DG methods focus only on minimizing expected risk in the unseen target domain without considering algorithmic fairness. Conversely, fairness methods typically do not account for domain shifts, so the fairness achieved during training may not generalize to unseen test domains. In this work, we bridge these gaps by studying the problem of Fair Domain Generalization (FairDG), which aims to minimize both expected risk and fairness violations in unseen target domains. We derive novel mutual information-based upper bounds for expected risk and fairness violations in multi-class classification tasks with multi-group sensitive attributes. These bounds provide key insights for algorithm design from an information-theoretic perspective. Guided by these insights, we introduce PAFDG (Pareto-Optimal Fairness for Domain Generalization), a practical framework that solves the FairDG problem and models the utility-fairness trade-off through Pareto optimization. Experiments on real-world vision and language datasets show that PAFDG achieves superior utility-fairness trade-offs compared to existing methods.