Table of Contents
- cs.CL [Total: 27]
- cs.CV [Total: 66]
- eess.IV [Total: 1]
- cs.RO [Total: 1]
- cs.LG [Total: 7]
- q-fin.CP [Total: 1]
- cs.CR [Total: 1]
- cs.IR [Total: 2]
- cs.AI [Total: 4]
cs.CL [Back]
[1] Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective
Zhiqiang Kou,Junyang Chen,Xin-Qiang Cai,Ming-Kun Xie,Biao Liu,Changwei Wang,Lei Feng,Yuheng Jia,Gang Niu,Masashi Sugiyama,Xin Geng
Main category: cs.CL
TL;DR: 论文提出了一个新的多标签视角来评估大型语言模型(LLM)的毒性生成问题,通过引入三个多标签基准数据集和一个伪标签训练方法,显著提升了毒性检测的性能。
Details
Motivation: 当前毒性检测器主要依赖单标签基准,无法充分捕捉现实中有毒提示的多维和模糊特性,导致检测偏差。Contribution: 1) 引入了三个新的多标签基准数据集;2) 提出了伪标签训练的毒性检测方法。
Method: 通过从公开毒性数据集中提取数据并基于15类分类法标注,构建多标签基准;证明了伪标签训练优于单标签监督;开发了基于伪标签的检测方法。
Result: 实验结果表明,该方法在性能上显著超越了GPT-4o和DeepSeek等先进基线。
Insight: 多标签视角能更真实地反映毒性的复杂性,伪标签训练是提升检测效果的有效途径。
Abstract: Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely on single-label benchmarks, which cannot adequately capture the inherently ambiguous and multi-dimensional nature of real-world toxic prompts. This limitation results in biased evaluations, including missed toxic detections and false positives, undermining the reliability of existing detectors. Additionally, gathering comprehensive multi-label annotations across fine-grained toxicity categories is prohibitively costly, further hindering effective evaluation and development. To tackle these issues, we introduce three novel multi-label benchmarks for toxicity detection: \textbf{Q-A-MLL}, \textbf{R-A-MLL}, and \textbf{H-X-MLL}, derived from public toxicity datasets and annotated according to a detailed 15-category taxonomy. We further provide a theoretical proof that, on our released datasets, training with pseudo-labels yields better performance than directly learning from single-label supervision. In addition, we develop a pseudo-label-based toxicity detection method. Extensive experimental results show that our approach significantly surpasses advanced baselines, including GPT-4o and DeepSeek, thus enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.
[2] A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling
Shiyu Ji,Farnoosh Hashemi,Joice Chen,Juanwen Pan,Weicheng Ma,Hefan Zhang,Sophia Pan,Ming Cheng,Shubham Mohole,Saeed Hassanpour,Soroush Vosoughi,Michael Macy
Main category: cs.CL
TL;DR: 论文提出了一种利用大型语言模型(LLM)自动生成和标注辩论数据的方法,用于分析和标注修辞策略,并验证了模型在多领域中的泛化能力。
Details
Motivation: 修辞策略在说服性交流中至关重要,但现有研究依赖人工标注,成本高且难以扩展。Contribution: 提出了一种基于LLM的自动生成和标注辩论数据的框架,实现了修辞策略的高性能分类和多领域泛化。
Method: 利用LLM生成和标注合成辩论数据,基于四部分修辞分类,并对Transformer分类器进行微调。
Result: 模型在人类标注数据和外部语料库上表现优异,并展示了在说服力预测和政治辩论分析中的应用。
Insight: 研究发现美国总统辩论中情感论证的使用增加,证明了方法的实用性和泛化能力。
Abstract: Rhetorical strategies are central to persuasive communication, from political discourse and marketing to legal argumentation. However, analysis of rhetorical strategies has been limited by reliance on human annotation, which is costly, inconsistent, difficult to scale. Their associated datasets are often limited to specific topics and strategies, posing challenges for robust model development. We propose a novel framework that leverages large language models (LLMs) to automatically generate and label synthetic debate data based on a four-part rhetorical typology (causal, empirical, emotional, moral). We fine-tune transformer-based classifiers on this LLM-labeled dataset and validate its performance against human-labeled data on this dataset and on multiple external corpora. Our model achieves high performance and strong generalization across topical domains. We illustrate two applications with the fine-tuned model: (1) the improvement in persuasiveness prediction from incorporating rhetorical strategy labels, and (2) analyzing temporal and partisan shifts in rhetorical strategies in U.S. Presidential debates (1960-2020), revealing increased use of affective over cognitive argument in U.S. Presidential debates.
[3] Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning
Junlin Wu,Xianrui Zhong,Jiashuo Sun,Bolian Li,Bowen Jin,Jiawei Han,Qingkai Zeng
Main category: cs.CL
TL;DR: 论文提出了Structure-R1框架,通过强化学习动态生成结构化表示以优化LLM的推理能力,相比传统RAG系统显著提升了信息密度和推理性能。
Details
Motivation: 大型语言模型(LLMs)在推理任务中表现优异,但受限于对显式结构化知识的有限访问,传统检索增强生成(RAG)系统通常处理非结构化文本,导致信息密度低和推理效果不佳。Contribution: 1. 提出Structure-R1框架,动态生成和调整结构化表示以满足多步推理需求;2. 引入自奖励验证机制确保生成结构的正确性和自洽性;3. 在多个知识密集型任务中验证了其高效性。
Method: 使用强化学习训练内容表示策略,生成任务特定的结构化表示,并通过自奖励机制验证结构的正确性。
Result: 在7B规模的基准模型上表现出色,性能接近更大规模的模型;理论分析表明结构化表示显著提升了信息密度和上下文清晰度。
Insight: 结构化表示能有效增强LLM的推理能力,其动态生成和验证机制为未来知识密集型任务提供了新思路。
Abstract: Large language models (LLMs) have demonstrated remarkable advances in reasoning capabilities. However, their performance remains constrained by limited access to explicit and structured domain knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external information as context to augment reasoning. Nevertheless, traditional RAG systems typically operate over unstructured and fragmented text, resulting in low information density and suboptimal reasoning. To overcome these limitations, we propose \textsc{Structure-R1}, a novel framework that transforms retrieved content into structured representations optimized for reasoning. Leveraging reinforcement learning, \textsc{Structure-R1} learns a content representation policy that dynamically generates and adapts structural formats based on the demands of multi-step reasoning. Unlike prior methods that rely on fixed schemas, our approach adopts a generative paradigm capable of producing task-specific structures tailored to individual queries. To ensure the quality and reliability of these representations, we introduce a self-reward structural verification mechanism that checks whether the generated structures are both correct and self-contained. Extensive experiments on seven knowledge-intensive benchmarks show that \textsc{Structure-R1} consistently achieves competitive performance with a 7B-scale backbone model and matches the performance of much larger models. Additionally, our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity. Our code and data are available at: https://github.com/jlwu002/sr1.
[4] Extending Audio Context for Long-Form Understanding in Large Audio-Language Models
Yuatyong Chaichana,Pittawat Taveekitworachai,Warit Sirichotedumrong,Potsawee Manakul,Kunat Pipatanakul
Main category: cs.CL
TL;DR: 论文提出Partial YaRN和VLAT两种方法,扩展大型音频-语言模型的音频上下文窗口,提升长音频理解能力。
Details
Motivation: 现有的LALMs受限于短音频上下文窗口,无法充分利用长音频内容,亟需一种扩展上下文的方法。Contribution: 1. Partial YaRN:无需训练的音视频上下文扩展方法;2. VLAT:一种训练时的位置增强策略。
Method: 1. Partial YaRN通过仅调整音频token位置扩展上下文;2. VLAT通过模拟不同音频长度的训练提升泛化能力。
Result: 实验表明,Partial YaRN效果优于基线模型,VLAT进一步提升性能,支持未见长度的长音频理解。
Insight: 仅调整音频token位置优于全局调整;动态训练策略对长上下文任务至关重要。
Abstract: Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM’s text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.
[5] Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning
Lina Berrayana,Ahmed Heakl,Muhammad Abdullah Sohail,Thomas Hofmann,Salman Khan,Wei Chen
Main category: cs.CL
TL;DR: 论文探讨了离散扩散语言模型(DDLM)与自回归模型(ARM)的混合架构,通过文本空间和潜在空间的协作,证明了其在推理任务中的互补优势和计算效率。
Details
Motivation: 当前自回归语言模型(ARMs)虽准确率高,但生成长序列时成本较高。离散扩散语言模型(DDLMs)具备并行生成能力,在复杂推理和长期规划任务中表现优异。研究旨在探索两者的混合架构是否能结合优势。Contribution: 1. 提出DDLM与ARM混合协作架构;2. 引入潜在空间通信的投影器,显著提升准确性;3. 展示了计算效率的优势,如大幅减少token数量超越基线模型。
Method: 1. 文本空间协作:DDLM规划推理过程,ARM执行答案;2. 潜在空间协作:通过投影器将DDLM潜在变量映射到ARM嵌入空间。
Result: 潜在空间协作显著提升准确性(DART-5从27%到54%,AIME24从0%到14%),计算效率高(64 token规划+5 token执行超越基线44倍token使用)。
Insight: DDLM与ARM的潜在空间协作能绕过扩散模型的文本生成限制,实现高效推理;混合架构在保持准确性的同时大幅降低计算成本。
Abstract: Current autoregressive language models (ARMs) achieve high accuracy but require long token sequences, making them costly. Discrete diffusion language models (DDLMs) enable parallel and flexible generation within a fixed number of steps and have recently emerged for their strong performance in complex reasoning and long-term planning tasks. We present a study exploring hybrid architectures that couple DDLMs with ARMs to assess whether their collaboration can yield complementary benefits. We first examine collaboration in text space, where one model plans the reasoning process and another executes the final answer based on that plan. We then extend this setup to latent-space communication, introducing a learned projector that maps DDLM latents into the ARM’s embedding space, potentially bypassing some of the text-generation limitations of diffusion models. We find that shifting DDLM –> ARM communication from text space to latent space yields significant accuracy gains, for example increasing from 27.0% to 54.0% on DART-5 and from 0.0% to 14.0% on AIME24. We also find that combining a DDLM planner with an ARM executor can provide substantial computational savings with little to no impact on accuracy. For example, the latent-space pipeline, using 64 tokens for planning and roughly 5 for execution, surpasses Qwen3.1-7B on DART-5 and AIME, despite Qwen using 44 times more tokens. Overall, our study offers new insights into reasoning with DDLMs and highlights their potential in hybrid architectures.
[6] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
Sensen Gao,Shanshan Zhao,Xu Jiang,Lunhao Duan,Yong Xien Chng,Qing-Guo Chen,Weihua Luo,Kaifu Zhang,Jia-Wang Bian,Mingming Gong
Main category: cs.CL
TL;DR: 该论文系统综述了多模态检索增强生成(Multimodal RAG)在文档理解中的应用,提出了一种基于领域、检索模态和粒度的分类法,并探讨了相关数据集、基准和未来挑战。
Details
Motivation: 文档理解在金融分析和科学发现等领域至关重要。现有方法如基于OCR的流程或原生多模态大模型(MLLMs)存在信息丢失或上下文建模不足的问题,而多模态RAG能更全面地处理文档的多模态特性。Contribution: 1. 提出了多模态RAG在文档理解中的系统性综述。2. 设计了一个基于领域、检索模态和粒度的分类法。3. 总结了关键数据集、基准和未来研究方向。
Method: 通过分析现有文献,提出了多模态RAG的分类法,并探讨了图结构和智能代理框架在其中的应用。
Result: 论文总结了多模态RAG的进展、数据集和基准,同时指出了效率、细粒度表示和鲁棒性等开放挑战。
Insight: 多模态RAG能够整合文本、表格、图表和布局等多种信息,为文档理解提供了更全面的解决方案,未来需要在效率和鲁棒性方面进一步优化。
Abstract: Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents’ multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.
[7] Exemplar-Guided Planing: Enhanced LLM Agent for KGQA
Jingao Xu,Shuoyoucheng Ma,Xin Song,Rong Jiang,Hongkui Tu,Bin Zhou
Main category: cs.CL
TL;DR: 该论文提出了EGP框架,通过检索和利用训练数据中的示例问题及其成功推理路径,动态指导LLM代理的规划和关系探索,显著提升了知识图谱问答(KGQA)的性能。
Details
Motivation: LLM作为交互式代理在KGQA中表现不佳,主要是由于自然语言查询与结构化知识图谱表示的语义鸿沟,导致规划低效且未能充分利用训练数据中的推理模式。Contribution: 提出了Exemplar-Guided Planning (EGP)框架,结合实体模板化和检索示例问题的动态指导,增强LLM代理的规划能力;提出Smart Lookahead机制以提高探索效率。
Method: 通过实体模板化预处理训练问题,利用语义嵌入和FAISS索引检索相似示例;在任务分解和关系探索阶段动态指导LLM;引入Smart Lookahead优化路径探索。
Result: 在WebQSP和CWQ数据集上的实验表明,PoG-EGP显著优于基线PoG系统和其他对比方法。
Insight: 利用训练数据中的推理模式和示例问题可以有效弥合语义鸿沟;动态规划和高效探索是实现高性能KGQA的关键。
Abstract: Large Language Models (LLMs) as interactive agents show significant promise in Knowledge Graph Question Answering (KGQA) but often struggle with the semantic gap between natural language queries and structured knowledge graph (KG) representations. This leads to suboptimal planning and inefficient exploration on KG, while training-free approaches often underutilize valuable reasoning patterns in training data. To address these limitations, we propose a novel framework, Exemplar-Guided Planning (EGP), which enhances the planning capabilities of LLM agents for KGQA. EGP first preprocesses the training set questions via entity templating to normalize semantic variations. It then retrieves highly similar exemplary questions and their successful reasoning paths from this preprocessed set using semantic embeddings and an efficient FAISS index. These retrieved exemplars dynamically guide the LLM’s planning process in two key phases: (1) Task Decomposition, by aligning generated sub-objectives with proven reasoning steps, and (2) Relation Exploration, by providing high-quality auxiliary information to improve relation pruning accuracy. Additionally, we introduce a Smart Lookahead mechanism during relation exploration to improve efficiency by preemptively exploring promising paths and potentially terminating exploration earlier. We apply EGP to the Plan-on-Graph (PoG) framework, termed PoG-EGP. Extensive experiments on two real-world KGQA datasets, WebQSP and CWQ, demonstrate that PoG-EGP significantly improves over the baseline PoG system and other compared methods.
[8] AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction
Hong Ting Tsang,Jiaxin Bai,Haoyu Huang,Qiao Xiao,Tianshi Zheng,Baixuan Xu,Shujie Liu,Yangqiu Song
Main category: cs.CL
TL;DR: AutoGraph-R1提出了一种基于强化学习(RL)的知识图谱(KG)构建框架,直接优化KG在下游任务中的表现,从而弥补了KG构建与其应用之间的脱节问题。
Details
Motivation: 传统的KG构建方法与下游应用(如问答系统)脱节,导致图谱结构在功能上不够高效。AutoGraph-R1旨在通过强化学习直接优化KG的功能性效用。Contribution: 提出了首个端到端的KG构建强化学习框架AutoGraph-R1,设计了两种新颖的任务感知奖励函数,分别针对知识载体和知识索引的图谱功能。
Method: 将图谱生成建模为策略学习问题,训练一个LLM构造器,并通过RAG管道中的功能性效用作为奖励信号。
Result: 在多个问答基准测试中,AutoGraph-R1显著提升了图谱RAG方法的性能,超越了任务无关的基线图谱。
Insight: 研究表明,KG构建应从传统的“内在质量”转向“功能性效用”,实现了构建与应用的闭环优化。
Abstract: Building effective knowledge graphs (KGs) for Retrieval-Augmented Generation (RAG) is pivotal for advancing question answering (QA) systems. However, its effectiveness is hindered by a fundamental disconnect: the knowledge graph (KG) construction process is decoupled from its downstream application, yielding suboptimal graph structures. To bridge this gap, we introduce AutoGraph-R1, the first framework to directly optimize KG construction for task performance using Reinforcement Learning (RL). AutoGraph-R1 trains an LLM constructor by framing graph generation as a policy learning problem, where the reward is derived from the graph’s functional utility in a RAG pipeline. We design two novel, task-aware reward functions, one for graphs as knowledge carriers and another as knowledge indices. Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over using task-agnostic baseline graphs. Our work shows it is possible to close the loop between construction and application, shifting the paradigm from building intrinsically good'' graphs to building demonstrably useful’’ ones.
[9] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Baode Wang,Biao Wu,Weizhen Li,Meng Fang,Zuming Huang,Jun Huang,Haozhe Wang,Yanjie Liang,Ling Chen,Wei Chu,Yuan Qi
Main category: cs.CL
TL;DR: 论文提出了Infinity Parser,一种基于强化学习的框架LayoutRL,用于解决扫描文档解析中的布局理解问题。通过构建大型数据集Infinity-Doc-400K,并在多个基准测试中取得最先进性能。
Details
Motivation: 现有监督微调方法在复杂文档类型上泛化能力不足,且高质量训练数据有限,亟需新方法提升布局感知解析能力。Contribution: 1)提出LayoutRL强化学习框架,集成多奖励函数优化布局理解;2)构建Infinity-Doc-400K数据集;3)实现跨领域、语言和复杂结构的鲁棒解析。
Method: 采用强化学习框架LayoutRL,结合标准化编辑距离、段落计数准确性和阅读顺序保持等奖励函数,训练视觉语言模型Infinity-Parser。
Result: 在OmniDocBench、olmOCR-Bench等基准测试中显著优于专用文档解析系统和通用视觉语言模型。
Insight: 强化学习结合布局感知奖励能有效提升复杂文档解析的泛化性;大规模数据支持是关键。
Abstract: Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.
[10] VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency
Hongcheng Liu,Yixuan Hou,Heyang Liu,Yuhao Wang,Yanfeng Wang,Yu Wang
Main category: cs.CL
TL;DR: 该论文研究了语音大语言模型(Speech-LLMs)在语音不流畅情况下的鲁棒性,并提出了一种名为VocalBench-DF的评估框架,揭示了当前模型的局限性。
Details
Motivation: 现有研究通常依赖理想化的语音输入,而忽视了现实中常见的语音不流畅问题,尤其是与帕金森病等疾病相关的不流畅现象。因此,需要系统评估和提升Speech-LLMs在这些情况下的性能。Contribution: 论文的主要贡献包括:1)提出了VocalBench-DF评估框架,用于系统性评估语音不流畅问题;2)通过对22种主流Speech-LLMs的测试,揭示了其在处理不流畅语音时的性能下降问题;3)识别了主要瓶颈并提出了改进方向。
Method: 研究方法包括:1)设计VocalBench-DF框架,基于多维分类法评估语音不流畅问题;2)测试22种主流Speech-LLMs在不同不流畅条件下的表现;3)分析性能瓶颈(如音素级处理和长上下文建模)。
Result: 实验结果显示,当前Speech-LLMs在处理不流畅语音时性能显著下降,表明其在现实场景中的实用性受限。
Insight: 论文指出,通过增强模型组件和流程中的识别与推理能力,可以显著提升鲁棒性。此外,强调了未来研究亟需改进不流畅语音处理技术,以实现更具包容性的语音模型。
Abstract: While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson’s disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs
[11] Large-scale User Game Lifecycle Representation Learning
Yanjie Gou,Jiangming Liu,Kouying Xue,Yi Hua
Main category: cs.CL
TL;DR: 该论文提出了一种针对大规模用户游戏生命周期表示学习的方法,旨在解决游戏稀疏性和不平衡性问题,通过引入用户游戏生命周期(UGL)和改进的行为策略,显著提升了游戏广告和推荐的效果。
Details
Motivation: 随着视频游戏产业的快速发展,游戏平台需要高效的广告和推荐系统。然而,现有的推荐系统方法难以处理游戏的稀疏性和不平衡性问题,因此需要一种新的表示学习方法。Contribution: 1. 引入了用户游戏生命周期(UGL)来解决游戏稀疏性问题;2. 提出了两种新的行为策略以提取用户的短期和长期兴趣;3. 提出了一种逆概率掩码策略来处理游戏不平衡性问题。
Method: 1. 使用UGL表示学习来丰富用户行为;2. 设计了两种行为策略以优化兴趣提取;3. 采用逆概率掩码策略来平衡游戏数据的影响。
Result: 离线实验显示UGL表示学习平均提高了1.83%的AUC,在线实验则在游戏广告中实现了21.67%的CVR提升。同时,在游戏内物品推荐中,AUC提升了0.5%,ARPU提升了0.82%。
Insight: 通过处理游戏的稀疏性和不平衡性问题,UGL表示学习方法能显著提升推荐和广告的效果,尤其在短期和长期兴趣的提取上表现出色。
Abstract: The rapid expansion of video game production necessitates the development of effective advertising and recommendation systems for online game platforms. Recommending and advertising games to users hinges on capturing their interest in games. However, existing representation learning methods crafted for handling billions of items in recommendation systems are unsuitable for game advertising and recommendation. This is primarily due to game sparsity, where the mere hundreds of games fall short for large-scale user representation learning, and game imbalance, where user behaviors are overwhelmingly dominated by a handful of popular games. To address the sparsity issue, we introduce the User Game Lifecycle (UGL), designed to enrich user behaviors in games. Additionally, we propose two innovative strategies aimed at manipulating user behaviors to more effectively extract both short and long-term interests. To tackle the game imbalance challenge, we present an Inverse Probability Masking strategy for UGL representation learning. The offline and online experimental results demonstrate that the UGL representations significantly enhance model by achieving a 1.83% AUC offline increase on average and a 21.67% CVR online increase on average for game advertising and a 0.5% AUC offline increase and a 0.82% ARPU online increase for in-game item recommendation.
[12] Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs
Lee Qi Zun,Mohamad Zulhilmi Bin Abdul Halim,Goh Man Fye
Main category: cs.CL
TL;DR: 该论文提出了一种专门化MedGemma模型的方法,用于生成高质量临床图像描述,以提升马来西亚临床实践指南的多模态检索增强生成(RAG)系统效果。通过知识蒸馏和数据增强解决数据稀缺问题,并使用QLoRA方法进行高效微调。实验验证了模型在分类和描述准确性上的显著提升。
Details
Motivation: 现有的一般视觉语言模型生成的临床图像描述缺乏专业性和事实基础,限制了多模态RAG系统在临床决策支持中的有效性。因此,需要一种能够生成高保真医学图像描述的专门化模型。Contribution: 1. 提出了一种专门化MedGemma模型的框架,生成高质量的临床图像描述;2. 通过知识蒸馏和合成数据集解决了数据稀缺问题;3. 应用RAGAS框架评估描述的信度、相关性和正确性,验证了模型的有效性。
Method: 1. 使用知识蒸馏生成合成数据集;2. 采用QLoRA方法对MedGemma进行参数高效的微调;3. 通过分类准确性和RAGAS框架的双重评估验证模型性能。
Result: 微调后的MedGemma在分类性能和图像描述的信度、正确性上均有显著提升,证明了其生成可靠医学描述的能力。
Insight: 知识蒸馏和高效微调技术可以显著提升医学视觉语言模型的性能,为多模态RAG系统在临床决策中的应用提供了可靠基础。
Abstract: Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.
[13] When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs
Hongcheng Liu,Pingjie Wang,Yuhao Wang,Siqu Ou,Yanfeng Wang,Yu Wang
Main category: cs.CL
TL;DR: 该论文探讨了多模态大语言模型(MLLMs)在主动推理任务中的表现,发现其性能远落后于被动推理任务,揭示了当前MLLMs的局限性。通过提出GuessBench基准,研究进一步指出细粒度感知和及时决策是主要挑战,并提出感知增强和思维导向方法为未来研究方向。
Details
Motivation: 现有MLLMs评测主要关注被动推理任务,而忽略了真实世界中信息不完全的场景。论文旨在探索MLLMs在主动获取缺失证据和迭代优化决策方面的能力,填补这一研究空白。Contribution: 1. 提出了GuessBench基准,用于系统性评测MLLMs的主动推理能力;2. 评估了20个先进MLLMs的表现,揭示了其在主动推理任务中的局限性;3. 指出细粒度感知和及时决策是主要挑战,并提出未来研究方向。
Method: 通过GuessBench基准设计主动推理任务,要求MLLMs在不完全信息下选择目标图像并迭代优化决策。评测了20个模型的表现,并通过消融实验分析了感知增强与思维导向方法的差异。
Result: MLLMs在主动推理任务中的表现显著低于被动推理任务。感知增强对小模型效果明显,而思维导向方法在不同规模模型中均有提升。
Insight: 研究表明,MLLMs在主动推理任务中存在较大改进空间,未来研究应关注细粒度感知能力和决策时效性,结合感知增强与思维导向方法实现突破。
Abstract: Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.
[14] DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios
Yao Huang,Yitong Sun,Yichi Zhang,Ruochen Zhang,Yinpeng Dong,Xingxing Wei
Main category: cs.CL
TL;DR: DeceptionBench是一个系统性评估AI(特别是大语言模型和大推理模型)在现实场景中欺骗行为的基准测试,涵盖经济和医疗等领域,揭示了模型在激励或胁迫下欺骗行为的脆弱性。
Details
Motivation: 尽管大语言模型(LLMs)在认知任务上表现出色,但其能力的快速提升也带来了潜在的欺骗行为,可能在高风险应用中造成严重后果。然而,现有研究对现实场景中欺骗行为的系统性评估不足。Contribution: 1. 提出了首个系统性评估AI欺骗行为的基准DeceptionBench;2. 覆盖五个社会领域,设计了150个场景和1000多个样本;3. 研究了内在(利己主义或谄媚行为)和外在(激励或胁迫)因素的影响。
Method: 1. 设计了涵盖经济和医疗等领域的150个场景;2. 分析模型在静态和动态(多轮交互)条件下的欺骗行为;3. 考察内在动机和外在激励对欺骗行为的影响。
Result: 实验表明,现有模型在激励或胁迫下欺骗行为显著增加,缺乏对操纵性上下文线索的抵抗力,亟需更强大的防护机制。
Insight: 揭示了AI在现实交互中欺骗行为的复杂性,表明当前的模型仍容易受到激励和胁迫的影响,未来需研发更安全的AI系统。
Abstract: Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also introduces emergent deceptive behaviors that may induce severe risks in high-stakes deployments. More critically, the characterization of deception across realistic real-world scenarios remains underexplored. To bridge this gap, we establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different societal domains, what their intrinsic behavioral patterns are, and how extrinsic factors affect them. Specifically, on the static count, the benchmark encompasses 150 meticulously designed scenarios in five domains, i.e., Economy, Healthcare, Education, Social Interaction, and Entertainment, with over 1,000 samples, providing sufficient empirical foundations for deception analysis. On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement. On the extrinsic dimension, we investigate how contextual factors modulate deceptive outputs under neutral conditions, reward-based incentivization, and coercive pressures. Moreover, we incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics. Extensive experiments across LLMs and Large Reasoning Models (LRMs) reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, demonstrating that current models lack robust resistance to manipulative contextual cues and the urgent need for advanced safeguards against various deception behaviors. Code and resources are publicly available at https://github.com/Aries-iai/DeceptionBench.
[15] Temporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References?
Ashutosh Bajpai,Tanmoy Chakraborty
Main category: cs.CL
TL;DR: 这篇论文提出了一个新的基准TEMP-ReCon,用于评估LLMs在时间敏感问题中的时序参考一致性,并发现LLMs表现不足。作者提出了一个基于推理路径对齐的模型UnTRaP,以提升其一致性。
Details
Motivation: 随着LLMs在时间敏感领域(如法律、医疗和金融)的应用增加,确保其在时间维度上的准确性变得至关重要。但目前缺乏相关工作来评估或提升LLMs的时序一致性。Contribution: 1. 提出新的基准TEMP-ReCon,用于评估LLMs的时序参考一致性;2. 发现LLMs在时序一致性上表现不足;3. 提出了UnTRaP模型以提升一致性。
Method: 提出了基于推理路径对齐的模型UnTRaP,旨在通过对齐推理路径提升LLMs的时序参考一致性。
Result: 实验表明,UnTRaP在提升时序一致性上优于多个基线模型。
Insight: LLMs在时间敏感领域的应用中存在时序一致性问题,而通过推理路径对齐的方法可以有效缓解这一问题。
Abstract: The increasing acceptance of large language models (LLMs) as an alternative to knowledge sources marks a significant paradigm shift across various domains, including time-sensitive fields such as law, healthcare, and finance. To fulfill this expanded role, LLMs must not only be factually accurate but also demonstrate consistency across temporal dimensions, necessitating robust temporal reasoning capabilities. Despite this critical requirement, efforts to ensure temporal consistency in LLMs remain scarce including noticeable absence of endeavors aimed at evaluating or augmenting LLMs across temporal references in time-sensitive inquiries. In this paper, we seek to address this gap by introducing a novel benchmark entitled temporal referential consistency, accompanied by a resource TEMP-ReCon designed to benchmark a wide range of both open-source and closed-source LLMs with various linguistic contexts characterized by differing resource richness (including English, French, and Romanian). The findings emphasis that LLMs do exhibit insufficient temporal referent consistency. To address this, we propose \newmodel, a reasoning path alignment-based model that aims to enhance the temporal referential consistency of LLMs. Our empirical experiments substantiate the efficacy of UnTRaP compared to several baseline models.
[16] From Characters to Tokens: Dynamic Grouping with Hierarchical BPE
Rares Dolga,Lucas Maystre,Tudor Berariu,David Barber
Main category: cs.CL
TL;DR: 论文提出了一种动态字符分组方法,通过利用现有BPE分词的结构,无需额外模型,实现了高效、灵活且语言无关的表征。
Details
Motivation: 当前子词分词方法(如BPE)在表示罕见词时效率低下且需要大嵌入矩阵,而字符级模型在Transformer架构中性能受限。作者希望通过一种动态方法结合二者的优点。Contribution: 提出了一种动态字符分组方法,利用BPE分词的结构,通过添加显式补丁结束标记和第二级BPE压缩阶段,实现了高效的词汇表示。
Method: 方法包括在BPE分词中添加显式结束标记,并引入第二级BPE压缩阶段来控制补丁粒度。
Result: 实验表明,该方法在性能上匹敌或优于基于动态熵和空格的分词策略,同时保持了词汇的紧凑性。
Insight: 动态字符分组方法提供了一种语言无关的解决方案,避免了额外模型的依赖,适用于多语言任务。
Abstract: Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare words and require large embedding matrices. Character-level models address these issues but introduce performance bottlenecks, particularly in Transformer-based architectures. Recent hierarchical models attempt to merge the benefits of both paradigms by grouping characters into patches, but existing patching strategies either rely on whitespace-limiting applicability to certain languages, or require auxiliary models that introduce new dependencies. In this paper, we propose a dynamic character grouping method that leverages the structure of existing BPE tokenization without requiring additional models. By appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity, our method offers efficient, flexible, and language-agnostic representations. Empirical results demonstrate that our approach matches or exceeds the performance of dynamic entropy- and whitespace-based patching strategies, while maintaining a compact vocabulary.
[17] Latent Reasoning in LLMs as a Vocabulary-Space Superposition
Jingcheng Deng,Liang Pang,Zihao Wei,Shichen Xu,Zenghao Duan,Kun Xu,Yang Song,Huawei Shen,Xueqi Cheng
Main category: cs.CL
TL;DR: 本文提出了一种新型的潜在推理方法Latent-SFT,通过将潜在空间限制在LLM词汇表的列空间中,显著降低了显式推理的计算成本,同时在多个数据集上表现优异。
Details
Motivation: 显式推理(如思维链提示)虽然有效,但带来了显著的计算开销。潜在推理虽能降低成本,但性能下降严重。本文旨在解决这一问题。Contribution: 1. 提出将潜在空间限制为词汇表概率的叠加空间;2. 设计了两阶段学习框架Latent-SFT;3. 在多个数据集上验证了方法的有效性,计算成本显著降低。
Method: 1. 第一阶段:使用专用注意力掩码训练潜在标记编码器,生成潜在标记;2. 第二阶段:丢弃编码器,直接训练LLM自主生成潜在标记;优化采用KL和CE损失。
Result: Latent-SFT在GSM8k上达到显式SFT的性能,推理链缩短4倍,优于现有潜在方法;在Math500和AIME24上优于基于隐藏状态的方法。
Insight: 潜在推理不仅是对单一路径的压缩,也是对多路径的叠加;词汇表概率为基础的潜在推理更具优势。
Abstract: Large language models (LLMs) demonstrate strong reasoning abilities with chain-of-thought prompting, but explicit reasoning introduces substantial computational overhead. Recent work on latent reasoning reduces this cost by reasoning in latent space without explicit supervision, but performance drops significantly. Our preliminary experiments suggest that this degradation stems from the unstructured latent space, which makes fitting latent tokens difficult. To address this, we restrict the latent space to the column space of the LLM vocabulary, treating latent reasoning as a superposition over vocabulary probabilities. Once latent reasoning concludes, it collapses into an eigenstate of explicit reasoning to yield the final answer. Based on this idea, we propose Latent-SFT, a two-stage learning framework. In the first stage, we design two specialized attention masks to guide the Latent Token Encoder in generating latent tokens, allowing the LLM to produce the correct answer conditioned on them. In the second stage, the Latent Token Encoder is discarded, and the LLM is directly trained to generate these latent tokens autonomously for latent reasoning, optimized with KL and CE losses. Latent-SFT sets a new state of the art on GSM8k, matching explicit SFT performance while cutting reasoning chains by up to 4 times and outperforming prior latent methods. On Math500 and AIME24, lexical probability-based latent reasoning also clearly surpasses hidden-state-based approaches. Our metrics of effective compression rate and effective global parallelism further show that latent reasoning is both the compression of a single path and the superposition of multiple paths.
[18] MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval
Qiyu Wu,Shuyang Cui,Satoshi Hayakawa,Wei-Yao Wang,Hiromi Wakaki,Yuki Mitsufuji
Main category: cs.CL
TL;DR: 论文提出了一种模态复合感知(MCA)框架,通过偏好损失和复合正则化目标,增强多模态统一编码器在分布偏移下的鲁棒性,提升多模态检索性能。
Details
Motivation: 现有的多模态大语言模型(MLLMs)虽然灵活且先进,但在传统对比学习训练下容易学习模态捷径,导致在分布偏移下表现不佳。因此需要一种方法来增强模型的鲁棒性。Contribution: 提出了模态复合感知(MCA)框架,通过偏好损失和复合正则化目标,显式建模复合表示与其单模态部分的结构关系,提升模型在分布偏移下的性能。
Method: 使用了偏好损失强制多模态嵌入优于单模态嵌入,并通过复合正则化目标将多模态嵌入与其单模态部分的原型对齐。
Result: 在多个基准测试中,MCA框架显著提升了分布偏移下的检索性能,验证了其有效性。
Insight: 模态复合感知是提升多模态统一编码器鲁棒性的重要原则,特别是在处理复合输入时。
Abstract: Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.
[19] Rethinking Cross-lingual Gaps from a Statistical Viewpoint
Vihari Piratla,Purvam Jain,Darshan Singh,Partha Talukdar,Trevor Cohn
Main category: cs.CL
TL;DR: 该论文从统计视角重新审视了跨语言差距问题,提出目标语言响应方差是跨语言差距的主要原因,并通过实验验证了这一假设。
Details
Motivation: 现有研究将跨语言差距归因于源语言和目标语言潜在表征的差异,而本文假设目标语言响应的方差是主要原因,试图从统计角度重新解释这一问题。Contribution: 首次通过偏置-方差分解形式化跨语言差距,并通过实验和干预措施验证了目标语言响应方差的重要性。还提出了一种简单的提示指令,显著提升了目标语言的准确率。
Method: 采用偏置-方差分解方法形式化跨语言差距,并通过多种推理时干预措施(如降低响应方差)来验证假设。实验证明提示指令能显著减少方差。
Result: 通过控制响应方差,目标语言准确率提升了20%-25%,验证了假设的正确性和方法的有效性。
Insight: 跨语言差距的关键因素是目标语言响应的方差,而非传统认为的潜在表征差异。这一发现为改善跨语言模型性能提供了新方向。
Abstract: Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried from target languages. Prior research has pointed to a cross-lingual gap, viz., a drop in accuracy when the knowledge is queried in a target language compared to when the query is in the source language. Existing research has rationalized divergence in latent representations in source and target languages as the source of cross-lingual gap. In this work, we take an alternative view and hypothesize that the variance of responses in the target language is the main cause of this gap. For the first time, we formalize the cross-lingual gap in terms of bias-variance decomposition. We present extensive experimental evidence which support proposed formulation and hypothesis. We then reinforce our hypothesis through multiple inference-time interventions that control the variance and reduce the cross-lingual gap. We demonstrate a simple prompt instruction to reduce the response variance, which improved target accuracy by 20-25% across different models.
[20] Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation
Jinliang Liu
Main category: cs.CL
TL;DR: 论文提出ParallaxRAG框架,通过多视角知识图谱检索增强生成(KG-RAG)解决多跳推理问题,利用注意力头的多样性提升检索质量,减少幻觉,并在实验中表现优异。
Details
Motivation: 大语言模型(LLMs)在多跳推理中容易产生幻觉且表现不佳,现有KG-RAG方法依赖扁平嵌入和噪声路径探索,亟需更鲁棒的解决方案。Contribution: 提出ParallaxRAG框架,对称解耦查询和图三元组到多视角空间,通过注意力头多样性提升检索效果,支持逐步推理,减少幻觉。
Method: 将查询和图三元组解耦到多视角空间,利用注意力头在不同推理阶段的语义关系专一性构建干净子图,指导LLM进行逐步推理。
Result: 在WebQSP和CWQ数据集上的实验表明,ParallaxRAG在检索和问答任务中表现优异, hallucination减少且泛化能力强。
Insight: 多视角注意力头专一性是知识基础多跳推理的可行方向,为LLM的逐步推理提供了新的理论基础。
Abstract: Large language models (LLMs) excel at language understanding but often hallucinate and struggle with multi-hop reasoning. Knowledge-graph-based retrieval-augmented generation (KG-RAG) offers grounding, yet most methods rely on flat embeddings and noisy path exploration. We propose ParallaxRAG, a framework that symmetrically decouples queries and graph triples into multi-view spaces, enabling a robust retrieval architecture that explicitly enforces head diversity while constraining weakly related paths. Central to our approach is the observation that different attention heads specialize in semantic relations at distinct reasoning stages, contributing to different hops of the reasoning chain. This specialization allows ParallaxRAG to construct cleaner subgraphs and guide LLMs through grounded, step-wise reasoning. Experiments on WebQSP and CWQ, under our unified, reproducible setup (BGE-M3 + Llama3.1-8B), demonstrate competitive retrieval and QA performance, alongside reduced hallucination and good generalization. Our results highlight multi-view head specialization as a principled direction for knowledge-grounded multi-hop reasoning. Our implementation will be released as soon as the paper is accepted.
[21] KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models
Dongjun Kim,Chanhee Park,Chanjun Park,Heuiseok Lim
Main category: cs.CL
TL;DR: KITE是一个专门用于评估大型语言模型(LLM)韩语指令遵循能力的基准测试工具,填补了韩语在开环指令任务评估上的空白。
Details
Motivation: 当前LLM的评估主要针对英语模型,忽略了其他语言的独特语法和文化特征。韩语因其复杂的语法、敬语系统和双数体系等特点,缺乏专门的指令遵循能力评估工具。Contribution: 提出了KITE基准测试,首次专注于韩语开环指令任务的评估,并结合自动指标与人工评估,揭示了模型在韩语任务中的性能差异。
Method: 设计了涵盖通用和韩语特有指令的任务集,通过自动化指标和人工评分相结合的方式进行评估。
Result: KITE揭示了不同模型在韩语指令任务上的性能差异,为LLM的开发提供了重要参考。
Insight: KITE不仅是韩语评估的工具,也为其他低资源语言的类似研究提供了范本,推动了多语言LLM的发展。
Abstract: The instruction-following capabilities of large language models (LLMs) are pivotal for numerous applications, from conversational agents to complex reasoning systems. However, current evaluations predominantly focus on English models, neglecting the linguistic and cultural nuances of other languages. Specifically, Korean, with its distinct syntax, rich morphological features, honorific system, and dual numbering systems, lacks a dedicated benchmark for assessing open-ended instruction-following capabilities. To address this gap, we introduce the Korean Instruction-following Task Evaluation (KITE), a comprehensive benchmark designed to evaluate both general and Korean-specific instructions. Unlike existing Korean benchmarks that focus mainly on factual knowledge or multiple-choice testing, KITE directly targets diverse, open-ended instruction-following tasks. Our evaluation pipeline combines automated metrics with human assessments, revealing performance disparities across models and providing deeper insights into their strengths and weaknesses. By publicly releasing the KITE dataset and code, we aim to foster further research on culturally and linguistically inclusive LLM development and inspire similar endeavors for other underrepresented languages.
[22] Finetuning LLMs for EvaCun 2025 token prediction shared task
Josef Jon,Ondřej Bojar
Main category: cs.CL
TL;DR: 本文介绍了为EvaCun 2025的token预测任务提交的系统,基于Command-R、Mistral和Aya Expanse等大语言模型(LLM)进行微调。作者未对数据进行特定调整或预处理,并比较了三种不同提示方法的效果。
Details
Motivation: 任务目标是解决EvaCun 2025共享任务中的token预测问题,尽管作者对该领域和语言了解有限,但仍希望通过微调LLM来实现有效预测。Contribution: 主要贡献是通过简单微调LLM(未进行数据调整)展示了其在token预测任务中的潜力,并比较了不同提示方法的性能。
Method: 方法基于三种不同的大语言模型(Command-R、Mistral和Aya Expanse),直接使用任务提供的训练数据,未做预处理。随后比较了三种提示策略的效果。
Result: 结果通过在保留数据集上的评估展示了不同提示方法的性能差异,但具体指标未在摘要中提及。
Insight: 研究表明,即使缺乏领域特定知识或数据处理,简单微调LLM也能在token预测任务中表现良好,提示方法的选择对结果有显著影响。
Abstract: In this paper, we present our submission for the token prediction task of EvaCun 2025. Our sys-tems are based on LLMs (Command-R, Mistral, and Aya Expanse) fine-tuned on the task data provided by the organizers. As we only pos-sess a very superficial knowledge of the subject field and the languages of the task, we simply used the training data without any task-specific adjustments, preprocessing, or filtering. We compare 3 different approaches (based on 3 different prompts) of obtaining the predictions, and we evaluate them on a held-out part of the data.
[23] HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination
Tingting Chen,Beibei Lin,Zifeng Yuan,Qiran Zou,Hongyu He,Yew-Soon Ong,Anirudh Goyal,Dianbo Liu
Main category: cs.CL
TL;DR: HypoSpace是一个评估语言模型作为假设生成器在多解释科学问题中创造力的诊断套件,重点关注假设集的合法性、唯一性和覆盖率。
Details
Motivation: 随着语言模型在科学工作流中的应用增加,评估其提出多解释假设集的能力变得至关重要,因为许多科学问题是未确定的(即存在多个一致的假设)。Contribution: 提出HypoSpace,一个评估语言模型生成假设集能力的诊断套件,首次系统地量化了假设集的合法性、唯一性和覆盖率。
Method: HypoSpace将语言模型视为有限假设集的采样器,并通过确定性验证器和精确枚举的假设空间(如因果图、体素重建和布尔遗传交互)衡量三个指标。
Result: 研究表明,随着可接纳空间的扩大,语言模型的合法性保持较高水平,但唯一性和覆盖率下降,揭示了传统正确性指标无法检测的模式崩溃现象。
Insight: HypoSpace揭示了语言模型在生成假设集时的局限性(如模式崩溃),并为探索和覆盖多解释问题的方法提供了可控的评估工具。
Abstract: As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.
[24] Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection
Joshua Wolfe Brook,Ilia Markov
Main category: cs.CL
TL;DR: 该论文提出了一种利用大语言模型(LLMs)作为动态知识库生成背景上下文并整合到仇恨言论检测(HSD)分类器输入中的新方法,显著提升了文本和多模态场景下的检测性能。
Details
Motivation: 仇恨言论检测在处理隐含或复杂的文本及多模态内容时面临挑战,现有方法通常缺乏对背景信息的有效利用,限制了检测效果。Contribution: 提出了一种基于LLMs的动态知识库方法,通过生成背景上下文信息并将其整合到HSD分类器中,显著提升了检测性能。
Method: 研究了两种上下文生成策略(命名实体聚焦和全文提示),并比较了四种上下文整合方法(文本拼接、嵌入拼接、分层Transformer融合和LLM驱动文本增强)。
Result: 在文本数据集(Latent Hatred)和多模态数据集(MAMI)上,性能分别提升了3和6个F1分数,证明了上下文信息及其整合方法的重要性。
Insight: 上下文信息的动态生成和合理整合是提升仇恨言论检测性能的关键,同时多模态场景的整合方法需要进一步优化。
Abstract: This research introduces a novel approach to textual and multimodal Hate Speech Detection (HSD), using Large Language Models (LLMs) as dynamic knowledge bases to generate background context and incorporate it into the input of HSD classifiers. Two context generation strategies are examined: one focused on named entities and the other on full-text prompting. Four methods of incorporating context into the classifier input are compared: text concatenation, embedding concatenation, a hierarchical transformer-based fusion, and LLM-driven text enhancement. Experiments are conducted on the textual Latent Hatred dataset of implicit hate speech and applied in a multimodal setting on the MAMI dataset of misogynous memes. Results suggest that both the contextual information and the method by which it is incorporated are key, with gains of up to 3 and 6 F1 points on textual and multimodal setups respectively, from a zero-context baseline to the highest-performing system, based on embedding concatenation.
[25] Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth
Helia Hashemi,Victor Rühle,Saravan Rajmohan
Main category: cs.CL
TL;DR: 论文提出了一种成本感知的检索增强推理模型,通过动态调整检索文档列表长度和强化学习方法,显著提高了效率(延迟降低16-20%)且无损效果(准确率提升5%)。
Details
Motivation: 现有的检索增强推理模型虽然性能强大,但计算成本高昂,检索和推理阶段均消耗大量资源。Contribution: 1. 动态调整检索文档列表长度;2. 提出了成本感知的优势函数用于训练;3. 探索了内存和延迟优化的实现。
Method: 1. 动态检索深度;2. 强化学习框架下的成本感知训练;3. 近端和组相对策略优化算法的实现。
Result: 在7个QA数据集上实验表明,模型延迟降低16-20%,准确率提升5%。
Insight: 动态调整检索深度和成本感知训练可以有效平衡效率和效果。
Abstract: Reasoning models have gained significant attention due to their strong performance, particularly when enhanced with retrieval augmentation. However, these models often incur high computational costs, as both retrieval and reasoning tokens contribute substantially to the overall resource usage. In this work, we make the following contributions: (1) we propose a retrieval-augmented reasoning model that dynamically adjusts the length of the retrieved document list based on the query and retrieval results; (2) we develop a cost-aware advantage function for training of efficient retrieval-augmented reasoning models through reinforcement learning; and (3) we explore both memory- and latency-bound implementations of the proposed cost-aware framework for both proximal and group relative policy optimization algorithms. We evaluate our approach on seven public question answering datasets and demonstrate significant efficiency gains, without compromising effectiveness. In fact, we observed that the model latency decreases by ~16-20% across datasets, while its effectiveness increases by ~5% on average, in terms of exact match.
[26] Enhanced Sentiment Interpretation via a Lexicon-Fuzzy-Transformer Framework
Shayan Rokhva,Mousa Alizadeh,Maryam Abdollahi Shamami
Main category: cs.CL
TL;DR: 论文提出了一种结合词典规则、模糊逻辑和Transformer的混合框架,用于提升情感分析的精度和可解释性,尤其在非正式和领域特定的文本中表现优异。
Details
Motivation: 现有的情感分析模型在面对非正式语言和领域特定文本时,往往表现不佳,尤其是难以准确捕捉情感的极性和强度。Contribution: 提出了一个新颖的混合框架,结合VADER、DistilBERT和模糊逻辑,生成了连续的情感分数,显著提升了情感分析的准确性和可解释性。
Method: 1. 使用VADER生成初始情感分数;2. 通过DistilBERT生成置信分数进行两阶段调整;3. 应用自定义的模糊推理系统将分数映射到0-1区间。
Result: 在四个领域特定数据集上验证了模型的优越性,结果显示其在分布对齐、极端情感识别和减少误分类方面表现突出。
Insight: 结合符号推理与神经模型可以显著提升情感分析的可解释性和细粒度性能,尤其是在动态语言环境中。
Abstract: Accurately detecting sentiment polarity and intensity in product reviews and social media posts remains challenging due to informal and domain-specific language. To address this, we propose a novel hybrid lexicon-fuzzy-transformer framework that combines rule-based heuristics, contextual deep learning, and fuzzy logic to generate continuous sentiment scores reflecting both polarity and strength. The pipeline begins with VADER-based initial sentiment estimations, which are refined through a two-stage adjustment process. This involves leveraging confidence scores from DistilBERT, a lightweight transformer and applying fuzzy logic principles to mitigate excessive neutrality bias and enhance granularity. A custom fuzzy inference system then maps the refined scores onto a 0 to 1 continuum, producing expert)like judgments. The framework is rigorously evaluated on four domain-specific datasets. food delivery, e-commerce, tourism, and fashion. Results show improved alignment with user ratings, better identification of sentiment extremes, and reduced misclassifications. Both quantitative metrics (distributional alignment, confusion matrices) and qualitative insights (case studies, runtime analysis) affirm the models robustness and efficiency. This work demonstrates the value of integrating symbolic reasoning with neural models for interpretable, finegrained sentiment analysis in linguistically dynamic domains.
[27] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training
Pengkai Wang,Qi Zuo,Pengwei Liu,Zhijie Sang,Congkai Xie,Hongxia Yang
Main category: cs.CL
TL;DR: 这篇论文提出了ORBIT框架,通过基于量规的增量训练方法,解决了LLM在开放领域任务(如医疗对话)中奖励模糊的问题,显著提升了模型性能。
Details
Motivation: LLM在程序化奖励明确的领域(如数学和代码)表现优异,但在开放领域(如医疗咨询)中因奖励模糊而受限。ORBIT旨在通过量规反馈填补这一空白。Contribution: 提出了ORBIT框架,结合合成对话生成和动态量规创建,利用量规引导增量RL,无需外部知识或人工规则,显著提升医疗对话性能。
Method: 动态生成量规,通过量规引导增量强化学习(RL),结合合成对话数据,实现高效训练。
Result: 在HealthBench-Hard基准上,Qwen3-4B-Instruct模型的性能从7.0提升至27.2,达到同类模型的SOTA。
Insight: 量规驱动的RL不仅提升数值指标,还能在多样化医疗场景中实现一致改进,为开放领域任务提供了一种可扩展的训练策略。
Abstract: Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.
cs.CV [Back]
[28] GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments
Leela Krishna,Mengyang Zhao,Saicharithreddy Pasula,Harshit Rajgarhia,Abhishek Mukherji
Main category: cs.CV
TL;DR: GAZE提出了一种自动化预处理流程,将原始长视频转化为高质量、多模态的世界模型训练数据,显著提升了标注效率和隐私保护。
Details
Motivation: 传统的人工标注多模态数据效率低且成本高,阻碍了鲁棒世界模型的训练。GAZE旨在通过自动化流程解决这一问题。Contribution: 1. 设计了GAZE预处理流程,支持多模态数据的自动化标注;2. 通过保守跳过低显著性片段,减少人工审核量80%以上;3. 集成了隐私保护和数据治理功能。
Method: 1. 标准化360度视频格式并分片处理;2. 应用多种AI模型(场景理解、目标跟踪、音频转录等)进行密集预标注;3. 整合信号为结构化输出供快速人工验证。
Result: 效率显著提升(每审核小时节省19分钟),标注密度和一致性提高,同时确保隐私保护。
Insight: 通过自动化预标注和多模态信号整合,GAZE为世界模型训练提供了高质量数据,同时兼顾效率和治理需求。
Abstract: Training robust world models requires large-scale, precisely labeled multimodal datasets, a process historically bottlenecked by slow and expensive manual annotation. We present a production-tested GAZE pipeline that automates the conversion of raw, long-form video into rich, task-ready supervision for world-model training. Our system (i) normalizes proprietary 360-degree formats into standard views and shards them for parallel processing; (ii) applies a suite of AI models (scene understanding, object tracking, audio transcription, PII/NSFW/minor detection) for dense, multimodal pre-annotation; and (iii) consolidates signals into a structured output specification for rapid human validation. The GAZE workflow demonstrably yields efficiency gains (~19 minutes saved per review hour) and reduces human review volume by >80% through conservative auto-skipping of low-salience segments. By increasing label density and consistency while integrating privacy safeguards and chain-of-custody metadata, our method generates high-fidelity, privacy-aware datasets directly consumable for learning cross-modal dynamics and action-conditioned prediction. We detail our orchestration, model choices, and data dictionary to provide a scalable blueprint for generating high-quality world model training data without sacrificing throughput or governance.
[29] PC-UNet: An Enforcing Poisson Statistics U-Net for Positron Emission Tomography Denoising
Yang Shi,Jingchao Wang,Liangsi Lu,Mingxuan Huang,Ruixin He,Yifeng Xie,Hanqian Liu,Minzhe Guo,Yangyang Liang,Weipeng Zhang,Zimeng Li,Xuhang Chen
Main category: cs.CV
TL;DR: PC-UNet通过提出Poisson Variance and Mean Consistency Loss(PVMC-Loss)有效地解决了PET图像去噪中的Poisson噪声问题,显著提高了图像质量和物理一致性。
Details
Motivation: PET图像在高剂量使用时存在辐射风险,而降低剂量会引入Poisson噪声,传统去噪方法无法有效处理这种噪声,导致图像失真和伪影。Contribution: 提出了PC-UNet模型和PVMC-Loss,通过物理数据的统计特性提升了图像去噪效果。PVMC-Loss在方差和梯度适配上是统计无偏的,并能应对小范围数据失配。
Method: PC-UNet基于U-Net架构,结合PVMC-Loss确保去噪过程中Poisson统计特性的物理一致性,PVMC-Loss作为一种广义矩估计方法实现。
Result: 在PET数据集上的实验表明,PC-UNet显著提升了图像的质量和物理一致性。
Insight: 结合物理统计特性的深度学习方法可以更有效地处理医学图像中的噪声问题。
Abstract: Positron Emission Tomography (PET) is crucial in medicine, but its clinical use is limited due to high signal-to-noise ratio doses increasing radiation exposure. Lowering doses increases Poisson noise, which current denoising methods fail to handle, causing distortions and artifacts. We propose a Poisson Consistent U-Net (PC-UNet) model with a new Poisson Variance and Mean Consistency Loss (PVMC-Loss) that incorporates physical data to improve image fidelity. PVMC-Loss is statistically unbiased in variance and gradient adaptation, acting as a Generalized Method of Moments implementation, offering robustness to minor data mismatches. Tests on PET datasets show PC-UNet improves physical consistency and image fidelity, proving its ability to integrate physical information effectively.
[30] DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models
Mor Ventura,Michael Toker,Or Patashnik,Yonatan Belinkov,Roi Reichart
Main category: cs.CV
TL;DR: DeLeaker提出了一种动态推理时重加权方法,通过干预模型的注意力图来缓解文本到图像模型中的语义泄漏问题。
Details
Motivation: 文本到图像(T2I)模型在快速发展,但仍存在语义泄漏问题(语义特征在不相关的实体间传递)。现有方法通常基于优化或依赖外部输入,DeLeaker提出了一种轻量级、无需优化的解决方案。Contribution: 1. 提出了DeLeaker,一种在推理时动态重加权注意力图的方法;2. 引入了SLIM数据集和自动评估框架,专门用于语义泄漏研究;3. 实验表明DeLeaker在不影响生成质量的情况下显著优于基线方法。
Method: DeLeaker通过动态调整扩散过程中的注意力图权重,抑制跨实体交互并强化各实体的独立性。
Result: DeLeaker在实验中表现优异,即使基线方法使用外部信息,DeLeaker仍能更有效地减轻语义泄漏且保持生成质量。
Insight: 注意力控制是实现语义精确生成的关键方向,DeLeaker的成功为未来T2I模型的设计提供了重要启示。
Abstract: Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce DeLeaker, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the model’s attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce SLIM (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.
[31] UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos
Mingxuan Liu,Honglin He,Elisa Ricci,Wayne Wu,Bolei Zhou
Main category: cs.CV
TL;DR: UrbanVerse是一个数据驱动的真实到模拟系统,通过城市游览视频生成高保真、交互式的城市模拟场景,支持大规模城市AI代理训练。
Details
Motivation: 当前的城市模拟场景要么缺乏可扩展性,要么无法捕捉真实世界的复杂性,限制了城市AI代理的训练效果。UrbanVerse旨在解决这一问题。Contribution: 1)提出UrbanVerse-100K,包含10万+带语义和物理属性的城市3D资产;2)开发UrbanVerse-Gen,自动从视频中提取场景布局并生成度量级的3D模拟场景。
Method: 通过众包城市游览视频,利用UrbanVerse-Gen提取场景布局并实例化3D模拟场景,结合物理感知的资产库UrbanVerse-100K。
Result: 实验中,UrbanVerse场景保留真实世界的语义和布局,训练出的导航策略在仿真和零样本迁移中分别提升6.3%和30.1%的成功率。
Insight: 数据驱动的真实到模拟方法可以显著提升仿真场景的多样性、真实性和训练效果,为城市AI代理的实际应用提供支持。
Abstract: Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.
[32] MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning
Mattia Segu,Marta Tintore Gazulla,Yongqin Xian,Luc Van Gool,Federico Tombari
Main category: cs.CV
TL;DR: MOBIUS是一种面向移动设备的通用实例分割模型,通过多模态瓶颈融合和解码器剪枝技术,实现了高效的性能和计算资源平衡。
Details
Motivation: 现有的基础模型在实例级感知(如目标检测和分割)上表现优异,但计算成本高,限制了在资源受限平台上的应用。MOBIUS旨在解决这一问题,支持从高性能设备到移动硬件的广泛部署。Contribution: 1. 提出多尺度多模态融合的瓶颈像素解码器;2. 设计语言引导的不确定性校准损失以实现自适应解码器剪枝;3. 提出一种统一的训练策略。
Method: 结合瓶颈像素解码器减少了多模态和多尺度特征融合的计算开销,利用语言引导的不确定性校准损失对解码器进行剪枝,并通过统一的训练策略提升效率。
Result: MOBIUS减少了55%的像素解码器和75%的Transformer解码器的FLOPs,同时保持最佳性能,并在训练迭代次数上仅为传统方法的三分之一。
Insight: 通过融合和剪枝技术,MOBIUS实现了性能与效率的平衡,为资源受限设备的高效分割任务设立了新基准。
Abstract: Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
[33] Composition-Grounded Instruction Synthesis for Visual Reasoning
Xinyi Gu,Jiayuan Mao,Zhang-Wei Hong,Zhuoran Yu,Pengyuan Li,Dhiraj Joshi,Rogerio Feris,Zexue He
Main category: cs.CV
TL;DR: 本文提出了一种名为COGS的高效数据框架,通过将种子问题分解为感知和推理因素,生成大量合成问答对,以增强多模态大语言模型(MLLMs)在人工图像领域的推理能力。实验表明,COGS显著提升了模型在未见问题上的表现。
Details
Motivation: 现有MLLMs在多模态任务上表现优异,但在缺乏大规模标注数据的领域(如图表、网页等人工图像)推理能力有限。COGS旨在通过少量种子问题生成多样化数据,提升模型的泛化能力。Contribution: 1. 提出了COGS框架,通过分解和重组种子问题生成合成数据;2. 引入子问题和中间答案,支持基于过程奖励的强化学习;3. 展示了该方法在图表和其他人工图像领域的有效性。
Method: 1. 将种子问题分解为感知和推理因素;2. 通过系统性重组生成新的问答对;3. 结合子问题和中间答案进行强化学习。
Result: 在图表推理任务中,COGS显著提升了模型对未见问题的表现,尤其在推理密集和组合问题上。混合不同种子数据进一步提升了跨数据集的迁移能力。
Insight: COGS不仅能提升特定领域的性能,还能通过合成数据生成机制增强模型的泛化能力,适用于多种人工图像领域。
Abstract: Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.
[34] Generalized Dynamics Generation towards Scannable Physical World Model
Yichen Li,Zhiyi Li,Brandon Feng,Dinghuai Zhang,Antonio Torralba
Main category: cs.CV
TL;DR: GDGen提出了一种统一的方法,通过势能视角整合刚体、关节体和软体动力学,构建可扫描的物理世界模型。
Details
Motivation: 开发通用数字孪生世界,为复杂物理行为的交互式环境中的通用智能体提供训练基础。Contribution: GDGen通过引入方向刚度扩展经典弹性动力学,统一了多种仿真范式,并提出了一种几何无关的变形表示方法。
Method: 从势能角度建模物理系统,使用神经网络表示扩展的材料属性和几何无关的变形。
Result: 实验表明GDGen能稳健地统一多样化的仿真范式,适用于复杂动态场景。
Insight: 势能视角为物理建模提供了新思路,几何无关的表示方法增强了模型的通用性和灵活性。
Abstract: Digital twin worlds with realistic interactive dynamics presents a new opportunity to develop generalist embodied agents in scannable environments with complex physical behaviors. To this end, we present GDGen (Generalized Representation for Generalized Dynamics Generation), a framework that takes a potential energy perspective to seamlessly integrate rigid body, articulated body, and soft body dynamics into a unified, geometry-agnostic system. GDGen operates from the governing principle that the potential energy for any stable physical system should be low. This fresh perspective allows us to treat the world as one holistic entity and infer underlying physical properties from simple motion observations. We extend classic elastodynamics by introducing directional stiffness to capture a broad spectrum of physical behaviors, covering soft elastic, articulated, and rigid body systems. We propose a specialized network to model the extended material property and employ a neural field to represent deformation in a geometry-agnostic manner. Extensive experiments demonstrate that GDGen robustly unifies diverse simulation paradigms, offering a versatile foundation for creating interactive virtual environments and training robotic agents in complex, dynamically rich scenarios.
[35] Comprehensive language-image pre-training for 3D medical image understanding
Tassilo Wald,Ibrahim Ethem Hamamci,Yuan Gao,Sam Bond-Taylor,Harshita Sharma,Maximilian Ilse,Cynthia Lo,Olesya Melnichenko,Noel C. F. Codella,Maria Teodora Wetscherek,Klaus H. Maier-Hein,Panagiotis Korfiatis,Valentina Salvatelli,Javier Alvarez-Valle,Fernando Pérez-García
Main category: cs.CV
TL;DR: 论文提出了COLIPRI编码器家族,通过引入额外的归纳偏差(报告生成目标和视觉语言预训练结合视觉预训练),解决3D医学图像领域数据不足的问题,并在报告生成、分类探测和零样本分类等任务上实现SOTA。
Details
Motivation: 3D医学图像领域的数据稀缺限制了当前视觉语言编码器的能力。论文旨在通过利用更多的数据(包括图像和图像-文本对)和引入额外的归纳偏差,提升模型的性能。Contribution: 1. 提出了COLIPRI编码器家族;2. 通过报告生成目标和结合视觉预训练的方法,缓解数据不足问题;3. 在多个任务(如报告生成、分类、分割)中取得SOTA或竞争性性能。
Method: 1. 引入报告生成目标作为额外的学习任务;2. 结合视觉语言预训练和视觉预训练,利用更多数据;3. 采用3D医学图像领域的最佳实践。
Result: COLIPRI编码器在报告生成、分类探测和零样本分类任务中达到SOTA性能,同时在语义分割任务中保持竞争力。
Insight: 在数据稀缺的领域,通过引入额外的学习目标和结合不同的预训练策略(如视觉语言预训练与视觉预训练的结合),可以有效提升模型的性能。
Abstract: Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation.
[36] Directional Reasoning Injection for Fine-Tuning MLLMs
Chao Huang,Zeliang Zhang,Jiang Liu,Ximeng Sun,Jialian Wu,Xiaodong Yu,Ze Wang,Chenliang Xu,Emad Barsoum,Zicheng Liu
Main category: cs.CV
TL;DR: 该论文提出了DRIFT方法,通过梯度空间注入推理知识,提升多模态大语言模型(MLLMs)的推理能力,避免了资源密集的传统方法。
Details
Motivation: 现有的MLLMs推理能力落后于纯文本模型,传统方法(如监督微调或强化学习)成本高昂,而简单的模型融合在不同模型家族中效果不稳定。Contribution: 提出了DRIFT方法,通过梯度空间注入推理知识,既保持了多模态对齐的稳定性,又实现了高效的推理能力迁移。
Method: 预计算推理先验作为参数空间差异,在微调过程中通过梯度偏置注入推理知识。
Result: 在MathVista和MathVerse等基准测试中,DRIFT优于简单融合和监督微调,且成本显著低于训练密集型方法。
Insight: 梯度空间的推理知识转移是一种轻量且高效的方法,适用于不同MLLMs家族的推理能力提升。
Abstract: Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a “free lunch”: its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.
[37] A solution to generalized learning from small training sets found in everyday infant experiences
Frangil Ramirez,Elizabeth Clerkin,David J. Crandall,Linda B. Smith
Main category: cs.CV
TL;DR: 论文提出,婴儿从有限的视觉经验中高效学习的能力源于其日常生活中视觉输入的‘块状’相似性结构,并通过计算实验验证了这一结构对机器学习中从小数据集泛化的有效性。
Details
Motivation: 研究动机是解释为何婴儿能够从有限的视觉经验中高效学习和泛化,尽管传统机器学习需要大量数据。Contribution: 主要贡献是揭示了婴儿日常视觉输入的‘块状’相似性结构,并通过实验证明这种结构有助于机器学习中的小样本泛化。
Method: 方法包括分析14名婴儿的自我中心视角图像,提取视觉输入的相似性结构,并通过计算实验模拟这种结构在小数据集上的学习效果。
Result: 结果表明,婴儿视觉输入的‘块状’结构显著提高了机器学习模型对小数据集的泛化能力。
Insight: 论文的深层洞见是,自然视觉输入的统计特性(如‘块状’结构)可以指导高效学习的算法设计,适用于多种学习任务和学习者。
Abstract: Young children readily recognize and generalize visual objects labeled by common nouns, suggesting that these basic level object categories may be given. Yet if they are, how they arise remains unclear. We propose that the answer lies in the statistics of infant daily life visual experiences. Whereas large and diverse datasets typically support robust learning and generalization in human and machine learning, infants achieve this generalization from limited experiences. We suggest that the resolution of this apparent contradiction lies in the visual diversity of daily life, repeated experiences with single object instances. Analyzing egocentric images from 14 infants (aged 7 to 11 months) we show that their everyday visual input exhibits a lumpy similarity structure, with clusters of highly similar images interspersed with rarer, more variable ones, across eight early-learned categories. Computational experiments show that mimicking this structure in machines improves generalization from small datasets in machine learning. The natural lumpiness of infant experience may thus support early category learning and generalization and, more broadly, offer principles for efficient learning across a variety of problems and kinds of learners.
[38] SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images
Jiaxin Guo,Tongfan Guan,Wenzhen Dong,Wenzhao Zheng,Wenting Wang,Yue Wang,Yeung Yam,Yun-Hui Liu
Main category: cs.CV
TL;DR: SaLon3R是一种新颖的在线通用3D高斯泼溅(3DGS)框架,专注于长期视频序列的结构感知重建。通过引入紧凑锚点基元和使用3D点Transformer,它显著减少了冗余并提升了几何一致性。
Details
Motivation: 现有方法在长期视频序列中预测高斯分布时存在冗余和几何不一致性问题,限制了重建效率和泛化能力。Contribution: 提出SaLon3R,首个能在50多视图上实时重建的通用3DGS方法,通过紧凑锚点和3D点Transformer实现了50%-90%的冗余去除。
Method: 利用紧凑锚点基元和可微分显著性感知高斯量化消除冗余,结合3D点Transformer优化锚点属性和显著性以提高几何一致性。
Result: 在多数据集上展现了最先进的性能,提升了新视角合成和深度估计的效率和鲁棒性。
Insight: 结构感知和紧凑表征是长期3D重建中减少冗余和提升一致性的关键。
Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled generalizable, on-the-fly reconstruction of sequential input views. However, existing methods often predict per-pixel Gaussians and combine Gaussians from all views as the scene representation, leading to substantial redundancies and geometric inconsistencies in long-duration video sequences. To address this, we propose SaLon3R, a novel framework for Structure-aware, Long-term 3DGS Reconstruction. To our best knowledge, SaLon3R is the first online generalizable GS method capable of reconstructing over 50 views in over 10 FPS, with 50% to 90% redundancy removal. Our method introduces compact anchor primitives to eliminate redundancy through differentiable saliency-aware Gaussian quantization, coupled with a 3D Point Transformer that refines anchor attributes and saliency to resolve cross-frame geometric and photometric inconsistencies. Specifically, we first leverage a 3D reconstruction backbone to predict dense per-pixel Gaussians and a saliency map encoding regional geometric complexity. Redundant Gaussians are compressed into compact anchors by prioritizing high-complexity regions. The 3D Point Transformer then learns spatial structural priors in 3D space from training data to refine anchor attributes and saliency, enabling regionally adaptive Gaussian decoding for geometric fidelity. Without known camera parameters or test-time optimization, our approach effectively resolves artifacts and prunes the redundant 3DGS in a single feed-forward pass. Experiments on multiple datasets demonstrate our state-of-the-art performance on both novel view synthesis and depth estimation, demonstrating superior efficiency, robustness, and generalization ability for long-term generalizable 3D reconstruction. Project Page: https://wrld.github.io/SaLon3R/.
[39] TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
Guofeng Zhang,Angtian Wang,Jacob Zhiyuan Fang,Liming Jiang,Haotian Yang,Bo Liu,Yiding Yang,Guang Chen,Longyin Wen,Alan Yuille,Chongyang Ma
Main category: cs.CV
TL;DR: TGT是一个文本驱动轨迹的本地控制视频生成框架,通过结合轨迹和局部文本描述,提高了对视频中多对象外观和运动的控制能力。
Details
Motivation: 现有的文本到视频生成方法在多对象场景中控制能力有限,尤其是在复杂情境下无法精确对应视觉实体和文本描述。Contribution: 提出了TGT框架,利用轨迹和局部文本描述生成视频;设计了LACA模块和双CFG方案,增强局部和全局文本控制的分离;构建了包含200万个高质量视频片段的数据集。
Method: 1. 使用轨迹和局部文本描述作为输入;2. 设计了LACA模块整合信号;3. 采用双CFG方案分别调控局部和全局文本指导;4. 开发了新的数据处理流程来生成轨迹和文本标注。
Result: 实验表明,TGT在视觉质量、文本对齐精度和运动控制能力上均优于现有方法。
Insight: 轨迹结合局部文本描述是一种直观且有效的方式,能够实现对多对象视频的更精细控制,特别是在复杂场景中。
Abstract: Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: https://textgroundedtraj.github.io.
[40] Deep generative priors for 3D brain analysis
Ana Lawry Aguila,Dina Zemlyanker,You Cheng,Sudeshna Das,Daniel C. Alexander,Oula Puonti,Annabel Sorby-Adams,W. Taylor Kimberly,Juan Eugenio Iglesias
Main category: cs.CV
TL;DR: 该论文提出了一种基于扩散模型的通用方法,用于解决医学影像中的多种逆问题,结合领域知识和数据驱动模型,在脑MRI分析中取得了最先进的性能。
Details
Motivation: 传统贝叶斯逆问题方法依赖经典数学先验,难以捕捉复杂脑部解剖结构;而尽管扩散模型在医学影像中表现优异,如何将其与领域知识结合仍是一个挑战。Contribution: 首次将扩散模型作为通用先验应用于医学影像逆问题,展示了其在超分辨率、偏差场校正等任务中的潜力,并能提升现有深度学习方法的结果质量。
Method: 采用基于分数的扩散先验模型,结合灵活的向前模型,处理多种脑MRI任务,无需配对训练数据。
Result: 在临床和研究MRI数据上实现了最先进的性能,生成一致且高质量的解决方案。
Insight: 扩散先验可以作为脑MRI分析的通用工具,既能结合领域知识,又能利用大数据训练的优势。
Abstract: Diffusion models have recently emerged as powerful generative models in medical imaging. However, it remains a major challenge to combine these data-driven models with domain knowledge to guide brain imaging problems. In neuroimaging, Bayesian inverse problems have long provided a successful framework for inference tasks, where incorporating domain knowledge of the imaging process enables robust performance without requiring extensive training data. However, the anatomical modeling component of these approaches typically relies on classical mathematical priors that often fail to capture the complex structure of brain anatomy. In this work, we present the first general-purpose application of diffusion models as priors for solving a wide range of medical imaging inverse problems. Our approach leverages a score-based diffusion prior trained extensively on diverse brain MRI data, paired with flexible forward models that capture common image processing tasks such as super-resolution, bias field correction, inpainting, and combinations thereof. We further demonstrate how our framework can refine outputs from existing deep learning methods to improve anatomical fidelity. Experiments on heterogeneous clinical and research MRI data show that our method achieves state-of-the-art performance producing consistent, high-quality solutions without requiring paired training datasets. These results highlight the potential of diffusion priors as versatile tools for brain MRI analysis.
[41] Fourier Transform Multiple Instance Learning for Whole Slide Image Classification
Anthony Bilic,Guangyu Sun,Ming Li,Md Sanzid Bin Hossain,Yu Tian,Wei Zhang,Laura Brattain,Dexter Hadley,Chen Chen
Main category: cs.CV
TL;DR: 论文提出了Fourier Transform Multiple Instance Learning(FFT-MIL)框架,通过频域分支增强MIL方法,捕捉全切片图像(WSI)的全局依赖关系,提升分类性能。
Details
Motivation: 现有基于MIL的全切片图像分类方法难以捕捉全局依赖,因为图像尺寸巨大且局部特征嵌入有限。这限制了模型对粗粒度结构的建模能力,影响了诊断预测的鲁棒性。Contribution: 提出了FFT-MIL框架,引入频域分支提取全局上下文信息,并通过轻量级融合策略将其与空间特征结合,提升了MIL方法的性能。
Method: 通过快速傅里叶变换(FFT)提取低频区域,使用FFT-Block(卷积层和Min-Max归一化)处理频域数据,并与空间特征融合。
Result: 在三个公开数据集(BRACS、LUAD、IMP)上测试六种MIL方法,FFT-MIL实现了平均宏F1分数提升3.51%和AUC提升1.51%。
Insight: 频域学习是捕捉WSI全局依赖的有效机制,与空间特征互补,提升了计算病理学的可扩展性和分类准确性。
Abstract: Whole Slide Image (WSI) classification relies on Multiple Instance Learning (MIL) with spatial patch features, yet existing methods struggle to capture global dependencies due to the immense size of WSIs and the local nature of patch embeddings. This limitation hinders the modeling of coarse structures essential for robust diagnostic prediction. We propose Fourier Transform Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a frequency-domain branch to provide compact global context. Low-frequency crops are extracted from WSIs via the Fast Fourier Transform and processed through a modular FFT-Block composed of convolutional layers and Min-Max normalization to mitigate the high variance of frequency data. The learned global frequency feature is fused with spatial patch features through lightweight integration strategies, enabling compatibility with diverse MIL architectures. FFT-MIL was evaluated across six state-of-the-art MIL methods on three public datasets (BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across architectures and datasets. These results establish frequency-domain learning as an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology.
[42] XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
Xingrui Wang,Jiang Liu,Chao Huang,Xiaodong Yu,Ze Wang,Ximeng Sun,Jialian Wu,Alan Yuille,Emad Barsoum,Zicheng Liu
Main category: cs.CV
TL;DR: XModBench是一个大规模三模态基准测试,旨在评估全模态大语言模型(OLLMs)的跨模态一致性。实验表明,现有模型在空间和时序推理、模态差异和方向性不平衡方面表现不足。
Details
Motivation: 现有的基准测试主要评估跨模态问答能力,但缺乏对模态不变推理或模态特定偏差的系统性分析,因此需要一个新的诊断工具。Contribution: 提出了XModBench基准测试,覆盖五种任务族和六种模态组合,用于系统评估OLLMs的跨模态一致性和模态差异。
Method: 设计了60,828个多选题,涵盖音频、视觉和文本的所有组合,通过精确的模态变换和任务设计,诊断模型的短板。
Result: 实验显示,即使是顶尖模型Gemini 2.5 Pro也存在空间和时序推理能力不足(准确率低于60%)、模态差异显著(音频表现远低于文本)和方向性不平衡(视觉上下文一致性更低)等问题。
Insight: 当前OLLMs远未实现真正的模态不变推理,XModBench可作为未来研究改进跨模态能力的关键工具。
Abstract: Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM’s modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.
[43] Train a Unified Multimodal Data Quality Classifier with Synthetic Data
Weizhi Wang,Rongmei Lin,Shiyang Li,Colin Lockard,Ritesh Sarkhel,Sanket Lokegaonkar,Jingbo Shang,Xifeng Yan,Nasser Zalmout,Xian Li
Main category: cs.CV
TL;DR: 论文提出了一种统一的多模态数据质量分类器UniFilter,通过合成数据解决高质量图像-文本数据的筛选问题,并显著提升了多模态大语言模型的性能。
Details
Motivation: 目前多模态大语言模型主要依赖混合的图像-文本数据和交错文档数据进行预训练,但对高质量数据的筛选研究较少,亟需一种高效的方法来统一筛选高质量数据。Contribution: 1) 提出了UniFilter,一种统一的多模态数据质量分类器;2) 通过半合成方法生成多样化的标注数据;3) 发布了合成的训练数据、模型检查点和高质量数据集OBELICS-HQ。
Method: 采用半合成方法,利用原始图像生成不同质量等级的文本,从而高效创建样本-得分对,训练UniFilter分类器。
Result: 实验表明,基于UniFilter筛选的数据预训练的MLLM在零样本推理和上下文学习中表现显著优于基线方法,并在下游任务中取得更强的性能。
Insight: 高质量的多模态数据对MLLM的性能提升至关重要,通过统一的分类器可以有效筛选数据,提高模型的泛化能力。
Abstract: The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.
[44] Salient Concept-Aware Generative Data Augmentation
Tianchen Zhao,Xuanbai Chen,Zhihua Li,Jun Fang,Dongsheng An,Xiang Xu,Zhuowen Tu,Yifan Xing
Main category: cs.CV
TL;DR: 论文提出了一种基于显著概念感知的图像生成框架,用于解决生成式数据增强中保真度与多样性难以平衡的问题,通过减少不相关视觉细节的影响,提升下游模型的鲁棒性。
Details
Motivation: 现有的生成式数据增强方法在图像和文本提示条件下,难以同时保持图像的保真度和生成多样性,原因是合成过程中非必要的图像属性(如环境背景)与文本提示意图冲突。Contribution: 1) 提出了显著概念感知的图像嵌入模型,减少合成过程中不相关视觉细节的影响;2) 设计了能够保持类别判别特征并增加可控变化的图像生成框架;3) 在细粒度视觉数据集上验证了方法的有效性。
Method: 1) 开发了显著概念感知的图像嵌入模型;2) 通过该模型生成保留关键特征并引入多样性的图像;3) 用于数据增强,提升下游分类模型的性能。
Result: 在八个细粒度视觉数据集上,该方法比现有最佳方法平均准确率提升了0.73%(常规设置)和6.5%(长尾设置)。
Insight: 通过显著概念感知方法可以更好地控制生成图像的保真度和多样性,从而有效提升数据增强的效果和下游任务的性能。
Abstract: Recent generative data augmentation methods conditioned on both image and text prompts struggle to balance between fidelity and diversity, as it is challenging to preserve essential image details while aligning with varied text prompts. This challenge arises because representations in the synthesis process often become entangled with non-essential input image attributes such as environmental contexts, creating conflicts with text prompts intended to modify these elements. To address this, we propose a personalized image generation framework that uses a salient concept-aware image embedding model to reduce the influence of irrelevant visual details during the synthesis process, thereby maintaining intuitive alignment between image and text inputs. By generating images that better preserve class-discriminative features with additional controlled variations, our framework effectively enhances the diversity of training datasets and thereby improves the robustness of downstream models. Our approach demonstrates superior performance across eight fine-grained vision datasets, outperforming state-of-the-art augmentation methods with averaged classification accuracy improvements by 0.73% and 6.5% under conventional and long-tail settings, respectively.
[45] CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records
Daniela Vega,Hannah V. Ceballos,Javier S. Vera,Santiago Rodriguez,Alejandra Perez,Angela Castillo,Maria Escobar,Dario Londoño,Luis A. Sarmiento,Camila I. Castro,Nadiezhda Rodriguez,Juan C. Briceño,Pablo Arbeláez
Main category: cs.CV
TL;DR: 该论文介绍了首个公开的多模态数据集CARDIUM,用于先天性心脏病(CHD)的产前诊断,并提出了一种鲁棒的多模态Transformer架构,整合图像和表格数据,显著提升了检测性能。
Details
Motivation: 由于先天性心脏病的罕见性,高质量诊断数据稀缺,导致数据集不平衡且质量低,限制了AI模型的性能。此外,缺乏公开的多模态数据集进一步阻碍了AI在临床决策中的应用。Contribution: 1)发布首个公开的多模态数据集CARDIUM,整合胎儿超声、心超图像及母体临床记录;2)提出多模态Transformer架构,通过交叉注意力机制融合图像和表格数据特征,显著提升CHD检测性能。
Method: 采用多模态Transformer架构,引入交叉注意力机制,融合图像和表格数据的特征表示,优化CHD检测。
Result: 在多模态数据集上,模型性能比单模态方法提升11%(图像)和50%(表格数据),F1分数达到79.8 ± 4.8%。
Insight: 多模态数据融合能显著提升罕见病诊断的准确性,公开数据集和代码有助于推动该领域的研究进展。
Abstract: Prenatal diagnosis of Congenital Heart Diseases (CHDs) holds great potential for Artificial Intelligence (AI)-driven solutions. However, collecting high-quality diagnostic data remains difficult due to the rarity of these conditions, resulting in imbalanced and low-quality datasets that hinder model performance. Moreover, no public efforts have been made to integrate multiple sources of information, such as imaging and clinical data, further limiting the ability of AI models to support and enhance clinical decision-making. To overcome these challenges, we introduce the Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records (CARDIUM) dataset, the first publicly available multimodal dataset consolidating fetal ultrasound and echocardiographic images along with maternal clinical records for prenatal CHD detection. Furthermore, we propose a robust multimodal transformer architecture that incorporates a cross-attention mechanism to fuse feature representations from image and tabular data, improving CHD detection by 11% and 50% over image and tabular single-modality approaches, respectively, and achieving an F1 score of 79.8 $\pm$ 4.8% in the CARDIUM dataset. We will publicly release our dataset and code to encourage further research on this unexplored field. Our dataset and code are available at https://github.com/BCVUniandes/Cardium, and at the project website https://bcv-uniandes.github.io/CardiumPage/
[46] The Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads
Aysan Aghazadeh,Adriana Kovashka
Main category: cs.CV
TL;DR: 论文研究了文本到图像模型在广告定制中的潜力,分析了广告中的种族和性别偏见,并提出了一种针对特定国家文化的广告生成技术。
Details
Motivation: 研究动机是探索文本到图像模型在广告定制中的潜力,并分析广告生成中存在的偏见问题。Contribution: 主要贡献包括分析了广告中的种族和性别偏见,展示了广告说服力的差异性,并提出了一种针对特定国家文化的广告生成方法。
Method: 方法包括分析广告中的偏见,评估不同人群广告的说服力,以及实验一种针对特定国家文化的广告生成技术。
Result: 结果表明广告中存在显著的种族和性别偏见,并且针对性生成的广告对不同国家文化更有效。
Insight: 研究发现,广告生成需要注意文化的多样性,避免偏见,以提高广告的说服力和效果。
Abstract: Text-to-image models are appealing for customizing visual advertisements and targeting specific populations. We investigate this potential by examining the demographic bias within ads for different ad topics, and the disparate level of persuasiveness (judged by models) of ads that are identical except for gender/race of the people portrayed. We also experiment with a technique to target ads for specific countries. The code is available at https://github.com/aysanaghazadeh/FaceOfPersuasion
[47] DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion
Weijie Wang,Jiagang Zhu,Zeyu Zhang,Xiaofeng Wang,Zheng Zhu,Guosheng Zhao,Chaojun Ni,Haoxiao Wang,Guan Huang,Xinze Chen,Yukun Zhou,Wenkang Qin,Duochao Shi,Haoyun Li,Guanghong Jia,Jiwen Lu
Main category: cs.CV
TL;DR: DriveGen3D是一个结合高效视频扩散和动态3D重建的框架,用于实时生成高质量的动态3D驾驶场景。它通过FastDrive-DiT和FastRecon3D模块解决了现有方法在长期生成和3D表示上的局限性。
Details
Motivation: 现有方法在动态驾驶场景生成中存在计算量大、缺乏3D表示或仅支持静态场景的问题。DriveGen3D致力于填补这一方法学空白。Contribution: 提出DriveGen3D框架,结合高效视频扩散和动态3D重建,实现了高分辨率、时间一致性强的驾驶场景生成。
Method: 集成了FastDrive-DiT(高效视频扩散变换器)和FastRecon3D(前馈3D重建模块),通过多模态条件控制生成动态3D场景。
Result: 实现了12 FPS的高分辨率(424×800)驾驶视频生成,新视角合成的SSIM为0.811,PSNR为22.84。
Insight: 通过结合视频扩散和3D重建,能够在保持参数效率的同时生成高质量的动态场景,为自动驾驶模拟和数据增强提供了新工具。
Abstract: We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird’s-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward reconstruction module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. Together, these components enable real-time generation of extended driving videos (up to $424\times800$ at 12 FPS) and corresponding dynamic 3D scenes, achieving SSIM of 0.811 and PSNR of 22.84 on novel view synthesis, all while maintaining parameter efficiency.
[48] CuSfM: CUDA-Accelerated Structure-from-Motion
Jingrui Yu,Jun Liu,Kefei Ren,Joydeep Biswas,Rurui Ye,Keqiang Wu,Chirag Majithia,Di Zeng
Main category: cs.CV
TL;DR: cuSfM是一个基于CUDA加速的离线Structure-from-Motion系统,通过GPU并行化提升计算效率,支持高精度相机位姿估计和全局一致的建图。
Details
Motivation: 高效准确的相机位姿估计是自动驾驶、机器人感知和虚拟仿真的基础需求。现有方法计算开销大,cuSfM利用GPU并行化解决这一问题。Contribution: 提出了cuSfM系统,通过CUDA加速实现高效非冗余数据关联和高精度位姿估计,支持多种功能如位姿优化、建图和先验地图定位。
Method: 利用GPU并行化加速计算密集型特征提取器,生成全局一致的非冗余数据关联,显著提升效率和精度。
Result: 实验表明,cuSfM在精度和处理速度上优于COLMAP,同时保持高精度和全局一致性。
Insight: GPU并行化在离线SfM任务中能显著提升效率和精度,cuSfM的开源实现有望推动计算机视觉和机器人研究。
Abstract: Efficient and accurate camera pose estimation forms the foundational requirement for dense reconstruction in autonomous navigation, robotic perception, and virtual simulation systems. This paper addresses the challenge via cuSfM, a CUDA-accelerated offline Structure-from-Motion system that leverages GPU parallelization to efficiently employ computationally intensive yet highly accurate feature extractors, generating comprehensive and non-redundant data associations for precise camera pose estimation and globally consistent mapping. The system supports pose optimization, mapping, prior-map localization, and extrinsic refinement. It is designed for offline processing, where computational resources can be fully utilized to maximize accuracy. Experimental results demonstrate that cuSfM achieves significantly improved accuracy and processing speed compared to the widely used COLMAP method across various testing scenarios, while maintaining the high precision and global consistency essential for offline SfM applications. The system is released as an open-source Python wrapper implementation, PyCuSfM, available at https://github.com/nvidia-isaac/pyCuSFM, to facilitate research and applications in computer vision and robotics.
[49] Hyperbolic Structured Classification for Robust Single Positive Multi-label Learning
Yiming Lin,Shang Wang,Junkai Zhou,Qiufeng Wang,Xiao-Bo Jin,Kaizhu Huang
Main category: cs.CV
TL;DR: 该论文提出了一种基于双曲几何的分类框架,用于解决单正多标签学习(SPMLL)问题。通过将标签表示为双曲球而非点或向量,能够同时建模标签间的层次结构、共现模式和语义独立性。还引入了温度自适应双曲球分类器和物理启发的双井正则化方法。实验在多个基准数据集上验证了方法的优越性和可解释性。
Details
Motivation: 现有SPMLL方法主要通过距离相似性隐式建模标签关系,缺乏对不同关系类型的显式几何定义。而双曲几何因其对层次结构的自然表示能力,更适合建模复杂标签关系。Contribution: 1. 首次提出用于SPMLL的双曲分类框架,将标签表示为双曲球而非点或向量,支持丰富的标签关系建模;2. 引入温度自适应双曲球分类器和双井正则化方法,优化标签球的配置;3. 实验验证了方法在性能和可解释性上的优势,并证明双曲嵌入与实际共现模式高度相关。
Method: 1. 双曲球表示标签,通过球的几何关系(包含、重叠、分离)建模标签关系;2. 温度自适应双曲球分类器动态调整球的大小;3. 双井正则化利用物理势能引导球的分布,避免重叠或分离过多。
Result: 在MS-COCO、PASCAL VOC等四个数据集上,该方法表现优于现有方法,且具有更高的可解释性。统计分析表明,学习到的嵌入与实际共现模式显著相关。
Insight: 双曲几何天然适合建模层次结构和复杂关系,尤其在标签监督不完整(如SPMLL)的场景下更具鲁棒性。
Abstract: Single Positive Multi-Label Learning (SPMLL) addresses the challenging scenario where each training sample is annotated with only one positive label despite potentially belonging to multiple categories, making it difficult to capture complex label relationships and hierarchical structures. While existing methods implicitly model label relationships through distance-based similarity, lacking explicit geometric definitions for different relationship types. To address these limitations, we propose the first hyperbolic classification framework for SPMLL that represents each label as a hyperbolic ball rather than a point or vector, enabling rich inter-label relationship modeling through geometric ball interactions. Our ball-based approach naturally captures multiple relationship types simultaneously: inclusion for hierarchical structures, overlap for co-occurrence patterns, and separation for semantic independence. Further, we introduce two key component innovations: a temperature-adaptive hyperbolic ball classifier and a physics-inspired double-well regularization that guides balls toward meaningful configurations. To validate our approach, extensive experiments on four benchmark datasets (MS-COCO, PASCAL VOC, NUS-WIDE, CUB-200-2011) demonstrate competitive performance with superior interpretability compared to existing methods. Furthermore, statistical analysis reveals strong correlation between learned embeddings and real-world co-occurrence patterns, establishing hyperbolic geometry as a more robust paradigm for structured classification under incomplete supervision.
[50] Latent Diffusion Model without Variational Autoencoder
Minglei Shi,Haolin Wang,Wenzhao Zheng,Ziyang Yuan,Xiaoshi Wu,Xintao Wang,Pengfei Wan,Jie Zhou,Jiwen Lu
Main category: cs.CV
TL;DR: recent progress in diffusion-based visual generation has relied on latent diffusion with variational autoencoders (VAEs). LD suffers from limitations including lack of clear semantic separability.
Details
Motivation: LD suffers drawbacks from VAE latent spaces including unclear semantic separability.Contribution: [intro] LD suffers drawbacks from VAC latent spaces including unclear semantic separability.
Method: SVG constructs a feature space with semantic discriminability.
Result: Experiments confirm SVG improves generative quality.
Insight: SVG leverages semantic discriminability for improvement.
Abstract: Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.
[51] Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation
Fei Wang,Li Shen,Liang Ding,Chao Xue,Ye Liu,Changxing Ding
Main category: cs.CV
TL;DR: 论文提出了一种名为CoMe的新方法,通过逐层拼接和层次蒸馏技术压缩大语言模型大小,同时保持性能。
Details
Motivation: 大语言模型的计算和存储需求高,现有结构化剪枝方法直接移除层会导致性能下降,且缺乏有效的后训练恢复机制。Contribution: 1) 提出渐进式剪枝框架;2) 引入基于拼接的合并技术;3) 设计层次蒸馏后训练过程;4) 在LLaMA-2-7b上剪枝30%参数后仍保留83%准确率。
Method: 1) 利用激活强度和权重范数的通道敏感性指标选择关键通道;2) 通过拼接合并相邻层的通道;3) 层次蒸馏利用原始与剪枝层的对应关系进行知识迁移。
Result: 在7个基准测试中达到SOTA性能,30%剪枝后仍保留83%原始准确率。
Insight: 层的拼接合并和层次蒸馏能有效减少模型大小并保持性能,为大语言模型的轻量化提供新思路。
Abstract: Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b’s parameters, the pruned model retains 83% of its original average accuracy. Our code is available at https://github.com/MPI-Lab/CoMe.
[52] SHARE: Scene-Human Aligned Reconstruction
Joshua Li,Brendan Chharawala,Chang Shu,Xue Bin Peng,Pengcheng Xi
Main category: cs.CV
TL;DR: SHARE提出了一种利用场景几何信息精确重建人类运动的方法,仅需单目RGB视频即可实现人与场景的对齐。
Details
Motivation: 现有的人类运动重建方法难以准确将人放置在3D空间中,影响了游戏、AR/VR和机器人等领域中逼真角色交互的动画效果。Contribution: SHARE通过结合场景几何信息和人类网格重建,实现了更精确的3D人类运动重建和场景对齐。
Method: SHARE首先估计人类网格和分割掩码,并结合关键帧的场景点图优化人类位置,同时通过保持非关键帧与关键帧的相对位置一致性确保重建的连续性。
Result: 实验表明SHARE在重建精度上优于现有方法,适用于合成数据集和真实环境中的网络视频。
Insight: 利用场景几何信息可以有效提升人类运动重建的准确性,尤其是在复杂环境中。
Abstract: Animating realistic character interactions with the surrounding environment is important for autonomous agents in gaming, AR/VR, and robotics. However, current methods for human motion reconstruction struggle with accurately placing humans in 3D space. We introduce Scene-Human Aligned REconstruction (SHARE), a technique that leverages the scene geometry’s inherent spatial cues to accurately ground human motion reconstruction. Each reconstruction relies solely on a monocular RGB video from a stationary camera. SHARE first estimates a human mesh and segmentation mask for every frame, alongside a scene point map at keyframes. It iteratively refines the human’s positions at these keyframes by comparing the human mesh against the human point map extracted from the scene using the mask. Crucially, we also ensure that non-keyframe human meshes remain consistent by preserving their relative root joint positions to keyframe root joints during optimization. Our approach enables more accurate 3D human placement while reconstructing the surrounding scene, facilitating use cases on both curated datasets and in-the-wild web videos. Extensive experiments demonstrate that SHARE outperforms existing methods.
[53] Adaptive transfer learning for surgical tool presence detection in laparoscopic videos through gradual freezing fine-tuning
Ana Davila,Jacinto Colan,Yasuhisa Hasegawa
Main category: cs.CV
TL;DR: 论文提出了一种分阶段自适应微调方法,用于腹腔镜视频中手术工具的检测,通过渐进冻结微调提高了检测性能。
Details
Motivation: 微创手术中自动检测手术工具对分析和辅助至关重要,但注释数据有限,传统深度学习方法难以训练出鲁棒模型。Contribution: 1. 提出了一种两阶段自适应微调方法(线性探测和渐进冻结);2. 只需单次训练循环,降低了网络复杂性;3. 在多个数据集上验证了方法的有效性。
Method: 1. 使用预训练的CNN架构(如ResNet-50和DenseNet-121);2. 第一阶段通过线性探测调整分类层;3. 第二阶段通过渐进冻结动态减少可微调层数量。
Result: 在Cholec80数据集上达到了96.4%的mAP,并在CATARACTS数据集上验证了通用性。
Insight: 渐进冻结微调是一种有效的域适应技术,可推广到其他医学图像分类任务。
Abstract: Minimally invasive surgery can benefit significantly from automated surgical tool detection, enabling advanced analysis and assistance. However, the limited availability of annotated data in surgical settings poses a challenge for training robust deep learning models. This paper introduces a novel staged adaptive fine-tuning approach consisting of two steps: a linear probing stage to condition additional classification layers on a pre-trained CNN-based architecture and a gradual freezing stage to dynamically reduce the fine-tunable layers, aiming to regulate adaptation to the surgical domain. This strategy reduces network complexity and improves efficiency, requiring only a single training loop and eliminating the need for multiple iterations. We validated our method on the Cholec80 dataset, employing CNN architectures (ResNet-50 and DenseNet-121) pre-trained on ImageNet for detecting surgical tools in cholecystectomy endoscopic videos. Our results demonstrate that our method improves detection performance compared to existing approaches and established fine-tuning techniques, achieving a mean average precision (mAP) of 96.4%. To assess its broader applicability, the generalizability of the fine-tuning strategy was further confirmed on the CATARACTS dataset, a distinct domain of minimally invasive ophthalmic surgery. These findings suggest that gradual freezing fine-tuning is a promising technique for improving tool presence detection in diverse surgical procedures and may have broader applications in general image classification tasks.
[54] FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers
Haisheng Su,Junjie Zhang,Feixiang Song,Sanping Zhou,Wei Wu,Nanning Zheng,Junchi Yan
Main category: cs.CV
TL;DR: 论文提出了FreqPDE方法,通过频率感知的空间深度嵌入为多视角3D目标检测Transformer提供空间信息,解决了传统方法中深度预测质量差和跨视角一致性不足的问题。
Details
Motivation: 当前多视角3D目标检测方法依赖于深度预测恢复空间信息,但存在深度不连续、小目标区分度低等问题,且缺乏跨视角一致性和尺度不变性。Contribution: 1. 提出FreqPDE方法,结合频率感知的空间金字塔编码器(FSPE)、跨视角尺度不变深度预测器(CSDP)和位置深度编码器(PDE);2. 采用混合深度监督,实现度量与分布学习的互补。
Method: 1. FSPE结合高频边缘线索和低频语义特征构建特征金字塔;2. CSDP通过跨视角和通道注意力机制估计像素级深度分布;3. PDE整合2D图像特征与3D位置嵌入生成深度感知特征。
Result: 在nuScenes数据集上的实验证明,FreqPDE在3D目标检测任务中表现出优越性能。
Insight: 频率特征与空间信息的结合能有效提升深度预测质量,同时跨视角注意力机制增强了特征的一致性。
Abstract: Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Cross-view Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.
[55] PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction
Ting-Yu Yen,Yu-Sheng Chiu,Shih-Hsuan Hung,Peter Wonka,Hung-Kuo Chu
Main category: cs.CV
TL;DR: PFGS提出了一种基于3D高斯泼溅的多姿态对象重建方法,解决了现有方法在多姿态下重建不完整的问题。通过姿态感知的全局和局部配准策略,PFGS实现了高质量、完整的对象重建。
Details
Motivation: 现有的3D高斯泼溅方法通常假设对象在单一静态姿态下捕捉,导致重建结果不完整,尤其是被遮挡或自遮挡的区域。PFGS旨在解决从多姿态图像中实现完整对象重建的挑战。Contribution: PFGS的主要贡献包括:1) 提出姿态感知的3D高斯泼溅框架,实现了多姿态图像的有效融合;2) 结合全局和局部配准策略,提升了重建的完整性和精度;3) 利用基础模型特征优化配准过程,解决了背景不一致的问题。
Method: PFGS通过迭代方式将辅助姿态的图像集融合到主姿态的统一3D高斯泼溅表示中。方法采用了姿态感知的全局和局部配准策略,并以背景特征和基础模型为支撑,优化了配准过程。
Result: 实验表明,PFGS在定性和定量评估中均优于基线方法,能够生成更完整且保真度更高的3D高斯泼溅模型。
Insight: PFGS通过姿态感知的配准策略和多视角融合,展示了在多姿态对象重建中的优势,同时提出了解决背景不一致问题的新思路。
Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality, real-time novel-view synthesis from multi-view images. However, most existing methods assume the object is captured in a single, static pose, resulting in incomplete reconstructions that miss occluded or self-occluded regions. We introduce PFGS, a pose-aware 3DGS framework that addresses the practical challenge of reconstructing complete objects from multi-pose image captures. Given images of an object in one main pose and several auxiliary poses, PFGS iteratively fuses each auxiliary set into a unified 3DGS representation of the main pose. Our pose-aware fusion strategy combines global and local registration to merge views effectively and refine the 3DGS model. While recent advances in 3D foundation models have improved registration robustness and efficiency, they remain limited by high memory demands and suboptimal accuracy. PFGS overcomes these challenges by incorporating them more intelligently into the registration process: it leverages background features for per-pose camera pose estimation and employs foundation models for cross-pose registration. This design captures the best of both approaches while resolving background inconsistency issues. Experimental results demonstrate that PFGS consistently outperforms strong baselines in both qualitative and quantitative evaluations, producing more complete reconstructions and higher-fidelity 3DGS models.
[56] LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding
Peng Ren,Hai Yang
Main category: cs.CV
TL;DR: LILAC提出了一种基于VAE-扩散模型的实时长序列运动风格化方法,通过因果解码和潜在空间流式处理架构,实现了高质量和低延迟的运动生成。
Details
Motivation: 实时生成具有稳定性和高质量的长序列运动风格化对于动态角色控制至关重要,现有方法要么计算开销大,要么仅支持离线处理。Contribution: 提出了一种潜在空间流式处理架构,结合因果解码和运动特征注入,实现了长序列、低延迟的实时运动风格化。
Method: 采用VAE-扩散框架,引入滑动窗口因果设计和解码运动特征注入,确保平滑过渡和高效处理。
Result: 在基准数据集上展示了高质量和响应性的平衡,优于现有方法。
Insight: 潜在空间流式处理和因果解码的结合是实时运动风格化的有效解决方案,无需依赖未来帧或修改扩散模型架构。
Abstract: Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE-Diffusion-based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding) builds upon a recent high-performing offline framework for arbitrary motion stylization and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets. Supplementary video and examples are available at the project page: https://pren1.github.io/lilac/
[57] Robust High-Resolution Multi-Organ Diffusion MRI Using Synthetic-Data-Tuned Prompt Learning
Chen Qian,Haoyu Zhang,Junnan Ma,Liuhong Zhu,Qingrui Cai,Yu Wang,Ruibo Song,Lv Li,Lin Mei,Xianwang Jiang,Qin Xu,Boyu Jiang,Ran Tao,Chunmiao Chen,Shufang Chen,Dongyun Liang,Qiu Guo,Jianzhong Lin,Taishan Kang,Mengtian Lu,Liyuan Fu,Ruibin Huang,Huijuan Wan,Xu Huang,Jianhua Wang,Di Guo,Hai Zhong,Jianjun Zhou,Xiaobo Qu
Main category: cs.CV
TL;DR: LoSP-Prompt是一种新颖的多器官扩散MRI重建框架,通过物理学建模和合成数据驱动的提示学习解决运动伪影问题,实现高分辨率和跨器官泛化,临床验证表现出色。
Details
Motivation: 多器官扩散MRI在临床应用中被呼吸、蠕动等运动引起的相位伪影困扰,加上复杂的多参数需求,限制了其诊断效果。LoSP-Prompt旨在解决这些问题。Contribution: 1. 提出LoSP-Prompt框架,结合LoSP相位建模和合成数据驱动的提示学习;2. 实现高分辨率重建和跨器官泛化;3. 临床验证表明其优于现有方法。
Method: 1. 将相位变化建模为高阶LoSP,并嵌入低秩Hankel矩阵重建;2. 使用合成腹部DWI数据训练提示学习自动设置算法参数。
Result: 1. 分辨率是单次DWI的两倍;2. 单模型泛化7个解剖区域;3. 在图像质量、伪影抑制和降噪上领先(11位放射科医生评分4-5分)。
Insight: 1. 合成数据驱动提示学习可解决临床数据不足问题;2. 物理学建模结合机器学习提供可解释的鲁棒解决方案;3. 无导航信号设计简化了临床应用。
Abstract: Clinical adoption of multi-shot diffusion-weighted magnetic resonance imaging (multi-shot DWI) for body-wide tumor diagnostics is limited by severe motion-induced phase artifacts from respiration, peristalsis, and so on, compounded by multi-organ, multi-slice, multi-direction and multi-b-value complexities. Here, we introduce a reconstruction framework, LoSP-Prompt, that overcomes these challenges through physics-informed modeling and synthetic-data-driven prompt learning. We model inter-shot phase variations as a high-order Locally Smooth Phase (LoSP), integrated into a low-rank Hankel matrix reconstruction. Crucially, the algorithm’s rank parameter is automatically set via prompt learning trained exclusively on synthetic abdominal DWI data emulating physiological motion. Validated across 10,000+ clinical images (43 subjects, 4 scanner models, 5 centers), LoSP-Prompt: (1) Achieved twice the spatial resolution of clinical single-shot DWI, enhancing liver lesion conspicuity; (2) Generalized to seven diverse anatomical regions (liver, kidney, sacroiliac, pelvis, knee, spinal cord, brain) with a single model; (3) Outperformed state-of-the-art methods in image quality, artifact suppression, and noise reduction (11 radiologists’ evaluations on a 5-point scale, $p<0.05$), achieving 4-5 points (excellent) on kidney DWI, 4 points (good to excellent) on liver, sacroiliac and spinal cord DWI, and 3-4 points (good) on knee and tumor brain. The approach eliminates navigator signals and realistic data supervision, providing an interpretable, robust solution for high-resolution multi-organ multi-shot DWI. Its scanner-agnostic performance signifies transformative potential for precision oncology.
[58] Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Shuang Liang,Zhihao Xu,Jialing Tao,Hui Xue,Xiting Wang
Main category: cs.CV
TL;DR: 论文提出了一种名为Learning to Detect (LoD)的新框架,用于检测大规模视觉语言模型中的未知越狱攻击,通过任务特异性学习和多模态安全概念激活向量模块实现了更高的检测准确性和效率。
Details
Motivation: 现有的大型视觉语言模型尽管经过对齐,仍易受越狱攻击威胁。当前检测方法要么依赖攻击特异性学习,泛化能力差,要么基于启发式原则,准确性和效率有限。Contribution: 提出LoD框架,专注于任务特异性学习而非攻击特异性学习;引入多模态安全概念激活向量模块和安全模式自编码器模块,提升检测未知攻击的准确性和效率。
Method: 采用多模态安全概念激活向量模块进行安全导向的表征学习;使用安全模式自编码器模块进行无监督攻击分类。
Result: 实验表明,LoD在多种未知攻击中表现更高的检测AUROC,并提高了效率。
Insight: 任务特异性学习比攻击特异性学习更具泛化能力;多模态和无监督技术的结合能有效提升安全检测性能。
Abstract: Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
[59] Semantic4Safety: Causal Insights from Zero-shot Street View Imagery Segmentation for Urban Road Safety
Huan Chen,Ting Han,Siyu Chen,Zhihao Guo,Yiping Chen,Meiliu Wu
Main category: cs.CV
TL;DR: 论文提出Semantic4Safety框架,结合零样本语义分割和因果推断分析街景图像中的道路安全特征,揭示不同事故类型的因果模式。
Details
Motivation: 现有研究在街景水平上缺乏捕捉事故相关特征及量化其因果影响的工具,本文旨在填补这一空白。Contribution: 1. 提出Semantic4Safety框架,结合零样本分割与因果推断;2. 从SVIs中提取11个可解释特征;3. 量化特征对事故类型的因果影响。
Method: 1. 零样本语义分割提取街道特征;2. XGBoost多分类模型预测事故类型;3. SHAP解释特征贡献;4. GPS和ATE估计因果效应。
Result: 结果显示场景复杂度、暴露度和道路几何特征是主要预测因素;驾驶区域和应急空间增大可降低风险,而视觉开放性过高则增加风险。
Insight: 通过预测与因果推断结合,为城市道路安全规划提供了可扩展的数据驱动工具。
Abstract: Street-view imagery (SVI) offers a fine-grained lens on traffic risk, yet two fundamental challenges persist: (1) how to construct street-level indicators that capture accident-related features, and (2) how to quantify their causal impacts across different accident types. To address these challenges, we propose Semantic4Safety, a framework that applies zero-shot semantic segmentation to SVIs to derive 11 interpretable streetscape indicators, and integrates road type as contextual information to analyze approximately 30,000 accident records in Austin. Specifically, we train an eXtreme Gradient Boosting (XGBoost) multi-class classifier and use Shapley Additive Explanations (SHAP) to interpret both global and local feature contributions, and then apply Generalized Propensity Score (GPS) weighting and Average Treatment Effect (ATE) estimation to control confounding and quantify causal effects. Results uncover heterogeneous, accident-type-specific causal patterns: features capturing scene complexity, exposure, and roadway geometry dominate predictive power; larger drivable area and emergency space reduce risk, whereas excessive visual openness can increase it. By bridging predictive modeling with causal inference, Semantic4Safety supports targeted interventions and high-risk corridor diagnosis, offering a scalable, data-informed tool for urban road safety planning.
[60] Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation
Feifei Zhang,Zhenhong Jia,Sensen Song,Fei Shi,Dayong Ren
Main category: cs.CV
TL;DR: 论文提出了一种预测-校正(PC)范式,通过将建模任务解耦来加速学习,特别是在医学图像分割领域。基于此,设计了一个名为PCMambaNet的新网络,包含预测先验模块(PPM)和校正残差网络(CRN),显著提升了收敛速度和精度。
Details
Motivation: 现有端到端深度学习范式在医学图像分割中面临收敛慢和对大规模数据的依赖问题,限制了其在数据稀缺领域的效率和应用。Contribution: 提出了PC范式,设计了一个新网络PCMambaNet,通过解耦建模任务和利用解剖学知识(双侧对称性)来加速学习并提高分割精度。
Method: PCMambaNet包含两部分:PPM生成低计算成本的粗分割,专注于不对称区域;CRN学习残差误差,细化病理边界。
Result: 在高分辨率脑MRI分割任务中,PCMambaNet在1-5个epoch内实现SOTA精度,远超传统端到端模型。
Insight: 显式融入领域知识可以简化学习目标,有效缓解数据不足和过拟合问题,显著提升模型效率。
Abstract: Despite the remarkable success of the end-to-end paradigm in deep learning, it often suffers from slow convergence and heavy reliance on large-scale datasets, which fundamentally limits its efficiency and applicability in data-scarce domains such as medical imaging. In this work, we introduce the Predictive-Corrective (PC) paradigm, a framework that decouples the modeling task to fundamentally accelerate learning. Building upon this paradigm, we propose a novel network, termed PCMambaNet. PCMambaNet is composed of two synergistic modules. First, the Predictive Prior Module (PPM) generates a coarse approximation at low computational cost, thereby anchoring the search space. Specifically, the PPM leverages anatomical knowledge-bilateral symmetry-to predict a ‘focus map’ of diagnostically relevant asymmetric regions. Next, the Corrective Residual Network (CRN) learns to model the residual error, focusing the network’s full capacity on refining these challenging regions and delineating precise pathological boundaries. Extensive experiments on high-resolution brain MRI segmentation demonstrate that PCMambaNet achieves state-of-the-art accuracy while converging within only 1-5 epochs-a performance unattainable by conventional end-to-end models. This dramatic acceleration highlights that by explicitly incorporating domain knowledge to simplify the learning objective, PCMambaNet effectively mitigates data inefficiency and overfitting.
[61] Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
Xuchen Li,Xuzhao Li,Shiyu Hu,Kaiqi Huang
Main category: cs.CV
TL;DR: 本文提出了一种基于证据优先的自适应框架EARL,通过强化学习动态选择关键帧并进行局部重采样,显著提升了视频大语言模型的推理能力。
Details
Motivation: 现有视频大语言模型在长视频推理中存在信息稀释和关键证据模糊的问题,且缺乏严格的奖励机制来确保证据纯度。Contribution: 提出了证据感知的强化学习框架EARL,能够动态选择最相关帧并通过局部重采样获取细粒度时序信息。
Method: 使用证据感知的强化学习(EARL)动态选择关键帧,并进行局部重采样以补充时空信息。
Result: 在五个视频推理基准测试中取得了最优效果,7B模型在LongVideoBench、MVBench和VideoMME上的表现分别为59.8%、69.0%和64.9%。
Insight: 证据纯度对视频推理至关重要,动态选择和局部重采样显著提升了模型性能。
Abstract: Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: “Select Less, Reason More.” Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.
[62] MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention
Nengbo Zhang,Hann Woei Ho
Main category: cs.CV
TL;DR: MAVR-Net是一个多视角学习框架,用于无人机(MAV)动作识别,结合RGB帧、光流和分割掩码数据,通过跨视角注意力模块和多尺度特征金字塔提升识别鲁棒性和准确性。
Details
Motivation: 现有的基于RGB数据的无人机动作识别方法难以捕捉复杂的时空运动特征,导致识别能力有限。Contribution: 1. 提出了MAVR-Net框架,结合多模态数据(RGB、光流、分割掩码);2. 设计了跨视角注意力模块和多视图对齐损失;3. 在多个基准数据集上显著优于现有方法。
Method: 1. 使用ResNet编码器从多模态数据中提取特征;2. 通过多尺度特征金字塔保留时空细节;3. 引入跨视角注意力模块建模多模态依赖关系;4. 设计多视图对齐损失确保语义一致性。
Result: 在Short MAV、Medium MAV和Long MAV数据集上分别达到了97.8%、96.5%和92.8%的准确率。
Insight: 多模态数据和跨视角注意力机制的结合能显著提升无人机动作识别的性能,尤其在复杂时空特征建模上表现出色。
Abstract: Recognizing the motion of Micro Aerial Vehicles (MAVs) is crucial for enabling cooperative perception and control in autonomous aerial swarms. Yet, vision-based recognition models relying only on RGB data often fail to capture the complex spatial temporal characteristics of MAV motion, which limits their ability to distinguish different actions. To overcome this problem, this paper presents MAVR-Net, a multi-view learning-based MAV action recognition framework. Unlike traditional single-view methods, the proposed approach combines three complementary types of data, including raw RGB frames, optical flow, and segmentation masks, to improve the robustness and accuracy of MAV motion recognition. Specifically, ResNet-based encoders are used to extract discriminative features from each view, and a multi-scale feature pyramid is adopted to preserve the spatiotemporal details of MAV motion patterns. To enhance the interaction between different views, a cross-view attention module is introduced to model the dependencies among various modalities and feature scales. In addition, a multi-view alignment loss is designed to ensure semantic consistency and strengthen cross-view feature representations. Experimental results on benchmark MAV action datasets show that our method clearly outperforms existing approaches, achieving 97.8%, 96.5%, and 92.8% accuracy on the Short MAV, Medium MAV, and Long MAV datasets, respectively.
[63] DPTrack:Directional Kernel-Guided Prompt Learning for Robust Nighttime Aerial Tracking
Zhiqiang Zhu,Xinbo Gao,Wen Lu,Jie Li,Zhaoyang Wang,Mingqian Ge
Main category: cs.CV
TL;DR: DPTrack 是一种基于提示学习的夜间航拍跟踪器,通过编码目标对象的属性特征到带有细粒度线索的方向核中,生成精确的提示,提升跟踪性能。
Details
Motivation: 现有的夜间航拍跟踪器仅依赖于空间定位监督,导致生成的提示模糊,无法准确聚焦目标特征,跟踪性能较差。Contribution: 提出 DPTrack,通过方向核编码目标的拓扑结构和属性特征,生成细粒度提示,显著提升夜间航拍跟踪的准确性。
Method: 1) 分层捕获目标的拓扑结构;2) 将拓扑感知特征编码为方向核;3) 通过核引导的提示模块传播核信息并生成精确提示。
Result: 在多个基准测试中,DPTrack 表现优异,证明了其鲁棒性和准确性。
Insight: 夜间航拍跟踪的关键在于利用目标的拓扑属性和细粒度线索生成精确提示,方向核的设计为这一任务提供了核心指导信号。
Abstract: Existing nighttime aerial trackers based on prompt learning rely solely on spatial localization supervision, which fails to provide fine-grained cues that point to target features and inevitably produces vague prompts. This limitation impairs the tracker’s ability to accurately focus on the object features and results in trackers still performing poorly. To address this issue, we propose DPTrack, a prompt-based aerial tracker designed for nighttime scenarios by encoding the given object’s attribute features into the directional kernel enriched with fine-grained cues to generate precise prompts. Specifically, drawing inspiration from visual bionics, DPTrack first hierarchically captures the object’s topological structure, leveraging topological attributes to enrich the feature representation. Subsequently, an encoder condenses these topology-aware features into the directional kernel, which serves as the core guidance signal that explicitly encapsulates the object’s fine-grained attribute cues. Finally, a kernel-guided prompt module built on channel-category correspondence attributes propagates the kernel across the features of the search region to pinpoint the positions of target features and convert them into precise prompts, integrating spatial gating for robust nighttime tracking. Extensive evaluations on established benchmarks demonstrate DPTrack’s superior performance. Our code will be available at https://github.com/zzq-vipsl/DPTrack.
[64] Improving Micro-Expression Recognition with Phase-Aware Temporal Augmentation
Vu Tram Anh Khuong,Luu Tu Nguyen,Thanh Ha Le,Thi Duyen Ngo
Main category: cs.CV
TL;DR: 本文提出了一种基于动态图像(Dynamic Image, DI)的阶段感知时间增强方法,通过将微表情序列分解为起始到顶点和顶点到结束两个运动阶段,生成双阶段DI,以丰富运动多样性并提升识别性能。
Details
Motivation: 微表情识别(MER)因标注数据稀缺而受限,传统方法主要依赖简单的空间增强(如翻转、旋转),忽视了时间增强策略的潜力。本文旨在通过阶段感知的时间增强方法提升MER性能。Contribution: 提出了一种双阶段DI增强策略,将微表情序列分解为两个运动阶段并分别生成DI,从而引入互补的时间线索,提升识别性能。
Method: 通过动态图像技术,将微表情序列分解为‘起始到顶点’和‘顶点到结束’两个阶段,分别生成DI。该方法简单、模型无关,并能在低资源场景下有效工作。
Result: 在CASME-II和SAMM数据集上的实验表明,该方法显著提升了识别准确性、非加权F1分数和非加权平均召回率,结合空间增强后相对提升达10%。
Insight: 通过阶段分解增强时间多样性,可以有效捕捉微表情的细微运动特征,为低资源MER提供了一种鲁棒且通用的解决方案。
Abstract: Micro-expressions (MEs) are brief, involuntary facial movements that reveal genuine emotions, typically lasting less than half a second. Recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. Although deep learning has enabled significant advances in micro-expression recognition (MER), its effectiveness is limited by the scarcity of annotated ME datasets. This data limitation not only hinders generalization but also restricts the diversity of motion patterns captured during training. Existing MER studies predominantly rely on simple spatial augmentations (e.g., flipping, rotation) and overlook temporal augmentation strategies that can better exploit motion characteristics. To address this gap, this paper proposes a phase-aware temporal augmentation method based on dynamic image. Rather than encoding the entire expression as a single onset-to-offset dynamic image (DI), our approach decomposes each expression sequence into two motion phases: onset-to-apex and apex-to-offset. A separate DI is generated for each phase, forming a Dual-phase DI augmentation strategy. These phase-specific representations enrich motion diversity and introduce complementary temporal cues that are crucial for recognizing subtle facial transitions. Extensive experiments on CASME-II and SAMM datasets using six deep architectures, including CNNs, Vision Transformer, and the lightweight LEARNet, demonstrate consistent performance improvements in recognition accuracy, unweighted F1-score, and unweighted average recall, which are crucial for addressing class imbalance in MER. When combined with spatial augmentations, our method achieves up to a 10% relative improvement. The proposed augmentation is simple, model-agnostic, and effective in low-resource settings, offering a promising direction for robust and generalizable MER.
[65] MRASfM: Multi-Camera Reconstruction and Aggregation through Structure-from-Motion in Driving Scenes
Lingfeng Xuan,Chang Nie,Yiqing Xu,Zhe Liu,Yanzi Miao,Hesheng Wang
Main category: cs.CV
TL;DR: MRASfM提出了一种针对驾驶场景的多相机SfM框架,通过固定空间关系提升相机位姿估计可靠性,采用平面模型优化路面重建质量,并通过Bundle Adjustment提升效率,在公开数据集上表现优异。
Details
Motivation: 驾驶场景的多相机SfM面临位姿估计不可靠、路面重建异常点多及效率低下等问题,MRASfM旨在解决这些挑战。Contribution: 1. 提出了针对驾驶场景的多相机SfM框架MRASfM;2. 通过固定空间关系优化位姿估计;3. 平面模型减少路面重建异常点;4. Bundle Adjustment提升效率;5. 多场景聚合模块实现粗到细的优化。
Method: 1. 利用多相机系统的固定空间关系优化位姿;2. 平面模型剔除路面异常点;3. Bundle Adjustment中以多相机为单位减少变量;4. 多场景聚合模块实现场景整合。
Result: 在nuScenes数据集上达到0.124的绝对位姿误差,表现优于现有方法。
Insight: MRASfM通过硬约束和多相机联合优化显著提升了驾驶场景SfM的可靠性和效率。
Abstract: Structure from Motion (SfM) estimates camera poses and reconstructs point clouds, forming a foundation for various tasks. However, applying SfM to driving scenes captured by multi-camera systems presents significant difficulties, including unreliable pose estimation, excessive outliers in road surface reconstruction, and low reconstruction efficiency. To address these limitations, we propose a Multi-camera Reconstruction and Aggregation Structure-from-Motion (MRASfM) framework specifically designed for driving scenes. MRASfM enhances the reliability of camera pose estimation by leveraging the fixed spatial relationships within the multi-camera system during the registration process. To improve the quality of road surface reconstruction, our framework employs a plane model to effectively remove erroneous points from the triangulated road surface. Moreover, treating the multi-camera set as a single unit in Bundle Adjustment (BA) helps reduce optimization variables to boost efficiency. In addition, MRASfM achieves multi-scene aggregation through scene association and assembly modules in a coarse-to-fine fashion. We deployed multi-camera systems on actual vehicles to validate the generalizability of MRASfM across various scenes and its robustness in challenging conditions through real-world applications. Furthermore, large-scale validation results on public datasets show the state-of-the-art performance of MRASfM, achieving 0.124 absolute pose error on the nuScenes dataset.
[66] MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval
Jinghao Huang,Yaxiong Chen,Ganchao Liu
Main category: cs.CV
TL;DR: 该论文提出了一种名为MSAM的新方法,针对无人机视频-文本检索任务,通过多语义自适应学习机制和跨模态交互特征融合策略,显著提升了检索性能。
Details
Motivation: 无人机视频具有独特的俯视视角、强结构同质性和多样化的语义表达,现有的地面视角跨模态检索方法难以有效建模其特征,因此需要专门的检索机制。Contribution: 1) 首次系统性提出并研究了无人机视频-文本检索任务;2) 设计了多语义自适应学习机制和跨模态交互特征融合策略;3) 在两个自建数据集上验证了方法的优越性。
Method: MSAM方法包含多语义自适应学习机制(动态帧间变化和场景区域语义提取)、自适应语义构建模块、分布驱动的语义学习项和多样性语义项,并结合了跨模态交互特征融合池化机制以减少背景干扰。
Result: 在两个自建无人机视频-文本数据集上的实验表明,MSAM在检索任务中优于现有方法。
Insight: 无人机视频的独特视角和语义复杂性需要专门设计的检索机制,动态语义挖掘和目标区域特征聚焦是提升检索性能的关键。
Abstract: With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.
[67] A Novel Combined Optical Flow Approach for Comprehensive Micro-Expression Recognition
Vu Tram Anh Khuong,Thi Bich Phuong Man,Luu Tu Nguyen,Thanh Ha Le,Thi Duyen Ngo
Main category: cs.CV
TL;DR: 该论文提出了一种结合起始到顶点和顶点到终止两个阶段的光流方法(COF),用于全面的微表情识别,显著提升了识别性能。
Details
Motivation: 现有的微表情识别方法通常仅关注从起始到顶点的光流阶段,而忽略了顶点到终止阶段的关键时间动态信息,这限制了识别的全面性和准确性。Contribution: 引入了一种结合两个阶段的光流方法(COF),通过整合起始到顶点和顶点到终止的动态信息,提供了更全面的运动分析,从而提升了微表情识别的性能。
Method: 提出了Combined Optical Flow(COF)方法,同时利用起始到顶点和顶点到终止两个阶段的光流信息,增强了特征的表示能力。
Result: 在CASMEII和SAMM数据集上的实验结果表明,COF方法优于仅基于单一光流的方法,验证了其捕捉微表情动态的有效性。
Insight: 微表情识别中,顶点到终止阶段的时间动态信息对提升识别性能至关重要,未来的研究应更多关注这一阶段。
Abstract: Facial micro-expressions are brief, involuntary facial movements that reveal hidden emotions. Most Micro-Expression Recognition (MER) methods that rely on optical flow typically focus on the onset-to-apex phase, neglecting the apex-to-offset phase, which holds key temporal dynamics. This study introduces a Combined Optical Flow (COF), integrating both phases to enhance feature representation. COF provides a more comprehensive motion analysis, improving MER performance. Experimental results on CASMEII and SAMM datasets show that COF outperforms single optical flow-based methods, demonstrating its effectiveness in capturing micro-expression dynamics.
[68] Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI
Syed Abdul Gaffar Shakhadri,Kruthika KR,Kartik Basavaraj Angadi
Main category: cs.CV
TL;DR: Shakti-VLMs是一系列1B和4B参数的视觉语言模型,通过架构创新和三阶段训练策略,在少量数据下实现高性能,适用于企业级多模态任务。
Details
Motivation: 当前视觉语言模型依赖大量训练数据,而Shakti-VLMs旨在通过模型设计和训练策略优化,减少数据需求,提升效率。Contribution: 提出了Shakti-VLM模型家族,包括QK归一化、混合归一化和增强位置编码等创新,以及三阶段训练策略。
Method: 采用QK-Normalization稳定注意力机制,结合混合归一化和改进的位置编码,并通过三阶段训练策略优化效率。
Result: Shakti-VLM-1B和4B在文档理解、视觉推理、OCR提取和多模态推理任务中表现优异。
Insight: 高性能可通过模型设计和训练策略而非仅依赖数据规模实现,为企业级多模态任务提供了高效解决方案。
Abstract: We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.
[69] Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement
Xianmin Chen,Peiliang Huang,Longfei Han,Dingwen Zhang,Junwei Han
Main category: cs.CV
TL;DR: 论文提出了一种名为HiMA的分层混合架构,结合Transformer和Mamba模块的优势,用于高效低光RAW图像增强。方法还包括局部分布调整(LoDA)和多先验融合(MPF)模块,显著提升了增强质量和效率。
Details
Motivation: 低光RAW图像增强任务复杂,现有深度学习方法在效率与质量之间难以平衡。HiMA旨在通过新型架构设计克服这一问题。Contribution: 1. 提出HiMA分层混合架构,结合Transformer和Mamba的优势;2. 设计了LoDA模块解决局部光照不均问题;3. 引入MPF模块融合多域先验,增强细节。
Method: HiMA利用Transformer处理大尺度特征,Mamba处理小尺度特征;LoDA自适应调整局部特征分布;MPF融合空间和频域先验。
Result: 在多个公开数据集上,HiMA优于现有方法,且参数量更少。
Insight: 结合不同模块的优势(如Transformer和Mamba)能有效提升低光图像增强的性能和效率,局部特征处理和多域先验融合是关键。
Abstract: Low-light RAW image enhancement remains a challenging task. Although numerous deep learning based approaches have been proposed, they still suffer from inherent limitations. A key challenge is how to simultaneously achieve strong enhancement quality and high efficiency. In this paper, we rethink the architecture for efficient low-light image signal processing (ISP) and introduce a Hierarchical Mixing Architecture (HiMA). HiMA leverages the complementary strengths of Transformer and Mamba modules to handle features at large and small scales, respectively, thereby improving efficiency while avoiding the ambiguities observed in prior two-stage frameworks. To further address uneven illumination with strong local variations, we propose Local Distribution Adjustment (LoDA), which adaptively aligns feature distributions across different local regions. In addition, to fully exploit the denoised outputs from the first stage, we design a Multi-prior Fusion (MPF) module that integrates spatial and frequency-domain priors for detail enhancement. Extensive experiments on multiple public datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior performance with fewer parameters. Code will be released at https://github.com/Cynicarlos/HiMA.
[70] Exploring Conditions for Diffusion models in Robotic Control
Heeseong Shin,Byeongho Heo,Dongyoon Han,Seungryong Kim,Taekyung Kim
Main category: cs.CV
TL;DR: 预训练的扩散模型在视觉表现上取得了显著进展,但直接在机器人控制任务中应用文本条件效果不佳。本文提出ORCA,通过自适应任务提示和视觉提示,显著提升了控制任务的表现。
Details
Motivation: 预训练的视觉表征在模仿学习中表现优异,但由于其任务无关性,可能无法直接用于机器人控制任务。本文旨在探索如何利用预训练的扩散模型获取任务自适应的视觉表征,而无需微调模型本身。Contribution: 提出了ORCA,一种结合可学习的任务提示和视觉提示的方法,解决了扩散模型在机器人控制任务中因领域差距导致的表现不佳问题。
Method: 通过设计自适应任务提示(learnable task prompts)和捕捉细粒度帧信息(frame-specific details)的视觉提示,使扩散模型的表征能力更适应控制任务。
Result: ORCA在多个机器人控制基准测试中达到了最先进的性能,显著优于现有方法。
Insight: 简单地复制其他视觉领域的成功方法(如文本条件)在机器人控制中可能无效,需设计更动态和任务特定的条件才能提升表现。
Abstract: While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model’s training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.
[71] ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents
Tingyu Lin,Marco Peer,Florian Kleber,Robert Sablatnig
Main category: cs.CV
TL;DR: ClapperText是一个面向低资源档案文档文本识别任务的基准数据集,包含手写和打印文本的标注数据,适用于复杂场景下的OCR研究。
Details
Motivation: 历史档案中的文档通常视觉退化严重,资源有限,现有OCR方法在这些场景下表现不佳。ClapperText旨在提供一个真实的低资源环境数据集,推动文档理解和OCR技术的进步。Contribution: 1)提供了ClapperText数据集,包含9,813标注帧和94,573个文本实例;2)设计了旋转边界框和多标签标注,支持空间精确OCR任务;3)提出了一种基于视频的评估协议,并测试了多种模型在零样本和微调条件下的表现。
Method: 数据集来源于127段二战时期的档案视频,标注包括文本转录、语义类别、文本类型和遮挡状态。通过旋转边界框(4点多边形)表示文本区域,支持OCR的精确检测。
Result: 实验表明,尽管训练数据有限(18段视频),微调仍能显著提升模型性能,证明了数据集在少样本学习场景中的适用性。
Insight: ClapperText揭示了复杂历史文档中的挑战(如运动模糊、手写变体),为低资源OCR研究提供了宝贵资源和基准测试平台。
Abstract: This paper presents ClapperText, a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings. The dataset is derived from 127 World War II-era archival video segments containing clapperboards that record structured production metadata such as date, location, and camera-operator identity. ClapperText includes 9,813 annotated frames and 94,573 word-level text instances, 67% of which are handwritten and 1,566 are partially occluded. Each instance includes transcription, semantic category, text type, and occlusion status, with annotations available as rotated bounding boxes represented as 4-point polygons to support spatially precise OCR applications. Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds, mirroring broader challenges in historical document analysis where structured content appears in degraded, non-standard forms. We provide both full-frame annotations and cropped word images to support downstream tasks. Using a consistent per-video evaluation protocol, we benchmark six representative recognition and seven detection models under zero-shot and fine-tuned conditions. Despite the small training set (18 videos), fine-tuning leads to substantial performance gains, highlighting ClapperText’s suitability for few-shot learning scenarios. The dataset offers a realistic and culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts. The dataset and evaluation code are available at https://github.com/linty5/ClapperText.
[72] Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation
Xiaoming Zhu,Xu Huang,Qinghongbing Xie,Zhi Deng,Junsheng Yu,Yirui Guan,Zhongyuan Liu,Lin Zhu,Qijun Zhao,Ligang Liu,Long Zeng
Main category: cs.CV
TL;DR: Imaginarium提出了一种基于视觉引导的3D场景布局生成系统,通过高质量资源库、图像生成和解析模块,显著提升了布局的丰富性和质量。
Details
Motivation: 传统优化方法依赖繁琐手工规则,生成模型在多样性和鲁棒性上表现不足,大型语言模型难以捕捉复杂空间关系。Contribution: 1) 构建了高质量3D场景资源库;2) 开发了视觉引导的图像生成与解析模块;3) 提出了基于场景图和语义的布局优化方法。
Method: 1) 构建资源库;2) 微调图像生成模型;3) 设计图像解析模块;4) 通过场景图和语义优化布局。
Result: 用户测试表明,该方法在布局丰富性和质量上显著优于现有方法。
Insight: 视觉引导结合语义优化是解决3D布局生成问题的有效途径。
Abstract: Generating artistic and coherent 3D scene layouts is crucial in digital content creation. Traditional optimization-based methods are often constrained by cumbersome manual rules, while deep generative models face challenges in producing content with richness and diversity. Furthermore, approaches that utilize large language models frequently lack robustness and fail to accurately capture complex spatial relationships. To address these challenges, this paper presents a novel vision-guided 3D layout generation system. We first construct a high-quality asset library containing 2,037 scene assets and 147 3D scene layouts. Subsequently, we employ an image generation model to expand prompt representations into images, fine-tuning it to align with our asset library. We then develop a robust image parsing module to recover the 3D layout of scenes based on visual semantics and geometric information. Finally, we optimize the scene layout using scene graphs and overall visual semantics to ensure logical coherence and alignment with the images. Extensive user testing demonstrates that our algorithm significantly outperforms existing methods in terms of layout richness and quality. The code and dataset will be available at https://github.com/HiHiAllen/Imaginarium.
[73] FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification
Zhen Sun,Lei Tan,Yunhang Shen,Chengmao Cai,Xing Sun,Pingyang Dai,Liujuan Cao,Rongrong Ji
Main category: cs.CV
TL;DR: FlexiReID是一个灵活的跨模态行人重识别框架,支持四种模态(RGB、红外、素描、文本)和七种检索模式,通过自适应专家混合(MoE)机制动态整合特征,并在新构建的CIRS-PEDES数据集上取得领先性能。
Details
Motivation: 现有跨模态行人重识别方法局限于特定模态组合,缺乏灵活性,难以支持实际部署中多样的查询-检索需求。Contribution: 1)提出FlexiReID框架,支持四种模态的灵活组合;2)引入自适应MoE机制和跨模态查询融合模块;3)构建CIRS-PEDES数据集,统一多模态评估基准。
Method: 使用自适应MoE动态整合多模态特征,并设计跨模态查询融合模块优化特征提取。
Result: 在CIRS-PEDES数据集上达到SOTA性能,并展现强泛化能力。
Insight: 动态特征融合和灵活模态支持是提升跨模态行人重识别实用性的关键。
Abstract: Multimodal person re-identification (Re-ID) aims to match pedestrian images across different modalities. However, most existing methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment. We propose FlexiReID, a flexible framework that supports seven retrieval modes across four modalities: rgb, infrared, sketches, and text. FlexiReID introduces an adaptive mixture-of-experts (MoE) mechanism to dynamically integrate diverse modality features and a cross-modal query fusion module to enhance multimodal feature extraction. To facilitate comprehensive evaluation, we construct CIRS-PEDES, a unified dataset extending four popular Re-ID datasets to include all four modalities. Extensive experiments demonstrate that FlexiReID achieves state-of-the-art performance and offers strong generalization in complex scenarios.
[74] Quantized FCA: Efficient Zero-Shot Texture Anomaly Detection
Andrei-Timotei Ardelean,Patrick Rückbeil,Tim Weyrich
Main category: cs.CV
TL;DR: 该论文提出了一种高效的零样本纹理异常检测方法QFCA,通过量化特征对应的统计分析实现了10倍的速度提升,同时保持了高精度。
Details
Motivation: 现有纹理异常检测方法由于运行时间长,难以应用于实际场景(如生产线监控)。通过量化技术和PCA预处理,解决了速度和精度的问题。Contribution: 1. 提出量化版本的FCA算法(QFCA),显著提升了速度;2. 引入基于PCA的特征预处理,增强了正常与异常特征的对比度。
Method: 1. 对特征值进行量化,使用直方图统计比较;2. 通过PCA预处理提高特征区分度。
Result: QFCA在速度上实现了10倍提升,同时精度与现有方法相当甚至更优。
Insight: 量化技术和PCA预处理可以有效平衡速度和精度,为零样本异常检测提供实用解决方案。
Abstract: Zero-shot anomaly localization is a rising field in computer vision research, with important progress in recent years. This work focuses on the problem of detecting and localizing anomalies in textures, where anomalies can be defined as the regions that deviate from the overall statistics, violating the stationarity assumption. The main limitation of existing methods is their high running time, making them impractical for deployment in real-world scenarios, such as assembly line monitoring. We propose a real-time method, named QFCA, which implements a quantized version of the feature correspondence analysis (FCA) algorithm. By carefully adapting the patch statistics comparison to work on histograms of quantized values, we obtain a 10x speedup with little to no loss in accuracy. Moreover, we introduce a feature preprocessing step based on principal component analysis, which enhances the contrast between normal and anomalous features, improving the detection precision on complex textures. Our method is thoroughly evaluated against prior art, comparing favorably with existing methods. Project page: https://reality.tf.fau.de/pub/ardelean2025quantized.html
[75] Lightweight Data-Free Denoising for Detail-Preserving Biomedical Image Restoration
Tomáš Chobola,Julia A. Schnabel,Tingying Peng
Main category: cs.CV
TL;DR: 该论文提出了一种名为Noise2Detail(N2D)的超轻量级数据无关去噪方法,旨在解决现有自监督去噪技术在计算和内存需求上的限制,同时实现高速和高质量的图像恢复。
Details
Motivation: 现有自监督去噪技术尽管表现优异,但在实际应用中常因计算和内存需求过高而受限,难以平衡推理速度和重建质量。尤其是在生物医学成像中,干净的训练数据稀缺且成像模式复杂。Contribution: 提出了Noise2Detail(N2D)模型,通过多阶段去噪流程打破噪声模式的空间相关性,生成中间平滑结构并细化以恢复精细细节。该方法在无需干净参考图像或显式噪声建模的情况下,显著降低了计算资源需求。
Method: 基于Noise2Noise框架,N2D采用多阶段去噪流程:首先破坏噪声空间相关性生成平滑结构,随后直接从噪声输入中恢复细节。该方法无需外部数据,仅依赖噪声图像。
Result: 实验表明,N2D在性能上超越现有数据无关去噪技术,同时计算资源消耗显著降低。
Insight: N2D的效率、低成本和数据无关性使其特别适合生物医学成像,解决了干净数据稀缺的问题,同时支持快速推理以用于实际应用。
Abstract: Current self-supervised denoising techniques achieve impressive results, yet their real-world application is frequently constrained by substantial computational and memory demands, necessitating a compromise between inference speed and reconstruction quality. In this paper, we present an ultra-lightweight model that addresses this challenge, achieving both fast denoising and high quality image restoration. Built upon the Noise2Noise training framework-which removes the reliance on clean reference images or explicit noise modeling-we introduce an innovative multistage denoising pipeline named Noise2Detail (N2D). During inference, this approach disrupts the spatial correlations of noise patterns to produce intermediate smooth structures, which are subsequently refined to recapture fine details directly from the noisy input. Extensive testing reveals that Noise2Detail surpasses existing dataset-free techniques in performance, while requiring only a fraction of the computational resources. This combination of efficiency, low computational cost, and data-free approach make it a valuable tool for biomedical imaging, overcoming the challenges of scarce clean training data-due to rare and complex imaging modalities-while enabling fast inference for practical use.
[76] Deep Learning Based Domain Adaptation Methods in Remote Sensing: A Comprehensive Survey
Shuchang Lyu,Qi Zhao,Zheng Zhou,Meng Li,You Zhou,Dingding Yao,Guangliang Cheng,Huiyu Zhou,Zhenwei Shi
Main category: cs.CV
TL;DR: 本文是一篇关于深度学习在遥感领域域适应方法应用的全面综述,涵盖了任务分类、输入模式、监督范式和算法粒度等多方面内容,并总结了当前进展与未来挑战。
Details
Motivation: 遥感领域中,域适应任务因数据分布差异(如地面采样距离、成像模式等)面临巨大挑战。深度学习的强大特征表示和跨域知识迁移能力使其在该领域受到广泛关注,但缺乏全面系统的综述。Contribution: 1. 提出了一个系统的分类法,从多个角度组织现有算法;2. 涵盖了比以往综述更广泛的域适应任务;3. 总结了当前进展和未来研究方向。
Method: 通过任务分类、输入模式、监督范式和算法粒度等多个视角,对深度学习方法进行分类和综述。
Result: 总结了当前最新方法的性能,并提供了对未来研究方向的指导。
Insight: 遥感域适应任务的多样性和复杂性需要更系统的分类方法,未来研究可能集中在多模态数据融合和更高效的跨域学习方法上。
Abstract: Domain adaptation is a crucial and increasingly important task in remote sensing, aiming to transfer knowledge from a source domain a differently distributed target domain. It has broad applications across various real-world applications, including remote sensing element interpretation, ecological environment monitoring, and urban/rural planning. However, domain adaptation in remote sensing poses significant challenges due to differences in data, such as variations in ground sampling distance, imaging modes from various sensors, geographical landscapes, and environmental conditions. In recent years, deep learning has emerged as a powerful tool for feature representation and cross-domain knowledge transfer, leading to widespread adoption in remote sensing tasks. In this paper, we present a comprehensive survey of significant advancements in deep learning based domain adaptation for remote sensing. We first introduce the preliminary knowledge to clarify key concepts, mathematical notations, and the taxonomy of methodologies. We then organize existing algorithms from multiple perspectives, including task categorization, input mode, supervision paradigm, and algorithmic granularity, providing readers with a structured understanding of the field. Next, we review widely used datasets and summarize the performance of state-of-the-art methods to provide an overview of current progress. We also identify open challenges and potential directions to guide future research in domain adaptation for remote sensing. Compared to previous surveys, this work addresses a broader range of domain adaptation tasks in remote sensing, rather than concentrating on a few subfields. It also presents a systematic taxonomy, providing a more comprehensive and organized understanding of the field. As a whole, this survey can inspire the research community, foster understanding, and guide future work in the field.
[77] Uncertainty-Aware Extreme Point Tracing for Weakly Supervised Ultrasound Image Segmentation
Lei Shi,Gang Li,Junxing Zhang
Main category: cs.CV
TL;DR: 该论文提出了一种基于极弱监督的超声图像分割方法,仅需四个极值点作为标注,并利用SAM2模型生成初始伪标签,通过FGEPM算法和不确定性估计逐步优化分割结果。
Details
Motivation: 传统的全监督医学图像分割需要大量像素级标注,成本高且耗时。为降低标注负担,提出一种仅需极值点的弱监督方法。Contribution: 1) 提出一种基于极值点的弱监督分割框架;2) 设计FGEPM算法结合不确定性估计优化伪标签;3) 引入USC损失和框对齐损失提升空间一致性和边界精度。
Method: 1) 利用极值点生成边界框作为SAM2的输入,生成初始伪标签;2) 通过FGEPM算法和蒙特卡洛随机失活估计不确定性,优化边界;3) 使用USC损失和框对齐损失训练模型。
Result: 在BUSI和UNS数据集上的实验表明,该方法性能接近甚至超越全监督方法,同时大幅降低标注成本。
Insight: 1) 极值点标注足够支持高质量分割;2) 不确定性估计有助于边界优化;3) 弱监督方法在实际应用中潜力巨大。
Abstract: Automatic medical image segmentation is a fundamental step in computer-aided diagnosis, yet fully supervised approaches demand extensive pixel-level annotations that are costly and time-consuming. To alleviate this burden, we propose a weakly supervised segmentation framework that leverages only four extreme points as annotation. Specifically, bounding boxes derived from the extreme points are used as prompts for the Segment Anything Model 2 (SAM2) to generate reliable initial pseudo labels. These pseudo labels are progressively refined by an enhanced Feature-Guided Extreme Point Masking (FGEPM) algorithm, which incorporates Monte Carlo dropout-based uncertainty estimation to construct a unified gradient uncertainty cost map for boundary tracing. Furthermore, a dual-branch Uncertainty-aware Scale Consistency (USC) loss and a box alignment loss are introduced to ensure spatial consistency and precise boundary alignment during training. Extensive experiments on two public ultrasound datasets, BUSI and UNS, demonstrate that our method achieves performance comparable to, and even surpassing fully supervised counterparts while significantly reducing annotation cost. These results validate the effectiveness and practicality of the proposed weakly supervised framework for ultrasound image segmentation.
[78] Valeo Near-Field: a novel dataset for pedestrian intent detection
Antonyo Musabini,Rachid Benmokhtar,Jagdish Bhanushali,Victor Galizzi,Bertrand Luvison,Xavier Perrotton
Main category: cs.CV
TL;DR: 论文提出了一个名为Valeo Near-Field的新数据集,用于检测行人接近车辆时的意图。数据集包含多模态数据,并提供了详细的标注和基准测试,旨在推动智能车辆在近场场景中的研究。
Details
Motivation: 现有数据集在行人意图检测和近场感知任务上存在不足,尤其是在多模态数据同步和真实场景多样性方面。本文希望通过提供一个高质量、多模态的数据集,促进相关算法的研究和提升。Contribution: 1. 提出了一个包含多模态数据的行人意图检测数据集;2. 提供了详细的3D身体关节位置和行人位置标注;3. 发布了基准测试套件,包括性能和效率评估指标;4. 提出了未来的研究方向。
Method: 数据集通过同步收集鱼眼摄像头、激光雷达、超声波传感器和动捕3D姿态等多模态信息构建。标注包含3D关节位置和行人位置。基准测试使用了自定义神经网络架构。
Result: 数据集和基准测试为行人检测、3D姿态估计和轨迹预测提供了评估标准,并展示了在多模态数据融合任务中的潜力。
Insight: 多模态数据结合详细标注能够显著提升行人意图检测的鲁棒性,尤其是在复杂动态环境和硬件限制条件下。该数据集为近场场景的研究提供了重要资源。
Abstract: This paper presents a novel dataset aimed at detecting pedestrians’ intentions as they approach an ego-vehicle. The dataset comprises synchronized multi-modal data, including fisheye camera feeds, lidar laser scans, ultrasonic sensor readings, and motion capture-based 3D body poses, collected across diverse real-world scenarios. Key contributions include detailed annotations of 3D body joint positions synchronized with fisheye camera images, as well as accurate 3D pedestrian positions extracted from lidar data, facilitating robust benchmarking for perception algorithms. We release a portion of the dataset along with a comprehensive benchmark suite, featuring evaluation metrics for accuracy, efficiency, and scalability on embedded systems. By addressing real-world challenges such as sensor occlusions, dynamic environments, and hardware constraints, this dataset offers a unique resource for developing and evaluating state-of-the-art algorithms in pedestrian detection, 3D pose estimation and 4D trajectory and intention prediction. Additionally, we provide baseline performance metrics using custom neural network architectures and suggest future research directions to encourage the adoption and enhancement of the dataset. This work aims to serve as a foundation for researchers seeking to advance the capabilities of intelligent vehicles in near-field scenarios.
[79] Towards Label-Free Brain Tumor Segmentation: Unsupervised Learning with Multimodal MRI
Gerard Comas-Quiles,Carles Garcia-Cabrera,Julia Dietlmeier,Noel E. O’Connor,Ferran Marques
Main category: cs.CV
TL;DR: 该论文提出了一种基于多模态MRI的无监督学习方法MViT-AE,通过重建误差图实现脑肿瘤分割,解决了标注数据稀缺的问题,并在BraTS-GoAT 2025数据集上取得了临床意义的结果。
Details
Motivation: 由于标注数据稀缺、昂贵或不一致,传统的监督学习方法在脑肿瘤分割中面临可扩展性瓶颈。无监督异常检测(UAD)提供了一个补充方案,无需依赖手动标注。Contribution: 1. 提出了一种新型多模态Vision Transformer自编码器(MViT-AE);2. 引入了多模态早期-晚期融合策略;3. 结合Segment Anything Model(SAM)优化肿瘤轮廓预测。
Method: 1. 使用健康脑MRI训练MViT-AE;2. 通过重建误差图检测和定位肿瘤;3. 多模态融合策略增强性能;4. SAM后处理优化分割结果。
Result: 在BraTS-GoAT 2025测试集上,Dice系数分别为0.437(全肿瘤)、0.316(肿瘤核心)和0.350(增强肿瘤),验证集的异常检测率为89.4%。
Insight: 基于Transformer的无监督模型有望成为神经肿瘤影像的可扩展、高效工具,尤其在标注数据有限的情况下。
Abstract: Unsupervised anomaly detection (UAD) presents a complementary alternative to supervised learning for brain tumor segmentation in magnetic resonance imaging (MRI), particularly when annotated datasets are limited, costly, or inconsistent. In this work, we propose a novel Multimodal Vision Transformer Autoencoder (MViT-AE) trained exclusively on healthy brain MRIs to detect and localize tumors via reconstruction-based error maps. This unsupervised paradigm enables segmentation without reliance on manual labels, addressing a key scalability bottleneck in neuroimaging workflows. Our method is evaluated in the BraTS-GoAT 2025 Lighthouse dataset, which includes various types of tumors such as gliomas, meningiomas, and pediatric brain tumors. To enhance performance, we introduce a multimodal early-late fusion strategy that leverages complementary information across multiple MRI sequences, and a post-processing pipeline that integrates the Segment Anything Model (SAM) to refine predicted tumor contours. Despite the known challenges of UAD, particularly in detecting small or non-enhancing lesions, our method achieves clinically meaningful tumor localization, with lesion-wise Dice Similarity Coefficient of 0.437 (Whole Tumor), 0.316 (Tumor Core), and 0.350 (Enhancing Tumor) on the test set, and an anomaly Detection Rate of 89.4% on the validation set. These findings highlight the potential of transformer-based unsupervised models to serve as scalable, label-efficient tools for neuro-oncological imaging.
[80] Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis
Junzhi Ning,Wei Li,Cheng Tang,Jiashi Lin,Chenglong Ma,Chaoyang Zhang,Jiyao Liu,Ying Chen,Shujian Gao,Lihao Liu,Yuandong Pu,Huihui Xu,Chenhui Gou,Ziyan Huang,Yi Xin,Qi Qin,Zhongying Deng,Diping Song,Bin Fu,Guang Yang,Yuanfeng Ji,Tianbin Li,Yanzhou Su,Jin Ye,Shixiang Tang,Ming Hu,Junjun He
Main category: cs.CV
TL;DR: 本文提出了UniMedVL框架,通过Observation-Knowledge-Analysis(OKA)范式统一医学多模态理解和生成任务,填补了现有医学AI系统在数据表示和特征集成方面的空白。
Details
Motivation: 现有医学AI系统在处理多模态输入和生成多样化输出时存在割裂,无法同时完成图像理解和生成任务,限制了实际医疗诊断的应用效果。Contribution: 1. 构建了UniMed-5M数据集;2. 提出了渐进式课程学习(Progressive Curriculum Learning);3. 设计了首个医学统一多模态模型UniMedVL,同时支持图像理解和生成任务。
Method: 通过OKA范式分三个层次实现医学多模态任务的统一:观察层(数据集构建)、知识层(渐进式学习)、分析层(UniMedVL模型)。
Result: UniMedVL在5个医学图像理解基准测试中表现优异,同时在8种医学成像模态的生成任务中媲美专用模型。
Insight: 统一架构实现了双向知识共享,生成任务能够提升视觉理解特征,表明整合传统分离的能力可以显著提升医学视觉-语言任务的性能。
Abstract: Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.
[81] DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification
Tingyu Lin,Armin Dadras,Florian Kleber,Robert Sablatnig
Main category: cs.CV
TL;DR: DGME-T是一种基于Transformer的轻量级扩展,通过方向性网格运动编码(DGME)提升视频分类模型的鲁棒性,尤其在处理历史档案影片时表现显著提升。
Details
Motivation: 当前基于高质量现代影片训练的相机运动分类(CMC)模型在处理噪声多、帧缺失、低对比度的历史档案影片时性能下降。Contribution: 1. 提出了一个统一的基准数据集;2. 设计了DGME-T,通过方向性网格运动编码增强Transformer模型的性能。
Method: 在Video Swin Transformer中引入方向性网格运动编码(来自光流),并通过可学习的归一化后期融合层注入模型。
Result: 在现代视频上,top-1准确率从81.78%提升至86.14%,宏F1从82.08%提升至87.81%;在二战档案影片上,准确率从83.43%提升至84.62%,宏F1从81.72%提升至82.63%。
Insight: 结构化运动先验和Transformer表示是互补的,即使是小规模的运动头部也能显著提升模型在退化影片分析中的鲁棒性。
Abstract: Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone’s top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at https://github.com/linty5/DGME-T.
[82] Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Qingyan Bai,Qiuyu Wang,Hao Ouyang,Yue Yu,Hanlin Wang,Wen Wang,Ka Leong Cheng,Shuailei Ma,Yanhong Zeng,Zichen Liu,Yinghao Xu,Yujun Shen,Qifeng Chen
Main category: cs.CV
TL;DR: 论文提出了一种名为Ditto的框架,通过生成大规模高质量合成数据集(Ditto-1M)解决了指令视频编辑数据稀缺的问题,并训练出性能优异的模型Editto。
Details
Motivation: 指令视频编辑在内容创作中具有潜力,但由于缺乏大规模高质量训练数据而进展缓慢,本文旨在解决这一瓶颈。Contribution: 1. 提出Ditto框架及其数据生成流程,结合图像编辑器和视频生成器提升数据多样性;2. 设计高效蒸馏模型架构与时序增强器,平衡成本和性能;3. 开发智能代理驱动流程,保障大规模数据质量;4. 发布12000 GPU-day生成的Ditto-1M数据集及其训练的Editto模型(SOTA)。
Method: 1. 构建融合图像编辑器与视频生成器的数据生成流程;2. 使用蒸馏模型与时序增强器优化效率与时序一致性;3. 通过智能代理生成多样化指令并筛选数据;4. 采用课程学习策略训练Editto模型。
Result: Editto模型在指令视频编辑任务中表现优异,实现了新的SOTA性能。
Insight: 合成数据的高效生成和质量控制是指令视频编辑领域的关键突破点,同时模型的效率优化(如蒸馏和时序增强)对大规模训练至关重要。
Abstract: Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.
[83] OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Hanrong Ye,Chao-Han Huck Yang,Arushi Goel,Wei Huang,Ligeng Zhu,Yuanhang Su,Sean Lin,An-Chieh Cheng,Zhen Wan,Jinchuan Tian,Yuming Lou,Dong Yang,Zhijian Liu,Yukang Chen,Ambrish Dantrey,Ehsan Jahangiri,Sreyan Ghosh,Daguang Xu,Ehsan Hosseini-Asl,Danial Mohseni Taheri,Vidya Murali,Sifei Liu,Jason Lu,Oluwatobi Olabiyi,Frank Wang,Rafael Valle,Bryan Catanzaro,Andrew Tao,Song Han,Jan Kautz,Hongxu Yin,Pavlo Molchanov
Main category: cs.CV
TL;DR: OmniVinci是一个开源的多模态理解大模型,通过创新的架构设计和数据优化,显著提升了多模态任务的表现。
Details
Motivation: 推动机器智能需要具备跨模态感知能力,模拟人类多模态感知世界的方式。Contribution: 提出了OmniVinci模型,包括三项关键创新:OmniAlignNet、Temporal Embedding Grouping和Constrained Rotary Time Embedding,以及一个生成2400万单模态和多模态对话的数据合成流程。
Method: 通过OmniAlignNet增强视觉和音频嵌入的对齐,利用时间嵌入分组和约束旋转时间嵌入捕捉多模态信号的时间信息。
Result: OmniVinci在多项基准测试中超越Qwen2.5-Omni,训练令牌数仅为后者的1/6。
Insight: 多模态相互增强,不仅在感知上,也在推理任务中表现出协同效应。
Abstract: Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni’s 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
[84] SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior
Haoran Wang,Bo Zhao,Jinghui Wang,Hanzhang Wang,Huan Yang,Wei Ji,Hao Liu,Xinyan Xiao
Main category: cs.CV
TL;DR: 本文提出SEGA,一种分步进化的内容感知布局生成范式,通过分层推理框架和设计先验知识,显著提升了复杂布局规划的准确性。
Details
Motivation: 现有方法通常采用单步推理框架,缺乏反馈机制,导致复杂布局规划失败率高。Contribution: 1. 提出SEGA分步进化范式;2. 引入设计先验知识;3. 发布新的大规模海报数据集GenPoster-100K。
Method: 采用分层推理框架,先粗粒度估算布局,再通过细化模块修正,并结合设计原则作为先验知识。
Result: 在多个基准数据集上达到SOTA效果。
Insight: 分步推理和设计先验的结合可显著提升布局生成的鲁棒性和质量。
Abstract: In this paper, we study the content-aware layout generation problem, which aims to automatically generate layouts that are harmonious with a given background image. Existing methods usually deal with this task with a single-step reasoning framework. The lack of a feedback-based self-correction mechanism leads to their failure rates significantly increasing when faced with complex element layout planning. To address this challenge, we introduce SEGA, a novel Stepwise Evolution Paradigm for Content-Aware Layout Generation. Inspired by the systematic mode of human thinking, SEGA employs a hierarchical reasoning framework with a coarse-to-fine strategy: first, a coarse-level module roughly estimates the layout planning results; then, another refining module performs fine-level reasoning regarding the coarse planning results. Furthermore, we incorporate layout design principles as prior knowledge into the model to enhance its layout planning ability. Besides, we present GenPoster-100K that is a new large-scale poster dataset with rich meta-information annotation. The experiments demonstrate the effectiveness of our approach by achieving the state-of-the-art results on multiple benchmark datasets. Our project page is at: https://brucew91.github.io/SEGA.github.io/
[85] Semantic segmentation with coarse annotations
Jort de Jong,Mike Holenderski
Main category: cs.CV
TL;DR: 本文提出了一种用于粗标注语义分割的正则化方法,通过超像素上采样优化边界对齐效果。
Details
Motivation: 精细标注语义分割数据成本高昂,粗标注是一种替代方案,但其边界对齐效果较差,需要改进。Contribution: 提出一种结合SLIC超像素的正则化方法,优化粗标注下的边界对齐性能。
Method: 在编码器-解码器架构中引入基于超像素的上采样正则化,鼓励分割结果与SLIC超像素一致。
Result: 在SUIM、Cityscapes和PanNuke数据集上验证,边界召回率显著优于现有方法。
Insight: 利用超像素的底层图像特征(颜色、位置)可以弥补粗标注边界信息的不足。
Abstract: Semantic segmentation is the task of classifying each pixel in an image. Training a segmentation model achieves best results using annotated images, where each pixel is annotated with the corresponding class. When obtaining fine annotations is difficult or expensive, it may be possible to acquire coarse annotations, e.g. by roughly annotating pixels in an images leaving some pixels around the boundaries between classes unlabeled. Segmentation with coarse annotations is difficult, in particular when the objective is to optimize the alignment of boundaries between classes. This paper proposes a regularization method for models with an encoder-decoder architecture with superpixel based upsampling. It encourages the segmented pixels in the decoded image to be SLIC-superpixels, which are based on pixel color and position, independent of the segmentation annotation. The method is applied to FCN-16 fully convolutional network architecture and evaluated on the SUIM, Cityscapes, and PanNuke data sets. It is shown that the boundary recall improves significantly compared to state-of-the-art models when trained on coarse annotations.
[86] Towards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model
Gaoxiang Huang,Songning Lai,Yutao Yue
Main category: cs.CV
TL;DR: 该论文提出了一种轻量化的解耦概念瓶颈模型(LDCBM),通过自动将视觉特征分组为语义上有意义的组件,提高了概念对齐和分类性能。
Details
Motivation: 现有的概念瓶颈模型(CBMs)存在输入到概念映射的偏差和有限的可控性,限制了其实用价值和策略的可靠性。Contribution: 引入了一种新的LDCBM模型,通过过滤器分组损失和联合概念监督,提升了视觉模式与概念的对齐,增强了模型的解释性和分类性能。
Method: 提出了一种轻量化解耦方法,自动将视觉特征分组为语义组件,并结合过滤器分组损失和联合监督训练。
Result: 在三个多样化数据集上的实验表明,LDCBM在概念和类别准确性上均优于以往CBMs,同时提高了模型的透明性和鲁棒性。
Insight: 通过将概念根植于视觉证据中,该方法克服了先前模型的基本限制,显著提升了可解释AI的可靠性。
Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by predicting human-understandable concepts as intermediate representations. However, existing CBMs often suffer from input-to-concept mapping bias and limited controllability, which restricts their practical value, directly damage the responsibility of strategy from concept-based methods. We propose a lightweight Disentangled Concept Bottleneck Model (LDCBM) that automatically groups visual features into semantically meaningful components without region annotation. By introducing a filter grouping loss and joint concept supervision, our method improves the alignment between visual patterns and concepts, enabling more transparent and robust decision-making. Notably, Experiments on three diverse datasets demonstrate that LDCBM achieves higher concept and class accuracy, outperforming previous CBMs in both interpretability and classification performance. By grounding concepts in visual evidence, our method overcomes a fundamental limitation of prior models and enhances the reliability of interpretable AI.
[87] ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection
Haowei Zhu,Tianxiang Pan,Rui Qin,Jun-Hai Yong,Bin Wang
Main category: cs.CV
TL;DR: ReCon提出了一种新的数据增强框架,通过区域可控的生成模型和感知模型反馈,解决了当前生成方法中内容-位置不匹配和语义泄漏的问题。
Details
Motivation: 现有生成模型在数据增强中存在内容-位置不匹配和语义泄漏的问题,且需要复杂后处理或大量微调。ReCon旨在提升生成数据的质量和可训练性。Contribution: 1. 提出了ReCon框架,整合了区域引导的修正和对齐机制;2. 引入了区域对齐的交叉注意力,提升语义一致性和图像保真度。
Method: 1. 在扩散采样过程中引入感知模型反馈以修正误生成区域;2. 提出区域对齐的交叉注意力机制,确保图像区域与文本提示的空间-语义对齐。
Result: 实验表明,ReCon显著提升了生成数据的质量和训练效果,在不同数据集、骨干架构和数据规模下均取得一致性能提升。
Insight: 通过感知模型反馈和区域对齐机制,ReCon实现了更可控和高质量的生成数据,为数据增强提供了新思路。
Abstract: The scale and quality of datasets are crucial for training robust perception models. However, obtaining large-scale annotated data is both costly and time-consuming. Generative models have emerged as a powerful tool for data augmentation by synthesizing samples that adhere to desired distributions. However, current generative approaches often rely on complex post-processing or extensive fine-tuning on massive datasets to achieve satisfactory results, and they remain prone to content-position mismatches and semantic leakage. To overcome these limitations, we introduce ReCon, a novel augmentation framework that enhances the capacity of structure-controllable generative models for object detection. ReCon integrates region-guided rectification into the diffusion sampling process, using feedback from a pre-trained perception model to rectify misgenerated regions within diffusion sampling process. We further propose region-aligned cross-attention to enforce spatial-semantic alignment between image regions and their textual cues, thereby improving both semantic consistency and overall image fidelity. Extensive experiments demonstrate that ReCon substantially improve the quality and trainability of generated data, achieving consistent performance gains across various datasets, backbone architectures, and data scales. Our code is available at https://github.com/haoweiz23/ReCon .
[88] VISTA: A Test-Time Self-Improving Video Generation Agent
Do Xuan Long,Xingchen Wan,Hootan Nakhost,Chen-Yu Lee,Tomas Pfister,Sercan Ö. Arık
Main category: cs.CV
TL;DR: VISTA是一个测试时自优化视频生成多智能体系统,通过迭代优化提示来提升视频生成质量,表现优于现有方法。
Details
Motivation: 现有文本到视频生成技术高度依赖用户提示的精确性,而测试时优化方法难以应对视频的多维度特性,因此提出VISTA来解决这一问题。Contribution: 提出VISTA系统,通过多智能体协作分解用户意图、评估视频质量并提供反馈,迭代优化提示以提升视频生成效果。
Method: VISTA将用户意图分解为结构化时间计划,生成视频后通过锦标赛选择最优视频,并由三个专项智能体(视觉、音频、上下文)评估其质量,最后通过推理智能体反馈优化提示。
Result: 在单场景和多场景视频生成任务中,VISTA显著提升视频质量和用户意图对齐,60%的胜率超过现有方法,人类评估中66.4%的用户偏好VISTA生成结果。
Insight: 视频生成质量的提升不仅依赖于生成模型的改进,提示的迭代优化和多维度评估同样至关重要。
Abstract: Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.
[89] Neuro-Symbolic Spatial Reasoning in Segmentation
Jiayi Lin,Jiabo Huang,Shaogang Gong
Main category: cs.CV
TL;DR: 论文提出了RelateSeg方法,通过神经符号空间推理在开放词汇语义分割(OVSS)中引入显式空间关系约束,提升对未见对象的泛化能力。
Details
Motivation: 现有基于视觉语言模型(VLM)的方法在开放词汇语义分割中缺乏对场景中物体空间关系的理解,限制了其对未见类别的分割能力。Contribution: 1)首次在OVSS中探索神经符号(NeSy)空间推理;2)提出RelateSeg方法,通过一阶逻辑(FOL)建模空间关系约束;3)实现了无需额外参数的端到端学习。
Method: RelateSeg通过伪类别提取空间关系(如<cat, to-right-of, person>)并编码为一阶逻辑公式,结合模糊逻辑松弛技术,在深度网络中实现关系约束的分割。
Result: 在四个基准数据集上达到SOTA的mIoU表现,尤其在多类别图像上优势明显,仅引入了一个辅助损失函数且无额外参数。
Insight: 显式建模空间关系可以显著提升OVSS的性能,尤其是对复杂场景的分割任务,神经符号方法的结合是一个有潜力的研究方向。
Abstract: Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., <cat, to-right-of, person>, and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., “cat”) and a spatial pseudo category (e.g., “right of person”) simultaneously, enforcing relational constraints (e.g., a “cat” pixel must lie to the right of a “person”). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.
[90] Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt
Joongwon Chae,Lihui Luo,Xi Yuan,Dongmei Yu,Zhenglin Chen,Lian Zhang,Peiwu Qin
Main category: cs.CV
TL;DR: Memory-SAM通过检索技术自动生成提示符,无需人工干预或模型微调,实现高效的舌像分割,性能优于传统方法和SAM基线。
Details
Motivation: 传统舌像分割方法需要大量标注数据或依赖人工提示,Memory-SAM旨在实现无需训练和人工干预的自动分割,提升鲁棒性和数据效率。Contribution: 提出了一个无需训练和人工提示的分割框架,通过检索技术和SAM2实现了高效的舌像分割。
Method: 利用DINOv3特征和FAISS检索从历史案例中生成前景/背景点提示符,指导SAM2完成分割。
Result: 在600张专家标注图像上,Memory-SAM的mIoU达到0.9863,显著优于FCN和SAM基线方法。
Insight: 检索技术可以为SAM自动生成有效提示符,提升不规则边界分割的鲁棒性和数据效率。
Abstract: Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.
[91] BLIP3o-NEXT: Next Frontier of Native Image Generation
Jiuhai Chen,Le Xue,Zhiyang Xu,Xichen Pan,Shusheng Yang,Can Qin,An Yan,Honglu Zhou,Zeyuan Chen,Lifu Huang,Tianyi Zhou,Junnan Li,Silvio Savarese,Caiming Xiong,Ran Xu
Main category: cs.CV
TL;DR: BLIP3o-NEXT是一个开源的图像生成基础模型,结合了文本到图像生成和图像编辑功能,采用自回归+扩散架构,在性能和一致性上表现优异。
Details
Motivation: 本文旨在推动原生图像生成的边界,通过统一文本到图像生成和图像编辑任务,进一步提升模型的生成能力和编辑效果。Contribution: BLIP3o-NEXT的主要贡献包括:(1)统一文本到图像生成和图像编辑的架构;(2)结合自回归和扩散模型的优势;(3)提出四项关键洞察。
Method: BLIP3o-NEXT采用自回归+扩散架构:自回归模型生成离散图像标记,扩散模型利用这些标记生成高保真图像。
Result: 在多项文本到图像和图像编辑基准测试中,BLIP3o-NEXT表现优于现有模型。
Insight: (1)架构选择不影响性能,关键在扩展性和推理速度;(2)强化学习可推动图像生成;(3)图像编辑仍具挑战性;(4)数据质量和规模决定性能上限。
Abstract: We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.
[92] BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models
Kaushitha Silva,Mansitha Eashwara,Sanduni Ubayasiri,Ruwan Tennakoon,Damayanthi Herath
Main category: cs.CV
TL;DR: BiomedXPro是一种进化框架,利用大型语言模型作为生物医学知识提取器和自适应优化器,生成多样化的、可解释的自然语言提示对,用于疾病诊断,显著优于现有方法。
Details
Motivation: 生物医学视觉语言模型的临床应用受限于提示优化技术的透明度和多样性不足,无法捕捉临床诊断的多面性,影响模型的可靠性和信任度。Contribution: 提出了BiomedXPro框架,通过生成多样化的自然语言提示对,显著提升了模型的诊断性能和可解释性,尤其在数据稀缺的少样本场景中表现突出。
Method: 利用大型语言模型作为知识提取器和自适应优化器,采用进化算法自动生成多样化的自然语言提示对,并通过统计分析验证其与临床特征的语义对齐。
Result: 在多生物医学基准测试中,BiomedXPro表现优于现有提示优化方法,且生成的提示与显著临床特征具有强语义对齐。
Insight: 通过生成可解释的、多样化的提示对,BiomedXPro不仅提升了模型性能,还增强了其临床可信度,为高风险的AI系统提供了可验证的基础。
Abstract: The clinical adoption of biomedical vision-language models is hindered by prompt optimization techniques that produce either uninterpretable latent vectors or single textual prompts. This lack of transparency and failure to capture the multi-faceted nature of clinical diagnosis, which relies on integrating diverse observations, limits their trustworthiness in high-stakes settings. To address this, we introduce BiomedXPro, an evolutionary framework that leverages a large language model as both a biomedical knowledge extractor and an adaptive optimizer to automatically generate a diverse ensemble of interpretable, natural-language prompt pairs for disease diagnosis. Experiments on multiple biomedical benchmarks show that BiomedXPro consistently outperforms state-of-the-art prompt-tuning methods, particularly in data-scarce few-shot settings. Furthermore, our analysis demonstrates a strong semantic alignment between the discovered prompts and statistically significant clinical features, grounding the model’s performance in verifiable concepts. By producing a diverse ensemble of interpretable prompts, BiomedXPro provides a verifiable basis for model predictions, representing a critical step toward the development of more trustworthy and clinically-aligned AI systems.
[93] LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal
Shr-Ruei Tsai,Wei-Cheng Chang,Jie-Ying Lee,Chih-Hai Su,Yu-Lun Liu
Main category: cs.CV
TL;DR: LightsOut提出了一种基于扩散模型的图像外绘方法,专门用于增强单图像镜头光晕去除(SIFR)任务,通过重建画面外的光源来提升现有方法的性能。
Details
Motivation: 镜头光晕严重降低了图像质量,尤其是在画面外光源不完整或缺失时,现有的SIFR方法表现不佳。为了提高这些方法的鲁棒性,需要一种能够补充缺失光源的技术。Contribution: 提出了一种名为LightsOut的扩散基外绘框架,能有效重建缺失的光源,提升现有SIFR方法的性能;该方法无需额外训练即可作为通用预处理模块。
Method: 结合了多任务回归模块和LoRA微调的扩散模型,生成真实且物理一致的外绘结果,确保光源的连贯性和合理性。
Result: 在多种复杂场景下,LightsOut均显著提升了现有SIFR方法的性能,证明了其作为通用预处理解决方案的有效性。
Insight: 通过外绘技术补充画面外光源,为镜头光晕去除问题提供了新思路,同时也展示了扩散模型在图像修复任务中的潜力。
Abstract: Lens flare significantly degrades image quality, impacting critical computer vision tasks like object detection and autonomous driving. Recent Single Image Flare Removal (SIFR) methods perform poorly when off-frame light sources are incomplete or absent. We propose LightsOut, a diffusion-based outpainting framework tailored to enhance SIFR by reconstructing off-frame light sources. Our method leverages a multitask regression module and LoRA fine-tuned diffusion model to ensure realistic and physically consistent outpainting results. Comprehensive experiments demonstrate LightsOut consistently boosts the performance of existing SIFR methods across challenging scenarios without additional retraining, serving as a universally applicable plug-and-play preprocessing solution. Project page: https://ray-1026.github.io/lightsout/
eess.IV [Back]
[94] Confidence-Weighted Semi-Supervised Learning for Skin Lesion Segmentation Using Hybrid CNN-Transformer Networks
Saqib Qamar
Main category: eess.IV
TL;DR: 该论文提出了一种半监督学习框架MIRA-U,用于皮肤病变分割,结合不确定性感知的师生伪标签生成和混合CNN-Transformer架构,显著提升了分割性能。
Details
Motivation: 皮肤病变分割在早期皮肤癌检测中至关重要,但标注数据的稀缺性限制了模型的性能。为了解决这一问题,作者提出了一个半监督学习框架。Contribution: 1. 提出MIRA-U框架,结合了不确定性感知的师生伪标签生成和混合CNN-Transformer架构;2. 设计了交叉注意力跳跃连接的U型网络结构;3. 在低标注数据场景下表现优异。
Method: 1. 使用掩码图像建模预训练的教师网络生成置信度加权的软伪标签;2. 设计了U型CNN-Transformer学生网络,包含交叉注意力跳跃连接;3. 优化伪标签质量和边界分割效果。
Result: 在ISIC-2016和PH2数据集上,仅使用50%标注数据时取得了DSC 0.9153和IoU 0.8552的高分,显著优于基线方法。
Insight: 引入不确定性感知和混合架构可有效提升半监督学习中的分割性能,尤其在标注数据稀缺的情况下。
Abstract: Automated skin lesion segmentation through dermoscopic analysis is essential for early skin cancer detection, yet remains challenging due to limited annotated training data. We present MIRA-U, a semi-supervised framework that combines uncertainty-aware teacher-student pseudo-labeling with a hybrid CNN-Transformer architecture. Our approach employs a teacher network pre-trained via masked image modeling to generate confidence-weighted soft pseudo-labels, which guide a U-shaped CNN-Transformer student network featuring cross-attention skip connections. This design enhances pseudo-label quality and boundary delineation, surpassing reconstruction-based and CNN-only baselines, particularly in low-annotation regimes. Extensive evaluation on ISIC-2016 and PH2 datasets demonstrates superior performance, achieving a Dice Similarity Coefficient (DSC) of 0.9153 and Intersection over Union (IoU) of 0.8552 using only 50% labeled data. Code is publicly available on GitHub.
cs.RO [Back]
[95] VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
Zehao Ni,Yonghao He,Lingfeng Qian,Jilei Mao,Fa Fu,Wei Sui,Hu Su,Junran Peng,Zhipeng Wang,Bin He
Main category: cs.RO
TL;DR: 本文提出了一种仅依赖视觉的扩散策略学习方法VO-DP,通过融合语义和几何特征,在机器人操作任务中显著超越视觉基线方法,并媲美点云方法。
Details
Motivation: 现有模仿学习方法多依赖点云输入,缺乏对仅视觉方案的深入探索。VO-DP致力于解决这一问题,利用视觉基础模型实现特征融合。Contribution: 1. 提出VO-DP方法,首次将视觉基础模型用于视觉仅输入的扩散策略学习;2. 通过交叉注意力融合语义和几何特征;3. 开源支持多机和多GPU的训练库。
Method: 结合VGGT中间特征、DINOv2的语义特征和交替注意力块的几何特征,使用交叉注意力和CNN进行特征融合与压缩,作为策略头的输入。
Result: 在仿真任务中,VO-DP平均成功率64.6%,与点云方法DP3相当(64.0%),远超视觉基线DP(34.8%);在现实任务中,VO-DP达到87.9%,显著优于DP3和DP。
Insight: 仅视觉输入结合语义-几何特征融合,能够有效替代点云方法,尤其是在现实任务中表现更优,显示其在复杂环境中的鲁棒性。
Abstract: In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
cs.LG [Back]
[96] Internalizing World Models via Self-Play Finetuning for Agentic RL
Shiqi Chen,Tongyao Zhu,Zian Wang,Jinghan Zhang,Kangrui Wang,Siyang Gao,Teng Xiao,Yee Whye Teh,Junxian He,Manling Li
Main category: cs.LG
TL;DR: 论文提出SPA框架,通过自监督微调冷启动策略,学习世界模型以提升LLM智能体的决策能力,在多个环境中显著提高性能。
Details
Motivation: 大型语言模型在复杂和动态的真实环境中表现不佳,尤其是在分布外条件下,传统的强化学习难以适应。希望通过学习内部世界模型来改善决策。Contribution: 1. 提出SPA框架,结合自监督微调和强化学习。2. 将世界模型分解为状态表示和转移建模两部分。3. 在多个任务中验证性能提升。
Method: 1. 冷启动策略通过自监督微调学习世界模型。2. 用世界模型模拟未来状态,辅助策略优化。3. 结合自监督微调和强化学习训练智能体。
Result: 在Sokoban中成功率从25.6%提升至59.8%,FrozenLake中得分从22.1%提升至70.9%。
Insight: 通过显式建模环境动态,可以显著提升LLM智能体在复杂任务中的表现,尤其是在分布外场景中。
Abstract: Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) scenarios. Real-world environments are complex and dynamic, governed by task-specific rules and stochasticity, which makes it difficult for LLMs to ground their internal knowledge in those dynamics. Under such OOD conditions, vanilla RL training often fails to scale; we observe Pass@k–the probability that at least one of (k) sampled trajectories succeeds–drops markedly across training steps, indicating brittle exploration and limited generalization. Inspired by model-based reinforcement learning, we hypothesize that equipping LLM agents with an internal world model can better align reasoning with environmental dynamics and improve decision-making. We show how to encode this world model by decomposing it into two components: state representation and transition modeling. Building on this, we introduce SPA, a simple reinforcement learning framework that cold-starts the policy via a Self-Play supervised finetuning (SFT) stage to learn the world model by interacting with the environment, then uses it to simulate future states prior to policy optimization. This simple initialization outperforms the online world-modeling baseline and greatly boosts the RL-based agent training performance. Experiments across diverse environments like Sokoban, FrozenLake, and Sudoku show that our approach significantly improves performance. For example, SPA boosts the Sokoban success rate from 25.6% to 59.8% and raises the FrozenLake score from 22.1% to 70.9% for the Qwen2.5-1.5B-Instruct model.
[97] Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models
Samuel Paech,Allen Roush,Judah Goldfeder,Ravid Shwartz-Ziv
Main category: cs.LG
TL;DR: 这篇论文提出了Antislop框架,用于检测和消除语言模型中重复的短语模式(slop),通过创新方法显著减少slop,同时保持模型性能。
Details
Motivation: 广泛使用的LLM产生了重复短语模式(slop),降低了输出质量并容易被识别为AI生成。论文旨在提供工具检测和消除这些模式。Contribution: 1. Antislop Sampler:推理时抑制不需要的字符串;2. 自动化流程分析模型特有的slop;3. 新颖的微调方法FTPO,针对单个token调整logits。
Method: 结合Antislop Sampler、自动化slop分析和FTPO微调方法,通过回溯、数据分析和技术优化实现slop抑制。
Result: Antislop Sampler成功抑制8000+模式;FTPO减少90% slop,在GSM8K、MMLU等任务中保持或提升性能。
Insight: FTPO在减少slop的同时保持性能,优于DPO等方法,展示了针对token级优化的潜力。
Abstract: Widespread LLM adoption has introduced characteristic repetitive phraseology, termed ``slop,’’ which degrades output quality and makes AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary; (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data; (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace. We demonstrate that some slop patterns appear over 1,000$\times$ more frequently in LLM output than human text. The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000. Most importantly, FTPO achieves 90% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks. In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression. We release all code and results under MIT license: https://github.com/sam-paech/auto-antislop.
[98] DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning
Shih-Yang Liu,Xin Dong,Ximing Lu,Shizhe Diao,Mingjie Liu,Min-Hung Chen,Hongxu Yin,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Yejin Choi,Jan Kautz,Pavlo Molchanov
Main category: cs.LG
TL;DR: DLER通过强化学习优化长度惩罚,显著提升语言模型效率,减少70%输出的同时保持更高准确率。
Details
Motivation: 现有推理语言模型如OpenAI-o1等虽性能强,但输出过长,缺乏高效的智能表现。优化智能/标记比—即准确率与长度的关系—仍未解决。Contribution: 1. DLER训练方法整合了批量奖励归一化、高裁剪等技巧,结合简单截断长度惩罚;2. 提出Difficulty-Aware DLER自适应调整截断;3. 提出更新选择性合并方法在数据稀缺时保留简洁推理能力。
Method: 结合强化学习中三大挑战(优势估计偏差、熵坍缩、稀疏奖励)的解决方案:批量奖励归一化、动态采样及简单截断惩罚。
Result: DLER在7B模型上输出长度减少70%,准确率反超基线;测试时生成并行响应准确率提升28%,延迟更低。
Insight: RL优化不足是长度惩罚失效主因,而非惩罚设计复杂度;自适应截断与选择性合并为数据稀缺场景提供实用方案。
Abstract: Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token–accuracy relative to response length–remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty–truncation–and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy–efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.
[99] Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential
Xuansheng Wu,Xiaoman Pan,Wenlin Yao,Jianshu Chen
Main category: cs.LG
TL;DR: 这篇论文研究了预训练大语言模型(LLMs)的内在微观特性对其推理潜力的影响,提出了‘Soundness-Aware Level’(SAL)指标,揭示了模型区分合理与不合理知识的能力与其推理表现之间的强相关性。
Details
Motivation: 现有研究表明,通过可验证奖励的强化学习(RLVR)可以显著提升LLMs的推理能力,但不同基础模型的表现差异巨大。论文旨在探索预训练模型的哪些微观特性导致了这种差异。Contribution: 1. 提出了‘Soundness-Aware Level’(SAL)指标,量化模型对Horn子句推理规则合理性的敏感度;2. 揭示了模型推理潜力与其内在的合理性区分能力之间的强相关性。
Method: 1. 将推理形式化为一系列Horn子句,利用交叉层稀疏自编码器(SAEs)提取隐性空间特征;2. 估计特征间的转移概率,并用LLM分类规则的合理性级别;3. 使用Jensen-Shannon Divergence计算SAL指标。
Result: SAL指标能够准确预测模型在RLVR后的推理表现(R²=0.87),在多种模型家族(Qwen、Mistral、Llama、DeepSeek)和规模(0.5B-14B)上均表现出普适性。
Insight: 模型的推理潜力与其预训练阶段形成的合理性区分能力密切相关,这强调了预训练的关键作用,并为选择和设计更强的基础模型提供了理论依据。
Abstract: Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses (“if-then” rules) built from features extracted from the LLM’s latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules’ soundness levels, becoming highly distinct for “strict” versus “noisy” rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL’s predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model’s reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model’s internal mechanisms for selecting/designing stronger base models.
[100] FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
Tiansheng Hu,Tongyan Hu,Liuyang Bai,Yilun Zhao,Arman Cohan,Chen Zhao
Main category: cs.LG
TL;DR: FinTrust是一个专门为评估金融领域中大型语言模型(LLM)可信赖性设计的综合基准,覆盖广泛的实践问题和细粒度任务。结果显示专有模型在安全性等方面表现更优,开源模型在行业公平性上有优势,但所有模型在法律意识方面均表现不足。
Details
Motivation: 在金融领域应用LLMs面临高风险和高利益挑战,需要全面评估其可信赖性。Contribution: 提出了FinTrust基准,专注于金融领域LLMs的可信赖性评估,覆盖多维度和细粒度任务。
Method: 通过设计基于实践背景的细粒度任务,评估11种LLMs在不同可信赖性维度的表现。
Result: 专有模型在安全等任务表现优,开源模型在行业公平性占优,但所有模型在法律意识任务中表现差。
Insight: 金融领域LLMs在法律意识和合规性方面亟需改进,FinTrust可作为可信赖性评估的重要工具。
Abstract: Recent LLMs have demonstrated promising ability in solving finance related problems. However, applying LLMs in real-world finance application remains challenging due to its high risk and high stakes property. This paper introduces FinTrust, a comprehensive benchmark specifically designed for evaluating the trustworthiness of LLMs in finance applications. Our benchmark focuses on a wide range of alignment issues based on practical context and features fine-grained tasks for each dimension of trustworthiness evaluation. We assess eleven LLMs on FinTrust and find that proprietary models like o4-mini outperforms in most tasks such as safety while open-source models like DeepSeek-V3 have advantage in specific areas like industry-level fairness. For challenging task like fiduciary alignment and disclosure, all LLMs fall short, showing a significant gap in legal awareness. We believe that FinTrust can be a valuable benchmark for LLMs’ trustworthiness evaluation in finance domain.
[101] Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection
Denis Janiak,Jakub Binkowski,Tomasz Kajdanowicz
Main category: cs.LG
TL;DR: 该论文通过实证研究分析了Mahalanobis距离方法在OOD检测中的局限性,定义了理想的数据表示几何,并提出了径向缩放的ℓ2归一化方法,显著提升了OOD检测性能。
Details
Motivation: Mahalanobis距离方法在OOD检测中广泛使用,但其性能和表示几何及归一化的关系尚不明确,限制了下游应用。论文旨在填补这一空白。Contribution: 1. 实证研究Mahalanobis方法的局限性;2. 定义理想数据表示几何;3. 提出径向缩放ℓ2归一化方法,提升OOD检测性能。
Method: 1. 分析Mahalanobis方法的普适性;2. 定义理想几何并用光谱和内在维度指标预测OOD性能;3. 提出径向缩放ℓ2归一化。
Result: 研究结果表明Mahalanobis方法并非普适可靠,且提出的归一化方法能显著提升OOD性能。
Insight: 表示几何和归一化对OOD检测至关重要,径向缩放ℓ2归一化提供了一种系统性优化特征空间几何的方法。
Abstract: Out-of-distribution (OOD) detection is critical for the reliable deployment of deep learning models. hile Mahalanobis distance methods are widely used, the impact of representation geometry and normalization on their performance is not fully understood, which may limit their downstream application. To address this gap, we conducted a comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes. First, our analysis shows that Mahalanobis-based methods aren’t universally reliable. Second, we define the ideal geometry for data representations and demonstrate that spectral and intrinsic-dimensionality metrics can accurately predict a model’s OOD performance. Finally, we analyze how normalization impacts OOD performance. Building upon these studies, we propose radially scaled $\ell_2$ normalization, a method that generalizes the standard $\ell_2$ normalization recently applied to Mahalanobis-based OOD detection. Our approach introduces a tunable parameter to directly control the radial geometry of the feature space, systematically contracting or expanding representations to significantly improve OOD detection performance. By bridging the gap between representation geometry, normalization, and OOD performance, our findings offer new insights into the design of more effective and reliable deep learning models.
[102] Poultry Farm Intelligence: An Integrated Multi-Sensor AI Platform for Enhanced Welfare and Productivity
Pieris Panagi,Savvas Karatsiolis,Kyriacos Mosphilis,Nicholas Hadjisavvas,Andreas Kamilaris,Nicolas Nicolaou,Efstathios Stavrakis,Vassilis Vassiliades
Main category: cs.LG
TL;DR: 该论文提出了PoultryFI平台,通过集成AI模块(如摄像头优化、视听监控、实时蛋计数等),实现了低成本的家禽养殖智能监控与优化,提升了福利和生产效率。
Details
Motivation: 家禽养殖行业面临生产效率与动物福利的双重压力,但中小型农场缺乏低成本、集成的监控工具,依赖人工检查。PoultryFI旨在填补这一技术空白。Contribution: 提出了首个结合低成本传感、边缘分析和规范性AI的系统,实现了全农场范围的智能监控、生产预测和性能优化,支持中小型农场。
Method: 平台包含六模块:摄像头布局优化(进化算法)、视听监控(音视频数据)、实时蛋计数(边缘视觉模型)、生产预测(10天内的产量与饲料消耗)及推荐模块(结合天气数据)。
Result: 实地试验证明,蛋计数准确率达100%,异常检测稳健,短期预测可靠,尤其在Raspberry Pi 5上表现优异。
Insight: PoultryFI展示了如何通过模块化AI技术将分散的试点工具整合为可扩展的农场智能平台,为养殖业提供了主动管理的可能性。
Abstract: Poultry farming faces increasing pressure to meet productivity targets while ensuring animal welfare and environmental compliance. Yet many small and medium-sized farms lack affordable, integrated tools for continuous monitoring and decision-making, relying instead on manual, reactive inspections. This paper presents Poultry Farm Intelligence (PoultryFI) - a modular, cost-effective platform that integrates six AI-powered modules: Camera Placement Optimizer, Audio-Visual Monitoring, Analytics & Alerting, Real-Time Egg Counting, Production & Profitability Forecasting, and a Recommendation Module. Camera layouts are first optimized offline using evolutionary algorithms for full poultry house coverage with minimal hardware. The Audio-Visual Monitoring module extracts welfare indicators from synchronized video, audio, and feeding data. Analytics & Alerting produces daily summaries and real-time notifications, while Real-Time Egg Counting uses an edge vision model to automate production tracking. Forecasting models predict egg yield and feed consumption up to 10 days in advance, and the Recommendation Module integrates forecasts with weather data to guide environmental and operational adjustments. This is among the first systems to combine low-cost sensing, edge analytics, and prescriptive AI to continuously monitor flocks, predict production, and optimize performance. Field trials demonstrate 100% egg-count accuracy on Raspberry Pi 5, robust anomaly detection, and reliable short-term forecasting. PoultryFI bridges the gap between isolated pilot tools and scalable, farm-wide intelligence, empowering producers to proactively safeguard welfare and profitability.
q-fin.CP [Back]
[103] Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction
Tian Guo,Emmanuel Hauptmann
Main category: q-fin.CP
TL;DR: 这篇论文研究了如何有效结合量化因子和大型语言模型生成的新闻流表示来预测股票收益。提出了融合学习框架,并探讨了三种方法。随后提出了混合模型及解耦训练方法以提高稳定性,实验提供了多模态建模的有效见解。
Details
Motivation: 量化投资中,收益预测对股票选择、组合优化和风险管理至关重要。传统的量化因子和新兴的LLM生成的新闻流表示的结合潜力尚未充分探索,本文旨在填补这一空白。Contribution: 1. 提出了融合学习框架,统一学习量化因子和新闻流表示。2. 比较了三种融合方法:表示组合、表示求和和注意力表示。3. 提出了混合模型及解耦训练方法,解决了训练不稳定问题。
Method: 1. 融合学习框架:统一量化因子和LLM生成的新闻流表示。2. 三种融合方法:简单组合、求和及注意力机制。3. 混合模型:动态结合单模态和融合预测。4. 解耦训练:理论支持的方法以稳定训练过程。
Result: 在实际投资环境中实验验证了多模态建模的有效性,并提供了关于因素与新闻流结合的有用见解。
Insight: 1. 多模态结合(量化因子和新闻流)显著提升预测性能。2. 注意力机制和混合模型在动态融合中表现优异。3. 解耦训练方法有效解决了混合模型的训练不稳定问题。
Abstract: In quantitative investing, return prediction supports various tasks, including stock selection, portfolio optimization, and risk management. Quantitative factors, such as valuation, quality, and growth, capture various characteristics of stocks. Unstructured financial data, like news and transcripts, has attracted growing attention, driven by recent advances in large language models (LLMs). This paper examines effective methods for leveraging multimodal factors and newsflow in return prediction and stock selection. First, we introduce a fusion learning framework to learn a unified representation from factors and newsflow representations generated by an LLM. Within this framework, we compare three representative methods: representation combination, representation summation, and attentive representations. Next, building on empirical observations from fusion learning, we explore the mixture model that adaptively combines predictions made by single modalities and their fusion. To mitigate the training instability observed in the mixture model, we introduce a decoupled training approach with theoretical insights. Finally, our experiments on real investment universes yield several insights into effective multimodal modeling of factors and news for stock return prediction.
cs.CR [Back]
[104] MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evaluation
Gurusha Juneja,Jayanth Naga Sai Pasupulati,Alon Albalak,Wenyue Hua,William Yang Wang
Main category: cs.CR
TL;DR: MAGPIE是一个用于评估多智能体协作环境中隐私保护能力的新基准,包含200个高风险任务,揭示了当前先进LLM智能体在隐私泄漏和协作方面的不足。
Details
Motivation: 现有隐私基准仅关注单轮简单交互,无法评估多智能体协作中隐私与任务效能的平衡问题。Contribution: 提出MAGPIE基准,首次系统地评估多智能体协作中的隐私理解和保护能力。
Method: 设计200个高风险任务,将隐私信息作为任务核心,迫使智能体在协作与隐私控制间权衡。
Result: GPT-5和Gemini 2.5-Pro等先进智能体隐私泄漏严重(最高50.7%),且协作效果不佳,常出现操控行为。
Insight: 当前LLM智能体在多智能体环境中缺乏稳健的隐私保护能力,亟需改进对齐机制。
Abstract: A core challenge for autonomous LLM agents in collaborative settings is balancing robust privacy understanding and preservation alongside task efficacy. Existing privacy benchmarks only focus on simplistic, single-turn interactions where private information can be trivially omitted without affecting task outcomes. In this paper, we introduce MAGPIE (Multi-AGent contextual PrIvacy Evaluation), a novel benchmark of 200 high-stakes tasks designed to evaluate privacy understanding and preservation in multi-agent collaborative, non-adversarial scenarios. MAGPIE integrates private information as essential for task resolution, forcing agents to balance effective collaboration with strategic information control. Our evaluation reveals that state-of-the-art agents, including GPT-5 and Gemini 2.5-Pro, exhibit significant privacy leakage, with Gemini 2.5-Pro leaking up to 50.7% and GPT-5 up to 35.1% of the sensitive information even when explicitly instructed not to. Moreover, these agents struggle to achieve consensus or task completion and often resort to undesirable behaviors such as manipulation and power-seeking (e.g., Gemini 2.5-Pro demonstrating manipulation in 38.2% of the cases). These findings underscore that current LLM agents lack robust privacy understanding and are not yet adequately aligned to simultaneously preserve privacy and maintain effective collaboration in complex environments.
cs.IR [Back]
[105] SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation
Ines Besrour,Jingbo He,Tobias Schreieder,Michael Färber
Main category: cs.IR
TL;DR: SQuAI是一个多智能体检索增强生成框架,专注于解决科学问答任务中的复杂问题,通过分解问题、检索证据和生成带引用的答案,显著提升了可信度和效果。
Details
Motivation: 现有检索增强生成系统在科学领域处理复杂开放域问题时效果有限,缺乏明确的引用和可信度。SQuAI旨在解决这些问题,提供可验证的答案和上下文相关性。Contribution: 1. 提出多智能体协作框架,提升问题分解和检索能力;2. 整合内联引用和支持句子,增强答案的可信度;3. 发布了包含1000个科学问答对的基准数据集。
Method: SQuAI使用四个协作智能体:1. 问题分解;2. 混合稀疏-稠密检索;3. 自适应文档过滤;4. 生成带引用的答案。基于230万篇arXiv论文构建。
Result: 系统在忠实性、答案相关性和上下文相关性上比基线提升高达12%(+0.088)。
Insight: 多智能体协作和混合检索策略能显著提升科学问答任务的性能,同时内联引用增强了生成结果的可信度和可验证性。
Abstract: We present SQuAI (https://squai.scads.ai/), a scalable and trustworthy multi-agent retrieval-augmented generation (RAG) framework for scientific question answering (QA) with large language models (LLMs). SQuAI addresses key limitations of existing RAG systems in the scholarly domain, where complex, open-domain questions demand accurate answers, explicit claims with citations, and retrieval across millions of scientific documents. Built on over 2.3 million full-text papers from arXiv.org, SQuAI employs four collaborative agents to decompose complex questions into sub-questions, retrieve targeted evidence via hybrid sparse-dense retrieval, and adaptively filter documents to improve contextual relevance. To ensure faithfulness and traceability, SQuAI integrates in-line citations for each generated claim and provides supporting sentences from the source documents. Our system improves faithfulness, answer relevance, and contextual relevance by up to +0.088 (12%) over a strong RAG baseline. We further release a benchmark of 1,000 scientific question-answer-evidence triplets to support reproducibility. With transparent reasoning, verifiable citations, and domain-wide scalability, SQuAI demonstrates how multi-agent RAG enables more trustworthy scientific QA with LLMs.
[106] GraphMind: Interactive Novelty Assessment System for Accelerating Scientific Discovery
Italo Luis da Silva,Hanqi Yan,Lin Gui,Yulan He
Main category: cs.IR
TL;DR: 论文介绍了GraphMind,一个基于LLM的交互式工具,帮助用户评估科学论文或想法的新颖性。它集成了外部API和LLM,提供结构化视图和结果可追溯性。
Details
Motivation: 科学论文的新颖性评估需要广泛的相关工作知识,但并非所有审稿人都具备。现有LLM辅助工具缺乏透明度和结果追溯机制,GraphMind旨在解决这一问题。Contribution: GraphMind是一个交互式工具,支持通过多种关系探索相关论文,提供可验证的上下文洞察,结合外部API和LLM,增强新颖性评估的透明度和功能性。
Method: GraphMind通过集成arXiv和Semantic Scholar等外部API与LLM,实现论文的注释、提取、检索和分类,为用户提供结构化视图和结果追溯功能。
Result: GraphMind是一个可用工具,提供丰富的结构化视图,支持用户评估科学论文的新颖性,并通过演示视频和开源代码展示了其功能。
Insight: GraphMind的创新在于结合LLM和外部API,解决了现有工具在透明度和结果追溯上的不足,为科学文献分析提供了更高效的支持。
Abstract: Large Language Models (LLMs) show strong reasoning and text generation capabilities, prompting their use in scientific literature analysis, including novelty assessment. While evaluating novelty of scientific papers is crucial for peer review, it requires extensive knowledge of related work, something not all reviewers have. While recent work on LLM-assisted scientific literature analysis supports literature comparison, existing approaches offer limited transparency and lack mechanisms for result traceability via an information retrieval module. To address this gap, we introduce $\textbf{GraphMind}$, an easy-to-use interactive web tool designed to assist users in evaluating the novelty of scientific papers or drafted ideas. Specially, $\textbf{GraphMind}$ enables users to capture the main structure of a scientific paper, explore related ideas through various perspectives, and assess novelty via providing verifiable contextual insights. $\textbf{GraphMind}$ enables users to annotate key elements of a paper, explore related papers through various relationships, and assess novelty with contextual insight. This tool integrates external APIs such as arXiv and Semantic Scholar with LLMs to support annotation, extraction, retrieval and classification of papers. This combination provides users with a rich, structured view of a scientific idea’s core contributions and its connections to existing work. $\textbf{GraphMind}$ is available at https://oyarsa.github.io/graphmind and a demonstration video at https://youtu.be/wKbjQpSvwJg. The source code is available at https://github.com/oyarsa/graphmind.
cs.AI [Back]
[107] HugAgent: Evaluating LLMs in Simulating Human-Like Individual Reasoning on Open-Ended Tasks
Chance Jiajie Li,Zhenze Mo,Yuhan Tang,Ao Qu,Jiayi Wu,Kaiya Ivy Zhao,Yulu Gan,Jie Fan,Jiangbo Yu,Hang Jiang,Paul Pu Liang,Jinhua Zhao,Luis Alberto Alonso Pastor,Kent Larson
Main category: cs.AI
TL;DR: HugAgent 是一个用于评估大型语言模型在模拟人类个性化推理能力上的基准测试,通过合成和真实数据的双轨设计,揭示了现有模型在捕捉个体推理风格上的差距。
Details
Motivation: 现有大型语言模型虽然在群体层面模拟人类回答表现良好,但忽略了个体推理风格和信念变化的独特性。HugAgent 旨在推动机器更接近人类个性化推理的目标。Contribution: 1. 提出了 HugAgent 基准测试,用于评估模型在模拟个体推理风格上的能力。2. 设计了合成和真实数据的双轨评估方法。3. 开源了基准测试和工具。
Method: HugAgent 采用双轨设计:(1) 合成轨道用于大规模系统压力测试;(2) 人类轨道用于生态有效的实时推理数据收集。通过这两部分评估模型的推理一致性。
Result: 实验表明,当前最先进的大型语言模型在适应个体推理风格上仍存在显著差距,HugAgent 为这类问题提供了可扩展的评估工具。
Insight: 捕捉人类推理的个体差异是机器推理更接近人类的下一步挑战,HugAgent 为此提供了可行的评估框架。
Abstract: Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, “out-loud” reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).
[108] Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism
Haoran Sun,Yankai Jiang,Zhenyu Tang,Yaning Pan,Shuang Gu,Zekai Lin,Lilong Wang,Wenjie Lou,Lei Liu,Lei Bai,Xiaosong Wang
Main category: cs.AI
TL;DR: 论文提出了SciRecipe数据集和Thoth模型,通过结构化组件奖励机制改进生物实验协议生成,显著提升了协议的可靠性、逻辑性和语义准确性。
Details
Motivation: 当前大型语言模型生成生物实验协议时存在不完全或不一致的问题,限制了其在科学实验中的实用性,亟需一种可靠的方法改进生成质量。Contribution: 1. 提出了SciRecipe数据集,包含12K结构化协议;2. 设计了Sketch-and-Fill范式,分离分析、结构和表达;3. 提出了结构化组件奖励机制;4. 开发了Thoth模型,通过分阶段训练提升协议生成质量。
Method: 采用Sketch-and-Fill范式,结合结构化组件奖励机制(评估步骤粒度、动作顺序和语义保真度),并通过Knowledge-to-Action分阶段训练Thoth模型。
Result: Thoth在多个基准测试中超越当前最优LLMs,显著提升了步骤对齐、逻辑顺序和语义准确性。
Insight: 通过结构化和分阶段方法改进协议生成可行且有效,为科学实验助手提供了新思路。
Abstract: The foundation of reproducible science lies in protocols that are precise, logically ordered, and executable. The autonomous generation of these protocols through natural language queries could greatly improve the efficiency of the reproduction process. However, current leading large language models (LLMs) often generate incomplete or inconsistent protocols, limiting their utility. To address this limitation, we first introduce SciRecipe, a large-scale dataset of over 12K structured protocols spanning 27 biological subfields and encompassing both comprehension and problem-solving tasks. To further improve protocol generation, we propose the “Sketch-and-Fill” paradigm, which separates analysis, structuring, and expression to ensure each step is explicit and verifiable. Complementing this, the structured component-based reward mechanism evaluates step granularity, action order, and semantic fidelity, aligning model optimization with experimental reliability. Building on these components, we develop Thoth, trained through a staged Knowledge-to-Action process that progresses from knowledge acquisition to operational reasoning and ultimately to robust, executable protocol generation. Across multiple benchmarks, Thoth consistently surpasses both proprietary and open-source LLMs, achieving significant improvements in step alignment, logical sequencing, and semantic accuracy. Our approach paves the way for reliable scientific assistants that bridge knowledge with experimental execution. All data, code, and models will be released publicly.
[109] Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation
Ed Li,Junyu Ren,Xintian Pan,Cat Yan,Chuanhao Li,Dirk Bergemann,Zhuoran Yang
Main category: cs.AI
TL;DR: exttt{freephdlabor}是一个开源的多代理框架,支持动态工作流和模块化架构,旨在通过自动化的多代理系统推动科学研究的持续性和交互性。
Details
Motivation: 现有的科学研究自动化系统存在两个核心问题:一是预定义的工作流无法适应中间结果,二是上下文管理不足导致长期研究困难。 exttt{freephdlabor}旨在解决这些问题。Contribution: 提出了一个开源的多代理框架,支持完全动态的工作流和模块化架构,提供自动上下文压缩、会话间记忆持久化等特性,实现持续研究和交互式科学自动化。
Method: 通过多代理系统实现动态规划,模块化设计支持自定义代理,结合自动上下文管理和非阻塞人工干预机制。
Result: exttt{freephdlabor}能够将单次研究扩展为持续性的研究程序,并支持端到端的科学研究自动化。
Insight: 模块化和动态规划是实现科学自动化灵活性和可持续性的关键。
Abstract: The automation of scientific discovery represents a critical milestone in Artificial Intelligence (AI) research. However, existing agentic systems for science suffer from two fundamental limitations: rigid, pre-programmed workflows that cannot adapt to intermediate findings, and inadequate context management that hinders long-horizon research. We present \texttt{freephdlabor}, an open-source multiagent framework featuring \textit{fully dynamic workflows} determined by real-time agent reasoning and a \coloremph{\textit{modular architecture}} enabling seamless customization – users can modify, add, or remove agents to address domain-specific requirements. The framework provides comprehensive infrastructure including \textit{automatic context compaction}, \textit{workspace-based communication} to prevent information degradation, \textit{memory persistence} across sessions, and \textit{non-blocking human intervention} mechanisms. These features collectively transform automated research from isolated, single-run attempts into \textit{continual research programs} that build systematically on prior explorations and incorporate human feedback. By providing both the architectural principles and practical implementation for building customizable co-scientist systems, this work aims to facilitate broader adoption of automated research across scientific domains, enabling practitioners to deploy interactive multiagent systems that autonomously conduct end-to-end research – from ideation through experimentation to publication-ready manuscripts.
[110] Context-aware deep learning using individualized prior information reduces false positives in disease risk prediction and longitudinal health assessment
Lavanya Umapathy,Patricia M Johnson,Tarun Dutt,Angela Tong,Madhur Nayan,Hersh Chandarana,Daniel K Sodickson
Main category: cs.AI
TL;DR: 该论文提出了一种结合个体化历史信息的上下文感知深度学习框架,用于降低疾病风险预测和纵向健康评估中的假阳性率,并在前列腺癌风险预测中验证了其效果。
Details
Motivation: 在医疗健康监测中,整合患者的历史信息(如既往影像和临床生物标志物)可以提高风险预测的特异性,减少假阳性,从而更准确地评估患者健康变化。Contribution: 主要贡献是开发了一种能够动态整合患者既往信息的深度学习框架,显著降低了前列腺癌风险预测的假阳性率,同时保持高灵敏度。
Method: 方法分为两步:1)利用最近一次就诊的医疗数据初步估计疾病风险;2)通过消化既往影像和临床数据进一步优化风险预测。模型在多模态数据集(MRI和血液检测)上进行了验证。
Result: 结果表明,结合历史信息能够逐步降低假阳性率(从51%降至24%),并且在预测未来五年前列腺癌风险时进一步降至9%。
Insight: 研究强调,时间上下文信息的整合可以显著提高医疗风险预测的特异性,为大规模健康监测项目提供了可行性路径,从而实现早期疾病检测和改善健康结果。
Abstract: Temporal context in medicine is valuable in assessing key changes in patient health over time. We developed a machine learning framework to integrate diverse context from prior visits to improve health monitoring, especially when prior visits are limited and their frequency is variable. Our model first estimates initial risk of disease using medical data from the most recent patient visit, then refines this assessment using information digested from previously collected imaging and/or clinical biomarkers. We applied our framework to prostate cancer (PCa) risk prediction using data from a large population (28,342 patients, 39,013 magnetic resonance imaging scans, 68,931 blood tests) collected over nearly a decade. For predictions of the risk of clinically significant PCa at the time of the visit, integrating prior context directly converted false positives to true negatives, increasing overall specificity while preserving high sensitivity. False positive rates were reduced progressively from 51% to 33% when integrating information from up to three prior imaging examinations, as compared to using data from a single visit, and were further reduced to 24% when also including additional context from prior clinical data. For predicting the risk of PCa within five years of the visit, incorporating prior context reduced false positive rates still further (64% to 9%). Our findings show that information collected over time provides relevant context to enhance the specificity of medical risk prediction. For a wide range of progressive conditions, sufficient reduction of false positive rates using context could offer a pathway to expand longitudinal health monitoring programs to large populations with comparatively low baseline risk of disease, leading to earlier detection and improved health outcomes.