Table of Contents
- cs.CL [Total: 27]
- cs.CV [Total: 51]
- cs.NI [Total: 1]
- cs.IR [Total: 1]
- cs.AI [Total: 2]
- eess.IV [Total: 6]
- cs.GR [Total: 2]
- q-bio.QM [Total: 1]
- cs.LG [Total: 6]
- eess.AS [Total: 1]
- cs.CR [Total: 1]
cs.CL [Back]
[1] From Image Captioning to Visual Storytelling
Admitos Passadakis,Yingjin Song,Albert Gatt
Main category: cs.CL
TL;DR: 该论文提出了一种将图像描述(Image Captioning)与视觉叙事(Visual Storytelling)结合的框架,通过分步方法(先生成图像描述,再转化为连贯故事)提升叙事质量,并加速训练时间。同时,作者提出了一种新度量工具‘ideality’,用于模拟结果与理想模型的差距。
Details
Motivation: 视觉叙事(Visual Storytelling)是一个多模态任务,需要在图像序列的基础上生成既接地气又连贯的故事。现有方法通常直接生成故事,忽略了与图像描述任务的关联。本文旨在通过结合这两种任务,优化叙事质量和效率。Contribution: 1. 提出将视觉叙事视为图像描述的超集,分两步生成故事;2. 设计了一个统一框架,提升故事质量并加速训练;3. 提出新度量工具‘ideality’,用于评估故事的人类化程度。
Method: 1. 使用视觉-语言模型为输入图像生成描述;2. 通过语言-语言方法将描述转化为连贯叙事;3. 提出‘ideality’指标模拟结果与理想模型的差距。
Result: 实验表明,该框架在叙事质量上优于现有方法,同时训练时间更短。‘ideality’指标有效模拟了人类化程度。
Insight: 将复杂任务分解为子任务(如先描述再叙事)可以提升效果;统一框架的设计有助于可重用性和可复现性。
Abstract: Visual Storytelling is a challenging multimodal task between Vision & Language, where the purpose is to generate a story for a stream of images. Its difficulty lies on the fact that the story should be both grounded to the image sequence but also narrative and coherent. The aim of this work is to balance between these aspects, by treating Visual Storytelling as a superset of Image Captioning, an approach quite different compared to most of prior relevant studies. This means that we firstly employ a vision-to-language model for obtaining captions of the input images, and then, these captions are transformed into coherent narratives using language-to-language methods. Our multifarious evaluation shows that integrating captioning and storytelling under a unified framework, has a positive impact on the quality of the produced stories. In addition, compared to numerous previous studies, this approach accelerates training time and makes our framework readily reusable and reproducible by anyone interested. Lastly, we propose a new metric/tool, named ideality, that can be used to simulate how far some results are from an oracle model, and we apply it to emulate human-likeness in visual storytelling.
[2] Contrastive Analysis of Constituent Order Preferences Within Adverbial Roles in English and Chinese News: A Large-Language-Model-Driven Approach
Yiran Rex Ma
Main category: cs.CL
TL;DR: 本文通过大型语言模型标注的英汉新闻语料,对比分析了英汉新闻中状语功能块的语序差异,揭示了系统性偏好与动态适应性。
Details
Motivation: 研究英汉新闻中状语功能块的语序差异,以揭示两种语言在信息结构上的不同特点。Contribution: 提供了英汉新闻语序对比的新实证支持,揭示了状语功能块的位置偏好与动态调整规律。
Method: 基于大型语言模型标注的可比英汉新闻语料,分析功能块的语序分布与位置偏好。
Result: 英语新闻倾向于核心信息前置,状语后置;汉语新闻偏好背景前置,状语前置。英汉在SVO结构中的分布差异显著。
Insight: 语序既反映系统性偏好,也具备动态适应性,受到信息和语用目的的驱动。
Abstract: Based on comparable English-Chinese news corpora annotated by Large Language Model (LLM), this paper attempts to explore the differences in constituent order of English-Chinese news from the perspective of functional chunks with adverbial roles, and analyze their typical positional preferences and distribution patterns. It is found that: (1) English news prefers linear narrative of core information first, and functional chunks are mostly post-positioned, while Chinese news prefers overall presentation mode of background first, and functional chunks are often pre-positioned; (2) In SVO structure, both English and Chinese news show differences in the distribution of functional chunks, but the tendency of Chinese pre-positioning is more significant, while that of English post-positioning is relatively mild; (3) When function blocks are co-occurring, both English and Chinese news show high flexibility, and the order adjustment is driven by information and pragmatic purposes. The study reveals that word order has both systematic preference and dynamic adaptability, providing new empirical support for contrastive study of English-Chinese information structure.
[3] T-REX: Table – Refute or Entail eXplainer
Tim Luka Horstmann,Baptiste Geisenberger,Mehwish Alam
Main category: cs.CL
TL;DR: T-REX是一个交互式工具,用于通过多模态、多语言表格验证文本声明,基于指令调优的大型语言模型(LLMs),旨在为非专家提供易于使用的先进事实核查技术。
Details
Motivation: 现有的大型语言模型(LLMs)在表格事实核查方面取得了进展,但这些技术对非专家仍然难以访问。因此,作者开发了T-REX,提供一个透明且易于使用的交互式工具。Contribution: T-REX是首个基于指令调优的LLM的实时交互式工具,支持多模态、多语言表格的声明验证,为非专家提供高级事实核查能力。
Method: T-REX利用最先进的指令调优的大型语言模型(LLMs)进行表格数据的声明验证,设计为交互式工具,注重准确性和透明度。
Result: T-REX已公开发布,提供了一种高效、透明的表格事实核查解决方案。
Insight: 通过交互式设计和非专家友好的界面,T-REX展示了如何在复杂任务中降低技术门槛,同时保持准确性。
Abstract: Verifying textual claims against structured tabular data is a critical yet challenging task in Natural Language Processing with broad real-world impact. While recent advances in Large Language Models (LLMs) have enabled significant progress in table fact-checking, current solutions remain inaccessible to non-experts. We introduce T-REX (T-REX: Table – Refute or Entail eXplainer), the first live, interactive tool for claim verification over multimodal, multilingual tables using state-of-the-art instruction-tuned reasoning LLMs. Designed for accuracy and transparency, T-REX empowers non-experts by providing access to advanced fact-checking technology. The system is openly available online.
[4] Confidence Estimation for Text-to-SQL in Large Language Models
Sepideh Entezari Maleki,Mohammadreza Pourreza,Davood Rafiei
Main category: cs.CL
TL;DR: 本文研究了在大语言模型(LLMs)中为文本到SQL生成任务提供置信度估计的方法,重点关注黑盒和白盒策略,其中基于一致性和SQL语法感知的方法表现突出,执行查询的补充信号进一步提升了效果。
Details
Motivation: 在文本到SQL任务中,评估模型生成SQL查询的置信度是重要的,尤其是在无法获取标准答案的情况下。大语言模型的权重和梯度通常受限,因此需要开发无需访问内部参数的置信度估计方法。Contribution: 1. 研究了黑盒和白盒置信度估计策略在文本到SQL任务中的应用。2. 提出基于一致性的方法在黑盒模型中表现优异,而SQL语法感知的方法在白盒环境中更有效。3. 展示了执行查询的补充信号对提升置信度估计效果的重要性。
Method: 1. 黑盒方法:通过一致性评估模型输出的可靠性。2. 白盒方法:利用SQL语法感知技术解读LLM的logits。3. 结合执行查询的结果作为辅助信号。
Result: 实验表明,基于一致性的黑盒方法和SQL语法感知的白盒方法在跨领域文本到SQL任务中表现最佳,执行查询的补充信号进一步提升了置信度估计的准确性。
Insight: 1. 黑盒方法无需访问模型内部,适用于受限环境。2. 白盒方法通过语法分析能更精确解读模型输出。3. 执行查询的反馈为置信度估计提供了额外的验证维度。
Abstract: Confidence estimation for text-to-SQL aims to assess the reliability of model-generated SQL queries without having access to gold answers. We study this problem in the context of large language models (LLMs), where access to model weights and gradients is often constrained. We explore both black-box and white-box confidence estimation strategies, evaluating their effectiveness on cross-domain text-to-SQL benchmarks. Our evaluation highlights the superior performance of consistency-based methods among black-box models and the advantage of SQL-syntax-aware approaches for interpreting LLM logits in white-box settings. Furthermore, we show that execution-based grounding of queries provides a valuable supplementary signal, improving the effectiveness of both approaches.
[5] Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models
Badrinath Ramakrishnan,Akshaya Balaji
Main category: cs.CL
TL;DR: 该论文研究了微调大语言模型(LLM)时数据记忆化的隐私风险,提出了一种多层隐私保护框架,并验证了四种方法能有效减少数据泄露。
Details
Motivation: 大语言模型在微调过程中容易记忆训练数据,导致隐私泄露风险增加。论文旨在量化这一风险并提出解决方案。Contribution: 1. 实证分析了微调LLM时的数据记忆化风险;2. 提出了一种新颖的多层隐私保护框架;3. 验证了四种方法在减少数据泄露的同时保持模型性能的有效性。
Method: 通过实验研究数据记忆化风险,提出并评估四种保护方法:语义去重、差分隐私生成、熵值过滤和基于模式的过滤。
Result: 实验显示,微调后隐私泄露率显著上升(从0-5%到60-75%),而提出的框架能将泄露率降至0%,同时保留94.7%的模型性能。
Insight: 模型微调中的重复敏感数据是隐私泄露的主要风险源,而多层隐私保护方法可有效平衡隐私与性能。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, but their tendency to memorize training data poses significant privacy risks, particularly during fine-tuning processes. This paper presents a comprehensive empirical analysis of data memorization in fine-tuned LLMs and introduces a novel multi-layered privacy protection framework. Through controlled experiments on modern LLM architectures including GPT-2, Phi-3, and Gemma-2, we demonstrate that fine-tuning with repeated sensitive data increases privacy leakage rates from baseline levels of 0-5% to 60-75%, representing a 64.2% average increase across tested models. We propose and rigorously evaluate four complementary privacy protection methods: semantic data deduplication, differential privacy during generation, entropy-based filtering, and pattern-based content filtering. Our experimental results show that these techniques can reduce data leakage to 0% while maintaining 94.7% of original model utility.
[6] Punctuation and Predicates in Language Models
Sonakshi Chauhan,Maheep Chaudhary,Koby Choy,Samuel Nellessen,Nandi Schoots
Main category: cs.CL
TL;DR: 该论文探究了标点符号和大语言模型(LLM)中其他语言成分的作用和信息传播机制,发现不同模型对标点的依赖程度不同,并揭示了条件语句和全称量词等逻辑规则的处理差异。
Details
Motivation: 研究大语言模型中信息的收集和传播机制,尤其是标点符号和其他语言成分(如主语、形容词等)的动态处理方式,以及不同逻辑规则(如条件语句)的处理差异。Contribution: 揭示了标点符号在不同LLM(如GPT-2、DeepSeek、Gemma)中的必要性及作用差异,并分析了条件语句和全称量词的处理方式,为模型的可解释性提供了新见解。
Method: 采用干预技术(如层间交换和替换实验)评估标点符号的必要性和充分性,并研究语言成分和逻辑规则在网络中的动态处理机制。
Result: 标点符号在GPT-2多个层中既必要又充分,但在DeepSeek和Gemma中作用较小;条件语句和全称量词的处理方式差异显著。
Insight: LLM处理标点和逻辑规则时存在模型特异性,信息传播可能并非静态,而是动态变化的,这对模型设计和可解释性具有启发意义。
Abstract: In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance) of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-specific differences: for GPT-2, punctuation is both necessary and sufficient in multiple layers, while this holds far less in DeepSeek and not at all in Gemma. Extending beyond punctuation, we ask whether LLMs process different components of input (e.g., subjects, adjectives, punctuation, full sentences) by forming early static summaries reused across the network, or if the model remains sensitive to changes in these components across layers. Extending beyond punctuation, we investigate whether different reasoning rules are processed differently by LLMs. In particular, through interchange intervention and layer-swapping experiments, we find that conditional statements (if, then), and universal quantification (for all) are processed very differently. Our findings offer new insight into the internal mechanisms of punctuation usage and reasoning in LLMs and have implications for interpretability.
[7] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation
Xian Gao,Jiacheng Ruan,Zongyun Zhang,Jingsheng Gao,Ting Liu,Yuzhuo Fu
Main category: cs.CL
TL;DR: 论文提出了MMReview基准,用于评估基于LLM的多模态同行评审自动化任务,涵盖17个研究领域和多种模态内容。
Details
Motivation: 随着学术出版物快速增长,同行评审任务繁重且耗时,而现有LLM评审任务缺乏统一的多模态评估基准。Contribution: 提出了跨学科和多模态的MMReview基准,包含240篇论文的专家评审意见及13项任务,全面评估LLM和MLLM的性能。
Method: 设计了四类核心任务(逐步评审生成、结果生成、人类偏好对齐、对抗输入稳健性),并在16个开源模型和5个闭源模型上实验。
Result: 实验证明基准的全面性,为自动化评审系统开发奠定了基础。
Insight: MMReview填补了多模态评审评估的空白,有望推动标准化自动化评审系统的发展。
Abstract: With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models’ ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.
[8] Disentangling concept semantics via multilingual averaging in Sparse Autoencoders
Cliff O’Reilly,Ernesto Jimenez-Ruiz,Tillman Weyde
Main category: cs.CL
TL;DR: 论文提出了一种通过多语言平均稀疏自编码器提取概念语义的方法,揭示了大语言模型中概念语义的真关系。
Details
Motivation: 如何将大语言模型与形式化知识表示结合以解决其语义和语言特定信息的纠缠问题。Contribution: 提出了一种通过多语言平均稀疏自编码器提取概念语义的新方法,展示了与真实类关系的强相关性。
Method: 利用稀疏自编码器对多语言文本生成的概念激活进行平均,并与本体类的地面真实关系关联。
Result: 实验结果表明,概念平均结果与真实类关系高度一致,优于单一语言的结果。
Insight: 多语言视角的结合可以更准确地解耦概念语义,为网络内部状态的机理解释提供了新思路。
Abstract: Connecting LLMs with formal knowledge representation and reasoning is a promising approach to address their shortcomings. Embeddings and sparse autoencoders are widely used to represent textual content, but the semantics are entangled with syntactic and language-specific information. We propose a method that isolates concept semantics in Large Langue Models by averaging concept activations derived via Sparse Autoencoders. We create English text representations from OWL ontology classes, translate the English into French and Chinese and then pass these texts as prompts to the Gemma 2B LLM. Using the open source Gemma Scope suite of Sparse Autoencoders, we obtain concept activations for each class and language version. We average the different language activations to derive a conceptual average. We then correlate the conceptual averages with a ground truth mapping between ontology classes. Our results give a strong indication that the conceptual average aligns to the true relationship between classes when compared with a single language by itself. The result hints at a new technique which enables mechanistic interpretation of internal network states with higher accuracy.
[9] GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs
Adrian-Marius Dumitran,Alexandra-Mihaela Danila,Angela-Liliana Dumitran
Main category: cs.CL
TL;DR: GRILE是首个针对罗马尼亚语的语法推理和解释的基准测试,包含1151个选择题,用于评估LLM在低资源语言中的表现。研究发现,尽管Gemini 2.5 Pro准确率达83%,但多数开源模型表现不佳,且解释中存在大量问题。
Details
Motivation: 探究大型语言模型在低资源语言(罗马尼亚语)中的语法推理和解释能力,填补现有研究的空白。Contribution: 1) 提供首个罗马尼亚语语法推理基准GRILE;2) 评估七种多语言和罗马尼亚语专用LLM的表现;3) 发现模型在形态学和拼写规范上的系统弱点。
Method: 从罗马尼亚重要考试中收集1151个选择题,评估模型在答案选择和解释生成上的能力,并通过专家评审分析错误。
Result: Gemini 2.5 Pro准确率83%,多数开源模型低于65%,48%的解释存在事实或教学错误。
Insight: 1) LLM在低资源语言中的表现仍有显著提升空间;2) 形态学和拼写规范是常见弱点;3) GRILE为可控解释生成提供了新测试平台。
Abstract: LLMs (Large language models) have revolutionized NLP (Natural Language Processing), yet their pedagogical value for low-resource languages remains unclear. We present GRILE (Grammar Romanian Inference and Language Explanations) , the first open benchmark of 1,151 multiple-choice questions harvested from Romanian high-stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state-of-the-art multilingual and Romanian-specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically accurate explanations. While Gemini 2.5 Pro reaches 83% accuracy, most open-weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM3 orthographic norms. All data, code and a public web demo are released to catalyze future research. Our findings expose open challenges for trustworthy educational NLP in low-resource settings and establish GRILE as a new test-bed for controllable explanation generation and evaluation.
[10] Tokens with Meaning: A Hybrid Tokenization Approach for NLP
M. Ali Bayram,Ali Arda Fincan,Ahmet Semih Gümüş,Sercan Karakaş,Banu Diri,Savaş Yıldırım,Demircan Çelik
Main category: cs.CL
TL;DR: 该论文提出了一种混合分词方法,结合了基于规则的形态分析和统计子词分割,显著提升了在形态丰富语言(如土耳其语)中的分词效果。
Details
Motivation: 传统子词分词方法(如BPE和WordPiece)在形态丰富语言中效果不佳,因其依赖频率而非语言结构。Contribution: 1. 提出混合分词框架,结合规则和统计方法;2. 设计新算法平衡语素保存与词汇效率;3. 对土耳其语实现高分词准确率。
Method: 结合语音归一化、词根-词缀词典及新算法,集成BPE以处理未登录词,同时避免破坏形态一致性。
Result: 在土耳其语TR-MMLU基准测试中,分词准确率达到90.29%(土耳其语分词百分比)和85.8%(纯分词百分比),优于LLaMA、Gemma和GPT的分词器。
Insight: 该方法独立于语言,可扩展至其他形态丰富语言,为多语言NLP提供更可解释和高效的分词方案。
Abstract: Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitab{\i}), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29%) and Pure Token Percentage (85.8%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.
[11] A Joint Multitask Model for Morpho-Syntactic Parsing
Demian Inostroza,Mel Mistica,Ekaterina Vylomova,Chris Guest,Kemal Kurniawan
Main category: cs.CL
TL;DR: 该论文提出了一种联合多任务模型,用于同时预测形态和句法分析,在UniDive 2025共享任务中取得最佳性能,平均MSLAS为78.7%。
Details
Motivation: 为了在多语言和多样性的数据集上统一预测形态和句法分析,提出了一个联合模型,以解决单一任务模型的局限性。Contribution: 1. 提出了一个基于XLM-RoBERTa的联合多任务模型;2. 在九种语言上实现了最佳性能;3. 通过消融实验验证了任务设计和内容词识别的重要性。
Method: 使用共享的XLM-RoBERTa编码器和三个专用解码器(内容词识别、依存句法分析和形态句法特征预测)来完成任务。
Result: 在共享任务中,平均MSLAS为78.7%,LAS为80.1%,Feats F1为90.3%。
Insight: 模型在核心语法格(如Nom-Acc)和名词特征上表现较差,表明这些是未来改进的方向。
Abstract: We present a joint multitask model for the UniDive 2025 Morpho-Syntactic Parsing shared task, where systems predict both morphological and syntactic analyses following novel UD annotation scheme. Our system uses a shared XLM-RoBERTa encoder with three specialized decoders for content word identification, dependency parsing, and morphosyntactic feature prediction. Our model achieves the best overall performance on the shared task’s leaderboard covering nine typologically diverse languages, with an average MSLAS score of 78.7 percent, LAS of 80.1 percent, and Feats F1 of 90.3 percent. Our ablation studies show that matching the task’s gold tokenization and content word identification are crucial to model performance. Error analysis reveals that our model struggles with core grammatical cases (particularly Nom-Acc) and nominal features across languages.
[12] ZPD-SCA: Unveiling the Blind Spots of LLMs in Assessing Students’ Cognitive Abilities
Wenhan Dong,Zhen Sun,Yuemeng Zhao,Zifan Peng,Jun Wu,Jingyi Zheng,Yule Liu,Xinlei He,Yu Wang,Ruiming Wang,Xinyi Huang,Lei Mo
Main category: cs.CL
TL;DR: 论文提出了ZPD-SCA基准,用于评估大语言模型(LLMs)在匹配学生认知能力与阅读材料难度方面的表现,发现其零样本学习能力较差但上下文学习能力有所提升。
Details
Motivation: 本研究填补了LLMs在中文教育中评估阅读材料与学生认知能力对齐能力的空白,基于‘最近发展区(ZPD)’的教育原则。Contribution: 提出了ZPD-SCA基准,由60位特级教师标注,用于评估LLMs在学生认知能力与阅读材料难度匹配任务中的表现。
Method: 实验分为零样本学习和上下文学习两种场景,比较不同LLMs在评估阅读难度时的表现。
Result: 零样本学习下LLMs表现不佳,甚至低于随机猜测;上下文学习中模型性能显著提升,但仍存在系统性偏差和不同体裁间的显著差异。
Insight: LLMs在评估阅读难度方面表现出新兴能力,但其训练仍存在局限性,未来需进一步提升其在教育对齐任务中的准确性。
Abstract: Large language models (LLMs) have demonstrated potential in educational applications, yet their capacity to accurately assess the cognitive alignment of reading materials with students’ developmental stages remains insufficiently explored. This gap is particularly critical given the foundational educational principle of the Zone of Proximal Development (ZPD), which emphasizes the need to match learning resources with Students’ Cognitive Abilities (SCA). Despite the importance of this alignment, there is a notable absence of comprehensive studies investigating LLMs’ ability to evaluate reading comprehension difficulty across different student age groups, especially in the context of Chinese language education. To fill this gap, we introduce ZPD-SCA, a novel benchmark specifically designed to assess stage-level Chinese reading comprehension difficulty. The benchmark is annotated by 60 Special Grade teachers, a group that represents the top 0.15% of all in-service teachers nationwide. Experimental results reveal that LLMs perform poorly in zero-shot learning scenarios, with Qwen-max and GLM even falling below the probability of random guessing. When provided with in-context examples, LLMs performance improves substantially, with some models achieving nearly double the accuracy of their zero-shot baselines. These results reveal that LLMs possess emerging abilities to assess reading difficulty, while also exposing limitations in their current training for educationally aligned judgment. Notably, even the best-performing models display systematic directional biases, suggesting difficulties in accurately aligning material difficulty with SCA. Furthermore, significant variations in model performance across different genres underscore the complexity of task. We envision that ZPD-SCA can provide a foundation for evaluating and improving LLMs in cognitively aligned educational applications.
[13] Credence Calibration Game? Calibrating Large Language Models through Structured Play
Ke Fang,Tianyi Zhao,Lu Cheng
Main category: cs.CL
TL;DR: 提出了一种基于提示的校准框架,通过结构化互动循环和反馈驱动提示,动态提升大语言模型(LLMs)的置信度校准能力。
Details
Motivation: 现有校准方法通常依赖于后处理或额外的监督训练,缺乏动态性和灵活性。Contribution: 提出了基于Credence Calibration Game的提示框架,通过自然语言反馈和总结动态改进LLM校准。
Method: 设计结构化互动循环,利用反馈驱动提示和性能总结优化模型置信度。
Result: 在多种模型和游戏配置下均表现出校准指标的显著提升。
Insight: 游戏化提示策略为LLM校准提供了一种无需参数更新的新型有效方法。
Abstract: As Large Language Models (LLMs) are increasingly deployed in decision-critical domains, it becomes essential to ensure that their confidence estimates faithfully correspond to their actual correctness. Existing calibration methods have primarily focused on post-hoc adjustments or auxiliary model training; however, many of these approaches necessitate additional supervision or parameter updates. In this work, we propose a novel prompt-based calibration framework inspired by the Credence Calibration Game. Our method establishes a structured interaction loop wherein LLMs receive feedback based on the alignment of their predicted confidence with correctness. Through feedback-driven prompting and natural language summaries of prior performance, our framework dynamically improves model calibration. Extensive experiments across models and game configurations demonstrate consistent improvements in evaluation metrics. Our results highlight the potential of game-based prompting as an effective strategy for LLM calibration. Code and data are available at https://anonymous.4open.science/r/LLM-Calibration/.
[14] DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement
Yupei Yang,Fan Feng,Lin Yang,Wanxi Deng,Lin Qu,Biwei Huang,Shikui Tu,Lei Xu
Main category: cs.CL
TL;DR: DEPTH提出了一个依赖感知的句子简化和两级层次精化的框架,用于消除关系抽取中的幻觉问题,显著提升了性能。
Details
Motivation: 现有基于大语言模型的关系抽取方法在复杂句子和语义下容易产生虚假预测(幻觉),影响知识图谱的准确性。DEPTH旨在解决这一问题。Contribution: 提出了依赖感知的句子简化和两级层次精化的框架,引入了因果关系驱动的奖励模型以减少虚假相关,并通过实验验证了其有效性。
Method: 1. Grounding模块通过最短依赖路径提取关系,简化句子以减少语法噪声;2. Refinement模块通过全局理解修正预测;3. 使用强化学习和人类反馈优化模型。
Result: 在6个基准测试中,DEPTH将平均幻觉率降至7.0%,F1分数提升了17.2%。
Insight: 依赖路径和层次化精化能有效减少幻觉,因果奖励模型有助于鲁棒的强化学习调优。
Abstract: Relation extraction enables the construction of structured knowledge for many downstream applications. While large language models (LLMs) have shown great promise in this domain, most existing methods concentrate on relation classification, which predicts the semantic relation type between a related entity pair. However, we observe that LLMs often struggle to reliably determine whether a relation exists, especially in cases involving complex sentence structures or intricate semantics, which leads to spurious predictions. Such hallucinations can introduce noisy edges in knowledge graphs, compromising the integrity of structured knowledge and downstream reliability. To address these challenges, we propose DEPTH, a framework that integrates Dependency-aware sEntence simPlification and Two-tiered Hierarchical refinement into the relation extraction pipeline. Given a sentence and its candidate entity pairs, DEPTH operates in two stages: (1) the Grounding module extracts relations for each pair by leveraging their shortest dependency path, distilling the sentence into a minimal yet coherent relational context that reduces syntactic noise while preserving key semantics; (2) the Refinement module aggregates all local predictions and revises them based on a holistic understanding of the sentence, correcting omissions and inconsistencies. We further introduce a causality-driven reward model that mitigates reward hacking by disentangling spurious correlations, enabling robust fine-tuning via reinforcement learning with human feedback. Experiments on six benchmarks demonstrate that DEPTH reduces the average hallucination rate to 7.0% while achieving a 17.2% improvement in average F1 score over state-of-the-art baselines.
[15] Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLMs
Yinghan Zhou,Weifeng Zhu,Juan Wen,Wanli Peng,Zhengxian Wu,Yiming Xue
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLMs)在个体展示范式(IPP)下难以区分自身生成文本的现象,提出了一种称为‘认知手术’(CoSur)的新方法,通过唤醒隐式领地意识(ITA)显著提升了LLMs在IPP下的表现。
Details
Motivation: 虽然LLMs在成对展示范式(PPP)下能够可靠地识别自身生成的文本,但在个体展示范式(IPP)下表现显著下降。论文旨在探究这一现象的原因,并提出解决方案。Contribution: 论文的主要贡献是:1)确认了LLMs在IPP下的识别能力不足;2)提出了‘隐式领地意识’(ITA)的概念;3)开发了‘认知手术’(CoSur)框架,显著提升了模型在IPP下的表现。
Method: CoSur框架包含四个模块:表示提取、领地构建、作者判别和认知编辑,通过唤醒LLMs的ITA能力,提高其在IPP场景下的识别准确性。
Result: 实验结果显示,CoSur方法在三种不同LLMs上均显著提升了IPP下的表现,平均准确率分别达到83.25%、66.19%和88.01%。
Insight: 论文揭示了LLMs在隐式层面上具备区分自身与他人生成文本的能力,但这种能力未在输出行为中显式表现出来,需要通过特定方法‘唤醒’。
Abstract: Large language models (LLMs) have been shown to possess a degree of self-recognition capability-the ability to identify whether a given text was generated by themselves. Prior work has demonstrated that this capability is reliably expressed under the Pair Presentation Paradigm (PPP), where the model is presented with two texts and asked to choose which one it authored. However, performance deteriorates sharply under the Individual Presentation Paradigm (IPP), where the model is given a single text to judge authorship. Although this phenomenon has been observed, its underlying causes have not been systematically analyzed. In this paper, we first replicate existing findings to confirm that LLMs struggle to distinguish self- from other-generated text under IPP. We then investigate the reasons for this failure and attribute it to a phenomenon we term Implicit Territorial Awareness (ITA)-the model’s latent ability to distinguish self- and other-texts in representational space, which remains unexpressed in its output behavior. To awaken the ITA of LLMs, we propose Cognitive Surgery (CoSur), a novel framework comprising four main modules: representation extraction, territory construction, authorship discrimination and cognitive editing. Experimental results demonstrate that our proposed method improves the performance of three different LLMs in the IPP scenario, achieving average accuracies of 83.25%, 66.19%, and 88.01%, respectively.
[16] Knowledge Graph-Infused Fine-Tuning for Structured Reasoning in Large Language Models
Wuyang Zhang,Yexin Tian,Xiandong Meng,Mengjie Wang,Junliang Du
Main category: cs.CL
TL;DR: 该论文提出了一种基于知识图谱注入的微调算法框架,旨在解决大语言模型在处理需要结构化知识的任务时推理链缺失和实体级语义理解不足的问题。通过图神经网络和图语义表示,结合语言模型表示进行联合建模,提升了语义推理和实体预测的准确性。
Details
Motivation: 大语言模型在处理需要结构化知识的任务时,常因推理链缺失和实体级语义理解不足而表现不佳。因此,研究如何通过知识图谱辅助增强模型的推理和语义表示能力。Contribution: 提出了一个知识图谱注入的微调框架,引入图神经网络编码实体关系,设计融合机制动态平衡语言语义和结构化知识,构建联合损失函数优化任务性能和结构对齐目标。
Method: 1) 使用图神经网络编码实体及其关系;2) 设计融合机制联合建模知识图谱嵌入和语言模型表示;3) 引入门控机制平衡语义和结构化知识的贡献;4) 构建联合损失函数优化任务性能。
Result: 实验表明,该方法在实体识别、问答和语言生成等任务中显著提升了模型对复杂语义单元的表示能力,增强了语义一致性和上下文逻辑建模。
Insight: 通过动态平衡语言语义和结构化知识,可以有效缓解不同表征空间的冲突,提升模型的推理和语义理解能力。这种方法可推广至其他需要结构化知识的任务。
Abstract: This paper addresses the problems of missing reasoning chains and insufficient entity-level semantic understanding in large language models when dealing with tasks that require structured knowledge. It proposes a fine-tuning algorithm framework based on knowledge graph injection. The method builds on pretrained language models and introduces structured graph information for auxiliary learning. A graph neural network is used to encode entities and their relations, constructing a graph-based semantic representation. A fusion mechanism is then designed to jointly model the knowledge graph embeddings with the contextual representations from the language model. To enhance the robustness of knowledge integration, a gating mechanism is introduced to dynamically balance the contributions of linguistic semantics and structural knowledge. This effectively mitigates conflicts between different representational spaces. During training, a joint loss function is constructed to account for both task performance and structural alignment objectives. This helps improve the accuracy of entity prediction and semantic reasoning. The study also includes a series of systematic sensitivity experiments. It evaluates the effects of learning rate, graph coverage, and structural perturbations on model performance. The results further validate the effectiveness and stability of the proposed method across tasks such as entity recognition, question answering, and language generation. Experimental findings show that the proposed structure-aware fine-tuning framework significantly enhances the model’s ability to represent complex semantic units. It demonstrates better semantic consistency and contextual logic modeling in scenarios involving structural reasoning and entity extraction.
[17] NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
NVIDIA,:,Aarti Basant,Abhijit Khairnar,Abhijit Paithankar,Abhinav Khattar,Adi Renduchintala,Adithya Renduchintala,Aditya Malte,Akhiad Bercovich,Akshay Hazare,Alejandra Rico,Aleksander Ficek,Alex Kondratenko,Alex Shaposhnikov,Ali Taghibakhshi,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amy Shen,Andrew Tao,Ann Guan,Anna Shors,Anubhav Mandarwal,Arham Mehta,Arun Venkatesan,Ashton Sharabiani,Ashwath Aithal,Ashwin Poojary,Ayush Dattagupta,Balaram Buddharaju,Banghua Zhu,Barnaby Simkin,Bilal Kartal,Bita Darvish Rouhani,Bobby Chen,Boris Ginsburg,Brandon Norick,Brian Yu,Bryan Catanzaro,Charles Wang,Charlie Truong,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christian Munley,Christopher Parisien,Dan Su,Daniel Afrimi,Daniel Korzekwa,Daniel Rohrer,Daria Gitman,David Mosallanezhad,Deepak Narayanan,Dima Rekesh,Dina Yared,Dmytro Pykhtar,Dong Ahn,Duncan Riach,Eileen Long,Elliott Ning,Eric Chung,Erick Galinkin,Evelina Bakhturina,Gargi Prasad,Gerald Shen,Haim Elisha,Harsh Sharma,Hayley Ross,Helen Ngo,Herman Sahota,Hexin Wang,Hoo Chang Shin,Hua Huang,Iain Cunningham,Igor Gitman,Ivan Moshkov,Jaehun Jung,Jan Kautz,Jane Polak Scowcroft,Jared Casper,Jimmy Zhang,Jinze Xue,Jocelyn Huang,Joey Conway,John Kamalu,Jonathan Cohen,Joseph Jennings,Julien Veron Vialard,Junkeun Yi,Jupinder Parmar,Kari Briski,Katherine Cheung,Katherine Luna,Keith Wyss,Keshav Santhanam,Kezhi Kong,Krzysztof Pawelec,Kumar Anik,Kunlun Li,Kushan Ahmadian,Lawrence McAfee,Laya Sleiman,Leon Derczynski,Luis Vega,Maer Rodrigues de Melo,Makesh Narsimhan Sreedhar,Marcin Chochowski,Mark Cai,Markus Kliegl,Marta Stepniewska-Dziubinska,Matvei Novikov,Mehrzad Samadi,Meredith Price,Meriem Boubdir,Michael Boone,Michael Evans,Michal Bien,Michal Zawalski,Miguel Martinez,Mike Chrzanowski,Mohammad Shoeybi,Mostofa Patwary,Namit Dhameja,Nave Assaf,Negar Habibi,Nidhi Bhatia,Nikki Pope,Nima Tajbakhsh,Nirmal Kumar Juluru,Oleg Rybakov,Oleksii Hrinchuk,Oleksii Kuchaiev,Oluwatobi Olabiyi,Pablo Ribalta,Padmavathy Subramanian,Parth Chadha,Pavlo Molchanov,Peter Dykas,Peter Jin,Piotr Bialecki,Piotr Januszewski,Pradeep Thalasta,Prashant Gaikwad,Prasoon Varshney,Pritam Gundecha,Przemek Tredak,Rabeeh Karimi Mahabadi,Rajen Patel,Ran El-Yaniv,Ranjit Rajan,Ria Cheruvu,Rima Shahbazyan,Ritika Borkar,Ritu Gala,Roger Waleffe,Ruoxi Zhang,Russell J. Hewett,Ryan Prenger,Sahil Jain,Samuel Kriman,Sanjeev Satheesh,Saori Kaji,Sarah Yurick,Saurav Muralidharan,Sean Narenthiran,Seonmyeong Bak,Sepehr Sameni,Seungju Han,Shanmugam Ramasamy,Shaona Ghosh,Sharath Turuvekere Sreenivas,Shelby Thomas,Shizhe Diao,Shreya Gopal,Shrimai Prabhumoye,Shubham Toshniwal,Shuoyang Ding,Siddharth Singh,Siddhartha Jain,Somshubra Majumdar,Stefania Alborghetti,Syeda Nahida Akter,Terry Kong,Tim Moon,Tomasz Hliwiak,Tomer Asida,Tony Wang,Twinkle Vashishth,Tyler Poon,Udi Karpas,Vahid Noroozi,Venkat Srinivasan,Vijay Korthikanti,Vikram Fugro,Vineeth Kalluru,Vitaly Kurin,Vitaly Lavrukhin,Wasi Uddin Ahmad,Wei Du,Wonmin Byeon,Ximing Lu,Xin Dong,Yashaswi Karnati,Yejin Choi,Yian Zhang,Ying Lin,Yonggan Fu,Yoshi Suhara,Zhen Dong,Zhiyu Li,Zhongbo Zhu,Zijia Chen
Main category: cs.CL
TL;DR: 论文介绍了Nemotron-Nano-9B-v2,一种混合Mamba-Transformer语言模型,旨在提高推理工作负载的吞吐量,同时达到与同类规模模型相媲美的最先进精度。该模型通过替换Transformer的自注意力层为Mamba-2层,显著提升了推理速度,支持长推理轨迹生成。
Details
Motivation: 现有Transformer模型在长序列推理任务中性能受限,而纯Mamba模型在精度上难以匹敌Transformer。本研究希望通过结合两者优势,提升推理速度和精度。Contribution: 1. 提出混合Mamba-Transformer架构(Nemotron-Nano-9B-v2),显著提升推理吞吐量;2. 通过FP8训练和Minitron压缩策略,实现高效推理;3. 开源模型及数据集。
Method: 1. 替换Transformer的自注意力层为Mamba-2层;2. 使用FP8训练预训练120亿参数模型;3. 通过Minitron策略压缩和蒸馏模型;4. 支持单GPU上128k token的推理。
Result: 在推理任务中,相比同类模型(如Qwen3-8B),Nemotron-Nano-9B-v2精度相当或更高,吞吐量提升高达6倍。
Insight: 混合Mamba-Transformer架构在推理任务中表现优越,表明结合动态状态空间模型和自注意力机制是有效的未来方向。
Abstract: We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.
[18] Reasoning is about giving reasons
Krunal Shah,Dan Roth
Main category: cs.CL
TL;DR: 论文提出了一种名为RLS(逻辑结构表示)的中间表示方法,用于理解和表达自然语言论证的逻辑结构,从而支持多种形式的确定性推理。
Details
Motivation: 当前基于规则链的方法在解释性和扩展性上存在局限,无法支持复杂的推理任务(如溯因或矛盾识别)。Contribution: 提出了RLS作为一种中间表示,能够捕捉自然语言论证的逻辑原子和规则,实现确定性推理,并支持多种推理任务。
Method: 通过识别和提取自然语言论证的逻辑结构(逻辑原子和规则),构建RLS表示,从而支持推理任务。
Result: 在三个流行的推理数据集上,RLS能高精度地提取逻辑结构,显著扩展了模型的推理能力。
Insight: 逻辑结构的显式表示是提高推理模型解释性和灵活性的关键。
Abstract: Convincing someone of the truth value of a premise requires understanding and articulating the core logical structure of the argument which proves or disproves the premise. Understanding the logical structure of an argument refers to understanding the underlying “reasons” which make up the proof or disproof of the premise - as a function of the “logical atoms” in the argument. While it has been shown that transformers can “chain” rules to derive simple arguments, the challenge of articulating the “reasons” remains. Not only do current approaches to chaining rules suffer in terms of their interpretability, they are also quite constrained in their ability to accommodate extensions to theoretically equivalent reasoning tasks - a model trained to chain rules cannot support abduction or identify contradictions. In this work we suggest addressing these shortcomings by identifying an intermediate representation (which we call the Representation of the Logical Structure (RLS) of the argument) that possesses an understanding of the logical structure of a natural language argument - the logical atoms in the argument and the rules incorporating them. Given the logical structure, reasoning is deterministic and easy to compute. Therefore, our approach supports all forms of reasoning that depend on the logical structure of the natural language argument, including arbitrary depths of reasoning, on-the-fly mistake rectification and interactive discussion with respect to an argument. We show that we can identify and extract the logical structure of natural language arguments in three popular reasoning datasets with high accuracies, thus supporting explanation generation and extending the reasoning capabilities significantly.
[19] ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine
Junying Chen,Zhenyang Cai,Zhiheng Liu,Yunjin Yang,Rongsheng Wang,Qingying Xiao,Xiangyi Feng,Zhan Su,Jing Guo,Xiang Wan,Guangjun Yu,Haizhou Li,Benyou Wang
Main category: cs.CL
TL;DR: 论文提出了首个针对中医的多模态大语言模型ShizhenGPT,解决了中医领域数据稀缺和多模态诊疗的挑战,并在多个任务中表现优异。
Details
Motivation: 中医诊疗涉及多模态感官信息(视觉、听觉、嗅觉、触觉),传统大语言模型无法处理此类需求,且高质量中医数据稀缺。Contribution: 1. 构建了迄今为止最大的中医多模态数据集;2. 提出了首个中医多模态大语言模型ShizhenGPT;3. 在中医资格考试和视觉诊断任务中表现优异。
Method: 通过预训练和指令调优,结合文本、图像、音频和生理信号等多模态数据,实现深度中医知识和多模态推理。
Result: ShizhenGPT在中医视觉理解和多模态感知任务中优于同类模型,并与更大规模专有模型竞争。
Insight: 多模态大语言模型在中医领域有巨大潜力,可推动全面的感知和诊断。
Abstract: Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.
[20] The Digital Sous Chef – A Comparative Study on Fine-Tuning Language Models for Recipe Generation
Shubham Pundhir,Ganesh Bagler
Main category: cs.CL
TL;DR: 论文《The Digital Sous Chef》通过比较微调的GPT-2大模型与小模型及传统LSTM/RNN基线,提出了针对食谱生成的优化分词策略,显著提升了生成质量。
Details
Motivation: 食谱生成是自然语言生成的基础任务,但通用分词器无法有效保留食谱结构和精确数值,限制了生成质量。Contribution: 提出了一种针对食谱生成的分词策略,引入23个分数标记和结构标记,提升了领域特异性。
Method: 利用RecipeDB的5-cuisine语料库,比较微调的GPT-2大模型、小模型及传统LSTM/RNN基线,并通过7种指标评估生成质量。
Result: 大模型在BERTScore上相对提升了20%(0.92 vs 0.72),困惑度降低69.8%。
Insight: 优化分词策略可显著提升食谱生成的领域特异性,但事实准确性仍是未来研究的重要挑战。
Abstract: We established a rigorous benchmark for text-based recipe generation, a fundamental task in natural language generation. We present a comprehensive comparative study contrasting a fine-tuned GPT-2 large (774M) model against the GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine corpus from RecipeDB. Our key contribution is a targeted tokenization strategy that augments the vocabulary with 23 common fraction tokens and custom structural markers. This approach addresses a critical limitation of generic tokenizers by preserving essential recipe structures and precise numerical quantities, thereby enhancing domain specificity. Performance is evaluated using a comprehensive suite of seven automatic metrics spanning fluency (BLEU-4, METEOR), coherence (ROUGE-L), semantic relevance (BERTScore), and diversity. Our experiments show that the large transformer-based approach yields a >20% relative improvement in BERTScore (F1) (0.92 vs 0.72) over the best recurrent baseline, while reducing perplexity by 69.8%. We conclude with a discussion of remaining challenges, particularly regarding factual accuracy, and outline how this foundational study paves the way for integrating real-world constraints and multi-modal inputs in advanced recipe generation research.
[21] Transplant Then Regenerate: A New Paradigm for Text Data Augmentation
Guangzhan Wang,Hongyu Zhang,Beijun Shen,Xiaodong Gu
Main category: cs.CL
TL;DR: 论文提出了一种新的文本数据增强范式LMTransplant,利用大语言模型(LLM)通过transplant-then-regenerate策略生成更具多样性和创造性的文本变体,同时保留原始文本的核心属性。
Details
Motivation: 传统文本增强方法(如回译)主要产生语义相同的变体,而LLM难以精确控制输出风格和结构。本文旨在利用LLM的“知识涌现”能力,提出更灵活的数据增强方法。Contribution: 提出了LMTransplant范式,通过将种子文本嵌入LLM扩展的上下文中,并基于扩展上下文重新生成变体,实现了内容级别的多样性增强。
Method: 采用transplant-then-regenerate策略:1)将种子文本移植到LLM扩展的上下文中;2)基于扩展上下文生成变体。充分利用LLM的嵌入知识。
Result: 实验表明LMTransplant在多个文本相关任务中优于现有方法,且随着增强数据规模的增加表现出优异扩展性。
Insight: 通过结合种子文本与LLM生成的内容,实现了更具创造性的文本变体生成,同时避免了对提示工程的过度依赖。
Abstract: Data augmentation is a critical technique in deep learning. Traditional methods like Back-translation typically focus on lexical-level rephrasing, which primarily produces variations with the same semantics. While large language models (LLMs) have enhanced text augmentation by their “knowledge emergence” capability, controlling the style and structure of these outputs remains challenging and requires meticulous prompt engineering. In this paper, we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs. The core idea of LMTransplant is transplant-then-regenerate: incorporating seed text into a context expanded by LLM, and asking the LLM to regenerate a variant based on the expanded context. This strategy allows the model to create more diverse and creative content-level variants by fully leveraging the knowledge embedded in LLMs, while preserving the core attributes of the original text. We evaluate LMTransplant across various text-related tasks, demonstrating its superior performance over existing text augmentation methods. Moreover, LMTransplant demonstrates exceptional scalability as the size of augmented data grows.
[22] Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference
Samir Abdaljalil,Erchin Serpedin,Khalid Qaraqe,Hasan Kurban
Main category: cs.CL
TL;DR: 该论文提出了一种通过合成自然语言推理任务评估多语言大模型(LLMs)在逻辑一致性和跨语言对齐能力上的框架。研究发现,代码切换(code-switching)不仅不会降低性能,反而可能提升模型表现。
Details
Motivation: 当前多语言大模型在跨语言逻辑一致性和对齐能力上的表现缺乏系统评估,尤其是在代码切换场景下的表现尚未深入研究。Contribution: 1. 提出了一种基于合成NLI任务的评估框架;2. 发现了代码切换对模型表现的潜在提升作用;3. 通过嵌入相似性分析和可视化验证了语义保真度。
Method: 通过生成逻辑基础的合成前提-假设对,并将其翻译为多种语言(包括代码切换场景),构建了一个可控的评估框架。
Result: 代码切换不仅未降低性能,还可能提升模型表现,翻译引入的词汇变化可能作为正则化信号。
Insight: 跨语言对齐能力尚存脆弱性,而代码切换可能是提升多语言模型鲁棒性的有效手段。
Abstract: Large language models (LLMs) are increasingly applied in multilingual contexts, yet their capacity for consistent, logically grounded alignment across languages remains underexplored. We present a controlled evaluation framework for multilingual natural language inference (NLI) that generates synthetic, logic-based premise-hypothesis pairs and translates them into a typologically diverse set of languages. This design enables precise control over semantic relations and allows testing in both monolingual and mixed-language (code-switched) conditions. Surprisingly, code-switching does not degrade, and can even improve, performance, suggesting that translation-induced lexical variation may serve as a regularization signal. We validate semantic preservation through embedding-based similarity analyses and cross-lingual alignment visualizations, confirming the fidelity of translated pairs. Our findings expose both the potential and the brittleness of current LLM cross-lingual reasoning, and identify code-switching as a promising lever for improving multilingual robustness. Code available at: https://github.com/KurbanIntelligenceLab/nli-stress-testing
[23] TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting
Jiaming Leng,Yunying Bi,Chuan Qin,Bing Yin,Yanyong Zhang,Chao Wang
Main category: cs.CL
TL;DR: TransLLM提出了一种统一的多任务基础框架,通过可学习的提示组合将时空建模与大型语言模型(LLM)结合,解决了城市交通系统中多样任务的通用性问题。
Details
Motivation: 现有方法中,小规模深度学习模型任务专用且数据需求高,通用性差;而大型语言模型在结构化时空数据和数值推理方面表现不佳。TransLLM旨在解决这些问题。Contribution: 1. 提出了轻量级时空编码器;2. 设计了实例级提示路由机制;3. 通过动态个性化提示增强LLM的推理能力。
Method: 结合了时空卷积和双邻接图注意力网络的编码器,与LLM通过结构嵌入交互;通过强化学习训练个性化提示路由。
Result: 在7个数据集和3个任务上的实验表明,TransLLM在监督和零样本设置中表现优异,优于10个基线模型。
Insight: 动态提示机制显著提升了模型在多任务和跨任务场景中的适应性和泛化能力。
Abstract: Urban transportation systems encounter diverse challenges across multiple tasks, such as traffic forecasting, electric vehicle (EV) charging demand prediction, and taxi dispatch. Existing approaches suffer from two key limitations: small-scale deep learning models are task-specific and data-hungry, limiting their generalizability across diverse scenarios, while large language models (LLMs), despite offering flexibility through natural language interfaces, struggle with structured spatiotemporal data and numerical reasoning in transportation domains. To address these limitations, we propose TransLLM, a unified foundation framework that integrates spatiotemporal modeling with large language models through learnable prompt composition. Our approach features a lightweight spatiotemporal encoder that captures complex dependencies via dilated temporal convolutions and dual-adjacency graph attention networks, seamlessly interfacing with LLMs through structured embeddings. A novel instance-level prompt routing mechanism, trained via reinforcement learning, dynamically personalizes prompts based on input characteristics, moving beyond fixed task-specific templates. The framework operates by encoding spatiotemporal patterns into contextual representations, dynamically composing personalized prompts to guide LLM reasoning, and projecting the resulting representations through specialized output layers to generate task-specific predictions. Experiments across seven datasets and three tasks demonstrate the exceptional effectiveness of TransLLM in both supervised and zero-shot settings. Compared to ten baseline models, it delivers competitive performance on both regression and planning problems, showing strong generalization and cross-task adaptability. Our code is available at https://github.com/BiYunying/TransLLM.
[24] Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs
Skatje Myers,Dmitriy Dligach,Timothy A. Miller,Samantha Barr,Yanjun Gao,Matthew Churpek,Anoop Mayampurath,Majid Afshar
Main category: cs.CL
TL;DR: 论文比较了检索增强生成(RAG)和长上下文输入在电子健康记录(EHR)临床推理任务中的表现,发现RAG在减少输入令牌的同时,性能接近或优于长上下文方法。
Details
Motivation: 电子健康记录(EHR)冗长、噪声大且冗余,临床医生难以高效处理。尽管大语言模型(LLM)提供了解决方法,但EHR的长度常超出模型的上下文窗口限制。Contribution: 提出了三种可复现的临床任务(提取影像程序、生成抗生素使用时间线、识别关键诊断),并通过实验验证了RAG在效率与性能上的优势。
Method: 测试了三种SOTA LLM在不同上下文输入量下的表现,对比了RAG和直接使用最近临床笔记的效果。
Result: RAG在减少输入令牌的同时,性能接近或优于长上下文方法,并在效率上显著优于后者。
Insight: 研究表明,即使新模型能处理更长的文本,RAG仍是高效且竞争力的解决方案。
Abstract: Electronic health records (EHRs) are long, noisy, and often redundant, posing a major challenge for the clinicians who must navigate them. Large language models (LLMs) offer a promising solution for extracting and reasoning over this unstructured text, but the length of clinical notes often exceeds even state-of-the-art models’ extended context windows. Retrieval-augmented generation (RAG) offers an alternative by retrieving task-relevant passages from across the entire EHR, potentially reducing the amount of required input tokens. In this work, we propose three clinical tasks designed to be replicable across health systems with minimal effort: 1) extracting imaging procedures, 2) generating timelines of antibiotic use, and 3) identifying key diagnoses. Using EHRs from actual hospitalized patients, we test three state-of-the-art LLMs with varying amounts of provided context, using either targeted text retrieval or the most recent clinical notes. We find that RAG closely matches or exceeds the performance of using recent notes, and approaches the performance of using the models’ full context while requiring drastically fewer input tokens. Our results suggest that RAG remains a competitive and efficient approach even as newer models become capable of handling increasingly longer amounts of text.
[25] Long Chain-of-Thought Reasoning Across Languages
Josh Barua,Seun Eisape,Kayo Yin,Alane Suhr
Main category: cs.CL
TL;DR: 论文探讨了多语言环境下的长链思维推理能力,通过翻译数据集和多语言预训练模型,揭示了英语作为中介语言的有效性因语言而异,并强调了数据质量和规模对不同语言的影响。
Details
Motivation: 当前大型语言模型的长链思维推理能力主要集中在英语上,多语言环境下的推理能力研究不足。论文旨在填补这一空白,通过实验分析多语言环境下的推理表现。Contribution: 1) 构建了多语言推理数据集;2) 系统地研究了长链思维推理在法语、日语、拉脱维亚语和斯瓦希里语中的表现;3) 揭示了数据质量和规模对多语言推理的影响。
Method: 通过翻译英语推理数据集,并使用Qwen 2.5和Qwen 3模型进行微调,实验分析了不同语言下长链思维推理的表现。
Result: 1) 英语作为中介语言的效果因语言而异;2) 多语言预训练缩小但未消除性能差距;3) 数据质量和规模的权衡因语言而异。
Insight: 多语言推理能力不仅依赖于模型规模,还需要语言特定的数据支持,小规模高质量数据对某些语言更有效。
Abstract: Scaling inference through long chains-of-thought (CoTs) has unlocked impressive reasoning capabilities in large language models (LLMs), yet the reasoning process remains almost exclusively English-centric. We construct translated versions of two popular English reasoning datasets, fine-tune Qwen 2.5 (7B) and Qwen 3 (8B) models, and present a systematic study of long CoT generation across French, Japanese, Latvian, and Swahili. Our experiments reveal three key findings. First, the efficacy of using English as a pivot language varies by language: it provides no benefit for French, improves performance when used as the reasoning language for Japanese and Latvian, and proves insufficient for Swahili where both task comprehension and reasoning remain poor. Second, extensive multilingual pretraining in Qwen 3 narrows but does not eliminate the cross-lingual performance gap. A lightweight fine-tune using only 1k traces still improves performance by over 30% in Swahili. Third, data quality versus scale trade-offs are language dependent: small, carefully curated datasets suffice for English and French, whereas larger but noisier corpora prove more effective for Swahili and Latvian. Together, these results clarify when and why long CoTs transfer across languages and provide translated datasets to foster equitable multilingual reasoning research.
[26] MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework
Ailing Yu,Lan Yao,Jingnan Liu,Zhe Chen,Jiajun Yin,Yuan Wang,Xinhao Liao,Zhiling Ye,Ji Li,Yun Yue,Hansong Xiao,Hualei Zhou,Chunxiao Guo,Peng Wei,Jinjie Gu
Main category: cs.CL
TL;DR: 论文提出MedResearcher-R1,一种通过知识驱动的轨迹合成框架实现专家级医学深度研究的系统,解决了通用LLM在医学领域的局限性,通过结合医学知识图谱和专用检索工具,显著提升了医学信息合成能力。
Details
Motivation: 通用LLM在医学领域的表现受限,主要问题包括医学知识不足和缺乏专业检索工具。因此,作者提出MedResearcher-R1,通过领域专用创新解决这些问题。Contribution: 1. 提出基于医学知识图谱的数据合成框架,生成复杂多跳问答对;2. 结合专用医学检索工具与通用工具;3. 通过两阶段训练实现高精度医学研究。
Method: 1. 从医学知识图谱中提取子图的长链生成多跳QA对;2. 结合专用医学检索引擎;3. 采用监督微调与在线强化学习的混合训练范式。
Result: MedResearcher-R1在医学基准测试中表现优异,同时保持通用任务的竞争力,证明小模型也能超越大模型。
Insight: 领域专用的架构、工具设计和训练数据构造是实现小模型在专业领域超越大模型的关键。
Abstract: Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical contexts.We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting the longest chains from subgraphs around rare medical entities to generate complex multi-hop question-answer pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2100+ diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions.Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our MedResearcher-R1-32B model demonstrates exceptional performance, establishing new state-of-the-art results on medical benchmarks while maintaining competitive performance on general deep research tasks. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains.
[27] Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs
Haokun Lin,Haobo Xu,Yichen Wu,Ziyu Guo,Renrui Zhang,Zhichao Lu,Ying Wei,Qingfu Zhang,Zhenan Sun
Main category: cs.CL
TL;DR: 本文首次系统地研究了扩散大语言模型(dLLMs)的后训练量化(PTQ)问题,识别了激活离群值的影响,并评估了不同配置下的量化效果。
Details
Motivation: 扩散大语言模型(dLLMs)在自然语言生成任务中展现出潜力,但其巨大的参数量和资源需求阻碍了在边缘设备上的部署。后训练量化虽广泛用于压缩自回归LLMs,但在dLLMs上的适用性尚未探索。Contribution: 1. 首次系统地研究了dLLMs的量化问题;2. 识别了激活离群值对低比特量化的挑战;3. 在多种任务和模型变体上评估了PTQ方法。
Method: 采用后训练量化(PTQ)技术,从比特宽度、量化方法、任务类别和模型类型四个维度进行综合评估。
Result: 激活离群值会主导动态范围,导致低比特量化难以保持多数值的精度。量化性能受任务类型和模型配置显著影响。
Insight: dLLMs的量化需要针对任务和模型类型优化,未来研究应关注如何有效处理激活离群值以提升低比特量化效果。
Abstract: Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. All codes and experimental setups will be released to support the community.
cs.CV [Back]
[28] LENS: Learning to Segment Anything with Unified Reinforced Reasoning
Lianghui Zhu,Bin Ouyang,Yuxuan Zhang,Tianheng Cheng,Rui Hu,Haocheng Shen,Longjin Ran,Xiaoxin Chen,Li Yu,Wenyu Liu,Xinggang Wang
Main category: cs.CV
TL;DR: LENS是一种通过强化学习联合优化分割任务和推理过程的框架,通过跨句子、框和分割级别的奖励生成信息化的CoT(Chain-of-Thought)推理,提升分割质量。
Details
Motivation: 现有方法在测试时忽略了显式的CoT推理,限制了模型对未见过的提示和领域的泛化能力。Contribution: 提出了LENS框架,统一优化推理和分割任务,通过强化学习奖励提升CoT推理和分割质量。
Method: 使用强化学习框架联合优化推理和分割,设计跨句子、框和分割级别的奖励机制。
Result: 在RefCOCO等基准上平均cIoU达81.2%,比GLaMM方法提升5.6%。
Insight: 强化学习驱动的CoT推理是一种鲁棒的先验,可提升分割模型的通用性。
Abstract: Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models. Code is available at https://github.com/hustvl/LENS.
[29] RynnEC: Bringing MLLMs into Embodied World
Ronghao Dang,Yuqian Yuan,Yunxuan Mao,Kehan Li,Jiangpin Liu,Zhikai Wang,Xin Li,Fan Wang,Deli Zhao
Main category: cs.CV
TL;DR: RynnEC是一个面向具身认知的视频多模态大语言模型,通过区域编码器和掩码解码器实现了灵活的基于区域的视频交互。尽管结构紧凑,它在物体属性理解、物体分割和空间推理方面达到了最先进的性能。
Details
Motivation: 为了解决具身智能体在物理世界中需要精细感知和精确交互的需求,同时缓解标注3D数据稀缺的问题。Contribution: 1. 提出了RynnEC模型,支持区域级视频交互;2. 提出了基于自我中心视频的数据生成流水线;3. 引入了区域中心的评测基准RynnEC-Bench。
Method: 基于通用视觉-语言基础模型,设计了区域编码器和掩码解码器,用于灵活的基于区域的视频交互,并通过自监督方式生成具身认知数据。
Result: 在物体属性理解、分割和空间推理任务上取得了最先进的性能。
Insight: 区域中心的视频范式可以为具身智能体提供更精细的感知能力,并推动通用认知核心的发展。
Abstract: We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC
[30] Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer
Md Ashiqur Rahman,Chiao-An Yang,Michael N. Cheng,Lim Jun Hao,Jeremiah Jiang,Teck-Yian Lim,Raymond A. Yeh
Main category: cs.CV
TL;DR: 论文提出了一种深度均衡规范化器(DEC),用于提升模型的局部尺度等变性,解决计算机视觉中尺度变化的挑战。DEC易于集成到现有网络中,并在ImageNet基准测试中显著提升了性能与局部尺度一致性。
Details
Motivation: 计算机视觉中,同一类别的物体可能因距离或自身大小而表现出不同的尺度变化,这种变化是局部的。现有方法难以有效处理这种局部尺度变化,因此需要一种能够提升模型局部尺度等变性的方法。Contribution: 提出了深度均衡规范化器(DEC),能够提升模型的局部尺度等变性,并可灵活集成到现有网络架构中。DEC在多个预训练模型(如ViT、DeiT、Swin和BEiT)上显著提升了性能与局部尺度一致性。
Method: DEC通过规范化器的形式处理局部尺度变化,利用深度均衡机制实现尺度等变性。它可以适配预训练模型,并通过实验验证了其在ImageNet上的有效性。
Result: 在ImageNet基准测试中,DEC显著提升了ViT、DeiT、Swin和BEiT等预训练模型的性能和局部尺度一致性。
Insight: DEC提供了一种通用的解决方案,能够在不改变网络架构的情况下提升模型对局部尺度变化的鲁棒性。这表明规范化器在解决尺度等变性问题上具有潜力。
Abstract: Scale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. To effectively handle scale variations, we present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Notably, we show that on the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT. Our code is available at https://github.com/ashiq24/local-scale-equivariance.
[31] CLIPSym: Delving into Symmetry Detection with CLIP
Tinghan Yang,Md Ashiqur Rahman,Raymond A. Yeh
Main category: cs.CV
TL;DR: 论文提出CLIPSym,通过结合CLIP的视觉和语言编码器及旋转等变解码器,利用对称性提示检测图像中的旋转和反射对称性,新提示技术SAPG提升了性能,实验显示其在多个数据集上优于现有方法。
Details
Motivation: 对称性是计算机视觉中基础的几何线索,但检测仍有挑战。作者探索预训练的CLIP模型是否能通过自然图像描述中的对称性线索改进检测。Contribution: 1. 提出CLIPSym框架,结合CLIP和旋转等变解码器;2. 开发SAPG提示技术,整合语义线索;3. 在多个数据集上实现SOTA性能。
Method: 1. 使用CLIP的图像和语言编码器;2. 基于Transformer和G-卷积的旋转等变解码器;3. SAPG技术聚合多样化的对象提示。
Result: 在DENDI、SDRW和LDRS数据集上超越现有方法。消融实验验证了CLIP预训练、解码器和SAPG的有效性。
Insight: 预训练的视觉-语言模型能有效捕捉对称性线索,提示技术的多样性对性能提升至关重要。
Abstract: Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP’s image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and $G$-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP’s language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP’s pre-training, the proposed equivariant decoder, and the SAPG technique. The code is available at https://github.com/timyoung2333/CLIPSym.
[32] A Survey on Video Anomaly Detection via Deep Learning: Human, Vehicle, and Environment
Ghazal Alinezhad Noghre,Armin Danesh Pazho,Hamed Tabkhi
Main category: cs.CV
TL;DR: 这篇综述系统地梳理了基于深度学习的视频异常检测(VAD)研究,涵盖了不同监督水平的文献以及自适应学习方法,并针对人类、车辆和环境三类应用场景进行了分析,指出了当前方法的贡献与局限。
Details
Motivation: 视频异常检测在计算机视觉中具有重要意义,尽管深度学习推动了该领域的进展,但研究仍较为分散,缺乏系统性整合。本文旨在为社区提供一个结构化的综述,推动理论和实际应用的进步。Contribution: 1. 从监督水平和自适应学习方法角度系统整理了VAD文献;2. 将VAD分为人类、车辆和环境三类应用场景进行分析;3. 指出了当前方法的贡献与局限。
Method: 通过文献综述的方法,对不同监督水平(如全监督、弱监督、无监督)和自适应学习方法(如在线学习、主动学习、持续学习)进行分类和分析。
Result: 总结了VAD在不同应用场景中的研究现状,明确了当前方法的优势和不足,为未来研究提供了方向。
Insight: 视频异常检测的研究需要进一步结合多学科知识,解决实际部署中的挑战,同时需要关注算法的泛化性和实时性。
Abstract: Video Anomaly Detection (VAD) has emerged as a pivotal task in computer vision, with broad relevance across multiple fields. Recent advances in deep learning have driven significant progress in this area, yet the field remains fragmented across domains and learning paradigms. This survey offers a comprehensive perspective on VAD, systematically organizing the literature across various supervision levels, as well as adaptive learning methods such as online, active, and continual learning. We examine the state of VAD across three major application categories: human-centric, vehicle-centric, and environment-centric scenarios, each with distinct challenges and design considerations. In doing so, we identify fundamental contributions and limitations of current methodologies. By consolidating insights from subfields, we aim to provide the community with a structured foundation for advancing both theoretical understanding and real-world applicability of VAD systems. This survey aims to support researchers by providing a useful reference, while also drawing attention to the broader set of open challenges in anomaly detection, including both fundamental research questions and practical obstacles to real-world deployment.
[33] Accelerating Image Classification with Graph Convolutional Neural Networks using Voronoi Diagrams
Mustafa Mohammadi Gharasuie,Luis Rueda
Main category: cs.CV
TL;DR: 论文提出一种结合Voronoi图和图卷积网络(GCN)的图像分类方法,通过图结构表示图像,并引入归一化Voronoi图卷积网络(NVGCN),显著提升预处理速度和分类精度。
Details
Motivation: 传统卷积神经网络(CNNs)在复杂场景和细粒度分类任务中表现有限。作者提出利用图结构表达图像关系,并借助Voronoi图的几何特性优化计算效率。Contribution: 1. 提出基于Voronoi图和GCN的创新框架;2. 提出NVGCN,比常规GCN更高效;3. 在多个基准数据集上验证了方法的优越性。
Method: 将图像表示为图结构(顶点为像素或区域),通过Delaunay三角剖分简化图,并设计NVGCN模型进行特征学习和分类。
Result: 实验表明,方法在预处理时间和分类准确率上优于现有模型,尤其在复杂场景和细粒度分类任务中表现突出。
Insight: 图结构与几何分割(如Voronoi图)的结合为图像分类提供了新思路,同时NVGCN的设计可推广到其他非结构化数据任务中。
Abstract: Recent advances in image classification have been significantly propelled by the integration of Graph Convolutional Networks (GCNs), offering a novel paradigm for handling complex data structures. This study introduces an innovative framework that employs GCNs in conjunction with Voronoi diagrams to peform image classification, leveraging their exceptional capability to model relational data. Unlike conventional convolutional neural networks, our approach utilizes a graph-based representation of images, where pixels or regions are treated as vertices of a graph, which are then simplified in the form of the corresponding Delaunay triangulations. Our model yields significant improvement in pre-processing time and classification accuracy on several benchmark datasets, surpassing existing state-of-the-art models, especially in scenarios that involve complex scenes and fine-grained categories. The experimental results, validated via cross-validation, underscore the potential of integrating GCNs with Voronoi diagrams in advancing image classification tasks. This research contributes to the field by introducing a novel approach to image classification, while opening new avenues for developing graph-based learning paradigms in other domains of computer vision and non-structured data. In particular, we have proposed a new version of the GCN in this paper, namely normalized Voronoi Graph Convolution Network (NVGCN), which is faster than the regular GCN.
[34] Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models
Thanh-Dat Truong,Huu-Thien Tran,Tran Thai Son,Bhiksha Raj,Khoa Luu
Main category: cs.CV
TL;DR: 该论文提出了一种名为Directed-Tokens的多模态对齐方法,通过解决图像和文本顺序的重构问题,提升大型语言-视觉模型的鲁棒性与泛化能力。
Details
Motivation: 现有大型多模态模型(LMMs)在视觉与文本特征的鲁棒对齐和相关性方面存在局限性,影响了模型的泛化能力和推理性能。Contribution: 1. 提出了一种新的学习机制,通过重构图像和文本顺序的任务提升多模态对齐;2. 引入了Directed-Tokens方法捕捉视觉与文本知识;3. 设计了Image-to-Response Guided损失函数以增强视觉理解。
Method: 在预训练和微调阶段引入图像和文本顺序重构任务,使用Directed-Tokens捕获多模态知识,并设计新的损失函数优化模型响应。
Result: 所提方法在学术任务导向和指令跟随的LMM基准测试中实现了最先进的性能。
Insight: 通过顺序重构任务和Directed-Tokens的设计,有效提升了模型的多模态对齐能力和视觉理解能力,同时增强了模型的鲁棒性。
Abstract: Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM’s pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.
[35] Multi-Rationale Explainable Object Recognition via Contrastive Conditional Inference
Ali Rasekh,Sepehr Kazemi Ranjbar,Simon Gottschalk
Main category: cs.CV
TL;DR: 该论文提出了一个多理由可解释物体识别基准和对比条件推理(CCI)框架,解决现有方法在CLIP模型下解释性不足的问题。
Details
Motivation: 现有基于视觉-语言模型(如CLIP)的可解释物体识别方法依赖提示词条件,但其文本编码器受限且解释结构条件较弱。此外,数据集中多包含单一或噪声理由,未能捕捉判别特征的多样性。Contribution: 1. 引入了多理由可解释物体识别基准,每张图像标注多个真实理由;2. 提出了对比条件推理(CCI)框架,显式建模图像嵌入、类别标签和理由间的概率关系;3. 实现了零样本高性能。
Method: 提出对比条件推理(CCI)框架,无需训练即可通过建模嵌入、标签和理由的关系,高效利用理由预测类别。
Result: 在基准测试中达到最优结果,零样本表现突出,分类准确率和理由质量均提升。
Insight: 多理由标注和概率建模的结合为可解释性任务提供了更全面的评估标准,且无需训练的特点使其更具通用性。
Abstract: Explainable object recognition using vision-language models such as CLIP involves predicting accurate category labels supported by rationales that justify the decision-making process. Existing methods typically rely on prompt-based conditioning, which suffers from limitations in CLIP’s text encoder and provides weak conditioning on explanatory structures. Additionally, prior datasets are often restricted to single, and frequently noisy, rationales that fail to capture the full diversity of discriminative image features. In this work, we introduce a multi-rationale explainable object recognition benchmark comprising datasets in which each image is annotated with multiple ground-truth rationales, along with evaluation metrics designed to offer a more comprehensive representation of the task. To overcome the limitations of previous approaches, we propose a contrastive conditional inference (CCI) framework that explicitly models the probabilistic relationships among image embeddings, category labels, and rationales. Without requiring any training, our framework enables more effective conditioning on rationales to predict accurate object categories. Our approach achieves state-of-the-art results on the multi-rationale explainable object recognition benchmark, including strong zero-shot performance, and sets a new standard for both classification accuracy and rationale quality. Together with the benchmark, this work provides a more complete framework for evaluating future models in explainable object recognition. The code will be made available online.
[36] OccluNet: Spatio-Temporal Deep Learning for Occlusion Detection on DSA
Anushka A. Kore,Frank G. te Nijenhuis,Matthijs van der Sluijs,Wim van Zwam,Charles Majoie,Geert Lycklama à Nijeholt,Danny Ruijters,Frans Vos,Sandra Cornelissen,Ruisheng Su,Theo van Walsum
Main category: cs.CV
TL;DR: OccluNet是一个时空深度学习模型,结合YOLOX和目标检测和变换器的时间注意力机制,用于在DSA序列中自动检测血管闭塞,显著优于基线模型。
Details
Motivation: 在急性缺血性卒中治疗中,准确检测血管闭塞对数字减影血管造影(DSA)序列的解读至关重要,但由于解剖复杂性和时间压力,手动检测具有挑战性。Contribution: 提出OccluNet模型,融合YOLOX和时空注意力机制,实现了在DSA序列中的自动化闭塞检测,并显著提升性能。
Method: 结合YOLOX(单阶段目标检测器)和基于变换器的时空注意力机制,探索了两种注意力变体:纯时间注意力和分时空间注意力。
Result: 在MR CLEAN Registry的DSA图像上,OccluNet的精确率和召回率分别达到89.02%和74.87%,显著优于基线模型。
Insight: 时空注意力机制有效捕捉了时间一致性特征,为医学图像中的动态目标检测提供了新的解决思路。
Abstract: Accurate detection of vascular occlusions during endovascular thrombectomy (EVT) is critical in acute ischemic stroke (AIS). Interpretation of digital subtraction angiography (DSA) sequences poses challenges due to anatomical complexity and time constraints. This work proposes OccluNet, a spatio-temporal deep learning model that integrates YOLOX, a single-stage object detector, with transformer-based temporal attention mechanisms to automate occlusion detection in DSA sequences. We compared OccluNet with a YOLOv11 baseline trained on either individual DSA frames or minimum intensity projections. Two spatio-temporal variants were explored for OccluNet: pure temporal attention and divided space-time attention. Evaluation on DSA images from the MR CLEAN Registry revealed the model’s capability to capture temporally consistent features, achieving precision and recall of 89.02% and 74.87%, respectively. OccluNet significantly outperformed the baseline models, and both attention variants attained similar performance. Source code is available at https://github.com/anushka-kore/OccluNet.git
[37] Pixels to Play: A Foundation Model for 3D Gameplay
Yuguang Yue,Chris Green,Samuel Hunt,Irakli Salia,Wenzhe Shi,Jonathan J Hunt
Main category: cs.CV
TL;DR: Pixels2Play-0.1 (P2P0.1) 是一款基础模型,能够通过像素流学习和玩多种 3D 视频游戏,展现类似人类的行为。
Details
Motivation: 研究动机是满足用户和开发者对 AI 队友、可控 NPC、个性化直播助手等应用的需求,要求模型仅依赖玩家可见的像素流,并能泛化到新游戏。Contribution: 主要贡献是提出了一种端到端训练的基础模型 P2P0.1,结合行为克隆和无标注视频数据,实现了对多种游戏的泛化能力。
Method: 方法包括使用行为克隆和逆动力学模型处理无标注视频数据,并采用解码器专用的 transformer 结构生成动作。
Result: 模型在 Roblox 和经典 MS-DOS 游戏中表现出色,展示了泛化能力和潜力。
Insight: 通过结合标注和无标注数据,以及低延迟的模型设计,为未来实现专家级、文本驱动的游戏控制提供了基础。
Abstract: We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A decoder-only transformer with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic MS-DOS titles, ablations on unlabeled data, and outline the scaling and evaluation steps required to reach expert-level, text-conditioned control.
[38] MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation
Guile Wu,David Huang,Dongfeng Bai,Bingbing Liu
Main category: cs.CV
TL;DR: 该论文提出了一种用于自动驾驶场景的多模态多视角视频生成方法MoVieDrive,通过统一的扩散变换器模型解决了现有方法仅支持RGB视频生成的问题。
Details
Motivation: 现有自动驾驶视频生成方法主要关注RGB视频,缺乏多模态数据(如深度图和语义图)的支持。多模态数据对全面理解场景至关重要,但使用多个模型会增加部署难度且无法利用互补信息。Contribution: 提出了一种统一的多模态多视角视频生成框架MoVieDrive,通过扩散变换器模型实现了模态共享与模态专用组件的结合,支持高保真和可控的视频生成。
Method: 构建了一个统一的扩散变换器模型,包含模态共享和模态专用组件,并利用多样化条件输入编码场景结构和内容线索,实现多模态多视角视频生成。
Result: 在nuScenes数据集上的实验表明,该方法在生成多模态多视角视频时具有高保真性和可控性,优于现有方法。
Insight: 通过统一框架结合多模态数据,能够有效提升自动驾驶场景视频生成的全面性和实用性,同时减少模型部署的复杂度。
Abstract: Video generation has recently shown superiority in urban scene synthesis for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to support multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving. Although it is feasible to use multiple models to generate different modalities, this increases the difficulty of model deployment and does not leverage complementary cues for multi-modal data generation. To address this problem, in this work, we propose a novel multi-modal multi-view video generation approach to autonomous driving. Specifically, we construct a unified diffusion transformer model composed of modal-shared components and modal-specific components. Then, we leverage diverse conditioning inputs to encode controllable scene structure and content cues into the unified diffusion model for multi-modal multi-view video generation. In this way, our approach is capable of generating multi-modal multi-view driving scene videos in a unified framework. Our experiments on the challenging real-world autonomous driving dataset, nuScenes, show that our approach can generate multi-modal multi-view urban scene videos with high fidelity and controllability, surpassing the state-of-the-art methods.
[39] Inter-Class Relational Loss for Small Object Detection: A Case Study on License Plates
Dian Ning,Dong Seog Han
Main category: cs.CV
TL;DR: 本文提出了一个新的小物体检测损失函数(ICR损失),通过利用类间空间关系(如车牌与车的关联),提升小物体的梯度更新效率,并发布了SVMLP数据集。实验表明,ICR损失显著提升了检测性能。
Details
Motivation: 传统的基于IoU的损失函数对小物体的梯度更新效果不佳,导致小物体检测性能较差。本文通过利用类间空间关系(如车牌与车的固定位置关系)来解决这一问题。Contribution: 1. 发布了一个高质量的小型车辆多车牌数据集(SVMLP)。2. 提出了一种新的类间关系损失(ICR损失),通过利用类间空间关系提升小物体检测性能。
Method: ICR损失通过在预测框与关联物体(如车)的空间关系上添加惩罚项,引导模型更有效地学习小物体的位置。该方法可以轻松集成到现有基于IoU的损失函数中。
Result: 在YOLOv12-T和UAV-DETR上,ICR损失分别提升了10.3%和1.6%的mAP$^{\text{test}}_{50}$,且无需额外调参。
Insight: 利用类间空间关系可以有效解决小物体检测中的梯度更新问题,同时避免了对其他物体学习效率的影响。这一思路可以推广到其他小物体检测任务中。
Abstract: In one-stage multi-object detection tasks, various intersection over union (IoU)-based solutions aim at smooth and stable convergence near the targets during training. However, IoU-based losses fail to correctly update the gradient of small objects due to an extremely flat gradient. During the update of multiple objects, the learning of small objects’ gradients suffers more because of insufficient gradient updates. Therefore, we propose an inter-class relational loss to efficiently update the gradient of small objects while not sacrificing the learning efficiency of other objects based on the simple fact that an object has a spatial relationship to another object (e.g., a car plate is attached to a car in a similar position). When the predicted car plate’s bounding box is not within its car, a loss punishment is added to guide the learning, which is inversely proportional to the overlapped area of the car’s and predicted car plate’s bounding box. By leveraging the spatial relationship at the inter-class level, the loss guides small object predictions using larger objects and enhances latent information in deeper feature maps. In this paper, we present twofold contributions using license plate detection as a case study: (1) a new small vehicle multi-license plate dataset (SVMLP), featuring diverse real-world scenarios with high-quality annotations; and (2) a novel inter-class relational loss function designed to promote effective detection performance. We highlight the proposed ICR loss penalty can be easily added to existing IoU-based losses and enhance the performance. These contributions improve the standard mean Average Precision (mAP) metric, achieving gains of 10.3% and 1.6% in mAP$^{\text{test}}_{50}$ for YOLOv12-T and UAV-DETR, respectively, without any additional hyperparameter tuning. Code and dataset will be available soon.
[40] Deep Learning for Taxol Exposure Analysis: A New Cell Image Dataset and Attention-Based Baseline Model
Sean Fletcher,Gabby Scott,Douglas Currie,Xin Zhang,Yuqi Song,Bruce MacLeod
Main category: cs.CV
TL;DR: 论文提出了一种新的显微镜图像数据集和基于注意力的基线模型,用于分析紫杉醇(Taxol)对细胞的形态影响。数据集填补了这一领域的空白,而ResAttention-KNN模型结合了ResNet-50、注意力模块和KNN分类器,提供了高效的分类方法。
Details
Motivation: 现有的紫杉醇细胞效应检测方法需要专业设备和人员,成本高且不适用于高通量或实时分析。深度学习可以自动化分析细胞形态,但目前缺乏公开的数据集和基准模型。Contribution: 1. 发布了首个公开的紫杉醇处理细胞显微镜图像数据集;2. 提出了一种结合ResNet-50、注意力模块和KNN分类器的基线模型ResAttention-KNN。
Method: 模型采用ResNet-50作为主干网络,结合卷积块注意力模块(CBAM)提取特征,并在嵌入空间中使用KNN进行分类。注意力模块提高了模型对形态变化的敏感性,KNN增强了分类的鲁棒性。
Result: ResAttention-KNN在紫杉醇浓度分类任务上表现良好,数据集和实现代码已公开,支持未来研究的复现和扩展。
Insight: 注意力机制能够有效捕捉细胞形态的微小变化,而KNN在低数据量场景下提供了简单但高效的分类方案。公开的数据集和代码为这一领域的研究提供了重要资源。
Abstract: Monitoring the effects of the chemotherapeutic agent Taxol at the cellular level is critical for both clinical evaluation and biomedical research. However, existing detection methods require specialized equipment, skilled personnel, and extensive sample preparation, making them expensive, labor-intensive, and unsuitable for high-throughput or real-time analysis. Deep learning approaches have shown great promise in medical and biological image analysis, enabling automated, high-throughput assessment of cellular morphology. Yet, no publicly available dataset currently exists for automated morphological analysis of cellular responses to Taxol exposure. To address this gap, we introduce a new microscopy image dataset capturing C6 glioma cells treated with varying concentrations of Taxol. To provide an effective solution for Taxol concentration classification and establish a benchmark for future studies on this dataset, we propose a baseline model named ResAttention-KNN, which combines a ResNet-50 with Convolutional Block Attention Modules and uses a k-Nearest Neighbors classifier in the learned embedding space. This model integrates attention-based refinement and non-parametric classification to enhance robustness and interpretability. Both the dataset and implementation are publicly released to support reproducibility and facilitate future research in vision-based biomedical analysis.
[41] Taming Transformer for Emotion-Controllable Talking Face Generation
Ziqi Zhang,Cheng Deng
Main category: cs.CV
TL;DR: 该论文提出了一种新方法,用于实现情感可控的说话人脸生成任务,通过预训练策略和情感锚(EA)表示,结合自回归Transformer模型,生成身份保持的情感化视频。
Details
Motivation: 当前说话人脸生成任务面临两个挑战:如何有效建模与特定情感相关的多模态关系,以及如何利用这种关系合成身份保持的情感化视频。论文旨在解决这两个问题。Contribution: 提出了两个预训练策略,将音频解耦为独立组件并将视频量化为视觉标记的组合;设计了情感锚(EA)表示,将情感信息整合到视觉标记中;引入自回归Transformer模型,生成情感可控的视频。
Method: 通过预训练策略解耦音频和量化视频,提出情感锚(EA)表示,并利用自回归Transformer模型建模视觉标记的全局分布,预测合成视频的索引序列。
Result: 在MEAD数据集上的实验表明,该方法在生成情感可控的视频方面表现优异,定性和定量结果均优于现有方法。
Insight: 通过量化视频和解耦音频,结合情感锚表示和自回归Transformer,可以更有效地生成身份保持且情感丰富的说话人脸视频。
Abstract: Talking face generation is a novel and challenging generation task, aiming at synthesizing a vivid speaking-face video given a specific audio. To fulfill emotion-controllable talking face generation, current methods need to overcome two challenges: One is how to effectively model the multimodal relationship related to the specific emotion, and the other is how to leverage this relationship to synthesize identity preserving emotional videos. In this paper, we propose a novel method to tackle the emotion-controllable talking face generation task discretely. Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens. Subsequently, we propose the emotion-anchor (EA) representation that integrates the emotional information into visual tokens. Finally, we introduce an autoregressive transformer to model the global distribution of the visual tokens under the given conditions and further predict the index sequence for synthesizing the manipulated videos. We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios. Extensive experiments demonstrate the superiorities of our method both qualitatively and quantitatively.
[42] TCFNet: Bidirectional face-bone transformation via a Transformer-based coarse-to-fine point movement network
Runshi Zhang,Bimeng Jie,Yang He,Junchen Wang
Main category: cs.CV
TL;DR: TCFNet是一个基于Transformer的粗到细点移动网络,用于精确模拟面部与骨骼点云之间的双向变换,解决了传统方法和现有深度学习方法在计算时间、精度和适用性上的局限性。
Details
Motivation: 传统的生物力学模拟方法计算耗时、数据处理复杂且精度低,而现有的深度学习方法在处理大规模点云、感受野限制和复杂预处理方面存在问题,因此需要一种更高效、精确的解决方案。Contribution: 1. 提出TCFNet,通过分阶段Transformer网络和局部信息聚合网络(LIA-Net)实现粗到细的点云变换;2. LIA-Net弥补Transformer在局部几何结构建模上的不足;3. 引入辅助损失利用专家知识重建关键器官。
Method: 1. 第一阶段用Transformer网络处理全局特征;2. 第二阶段用LIA-Net建模局部几何结构;3. 通过门控循环单元结合全局与局部信息;4. 设计辅助损失函数优化关键器官重建。
Result: 在数据集上,TCFNet在评估指标和可视化结果上均优于现有SOTA方法。
Insight: 1. 分阶段的粗到细方法能显著提升点云变换的精度;2. 结合全局和局部信息是处理密集点云变换的关键;3. 专家知识的引入可以进一步优化医学图像相关任务。
Abstract: Computer-aided surgical simulation is a critical component of orthognathic surgical planning, where accurately simulating face-bone shape transformations is significant. The traditional biomechanical simulation methods are limited by their computational time consumption levels, labor-intensive data processing strategies and low accuracy. Recently, deep learning-based simulation methods have been proposed to view this problem as a point-to-point transformation between skeletal and facial point clouds. However, these approaches cannot process large-scale points, have limited receptive fields that lead to noisy points, and employ complex preprocessing and postprocessing operations based on registration. These shortcomings limit the performance and widespread applicability of such methods. Therefore, we propose a Transformer-based coarse-to-fine point movement network (TCFNet) to learn unique, complicated correspondences at the patch and point levels for dense face-bone point cloud transformations. This end-to-end framework adopts a Transformer-based network and a local information aggregation network (LIA-Net) in the first and second stages, respectively, which reinforce each other to generate precise point movement paths. LIA-Net can effectively compensate for the neighborhood precision loss of the Transformer-based network by modeling local geometric structures (edges, orientations and relative position features). The previous global features are employed to guide the local displacement using a gated recurrent unit. Inspired by deformable medical image registration, we propose an auxiliary loss that can utilize expert knowledge for reconstructing critical organs.Compared with the existing state-of-the-art (SOTA) methods on gathered datasets, TCFNet achieves outstanding evaluation metrics and visualization results. The code is available at https://github.com/Runshi-Zhang/TCFNet.
[43] QuadINR: Hardware-Efficient Implicit Neural Representations Through Quadratic Activation
Wenyong Zhou,Boyu Li,Jiachen Ren,Taiqiang Wu,Zhilin Ai,Zhengwu Liu,Ngai Wong
Main category: cs.CV
TL;DR: QuadINR是一种硬件高效的隐式神经表示方法,通过二次激活函数减少硬件开销,同时提升高频信号表达能力,并在FPGA和ASIC上验证了其高效性。
Details
Motivation: 传统的隐式神经表示(INR)使用复杂激活函数以缓解频谱偏差,但导致硬件开销大。QuadINR旨在通过二次激活函数实现高效硬件实现。Contribution: 提出了QuadINR,利用分段二次激活函数实现高性能和低硬件开销;提供了统一的硬件实现框架;在FPGA和ASIC上验证了其效率。
Method: 采用分段二次激活函数,通过傅里叶级数分析验证其高表达能力;设计了一个统一的N级硬件管道框架。
Result: 在图像和视频任务中,QuadINR相比基线方法PSNR提升2.06dB,硬件面积仅1914μm²,动态功耗6.14mW,资源减少97%,延迟降低93%。
Insight: 二次激活函数在硬件效率和高频信号表达之间实现了良好的平衡,为INR的实际部署提供了可行方案。
Abstract: Implicit Neural Representations (INRs) encode discrete signals continuously while addressing spectral bias through activation functions (AFs). Previous approaches mitigate this bias by employing complex AFs, which often incur significant hardware overhead. To tackle this challenge, we introduce QuadINR, a hardware-efficient INR that utilizes piecewise quadratic AFs to achieve superior performance with dramatic reductions in hardware consumption. The quadratic functions encompass rich harmonic content in their Fourier series, delivering enhanced expressivity for high-frequency signals, as verified through Neural Tangent Kernel (NTK) analysis. We develop a unified $N$-stage pipeline framework that facilitates efficient hardware implementation of various AFs in INRs. We demonstrate FPGA implementations on the VCU128 platform and an ASIC implementation in a 28nm process. Experiments across images and videos show that QuadINR achieves up to 2.06dB PSNR improvement over prior work, with an area of only 1914$\mu$m$^2$ and a dynamic power of 6.14mW, reducing resource and power consumption by up to 97% and improving latency by up to 93% vs existing baselines.
[44] Img2ST-Net: Efficient High-Resolution Spatial Omics Prediction from Whole Slide Histology Images via Fully Convolutional Image-to-Image Learning
Junchao Zhu,Ruining Deng,Junlin Guo,Tianyuan Yao,Juming Xiong,Chongyu Qu,Mengmeng Yin,Yu Wang,Shilin Zhao,Haichun Yang,Daguang Xu,Yucheng Tang,Yuankai Huo
Main category: cs.CV
TL;DR: Img2ST-Net 是一种高效的高分辨率空间转录组学预测框架,通过全卷积图像到图像学习从全切片组织学图像中生成密集的基因表达图,解决了现有方法计算效率低和不稳定的问题。
Details
Motivation: 当前的空间转录组学(ST)数据获取成本高且耗时,而现有的逐点推理方法在超高分辨率下效率低下且不稳定。本文旨在提出一种高效并行的预测方法。Contribution: 1. 提出了一种全新的全卷积框架 Img2ST-Net,用于并行生成高分辨率 ST 数据;2. 引入了超像素表示将任务转化为图像生成问题;3. 提出了 SSIM-ST 评估指标以适应高分辨率数据的稀疏性分析。
Method: 采用全卷积网络架构,将高分辨率 ST 数据建模为超像素表示,并将其重构为具有数百或数千输出通道的图像生成任务,显著提升了计算效率。
Result: 提出的方法在计算效率和预测准确性上均优于传统的逐点推理方法,能够高效生成高分辨率基因表达图。
Insight: 通过将 ST 预测任务转化为图像生成问题,并引入适合高分辨率数据的评估指标,为下一代空间转录组学建模提供了方向。
Abstract: Recent advances in multi-modal AI have demonstrated promising potential for generating the currently expensive spatial transcriptomics (ST) data directly from routine histology images, offering a means to reduce the high cost and time-intensive nature of ST data acquisition. However, the increasing resolution of ST, particularly with platforms such as Visium HD achieving 8um or finer, introduces significant computational and modeling challenges. Conventional spot-by-spot sequential regression frameworks become inefficient and unstable at this scale, while the inherent extreme sparsity and low expression levels of high-resolution ST further complicate both prediction and evaluation. To address these limitations, we propose Img2ST-Net, a novel histology-to-ST generation framework for efficient and parallel high-resolution ST prediction. Unlike conventional spot-by-spot inference methods, Img2ST-Net employs a fully convolutional architecture to generate dense, HD gene expression maps in a parallelized manner. By modeling HD ST data as super-pixel representations, the task is reformulated from image-to-omics inference into a super-content image generation problem with hundreds or thousands of output channels. This design not only improves computational efficiency but also better preserves the spatial organization intrinsic to spatial omics data. To enhance robustness under sparse expression patterns, we further introduce SSIM-ST, a structural-similarity-based evaluation metric tailored for high-resolution ST analysis. We present a scalable, biologically coherent framework for high-resolution ST prediction. Img2ST-Net offers a principled solution for efficient and accurate ST inference at scale. Our contributions lay the groundwork for next-generation ST modeling that is robust and resolution-aware. The source code has been made publicly available at https://github.com/hrlblab/Img2ST-Net.
[45] CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities
Yue Gong,Shanyuan Liu,Liuzhuozheng Li,Jian Zhu,Bo Cheng,Liebucha Wu,Xiaoyu Wu,Yuhang Ma,Dawei Leng,Yuhui Yin
Main category: cs.CV
TL;DR: 论文提出了一种名为CTA-Flux的适配方法,通过MultiModal Diffusion Transformer(MMDiT)将中文语义直接嵌入到英文文本到图像生成模型Flux中,解决了现有方法在文化特定语义上的不足,提升了生成图像的品质与文化真实性。
Details
Motivation: 现有的英文文本到图像生成模型(如Flux)在处理非英文(尤其是中文)提示时表现不佳,主要因为训练数据的语言和文化偏见。现有方法(如翻译或双语微调)无法充分捕捉文化特定语义,导致图像生成质量下降。Contribution: 提出了CTA-Flux方法,通过MMDiT直接控制Flux主干模型,显著减少参数量,同时增强对中文语义的理解,在保持与现有插件兼容的同时提升生成质量和文化真实性。
Method: 利用MultiModal Diffusion Transformer(MMDiT)直接嵌入中文语义到Flux模型中,避免了大规模参数调整,同时兼容LoRA、IP-Adapter等插件。
Result: 实验表明,CTA-Flux支持中英文提示,在图像生成质量、视觉真实性和中文语义表达上优于现有方法。
Insight: 通过直接控制主干模型而非依赖翻译或双语微调,可以更高效地解决多语言和文化多样性问题,同时保持模型的轻量化和兼容性。
Abstract: We proposed the Chinese Text Adapter-Flux (CTA-Flux). An adaptation method fits the Chinese text inputs to Flux, a powerful text-to-image (TTI) generative model initially trained on the English corpus. Despite the notable image generation ability conditioned on English text inputs, Flux performs poorly when processing non-English prompts, particularly due to linguistic and cultural biases inherent in predominantly English-centric training datasets. Existing approaches, such as translating non-English prompts into English or finetuning models for bilingual mappings, inadequately address culturally specific semantics, compromising image authenticity and quality. To address this issue, we introduce a novel method to bridge Chinese semantic understanding with compatibility in English-centric TTI model communities. Existing approaches relying on ControlNet-like architectures typically require a massive parameter scale and lack direct control over Chinese semantics. In comparison, CTA-flux leverages MultiModal Diffusion Transformer (MMDiT) to control the Flux backbone directly, significantly reducing the number of parameters while enhancing the model’s understanding of Chinese semantics. This integration significantly improves the generation quality and cultural authenticity without extensive retraining of the entire model, thus maintaining compatibility with existing text-to-image plugins such as LoRA, IP-Adapter, and ControlNet. Empirical evaluations demonstrate that CTA-flux supports Chinese and English prompts and achieves superior image generation quality, visual realism, and faithful depiction of Chinese semantics.
[46] MoCHA-former: Moiré-Conditioned Hybrid Adaptive Transformer for Video Demoiréing
Jeahun Sung,Changhyun Roh,Chanho Eom,Jihyong Oh
Main category: cs.CV
TL;DR: MoCHA-former提出了一种用于视频去摩尔纹的混合自适应Transformer方法,通过解耦摩尔纹与内容并结合时空自适应处理,显著提升了去摩尔纹的效果。
Details
Motivation: 便携式成像设备在拍摄屏幕时,由于相机CFA与显示器子像素之间的频率混叠,会产生严重影响画质的摩尔纹。现有方法在处理时空变化、大尺度结构和通道依赖性方面存在不足。Contribution: 1. 提出DMAD模块解耦摩尔纹与内容,并通过MCB生成自适应特征;2. 设计STAD模块,结合窗口注意力与通道注意力,处理时空一致性;3. 无需显式对齐模块即可实现帧间对齐。
Method: 1. DMAD通过MDB和DDB分离摩尔纹与细节,MCB生成自适应特征;2. STAD引入SFB和FCA分别处理空间大尺度结构和通道依赖性;3. 隐式帧对齐确保时间一致性。
Result: 在两个视频数据集(RAW和sRGB)上,MoCHA-former在PSNR、SSIM和LPIPS指标上均优于现有方法。
Insight: 通过解耦摩尔纹与内容并结合时空自适应处理,可以显著提升复杂场景下去摩尔纹的效果。无需显式对齐模块的设计简化了模型结构。
Abstract: Recent advances in portable imaging have made camera-based screen capture ubiquitous. Unfortunately, frequency aliasing between the camera’s color filter array (CFA) and the display’s sub-pixels induces moir'e patterns that severely degrade captured photos and videos. Although various demoir'eing models have been proposed to remove such moir'e patterns, these approaches still suffer from several limitations: (i) spatially varying artifact strength within a frame, (ii) large-scale and globally spreading structures, (iii) channel-dependent statistics and (iv) rapid temporal fluctuations across frames. We address these issues with the Moir'e Conditioned Hybrid Adaptive Transformer (MoCHA-former), which comprises two key components: Decoupled Moir'e Adaptive Demoir'eing (DMAD) and Spatio-Temporal Adaptive Demoir'eing (STAD). DMAD separates moir'e and content via a Moir'e Decoupling Block (MDB) and a Detail Decoupling Block (DDB), then produces moir'e-adaptive features using a Moir'e Conditioning Block (MCB) for targeted restoration. STAD introduces a Spatial Fusion Block (SFB) with window attention to capture large-scale structures, and a Feature Channel Attention (FCA) to model channel dependence in RAW frames. To ensure temporal consistency, MoCHA-former performs implicit frame alignment without any explicit alignment module. We analyze moir'e characteristics through qualitative and quantitative studies, and evaluate on two video datasets covering RAW and sRGB domains. MoCHA-former consistently surpasses prior methods across PSNR, SSIM, and LPIPS.
[47] HyperDiff: Hypergraph Guided Diffusion Model for 3D Human Pose Estimation
Bing Han,Yuhua Huang,Pan Gao
Main category: cs.CV
TL;DR: 论文提出了一种名为HyperDiff的新方法,结合扩散模型和HyperGCN解决单目3D人体姿态估计中的深度模糊性和遮挡问题,并在性能和效率之间取得平衡。
Details
Motivation: 单目3D人体姿态估计存在深度模糊性和遮挡问题,且传统方法可能忽视多尺度骨架特征。HyperDiff通过结合扩散模型和HyperGCN提升精度。Contribution: 1. 提出HyperDiff方法,结合扩散模型和HyperGCN;2. 利用HyperGCN的多粒度结构建模关节间的高阶相关性;3. 在Human3.6M和MPI-INF-3DHP数据集上达到SOTA。
Method: 1. 扩散模型用于捕捉数据不确定性;2. HyperGCN作为去噪器,通过多粒度结构建模关节关系;3. 动态调整计算资源以平衡性能与效率。
Result: 在Human3.6M和MPI-INF-3DHP数据集上表现优于现有方法,且能灵活适应不同计算资源需求。
Insight: HyperGCN的多粒度结构设计能有效提升复杂姿态的去噪能力,为3D姿态估计提供了新思路。
Abstract: Monocular 3D human pose estimation (HPE) often encounters challenges such as depth ambiguity and occlusion during the 2D-to-3D lifting process. Additionally, traditional methods may overlook multi-scale skeleton features when utilizing skeleton structure information, which can negatively impact the accuracy of pose estimation. To address these challenges, this paper introduces a novel 3D pose estimation method, HyperDiff, which integrates diffusion models with HyperGCN. The diffusion model effectively captures data uncertainty, alleviating depth ambiguity and occlusion. Meanwhile, HyperGCN, serving as a denoiser, employs multi-granularity structures to accurately model high-order correlations between joints. This improves the model’s denoising capability especially for complex poses. Experimental results demonstrate that HyperDiff achieves state-of-the-art performance on the Human3.6M and MPI-INF-3DHP datasets and can flexibly adapt to varying computational resources to balance performance and efficiency.
[48] FOCUS: Frequency-Optimized Conditioning of DiffUSion Models for mitigating catastrophic forgetting during Test-Time Adaptation
Gabriel Tjio,Jie Zhang,Xulei Yang,Yun Xing,Nhat Chung,Xiaofeng Cao,Ivor W. Tsang,Chee Keong Kwoh,Qing Guo
Main category: cs.CV
TL;DR: FOCUS通过频率优化的扩散模型条件化方法解决了测试时适应中的灾难性遗忘问题,结合轻量级Y-FPN网络和FrequencyMix数据增强,提升了语义分割和深度估计的性能。
Details
Motivation: 在测试时适应中,模型需平衡领域适应与任务相关知识的保留,但现有方法易导致灾难性遗忘。为此,FOCUS提出了一种基于频率优化的解决方案。Contribution: 1) 提出FOCUS框架,通过频率条件化扩散模型保留任务语义信息;2) 设计了轻量级Y-FPN网络和FrequencyMix数据增强方法;3) 在语义分割和深度估计任务中实现了SOTA性能。
Method: 1) 利用Y-FPN网络分离图像高低频信息;2) 通过扩散模型反向步骤中的频率条件化保护任务相关语义;3) 使用FrequencyMix增强数据多样性。
Result: 在15种损坏类型和3个数据集上,FOCUS在语义分割和深度估计任务中达到了SOTA水平,并缓解了灾难性遗忘问题。
Insight: 频率分解是缓解灾难性遗忘的有效手段,扩散模型的条件化可以灵活结合现有适应方法。
Abstract: Test-time adaptation enables models to adapt to evolving domains. However, balancing the tradeoff between preserving knowledge and adapting to domain shifts remains challenging for model adaptation methods, since adapting to domain shifts can induce forgetting of task-relevant knowledge. To address this problem, we propose FOCUS, a novel frequency-based conditioning approach within a diffusion-driven input-adaptation framework. Utilising learned, spatially adaptive frequency priors, our approach conditions the reverse steps during diffusion-driven denoising to preserve task-relevant semantic information for dense prediction. FOCUS leverages a trained, lightweight, Y-shaped Frequency Prediction Network (Y-FPN) that disentangles high and low frequency information from noisy images. This minimizes the computational costs involved in implementing our approach in a diffusion-driven framework. We train Y-FPN with FrequencyMix, a novel data augmentation method that perturbs the images across diverse frequency bands, which improves the robustness of our approach to diverse corruptions. We demonstrate the effectiveness of FOCUS for semantic segmentation and monocular depth estimation across 15 corruption types and three datasets, achieving state-of-the-art averaged performance. In addition to improving standalone performance, FOCUS complements existing model adaptation methods since we can derive pseudo labels from FOCUS-denoised images for additional supervision. Even under limited, intermittent supervision with the pseudo labels derived from the FOCUS denoised images, we show that FOCUS mitigates catastrophic forgetting for recent model adaptation methods.
[49] MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion
Fei Peng,Junqiang Wu,Yan Li,Tingting Gao,Di Zhang,Huiyuan Fu
Main category: cs.CV
TL;DR: MUSE 是一个多主题统一合成框架,通过显式布局语义扩展实现文本到图像的多主题合成,解决了现有方法在空间精度和身份一致性上的挑战。
Details
Motivation: 现有的文本到图像扩散模型在多主题合成中难以同时满足空间控制和身份保留的需求,MUSE 旨在解决这一问题。Contribution: 提出了 MUSE 框架,采用串联交叉注意力(CCA)机制实现布局与文本的双向模态对齐,并提出两阶段训练策略优化任务分解。
Method: 使用 CCA 机制将布局与文本语义空间显式扩展,并通过两阶段训练策略分别优化子任务。
Result: 实验表明,MUSE 在零样本端到端生成中优于现有方法,实现了更高的空间精度和身份一致性。
Insight: 通过显式语义空间扩展和任务分解,可以提升多主题合成的控制能力和生成质量。
Abstract: Existing text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images guided by textual prompts. However, achieving multi-subject compositional synthesis with precise spatial control remains a significant challenge. In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. While recent advancements have separately improved layout control and subject synthesis, existing approaches struggle to simultaneously satisfy the dual requirements of spatial precision and identity preservation in this composite task. To bridge this gap, we propose MUSE, a unified synthesis framework that employs concatenated cross-attention (CCA) to seamlessly integrate layout specifications with textual guidance through explicit semantic space expansion. The proposed CCA mechanism enables bidirectional modality alignment between spatial constraints and textual descriptions without interference. Furthermore, we design a progressive two-stage training strategy that decomposes the LMS task into learnable sub-objectives for effective optimization. Extensive experiments demonstrate that MUSE achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions, advancing the frontier of controllable image synthesis. Our code and model are available at https://github.com/pf0607/MUSE.
[50] Reconstruction Using the Invisible: Intuition from NIR and Metadata for Enhanced 3D Gaussian Splatting
Gyusam Chang,Tuan-Anh Vu,Vivek Alumootil,Harris Song,Deanna Pham,Sangpil Kim,M. Khalid Jawed
Main category: cs.CV
TL;DR: 该论文提出了一种名为NIRSplat的多模态3D高斯泼溅方法,结合近红外(NIR)影像和文本元数据,以解决农业场景中的3D重建难题。通过引入新数据集NIRPlant和跨注意力机制,显著提升了复杂农业环境的重建效果。
Details
Motivation: 农业场景中存在光照不均、遮挡和视野受限等问题,传统3D重建方法效果不佳。近红外影像和植被指数数据尚未被充分利用。Contribution: 1. 提出新的多模态数据集NIRPlant,包含NIR、RGB、深度和LiDAR数据;2. 设计了NIRSplat,一种结合跨注意力机制和3D位置编码的高斯泼溅架构。
Method: 使用NIR影像和文本元数据(如NDVI、NDWI等植被指数),结合跨注意力机制和3D点位置编码,构建多模态3D高斯泼溅模型。
Result: NIRSplat性能优于3DGS、CoR-GS和InstantSplat等现有方法,尤其在复杂农业场景中表现突出。
Insight: 近红外和植被指数数据能够显著提升3D重建的鲁棒性,尤其在农业场景中提供超越可见光谱的植物学信息。
Abstract: While 3D Gaussian Splatting (3DGS) has rapidly advanced, its application in agriculture remains underexplored. Agricultural scenes present unique challenges for 3D reconstruction methods, particularly due to uneven illumination, occlusions, and a limited field of view. To address these limitations, we introduce \textbf{NIRPlant}, a novel multimodal dataset encompassing Near-Infrared (NIR) imagery, RGB imagery, textual metadata, Depth, and LiDAR data collected under varied indoor and outdoor lighting conditions. By integrating NIR data, our approach enhances robustness and provides crucial botanical insights that extend beyond the visible spectrum. Additionally, we leverage text-based metadata derived from vegetation indices, such as NDVI, NDWI, and the chlorophyll index, which significantly enriches the contextual understanding of complex agricultural environments. To fully exploit these modalities, we propose \textbf{NIRSplat}, an effective multimodal Gaussian splatting architecture employing a cross-attention mechanism combined with 3D point-based positional encoding, providing robust geometric priors. Comprehensive experiments demonstrate that \textbf{NIRSplat} outperforms existing landmark methods, including 3DGS, CoR-GS, and InstantSplat, highlighting its effectiveness in challenging agricultural scenarios. The code and dataset are publicly available at: https://github.com/StructuresComp/3D-Reconstruction-NIR
[51] D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis
Yuhang Guo,Kaijun Deng,Siyang Song,Jindong Xie,Wenhui Ma,Linlin Shen
Main category: cs.CV
TL;DR: D^3-Talker提出了一种双分支解耦变形场的方法,用于小样本3D说话头合成,通过分离通用和个性化变形预测,实现了更好的唇形同步和图像质量。
Details
Motivation: 现有方法在利用少量训练数据时难以将音频准确映射到目标面部的唇部动作,导致唇形同步和图像质量不佳。Contribution: 1) 提出双分支解耦变形场结构;2) 设计了相似性对比损失函数;3) 引入粗到细模块提升图像质量。
Method: 通过静态3D高斯属性场和音频/面部运动信号独立控制两个变形场,并利用相似性对比损失和粗到细模块优化结果。
Result: 实验表明,D^3-Talker在高保真渲染和唇形同步方面优于现有方法。
Insight: 解耦通用和个性化变形预测是提升小样本3D说话头合成的有效方法。
Abstract: A key challenge in 3D talking head synthesis lies in the reliance on a long-duration talking head video to train a new model for each target identity from scratch. Recent methods have attempted to address this issue by extracting general features from audio through pre-training models. However, since audio contains information irrelevant to lip motion, existing approaches typically struggle to map the given audio to realistic lip behaviors in the target face when trained on only a few frames, causing poor lip synchronization and talking head image quality. This paper proposes D^3-Talker, a novel approach that constructs a static 3D Gaussian attribute field and employs audio and Facial Motion signals to independently control two distinct Gaussian attribute deformation fields, effectively decoupling the predictions of general and personalized deformations. We design a novel similarity contrastive loss function during pre-training to achieve more thorough decoupling. Furthermore, we integrate a Coarse-to-Fine module to refine the rendered images, alleviating blurriness caused by head movements and enhancing overall image quality. Extensive experiments demonstrate that D^3-Talker outperforms state-of-the-art methods in both high-fidelity rendering and accurate audio-lip synchronization with limited training data. Our code will be provided upon acceptance.
[52] Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering
Shanlin Sun,Yifan Wang,Hanwen Zhang,Yifeng Xiong,Qin Ren,Ruogu Fang,Xiaohui Xie,Chenyu You
Main category: cs.CV
TL;DR: Ouroboros提出了一种单步扩散模型框架,通过相互强化的方式同时处理正向和逆向渲染任务,实现了循环一致性和快速推理。
Details
Motivation: 现有的多步扩散模型在处理正向和逆向渲染任务时通常是独立的,导致循环不一致和推理速度慢。这促使作者开发一个统一的框架来改进这些问题。Contribution: 1) 提出单步扩散模型框架Ouroboros,同时处理正向和逆向渲染任务;2) 扩展到室内外场景的本征分解;3) 引入循环一致性机制;4) 展示了快速推理和视频分解能力。
Method: Ouroboros由两个单步扩散模型组成,分别处理正向和逆向渲染任务,并通过循环一致性机制相互强化。
Result: 实验结果表明,Ouroboros在多种场景中实现了最先进的性能,且推理速度显著快于其他基于扩散的方法。在视频分解任务中,无需训练即可减少时间不一致性。
Insight: 单步扩散模型的循环一致性框架可以高效解决正向和逆向渲染任务,同时为视频分解等任务提供零样本迁移能力。
Abstract: While multi-step diffusion models have advanced both forward and inverse rendering, existing approaches often treat these problems independently, leading to cycle inconsistency and slow inference speed. In this work, we present Ouroboros, a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. Our approach extends intrinsic decomposition to both indoor and outdoor scenes and introduces a cycle consistency mechanism that ensures coherence between forward and inverse rendering outputs. Experimental results demonstrate state-of-the-art performance across diverse scenes while achieving substantially faster inference speed compared to other diffusion-based methods. We also demonstrate that Ouroboros can transfer to video decomposition in a training-free manner, reducing temporal inconsistency in video sequences while maintaining high-quality per-frame inverse rendering.
[53] DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing
Weitao Wang,Zichen Wang,Hongdeng Shen,Yulei Lu,Xirui Fan,Suhui Wu,Jun Zhang,Haoqian Wang,Hao Zhang
Main category: cs.CV
TL;DR: DreamSwapV是一个基于掩码引导、主体无关的端到端框架,用于视频中任意主体的替换,支持用户指定的掩码和参考图像,通过多功能条件融合和自适应掩码策略提升效果。
Details
Motivation: 随着视频生成技术的迅速发展,定制化视频编辑需求激增,但主体替换技术仍停留在狭窄领域或依赖间接编辑范式,限制了实际应用。Contribution: 提出了一个掩码引导的主体无关框架DreamSwapV,支持任意视频中的主体替换;设计了多功能条件融合模块和自适应掩码策略;构建了两阶段数据集和训练方案。
Method: 采用掩码引导和参考图像输入,引入多功能条件融合模块和自适应掩码策略,通过两阶段数据集训练优化模型。
Result: 在VBench指标和DreamSwapV-Benchmark上优于现有方法,验证了其高效性和泛化能力。
Insight: 掩码引导和自适应策略是提升主体替换效率的关键;多功能条件融合模块为复杂场景提供了更强的控制能力。
Abstract: With the rapid progress of video generation, demand for customized video editing is surging, where subject swapping constitutes a key component yet remains under-explored. Prevailing swapping approaches either specialize in narrow domains–such as human-body animation or hand-object interaction–or rely on some indirect editing paradigm or ambiguous text prompts that compromise final fidelity. In this paper, we propose DreamSwapV, a mask-guided, subject-agnostic, end-to-end framework that swaps any subject in any video for customization with a user-specified mask and reference image. To inject fine-grained guidance, we introduce multiple conditions and a dedicated condition fusion module that integrates them efficiently. In addition, an adaptive mask strategy is designed to accommodate subjects of varying scales and attributes, further improving interactions between the swapped subject and its surrounding context. Through our elaborate two-phase dataset construction and training scheme, our DreamSwapV outperforms existing methods, as validated by comprehensive experiments on VBench indicators and our first introduced DreamSwapV-Benchmark.
[54] LookOut: Real-World Humanoid Egocentric Navigation
Boxiao Pan,Adam W. Harley,C. Karen Liu,Leonidas J. Guibas
Main category: cs.CV
TL;DR: 这篇论文提出了一个从第一人称视角视频中预测未来6D头部姿态序列的框架,并引入了一个新的数据集AND,用于学习真实世界中的导航行为。
Details
Motivation: 在仿人机器人、VR/AR和辅助导航等应用中,预测无碰撞的未来轨迹至关重要。然而,目前缺乏相关的训练数据和有效的模型来模拟人类的主动信息收集行为(如转头)。Contribution: 1. 提出预测未来6D头部姿态序列的挑战性问题;2. 提出一个基于时域聚合3D潜在特征的框架;3. 引入全新的数据集AND,包含4小时真实世界导航记录。
Method: 利用时域聚合的3D潜在特征建模环境的几何和语义约束,同时结合静态和动态部分的信息。数据通过Project Aria眼镜采集。
Result: 模型能够学习人类的导航行为(如等待、绕行和观察交通),并在未见过的环境中表现出泛化能力。
Insight: 通过结合几何和语义约束的3D特征建模,可以更好地模拟人类的主动导航行为,同时真实世界数据集对训练至关重要。
Abstract: The ability to predict collision-free future trajectories from egocentric observations is crucial in applications such as humanoid robotics, VR / AR, and assistive navigation. In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. In particular, we predict both head translations and rotations to learn the active information-gathering behavior expressed through head-turning events. To solve this task, we propose a framework that reasons over temporally aggregated 3D latent features, which models the geometric and semantic constraints for both the static and dynamic parts of the environment. Motivated by the lack of training data in this space, we further contribute a data collection pipeline using the Project Aria glasses, and present a dataset collected through this approach. Our dataset, dubbed Aria Navigation Dataset (AND), consists of 4 hours of recording of users navigating in real-world scenarios. It includes diverse situations and navigation behaviors, providing a valuable resource for learning real-world egocentric navigation policies. Extensive experiments show that our model learns human-like navigation behaviors such as waiting / slowing down, rerouting, and looking around for traffic while generalizing to unseen environments. Check out our project webpage at https://sites.google.com/stanford.edu/lookout.
[55] Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration
Haoran Bai,Xiaoxu Chen,Canqian Yang,Zongyao He,Sibin Deng,Ying Chen
Main category: cs.CV
TL;DR: Vivid-VR是一个基于DiT的生成式视频修复方法,利用ControlNet控制生成过程以保持内容一致性。为了解决传统微调导致的质量下降问题,提出了一种概念蒸馏训练策略和增强控制架构,显著提升了纹理真实性和时间一致性。
Details
Motivation: 现有基于T2V的可控视频修复方法在微调时因多模态对齐不完美导致分布漂移,从而影响纹理真实性和时间一致性。Contribution: 1. 提出概念蒸馏训练策略,利用预训练T2V模型合成带文本概念的训练样本;2. 重新设计控制架构,包括控制特征投影器和双分支ControlNet连接器。
Method: 1. 概念蒸馏策略保持纹理和时间质量;2. 控制特征投影器过滤退化伪影;3. 双分支ControlNet连接器结合MLP和交叉注意力机制。
Result: 在合成和真实基准测试中优于现有方法,实现了高纹理真实性和时间一致性。
Insight: 通过预训练模型的生成能力和动态控制信号调制,可以显著提升视频修复的质量和可控性。
Abstract: We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents to minimize their propagation through the generation pipeline, and 2) a new ControlNet connector employing a dual-branch design. This connector synergistically combines MLP-based feature mapping with cross-attention mechanism for dynamic control feature retrieval, enabling both content preservation and adaptive control signal modulation. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency. The codes and checkpoints are publicly available at https://github.com/csbhr/Vivid-VR.
[56] PB-IAD: Utilizing multimodal foundation models for semantic industrial anomaly detection in dynamic manufacturing environments
Bernd Hofmann,Albert Scheck,Joerg Franke,Patrick Bruendl
Main category: cs.CV
TL;DR: PB-IAD 是一种基于多模态基础模型的工业异常检测框架,通过提示模板和语义指令,适应动态制造环境中的稀疏数据和高适应性需求,无需大量标注数据。
Details
Motivation: 传统工业异常检测方法依赖大量标注数据且灵活性不足,无法适应动态生产环境。多模态基础模型的发展为此提供了新机会。Contribution: 1. 提出 PB-IAD 框架,利用基础模型的多模态和推理能力进行异常检测;2. 设计了提示模板和预处理模块,支持用户灵活定制;3. 在稀疏数据和低样本场景中表现优异。
Method: 1. 结合多模态基础模型的感知能力;2. 设计提示模板注入领域知识;3. 用户输入转换为系统提示。实验使用 GPT-4.1 评估三种制造场景。
Result: PB-IAD 在数据稀疏和低样本情况下优于 PatchCore 等先进方法,仅通过语义指令即实现高性能。
Insight: 提示工程和多模态基础模型的结合为工业异常检测提供了新的可能性,尤其在缺乏标注数据时表现出色。
Abstract: The detection of anomalies in manufacturing processes is crucial to ensure product quality and identify process deviations. Statistical and data-driven approaches remain the standard in industrial anomaly detection, yet their adaptability and usability are constrained by the dependence on extensive annotated datasets and limited flexibility under dynamic production conditions. Recent advances in the perception capabilities of foundation models provide promising opportunities for their adaptation to this downstream task. This paper presents PB-IAD (Prompt-based Industrial Anomaly Detection), a novel framework that leverages the multimodal and reasoning capabilities of foundation models for industrial anomaly detection. Specifically, PB-IAD addresses three key requirements of dynamic production environments: data sparsity, agile adaptability, and domain user centricity. In addition to the anomaly detection, the framework includes a prompt template that is specifically designed for iteratively implementing domain-specific process knowledge, as well as a pre-processing module that translates domain user inputs into effective system prompts. This user-centric design allows domain experts to customise the system flexibly without requiring data science expertise. The proposed framework is evaluated by utilizing GPT-4.1 across three distinct manufacturing scenarios, two data modalities, and an ablation study to systematically assess the contribution of semantic instructions. Furthermore, PB-IAD is benchmarked to state-of-the-art methods for anomaly detection such as PatchCore. The results demonstrate superior performance, particularly in data-sparse scenarios and low-shot settings, achieved solely through semantic instructions.
[57] Adversarial Generation and Collaborative Evolution of Safety-Critical Scenarios for Autonomous Vehicles
Jiangfan Liu,Yongkang Guo,Fangzhi Zhong,Tianyuan Zhang,Zonglei Jing,Siyuan Liang,Jiakai Wang,Mingchuan Zhang,Aishan Liu,Xianglong Liu
Main category: cs.CV
TL;DR: 提出ScenGE框架,通过对抗生成与协作演化方法,为自动驾驶车辆生成多样化的安全关键场景,显著提升碰撞案例的严重性,并通过实验验证其有效性和实用性。
Details
Motivation: 当前自动驾驶车辆的安全性评估依赖于预定义的威胁模式或基于规则的方法,难以暴露多样化和未预见的故障模式,因此需要一种能生成更多样化且更严峻的安全关键场景的方法。Contribution: 1. 提出ScenGE框架,结合大语言模型和对抗协作图,生成多样且严峻的安全关键场景;2. 框架支持在多仿真器上部署,并能提升模型鲁棒性;3. 通过真实车辆测试和人类评估验证场景的合理性和关键性。
Method: 1. Meta-Scenario Generation:基于大语言模型和结构化驾驶知识,推断并生成具有挑战性的对抗行为;2. Complex Scenario Evolution:通过构建对抗协作图优化关键代理轨迹,放大核心威胁。
Result: 实验表明,ScenGE比现有方法平均多生成31.96%的严重碰撞案例,且通过对抗训练显著提升了模型鲁棒性。真实测试和人类评估也验证了场景的合理性。
Insight: 1. 对抗生成和协作演化能有效生成多样化的安全关键场景;2. 大语言模型在场景生成中具有潜力;3. 生成的场景对提升自动驾驶安全性具有实际意义。
Abstract: The generation of safety-critical scenarios in simulation has become increasingly crucial for safety evaluation in autonomous vehicles prior to road deployment in society. However, current approaches largely rely on predefined threat patterns or rule-based strategies, which limit their ability to expose diverse and unforeseen failure modes. To overcome these, we propose ScenGE, a framework that can generate plentiful safety-critical scenarios by reasoning novel adversarial cases and then amplifying them with complex traffic flows. Given a simple prompt of a benign scene, it first performs Meta-Scenario Generation, where a large language model, grounded in structured driving knowledge, infers an adversarial agent whose behavior poses a threat that is both plausible and deliberately challenging. This meta-scenario is then specified in executable code for precise in-simulator control. Subsequently, Complex Scenario Evolution uses background vehicles to amplify the core threat introduced by Meta-Scenario. It builds an adversarial collaborator graph to identify key agent trajectories for optimization. These perturbations are designed to simultaneously reduce the ego vehicle’s maneuvering space and create critical occlusions. Extensive experiments conducted on multiple reinforcement learning based AV models show that ScenGE uncovers more severe collision cases (+31.96%) on average than SoTA baselines. Additionally, our ScenGE can be applied to large model based AV systems and deployed on different simulators; we further observe that adversarial training on our scenarios improves the model robustness. Finally, we validate our framework through real-world vehicle tests and human evaluation, confirming that the generated scenarios are both plausible and critical. We hope our paper can build up a critical step towards building public trust and ensuring their safe deployment.
[58] WISE-FUSE: Efficient Whole Slide Image Encoding via Coarse-to-Fine Patch Selection with VLM and LLM Knowledge Fusion
Yonghan Shin,SeungKyu Kim,Won-Ki Jeong
Main category: cs.CV
TL;DR: WISE-FUSE 提出了一种高效的整张切片图像(WSI)编码框架,通过结合视觉语言模型(VLM)和大语言模型(LLM)的知识,选择性处理诊断相关区域,显著减少了计算负担和编码时间。
Details
Motivation: 整张切片图像(WSI)的高分辨率带来了巨大的计算挑战,传统方法需要处理数十万甚至数百万个高分辨率图像块,导致编码成本和时间过高,难以在实际场景中高效部署。Contribution: 1. 提出了一种通过知识蒸馏机制粗选和精选诊断相关区域的自适应框架。2. 将视觉语言模型和大语言模型的知识融合到图像编码中,增强了诊断上下文。
Method: 1. 通过低分辨率块与类特定文本描述的相似性评分粗选信息区域。2. 对精选的高分辨率块进行选择性编码,并与文本嵌入融合。3. 减少了无关块的计算负担。
Result: 实验表明,WISE-FUSE 将编码时间减少了三倍以上,同时诊断性能与全量块处理方法相当或更优。
Insight: 通过结合多模态知识(视觉和文本),可以在减少计算量的同时保持甚至提升诊断性能,为计算病理学的实际应用提供了实用方案。
Abstract: Whole slide images (WSIs) in computational pathology (CPath) pose a major computational challenge due to their gigapixel scale, often requiring the processing of tens to hundreds of thousands of high-resolution patches per slide. This results in prohibitive encoding costs, with preprocessing and training times extending to days or even weeks-making WSI encoding the most significant bottleneck in real-world deployment. In this work, we propose WISE-FUSE, an adaptive WSI encoding framework that leverages pathology-domain vision-language models and large language models to address this challenge by selectively processing diagnostically relevant regions. WISE-FUSE first computes similarity scores between low-resolution patches and class-specific textual descriptions using a knowledge distillation mechanism that preserves fine-grained diagnostic features. Based on these similarity scores, we select a small subset of informative regions for the target task, which quickly eliminates irrelevant patches at the coarse level. The corresponding high-resolution patches are then selectively encoded and fused with textual embeddings to reinforce diagnostic context. Extensive experiments demonstrate that WISE-FUSE reduces WSI encoding time by over threefold while achieving diagnostic performance comparable to or surpassing that of exhaustive patch processing, offering a scalable and practical solution for CPath.
[59] Improving OCR using internal document redundancy
Diego Belzarena,Seginus Mowlavi,Aitor Artola,Camilo Mariño,Marina Gardella,Ignacio Ramírez,Antoine Tadros,Roy He,Natalia Bottaioli,Boshra Rajaei,Gregory Randall,Jean-Michel Morel
Main category: cs.CV
TL;DR: 该论文提出了一种无监督方法,利用文档内部的字符形状冗余性来改进OCR系统的输出质量,通过扩展高斯混合模型(GMM)和统计测试提高字符识别的准确性。
Details
Motivation: 当前OCR系统在低质量数据上的表现不佳,尤其是在印刷文档中,未能充分利用文档内部的冗余信息。Contribution: 提出了一种利用文档内部冗余性的无监督方法,通过改进GMM模型和统计测试,显著提升了OCR的识别效果。
Method: 扩展高斯混合模型(GMM)结合EM算法、簇内重新对齐过程及正态性统计测试,优化字符形状的聚类结果。
Result: 在退化程度不同的文档(如乌拉圭军事档案和欧洲历史报纸)上展示了显著的性能提升。
Insight: 文档内部的冗余信息可以成为改进OCR系统的有效资源,尤其在低质量数据场景中。
Abstract: Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document’s redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.
[60] A Comprehensive Review of Agricultural Parcel and Boundary Delineation from Remote Sensing Images: Recent Progress and Future Perspectives
Juepeng Zheng,Zi Ye,Yibin Wen,Jianxi Huang,Zhiwei Zhang,Qingmei Li,Qiong Hu,Baodong Xu,Lingyuan Zhao,Haohuan Fu
Main category: cs.CV
TL;DR: 这篇综述文章总结了利用遥感图像进行农业地块与边界划分(APBD)的研究进展,分类整理了传统图像处理、机器学习及深度学习方法,并探讨了未来方向。
Details
Motivation: 随着高分辨率遥感图像的发展,自动化高效精确的农业地块分析成为可能,推动了APBD领域的研究需求。文章旨在系统梳理APBD方法,为研究者提供清晰的知识图谱和发展趋势。Contribution: 1. 分类整理了APBD方法(传统图像处理、机器学习、深度学习);2. 深入讨论了深度学习在APBD中的应用;3. 分析了多传感器数据、多任务学习等关键问题;4. 提出了未来研究的热点方向。
Method: 文章通过元数据分析(算法、研究区域、作物类型、传感器类型、评估方法等),将APBD方法分为三类:传统图像处理(像素/边缘/区域)、传统机器学习(如随机森林)和深度学习方法(如语义分割、Transformer)。
Result: 文献综述表明,深度学习方法在APBD领域占据主导地位,尤其是在高分辨率遥感图像处理中表现优异。文章还总结了不同方法的比较结果及适用场景。
Insight: 1. 深度学习方法在APBD任务中潜力巨大,尤其是结合Transformer等新兴技术;2. 多传感器数据和多任务学习是提升APBD性能的关键;3. 未来研究可以关注自动化标注、小样本学习和跨域泛化等方向。
Abstract: Powered by advances in multiple remote sensing sensors, the production of high spatial resolution images provides great potential to achieve cost-efficient and high-accuracy agricultural inventory and analysis in an automated way. Lots of studies that aim at providing an inventory of the level of each agricultural parcel have generated many methods for Agricultural Parcel and Boundary Delineation (APBD). This review covers APBD methods for detecting and delineating agricultural parcels and systematically reviews the past and present of APBD-related research applied to remote sensing images. With the goal to provide a clear knowledge map of existing APBD efforts, we conduct a comprehensive review of recent APBD papers to build a meta-data analysis, including the algorithm, the study site, the crop type, the sensor type, the evaluation method, etc. We categorize the methods into three classes: (1) traditional image processing methods (including pixel-based, edge-based and region-based); (2) traditional machine learning methods (such as random forest, decision tree); and (3) deep learning-based methods. With deep learning-oriented approaches contributing to a majority, we further discuss deep learning-based methods like semantic segmentation-based, object detection-based and Transformer-based methods. In addition, we discuss five APBD-related issues to further comprehend the APBD domain using remote sensing data, such as multi-sensor data in APBD task, comparisons between single-task learning and multi-task learning in the APBD domain, comparisons among different algorithms and different APBD tasks, etc. Finally, this review proposes some APBD-related applications and a few exciting prospects and potential hot topics in future APBD research. We hope this review help researchers who involved in APBD domain to keep track of its development and tendency.
[61] Controllable Latent Space Augmentation for Digital Pathology
Sofiène Boutaj,Marin Scalbert,Pierre Marza,Florent Couzinie-Devy,Maria Vakalopoulou,Stergios Christodoulidis
Main category: cs.CV
TL;DR: 该论文提出了一种名为HistAug的生成模型,用于在数字病理学中进行可控的潜在空间增强,以解决传统图像增强方法的不足,提升多实例学习模型的性能。
Details
Motivation: 由于全玻片图像(WSI)的高分辫率和密集监督信号的稀缺性,传统图像增强方法在数字病理学中难以高效地增加数据多样性并减少过拟合。Contribution: 论文的主要贡献是提出了HistAug,一种快速且高效的生成模型,能够在潜在空间中生成可控的增强特征,同时保留初始语义信息。
Method: 通过显式条件化(如色调、腐蚀等)在潜在空间生成增强特征,实现了对变换语义的精确控制,并高效处理大量图像块。
Result: 实验表明,HistAug在多种器官和低数据量任务中优于现有方法,提升了多实例学习模型的性能。
Insight: 论文揭示了学习变换优于基于噪声的扰动,并强调了均匀WSI增强的重要性。
Abstract: Whole slide image (WSI) analysis in digital pathology presents unique challenges due to the gigapixel resolution of WSIs and the scarcity of dense supervision signals. While Multiple Instance Learning (MIL) is a natural fit for slide-level tasks, training robust models requires large and diverse datasets. Even though image augmentation techniques could be utilized to increase data variability and reduce overfitting, implementing them effectively is not a trivial task. Traditional patch-level augmentation is prohibitively expensive due to the large number of patches extracted from each WSI, and existing feature-level augmentation methods lack control over transformation semantics. We introduce HistAug, a fast and efficient generative model for controllable augmentations in the latent space for digital pathology. By conditioning on explicit patch-level transformations (e.g., hue, erosion), HistAug generates realistic augmented embeddings while preserving initial semantic information. Our method allows the processing of a large number of patches in a single forward pass efficiently, while at the same time consistently improving MIL model performance. Experiments across multiple slide-level tasks and diverse organs show that HistAug outperforms existing methods, particularly in low-data regimes. Ablation studies confirm the benefits of learned transformations over noise-based perturbations and highlight the importance of uniform WSI-wise augmentation. Code is available at https://github.com/MICS-Lab/HistAug.
[62] Reliable Smoke Detection via Optical Flow-Guided Feature Fusion and Transformer-Based Uncertainty Modeling
Nitish Kumar Mahala,Muzammil Khan,Pushpendra Kumar
Main category: cs.CV
TL;DR: 该论文提出了一种基于光流引导特征融合和Transformer不确定性建模的可靠烟雾检测方法,通过双相不确定感知Shifted Windows Transformer提升检测鲁棒性。
Details
Motivation: 烟雾的复杂时空动态、光照变化和环境噪声导致传统检测器可靠性不足,亟需一种无需复杂多传感器的高精度早期预警方法。Contribution: 1. 提出光流引导特征融合和不确定性建模的烟雾检测框架;2. 基于四色定理的双相变分光流模型;3. 多尺度不确定估计的Shifted-Windows Transformer。
Method: 1. 使用光流编码运动生成烟雾分割数据集;2. 高斯混合模型融合光流与外观特征;3. 通过双相训练优化检测精度和不确定性估计。
Result: 实验表明方法在多个指标上优于现有技术,具有优异的泛化能力和鲁棒性。
Insight: 通过联合建模偶然和认知不确定性,模型能更可靠地评估预测置信度,适用于工业安全和监控场景。
Abstract: Fire outbreaks pose critical threats to human life and infrastructure, necessitating high-fidelity early-warning systems that detect combustion precursors such as smoke. However, smoke plumes exhibit complex spatiotemporal dynamics influenced by illumination variability, flow kinematics, and environmental noise, undermining the reliability of traditional detectors. To address these challenges without the logistical complexity of multi-sensor arrays, we propose an information-fusion framework by integrating smoke feature representations extracted from monocular imagery. Specifically, a Two-Phase Uncertainty-Aware Shifted Windows Transformer for robust and reliable smoke detection, leveraging a novel smoke segmentation dataset, constructed via optical flow-based motion encoding, is proposed. The optical flow estimation is performed with a four-color-theorem-inspired dual-phase level-set fractional-order variational model, which preserves motion discontinuities. The resulting color-encoded optical flow maps are fused with appearance cues via a Gaussian Mixture Model to generate binary segmentation masks of the smoke regions. These fused representations are fed into the novel Shifted-Windows Transformer, which is augmented with a multi-scale uncertainty estimation head and trained under a two-phase learning regimen. First learning phase optimizes smoke detection accuracy, while during the second phase, the model learns to estimate plausibility confidence in its predictions by jointly modeling aleatoric and epistemic uncertainties. Extensive experiments using multiple evaluation metrics and comparative analysis with state-of-the-art approaches demonstrate superior generalization and robustness, offering a reliable solution for early fire detection in surveillance, industrial safety, and autonomous monitoring applications.
[63] Incremental Object Detection with Prompt-based Methods
Matthias Neuwirth-Trapp,Maarten Bieshaar,Danda Pani Paudel,Luc Van Gool
Main category: cs.CV
TL;DR: 该论文研究了基于视觉提示的方法在增量目标检测(IOD)中的应用,发现其在复杂域增量学习设置中表现不佳,但结合少量历史数据回放后效果最佳。
Details
Motivation: 研究基于视觉提示的增量学习方法在目标检测任务中的通用性,填补了这一领域的空白。Contribution: 首次将视觉提示方法应用于IOD,并通过实验比较了三种不同的提示方法,提出了一种结合提示与数据回放的实用方法。
Method: 分析了三种不同的视觉提示方法,并在复杂域增量学习设置下进行实验,结合少量历史数据回放以提升性能。
Result: 实验表明纯视觉提示方法在IOD中表现不佳,但结合数据回放后效果显著提升,且对提示长度和初始化的实验结果提供了进一步见解。
Insight: 提示方法在IOD中的表现需要结合其他技术(如数据回放)才能发挥最佳效果,为未来研究提供了方向。
Abstract: Visual prompt-based methods have seen growing interest in incremental learning (IL) for image classification. These approaches learn additional embedding vectors while keeping the model frozen, making them efficient to train. However, no prior work has applied such methods to incremental object detection (IOD), leaving their generalizability unclear. In this paper, we analyze three different prompt-based methods under a complex domain-incremental learning setting. We additionally provide a wide range of reference baselines for comparison. Empirically, we show that the prompt-based approaches we tested underperform in this setting. However, a strong yet practical method, combining visual prompts with replaying a small portion of previous data, achieves the best results. Together with additional experiments on prompt length and initialization, our findings offer valuable insights for advancing prompt-based IL in IOD.
[64] Virtual Community: An Open World for Humans, Robots, and Society
Qinhong Zhou,Hongxin Zhang,Xiangye Lin,Zheyuan Zhang,Yutian Chen,Wenjun Liu,Zunzhe Zhang,Sunli Chen,Lixing Fang,Qiushi Lyu,Xinyu Sun,Jincheng Yang,Zeyuan Wang,Bao Chi Dang,Zhehuan Chen,Daksha Ladia,Jiageng Liu,Chuang Gan
Main category: cs.CV
TL;DR: 该论文提出了一个名为“虚拟社区”的开放世界平台,用于研究人类社会与机器人的共存问题,支持多智能体协作与竞争,并提出了两个新挑战任务。
Details
Motivation: 随着AI和机器人技术的快速发展,社会将迎来人类与机器人共存的变革,亟需研究其带来的机会与挑战。Contribution: 提出一个开源的多智能体物理模拟器平台“虚拟社区”,支持人类与机器人在开放世界中的交互,并设计了两个新挑战任务以评估协作与规划能力。
Method: 基于通用物理引擎和真实世界3D场景,构建了一个大规模、与现实世界对齐的社区生成流程,支持多样化的室内外场景和多智能体交互。
Result: 通过实验验证了现有方法在高层规划与底层控制协作任务中的挑战,展示了平台的实用性。
Insight: 虚拟社区为研究开放世界中人类与机器人的社会智能提供了新平台,推动了相关领域的研究。
Abstract: The rapid progress in AI and Robotics may lead to a profound societal transformation, as humans and robots begin to coexist within shared communities, introducing both opportunities and challenges. To explore this future, we present Virtual Community-an open-world platform for humans, robots, and society-built on a universal physics engine and grounded in real-world 3D scenes. With Virtual Community, we aim to study embodied social intelligence at scale: 1) How robots can intelligently cooperate or compete; 2) How humans develop social relations and build community; 3) More importantly, how intelligent robots and humans can co-exist in an open world. To support these, Virtual Community features: 1) An open-source multi-agent physics simulator that supports robots, humans, and their interactions within a society; 2) A large-scale, real-world aligned community generation pipeline, including vast outdoor space, diverse indoor scenes, and a community of grounded agents with rich characters and appearances. Leveraging Virtual Community, we propose two novel challenges. The Community Planning Challenge evaluates multi-agent reasoning and planning ability in open-world settings, such as cooperating to help agents with daily activities and efficiently connecting other agents. The Community Robot Challenge requires multiple heterogeneous robots to collaborate in solving complex open-world tasks. We evaluate various baselines on these tasks and demonstrate the challenges in both high-level open-world task planning and low-level cooperation controls. We hope that Virtual Community will unlock further study of human-robot coexistence within open-world environments.
[65] UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling
Peiming Li,Ziyi Wang,Yulin Yuan,Hong Liu,Xiangming Meng,Junsong Yuan,Mengyuan Liu
Main category: cs.CV
TL;DR: UST-SSM提出了一种统一的时空状态空间模型,用于处理点云视频,通过空间-时间选择扫描(STSS)重组无序点,并通过时空结构聚合(STSA)和时序交互采样(TIS)增强时空特征和时序依赖关系,实现了高效的动态3D动作识别。
Details
Motivation: 点云视频的动态3D运动捕捉能力使其在识别连续和精细的人类动作时具有优势,但其时空无序性限制了传统选择性状态空间模型(SSMs)的直接应用,因此需要一种新的方法来处理这一问题。Contribution: 1. 提出了统一时空状态空间模型(UST-SSM),扩展了SSMs在点云视频中的应用。
2. 设计了STSS、STSA和TIS模块,分别解决了时空无序性、特征缺失和时序依赖增强的问题。
Method: 1. STSS通过提示引导的聚类将无序点重组为语义感知的序列。
2. STSA聚合时空特征并补偿缺失的几何和运动细节。
3. TIS通过非锚帧利用和扩展感受野增强时序依赖关系。
Result: 在MSR-Action3D、NTU RGB+D和Synthia 4D数据集上的实验验证了UST-SSM的有效性。
Insight: 1. 点云视频的时空无序性可以通过语义重组和特征聚合来解决。
2. 时序交互的增强对于动态3D动作识别至关重要。
Abstract: Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.
[66] SMTrack: End-to-End Trained Spiking Neural Networks for Multi-Object Tracking in RGB Videos
Pengzhi Zhong,Xinzhe Wang,Dan Zeng,Qihua Zhou,Feixiang He,Shuiwang Li
Main category: cs.CV
TL;DR: SMTrack是第一个直接在标准RGB视频上端到端训练深度脉冲神经网络(SNN)进行多目标跟踪的框架,通过自适应尺度感知归一化Wasserstein距离损失(Asa-NWDLoss)和TrackTrack身份模块,实现了与主流基于人工神经网络(ANN)的MOT方法媲美的性能。
Details
Motivation: 尽管脉冲神经网络(SNN)在低功耗计算中展现出潜力,但其在视觉任务中的应用主要集中在图像分类、物体检测和基于事件的跟踪上。对于复杂时序任务如标准RGB视频的多目标跟踪(MOT),SNN的直接训练仍未被充分探索。Contribution: 1. 提出了SMTrack,首个端到端训练的深度SNN框架,用于标准RGB视频的多目标跟踪。
2. 引入了自适应尺度感知归一化Wasserstein距离损失(Asa-NWDLoss),提升了不同尺度和小物体下的检测与定位性能。
3. 结合TrackTrack身份模块,增强了轨迹的鲁棒性和一致性。
Method: 1. 使用端到端训练的深度SNN架构。
2. 提出Asa-NWDLoss,通过动态调整归一化因子,适应不同物体尺度的训练。
3. 在关联阶段引入TrackTrack身份模块,优化轨迹一致性。
Result: 在BEE24、MOT17、MOT20和DanceTrack等数据集上的实验表明,SMTrack的性能与主流基于ANN的MOT方法相当,证明了SNN在复杂场景下高效多目标跟踪的能力。
Insight: 通过在RGB视频中直接训练SNN完成MOT任务,展示了SNN在复杂时序任务中的潜力,同时为低功耗视觉系统提供了新思路。
Abstract: Brain-inspired Spiking Neural Networks (SNNs) exhibit significant potential for low-power computation, yet their application in visual tasks remains largely confined to image classification, object detection, and event-based tracking. In contrast, real-world vision systems still widely use conventional RGB video streams, where the potential of directly-trained SNNs for complex temporal tasks such as multi-object tracking (MOT) remains underexplored. To address this challenge, we propose SMTrack-the first directly trained deep SNN framework for end-to-end multi-object tracking on standard RGB videos. SMTrack introduces an adaptive and scale-aware Normalized Wasserstein Distance loss (Asa-NWDLoss) to improve detection and localization performance under varying object scales and densities. Specifically, the method computes the average object size within each training batch and dynamically adjusts the normalization factor, thereby enhancing sensitivity to small objects. For the association stage, we incorporate the TrackTrack identity module to maintain robust and consistent object trajectories. Extensive evaluations on BEE24, MOT17, MOT20, and DanceTrack show that SMTrack achieves performance on par with leading ANN-based MOT methods, advancing robust and accurate SNN-based tracking in complex scenarios.
[67] AnchorSync: Global Consistency Optimization for Long Video Editing
Zichi Liu,Yinggui Wang,Tao Wei,Chao Ma
Main category: cs.CV
TL;DR: AnchorSync 是一种基于扩散模型的视频编辑框架,通过稀疏锚帧编辑和中间帧插值实现长视频的全局一致性和时间连贯性,优于现有方法。
Details
Motivation: 长视频编辑面临全局结构漂移和时间不一致的挑战,现有方法难以在分钟级序列中保持高质量编辑效果。Contribution: 提出了 AnchorSync 框架,通过分离锚帧与中间帧处理,结合渐进去噪和多模态引导,实现长视频的高质量编辑。
Method: 采用扩散模型,将编辑任务分解为稀疏锚帧编辑和中间帧插值,通过渐进去噪和多模态引导保持全局一致性和时间动态。
Result: 实验表明,AnchorSync 在视觉质量和时间稳定性上优于现有方法,生成连贯且高保真的编辑效果。
Insight: 长视频编辑的关键在于解耦任务和动态一致性约束,扩散模型结合多模态引导是提升效果的有效路径。
Abstract: Editing long videos remains a challenging task due to the need for maintaining both global consistency and temporal coherence across thousands of frames. Existing methods often suffer from structural drift or temporal artifacts, particularly in minute-long sequences. We introduce AnchorSync, a novel diffusion-based framework that enables high-quality, long-term video editing by decoupling the task into sparse anchor frame editing and smooth intermediate frame interpolation. Our approach enforces structural consistency through a progressive denoising process and preserves temporal dynamics via multimodal guidance. Extensive experiments show that AnchorSync produces coherent, high-fidelity edits, surpassing prior methods in visual quality and temporal stability.
[68] GeMS: Efficient Gaussian Splatting for Extreme Motion Blur
Gopi Raju Matta,Trisha Reddypalli,Vemunuri Divya Madhuri,Kaushik Mitra
Main category: cs.CV
TL;DR: GeMS 是一个针对极端运动模糊图像的 3D 高斯泼溅(3DGS)框架,直接从模糊输入重建场景,无需依赖清晰图像。GeMS-E 在此基础上加入事件数据细化,进一步提升重建效果。
Details
Motivation: 现有极端模糊去模糊方法(如 ExBluRF 和 Deblur-GS)依赖清晰图像进行姿态估计和点云生成,而基于 COLMAP 的方法(如 BAD-Gaussians)在严重模糊下特征对应不可靠。这些假设在实际中不成立,因此需要直接从模糊输入重建场景的解决方案。Contribution: 1. 提出 GeMS,直接从极端模糊图像重建 3D 场景;2. 引入 VGGSfM(基于深度学习的 SfM)和 3DGS-MCMC(概率分布初始化)技术;3. 提出 GeMS-E,利用事件数据(EDI 去模糊)优化重建;4. 在合成和真实数据集上实现 SOTA 性能。
Method: 1. VGGSfM:从模糊输入估计相机姿态和生成点云;2. 3DGS-MCMC:将高斯泼溅视为概率分布,避免启发式稠密化和剪枝;3. 联合优化相机轨迹和高斯参数;4. GeMS-E 加入 EDI 去模糊,生成更清晰的中间图像以优化重建。
Result: GeMS 和 GeMS-E 在合成和真实数据集上表现优于现有方法,首次直接从极端模糊输入重建 3D 场景。
Insight: 1. 直接从模糊输入重建是可行的;2. 事件数据(如 EDI)可以有效优化模糊场景的重建;3. 3DGS-MCMC 为高斯泼溅提供了一种鲁棒的初始化方法。
Abstract: We introduce GeMS, a framework for 3D Gaussian Splatting (3DGS) designed to handle severely motion-blurred images. State-of-the-art deblurring methods for extreme blur, such as ExBluRF, as well as Gaussian Splatting-based approaches like Deblur-GS, typically assume access to sharp images for camera pose estimation and point cloud generation, an unrealistic assumption. Methods relying on COLMAP initialization, such as BAD-Gaussians, also fail due to unreliable feature correspondences under severe blur. To address these challenges, we propose GeMS, a 3DGS framework that reconstructs scenes directly from extremely blurred images. GeMS integrates: (1) VGGSfM, a deep learning-based Structure-from-Motion pipeline that estimates poses and generates point clouds directly from blurred inputs; (2) 3DGS-MCMC, which enables robust scene initialization by treating Gaussians as samples from a probability distribution, eliminating heuristic densification and pruning; and (3) joint optimization of camera trajectories and Gaussian parameters for stable reconstruction. While this pipeline produces strong results, inaccuracies may remain when all inputs are severely blurred. To mitigate this, we propose GeMS-E, which integrates a progressive refinement step using events: (4) Event-based Double Integral (EDI) deblurring restores sharper images that are then fed into GeMS, improving pose estimation, point cloud generation, and overall reconstruction. Both GeMS and GeMS-E achieve state-of-the-art performance on synthetic and real-world datasets. To our knowledge, this is the first framework to address extreme motion blur within 3DGS directly from severely blurred inputs.
[69] Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models
Jiabo Huang,Chen Chen,Lingjuan Lyu
Main category: cs.CV
TL;DR: 论文提出了一种基于模型驱动的视觉基础模型(VFM)训练方法,通过联合知识迁移与保存,利用多个预训练模型的知识来构建通用的VFM,避免了大规模数据的训练需求。
Details
Motivation: 目前视觉基础模型主要依赖数据驱动方法,需要大量高质量标注数据和计算资源,限制了大多数机构的发展。而许多开源领域特定模型已经具备丰富的知识,如何有效利用这些资源成为关键挑战。Contribution: 1. 提出了一种模型驱动的方法,联合多个预训练教师模型的知识;2. 通过共享潜在空间解决教师模型间的分布差异问题;3. 引入知识保存策略,以通用教师模型为基整合领域特定知识。
Method: 1. 统一多个教师模型到共享潜在空间以平衡迁移;2. 设计适配器模块保存和整合知识;3. 构建通用VFM,支持多下游任务。
Result: 在图像分类、目标检测、语义和实例分割四项基础任务中,该方法优于现有数据驱动模型。
Insight: 通过联合知识迁移与保存,可以有效利用现有预训练模型资源,降低对大规模数据的依赖,同时提升模型的通用性和性能。
Abstract: Vision foundation models (VFMs) are predominantly developed using data-centric methods. These methods require training on vast amounts of data usually with high-quality labels, which poses a bottleneck for most institutions that lack both large-scale data and high-end GPUs. On the other hand, many open-source vision models have been pretrained on domain-specific data, enabling them to distill and represent core knowledge in a form that is transferable across diverse applications. Even though these models are highly valuable assets, they remain largely under-explored in empowering the development of a general-purpose VFM. In this paper, we presents a new model-driven approach for training VFMs through joint knowledge transfer and preservation. Our method unifies multiple pre-trained teacher models in a shared latent space to mitigate the ``imbalanced transfer’’ issue caused by their distributional gaps. Besides, we introduce a knowledge preservation strategy to take a general-purpose teacher as a knowledge base for integrating knowledge from the remaining purpose-specific teachers using an adapter module. By unifying and aggregating existing models, we build a powerful VFM to inherit teachers’ expertise without needing to train on a large amount of labeled data. Our model not only provides generalizable visual features, but also inherently supports multiple downstream tasks. Extensive experiments demonstrate that our VFM outperforms existing data-centric models across four fundamental vision tasks, including image classification, object detection, semantic and instance segmentation.
[70] Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving
Leila Cheshmi,Mennatullah Siam
Main category: cs.CV
TL;DR: 该论文提出了一种多尺度视频Transformer,用于自动驾驶中的类无关分割任务,通过运动线索检测未知物体,避免了依赖已知类别的局限性,同时提出了一种高效的解码器和内存设计,实现了高分辨率信息的保留和多尺度特征的捕捉。
Details
Motivation: 自动驾驶的安全问题需要处理未知物体和未预见的场景,现有视频分割方法通常依赖已知类别的训练数据,忽略了新类别。此外,基于大语言模型的视觉定位方法计算成本高,不适合像素级输出。因此,需要一种高效的类无关分割方法。Contribution: 1. 提出了一种多尺度视频Transformer,仅依赖运动线索实现类无关分割。
2. 设计了多阶段多尺度查询-内存解码器和随机丢弃令牌机制,提高了效率和准确性。
3. 通过共享可学习内存模块保留了高分辨率信息,避免了传统解码器对特征的压缩。
Method: 1. 端到端训练的视频Transformer,无需光流输入。
2. 多阶段多尺度查询-内存解码器,捕捉时空特征。
3. 引入尺度特定的随机丢弃令牌机制,优化效率。
Result: 在DAVIS’16、KITTI和Cityscapes数据集上的实验表明,该方法在多尺度基准测试中表现优越,同时在GPU内存和运行效率上均表现出色,适合实时密集预测任务。
Insight: 论文提出了一种内存中心的设计思想,保留了高分辨率信息,同时通过多尺度特征和运动线索实现了对未知物体的检测,为安全关键型机器人任务提供了新思路。
Abstract: Ensuring safety in autonomous driving is a complex challenge requiring handling unknown objects and unforeseen driving scenarios. We develop multiscale video transformers capable of detecting unknown objects using only motion cues. Video semantic and panoptic segmentation often relies on known classes seen during training, overlooking novel categories. Recent visual grounding with large language models is computationally expensive, especially for pixel-level output. We propose an efficient video transformer trained end-to-end for class-agnostic segmentation without optical flow. Our method uses multi-stage multiscale query-memory decoding and a scale-specific random drop-token to ensure efficiency and accuracy, maintaining detailed spatiotemporal features with a shared, learnable memory module. Unlike conventional decoders that compress features, our memory-centric design preserves high-resolution information at multiple scales. We evaluate on DAVIS’16, KITTI, and Cityscapes. Our method consistently outperforms multiscale baselines while being efficient in GPU memory and run-time, demonstrating a promising direction for real-time, robust dense prediction in safety-critical robotics.
[71] Fusing Monocular RGB Images with AIS Data to Create a 6D Pose Estimation Dataset for Marine Vessels
Fabian Holst,Emre Gülsoylu,Simone Frintrop
Main category: cs.CV
TL;DR: 该论文提出了一种通过融合单目RGB图像与AIS数据创建海洋船舶6D姿态估计数据集的新方法,解决了仅依赖AIS数据的局限性,并生成无需人工标注的高质量数据集。
Details
Motivation: 传统方法依赖AIS数据获取船舶位置,但存在设备可靠性、数据操纵和传输延迟等问题。为了克服这些限制,论文提出结合视觉与AIS数据的方法。Contribution: 1. 提出融合单目RGB图像与AIS数据的技术,生成6D姿态数据集;2. 比较了YOLOX-X检测器和两种坐标变换方法(PnP优于单应性);3. 发布公开数据集BONK-pose。
Method: 1. 使用YOLOX-X检测船舶;2. 通过PnP方法对齐AIS数据与图像坐标;3. 生成3D边界框标注的6D姿态数据集。
Result: PnP方法的投影误差显著低于单应性方法;YOLOX-X在IoU阈值0.5下mAP达到0.80;发布包含3753张标注图像的数据集。
Insight: 视觉与AIS数据融合可高效生成姿态数据集,减少人工标注需求;PnP在坐标对齐中表现更优。
Abstract: The paper presents a novel technique for creating a 6D pose estimation dataset for marine vessels by fusing monocular RGB images with Automatic Identification System (AIS) data. The proposed technique addresses the limitations of relying purely on AIS for location information, caused by issues like equipment reliability, data manipulation, and transmission delays. By combining vessel detections from monocular RGB images, obtained using an object detection network (YOLOX-X), with AIS messages, the technique generates 3D bounding boxes that represent the vessels’ 6D poses, i.e. spatial and rotational dimensions. The paper evaluates different object detection models to locate vessels in image space. We also compare two transformation methods (homography and Perspective-n-Point) for aligning AIS data with image coordinates. The results of our work demonstrate that the Perspective-n-Point (PnP) method achieves a significantly lower projection error compared to homography-based approaches used before, and the YOLOX-X model achieves a mean Average Precision (mAP) of 0.80 at an Intersection over Union (IoU) threshold of 0.5 for relevant vessel classes. We show indication that our approach allows the creation of a 6D pose estimation dataset without needing manual annotation. Additionally, we introduce the Boats on Nordelbe Kehrwieder (BONK-pose), a publicly available dataset comprising 3753 images with 3D bounding box annotations for pose estimation, created by our data fusion approach. This dataset can be used for training and evaluating 6D pose estimation networks. In addition we introduce a set of 1000 images with 2D bounding box annotations for ship detection from the same scene.
[72] 6-DoF Object Tracking with Event-based Optical Flow and Frames
Zhichao Li,Arren Glover,Chiara Bartolozzi,Lorenzo Natale
Main category: cs.CV
TL;DR: 该论文提出了一种结合事件相机光流和RGB相机全局位姿估计的方法,用于高速运动物体的6自由度(6-DoF)位姿跟踪。
Details
Motivation: 传统相机在高动态运动场景中由于帧率限制和运动模糊,难以实时跟踪物体的6-DoF位姿。事件相机具有高时间分辨率和低延迟特性,而RGB相机则提供丰富的视觉信息。Contribution: 结合事件相机和RGB相机的优势,提出了一种基于事件光流的物体运动测量方法,并与低频全局位姿估计融合,实现了高速物体的6-DoF位姿跟踪。
Method: 1. 设计事件相机光流算法测量物体运动;2. 与RGB相机的低频全局位姿估计结果融合,实现高速运动下的位姿跟踪。
Result: 在合成数据和真实数据上验证了方法的有效性,特别适用于高速运动场景。
Insight: 通过事件相机与RGB相机的互补性,解决了传统相机在高动态场景中的局限性,为机器人交互提供了更鲁棒的位姿跟踪方案。
Abstract: Tracking the position and orientation of objects in space (i.e., in 6-DoF) in real time is a fundamental problem in robotics for environment interaction. It becomes more challenging when objects move at high-speed due to frame rate limitations in conventional cameras and motion blur. Event cameras are characterized by high temporal resolution, low latency and high dynamic range, that can potentially overcome the impacts of motion blur. Traditional RGB cameras provide rich visual information that is more suitable for the challenging task of single-shot object pose estimation. In this work, we propose using event-based optical flow combined with an RGB based global object pose estimator for 6-DoF pose tracking of objects at high-speed, exploiting the core advantages of both types of vision sensors. Specifically, we propose an event-based optical flow algorithm for object motion measurement to implement an object 6-DoF velocity tracker. By integrating the tracked object 6-DoF velocity with low frequency estimated pose from the global pose estimator, the method can track pose when objects move at high-speed. The proposed algorithm is tested and validated on both synthetic and real world data, demonstrating its effectiveness, especially in high-speed motion scenarios.
[73] MF-LPR$^2$: Multi-Frame License Plate Image Restoration and Recognition using Optical Flow
Kihyun Na,Junseok Oh,Youngkwan Cho,Bumjin Kim,Sungmin Cho,Jinyoung Choi,Injung Kim
Main category: cs.CV
TL;DR: MF-LPR²提出了一种多帧车牌图像恢复与识别框架,利用光流对齐相邻帧以提升低质量图像的恢复和识别效果。
Details
Motivation: 现有的生成模型依赖预训练先验知识,难以可靠恢复低分辨率、运动模糊和反光的车牌图像,常引入严重失真。Contribution: 1. 提出MF-LPR²框架,通过光流对齐多帧图像提升恢复和识别效果;2. 构建RLPR数据集,包含真实场景下的低质量图像序列和高质量伪真值。
Method: 采用先进光流估计器,结合时空一致性检测和修正光流误差,设计过滤和优化算法提升图像质量与识别精度。
Result: MF-LPR²在PSNR、SSIM和LPIPS上显著优于8种恢复模型,识别准确率达86.44%,优于所有基线模型。
Insight: 多帧信息融合和光流误差修正显著提升车牌图像恢复和识别性能,真实数据集RLPR为未来研究提供了重要基准。
Abstract: License plate recognition (LPR) is important for traffic law enforcement, crime investigation, and surveillance. However, license plate areas in dash cam images often suffer from low resolution, motion blur, and glare, which make accurate recognition challenging. Existing generative models that rely on pretrained priors cannot reliably restore such poor-quality images, frequently introducing severe artifacts and distortions. To address this issue, we propose a novel multi-frame license plate restoration and recognition framework, MF-LPR$^2$, which addresses ambiguities in poor-quality images by aligning and aggregating neighboring frames instead of relying on pretrained knowledge. To achieve accurate frame alignment, we employ a state-of-the-art optical flow estimator in conjunction with carefully designed algorithms that detect and correct erroneous optical flow estimations by leveraging the spatio-temporal consistency inherent in license plate image sequences. Our approach enhances both image quality and recognition accuracy while preserving the evidential content of the input images. In addition, we constructed a novel Realistic LPR (RLPR) dataset to evaluate MF-LPR$^2$. The RLPR dataset contains 200 pairs of low-quality license plate image sequences and high-quality pseudo ground-truth images, reflecting the complexities of real-world scenarios. In experiments, MF-LPR$^2$ outperformed eight recent restoration models in terms of PSNR, SSIM, and LPIPS by significant margins. In recognition, MF-LPR$^2$ achieved an accuracy of 86.44%, outperforming both the best single-frame LPR (14.04%) and the multi-frame LPR (82.55%) among the eleven baseline models. The results of ablation studies confirm that our filtering and refinement algorithms significantly contribute to these improvements.
[74] Tinker: Diffusion’s Gift to 3D–Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization
Canyu Zhao,Xiaoman Li,Tianjian Feng,Zhiyue Zhao,Hao Chen,Chunhua Shen
Main category: cs.CV
TL;DR: Tinker提出了一种无需逐场景优化的3D编辑框架,通过预训练扩散模型的3D感知能力,实现多视角一致的编辑功能,仅需少量输入即可生成高质量结果。
Details
Motivation: 现有的3D编辑技术通常需要大量逐场景优化或数十个一致编辑输入视图,计算成本高且难以扩展。Tinker旨在解决这一问题,提供更高效、通用的3D编辑解决方案。Contribution: 1. 提出了首个无需逐场景优化的多视角一致编辑框架Tinker;2. 构建了首个大规模多视角编辑数据集和数据流水线;3. 引入了两个新组件:参考驱动的多视角编辑器和任意视角到视频的合成器。
Method: 1. 复用预训练扩散模型的3D感知能力;2. 通过参考多视角编辑器实现精确的跨视角一致编辑;3. 利用时空先验从稀疏输入生成高质量新视角和视频。
Result: Tinker在编辑、新视角合成和渲染增强任务上达到SOTA性能,显著降低了通用3D内容创建的难度。
Insight: Tinker通过扩散模型的潜空间3D感知,展示了无需逐场景优化的3D编辑潜力,为可扩展的零样本3D编辑技术提供了新方向。
Abstract: We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker
[75] Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives
Haoyu Zhao,Jiaxi Gu,Shicong Wang,Xing Zhang,Hang Xu,Zuxuan Wu,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 论文提出了一种新颖的视频-文本检索框架,通过粗到细的目标学习和关键词重复技术,显著提升检索性能,同时降低了训练成本。
Details
Motivation: 视频流数据的爆炸性增长对视频-文本检索的高精度和低成本训练提出了挑战,现有方法依赖大规模预训练,计算成本高,且细粒度信息未充分挖掘。Contribution: 1. 提出粗到细的目标学习框架;2. 引入Granularity-Aware Representation模块提取细粒度特征;3. 发现‘重复关键词’现象并提出无额外训练的推理流程。
Method: 1. 结合对比学习和匹配学习;2. 设计Granularity-Aware Representation模块;3. 提出基于投票机制和Matching Entropy指标的推理流程。
Result: 在四个基准测试上表现优异,MSR-VTT和DiDeMo数据集的Recall@1分别提升2.1%和1.6%。
Insight: 关键词重复可增强视频-文本对齐,通过推理流程的改进可显著提升性能而无需额外训练。
Abstract: The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems, we propose a novel framework to learn fine-grained features for better alignment and introduce an inference pipeline to improve performance without additional training. Specifically, we employ coarse-to-fine objectives to understand the semantic information of video-text pairs, including contrastive and matching learning. The fine-grained data used for training is obtained through the Granularity-Aware Representation module, which is designed based on similarity analysis between video frames and words in captions. Furthermore, we observe that the repetition of keywords in the original captions, referred to as “Repetition”, can enhance retrieval performance and improve alignment between video and text. Based on this insight, we propose a novel and effective inference pipeline that incorporates a voting mechanism and a new Matching Entropy metric to achieve better retrieval performance without requiring additional pre-training. Experimental results on four benchmarks demonstrate that the proposed method outperforms previous approaches. Additionally, our inference pipeline achieves significant performance improvements, with a 2.1% increase in Recall@1 on the MSR-VTT dataset and a 1.6% increase on the DiDeMo dataset.
[76] EventSSEG: Event-driven Self-Supervised Segmentation with Probabilistic Attention
Lakshmi Annamalai,Chetan Singh Thakur
Main category: cs.CV
TL;DR: EventSSEG提出了一种基于事件摄像头(event cameras)的自监督学习框架,用于道路分割任务,通过概率注意力机制和事件驱动的计算方式,减少了标注数据的需求,实现了低延迟和低计算开销的性能。
Details
Motivation: 传统的基于帧摄像头的道路分割方案在高延迟和高计算需求方面存在问题,而事件摄像头作为一个低功耗的替代方案潜力巨大,但缺乏预训练权重和标注数据的问题限制了其应用。Contribution: 1. 提出了EventSSEG,一种基于事件的、自监督学习的道路分割方法;2. 引入了概率注意力机制,提升分割性能;3. 在极少标注数据的情况下,实现了与传统方法相媲美的性能。
Method: EventSSEG结合了事件驱动的计算和自监督学习,利用概率注意力机制处理事件数据,无需传统摄像头预训练权重,直接从事件数据中学习特征。
Result: 在DSEC-Semantic和DDD17数据集上的实验表明,EventSSEG在极少标注数据的情况下达到了最先进的性能。
Insight: 事件摄像头的自监督学习是解决标注数据稀缺问题的有效途径,同时概率注意力机制可以高效处理事件数据的时间动态特性。
Abstract: Road segmentation is pivotal for autonomous vehicles, yet achieving low latency and low compute solutions using frame based cameras remains a challenge. Event cameras offer a promising alternative. To leverage their low power sensing, we introduce EventSSEG, a method for road segmentation that uses event only computing and a probabilistic attention mechanism. Event only computing poses a challenge in transferring pretrained weights from the conventional camera domain, requiring abundant labeled data, which is scarce. To overcome this, EventSSEG employs event-based self supervised learning, eliminating the need for extensive labeled data. Experiments on DSEC-Semantic and DDD17 show that EventSSEG achieves state of the art performance with minimal labeled events. This approach maximizes event cameras capabilities and addresses the lack of labeled events.
[77] Lifespan Pancreas Morphology for Control vs Type 2 Diabetes using AI on Largescale Clinical Imaging
Lucas W. Remedios,Chloe Cho,Trent M. Schwartz,Dingjie Su,Gaurav Rudravaram,Chenyu Gao,Aravind R. Krishnan,Adam M. Saunders,Michael E. Kim,Shunxing Bao,Thomas A. Lasko,Alvin C. Powers,Bennett A. Landman,John Virostko
Main category: cs.CV
TL;DR: 该论文通过AI技术分析大规模临床影像数据,研究了0至90岁人群中胰腺形态的年龄变化趋势,并对比了2型糖尿病患者与非糖尿病患者的差异。
Details
Motivation: 胰腺形态变化对于2型糖尿病和其他胰腺疾病的早期检测至关重要,但目前缺乏系统性的研究。Contribution: 1) 确定了适用于AI胰腺测量的可靠临床成像模态(CT和MRI);2) 建立了胰腺形态的年龄趋势规范;3) 发现2型糖尿病患者的胰腺形态明显偏离正常趋势。
Method: 研究分析了2533名患者的腹部CT或MRI数据,通过自动分割和13项形态学特征提取,使用GAMLSS回归模型对比糖尿病患者与非糖尿病患者的差异。
Result: 在调整混杂因素后,2型糖尿病患者的10/13项胰腺形态特征显著不同于对照组(p < 0.05)。MRI与CT的测量结果也存在差异。
Insight: 胰腺在2型糖尿病中显著缩小,且形态学特征可能成为早期诊断的生物标志物。研究还提供了非糖尿病患者胰腺形态的参考数据。
Abstract: Purpose: Understanding how the pancreas changes is critical for detecting deviations in type 2 diabetes and other pancreatic disease. We measure pancreas size and shape using morphological measurements from ages 0 to 90. Our goals are to 1) identify reliable clinical imaging modalities for AI-based pancreas measurement, 2) establish normative morphological aging trends, and 3) detect potential deviations in type 2 diabetes. Approach: We analyzed a clinically acquired dataset of 2533 patients imaged with abdominal CT or MRI. We resampled the scans to 3mm isotropic resolution, segmented the pancreas using automated methods, and extracted 13 morphological pancreas features across the lifespan. First, we assessed CT and MRI measurements to determine which modalities provide consistent lifespan trends. Second, we characterized distributions of normative morphological patterns stratified by age group and sex. Third, we used GAMLSS regression to model pancreas morphology trends in 1350 patients matched for age, sex, and type 2 diabetes status to identify any deviations from normative aging associated with type 2 diabetes. Results: When adjusting for confounders, the aging trends for 10 of 13 morphological features were significantly different between patients with type 2 diabetes and non-diabetic controls (p < 0.05 after multiple comparisons corrections). Additionally, MRI appeared to yield different pancreas measurements than CT using our AI-based method. Conclusions: We provide lifespan trends demonstrating that the size and shape of the pancreas is altered in type 2 diabetes using 675 control patients and 675 diabetes patients. Moreover, our findings reinforce that the pancreas is smaller in type 2 diabetes. Additionally, we contribute a reference of lifespan pancreas morphology from a large cohort of non-diabetic control patients in a clinical setting.
[78] GaussianArt: Unified Modeling of Geometry and Motion for Articulated Objects
Licheng Shen,Saining Zhang,Honghan Li,Peilin Yang,Zihao Huang,Zongzheng Zhang,Hao Zhao
Main category: cs.CV
TL;DR: GaussianArt提出了一种统一建模几何和运动的方法,用于重建包含多部分铰接的物体,显著提升了鲁棒性和扩展性。
Details
Motivation: 现有方法通常将几何和运动解耦,导致重建流程复杂且难以处理多部分铰接物体。Contribution: 提出了一种基于3D高斯的统一表示法,能同时建模几何和运动,支持多达20部分的铰接物体,并提出了新的基准数据集MPArt-90。
Method: 使用3D高斯分布联合建模几何和运动,避免了传统方法的分离式处理。
Result: 在90个铰接物体的实验中,该方法在几何重建和运动估计中表现出色。
Insight: 统一表示法在复杂铰接物体处理中更具潜力,适用于机器人仿真和人物场景交互等下游任务。
Abstract: Reconstructing articulated objects is essential for building digital twins of interactive environments. However, prior methods typically decouple geometry and motion by first reconstructing object shape in distinct states and then estimating articulation through post-hoc alignment. This separation complicates the reconstruction pipeline and restricts scalability, especially for objects with complex, multi-part articulation. We introduce a unified representation that jointly models geometry and motion using articulated 3D Gaussians. This formulation improves robustness in motion decomposition and supports articulated objects with up to 20 parts, significantly outperforming prior approaches that often struggle beyond 2–3 parts due to brittle initialization. To systematically assess scalability and generalization, we propose MPArt-90, a new benchmark consisting of 90 articulated objects across 20 categories, each with diverse part counts and motion configurations. Extensive experiments show that our method consistently achieves superior accuracy in part-level geometry reconstruction and motion estimation across a broad range of object types. We further demonstrate applicability to downstream tasks such as robotic simulation and human-scene interaction modeling, highlighting the potential of unified articulated representations in scalable physical modeling.
cs.NI [Back]
[79] OmniSense: Towards Edge-Assisted Online Analytics for 360-Degree Videos
Miao Zhang,Yifei Zhu,Linfeng Shen,Fangxin Wang,Jiangchuan Liu
Main category: cs.NI
TL;DR: OmniSense是一个边缘辅助的在线分析框架,旨在高效处理360度视频,通过轻量级的SRoI预测算法和动态模型缩放,实现低延迟和高准确率。
Details
Motivation: 随着360度视频的普及,如何高效提取有用信息成为挑战。OmniSense旨在解决计算和网络资源受限问题,提供低延迟、高准确率的在线视频分析。Contribution: 提出了OmniSense框架,引入轻量级SRoI预测算法和动态模型缩放技术,显著提升了360度视频分析的效率和准确性。
Method: 结合SRoI预测算法和视频内容/网络动态,动态调整视觉模型,优化资源利用率。
Result: 相比基线方法,准确性提升19.8%—114.6%,速度提升2.0—2.4倍,同时保持相似延迟。
Insight: 360度视频的分析可以通过关注关键区域(SRoI)和动态优化资源分配,显著提升性能。
Abstract: With the reduced hardware costs of omnidirectional cameras and the proliferation of various extended reality applications, more and more $360^\circ$ videos are being captured. To fully unleash their potential, advanced video analytics is expected to extract actionable insights and situational knowledge without blind spots from the videos. In this paper, we present OmniSense, a novel edge-assisted framework for online immersive video analytics. OmniSense achieves both low latency and high accuracy, combating the significant computation and network resource challenges of analyzing $360^\circ$ videos. Motivated by our measurement insights into $360^\circ$ videos, OmniSense introduces a lightweight spherical region of interest (SRoI) prediction algorithm to prune redundant information in $360^\circ$ frames. Incorporating the video content and network dynamics, it then smartly scales vision models to analyze the predicted SRoIs with optimized resource utilization. We implement a prototype of OmniSense with commodity devices and evaluate it on diverse real-world collected $360^\circ$ videos. Extensive evaluation results show that compared to resource-agnostic baselines, it improves the accuracy by $19.8%$ – $114.6%$ with similar end-to-end latencies. Meanwhile, it hits $2.0\times$ – $2.4\times$ speedups while keeping the accuracy on par with the highest accuracy of baselines.
cs.IR [Back]
[80] FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering
Chanyeol Choi,Jihoon Kwon,Alejandro Lopez-Lira,Chaewoon Kim,Minjae Kim,Juneha Hwang,Jaeseon Ha,Hojun Choi,Suyeol Yun,Yongjin Kim,Yongjae Lee
Main category: cs.IR
TL;DR: FinAgentBench是首个针对金融领域多步推理检索的大规模基准数据集,旨在评估LLM代理在金融问答中的检索能力。
Details
Motivation: 传统检索方法在金融领域的检索精度不足,缺乏多步推理能力,且无专用基准评估。FinAgentBench填补了这一空白。Contribution: 1. 提出首个金融领域多步推理检索基准FinAgentBench;2. 设计了分步评估框架,量化LLM代理检索行为;3. 通过实验表明针对性微调可显著提升性能。
Method: 1. 构建3,429个专家标注样本;2. 评估LLM代理的两步能力:选择文档类型和定位关键段落;3. 分离推理步骤以应对上下文限制。
Result: 评估了多种SOTA模型,证实针对性微调能显著提升代理检索性能。
Insight: 金融领域需要多步推理的代理检索,FinAgentBench为研究复杂领域任务中的LLM行为提供了基础。
Abstract: Accurate information retrieval (IR) is critical in the financial domain, where investors must identify relevant information from large collections of documents. Traditional IR methods-whether sparse or dense-often fall short in retrieval accuracy, as it requires not only capturing semantic similarity but also performing fine-grained reasoning over document structure and domain-specific knowledge. Recent advances in large language models (LLMs) have opened up new opportunities for retrieval with multi-step reasoning, where the model ranks passages through iterative reasoning about which information is most relevant to a given query. However, there exists no benchmark to evaluate such capabilities in the financial domain. To address this gap, we introduce FinAgentBench, the first large-scale benchmark for evaluating retrieval with multi-step reasoning in finance – a setting we term agentic retrieval. The benchmark consists of 3,429 expert-annotated examples on S&P-100 listed firms and assesses whether LLM agents can (1) identify the most relevant document type among candidates, and (2) pinpoint the key passage within the selected document. Our evaluation framework explicitly separates these two reasoning steps to address context limitations. This design enables to provide a quantitative basis for understanding retrieval-centric LLM behavior in finance. We evaluate a suite of state-of-the-art models and further demonstrated how targeted fine-tuning can significantly improve agentic retrieval performance. Our benchmark provides a foundation for studying retrieval-centric LLM behavior in complex, domain-specific tasks for finance. We will release the dataset publicly upon acceptance of the paper and plan to expand and share dataset for the full S&P 500 and beyond.
cs.AI [Back]
[81] Who Sees What? Structured Thought-Action Sequences for Epistemic Reasoning in LLMs
Luca Annese,Sabrina Patania,Silvia Serino,Tom Foulsham,Silvia Rossi,Azzurra Ruggeri,Dimitri Ognibene
Main category: cs.AI
TL;DR: 论文探讨了如何通过结构化的思想-行动序列改进大型语言模型(LLM)在认知推理任务中的表现,但发现仅结构化示例不足以实现稳健的视角推理。
Details
Motivation: 现有LLM在涉及主动感知、协作推理和视角推理的任务中表现不佳,研究者希望通过结构化示例提升其能力。Contribution: 提出了一种结构化的解决方案处理流程,生成三类示例(G型、E型、L型),并通过LLM明确每个决策背后的推理。
Method: 使用Fast Downward规划器生成的转换解决方案图,构建三类结构化示例,并在ReAct框架中验证效果。
Result: L型示例略微减少了澄清请求和行动步骤,但未能显著提升效果;模型在基础注意力过滤任务中表现良好,但在涉及遮挡空间或认知行为成本权衡的任务中表现不佳。
Insight: 结构化示例对提升视角推理能力作用有限,需要结合显式信念追踪、成本建模和更丰富的环境来实现LLM的社会化协作能力。
Abstract: Recent advances in large language models (LLMs) and reasoning frameworks have opened new possibilities for improving the perspective -taking capabilities of autonomous agents. However, tasks that involve active perception, collaborative reasoning, and perspective taking (understanding what another agent can see or knows) pose persistent challenges for current LLM-based systems. This study investigates the potential of structured examples derived from transformed solution graphs generated by the Fast Downward planner to improve the performance of LLM-based agents within a ReAct framework. We propose a structured solution-processing pipeline that generates three distinct categories of examples: optimal goal paths (G-type), informative node paths (E-type), and step-by-step optimal decision sequences contrasting alternative actions (L-type). These solutions are further converted into ``thought-action’’ examples by prompting an LLM to explicitly articulate the reasoning behind each decision. While L-type examples slightly reduce clarification requests and overall action steps, they do not yield consistent improvements. Agents are successful in tasks requiring basic attentional filtering but struggle in scenarios that required mentalising about occluded spaces or weighing the costs of epistemic actions. These findings suggest that structured examples alone are insufficient for robust perspective-taking, underscoring the need for explicit belief tracking, cost modelling, and richer environments to enable socially grounded collaboration in LLM-based agents.
[82] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
Ziyang Luo,Zhiqi Shen,Wenzhuo Yang,Zirui Zhao,Prathyusha Jwalapuram,Amrita Saha,Doyen Sahoo,Silvio Savarese,Caiming Xiong,Junnan Li
Main category: cs.AI
TL;DR: 论文提出了MCP-Universe,首个专门为评估大语言模型(LLM)在真实MCP服务器交互中的表现而设计的综合性基准测试,覆盖6个核心领域和11种MCP服务器,揭示了当前SOTA模型在长时推理和陌生工具空间中的局限性。
Details
Motivation: 现有基准测试过于简化,未能捕捉LLM在真实应用中的挑战(如长时推理和大规模陌生工具空间)。为了填补这一空白,作者开发了MCP-Universe。Contribution: 1. 提出了首个针对真实MCP服务器交互的综合性基准测试MCP-Universe;2. 实现了基于执行的多种评估器(格式、静态、动态);3. 开源了可扩展的评估框架,支持UI集成新代理和MCP服务器。
Method: MCP-Universe通过评估LLM与真实MCP服务器的交互(覆盖6个核心领域和11种MCP服务器),使用执行评估器(格式、静态、动态)进行严格测试,并开源了可扩展的框架。
Result: SOTA模型(如GPT-5、Grok-4和Claude-4.0-Sonnet)表现不佳(准确率43.72%、33.33%和29.44%)。基准测试揭示了LLM在长上下文和陌生工具使用中的挑战。
Insight: 1. 当前LLM在长时推理和陌生工具空间中表现不足;2. 企业级代理(如Cursor)未能超越标准ReAct框架;3. 开源框架有望推动MCP生态的创新。
Abstract: The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long-context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown-tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise-level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.
eess.IV [Back]
[83] 3D Cardiac Anatomy Generation Using Mesh Latent Diffusion Models
Jolanta Mozyrska,Marcel Beetz,Luke Melas-Kyriazi,Vicente Grau,Abhirup Banerjee,Alfonso Bueno-Orovio
Main category: eess.IV
TL;DR: 该论文提出了一种名为MeshLDM的潜在扩散模型(LDM)架构,用于生成3D心脏解剖结构网格,并在急性心肌梗死患者的左心室解剖数据上验证了其性能,生成结果与金标准相比仅相差2.4%。
Details
Motivation: 3D医学影像生成在心脏学领域应用较少,而生成多样且真实的心脏解剖结构对于计算机仿真、数据增强等应用至关重要。Contribution: 提出了MeshLDM,一种新型的LDM架构,专门用于生成3D心脏解剖网格。
Method: 作者设计了MeshLDM模型,通过在潜在空间中应用扩散过程,从心脏舒张和收缩阶段的解剖结构中学习并生成多样化的3D网格。
Result: MeshLDM生成的网格在临床和3D重建指标上表现优异,与金标准的群体均值差异仅为2.4%。
Insight: 扩散模型在3D医学影像生成中具有潜力,尤其是在心脏学领域,可以扩展应用到其他器官或病理状态的建模中。
Abstract: Diffusion models have recently gained immense interest for their generative capabilities, specifically the high quality and diversity of the synthesized data. However, examples of their applications in 3D medical imaging are still scarce, especially in cardiology. Generating diverse realistic cardiac anatomies is crucial for applications such as in silico trials, electromechanical computer simulations, or data augmentations for machine learning models. In this work, we investigate the application of Latent Diffusion Models (LDMs) for generating 3D meshes of human cardiac anatomies. To this end, we propose a novel LDM architecture – MeshLDM. We apply the proposed model on a dataset of 3D meshes of left ventricular cardiac anatomies from patients with acute myocardial infarction and evaluate its performance in terms of both qualitative and quantitative clinical and 3D mesh reconstruction metrics. The proposed MeshLDM successfully captures characteristics of the cardiac shapes at end-diastolic (relaxation) and end-systolic (contraction) cardiac phases, generating meshes with a 2.4% difference in population mean compared to the gold standard.
[84] Automated surgical planning with nnU-Net: delineation of the anatomy in hepatobiliary phase MRI
Karin A. Olthof,Matteo Fusagli,Bianca Güttner,Tiziano Natali,Bram Westerink,Stefanie Speidel,Theo J. M. Ruers,Koert F. D. Kuhlmann,Andrey Zhylka
Main category: eess.IV
TL;DR: 该研究开发了一种基于nnU-Net的深度学习方法,用于从肝胆期MRI中自动分割肝脏解剖结构,优化术前规划流程。
Details
Motivation: 术前规划在肝脏手术中至关重要,但手动分割肝脏解剖结构耗时且主观性强。本研究旨在通过自动化方法减轻临床负担。Contribution: 提出了一种基于nnU-Net的自动化分割方法,能够准确分割肝脏、肿瘤及血管结构,显著提升临床工作效率。
Method: 使用90例患者的肝胆期MRI数据,训练nnU-Net模型,重点关注薄结构和地形保留,并通过Dice相似系数评估性能。
Result: 模型在测试集上表现出色,尤其是肝脏实质分割(DSC 0.97);临床评估中仅需少量调整,并额外检测到放射科医生遗漏的肿瘤。
Insight: 自动化分割方法在临床中有实际价值,能够补充人工检查的不足,为肝胆手术的术前规划提供标准化工具。
Abstract: Background: The aim of this study was to develop and evaluate a deep learning-based automated segmentation method for hepatic anatomy (i.e., parenchyma, tumors, portal vein, hepatic vein and biliary tree) from the hepatobiliary phase of gadoxetic acid-enhanced MRI. This method should ease the clinical workflow of preoperative planning. Methods: Manual segmentation was performed on hepatobiliary phase MRI scans from 90 consecutive patients who underwent liver surgery between January 2020 and October 2023. A deep learning network (nnU-Net v1) was trained on 72 patients with an extra focus on thin structures and topography preservation. Performance was evaluated on an 18-patient test set by comparing automated and manual segmentations using Dice similarity coefficient (DSC). Following clinical integration, 10 segmentations (assessment dataset) were generated using the network and manually refined for clinical use to quantify required adjustments using DSC. Results: In the test set, DSCs were 0.97+/-0.01 for liver parenchyma, 0.80+/-0.04 for hepatic vein, 0.79+/-0.07 for biliary tree, 0.77+/-0.17 for tumors, and 0.74+/-0.06 for portal vein. Average tumor detection rate was 76.6+/-24.1%, with a median of one false-positive per patient. The assessment dataset showed minor adjustments were required for clinical use of the 3D models, with high DSCs for parenchyma (1.00+/-0.00), portal vein (0.98+/-0.01) and hepatic vein (0.95+/-0.07). Tumor segmentation exhibited greater variability (DSC 0.80+/-0.27). During prospective clinical use, the model detected three additional tumors initially missed by radiologists. Conclusions: The proposed nnU-Net-based segmentation method enables accurate and automated delineation of hepatic anatomy. This enables 3D planning to be applied efficiently as a standard-of-care for every patient undergoing liver surgery.
[85] A Systematic Study of Deep Learning Models and xAI Methods for Region-of-Interest Detection in MRI Scans
Justin Yiu,Kushank Arora,Daniel Steinberg,Rohit Ghiya
Main category: eess.IV
TL;DR: 论文系统评估了多种深度学习架构与可解释AI(xAI)方法在膝关节MRI扫描中自动检测感兴趣区域(ROI)的效果,发现ResNet50在分类和ROI识别中表现最佳。
Details
Motivation: MRI手动分析耗时且易受主观差异影响,需自动化的ROI检测方法提升效率和准确性。Contribution: 1. 系统比较了监督和自监督的深度学习模型(如ResNet50、ViT、U-Net变体)与xAI方法(如Grad-CAM)。2. 验证了CNN迁移学习在MRI数据集中的有效性,并探讨了transformer模型的潜力。
Method: 使用了多种架构(ResNet50、InceptionV3、ViT、U-Net变体)和xAI技术(Grad-CAM、Saliency Maps),以AUC、PSNR/SSIM和定性可视化评估性能。
Result: ResNet50在分类和ROI识别中表现最优;Grad-CAM提供最具临床意义的解释;transformer模型受限于数据规模,潜力未完全释放。
Insight: 1. CNN迁移学习是目前MRI ROI检测的最有效方法。2. transformer模型可能需更大规模预训练以发挥潜力。3. Grad-CAM是最适用的xAI工具。
Abstract: Magnetic Resonance Imaging (MRI) is an essential diagnostic tool for assessing knee injuries. However, manual interpretation of MRI slices remains time-consuming and prone to inter-observer variability. This study presents a systematic evaluation of various deep learning architectures combined with explainable AI (xAI) techniques for automated region of interest (ROI) detection in knee MRI scans. We investigate both supervised and self-supervised approaches, including ResNet50, InceptionV3, Vision Transformers (ViT), and multiple U-Net variants augmented with multi-layer perceptron (MLP) classifiers. To enhance interpretability and clinical relevance, we integrate xAI methods such as Grad-CAM and Saliency Maps. Model performance is assessed using AUC for classification and PSNR/SSIM for reconstruction quality, along with qualitative ROI visualizations. Our results demonstrate that ResNet50 consistently excels in classification and ROI identification, outperforming transformer-based models under the constraints of the MRNet dataset. While hybrid U-Net + MLP approaches show potential for leveraging spatial features in reconstruction and interpretability, their classification performance remains lower. Grad-CAM consistently provided the most clinically meaningful explanations across architectures. Overall, CNN-based transfer learning emerges as the most effective approach for this dataset, while future work with larger-scale pretraining may better unlock the potential of transformer models.
[86] Fine-grained Image Quality Assessment for Perceptual Image Restoration
Xiangfei Sheng,Xiaofeng Pan,Zhichao Yang,Pengfei Chen,Leida Li
Main category: eess.IV
TL;DR: 这篇论文提出了一个细粒度图像质量评估(IQA)数据集FGRestore,并设计了一个新的IQA模型FGResQ,专门用于图像恢复任务。
Details
Motivation: 现有IQA指标在图像恢复任务中表现不佳,尤其是在区分恢复图像的细粒度质量差异时。为了解决这一问题,作者提出了一个新的数据集和模型。Contribution: 1) 创建了首个细粒度图像质量评估数据集FGRestore;2) 提出了FGResQ模型,结合粗粒度评分回归和细粒度质量排序。
Method: 1) 构建FGRestore数据集,包含多任务恢复图像和成对偏好标注;2) 设计FGResQ模型,整合评分回归和质量排序。
Result: 实验表明,FGResQ在图像恢复任务中显著优于现有IQA指标。
Insight: 传统IQA指标在图像恢复任务中可能不够准确,细粒度评估能更好反映恢复质量。
Abstract: Recent years have witnessed remarkable achievements in perceptual image restoration (IR), creating an urgent demand for accurate image quality assessment (IQA), which is essential for both performance comparison and algorithm optimization. Unfortunately, the existing IQA metrics exhibit inherent weakness for IR task, particularly when distinguishing fine-grained quality differences among restored images. To address this dilemma, we contribute the first-of-its-kind fine-grained image quality assessment dataset for image restoration, termed FGRestore, comprising 18,408 restored images across six common IR tasks. Beyond conventional scalar quality scores, FGRestore was also annotated with 30,886 fine-grained pairwise preferences. Based on FGRestore, a comprehensive benchmark was conducted on the existing IQA metrics, which reveal significant inconsistencies between score-based IQA evaluations and the fine-grained restoration quality. Motivated by these findings, we further propose FGResQ, a new IQA model specifically designed for image restoration, which features both coarse-grained score regression and fine-grained quality ranking. Extensive experiments and comparisons demonstrate that FGResQ significantly outperforms state-of-the-art IQA metrics. Codes and model weights have been released in https://pxf0429.github.io/FGResQ/
[87] From Slices to Structures: Unsupervised 3D Reconstruction of Female Pelvic Anatomy from Freehand Transvaginal Ultrasound
Max Krähenmann,Sergio Tascon-Morales,Fabian Laumer,Julia E. Vogt,Ece Ozkan
Main category: eess.IV
TL;DR: 该论文提出了一种无监督的框架,从自由手的2D经阴道超声(TVS)扫描中重建女性盆腔3D解剖结构,无需外部跟踪或学习的姿态估计器。通过引入针对超声成像物理和几何特性的‘切片感知’可微分光栅化方法,实现了高空间保真度的3D重建。
Details
Motivation: 传统的3D超声成像依赖专用硬件和严格采集协议,限制了其广泛应用。本文旨在通过纯计算方法实现从2D超声图像的3D重建,提供一种可扩展的替代方案。Contribution: 主要贡献包括:(1)针对超声成像的‘切片感知’可微分光栅化方法;(2)将Gaussian Splatting原则应用于超声领域;(3)无需外部跟踪器即可实现3D重建的框架。
Method: 方法包括:(1)使用各向异性3D高斯模型表示解剖结构;(2)通过图像级监督直接优化高斯参数;(3)利用无传感器探头运动估计和领域特定几何先验。
Result: 该方法实现了高空间保真度的3D重建,生成了紧凑、灵活的3D体积表示,为AI辅助分析和诊断提供了新机会。
Insight: 通过纯计算方法实现3D重建是可行的,这不仅减少了硬件依赖,还为未来AI在超声领域的应用提供了新方向。
Abstract: Volumetric ultrasound has the potential to significantly improve diagnostic accuracy and clinical decision-making, yet its widespread adoption remains limited by dependence on specialized hardware and restrictive acquisition protocols. In this work, we present a novel unsupervised framework for reconstructing 3D anatomical structures from freehand 2D transvaginal ultrasound (TVS) sweeps, without requiring external tracking or learned pose estimators. Our method adapts the principles of Gaussian Splatting to the domain of ultrasound, introducing a slice-aware, differentiable rasterizer tailored to the unique physics and geometry of ultrasound imaging. We model anatomy as a collection of anisotropic 3D Gaussians and optimize their parameters directly from image-level supervision, leveraging sensorless probe motion estimation and domain-specific geometric priors. The result is a compact, flexible, and memory-efficient volumetric representation that captures anatomical detail with high spatial fidelity. This work demonstrates that accurate 3D reconstruction from 2D ultrasound images can be achieved through purely computational means, offering a scalable alternative to conventional 3D systems and enabling new opportunities for AI-assisted analysis and diagnosis.
[88] Virtual Multiplex Staining for Histological Images using a Marker-wise Conditioned Diffusion Model
Hyun-Jic Oh,Junsik Kim,Zhiyi Shi,Yichen Wu,Yu-An Chen,Peter K. Sorger,Hanspeter Pfister,Won-Ki Jeong
Main category: eess.IV
TL;DR: 该论文提出了一种基于标记条件扩散模型的新框架,用于从H&E图像生成虚拟多路复用图像,解决了多路成像的高成本和复杂性问题,并显著提高了生成标记类型的数量和准确性。
Details
Motivation: 多路成像在病理学中具有重要作用,但其复杂性和高成本限制了广泛应用。现有大量H&E图像缺乏对应的多路复用图像,限制了多模态分析的潜力。Contribution: 主要贡献包括:提出了一种新的虚拟多路复用染色框架,利用预训练的潜在扩散模型(LDM)生成多路复用图像;通过标记条件扩散模型实现逐个标记生成;通过单步采样优化提高了颜色对比度和推理效率。
Method: 使用预训练的LDM,并通过条件扩散模型生成多路复用图像。模型在每个标记上进行条件化,共享相同架构;通过像素级损失函数优化单步采样,提高对比度和效率。
Result: 在两个公开数据集上验证了框架的有效性,实现了生成多达18种标记类型,准确率优于以往方法(仅2-3种标记)。
Insight: 该方法为H&E图像与多路成像之间搭建了桥梁,有望支持回顾性研究和现有H&E图像库的大规模分析,为病理学提供了新的研究工具。
Abstract: Multiplex imaging is revolutionizing pathology by enabling the simultaneous visualization of multiple biomarkers within tissue samples, providing molecular-level insights that traditional hematoxylin and eosin (H&E) staining cannot provide. However, the complexity and cost of multiplex data acquisition have hindered its widespread adoption. Additionally, most existing large repositories of H&E images lack corresponding multiplex images, limiting opportunities for multimodal analysis. To address these challenges, we leverage recent advances in latent diffusion models (LDMs), which excel at modeling complex data distributions utilizing their powerful priors for fine-tuning to a target domain. In this paper, we introduce a novel framework for virtual multiplex staining that utilizes pretrained LDM parameters to generate multiplex images from H&E images using a conditional diffusion model. Our approach enables marker-by-marker generation by conditioning the diffusion model on each marker, while sharing the same architecture across all markers. To tackle the challenge of varying pixel value distributions across different marker stains and to improve inference speed, we fine-tune the model for single-step sampling, enhancing both color contrast fidelity and inference efficiency through pixel-level loss functions. We validate our framework on two publicly available datasets, notably demonstrating its effectiveness in generating up to 18 different marker types with improved accuracy, a substantial increase over the 2-3 marker types achieved in previous approaches. This validation highlights the potential of our framework, pioneering virtual multiplex staining. Finally, this paper bridges the gap between H&E and multiplex imaging, potentially enabling retrospective studies and large-scale analyses of existing H&E image repositories.
cs.GR [Back]
[89] A Real-world Display Inverse Rendering Dataset
Seokjun Choi,Hoon-Gyu Chung,Yujin Jeon,Giljoo Nam,Seung-Hwan Baek
Main category: cs.GR
TL;DR: 该论文提出了首个基于显示器-相机系统的真实世界逆向渲染数据集,填补了该领域的数据空白,支持了逆向渲染方法的研究和评估。
Details
Motivation: 现有的逆向渲染数据集多基于光舞台(light stage)等设备,而基于显示器-相机系统的数据集尚未公开。这种数据缺失限制了相关方法的发展。为填补这一空白,论文提出并构建了首个此类数据集。Contribution: 1. 提出了首个基于显示器-相机系统的真实世界逆向渲染数据集;2. 提供高质量的真实几何数据;3. 设计了一个简单但有效的方法,在逆向渲染任务中表现优于现有方法。
Method: 通过构建由LCD显示器和偏振立体相机组成的成像系统,采集多样化物体在OLAT(one-light-at-a-time)显示模式下的图像,并生成高质量真实几何数据。
Result: 实验表明,该数据集能有效支持合成任意显示模式和噪声水平下的图像,并验证了论文提出的方法在逆向渲染任务中的优越性。
Insight: 显示器-相机系统通过可控像素光源和偏振光特性,为逆向渲染提供了独特优势;开源数据集将极大促进该领域的研究。
Abstract: Inverse rendering aims to reconstruct geometry and reflectance from captured images. Display-camera imaging systems offer unique advantages for this task: each pixel can easily function as a programmable point light source, and the polarized light emitted by LCD displays facilitates diffuse-specular separation. Despite these benefits, there is currently no public real-world dataset captured using display-camera systems, unlike other setups such as light stages. This absence hinders the development and evaluation of display-based inverse rendering methods. In this paper, we introduce the first real-world dataset for display-based inverse rendering. To achieve this, we construct and calibrate an imaging system comprising an LCD display and stereo polarization cameras. We then capture a diverse set of objects with diverse geometry and reflectance under one-light-at-a-time (OLAT) display patterns. We also provide high-quality ground-truth geometry. Our dataset enables the synthesis of captured images under arbitrary display patterns and different noise levels. Using this dataset, we evaluate the performance of existing photometric stereo and inverse rendering methods, and provide a simple, yet effective baseline for display inverse rendering, outperforming state-of-the-art inverse rendering methods. Code and dataset are available on our project page at https://michaelcsj.github.io/DIR/
[90] MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds
Bingquan Dai,Li Ray Luo,Qihong Tang,Jie Wang,Xinyu Lian,Hao Xu,Minghan Qin,Xudong Xu,Bo Dai,Haoqian Wang,Zhaoyang Lyu,Jiangmiao Pang
Main category: cs.GR
TL;DR: MeshCoder是一个新颖的框架,能够从点云中重建复杂3D对象为可编辑的Blender Python脚本,通过多模态大语言模型实现高效形状到代码的转换。
Details
Motivation: 现有方法依赖有限的领域特定语言和小规模数据集,难以建模复杂几何与结构,MeshCoder旨在解决这一问题。Contribution: 开发了一个全面的Blender Python API集合,构建了大规模对象-代码配对数据集,并训练了多模态LLM以实现点云到脚本的转换。
Method: 基于Blender Python API构建数据集,训练多模态LLM进行点云到代码的翻译,支持代码分解为语义部分。
Result: 在形状到代码重建任务中表现优异,支持通过代码修改实现直观的几何与拓扑编辑。
Insight: 代码化表示提升了LLM在3D形状理解任务中的推理能力,为程序化3D形状重建提供了灵活解决方案。
Abstract: Reconstructing 3D objects into editable programs is pivotal for applications like reverse engineering and shape editing. However, existing methods often rely on limited domain-specific languages (DSLs) and small-scale datasets, restricting their ability to model complex geometries and structures. To address these challenges, we introduce MeshCoder, a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts. We develop a comprehensive set of expressive Blender Python APIs capable of synthesizing intricate geometries. Leveraging these APIs, we construct a large-scale paired object-code dataset, where the code for each object is decomposed into distinct semantic parts. Subsequently, we train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts. Our approach not only achieves superior performance in shape-to-code reconstruction tasks but also facilitates intuitive geometric and topological editing through convenient code modifications. Furthermore, our code-based representation enhances the reasoning capabilities of LLMs in 3D shape understanding tasks. Together, these contributions establish MeshCoder as a powerful and flexible solution for programmatic 3D shape reconstruction and understanding.
q-bio.QM [Back]
[91] High-Throughput Low-Cost Segmentation of Brightfield Microscopy Live Cell Images
Surajit Das,Gourav Roy,Pavel Zun
Main category: q-bio.QM
TL;DR: 该论文提出了一种低成本、高吞吐量的CNN流水线,用于分割未染色的活细胞明场显微图像,克服了低对比度、噪声和运动模糊等挑战。
Details
Motivation: 明场显微活细胞图像分割在生物医学研究中至关重要,但现有方法难以在低对比度、噪声和运动模糊的情况下保持高吞吐量和准确性。Contribution: 开发了一种基于U-Net架构的CNN流水线,引入了注意力机制、实例感知系统、自适应损失函数等技术,显著提升了分割性能。
Method: 结合了冻结编码器对比分析、注意力机制、动态学习率、渐进机制和集成技术,训练了鲁棒的模型。
Result: 在公开数据集上达到了93%的测试准确率和89%的平均F1分数,并展示了跨模态的泛化能力。
Insight: 该模型在低计算资源下表现优异,适合实际实验室部署,且在训练数据有限的情况下展现了良好的适应性。
Abstract: Live cell culture is crucial in biomedical studies for analyzing cell properties and dynamics in vitro. This study focuses on segmenting unstained live cells imaged with bright-field microscopy. While many segmentation approaches exist for microscopic images, none consistently address the challenges of bright-field live-cell imaging with high throughput, where temporal phenotype changes, low contrast, noise, and motion-induced blur from cellular movement remain major obstacles. We developed a low-cost CNN-based pipeline incorporating comparative analysis of frozen encoders within a unified U-Net architecture enhanced with attention mechanisms, instance-aware systems, adaptive loss functions, hard instance retraining, dynamic learning rates, progressive mechanisms to mitigate overfitting, and an ensemble technique. The model was validated on a public dataset featuring diverse live cell variants, showing consistent competitiveness with state-of-the-art methods, achieving 93% test accuracy and an average F1-score of 89% (std. 0.07) on low-contrast, noisy, and blurry images. Notably, the model was trained primarily on bright-field images with limited exposure to phase-contrast microscopy (<10%), yet it generalized effectively to the phase-contrast LIVECell dataset, demonstrating modality, robustness and strong performance. This highlights its potential for real-world laboratory deployment across imaging conditions. The model requires minimal compute power and is adaptable using basic deep learning setups such as Google Colab, making it practical for training on other cell variants. Our pipeline outperforms existing methods in robustness and precision for bright-field microscopy segmentation. The code and dataset are available for reproducibility
cs.LG [Back]
[92] GLASS: Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation
Amirmohsen Sattarifard,Sepehr Lavasani,Ehsan Imani,Kunlin Zhang,Hanlin Xu,Fengyu Sun,Negar Hassanpour,Chao Gao
Main category: cs.LG
TL;DR: 该论文提出了GLASS(Global-Local Neural Importance Aggregation)方法,通过动态选择FFN单元,结合提示的局部统计和模型全局统计,显著提升了LLM的推理加速效果,特别是在长文本生成任务中表现优异。
Details
Motivation: 当前LLM在边缘硬件上的部署需要高效的动态剪枝方法,但静态或基于预测的方法存在固定稀疏模式或额外运行时开销的问题,而零样本方法在短提示或长生成任务中表现不佳。Contribution: 论文提出了A/I-GLASS方法,首次结合提示的局部统计和模型的全局统计,动态选择FFN单元,无需额外训练或推理开销,显著提升了LLM的推理效率。
Method: GLASS通过基于激活和影响的全局-局部重要性聚合(rank-aggregation),动态稀疏化FFN网络。两种方法分别为Activation-based和Impact-based,结合提示的局部信息和模型全局统计。
Result: 实验表明,GLASS在多个LLM和基准测试中显著优于现有的零样本剪枝方法,尤其在长文本生成任务中表现突出。
Insight: 结合局部和全局统计的动态剪枝方法能更好地适应不同任务需求,而无需额外训练或运行时开销,为LLM在边缘设备的高效部署提供了新思路。
Abstract: Deploying Large Language Models (LLMs) on edge hardware demands aggressive, prompt-aware dynamic pruning to reduce computation without degrading quality. Static or predictor-based schemes either lock in a single sparsity pattern or incur extra runtime overhead, and recent zero-shot methods that rely on statistics from a single prompt fail on short prompt and/or long generation scenarios. We introduce A/I-GLASS: Activation- and Impact-based Global-Local neural importance Aggregation for feed-forward network SparSification, two training-free methods that dynamically select FFN units using a rank-aggregation of prompt local and model-intrinsic global neuron statistics. Empirical results across multiple LLMs and benchmarks demonstrate that GLASS significantly outperforms prior training-free methods, particularly in challenging long-form generation scenarios, without relying on auxiliary predictors or adding any inference overhead.
[93] DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
Shuaijie She,Yu Bao,Yu Lu,Lu Xu,Tao Li,Wenhao Zhu,Shujian Huang,Shanbo Cheng,Lu Lu,Yuxuan Wang
Main category: cs.LG
TL;DR: DuPO是一种基于双学习偏好优化的框架,无需标注反馈即可通过广义对偶性实现可靠的LLM自验证,提升了翻译、数学推理等任务的性能。
Details
Motivation: 传统的RLVR依赖高成本标注且仅适用于可验证任务,而传统双学习仅适用于严格对偶任务对。DuPO旨在解决这些限制,提出一种无需标注且适用于非对偶任务的优化框架。Contribution: 提出了DuPO框架,通过分解任务输入并利用对偶任务的自监督奖励优化原始任务,扩展了对非可逆任务的适用性,同时无需标注。
Method: 将原始任务输入分解为已知和未知部分,构建对偶任务以重构未知部分,并以此作为自监督奖励优化原始任务,利用LLM单模型实例化双任务。
Result: 在756个翻译方向上平均提升2.13 COMET,三个数学推理基准上平均提升6.4分,推理时重排序性能提升9.3分。
Insight: DuPO展示了LLM通过自监督对偶任务实现自我优化的潜力,为LLM优化提供了一种可扩展且通用的方法。
Abstract: We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)’s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning’s restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task’s input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs’ ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.
[94] STAS: Spatio-Temporal Adaptive Computation Time for Spiking Transformers
Donghwa Kang,Doohyun Kim,Sang-Ki Ko,Jinkyu Lee,Brent ByungHoon Kang,Hyeongboo Baek
Main category: cs.LG
TL;DR: STAS提出了一种时空自适应计算时间框架,通过联合设计静态架构和动态计算策略,解决了SNN中高延迟和计算开销的问题,显著提升了能效和准确性。
Details
Motivation: 脉冲神经网络(SNN)虽然比传统人工神经网络(ANN)更节能,但因其多时间步操作特性导致高延迟和计算开销。现有方法未能统一解决其时空冗余问题,亟需一种集成方案。Contribution: 1. 提出了STAS框架,通过I-SPS模块实现输入表示的时空稳定性,解决了传统ACT方法在SNN中不适用的问题。2. 设计了A-SSA模块,支持跨时空维度的自适应令牌剪枝,进一步优化计算效率。
Method: 1. I-SPS模块统一输入表示,确保时空稳定性。2. A-SSA模块动态调整计算路径,在时空维度进行令牌剪枝。3. 在CIFAR和ImageNet数据集上验证能效和准确性。
Result: 在CIFAR-10、CIFAR-100和ImageNet上,STAS分别降低了45.9%、43.8%和30.1%的能耗,同时准确性优于现有最佳模型。
Insight: 通过联合设计静态架构和动态计算策略,STAS展示了在SNN中高效处理时空冗余的潜力,为未来节能计算提供了新思路。
Abstract: Spiking neural networks (SNNs) offer energy efficiency over artificial neural networks (ANNs) but suffer from high latency and computational overhead due to their multi-timestep operational nature. While various dynamic computation methods have been developed to mitigate this by targeting spatial, temporal, or architecture-specific redundancies, they remain fragmented. While the principles of adaptive computation time (ACT) offer a robust foundation for a unified approach, its application to SNN-based vision Transformers (ViTs) is hindered by two core issues: the violation of its temporal similarity prerequisite and a static architecture fundamentally unsuited for its principles. To address these challenges, we propose STAS (Spatio-Temporal Adaptive computation time for Spiking transformers), a framework that co-designs the static architecture and dynamic computation policy. STAS introduces an integrated spike patch splitting (I-SPS) module to establish temporal stability by creating a unified input representation, thereby solving the architectural problem of temporal dissimilarity. This stability, in turn, allows our adaptive spiking self-attention (A-SSA) module to perform two-dimensional token pruning across both spatial and temporal axes. Implemented on spiking Transformer architectures and validated on CIFAR-10, CIFAR-100, and ImageNet, STAS reduces energy consumption by up to 45.9%, 43.8%, and 30.1%, respectively, while simultaneously improving accuracy over SOTA models.
[95] Organ-Agents: Virtual Human Physiology Simulator via LLMs
Rihao Chang,He Jiao,Weizhi Nie,Honglin Guo,Keliang Xie,Zhenhua Wu,Lina Zhao,Yunpeng Bai,Yongtao Ma,Lanjun Wang,Yuting Su,Xi Gao,Weijie Wang,Nicu Sebe,Bruno Lepri,Bingwei Sun
Main category: cs.LG
TL;DR: 论文提出了一种名为Organ-Agents的多智能体框架,利用大语言模型(LLM)模拟人体生理系统。通过在特定系统数据上进行监督微调和强化学习协调,该方法在模拟精度和外部验证中表现优异,得到了医生的认可并支持临床决策任务。
Details
Motivation: 随着大语言模型的发展,模拟复杂生理系统成为可能。作者旨在设计一个可信、可解释且通用的数字孪生模型,用于重症监护中的精准诊断和治疗模拟。Contribution: 1. 提出多智能体框架Organ-Agents,模拟人体多个生理系统。
2. 设计了监督微调和强化学习结合的动态参考选择与错误校正方法。
3. 通过大规模真实患者数据验证了模型的仿真精度和临床实用性。
Method: 1. 每个智能体模拟特定的生理系统(如心血管、肾脏等)。
2. 结合监督学习和强化学习,通过动态参考选择和错误校正优化协调。
3. 使用7,134例脓毒症患者和7,895例对照组的高分辨率数据训练。
Result: 1. 在4,509例独立患者中,每个系统的均方误差(MSE)<0.16。
2. 外部验证表明模型在不同医院数据中表现稳定。
3. 生成的反事实模拟与真实患者数据一致。
4. 15位重症医生对仿真结果给予高评价(Likert平均分3.9和3.7)。
Insight: 1. 多智能体框架能够高效模拟复杂的生理系统动态。
2. 模型不仅仿真精度高,还保留了决策相关模式,支持下游临床任务。
Abstract: Recent advances in large language models (LLMs) have enabled new possibilities in simulating complex physiological systems. We introduce Organ-Agents, a multi-agent framework that simulates human physiology via LLM-driven agents. Each Simulator models a specific system (e.g., cardiovascular, renal, immune). Training consists of supervised fine-tuning on system-specific time-series data, followed by reinforcement-guided coordination using dynamic reference selection and error correction. We curated data from 7,134 sepsis patients and 7,895 controls, generating high-resolution trajectories across 9 systems and 125 variables. Organ-Agents achieved high simulation accuracy on 4,509 held-out patients, with per-system MSEs <0.16 and robustness across SOFA-based severity strata. External validation on 22,689 ICU patients from two hospitals showed moderate degradation under distribution shifts with stable simulation. Organ-Agents faithfully reproduces critical multi-system events (e.g., hypotension, hyperlactatemia, hypoxemia) with coherent timing and phase progression. Evaluation by 15 critical care physicians confirmed realism and physiological plausibility (mean Likert ratings 3.9 and 3.7). Organ-Agents also enables counterfactual simulations under alternative sepsis treatment strategies, generating trajectories and APACHE II scores aligned with matched real-world patients. In downstream early warning tasks, classifiers trained on synthetic data showed minimal AUROC drops (<0.04), indicating preserved decision-relevant patterns. These results position Organ-Agents as a credible, interpretable, and generalizable digital twin for precision diagnosis, treatment simulation, and hypothesis testing in critical care.
[96] Understanding Data Influence with Differential Approximation
Haoru Tan,Sitong Wu,Xiuzhe Wu,Wang Wang,Bo Zhao,Zeke Xie,Gui-Song Xia,Xiaojuan Qi
Main category: cs.LG
TL;DR: 本文提出了一种新的数据影响量化方法Diff-In,通过累积连续训练步骤中的影响差异来近似样本的影响。该方法利用二阶近似,在不依赖模型凸性的情况下提高了准确性,同时保持了与一阶方法相当的计算复杂度,并在多个数据任务中表现优异。
Details
Motivation: 现有数据影响分析工具因假设损失函数凸性等限制导致准确性不足,难以有效支持模型训练中的数据利用。本文旨在提出一种更准确且高效的数据影响量化方法。Contribution: 1. 提出Diff-In方法,通过累积连续训练步骤中的影响差异来近似样本影响,无需依赖模型凸性。2. 实现了与一阶方法相当的计算复杂度,并通过二阶近似提高了准确性。3. 在理论上和实验中验证其优于现有影响估计器。
Method: Diff-In通过累积样本在连续训练步骤中的影响差异来量化其影响。利用二阶近似(Hessian矩阵与梯度的乘积)高效计算差异项,避免了模型凸性的需求,并通过一阶梯度有限差分进一步优化计算效率。
Result: 理论分析表明Diff-In的近似误差显著低于现有方法。实验验证其在数据清洗、删除和核心集选择等任务中的优越性,尤其是在大规模视觉语言预训练中表现出色。
Insight: 1. 通过差分累积近似影响的方法提供了更灵活且高效的解决方案。2. 二阶方法可通过优化实现与一阶方法相当的计算效率,扩展了其适用性。3. Diff-In为非凸模型的实用数据影响分析提供了新思路。
Abstract: Data plays a pivotal role in the groundbreaking advancements in artificial intelligence. The quantitative analysis of data significantly contributes to model training, enhancing both the efficiency and quality of data utilization. However, existing data analysis tools often lag in accuracy. For instance, many of these tools even assume that the loss function of neural networks is convex. These limitations make it challenging to implement current methods effectively. In this paper, we introduce a new formulation to approximate a sample’s influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In. Specifically, we formulate the sample-wise influence as the cumulative sum of its changes/differences across successive training iterations. By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods. Despite being a second-order method, Diff-In maintains computational complexity comparable to that of first-order methods and remains scalable. This efficiency is achieved by computing the product of the Hessian and gradient, which can be efficiently approximated using finite differences of first-order gradients. We assess the approximation accuracy of Diff-In both theoretically and empirically. Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators. Extensive experiments further confirm its superior performance across multiple benchmark datasets in three data-centric tasks: data cleaning, data deletion, and coreset selection. Notably, our experiments on data pruning for large-scale vision-language pre-training show that Diff-In can scale to millions of data points and outperforms strong baselines.
[97] Squeezed Diffusion Models
Jyotirmai Singh,Samar Khanna,James Burgess
Main category: cs.LG
TL;DR: 该论文提出了一种新的扩散模型(Squeezed Diffusion Models,SDM),通过数据依赖性的噪声缩放(各向异性)来改进生成性能,而不是传统的各向同性高斯噪声。实验表明,轻微的逆挤压(增加主轴的方差)能显著提升生成质量。
Details
Motivation: 扩散模型通常使用各向同性高斯噪声,忽略了数据的结构信息。受量子挤压态中根据海森堡不确定性原理重新分布不确定性的启发,作者希望通过数据依赖性的噪声缩放来提升模型的生成能力。Contribution: 提出了Squeezed Diffusion Models(SDM),通过各向异性的噪声缩放(主成分分析方向)改进扩散模型的性能,实验证明这种方法能显著提升FID和召回率。
Method: 研究了两种配置:(i) Heisenberg扩散模型(在主轴和正交方向上进行补偿性缩放);(ii) 标准SDM(仅缩放主轴方向)。
Result: 在CIFAR-10/100和CelebA-64上,轻微的逆挤压(增加主轴方差)使FID提升高达15%,并将精准-召回曲线推向更高的召回率。
Insight: 简单的数据感知噪声调整(无需模型架构变化)可以显著提升扩散模型的生成性能,逆挤压的效果出乎意料地好。
Abstract: Diffusion models typically inject isotropic Gaussian noise, disregarding structure in the data. Motivated by the way quantum squeezed states redistribute uncertainty according to the Heisenberg uncertainty principle, we introduce Squeezed Diffusion Models (SDM), which scale noise anisotropically along the principal component of the training distribution. As squeezing enhances the signal-to-noise ratio in physics, we hypothesize that scaling noise in a data-dependent manner can better assist diffusion models in learning important data features. We study two configurations: (i) a Heisenberg diffusion model that compensates the scaling on the principal axis with inverse scaling on orthogonal directions and (ii) a standard SDM variant that scales only the principal axis. Counterintuitively, on CIFAR-10/100 and CelebA-64, mild antisqueezing - i.e. increasing variance on the principal axis - consistently improves FID by up to 15% and shifts the precision-recall frontier toward higher recall. Our results demonstrate that simple, data-aware noise shaping can deliver robust generative gains without architectural changes.
eess.AS [Back]
[98] RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition
Pengcheng Wang,Sheng Li,Takahiro Shinozaki
Main category: eess.AS
TL;DR: RAG-Boost通过检索增强生成模块改进基于LLM的语音识别系统,动态检索音频-文本对和领域术语,以修正识别错误。
Details
Motivation: 传统的LLM语音识别系统可能在领域特定术语或上下文不足时表现不佳,RAG-Boost旨在通过动态检索增强生成过程来解决这一问题。Contribution: 论文的主要贡献是提出了RAG-Boost框架,通过实时检索和融合外部知识来提升LLM语音识别的准确性。
Method: 方法结合了检索增强生成(RAG)模块,通过查询音频-文本对和领域术语库,动态修正部分ASR假设,并将融合后的结果输入LLM。
Result: 实验结果表明,RAG-Boost能够显著提升语音识别的准确性,尤其在处理领域特定术语和上下文不足的场景下表现突出。
Insight: 检索增强生成能有效弥补LLM在实时语音识别中的局限性,尤其适用于需要高准确性和领域适应性的场景。
Abstract: In this paper, we propose RAG-Boost (ST-ShinozakiLab Task I system), which enhances the baseline LLM-based ASR system of the MLC-SLM Challenge (task I) with a retrieval-augmented generation (RAG) module on the fly. Each partial ASR hypothesis queries a vector store of audio-text pairs and domain terms, and the retrieved results are fused with the live ASR hypotheses to fix recognition errors. The fused hypotheses are passed to the LLM, yielding improved responses.
cs.CR [Back]
[99] MultiFuzz: A Dense Retrieval-based Multi-Agent System for Network Protocol Fuzzing
Youssef Maklad,Fares Wael,Ali Hamdi,Wael Elsersy,Khaled Shaban
Main category: cs.CR
TL;DR: MultiFuzz是一个基于密集检索的多智能体系统,用于网络协议模糊测试。它通过结合语义感知的上下文检索、专用智能体和结构化工具辅助推理,显著提升了模糊测试的效果。
Details
Motivation: 传统模糊测试技术(如AFL-based系统)在复杂协议语法理解和种子突变策略上存在局限性。ChatAFL等最近的工作虽然引入了大语言模型(LLM),但仍面临输出不可靠、LLM幻觉以及对协议规范的错误假设等问题。Contribution: 1. 提出了基于密集检索的多智能体模糊测试系统MultiFuzz。2. 通过检索增强生成(RAG)管道和智能体协作,实现了对协议文档的语义感知和结构化输出生成,提升了协议状态覆盖率和语法约束的遵从性。
Method: 1. 利用向量数据库构建协议文档的嵌入向量(如RFC文档),支持RAG管道。2. 将模糊测试过程分解为模块化的智能体组,通过链式推理动态调整模糊测试策略。
Result: 在实时流协议(RTSP)上的实验表明,MultiFuzz在分支覆盖率和协议状态探索上显著优于NSFuzz、AFLNet和ChatAFL等现有技术。
Insight: MultiFuzz通过结合密集检索、智能体协调和语言模型推理,为自主协议模糊测试提供了可扩展和可扩展的基础,为未来基于智能体的模糊测试系统研究指明了新方向。
Abstract: Traditional protocol fuzzing techniques, such as those employed by AFL-based systems, often lack effectiveness due to a limited semantic understanding of complex protocol grammars and rigid seed mutation strategies. Recent works, such as ChatAFL, have integrated Large Language Models (LLMs) to guide protocol fuzzing and address these limitations, pushing protocol fuzzers to wider exploration of the protocol state space. But ChatAFL still faces issues like unreliable output, LLM hallucinations, and assumptions of LLM knowledge about protocol specifications. This paper introduces MultiFuzz, a novel dense retrieval-based multi-agent system designed to overcome these limitations by integrating semantic-aware context retrieval, specialized agents, and structured tool-assisted reasoning. MultiFuzz utilizes agentic chunks of protocol documentation (RFC Documents) to build embeddings in a vector database for a retrieval-augmented generation (RAG) pipeline, enabling agents to generate more reliable and structured outputs, enhancing the fuzzer in mutating protocol messages with enhanced state coverage and adherence to syntactic constraints. The framework decomposes the fuzzing process into modular groups of agents that collaborate through chain-of-thought reasoning to dynamically adapt fuzzing strategies based on the retrieved contextual knowledge. Experimental evaluations on the Real-Time Streaming Protocol (RTSP) demonstrate that MultiFuzz significantly improves branch coverage and explores deeper protocol states and transitions over state-of-the-art (SOTA) fuzzers such as NSFuzz, AFLNet, and ChatAFL. By combining dense retrieval, agentic coordination, and language model reasoning, MultiFuzz establishes a new paradigm in autonomous protocol fuzzing, offering a scalable and extensible foundation for future research in intelligent agentic-based fuzzing systems.