Table of Contents

cs.CL [Back]

[1] RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning cs.CLPDF

Xiang Gao, Yuguang Yao, Qi Zhang, Kaiwen Dong, Avinash Baidya

TL;DR: RIMRULE是一种基于动态规则注入的神经符号方法,旨在提升大型语言模型在特定领域工具使用中的可靠性。该方法通过从失败轨迹中提炼紧凑、可解释的规则,并在推理时注入提示中,以提高任务性能。规则由LLM自身提出,并通过最小描述长度目标进行整合,强调通用性和简洁性。

Details

Motivation: 大型语言模型在领域特定设置中可靠使用工具时经常遇到困难,因为这些工具的API可能具有特殊性、文档不足或针对私有工作流定制,因此需要有效适应任务特定工具。

Result: 在工具使用基准测试上的实验表明,该方法在不修改LLM权重的情况下,提高了对已见和未见工具的准确性,优于基于提示的适应方法,并与微调形成互补。此外,从一个LLM学习的规则可以重用以改进其他LLM,包括长推理LLM,突显了符号知识在不同架构间的可移植性。

Insight: 创新点在于结合神经符号方法,通过MDL引导的规则学习实现动态规则注入,提升工具使用的适应性和可解释性;客观分析认为其规则的可重用性和跨架构移植性为LLM工具适应提供了新思路。

Abstract: Large language models (LLMs) often struggle to use tools reliably in domain-specific settings, where APIs may be idiosyncratic, under-documented, or tailored to private workflows. This highlights the need for effective adaptation to task-specific tools. We propose RIMRULE, a neuro-symbolic approach for LLM adaptation based on dynamic rule injection. Compact, interpretable rules are distilled from failure traces and injected into the prompt during inference to improve task performance. These rules are proposed by the LLM itself and consolidated using a Minimum Description Length (MDL) objective that favors generality and conciseness. Each rule is stored in both natural language and a structured symbolic form, supporting efficient retrieval at inference time. Experiments on tool-use benchmarks show that this approach improves accuracy on both seen and unseen tools without modifying LLM weights. It outperforms prompting-based adaptation methods and complements finetuning. Moreover, rules learned from one LLM can be reused to improve others, including long reasoning LLMs, highlighting the portability of symbolic knowledge across architectures.


[2] Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning cs.CLPDF

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

TL;DR: 本文提出了MetaJuLS,一种基于元强化学习的方法,用于学习通用的约束传播策略,以加速大型语言模型的结构化推理过程。该方法将结构化推理建模为自适应约束传播问题,通过训练图注意力网络实现跨语言和跨任务的快速适应,无需针对特定任务重新训练。

Details

Motivation: 随着大型语言模型越来越多地需要进行结构化推理(如JSON模式强制和多语言解析),输出必须满足复杂约束,而现有方法往往需要针对每个任务进行耗时训练。本文旨在开发一种通用的、高效的约束传播方法,以减少推理延迟和碳排放。

Result: 在Universal Dependencies(10种语言)和LLM约束生成任务(LogicBench、GSM8K-Constrained)上,MetaJuLS比GPU优化的基线方法快1.5-2.0倍,同时保持与最先进解析器在0.2%以内的准确率差距。新策略仅需5-10个梯度步骤(5-15秒)即可适应新语言或任务,而无需数小时的任务特定训练。

Insight: 创新点在于将结构化推理形式化为自适应约束传播问题,并利用元强化学习训练通用策略,实现跨域快速适应。该方法不仅发现了类似人类的解析策略(如易优先策略),还揭示了新颖的非直观启发式方法,有助于减少推理步骤和碳排放,推动绿色AI发展。

Abstract: Large language models increasingly require structured inference, from JSON schema enforcement to multi-lingual parsing, where outputs must satisfy complex constraints. We introduce MetaJuLS, a meta-reinforcement learning approach that learns universal constraint propagation policies applicable across languages and tasks without task-specific retraining. By formulating structured inference as adaptive constraint propagation and training a Graph Attention Network with meta-learning, MetaJuLS achieves 1.5–2.0$\times$ speedups over GPU-optimized baselines while maintaining within 0.2% accuracy of state-of-the-art parsers. On Universal Dependencies across 10 languages and LLM-constrained generation (LogicBench, GSM8K-Constrained), MetaJuLS demonstrates rapid cross-domain adaptation: a policy trained on English parsing adapts to new languages and tasks with 5–10 gradient steps (5–15 seconds) rather than requiring hours of task-specific training. Mechanistic analysis reveals the policy discovers human-like parsing strategies (easy-first) and novel non-intuitive heuristics. By reducing propagation steps in LLM deployments, MetaJuLS contributes to Green AI by directly reducing inference carbon footprint.


Yongmin Yoo, Kris W Pan

TL;DR: 本文提出了Pat-DEVAL,一个专门用于评估专利说明书的多维度评估框架。该框架利用LLM-as-a-judge范式,并引入了Chain-of-Legal-Thought(CoLT)这一受法律约束的推理机制,以强制执行针对专利法的顺序分析。在Pap2Pat-EvalGold数据集上的实验表明,Pat-DEVAL显著优于基线指标和现有LLM评估器。

Details

Motivation: 现有评估方法无法评估专利说明书所需的长篇结构连贯性和特定的法定合规性,而专利说明书必须提供全面的技术披露并满足严格的法律标准。

Result: 在Pap2Pat-EvalGold数据集上,经专利专家验证,Pat-DEVAL实现了0.69的皮尔逊相关系数,显著优于基线指标和现有LLM评估器。在法律专业合规性方面,其相关性高达0.73。

Insight: 主要创新点是提出了首个针对专利说明书的多维评估框架Pat-DEVAL,并引入了Chain-of-Legal-Thought(CoLT)这一受法律约束的推理机制,将法定约束明确注入评估过程,这对于捕捉细微的法律有效性至关重要,为自动专利起草系统的实际部署提供了方法论基础。

Abstract: Patent descriptions must deliver comprehensive technical disclosure while meeting strict legal standards such as enablement and written description requirements. Although large language models have enabled end-to-end automated patent drafting, existing evaluation approaches fail to assess long-form structural coherence and statutory compliance specific to descriptions. We propose Pat-DEVAL, the first multi-dimensional evaluation framework dedicated to patent description bodies. Leveraging the LLM-as-a-judge paradigm, Pat-DEVAL introduces Chain-of-Legal-Thought (CoLT), a legally-constrained reasoning mechanism that enforces sequential patent-law-specific analysis. Experiments validated by patent expert on our Pap2Pat-EvalGold dataset demonstrate that Pat-DEVAL achieves a Pearson correlation of 0.69, significantly outperforming baseline metrics and existing LLM evaluators. Notably, the framework exhibits a superior correlation of 0.73 in Legal-Professional Compliance, proving that the explicit injection of statutory constraints is essential for capturing nuanced legal validity. By establishing a new standard for ensuring both technical soundness and legal compliance, Pat-DEVAL provides a robust methodological foundation for the practical deployment of automated patent drafting systems.


[4] Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models cs.CLPDF

Wang Xing, Wei Song, Siyu Lin, Chen Wu, Zhesi Li

TL;DR: 本文提出了一种专为时序知识图谱推理设计的知识蒸馏框架,利用大语言模型作为教师模型,将结构和时序推理能力迁移到轻量级学生模型中,以解决现有模型参数量大、计算成本高、难以部署在资源受限平台的问题。

Details

Motivation: 现有时序知识图谱推理模型通常依赖大量参数和密集计算,导致硬件成本和能耗高,难以在需要实时推理的资源受限平台上部署;且现有模型压缩和蒸馏技术多针对静态知识图谱,无法充分捕捉时序依赖关系,导致推理性能下降。

Result: 在多个公开基准数据集上的大量实验表明,该方法持续优于强基线,在推理准确性、计算效率和实际可部署性之间实现了有利的权衡。

Insight: 创新点在于专门针对时序知识图谱推理定制蒸馏框架,并利用大语言模型作为教师来整合大规模公共知识和任务特定时序信息,从而增强轻量学生模型对时序动态的建模能力,同时保持架构紧凑高效。

Abstract: Reasoning over temporal knowledge graphs (TKGs) is fundamental to improving the efficiency and reliability of intelligent decision-making systems and has become a key technological foundation for future artificial intelligence applications. Despite recent progress, existing TKG reasoning models typically rely on large parameter sizes and intensive computation, leading to high hardware costs and energy consumption. These constraints hinder their deployment on resource-constrained, low-power, and distributed platforms that require real-time inference. Moreover, most existing model compression and distillation techniques are designed for static knowledge graphs and fail to adequately capture the temporal dependencies inherent in TKGs, often resulting in degraded reasoning performance. To address these challenges, we propose a distillation framework specifically tailored for temporal knowledge graph reasoning. Our approach leverages large language models as teacher models to guide the distillation process, enabling effective transfer of both structural and temporal reasoning capabilities to lightweight student models. By integrating large-scale public knowledge with task-specific temporal information, the proposed framework enhances the student model’s ability to model temporal dynamics while maintaining a compact and efficient architecture. Extensive experiments on multiple publicly available benchmark datasets demonstrate that our method consistently outperforms strong baselines, achieving a favorable trade-off between reasoning accuracy, computational efficiency, and practical deployability.


[5] Can Large Language Models Still Explain Themselves? Investigating the Impact of Quantization on Self-Explanations cs.CL | cs.AI | cs.LGPDF

Qianli Wang, Nils Feldhus, Pepa Atanasova, Fedor Splitt, Simon Ostermann

TL;DR: 本文研究了量化技术对大型语言模型(LLM)自解释(SE)能力的影响,发现量化通常会导致自解释的质量和忠实度出现小幅下降,但不会从根本上削弱量化作为模型压缩技术的有效性。

Details

Motivation: 量化被广泛用于加速LLM推理和部署,但其对模型自解释(如自然语言解释和反事实示例)的影响尚未被探索,而在高风险应用中,自解释的透明性至关重要,因此需要评估量化是否会损害自解释的质量和忠实度。

Result: 实验表明,量化通常导致自解释质量下降高达4.4%,忠实度下降高达2.38%;用户研究进一步显示量化降低了自解释的连贯性和可信度(高达8.5%);与较小模型相比,较大模型在自解释质量上对量化的韧性有限,但能更好地保持忠实度;且没有一种量化技术在任务准确性、自解释质量和忠实度上始终表现优异。

Insight: 量化对自解释的影响因上下文而异,建议在特定用例中验证自解释质量,尤其是对量化更敏感的自然语言解释;尽管量化导致自解释性能小幅下降,但其作为模型压缩技术仍然有效,这为在实际部署中平衡效率与可解释性提供了参考。

Abstract: Quantization is widely used to accelerate inference and streamline the deployment of large language models (LLMs), yet its effects on self-explanations (SEs) remain unexplored. SEs, generated by LLMs to justify their own outputs, require reasoning about the model’s own decision-making process, a capability that may exhibit particular sensitivity to quantization. As SEs are increasingly relied upon for transparency in high-stakes applications, understanding whether and to what extent quantization degrades SE quality and faithfulness is critical. To address this gap, we examine two types of SEs: natural language explanations (NLEs) and counterfactual examples, generated by LLMs quantized using three common techniques at distinct bit widths. Our findings indicate that quantization typically leads to moderate declines in both SE quality (up to 4.4%) and faithfulness (up to 2.38%). The user study further demonstrates that quantization diminishes both the coherence and trustworthiness of SEs (up to 8.5%). Compared to smaller models, larger models show limited resilience to quantization in terms of SE quality but better maintain faithfulness. Moreover, no quantization technique consistently excels across task accuracy, SE quality, and faithfulness. Given that quantization’s impact varies by context, we recommend validating SE quality for specific use cases, especially for NLEs, which show greater sensitivity. Nonetheless, the relatively minor deterioration in SE quality and faithfulness does not undermine quantization’s effectiveness as a model compression technique.


[6] The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining cs.CLPDF

Jiandong Shao, Raphael Tang, Crystina Zhang, Karin Sevegnani, Pontus Stenetorp

TL;DR: 该论文研究了混合语言文档在多语言大语言模型预训练中的作用,通过对比标准网络语料库与去除所有多语言文档的纯单语版本,发现双语数据虽仅占语料库的2%,但其移除会导致翻译性能大幅下降56%,而跨语言问答和推理任务性能保持稳定。进一步分析表明,平行数据(占双语数据的14%)对翻译性能恢复至关重要,而语码转换数据(占72%)贡献甚微。

Details

Motivation: 探究双语数据在多语言大语言模型预训练中的具体贡献,特别是其对翻译和跨语言理解任务的不同影响,以澄清现有认知中双语数据作用的模糊性。

Result: 在翻译任务上,移除双语数据使BLEU分数下降56%;重新引入平行数据可恢复91%的基线性能,而语码转换数据贡献有限;跨语言问答和推理任务性能基本不受双语数据移除影响。

Insight: 平行数据通过提供系统化的词级对齐对翻译性能至关重要,而跨语言理解和推理能力可能无需双语数据即可实现;这为高效构建多语言模型预训练语料库提供了指导,即应优先包含平行数据而非语码转换内容。

Abstract: Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.


[7] Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach cs.CLPDF

Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng

TL;DR: 本文提出Geo-R,一种基于强化学习的无检索视觉语言地理定位框架,通过从真实GPS坐标中提取结构化推理路径(Chain of Region)并利用基于Haversine距离的奖励进行优化,实现了更准确、可解释且泛化能力强的图像地理定位。

Details

Motivation: 现有视觉语言地理定位方法依赖合成推理标注或外部图像检索,限制了可解释性和泛化性,本文旨在开发一种无需检索、基于结构化地理推理的直接监督方法。

Result: 在多个基准测试上的实验结果表明,Geo-R在定位准确性、泛化能力和推理透明度方面均表现优异,为可扩展且可解释的图像地理定位建立了新的无检索范式。

Insight: 创新点包括:1) 提出Chain of Region层次推理范式,将GPS坐标映射到地理实体(如国家、省份、城市),无需模型生成或合成标签即可提供精确可解释的监督;2) 设计轻量级强化学习策略,使用基于Haversine距离的坐标对齐奖励,通过空间有意义的反馈优化预测;3) 结合结构化地理推理与直接空间监督,提升了定位性能与透明度。

Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.


[8] Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset cs.CL | cs.AIPDF

Alistair Plum, Laura Bernardy, Tharindu Ranasinghe

TL;DR: 本文介绍了judgeWEL数据集,这是一个针对卢森堡语的命名实体识别数据集,通过利用维基百科和维基数据作为弱监督来源自动标注,并采用大型语言模型进行验证的新流程构建而成。该数据集比现有卢森堡语NER数据集大约五倍,覆盖更广泛且平衡的实体类别,为多语言和低资源NER研究提供了重要资源。

Details

Motivation: 解决低资源语言(如卢森堡语)在自然语言处理中数据集构建的瓶颈问题,包括资源稀缺、标注成本高和潜在不一致性,通过弱监督和LLM验证来降低人工干预需求。

Result: 构建的judgeWEL数据集规模约为现有卢森堡语NER数据集的五倍,实体类别覆盖更广且更平衡,通过LLM验证提高了标注质量,为相关研究提供了实质性新资源。

Insight: 创新点在于结合维基百科内部链接和维基数据条目进行弱监督标注,并利用LLM进行噪声过滤和验证,形成自动化的数据集构建流程,可推广到其他低资源语言场景。

Abstract: We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline. Building datasets for under-represented languages remains one of the major bottlenecks in natural language processing, where the scarcity of resources and linguistic particularities make large-scale annotation costly and potentially inconsistent. To address these challenges, we propose and evaluate a novel approach that leverages Wikipedia and Wikidata as structured sources of weak supervision. By exploiting internal links within Wikipedia articles, we infer entity types based on their corresponding Wikidata entries, thereby generating initial annotations with minimal human intervention. Because such links are not uniformly reliable, we mitigate noise by employing and comparing several LLMs to identify and retain only high-quality labelled sentences. The resulting corpus is approximately five times larger than the currently available Luxembourgish NER dataset and offers broader and more balanced coverage across entity categories, providing a substantial new resource for multilingual and low-resource NER research.


[9] Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games cs.CL | cs.AIPDF

Dimitris Vartziotis

TL;DR: 本文通过对比语言游戏的社会建构主义观点与作者提出的语义场理论,探讨了大型语言模型(LLMs)如何为检验语言意义理论提供新的实证环境。论文形式化了词汇场和语言场作为连续语义空间中的交互结构,并分析了Transformer架构的核心特性(如分布式表示、注意力机制和嵌入空间的几何规律)与这些概念的关联。作者认为,LLMs在捕捉语义规律方面的成功支持了语言具有底层数学结构的观点,而其在语用推理和上下文敏感性方面的局限则与语言使用哲学中强调的社会基础重要性一致。

Details

Motivation: 动机是利用LLMs作为实证环境,检验长期存在的语言意义理论,特别是对比社会建构主义的语言游戏观点与数学导向的语义场理论,以澄清纯统计语言模型的适用范围和局限性。

Result: 论文未提及具体的定量实验结果或基准测试,而是基于理论分析和LLMs的实证观察,论证了数学结构与社会基础在语言理解中的互补性。

Insight: 创新点在于形式化了词汇场和语言场作为连续语义空间中的交互结构,并将Transformer架构的特性与这些理论概念联系起来,提出了数学结构与语言游戏作为互补视角的框架,为理论驱动的AI架构设计提供了新方向。

Abstract: Large language models (LLMs) offer a new empirical setting in which long-standing theories of linguistic meaning can be examined. This paper contrasts two broad approaches: social constructivist accounts associated with language games, and a mathematically oriented framework we call Semantic Field Theory. Building on earlier work by the author, we formalize the notions of lexical fields (Lexfelder) and linguistic fields (Lingofelder) as interacting structures in a continuous semantic space. We then analyze how core properties of transformer architectures-such as distributed representations, attention mechanisms, and geometric regularities in embedding spaces-relate to these concepts. We argue that the success of LLMs in capturing semantic regularities supports the view that language exhibits an underlying mathematical structure, while their persistent limitations in pragmatic reasoning and context sensitivity are consistent with the importance of social grounding emphasized in philosophical accounts of language use. On this basis, we suggest that mathematical structure and language games can be understood as complementary rather than competing perspectives. The resulting framework clarifies the scope and limits of purely statistical models of language and motivates new directions for theoretically informed AI architectures.


[10] Rule-Based Approaches to Atomic Sentence Extraction cs.CLPDF

Lineesha Kamana, Akshita Ananda Subramanian, Mehuli Ghosh, Suman Saha

TL;DR: 本研究分析了复杂句子结构(如关系从句、状语从句、并列结构和被动语态)如何影响基于规则的原子句子提取性能。研究在WikiSplit数据集上使用spaCy实现基于依存关系的提取规则,并评估了其性能。

Details

Motivation: 解决现有基于大语言模型的原子句子提取方法缺乏可解释性、且难以识别导致提取失败的特定语言结构的问题。

Result: 在WikiSplit数据集上,系统取得了ROUGE-1 F1=0.6714、ROUGE-2 F1=0.478、ROUGE-L F1=0.650和BERTScore F1=0.5898的成绩,表明具有中高水平的词汇、结构和语义对齐能力。

Insight: 创新点在于对导致提取困难的特定从句结构和依存关系进行了原理性分析,揭示了规则方法对句法复杂性敏感,挑战性结构包括关系从句、同位语、并列谓语、状语从句和被动结构。

Abstract: Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the “split-and-rephrase” task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.


Yuelyu Ji, Zhuochun Li, Rui Meng, Daqing He

TL;DR: 这篇论文提出一个四轴设计框架,用于分析和比较多跳问答(QA)系统中的检索-推理过程,并基于该框架对现有系统进行映射和趋势总结。

Details

Motivation: 当前多跳问答系统(如RAG和智能体方法)的检索-推理过程往往隐含,缺乏明确的比较框架,难以评估不同模型在程序选择上的优劣。

Result: 论文在标准基准(如HotpotQA、2WikiMultiHopQA、MuSiQue)上对代表性系统进行映射和消融分析,总结了在效果、效率和证据忠实度之间的常见权衡趋势。

Insight: 创新点在于将执行过程作为分析单元,提出涵盖整体执行计划、索引结构、下一步控制和停止/继续标准的四轴框架,为系统设计和比较提供了结构化视角。

Abstract: Multi-hop question answering (QA) requires systems to iteratively retrieve evidence and reason across multiple hops. While recent RAG and agentic methods report strong results, the underlying retrieval–reasoning \emph{process} is often left implicit, making procedural choices hard to compare across model families. This survey takes the execution procedure as the unit of analysis and introduces a four-axis framework covering (A) overall execution plan, (B) index structure, (C) next-step control (strategies and triggers), and (D) stop/continue criteria. Using this schema, we map representative multi-hop QA systems and synthesize reported ablations and tendencies on standard benchmarks (e.g., HotpotQA, 2WikiMultiHopQA, MuSiQue), highlighting recurring trade-offs among effectiveness, efficiency, and evidence faithfulness. We conclude with open challenges for retrieval–reasoning agents, including structure-aware planning, transferable control policies, and robust stopping under distribution shift.


[12] InfoSynth: Information-Guided Benchmark Synthesis for LLMs cs.CLPDF

Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song

TL;DR: 本文提出InfoSynth框架,一种基于信息论指导的自动生成和评估LLM推理基准的方法,通过KL散度和熵量化基准的新颖性和多样性,并利用遗传算法和迭代代码反馈从种子数据集合成稳健的Python编程问题,实现了97%的准确率且能控制问题难度与多样性。

Details

Motivation: 解决传统基准创建依赖人工、成本高且易污染LLM训练数据的问题,需要自动生成新颖多样的基准以准确评估LLM的真实能力。

Result: 方法在合成新问题的测试用例和解决方案时达到97%的准确率,生成的基准在相比种子数据集时展现出更高的新颖性和多样性,并能控制生成问题的难度与多样性。

Insight: 创新点在于引入信息论指标(KL散度和熵)自动量化基准质量,结合遗传算法和代码反馈实现端到端管道,为LLM评估提供了可扩展且自验证的高质量基准生成方案。

Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/


cs.CV [Back]

[13] TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model cs.CVPDF

Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng

TL;DR: TeleWorld是一个实时多模态4D世界建模框架,它通过生成-重建-引导的闭环范式,将视频生成、动态场景重建和长期世界记忆统一起来,旨在实现动态、一致且可交互的世界模型。

Details

Motivation: 当前视频生成模型在实时交互、长时程一致性和动态场景的持久记忆方面存在局限,阻碍了其发展为实用的世界模型,TeleWorld旨在解决这些问题。

Result: 广泛的实验表明,TeleWorld在静态与动态世界理解、长时程一致性和实时生成效率方面均表现出色,是迈向交互式、具备记忆功能的世界模型的实用一步。

Insight: 论文的核心创新在于提出了生成-重建-引导的闭环范式,以及结合了Macro-from-Micro Planning(MMPL)层次规划方法和高效的Distribution Matching Distillation(DMD)的实时自回归扩散视频模型,从而在统一的4D框架内整合了动态对象建模与静态场景表示。

Abstract: World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)–a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.


[14] Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark cs.CVPDF

Pan Wang, Yang Liu, Guile Wu, Eduardo R. Corral-Soto, Chengjie Huang

TL;DR: 本文提出了Spatial4D-Bench,一个用于全面评估多模态大语言模型(MLLMs)4D空间推理能力的大规模、多任务基准测试。该基准包含约40,000个问答对,涵盖18个明确定义的任务,并将其系统性地组织为六个认知类别。通过对多种先进MLLMs进行评测,揭示了它们在多种4D空间推理方面的显著局限性。

Details

Motivation: 旨在探究MLLMs在多大程度上能达到人类水平的4D空间智能,并解决现有空间智能基准测试规模小或多样性不足的问题。

Result: 在Spatial4D-Bench上对多种开源和专有的先进MLLMs进行了基准测试,结果显示它们在路线规划、动作识别和物理合理性推理等多种4D空间推理方面存在显著不足。

Insight: 创新点在于构建了一个结构化、全面的大规模4D空间智能基准,系统性地覆盖了从对象理解到时空推理的广泛认知类别,为评估和开发更强大的MLLMs提供了有价值的工具和洞见。

Abstract: 4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.


[15] A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data cs.CVPDF

Hyunho Lee, Wenwen Li

TL;DR: 本文提出了一种名为SMAGNet的空间掩码自适应门控网络,用于利用SAR和不完整多光谱数据(MSI)进行洪水后水体范围测绘。该模型通过特征融合自适应整合部分可用的MSI数据,以提升测绘精度和鲁棒性。

Details

Motivation: 解决在洪水响应阶段,当多光谱数据(MSI)不完整或缺失时,如何自适应地将其与合成孔径雷达(SAR)数据融合,以提升洪水后水体范围测绘的准确性和实用性的问题。

Result: 在C2S-MS Floods数据集上的实验表明,SMAGNet在不同MSI数据可用性水平下,其预测性能均优于其他多模态深度学习模型;即使MSI数据完全缺失,其性能也与仅使用SAR数据训练的U-Net模型相当。

Insight: 创新点在于提出了一个空间掩码自适应门控机制,能够鲁棒地处理不完整的MSI数据输入,增强了多模态深度学习在真实洪水管理场景中的适用性和对缺失数据的鲁棒性。

Abstract: Mapping water extent during a flood event is essential for effective disaster management throughout all phases: mitigation, preparedness, response, and recovery. In particular, during the response stage, when timely and accurate information is important, Synthetic Aperture Radar (SAR) data are primarily employed to produce water extent maps. Recently, leveraging the complementary characteristics of SAR and MSI data through a multimodal approach has emerged as a promising strategy for advancing water extent mapping using deep learning models. This approach is particularly beneficial when timely post-flood observations, acquired during or shortly after the flood peak, are limited, as it enables the use of all available imagery for more accurate post-flood water extent mapping. However, the adaptive integration of partially available MSI data into the SAR-based post-flood water extent mapping process remains underexplored. To bridge this research gap, we propose the Spatially Masked Adaptive Gated Network (SMAGNet), a multimodal deep learning model that utilizes SAR data as the primary input for post-flood water extent mapping and integrates complementary MSI data through feature fusion. In experiments on the C2S-MS Floods dataset, SMAGNet consistently outperformed other multimodal deep learning models in prediction performance across varying levels of MSI data availability. Furthermore, we found that even when MSI data were completely missing, the performance of SMAGNet remained statistically comparable to that of a U-Net model trained solely on SAR data. These findings indicate that SMAGNet enhances the model robustness to missing data as well as the applicability of multimodal deep learning in real-world flood management scenarios.


[16] Compressed Map Priors for 3D Perception cs.CVPDF

Brady Zhou, Philipp Krähenbühl

TL;DR: 本文提出了压缩地图先验(CMP)框架,通过从历史遍历数据中学习空间先验,以提升自动驾驶3D感知系统的性能。该方法利用二值化哈希图压缩存储地图信息,存储密度仅为32KB/km²,比密集存储减少了20倍,并能以极低计算成本集成到主流3D感知系统中。

Details

Motivation: 当前自动驾驶视觉系统通常忽略历史遍历信息,每次感知都视为首次访问场景,而现实中大部分部署区域已被多次遍历,因此需要有效利用历史空间先验提升感知效率。

Result: 在nuScenes数据集上的3D目标检测任务中,该方法在多种架构上实现了显著且一致的性能提升,验证了其有效性。

Insight: 创新点在于将历史遍历数据压缩为轻量级二值化哈希图作为空间先验,以极低存储和计算开销增强3D感知,为自动驾驶系统提供了可复用的场景记忆机制。

Abstract: Human drivers rarely travel where no person has gone before. After all, thousands of drivers use busy city roads every day, and only one can claim to be the first. The same holds for autonomous computer vision systems. The vast majority of the deployment area of an autonomous vision system will have been visited before. Yet, most autonomous vehicle vision systems act as if they are encountering each location for the first time. In this work, we present Compressed Map Priors (CMP), a simple but effective framework to learn spatial priors from historic traversals. The map priors use a binarized hashmap that requires only $32\text{KB}/\text{km}^2$, a $20\times$ reduction compared to the dense storage. Compressed Map Priors easily integrate into leading 3D perception systems at little to no extra computational costs, and lead to a significant and consistent improvement in 3D object detection on the nuScenes dataset across several architectures.


[17] Attention to Detail: Global-Local Attention for High-Resolution AI-Generated Image Detection cs.CVPDF

Lawrence Han

TL;DR: 本文提出GLASS(全局-局部注意力分层采样)架构,用于高分辨率AI生成图像检测。该架构结合全局缩放视图与多个随机采样的局部裁剪区域,通过空间分层采样高效选择原始分辨率区域,并利用基于注意力的评分进行聚合。GLASS可集成到视觉模型中,以利用任意尺寸图像的全局和局部信息。实验表明,在Vision Transformer、ResNet和ConvNeXt等骨干网络上,GLASS在可行计算约束下实现了比标准迁移学习更高的预测性能。

Details

Motivation: 生成式AI的快速发展使得AI生成图像越来越逼真和高分辨率,而现有检测架构通常在下采样后输入模型,可能导致细粒度细节丢失。

Result: 在Vision Transformer、ResNet和ConvNeXt等骨干网络上,GLASS在可行计算约束下实现了比标准迁移学习更高的预测性能。

Insight: 创新点在于结合全局视图与多个原始分辨率局部裁剪,通过空间分层采样和注意力评分聚合,有效保留高分辨率图像的细粒度细节,提升检测性能。

Abstract: The rapid development of generative AI has made AI-generated images increasingly realistic and high-resolution. Most AI-generated image detection architectures typically downsample images before inputting them into models, risking the loss of fine-grained details. This paper presents GLASS (Global-Local Attention with Stratified Sampling), an architecture that combines a globally resized view with multiple randomly sampled local crops. These crops are original-resolution regions efficiently selected through spatially stratified sampling and aggregated using attention-based scoring. GLASS can be integrated into vision models to leverage both global and local information in images of any size. Vision Transformer, ResNet, and ConvNeXt models are used as backbones, and experiments show that GLASS outperforms standard transfer learning by achieving higher predictive performance within feasible computational constraints.


[18] FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications cs.CV | cs.AI | cs.CE | cs.MMPDF

Yehui Yang, Dalu Yang, Wenshuo Zhou, Fangxin Shang, Yifan Liu

TL;DR: 本文介绍了FCMBench-V1.0,一个面向金融信贷领域的多模态基准测试,包含4043张合规图像和8446个问答样本。该基准从感知、推理和鲁棒性三个维度评估模型,并通过合成-采集流程构建数据以避免隐私泄露和预训练数据污染。在23个先进视觉语言模型上的实验表明,特定领域模型Qfin-VL-Instruct取得了最佳总体性能,但所有模型在面对真实采集伪影时鲁棒性均有下降。

Details

Motivation: 随着多模态AI在信贷风险评估和文档审核中的广泛应用,亟需一个反映金融信贷特定文档和工作流、包含信贷领域理解与真实世界鲁棒性、且能兼顾隐私合规性与实用性的领域专用基准。

Result: 在23个来自14家顶级AI公司和研究机构的SOTA视觉语言模型上进行了广泛实验。其中,Gemini 3 Pro作为商业模型取得了最佳F1分数(64.61%),Qwen3-VL-235B作为开源基线取得了最佳分数(57.27%),而作者提出的金融信贷专用模型Qfin-VL-Instruct取得了最高的总体分数(64.92%)。鲁棒性评估显示,即使表现最佳的模型在面对采集伪影时性能也出现显著下降。

Insight: 论文的创新点在于构建了一个大规模、隐私合规、面向真实世界应用的金融信贷多模态基准,其核心是通过封闭的合成-采集流程(手动合成文档模板并内部采集场景感知图像)来平衡合规性与真实性,同时避免了网络来源图像导致的数据泄露问题。基准设计强调决策导向的理解和真实采集伪影的压力测试,能有效区分模型性能差异和鲁棒性,为领域专用模型评估提供了新范式。

Abstract: As multimodal AI becomes widely used for credit risk assessment and document review, a domain-specific benchmark is urgently needed that (1) reflects documents and workflows specific to financial credit applications, (2) includes credit-specific understanding and real-world robustness, and (3) preserves privacy compliance without sacrificing practical utility. Here, we introduce FCMBench-V1.0 – a large-scale financial credit multimodal benchmark for real-world applications, covering 18 core certificate types, with 4,043 privacy-compliant images and 8,446 QA samples. The FCMBench evaluation framework consists of three dimensions: Perception, Reasoning, and Robustness, including 3 foundational perception tasks, 4 credit-specific reasoning tasks that require decision-oriented understanding of visual evidence, and 10 real-world acquisition artifact types for robustness stress testing. To reconcile compliance with realism, we construct all samples via a closed synthesis-capture pipeline: we manually synthesize document templates with virtual content and capture scenario-aware images in-house. This design also mitigates pre-training data leakage by avoiding web-sourced or publicly released images. FCMBench can effectively discriminate performance disparities and robustness across modern vision-language models. Extensive experiments were conducted on 23 state-of-the-art vision-language models (VLMs) from 14 top AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1(%) score as a commercial model (64.61), Qwen3-VL-235B achieves the best score as an open-source baseline (57.27), and our financial credit-specific model, Qfin-VL-Instruct, achieves the top overall score (64.92). Robustness evaluations show that even top-performing models suffer noticeable performance drops under acquisition artifacts.


[19] Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions cs.CVPDF

Kaiwen Zheng, Junchen Fu, Songpei Xu, Yaoqing He, Joemon M. Jose

TL;DR: 本文提出一个面部分析中的新问题:为任意选定的面部区域生成和识别包含面部动作单元、情绪状态和年龄估计的多属性自然语言描述。作者构建了一个新的多属性描述数据集,并提出基于Qwen2.5-VL微调的Focal-RegionFace模型,通过多阶段渐进微调聚焦局部面部特征,实现可解释的年龄、面部动作单元和情绪检测。实验表明该模型在新基准上取得了最佳性能。

Details

Motivation: 解决面部分析中针对任意选定面部区域进行细粒度多属性描述生成与识别这一未被充分探索的问题,旨在通过聚焦个体面部区域提升系统的理解与控制能力。

Result: 在提出的新基准上,Focal-RegionFace在传统指标和新提出的指标上都取得了最佳性能,验证了其在细粒度多属性面部区域聚焦分析场景中的有效性和通用性。

Insight: 创新点在于将视觉-语言模型(Qwen2.5-VL)通过多阶段渐进微调策略应用于面部区域聚焦分析,实现了对局部面部特征(如动作单元、情绪、年龄)的可解释性多属性描述生成与识别;客观来看,其构建的包含丰富区域级标注的数据集和渐进式聚焦的微调方法为细粒度面部分析提供了新思路。

Abstract: In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system’s ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.


[20] From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning cs.CV | cs.CLPDF

Omar Sharif, Eftekhar Hossain, Patrick Ng

TL;DR: 本文研究如何通过强化学习提升多模态大语言模型的视觉推理能力。作者发现现有模型在解决需要精确视觉感知的任务(如视觉谜题)时,其推理链缺乏对视觉信息的整合,视觉感知是主要瓶颈。为此,他们设计了六种针对不同推理方面(如图像理解、思维步骤和答案准确性)的奖励函数,并采用分组相对策略优化方法,激励模型生成更长、结构化的推理过程,避免绕过视觉信息。在Qwen-2.5-VL-7B模型上的实验表明,该方法在领域内和领域外设置下均实现了性能提升。

Details

Motivation: 解决多模态大语言模型在视觉推理任务中推理链缺乏视觉信息整合的问题,特别是针对需要精确视觉感知的任务(如视觉谜题),以提升模型的视觉推理能力。

Result: 在Qwen-2.5-VL-7B模型上,该方法相比基础模型取得了5.56%的性能提升,且在领域内和领域外设置下均表现一致。此外,将图像转换为文本描述能显著提升性能,例如Claude 3.5和Claude 3.7分别获得26.7%和23.6%的增益。

Insight: 创新点包括:使用奖励驱动的强化学习来解锁开源MLLMs的长视觉推理能力,无需昂贵监督;设计了六种针对不同推理方面的奖励函数;采用分组相对策略优化来激励结构化推理并缓解视觉信息绕过问题。从客观角度看,该方法通过强化学习机制有效整合视觉信息到推理链中,为提升多模态模型的视觉感知和推理提供了可借鉴的思路。

Abstract: Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.


[21] Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection cs.CV | cs.LG | cs.ROPDF

Chao Yang, Haoyuan Zheng, Yue Ma

TL;DR: 本文针对PCB红外缺陷检测中红外数据稀缺的关键瓶颈,提出了一种集成CycleGAN和YOLOv8的跨模态数据增强框架。该方法利用CycleGAN将丰富的可见光PCB图像转换为红外域,生成高保真的伪红外样本,并结合有限的真实红外数据训练YOLOv8检测器,有效提升了低数据条件下的特征学习能力。

Details

Motivation: 解决PCB缺陷检测中红外数据稀缺、难以获取配对监督数据的问题,旨在通过跨模态数据增强来克服这一瓶颈。

Result: 实验结果表明,该方法显著优于仅使用有限真实数据训练的模型,其性能接近完全监督训练的基准,证明了伪红外合成作为工业检测鲁棒增强策略的有效性。

Insight: 创新点在于利用无配对监督的CycleGAN进行跨模态图像转换以生成伪红外数据,并结合异构训练策略融合生成与真实数据,为低数据场景下的工业视觉检测提供了一种有效的解决方案。

Abstract: This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.


[22] TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models cs.CVPDF

Kohei Yamamoto, Tomohiro Kikuchi

TL;DR: 本研究提出了TotalFM,一种基于器官分离概念的放射学基础模型,旨在高效学习3D-CT图像与语言表达之间的对应关系。通过自动化创建器官体积与发现句子对,结合VideoMAE自监督预训练和对比学习,该模型在零样本器官级和发现级病变分类任务中表现出优于CT-CLIP和Merlin的泛化性能,并在放射学报告生成任务中达到与现有视觉语言模型相当的水平。

Details

Motivation: 解决在3D-CT体积数据上训练基础模型时计算成本高的问题,以支持多种临床任务的应用。

Result: 在零样本器官级病变分类任务中,相比CT-CLIP在83%的器官上获得更高F1分数,相比Merlin在64%的器官上更高;在零样本发现级病变分类任务中,相比Merlin在83%的发现类别上获得更高AUROC;在放射学报告生成任务中性能与现有视觉语言模型相当。

Insight: 创新点包括器官分离框架、自动化器官体积-句子对生成以及结合自监督预训练与对比学习的方法,为3D-CT基础模型的实用化提供了高效且泛化性强的设计指南。

Abstract: While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.


[23] S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding cs.CVPDF

He Wang, Longteng Guo, Pengkang Huo, Xuanxu Lin, Yichen Yuan

TL;DR: 本文介绍了S1-MMAlign,一个大规模、多学科的科学图文理解数据集,包含从250万篇开放获取科学论文中提取的超过1550万高质量图文对。为了解决原始科学图注中普遍存在的弱对齐问题,论文提出了一个基于Qwen-VL多模态大模型的语义增强流程,通过综合论文摘要和引用上下文来重新描述图像,显著提升了数据质量。该数据集旨在推动科学推理和跨模态理解研究。

Details

Motivation: 多模态学习在通用领域任务中取得了革命性进展,但其在科学发现中的应用受到复杂科学图像与稀疏文本描述之间深刻语义鸿沟的阻碍。现有科学图文数据存在弱对齐问题,需要高质量数据集来弥合这一差距。

Result: 技术验证表明,语义增强流程显著提升了数据质量:基于SciBERT的伪困惑度指标显示语义模糊性降低,而CLIP分数表明图文对齐度提升了18.21%。

Insight: 主要创新点在于构建了大规模、高质量、跨学科的科学图文数据集,并设计了一个利用多模态大模型(Qwen-VL)进行语义增强的流程,通过整合论文摘要和引用上下文来生成更丰富的图像描述,从而有效解决科学领域图文弱对齐的核心挑战。这为AI for Science时代的科学推理任务提供了关键的基础资源。

Abstract: Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.


[24] FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering cs.CV | cs.AIPDF

Chaodong Tong, Qi Zhang, Chen Li, Lei Jiang, Yanbing Liu

TL;DR: 该论文提出了FaithSCAN,一种用于视觉问答中忠实性幻觉检测的轻量级网络。它通过融合视觉语言模型的内部信号(如解码不确定性、中间视觉表示和跨模态对齐特征)来检测幻觉,并利用LLM-as-a-Judge范式自动生成监督信号进行训练。实验表明,该方法在多个VQA基准测试中显著优于现有方法,且效率更高。

Details

Motivation: 解决VQA中视觉语言模型产生流畅但视觉无根据的答案(忠实性幻觉)的问题,现有检测方法存在计算开销高、依赖外部资源或仅捕捉有限模型不确定性等局限性,导致效率、鲁棒性和检测性能不足。

Result: 在多个VQA基准测试上的实验表明,FaithSCAN在检测效果和效率上均显著优于现有方法,达到了SOTA水平。

Insight: 创新点在于利用VLM丰富的内部信号进行单次检测,并通过LLM自动生成监督信号以低成本训练;客观分析认为,该方法揭示了幻觉源于视觉感知、跨模态推理和语言解码的系统性内部状态变化,不同内部信号提供互补诊断线索,且幻觉模式因VLM架构而异,为多模态幻觉的成因提供了新见解。

Abstract: Faithfulness hallucinations in VQA occur when vision-language models produce fluent yet visually ungrounded answers, severely undermining their reliability in safety-critical applications. Existing detection methods mainly fall into two categories: external verification approaches relying on auxiliary models or knowledge bases, and uncertainty-driven approaches using repeated sampling or uncertainty estimates. The former suffer from high computational overhead and are limited by external resource quality, while the latter capture only limited facets of model uncertainty and fail to sufficiently explore the rich internal signals associated with the diverse failure modes. Both paradigms thus have inherent limitations in efficiency, robustness, and detection performance. To address these challenges, we propose FaithSCAN: a lightweight network that detects hallucinations by exploiting rich internal signals of VLMs, including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. We also extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals, enabling supervised training without costly human labels while maintaining high detection accuracy. Experiments on multiple VQA benchmarks show that FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding. Different internal signals provide complementary diagnostic cues, and hallucination patterns vary across VLM architectures, offering new insights into the underlying causes of multimodal hallucinations.


[25] SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting cs.CVPDF

Jun-Jee Chao, Volkan Isler

TL;DR: 本文提出SV-GS框架,用于在稀疏视角和时间采样下重建动态目标的4D运动。该方法利用粗略骨架图和初始静态重建作为输入,通过优化一个由粗粒度骨架关节姿态估计器和细粒度变形模块组成的变形场,实现运动估计和几何细节保持。实验表明,在合成数据集上PSNR指标优于现有方法达34%,在真实数据集上使用远少于密集单目视频方法的帧数也能达到相当性能。

Details

Motivation: 解决在真实世界场景(如监控摄像头)中,由于视角和时间采样稀疏导致动态重建高度不适定的问题。

Result: 在合成数据集上PSNR指标优于现有方法达34%;在真实数据集上,使用显著更少的帧数,性能与密集单目视频方法相当。

Insight: 创新点在于将粗略骨架作为运动先验来引导稀疏观测下的变形场优化,并通过分离时间依赖(仅关节姿态估计器随时间变化)实现平滑运动插值与几何细节保持;此外,展示了初始静态重建可被基于扩散模型的生成先验替代,提升了实用性。

Abstract: Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object’s motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.


[26] TimeColor: Flexible Reference Colorization via Temporal Concatenation cs.CVPDF

Bryan Constantine Sadihin, Yihao Meng, Michael Hua Wang, Matteo Jiahao Chen, Hang Su

TL;DR: TimeColor是一种基于草图的视频着色模型,通过时间拼接支持异构、可变数量的参考图像(如角色设定图、背景图像或任意着色帧),并利用显式的每参考区域分配和时空对应掩码注意力机制,提升颜色保真度、身份一致性和时间稳定性。

Details

Motivation: 现有着色模型通常仅基于单帧参考(如场景首帧),忽略了其他条件数据源(如角色设定图、背景图像或任意着色帧),导致参考信息利用不足。

Result: 在SAKUGA-42M数据集上的单参考和多参考协议实验中,TimeColor在颜色保真度、身份一致性和时间稳定性方面优于先前基线方法。

Insight: 创新点包括将参考编码为额外潜在帧并进行时间拼接,使模型在参数固定的情况下并行处理多参考;结合时空对应掩码注意力和模态分离的RoPE索引,缓解捷径学习和跨身份调色板泄漏问题。

Abstract: Most colorization models condition only on a single reference, typically the first frame of the scene. However, this approach ignores other sources of conditional data, such as character sheets, background images, or arbitrary colorized frames. We propose TimeColor, a sketch-based video colorization model that supports heterogeneous, variable-count references with the use of explicit per-reference region assignment. TimeColor encodes references as additional latent frames which are concatenated temporally, permitting them to be processed concurrently in each diffusion step while keeping the model’s parameter count fixed. TimeColor also uses spatiotemporal correspondence-masked attention to enforce subject-reference binding in addition to modality-disjoint RoPE indexing. These mechanisms mitigate shortcutting and cross-identity palette leakage. Experiments on SAKUGA-42M under both single- and multi-reference protocols show that TimeColor improves color fidelity, identity consistency, and temporal stability over prior baselines.


[27] ReMA: A Training-Free Plug-and-Play Mixing Augmentation for Video Behavior Recognition cs.CVPDF

Feng-Qi Cui, Jinyang Huang, Sirui Zhao, Jinglong Guo, Qifan Cai

TL;DR: 本文提出了一种无需训练、即插即用的视频行为识别数据增强方法ReMA,通过表征对齐机制和动态选择机制,在保持类别条件稳定性的同时扩展表征,以应对复杂时空变化下的表示漂移问题。

Details

Motivation: 现有视频数据增强策略多为扰动驱动,常引入不可控变化,放大非判别性因素,削弱类内分布结构并导致跨时间尺度的表示漂移,ReMA旨在解决这些问题。

Result: 在多个视频行为识别基准测试上的广泛实验表明,ReMA能一致提升模型在不同时空粒度下的泛化能力和鲁棒性。

Insight: 创新点在于将混合增强建模为受控的替换过程,结合分布对齐约束的结构化类内混合与运动感知的时空掩码定位扰动,无需额外监督或可训练参数即可增强表征鲁棒性。

Abstract: Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.


[28] Intelligent Traffic Surveillance for Real-Time Vehicle Detection, License Plate Recognition, and Speed Estimation cs.CVPDF

Bruce Mugizi, Sudi Murindanyi, Olivia Nakacwa, Andrew Katumba

TL;DR: 本研究提出了一种适用于乌干达等发展中国家的实时智能交通监控系统,该系统利用计算机视觉技术实现车辆检测、车牌识别和速度估计。系统使用YOLOv8进行车牌检测,使用CNN和Transformer模型进行字符识别,并通过设定感兴趣区域进行速度估计。此外,系统还建立了数据库,并通过API实现自动短信罚单发送。

Details

Motivation: 解决发展中国家因道路安全基础设施有限而导致的超速行驶问题,通过自动化交通监控系统来加强交通管理,减少道路事故。

Result: 车牌检测的mAP达到97.9%;字符识别中,CNN模型的CER为3.85%,Transformer模型显著降低至1.79%;速度估计的误差在10 km/h以内。

Insight: 针对资源受限环境定制了端到端的实时监控系统,创新性地整合了目标检测、字符识别和速度估计模块,并利用本地API实现了自动化执法流程,为类似地区提供了可行的技术解决方案。

Abstract: Speeding is a major contributor to road fatalities, particularly in developing countries such as Uganda, where road safety infrastructure is limited. This study proposes a real-time intelligent traffic surveillance system tailored to such regions, using computer vision techniques to address vehicle detection, license plate recognition, and speed estimation. The study collected a rich dataset using a speed gun, a Canon Camera, and a mobile phone to train the models. License plate detection using YOLOv8 achieved a mean average precision (mAP) of 97.9%. For character recognition of the detected license plate, the CNN model got a character error rate (CER) of 3.85%, while the transformer model significantly reduced the CER to 1.79%. Speed estimation used source and target regions of interest, yielding a good performance of 10 km/h margin of error. Additionally, a database was established to correlate user information with vehicle detection data, enabling automated ticket issuance via SMS via Africa’s Talking API. This system addresses critical traffic management needs in resource-constrained environments and shows potential to reduce road accidents through automated traffic enforcement in developing countries where such interventions are urgently needed.


[29] OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning cs.CVPDF

Liuxiang Qiu, Hui Da, Yuzhen Niu, Tiesong Zhao, Yang Cao

TL;DR: 本文提出了一种名为OmniVaT的框架,旨在解决多模态视觉-触觉学习中的单域泛化问题。该框架通过多模态分数傅里叶适配器将视觉和触觉嵌入映射到统一的嵌入-频率空间,以缓解模态差异,同时利用离散树生成模块获取多样且可靠的多模态分数表示,从而增强对未见域的适应性。

Details

Motivation: 视觉-触觉学习面临模态差异和由非标准化触觉传感器及不一致数据收集过程导致的域差距问题,本文将其形式化为单域泛化多模态视觉-触觉学习任务,旨在提升模型在未知域中的泛化能力。

Result: 大量实验表明,OmniVaT在SDG-VTL任务上表现出优越的跨域泛化性能,但摘要未具体提及基准测试名称或是否达到SOTA水平。

Insight: 创新点包括首次提出并解决SDG-VTL任务,引入多模态分数傅里叶适配器以统一嵌入空间缓解模态差距,以及离散树生成模块通过分层树结构增强对域偏移的适应性,这些方法避免了多域训练数据或复杂跨模态融合策略的需求。

Abstract: Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.


[30] BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition cs.CVPDF

Seungyeon Cho, Tae-kyun Kim

TL;DR: 本文提出BHaRNet,一种概率双流框架,用于细粒度骨架动作识别。该框架通过可靠性建模和多模态集成,统一处理骨架内和跨模态的不确定性,包含无标定预处理、概率Noisy-OR融合和跨模态集成三个关键组件。

Details

Motivation: 现有骨架动作识别方法多为身体中心,忽略细微手部关节对细粒度识别的重要性,且缺乏对不确定性和多模态集成的统一处理。

Result: 在NTU RGB+D 60/120、PKU-MMD、N-UCLA等多个基准测试及新定义的手部中心基准上均表现出持续改进和鲁棒性,尤其在噪声和异构条件下。

Insight: 创新点包括无标定预处理直接学习原生坐标、无需显式置信度监督的概率Noisy-OR融合稳定双流学习,以及将四种骨架模态与RGB表示耦合的跨模态集成方法。

Abstract: Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.


[31] NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos cs.CVPDF

Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang

TL;DR: 本文提出了NeoVerse,一个通用的4D世界模型,能够进行4D重建、新轨迹视频生成并支持丰富的下游应用。该模型的核心设计理念是使整个流程能够扩展到多样化的野外单目视频,从而克服了现有方法因依赖昂贵的多视图4D数据或繁琐的训练预处理而导致的可扩展性限制。

Details

Motivation: 解决当前4D世界建模方法因依赖昂贵、专业的多视图4D数据或繁琐的训练预处理而存在的可扩展性不足的问题,旨在构建一个能够利用多样化野外单目视频的通用模型。

Result: 在标准的重建和生成基准测试中达到了最先进的性能。

Insight: 创新点包括无姿态的前馈式4D重建、在线的单目退化模式模拟以及其他精心设计的技术,这些设计赋予了模型强大的通用性和对不同领域的泛化能力,其核心思想是利用易于获取的野外单目视频实现可扩展的4D建模。

Abstract: In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io


[32] RoLID-11K: A Dashcam Dataset for Small-Object Roadside Litter Detection cs.CVPDF

Tao Wu, Qing Xu, Xiangjian He, Oakleigh Weekes, James Brown

TL;DR: 本文介绍了RoLID-11K,首个用于行车记录仪路边垃圾检测的大规模数据集,包含超过1.1万张标注图像,涵盖了英国多样的驾驶条件,并呈现显著的长尾分布和小目标特性。作者对包括基于Transformer的架构和实时YOLO模型在内的多种现代检测器进行了基准测试,分析了它们在此挑战性任务上的优劣。结果表明,CO-DETR等Transformer模型定位精度最佳,而实时模型受限于粗糙的特征层次。该数据集旨在为动态驾驶场景中的极端小目标检测建立基准,并支持开发可扩展、低成本的路边垃圾监测系统。

Details

Motivation: 解决路边垃圾监测依赖人工调查和公众报告、空间覆盖有限的问题,并弥补现有视觉垃圾检测数据集(专注于街景静态图像、航拍或水生环境)无法反映行车记录仪视频中垃圾目标极小、稀疏且背景杂乱独特特性的空白。

Result: 在RoLID-11K数据集上对多种检测器进行了基准测试。CO-DETR及相关Transformer架构实现了最佳的定位精度,而实时模型(如YOLO)则因特征层次粗糙而性能受限。该数据集本身为动态驾驶场景中的极端小目标检测建立了一个具有挑战性的基准。

Insight: 创新点在于创建了首个专注于行车记录仪视角、针对极小且稀疏目标的路边垃圾检测大规模数据集(RoLID-11K),其长尾和小目标分布特性为检测算法带来了独特挑战。客观分析认为,该工作不仅提供了有价值的基准数据资源,还通过系统性的模型评估揭示了当前先进检测器(尤其是实时模型)在处理此类极端小目标场景时的核心瓶颈(特征层次问题),为后续算法改进指明了方向。

Abstract: Roadside litter poses environmental, safety and economic challenges, yet current monitoring relies on labour-intensive surveys and public reporting, providing limited spatial coverage. Existing vision datasets for litter detection focus on street-level still images, aerial scenes or aquatic environments, and do not reflect the unique characteristics of dashcam footage, where litter appears extremely small, sparse and embedded in cluttered road-verge backgrounds. We introduce RoLID-11K, the first large-scale dataset for roadside litter detection from dashcams, comprising over 11k annotated images spanning diverse UK driving conditions and exhibiting pronounced long-tail and small-object distributions. We benchmark a broad spectrum of modern detectors, from accuracy-oriented transformer architectures to real-time YOLO models, and analyse their strengths and limitations on this challenging task. Our results show that while CO-DETR and related transformers achieve the best localisation accuracy, real-time models remain constrained by coarse feature hierarchies. RoLID-11K establishes a challenging benchmark for extreme small-object detection in dynamic driving scenes and aims to support the development of scalable, low-cost systems for roadside-litter monitoring. The dataset is available at https://github.com/xq141839/RoLID-11K.


[33] CPPO: Contrastive Perception for Vision Language Policy Optimization cs.CVPDF

Ahmad Rezaei, Mohsen Gholami, Saeed Ranjbar Alvar, Kevin Cannons, Mohammad Asiful Hossain

TL;DR: 本文提出了CPPO(对比感知策略优化)方法,用于微调视觉语言模型(VLMs)。该方法通过检测模型在输入图像扰动下输出熵的变化来识别感知令牌,并引入对比感知损失(CPL)来增强RL目标函数,从而在无需额外模型的情况下提升多模态推理中的感知能力。

Details

Motivation: 现有工作主要依赖显式感知奖励来提升多模态推理,但难以有效分离感知令牌与推理令牌,且常需额外LLM、真实数据或强制分离,导致效率低下。CPPO旨在解决这一问题,通过熵移检测和对比损失优化感知与推理的联合训练。

Result: 实验表明,CPPO在无需额外模型的情况下超越了之前的感知奖励方法,训练更高效且可扩展,具体基准未在摘要中明确提及,但暗示了性能提升。

Insight: 创新点在于利用输入图像扰动下的熵移自动检测感知令牌,并设计对比感知损失来强化模型对信息保留扰动的鲁棒性和对信息移除扰动的敏感性,避免了依赖外部模型或复杂奖励设计,提升了训练效率。

Abstract: We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.


[34] MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation cs.CV | cs.AI | cs.GRPDF

Miaowei Wang, Jakub Zadrożny, Oisin Mac Aodha, Amir Vaxman

TL;DR: MotionPhysics是一个端到端的可微分框架,用于从自然语言提示中推断出合理的物理参数,以指导3D场景的动态模拟。它利用多模态大语言模型估计材料参数,并通过可学习的运动蒸馏损失从预训练的视频扩散模型中提取运动先验,从而无需真实轨迹或标注视频即可生成逼真的物理模拟。

Details

Motivation: 解决传统3D物体和材料模拟需要专家知识和耗时参数调整的问题,旨在通过自然语言提示自动推断物理参数,降低模拟门槛。

Result: 在超过30个场景(包括真实世界、人工设计和AI生成的3D物体)中评估,涵盖弹性固体、金属、泡沫、沙子及牛顿/非牛顿流体等多种材料,其模拟在视觉上逼真,超越了现有技术水平(SOTA),并能自动确定物理合理的参数。

Insight: 创新点包括:结合多模态大语言模型进行参数估计,引入可学习的运动蒸馏损失以减少外观和几何归纳偏差,实现无需真实数据指导的端到端语言引导物理模拟。

Abstract: Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground-truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: https://wangmiaowei.github.io/MotionPhysics.github.io/.


[35] All-in-One Video Restoration under Smoothly Evolving Unknown Weather Degradations cs.CVPDF

Wenrui Li, Hongtao Chen, Yao Xiao, Wangmeng Zuo, Jiantao Zhou

TL;DR: 本文提出了一种针对平滑演变未知天气退化的视频修复方法。现有的一体化图像修复方法在处理视频时,通常忽略了真实世界退化过程的时间连续性。为此,作者引入了平滑演变未知退化(SEUD)场景,并提出了一个名为ORCANet的循环条件自适应提示网络,该网络通过粗粒度强度估计去雾模块和流提示生成模块,有效处理了退化类型和强度随时间平滑变化或共存的情况。

Details

Motivation: 现有的一体化视频修复方法主要关注逐帧的退化变化,忽略了真实退化过程中自然存在的时间连续性。在实践中,退化的类型和强度会随时间平滑演变,多种退化可能共存或逐渐过渡,这构成了一个独特的挑战。

Result: 广泛的实验表明,ORCANet在恢复质量、时间一致性和鲁棒性方面,均优于基于图像和视频的基线方法,达到了优越的修复效果。

Insight: 论文的创新点在于正式定义了SEUD场景,并提出了一个灵活的合成流程来生成具有时间一致性的退化视频。ORCANet的核心创新在于结合了基于物理先验的粗粒度强度估计去雾模块,以及能同时生成捕获片段级退化类型的静态提示和适应帧级强度变化的动态提示的流提示生成模块,并通过标签感知监督机制提升了静态提示表征在不同退化下的可区分性。

Abstract: All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model. But extending this task to videos faces unique challenges. Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes. In practice, degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually. In this paper, we introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time. To support this scenario, we design a flexible synthesis pipeline that generates temporally coherent videos with single, compound, and evolving degradations. To address the challenges in the SEUD scenario, we propose an all-in-One Recurrent Conditional and Adaptive prompting Network (ORCANet). First, a Coarse Intensity Estimation Dehazing (CIED) module estimates haze intensity using physical priors and provides coarse dehazed features as initialization. Second, a Flow Prompt Generation (FPG) module extracts degradation features. FPG generates both static prompts that capture segment-level degradation types and dynamic prompts that adapt to frame-level intensity variations. Furthermore, a label-aware supervision mechanism improves the discriminability of static prompt representations under different degradations. Extensive experiments show that ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines. Code is available at https://github.com/Friskknight/ORCANet-SEUD.


[36] DynaDrag: Dynamic Drag-Style Image Editing by Motion Prediction cs.CVPDF

Jiacheng Sui, Yujie Zhou, Li Niu

TL;DR: 本文提出DynaDrag,一种基于预测-移动框架的动态拖拽式图像编辑方法,通过迭代执行运动预测和运动监督来避免传统移动-跟踪框架中的跟踪丢失和模糊问题,并动态调整有效处理点以提升性能。

Details

Motivation: 解决现有拖拽式图像编辑方法中因移动-跟踪框架导致的跟踪丢失、模糊问题,以及其他框架下源图像与目标图像差距大、中间点不合理导致的编辑性低的问题。

Result: 在人脸和人体数据集上的实验表明,该方法优于先前工作,但未具体提及定量指标或是否达到SOTA水平。

Insight: 创新点在于首次采用预测-移动框架,结合迭代的运动预测与运动监督,并引入动态调整有效处理点的机制,提升了编辑的准确性和鲁棒性。

Abstract: To achieve pixel-level image manipulation, drag-style image editing which edits images using points or trajectories as conditions is attracting widespread attention. Most previous methods follow move-and-track framework, in which miss tracking and ambiguous tracking are unavoidable challenging issues. Other methods under different frameworks suffer from various problems like the huge gap between source image and target edited image as well as unreasonable intermediate point which can lead to low editability. To avoid these problems, we propose DynaDrag, the first dragging method under predict-and-move framework. In DynaDrag, Motion Prediction and Motion Supervision are performed iteratively. In each iteration, Motion Prediction first predicts where the handle points should move, and then Motion Supervision drags them accordingly. We also propose to dynamically adjust the valid handle points to further improve the performance. Experiments on face and human datasets showcase the superiority over previous works.


[37] A Comprehensive Dataset for Human vs. AI Generated Image Detection cs.CV | cs.AIPDF

Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh

TL;DR: 该论文发布了一个名为MS COCOAI的新数据集,用于AI生成图像检测,包含96000个真实和合成数据点,基于MS COCO数据集构建,使用了Stable Diffusion 3、Stable Diffusion 2.1、SDXL、DALL-E 3和MidJourney v6五种生成器。论文基于该数据集提出了两个任务:分类图像为真实或生成,以及识别合成图像由哪个模型生成。

Details

Motivation: 随着Stable Diffusion、DALL-E和MidJourney等多模态生成AI系统使合成图像更难与真实照片区分,检测这些图像以防止误导内容、虚假信息和操纵媒体的传播变得紧迫。

Result: 论文未在摘要中提及具体定量结果或基准测试性能,但发布了数据集并定义了分类和模型识别任务。

Insight: 创新点在于构建了一个大规模、多生成器的AI生成图像检测数据集MS COCOAI,并定义了双重检测任务,为研究社区提供了标准化的评估资源,有助于推动生成图像检测技术的发展。

Abstract: Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, We release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.


[38] AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models cs.CVPDF

Jintao Lin, Bowen Dong, Weikang Shi, Chenyang Lei, Suiyun Zhang

TL;DR: 本文提出了一个名为AEGIS的综合多任务基准,用于评估统一多模态模型(UMMs)的世界知识应用能力。该基准涵盖视觉理解、生成、编辑和交错生成等任务,包含1050个涵盖21个主题和6种推理类型的手动标注问题。作者还提出了基于确定性清单的评估(DCE)协议,以增强评估的可靠性。实验表明,大多数UMMs存在严重的世界知识缺陷,且在复杂推理下性能显著下降,而简单的插件推理模块可以部分缓解这些问题。

Details

Motivation: 现有基准测试在评估统一多模态模型(UMMs)的世界知识跨任务应用能力方面存在不足,通常只提供孤立的单任务评估且诊断能力有限。本文旨在填补这一空白,通过构建一个全面的多任务基准来系统评估UMMs的世界知识能力。

Result: 在提出的AEGIS基准上进行的广泛实验表明,大多数UMMs表现出严重的世界知识缺陷,并且随着推理复杂性的增加,性能显著下降。同时,实验发现简单的插件推理模块可以部分缓解这些弱点。

Insight: 论文的创新点在于提出了一个综合性的多任务基准(AEGIS)和一个新的评估协议(DCE),以更可靠地评估UMMs的世界知识能力。从客观角度看,该研究强调了世界知识推理作为UMMs关键前沿的重要性,并为未来研究(如通过插件模块增强推理)指明了方向。

Abstract: The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N’’ judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.


[39] GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval cs.CVPDF

Mingyu Jeon, Sunjae Yoon, Jonghee Kim, Junyeoung Kim

TL;DR: 本文提出了一种名为GranAlign的无训练框架,用于解决零样本视频时刻检索任务中文本查询与视频内容之间语义粒度不匹配的问题。该框架通过粒度感知的查询重写和查询感知的标题生成两种互补技术,将多粒度查询与查询无关及查询感知的标题配对,有效弥合了粗粒度与细粒度语义表示之间的差距。

Details

Motivation: 零样本视频时刻检索任务的主要挑战在于文本查询与视觉内容之间的语义粒度不匹配。现有方法虽然利用了高质量的预训练知识,但未能平衡不同模态提供的预训练知识在给定场景下的语义粒度,导致检索不准确。

Result: 该方法在三个主要基准测试(QVHighlights、Charades-STA、ActivityNet-Captions)上均取得了新的最先进性能,特别是在具有挑战性的QVHighlights数据集上,mAP@avg指标显著提升了3.23%。

Insight: 创新点在于提出了一个无需训练、粒度感知的对齐框架,通过查询重写和查询感知的标题生成来动态调整语义粒度,从而解决跨模态粒度不匹配问题。从客观角度看,这是一种新颖的、利用生成技术来增强跨模态对齐的零样本方法,避免了复杂的模型微调。

Abstract: Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality’s representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.


[40] Modality Dominance-Aware Optimization for Embodied RGB-Infrared Perception cs.CVPDF

Xianhui Liu, Siqi Jiang, Yi Xie, Yuqing Lin, Siao Liu

TL;DR: 本文提出了一种模态主导感知优化方法(MDACL),用于解决RGB-红外(RGB-IR)多模态感知中因模态特征不对称导致的优化偏差问题。通过引入模态主导指数(MDI)量化模态主导现象,并设计了分层跨模态引导(HCG)和对抗均衡正则化(AER)来平衡优化动态,从而提升跨模态融合效果。

Details

Motivation: RGB-IR多模态感知在复杂物理环境的具身系统中至关重要,但现有跨模态融合方法忽视了由信息密度和特征质量差异引起的优化偏差,导致训练过程过度强调主导模态,阻碍了有效融合。

Result: 在三个RGB-IR基准测试上的大量实验表明,MDACL有效缓解了优化偏差,并取得了最先进(SOTA)的性能。

Insight: 创新点在于提出了模态主导指数(MDI)来量化模态主导现象,并基于此设计了MDACL框架,通过HCG增强特征对齐和AER平衡优化动态,为多模态融合中的优化偏差问题提供了系统的解决方案。

Abstract: RGB-Infrared (RGB-IR) multimodal perception is fundamental to embodied multimedia systems operating in complex physical environments. Although recent cross-modal fusion methods have advanced RGB-IR detection, the optimization dynamics caused by asymmetric modality characteristics remain underexplored. In practice, disparities in information density and feature quality introduce persistent optimization bias, leading training to overemphasize a dominant modality and hindering effective fusion. To quantify this phenomenon, we propose the Modality Dominance Index (MDI), which measures modality dominance by jointly modeling feature entropy and gradient contribution. Based on MDI, we develop a Modality Dominance-Aware Cross-modal Learning (MDACL) framework that regulates cross-modal optimization. MDACL incorporates Hierarchical Cross-modal Guidance (HCG) to enhance feature alignment and Adversarial Equilibrium Regularization (AER) to balance optimization dynamics during fusion. Extensive experiments on three RGB-IR benchmarks demonstrate that MDACL effectively mitigates optimization bias and achieves SOTA performance.


[41] Noise-Robust Tiny Object Localization with Flows cs.CV | cs.AIPDF

Huixin Sun, Linlin Yang, Ronyu Chen, Kerui Gu, Baochang Zhang

TL;DR: 本文提出了一种名为TOLF的噪声鲁棒小目标定位框架,旨在解决小目标检测中因标注噪声导致的性能下降问题。该方法利用归一化流进行灵活的误差建模,并采用不确定性引导的优化策略,以抑制噪声样本的学习,从而提升小目标定位的鲁棒性。

Details

Motivation: 尽管通用目标检测取得显著进展,但小目标与正常尺度目标之间仍存在性能差距。小目标对标注噪声高度敏感,优化严格的定位目标容易导致噪声过拟合,因此需要一种鲁棒的方法来处理噪声监督下的定位问题。

Result: 在三个数据集上的广泛实验验证了方法的有效性。特别是在AI-TOD数据集上,TOLF将DINO基线的平均精度(AP)提升了1.2%。

Insight: 创新点在于利用归一化流对小目标定位误差进行复杂、非高斯分布建模,并引入不确定性感知的梯度调制机制,动态抑制高不确定性(可能为噪声)样本的梯度贡献,从而在噪声监督下实现鲁棒学习和训练稳定。这为处理标注噪声和小目标定位提供了一种新的概率建模与优化思路。

Abstract: Despite significant advances in generic object detection, a persistent performance gap remains for tiny objects compared to normal-scale objects. We demonstrate that tiny objects are highly sensitive to annotation noise, where optimizing strict localization objectives risks noise overfitting. To address this, we propose Tiny Object Localization with Flows (TOLF), a noise-robust localization framework leveraging normalizing flows for flexible error modeling and uncertainty-guided optimization. Our method captures complex, non-Gaussian prediction distributions through flow-based error modeling, enabling robust learning under noisy supervision. An uncertainty-aware gradient modulation mechanism further suppresses learning from high-uncertainty, noise-prone samples, mitigating overfitting while stabilizing training. Extensive experiments across three datasets validate our approach’s effectiveness. Especially, TOLF boosts the DINO baseline by 1.2% AP on the AI-TOD dataset.


[42] RePose: A Real-Time 3D Human Pose Estimation and Biomechanical Analysis Framework for Rehabilitation cs.CVPDF

Junxiao Xue, Pavel Smirnov, Ziao Li, Yunyun Shi, Shi Chen

TL;DR: RePose是一个用于康复训练的实时三维人体姿态估计与运动分析方法,能够通过多摄像头RGB视频输入,实时监测和评估患者康复动作,并提供即时反馈与指导。该方法集成了快速跟踪、改进的SmoothNet姿态估计以及基于Unity平台的实时监控与肌肉应力可视化。

Details

Motivation: 解决康复训练中患者动作实时监测、评估与指导的需求,通过计算机视觉技术提供准确、实时的姿态估计与生物力学分析,以辅助患者正确执行康复练习并恢复肌肉功能。

Result: 在多人干扰的医疗康复场景中,提出的快速跟踪方法单帧跟踪时间小于1毫秒;改进的SmoothNet有效减少了姿态估计误差,使运动状态更平滑真实;整体框架实现了实时性能。

Insight: 创新点包括:1) 端到端的实时三维姿态估计与运动分析统一流程,专为康复训练设计;2) 针对多人干扰场景的极速跟踪方法;3) 改进SmoothNet以提升实时姿态估计的准确性与平滑性;4) 集成Unity平台实现实时监控与肌肉应力可视化,增强康复辅助的交互性。

Abstract: We propose a real-time 3D human pose estimation and motion analysis method termed RePose for rehabilitation training. It is capable of real-time monitoring and evaluation of patients’motion during rehabilitation, providing immediate feedback and guidance to assist patients in executing rehabilitation exercises correctly. Firstly, we introduce a unified pipeline for end-to-end real-time human pose estimation and motion analysis using RGB video input from multiple cameras which can be applied to the field of rehabilitation training. The pipeline can help to monitor and correct patients’actions, thus aiding them in regaining muscle strength and motor functions. Secondly, we propose a fast tracking method for medical rehabilitation scenarios with multiple-person interference, which requires less than 1ms for tracking for a single frame. Additionally, we modify SmoothNet for real-time posture estimation, effectively reducing pose estimation errors and restoring the patient’s true motion state, making it visually smoother. Finally, we use Unity platform for real-time monitoring and evaluation of patients’ motion during rehabilitation, and to display the muscle stress conditions to assist patients with their rehabilitation training.


[43] HyperPriv-EPN: Hypergraph Learning with Privileged Knowledge for Ependymoma Prognosis cs.CV | cs.LGPDF

Shuren Gabriel Yu, Sikang Ren, Yongji Tian

TL;DR: 本文提出HyperPriv-EPN,一个基于超图的学习框架,用于室管膜瘤的术前预后预测。该方法利用术后病理报告作为特权信息,通过双流蒸馏策略,使学生图(仅使用术前MRI数据)学会从视觉特征中模拟出语义社区结构,从而在推理阶段无需文本输入的情况下实现准确的预后预测。

Details

Motivation: 室管膜瘤的术前预后对治疗规划至关重要,但仅凭术前MRI缺乏语义信息,而现有的多模态方法无法在推理时利用术后文本报告(特权信息)。本文旨在弥合这一差距,将历史术后专家知识迁移到术前诊断中。

Result: 在一个包含311名患者的多中心队列上验证,HyperPriv-EPN在诊断准确性和生存分层方面达到了最先进的(SOTA)水平。

Insight: 创新点在于提出了一个基于超图的LUPI框架,并引入了Severed Graph Strategy和双流蒸馏,使学生模型能够从视觉特征中“幻觉”出语义社区结构,从而在推理时无需特权文本即可利用其知识,有效解决了术前-术后数据模态不匹配的问题。

Abstract: Preoperative prognosis of Ependymoma is critical for treatment planning but challenging due to the lack of semantic insights in MRI compared to post-operative surgical reports. Existing multimodal methods fail to leverage this privileged text data when it is unavailable during inference. To bridge this gap, we propose HyperPriv-EPN, a hypergraph-based Learning Using Privileged Information (LUPI) framework. We introduce a Severed Graph Strategy, utilizing a shared encoder to process both a Teacher graph (enriched with privileged post-surgery information) and a Student graph (restricted to pre-operation data). Through dual-stream distillation, the Student learns to hallucinate semantic community structures from visual features alone. Validated on a multi-center cohort of 311 patients, HyperPriv-EPN achieves state-of-the-art diagnostic accuracy and survival stratification. This effectively transfers expert knowledge to the preoperative setting, unlocking the value of historical post-operative data to guide the diagnosis of new patients without requiring text at inference.


[44] Quality Detection of Stored Potatoes via Transfer Learning: A CNN and Vision Transformer Approach cs.CVPDF

Shrikant Kapse, Priyankkumar Dhrangdhariya, Priya Kedia, Manasi Patwardhan, Shankar Kausley

TL;DR: 本研究利用迁移学习,结合CNN(如ResNet、VGG、DenseNet)和Vision Transformer(ViT)预训练架构,开发了基于图像的深度学习模型,用于马铃薯储存期间的质量检测,包括发芽检测、重量损失估计和货架期预测。

Details

Motivation: 解决马铃薯储存过程中非侵入式、可扩展的质量监控问题,如发芽检测、重量损失估计和货架期预测,以支持自动化分拣和库存系统。

Result: DenseNet在发芽检测中达到98.03%的准确率;货架期预测模型在粗粒度分类(2-5类)中准确率超过89.83%,但在细粒度分类(6-8类)中因视觉差异细微和数据有限而下降。

Insight: 创新点包括结合CNN和ViT的迁移学习方法,以及针对发芽检测和货架期预测设计专用模型;可借鉴之处在于使用粗粒度分类提升模型鲁棒性,并强调未来需开发泛化模型以适应多样品种和储存条件。

Abstract: Image-based deep learning provides a non-invasive, scalable solution for monitoring potato quality during storage, addressing key challenges such as sprout detection, weight loss estimation, and shelf-life prediction. In this study, images and corresponding weight data were collected over a 200-day period under controlled temperature and humidity conditions. Leveraging powerful pre-trained architectures of ResNet, VGG, DenseNet, and Vision Transformer (ViT), we designed two specialized models: (1) a high-precision binary classifier for sprout detection, and (2) an advanced multi-class predictor to estimate weight loss and forecast remaining shelf-life with remarkable accuracy. DenseNet achieved exceptional performance, with 98.03% accuracy in sprout detection. Shelf-life prediction models performed best with coarse class divisions (2-5 classes), achieving over 89.83% accuracy, while accuracy declined for finer divisions (6-8 classes) due to subtle visual differences and limited data per class. These findings demonstrate the feasibility of integrating image-based models into automated sorting and inventory systems, enabling early identification of sprouted potatoes and dynamic categorization based on storage stage. Practical implications include improved inventory management, differential pricing strategies, and reduced food waste across supply chains. While predicting exact shelf-life intervals remains challenging, focusing on broader class divisions ensures robust performance. Future research should aim to develop generalized models trained on diverse potato varieties and storage conditions to enhance adaptability and scalability. Overall, this approach offers a cost-effective, non-destructive method for quality assessment, supporting efficiency and sustainability in potato storage and distribution.


[45] CRoPS: A Training-Free Hallucination Mitigation Framework for Vision-Language Models cs.CVPDF

Neeraj Anand, Samyak Jha, Udbhav Bamba, Rahul Rahaman

TL;DR: 本文提出了CRoPS,一个无需训练即可缓解大型视觉语言模型(LVLM)幻觉问题的框架。该框架通过构建一个选择性移除关键文本标记的幻觉模型来捕捉幻觉效应,并引入广义对比解码来整合多个幻觉模型,从而更全面地处理幻觉来源。

Details

Motivation: 现有无需训练的方法在缓解LVLM幻觉时存在局限性:一是对幻觉来源的假设过于狭隘,二是其有效性在生成过程后期(幻觉最易发生时)会下降。本文旨在克服这些限制,提升模型生成内容的可靠性。

Result: CRoPS在六个基准测试和三个LVLM家族上均取得了一致的性能提升,将CHAIR分数提高了20%,并超越了当前最先进的无需训练方法。

Insight: 核心创新在于构建了一个通过选择性移除关键文本标记(而非视觉标记)来捕捉幻觉效应的幻觉模型,并结合了广义对比解码来整合多种幻觉来源的表示,从而更有效地在解码阶段缓解幻觉,且无需额外训练。

Abstract: Despite the rapid success of Large Vision-Language Models (LVLMs), a persistent challenge is their tendency to generate hallucinated content, undermining reliability in real-world use. Existing training-free methods address hallucinations but face two limitations: (i) they rely on narrow assumptions about hallucination sources, and (ii) their effectiveness declines toward the end of generation, where hallucinations are most likely to occur. A common strategy is to build hallucinated models by completely or partially removing visual tokens and contrasting them with the original model. Yet, this alone proves insufficient, since visual information still propagates into generated text. Building on this insight, we propose a novel hallucinated model that captures hallucination effects by selectively removing key text tokens. We further introduce Generalized Contrastive Decoding, which integrates multiple hallucinated models to represent diverse hallucination sources. Together, these ideas form CRoPS, a training-free hallucination mitigation framework that improves CHAIR scores by 20% and achieves consistent gains across six benchmarks and three LVLM families, outperforming state-of-the-art training-free methods.


[46] Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians cs.CVPDF

Melonie de Almeida, Daniela Ivanova, Tong Shi, John H. Williamson, Paul Henderson

TL;DR: 本文提出了一种名为Pixel-to-4D的新型框架,旨在从单张图像生成高质量、时间一致且相机路径可控的视频。该方法通过单次前向传播,利用动态3D高斯场景表示来同时建模场景几何和物体运动,从而实现快速、相机引导的视频生成。

Details

Motivation: 现有基于单张图像生成视频的方法在用户控制(如修改相机路径)方面存在不足,且难以准确建模相机运动、保持时间一致性和几何完整性。本文旨在解决这些问题,实现更可控、更一致的单图到视频生成。

Result: 在KITTI、Waymo、RealEstate10K和DL3DV-10K数据集上的大量实验表明,该方法在视频质量和推理效率方面均达到了最先进水平。

Insight: 核心创新在于将动态3D高斯表示与单次前向传播相结合,直接构建包含物体运动的3D场景表示,从而避免了传统两步法(先重建静态3D点云再注入运动)导致的时间不一致问题,实现了相机引导下高效、高质量的视频合成。

Abstract: Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at https://melonienimasha.github.io/Pixel-to-4D-Website.


[47] Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model cs.CV | cs.AIPDF

Hao Guan, Li Zhou

TL;DR: 本研究探讨了病理学视觉语言模型在数据分布偏移下的性能退化检测问题,开发了轻量级工具箱DomainSAT用于输入级数据偏移分析,并提出了一种基于模型预测置信度的无标签性能退化指示器。实验表明,结合输入数据偏移检测和输出置信度指标能更可靠地监测模型在数据偏移下的性能退化。

Details

Motivation: 视觉语言模型在医疗图像分析中部署后,当输入数据分布与开发时不同时,性能可能退化,而缺乏标注数据时检测这种退化具有挑战性,因此研究如何有效监测模型可靠性至关重要。

Result: 在大规模病理学肿瘤分类数据集上的实验表明,结合输入数据偏移检测和输出置信度指标能更可靠地检测性能退化,为数字病理学基础模型的可靠性监控提供了实用框架。

Insight: 创新点包括开发DomainSAT工具箱集成代表性偏移检测算法,以及引入无标签的置信度退化指示器直接捕捉模型预测变化;客观分析认为,结合输入和输出层面的监控提供了互补的可靠性评估方法,有助于早期诊断和实际性能退化关联。

Abstract: Vision-Language Models have demonstrated strong potential in medical image analysis and disease diagnosis. However, after deployment, their performance may deteriorate when the input data distribution shifts from that observed during development. Detecting such performance degradation is essential for clinical reliability, yet remains challenging for large pre-trained VLMs operating without labeled data. In this study, we investigate performance degradation detection under data shift in a state-of-the-art pathology VLM. We examine both input-level data shift and output-level prediction behavior to understand their respective roles in monitoring model reliability. To facilitate systematic analysis of input data shift, we develop DomainSAT, a lightweight toolbox with a graphical interface that integrates representative shift detection algorithms and enables intuitive exploration of data shift. Our analysis shows that while input data shift detection is effective at identifying distributional changes and providing early diagnostic signals, it does not always correspond to actual performance degradation. Motivated by this observation, we further study output-based monitoring and introduce a label-free, confidence-based degradation indicator that directly captures changes in model prediction confidence. We find that this indicator exhibits a close relationship with performance degradation and serves as an effective complement to input shift detection. Experiments on a large-scale pathology dataset for tumor classification demonstrate that combining input data shift detection and output confidence-based indicators enables more reliable detection and interpretation of performance degradation in VLMs under data shift. These findings provide a practical and complementary framework for monitoring the reliability of foundation models in digital pathology.


[48] Grading Handwritten Engineering Exams with Multimodal Large Language Models cs.CVPDF

Janez Perš, Jon Muhovič, Andrej Košir, Boštjan Murovec

TL;DR: 本文提出了一种基于多模态大语言模型(LLMs)的端到端工作流,用于自动批改手写工程学测验试卷。该系统保留了标准考试流程(A4纸、无约束手写),仅需讲师提供手写参考答案和简短评分规则,通过多阶段设计确保可靠性,并在真实课程测验上进行了评估。

Details

Motivation: 手动批改手写STEM考试(包含开放式推理和图表)速度慢且难以扩展,因此需要一种能自动、可靠地批改手写答案的解决方案。

Result: 在真实斯洛文尼亚语课程测验(包含手绘电路图)上,使用SOTA后端模型(GPT-5.2和Gemini-3 Pro)的完整流程与讲师评分的平均绝对差异约为8分,偏差较低,在最大差异阈值为40分时,估计需手动复核的触发率约为17%。

Insight: 创新点在于:1)仅需手写参考答案和简短规则,无需暴露参考答案扫描件;2)采用多阶段设计(格式检查、独立评分器集成、监督聚合、确定性验证)确保可靠性与可审计性;3)通过消融实验证实结构化提示和参考答案锚定对防止系统性过评分至关重要。

Abstract: Handwritten STEM exams capture open-ended reasoning and diagrams, but manual grading is slow and difficult to scale. We present an end-to-end workflow for grading scanned handwritten engineering quizzes with multimodal large language models (LLMs) that preserves the standard exam process (A4 paper, unconstrained student handwriting). The lecturer provides only a handwritten reference solution (100%) and a short set of grading rules; the reference is converted into a text-only summary that conditions grading without exposing the reference scan. Reliability is achieved through a multi-stage design with a format/presence check to prevent grading blank answers, an ensemble of independent graders, supervisor aggregation, and rigid templates with deterministic validation to produce auditable, machine-parseable reports. We evaluate the frozen pipeline in a clean-room protocol on a held-out real course quiz in Slovenian, including hand-drawn circuit schematics. With state-of-the-art backends (GPT-5.2 and Gemini-3 Pro), the full pipeline achieves $\approx$8-point mean absolute difference to lecturer grades with low bias and an estimated manual-review trigger rate of $\approx$17% at $D_{\max}=40$. Ablations show that trivial prompting and removing the reference solution substantially degrade accuracy and introduce systematic over-grading, confirming that structured prompting and reference grounding are essential.


[49] AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction cs.CVPDF

Jiewen Chan, Zhenjun Zhao, Yu-Lun Liu

TL;DR: 本文提出了AdaGaR,一个用于从单目视频重建动态3D场景的统一框架。它通过引入自适应Gabor表示来扩展高斯基元,以平衡高频细节捕捉与能量稳定性,并采用带时间曲率正则化的三次Hermite样条来确保运动的时间连续性。

Details

Motivation: 现有基于单一高斯基元的方法受限于其低通滤波特性,难以捕捉高频细节;而标准Gabor函数存在能量不稳定问题。同时,缺乏时间连续性约束常导致插值时的运动伪影。

Result: 在Tap-Vid DAVIS基准测试上取得了SOTA性能(PSNR 35.49, SSIM 0.9433, LPIPS 0.0723),并在帧插值、深度一致性、视频编辑和立体视图合成等任务上展现出强大的泛化能力。

Insight: 核心创新点在于自适应Gabor表示(通过可学习频率权重和自适应能量补偿)与时间连续性建模(三次Hermite样条加曲率正则化)的结合,以及结合深度估计、点跟踪和前景掩码的自适应初始化机制,共同解决了动态场景重建中细节、稳定性和平滑性的权衡问题。

Abstract: Reconstructing dynamic 3D scenes from monocular videos requires simultaneously capturing high-frequency appearance details and temporally continuous motion. Existing methods using single Gaussian primitives are limited by their low-pass filtering nature, while standard Gabor functions introduce energy instability. Moreover, lack of temporal continuity constraints often leads to motion artifacts during interpolation. We propose AdaGaR, a unified framework addressing both frequency adaptivity and temporal continuity in explicit dynamic scene modeling. We introduce Adaptive Gabor Representation, extending Gaussians through learnable frequency weights and adaptive energy compensation to balance detail capture and stability. For temporal continuity, we employ Cubic Hermite Splines with Temporal Curvature Regularization to ensure smooth motion evolution. An Adaptive Initialization mechanism combining depth estimation, point tracking, and foreground masks establishes stable point cloud distributions in early training. Experiments on Tap-Vid DAVIS demonstrate state-of-the-art performance (PSNR 35.49, SSIM 0.9433, LPIPS 0.0723) and strong generalization across frame interpolation, depth consistency, video editing, and stereo view synthesis. Project page: https://jiewenchan.github.io/AdaGaR/


cs.AI [Back]

[50] Reasoning in Action: MCTS-Driven Knowledge Retrieval for Large Language Models cs.AI | cs.CLPDF

Shuqi Liu, Bowei He, Chen Ma, Linqi Song

TL;DR: 本文提出了一种推理感知的知识检索方法,通过蒙特卡洛树搜索(MCTS)驱动的检索策略,在知识库中寻找与对话逻辑结构对齐的知识,而非仅依赖语义相似性,以增强大语言模型(LLM)在多轮对话中的性能。

Details

Motivation: 现有LLM通常通过检索语义相似信息或提升推理能力来增强性能,但如何有效整合检索与推理策略以优化LLM表现仍是一个重大挑战。

Result: 在两个多轮对话数据集上的实验表明,该方法检索的知识更贴近人类对话的底层推理逻辑,显著提升了检索知识的多样性,从而生成更具信息量和创造性的响应。

Insight: 创新点在于采用由粗到细的检索流程(先定位上下文相关子区域,再精炼推理相关知识)并引入MCTS启发式搜索方法,通过关键词导航知识库,实现了推理与检索的深度融合。

Abstract: Large language models (LLMs) typically enhance their performance through either the retrieval of semantically similar information or the improvement of their reasoning capabilities. However, a significant challenge remains in effectively integrating both retrieval and reasoning strategies to optimize LLM performance. In this paper, we introduce a reasoning-aware knowledge retrieval method that enriches LLMs with information aligned to the logical structure of conversations, moving beyond surface-level semantic similarity. We follow a coarse-to-fine approach for knowledge retrieval. First, we identify a contextually relevant sub-region of the knowledge base, ensuring that all sentences within it are relevant to the context topic. Next, we refine our search within this sub-region to extract knowledge that is specifically relevant to the reasoning process. Throughout both phases, we employ the Monte Carlo Tree Search-inspired search method to effectively navigate through knowledge sentences using common keywords. Experiments on two multi-turn dialogue datasets demonstrate that our knowledge retrieval approach not only aligns more closely with the underlying reasoning in human conversations but also significantly enhances the diversity of the retrieved knowledge, resulting in more informative and creative responses.


[51] The Illusion of Insight in Reasoning Models cs.AI | cs.CLPDF

Liv G. d’Aliberti, Manoel Horta Ribeiro

TL;DR: 该论文研究了推理模型中的‘顿悟’现象,通过分析超过100万条推理轨迹、数百个训练检查点、三个推理领域以及多种解码温度和模型架构,发现模型在推理过程中的策略转变(mid-reasoning shifts)实际上很罕见,不会随训练变得更频繁,且很少提高准确性,表明这些转变并非真正的自我纠正机制,而是推理不稳定的症状。

Details

Motivation: 动机是探究推理模型(如DeepSeek-R1-Zero)是否真的具有内在的自我纠正能力,即通过推理过程中的突然策略转变(‘顿悟’时刻)来提高输出准确性,从而澄清此类现象对性能的实际影响。

Result: 研究发现推理策略转变在模型训练中并不常见,且不会改善准确性;但在模型不确定性高(高熵)时,人工触发外部转变能可靠地提升准确性,这挑战了先前关于模型内在洞察力的认知。

Insight: 创新点在于揭示了推理模型中的‘顿悟’现象更多是推理不稳定的表现,而非有效的自我纠正机制;同时提出在高熵条件下人工诱导策略转变可作为提升准确性的实用方法,为模型推理行为分析提供了新视角。

Abstract: Do reasoning models have “Aha!” moments? Prior work suggests that models like DeepSeek-R1-Zero undergo sudden mid-trace realizations that lead to accurate outputs, implying an intrinsic capacity for self-correction. Yet, it remains unclear whether such intrinsic shifts in reasoning strategy actually improve performance. Here, we study mid-reasoning shifts and instrument training runs to detect them. Our analysis spans 1M+ reasoning traces, hundreds of training checkpoints, three reasoning domains, and multiple decoding temperatures and model architectures. We find that reasoning shifts are rare, do not become more frequent with training, and seldom improve accuracy, indicating that they do not correspond to prior perceptions of model insight. However, their effect varies with model uncertainty. Building on this finding, we show that artificially triggering extrinsic shifts under high entropy reliably improves accuracy. Our results show that mid-reasoning shifts are symptoms of unstable inference behavior rather than an intrinsic mechanism for self-correction.


[52] From Clay to Code: Typological and Material Reasoning in AI Interpretations of Iranian Pigeon Towers cs.AI | cs.CVPDF

Abolhassan Pishahang, Maryam Badiei

TL;DR: 本研究以伊朗鸽塔为案例,探讨生成式AI系统如何解读乡土建筑形式中蕴含的建筑智慧。研究测试了Midjourney v6、DALL-E 3和基于Stable Diffusion XL的DreamStudio三种扩散模型,通过参照性、适应性和推测性三个提示阶段,评估AI在类型学、材料性、环境、真实性和文化特异性五个方面的重建能力。

Details

Motivation: 探究生成式AI在解释乡土建筑形式中的建筑智能时,如何平衡视觉相似性与建筑逻辑推理,揭示AI在感知、扭曲和重新构想传统设计智慧方面的能力与局限。

Result: AI能可靠地再现几何图案,但误解材料和气候逻辑;参考图像提高了真实性但限制了创造性,而无参考的自由生成则产生有创意但文化模糊的结果。

Insight: 研究定义了视觉相似性与建筑推理之间的边界,提出了计算乡土推理框架,用于分析AI如何感知和重新诠释传统设计智能,为AI在建筑文化遗产解读中的应用提供了方法论参考。

Abstract: This study investigates how generative AI systems interpret the architectural intelligence embedded in vernacular form. Using the Iranian pigeon tower as a case study, the research tests three diffusion models, Midjourney v6, DALL-E 3, and DreamStudio based on Stable Diffusion XL (SDXL), across three prompt stages: referential, adaptive, and speculative. A five-criteria evaluation framework assesses how each system reconstructs typology, materiality, environment, realism, and cultural specificity. Results show that AI reliably reproduces geometric patterns but misreads material and climatic reasoning. Reference imagery improves realism yet limits creativity, while freedom from reference generates inventive but culturally ambiguous outcomes. The findings define a boundary between visual resemblance and architectural reasoning, positioning computational vernacular reasoning as a framework for analyzing how AI perceives, distorts, and reimagines traditional design intelligence.


[53] Explicit Abstention Knobs for Predictable Reliability in Video Question Answering cs.AI | cs.CVPDF

Jorge Ortiz

TL;DR: 本文研究了在视频问答任务中,基于置信度的弃权机制能否提供可靠的错误率控制,以及这种控制在分布偏移下是否稳健。研究发现,在分布内,置信度阈值能提供机制性控制,实现平滑的风险-覆盖权衡;但在分布偏移下,这种控制会失效,导致错误率意外上升。

Details

Motivation: 视觉语言模型在高风险部署中需要选择性预测,即在不确定时弃权以避免代价高昂的错误。本文旨在探究基于置信度的弃权能否为视频问答提供可预测的可靠性控制,并检验其在分布偏移下的鲁棒性。

Result: 在NExT-QA数据集上使用Gemini 2.0 Flash模型进行实验。结果表明,在分布内,通过调整置信度阈值ε可实现平滑的风险-覆盖权衡曲线,有效降低错误率;但在分布偏移下,这种控制失效,错误率出现意外上升。

Insight: 论文揭示了基于置信度的弃权机制在分布内可提供可预测的可靠性控制,但在面对分布偏移时存在局限性。这一发现强调了在现实部署中考虑分布鲁棒性的重要性,并为设计更可靠的弃权策略提供了方向。

Abstract: High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f


cs.LG [Back]

[54] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning cs.LG | cs.AI | cs.CL | cs.LOPDF

Valentin Noël

TL;DR: 本文提出了一种无需训练的方法,通过分析大语言模型中注意力模式的谱特征来检测有效的数学推理。该方法将注意力矩阵视为动态图的邻接矩阵,提取四个可解释的谱诊断指标(Fiedler值、高频能量比、图信号平滑度和谱熵),这些指标在有效和无效数学证明间表现出显著统计差异。在四个独立架构家族的七个Transformer模型上实验表明,该方法可实现高达Cohen’s d=3.30的效应量,分类准确率达85.0-95.6%,且无需训练数据或微调。

Details

Motivation: 动机在于开发一种无需训练、可解释的方法来验证大语言模型生成的数学推理的有效性,以应用于幻觉检测和AI安全监控。

Result: 在Meta Llama、Alibaba Qwen、Microsoft Phi和Mistral AI的七个模型上,该方法产生高达Cohen’s d=3.30的效应量(p<10^{-116}),分类准确率在严格评估下达到85.0-95.6%,校准阈值在完整数据集上可达93-95%。

Insight: 创新点在于将注意力模式视为动态图并进行谱分析,提取可解释的谱特征作为推理有效性的诊断指标;发现该方法检测的是逻辑连贯性而非编译器接受度,并揭示了注意力机制设计(如Mistral-7B的滑动窗口注意力)影响哪些谱特征捕获推理有效性,为推理验证提供了原则性框架。

Abstract: We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen’s $d = 3.30$ ($p < 10^{-116}$), enabling 85.0–95.6% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93–95% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B’s Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.


[55] E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models cs.LG | cs.AI | cs.CVPDF

Shengjun Zhang, Zhang Zhang, Chensheng Dai, Yueqi Duan

TL;DR: 本文提出E-GRPO(熵感知的组相对策略优化)方法,用于增强流匹配模型在人类偏好对齐中的强化学习效果。该方法通过合并连续的低熵步骤为单个高熵步骤进行SDE采样,并在其他步骤使用ODE采样,以解决多步去噪中奖励信号稀疏和模糊的问题,并引入了多步组归一化优势计算。

Details

Motivation: 现有基于多步去噪优化的方法在强化学习中对齐人类偏好时,面临奖励信号稀疏和模糊的挑战,特别是低熵步骤导致探索效率低下。

Result: 在不同奖励设置下的实验结果表明,该方法有效提升了性能,但摘要未具体说明基准测试和定量比较结果(如是否达到SOTA)。

Insight: 创新点在于提出熵感知的步骤合并策略(将低熵步骤合并为高熵SDE采样步骤)和组归一化优势计算,以更高效地利用高熵步骤进行探索,缓解多步随机性带来的奖励模糊问题。

Abstract: Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.


[56] Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation cs.LG | cs.AI | cs.CV | cs.HC | cs.MMPDF

Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang

TL;DR: 本文提出Avatar Forcing框架,用于实时交互式头部虚拟人生成,通过扩散强迫建模实时用户-虚拟人交互,并利用基于丢弃用户条件构建的合成负样本进行直接偏好优化,实现低延迟、富有表现力的反应。

Details

Motivation: 解决现有说话头生成模型缺乏真正交互感、无法实时响应以及难以在没有额外标注数据下学习生动反应的问题。

Result: 在实时交互中实现约500ms的低延迟,比基线加速6.8倍;生成的虚拟人运动更具反应性和表现力,在偏好测试中超过80%优于基线。

Insight: 创新点在于通过扩散强迫实现因果约束下的实时运动生成,以及利用构造的合成负样本进行无标签的直接偏好优化来学习表达性交互。

Abstract: Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user’s audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.


eess.SY [Back]

[57] Next Generation Intelligent Low-Altitude Economy Deployments: The O-RAN Perspective eess.SY | cs.AI | cs.CV | cs.MA | cs.NIPDF

Aly Sabri Abdalla, Vuk Marojevic

TL;DR: 本文提出了一种基于开放无线接入网络(O-RAN)的低空经济(LAE)部署框架,旨在通过O-RAN的解耦架构、开放接口和无线接入网络智能控制器(RICs)实现闭环、AI优化和任务关键的LAE操作。该框架利用语义感知的rApp作为地形解释器,为基于强化学习的xApp提供语义指导,以执行LAE集群节点的实时轨迹规划。

Details

Motivation: 解决低空经济应用(如无人机物流和应急响应)在复杂、信号受限环境中缺乏实时、弹性、上下文感知的空中节点编排,以及AI与LAE任务集成不足的问题。

Result: 通过一个语义感知rApp和强化学习xApp的协同工作,评估了所提架构的可行性和性能,但摘要中未提及具体的定量结果或基准测试比较。

Insight: 创新点在于将O-RAN架构引入LAE部署,利用其开放性和智能控制器实现AI驱动的闭环优化;从客观角度看,该研究为LAE提供了标准化和可扩展的解决方案框架,并强调了测试平台利用和研究挑战。

Abstract: Despite the growing interest in low-altitude economy (LAE) applications, including UAV-based logistics and emergency response, fundamental challenges remain in orchestrating such missions over complex, signal-constrained environments. These include the absence of real-time, resilient, and context-aware orchestration of aerial nodes with limited integration of artificial intelligence (AI) specialized for LAE missions. This paper introduces an open radio access network (O-RAN)-enabled LAE framework that leverages seamless coordination between the disaggregated RAN architecture, open interfaces, and RAN intelligent controllers (RICs) to facilitate closed-loop, AI-optimized, and mission-critical LAE operations. We evaluate the feasibility and performance of the proposed architecture via a semantic-aware rApp that acts as a terrain interpreter, offering semantic guidance to a reinforcement learning-enabled xApp, which performs real-time trajectory planning for LAE swarm nodes. We survey the capabilities of UAV testbeds that can be leveraged for LAE research, and present critical research challenges and standardization needs.


cs.SD [Back]

[58] Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection cs.SD | cs.CVPDF

Akanksha Chuchra, Shukesh Reddy, Sudeepta Mishra, Abhijit Das, Abhinav Dhall

TL;DR: 本研究探索了多模态大语言模型(MLLMs)在音频深度伪造检测中的应用潜力。通过将音频输入与多种文本提示相结合,评估了Qwen2-Audio-7B-Instruct和SALMONN模型在零样本和微调模式下的性能。实验表明,结合音频和多提示方法是一种可行的方向,模型在特定领域数据上经过少量监督即可取得良好性能,但在零样本和跨领域泛化方面表现不佳。

Details

Motivation: 尽管视觉-语言模型和多模态大语言模型在图像和视频深度伪造检测中表现出强大的泛化能力,但它们在音频深度伪造检测中的应用尚未得到充分探索。本研究旨在填补这一空白,探索MLLMs在音频深度伪造检测中的潜力。

Result: 在零样本设置下,模型表现不佳且难以泛化到领域外数据。然而,在特定领域数据上进行少量监督微调后,模型取得了良好的性能,显示出在音频深度伪造检测中的潜力。实验评估了Qwen2-Audio-7B-Instruct和SALMONN模型。

Insight: 论文的创新点在于首次系统地探索MLLMs用于音频深度伪造检测,并提出了结合音频输入与多样化文本提示(特别是基于问答的、上下文丰富的提示)的多提示方法。从客观角度看,这种特征引导的推理方法旨在促进更深层的多模态理解,为音频伪造检测学习更鲁棒的特征表示,为这一领域提供了一种新的、基于大模型的探索路径。

Abstract: While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.


eess.SP [Back]

[59] Neural Brain Fields: A NeRF-Inspired Approach for Generating Nonexistent EEG Electrodes eess.SP | cs.AI | cs.CV | cs.LG | eess.ASPDF

Shahar Ain Kedem, Itamar Zimerman, Eliya Nachmani

TL;DR: 本文提出了一种受NeRF启发的神经脑场方法,用于从离散的EEG电极数据中学习连续的神经活动表示,并能够生成不存在电极的信号,从而提高EEG处理网络的性能。

Details

Motivation: 解决EEG数据因长度可变、信噪比低、个体差异大、时间漂移和数据集稀缺等挑战,导致深度学习处理困难的问题。

Result: 方法能够连续可视化脑活动(包括超高分辨率),重建原始EEG信号,并通过模拟不存在电极数据提升标准EEG处理网络的性能。

Insight: 创新点在于将NeRF中从离散图像学习连续3D场景的类比应用于EEG,通过单个样本训练神经网络生成固定大小的权重向量,实现信号在时间和空间上的连续渲染与电极数据生成。

Abstract: Electroencephalography (EEG) data present unique modeling challenges because recordings vary in length, exhibit very low signal to noise ratios, differ significantly across participants, drift over time within sessions, and are rarely available in large and clean datasets. Consequently, developing deep learning methods that can effectively process EEG signals remains an open and important research problem. To tackle this problem, this work presents a new method inspired by Neural Radiance Fields (NeRF). In computer vision, NeRF techniques train a neural network to memorize the appearance of a 3D scene and then uses its learned parameters to render and edit the scene from any viewpoint. We draw an analogy between the discrete images captured from different viewpoints used to learn a continuous 3D scene in NeRF, and EEG electrodes positioned at different locations on the scalp, which are used to infer the underlying representation of continuous neural activity. Building on this connection, we show that a neural network can be trained on a single EEG sample in a NeRF style manner to produce a fixed size and informative weight vector that encodes the entire signal. Moreover, via this representation we can render the EEG signal at previously unseen time steps and spatial electrode positions. We demonstrate that this approach enables continuous visualization of brain activity at any desired resolution, including ultra high resolution, and reconstruction of raw EEG signals. Finally, our empirical analysis shows that this method can effectively simulate nonexistent electrodes data in EEG recordings, allowing the reconstructed signal to be fed into standard EEG processing networks to improve performance.