Table of Contents

cs.CL [Back]

[1] Rate-Distortion Analysis of Compressed Query Delegation with Low-Rank Riemannian Updates cs.CL | math.OCPDF

Faruk Alpay, Bugra Kilictas

TL;DR: 本文提出了一种压缩查询委托(CQD)方法,用于解决有界上下文智能体在中间推理超出工作内存预算时失效的问题。该方法通过将高维潜在推理状态压缩为低秩张量查询,委托给外部预言机,并利用黎曼优化在固定秩流形上更新潜在状态。论文从数学角度形式化了CQD,将其与经典率失真和信息瓶颈原理联系起来,并提供了理论收敛保证。实验部分包括一个包含2500项的有界上下文推理测试集和一个人工认知镜像基准,用于评估CQD与基线方法的性能。

Details

Motivation: 解决有界上下文智能体在处理复杂推理任务时,因中间推理步骤超出工作内存预算而失效的问题,通过压缩查询委托来高效利用外部预言机资源。

Result: 在BBH衍生任务和精选悖论实例组成的2500项有界上下文推理测试集中,CQD在固定计算和上下文条件下优于思维链基线;在N=200的人工认知镜像基准上,测量了现代预言机的认知增益和语义漂移。

Insight: 创新点包括将CQD形式化为带查询预算功能的约束随机规划,并证明谱硬阈值在约束二次失真问题中的最优性;利用黎曼优化在固定秩流形上更新状态,为处理低秩结构提供了高效方法,可借鉴于压缩推理和外部知识集成任务。

Abstract: Bounded-context agents fail when intermediate reasoning exceeds an effective working-memory budget. We study compressed query delegation (CQD): (i) compress a high-dimensional latent reasoning state into a low-rank tensor query, (ii) delegate the minimal query to an external oracle, and (iii) update the latent state via Riemannian optimization on fixed-rank manifolds. We give a math-first formulation: CQD is a constrained stochastic program with a query-budget functional and an oracle modeled as a noisy operator. We connect CQD to classical rate-distortion and information bottleneck principles, showing that spectral hard-thresholding is optimal for a natural constrained quadratic distortion problem, and we derive convergence guarantees for Riemannian stochastic approximation under bounded oracle noise and smoothness assumptions. Empirically, we report (A) a 2,500-item bounded-context reasoning suite (BBH-derived tasks plus curated paradox instances) comparing CQD against chain-of-thought baselines under fixed compute and context; and (B) a human “cognitive mirror” benchmark (N=200) measuring epistemic gain and semantic drift across modern oracles.


[2] Intention Collapse: Intention-Level Metrics for Reasoning in Language Models cs.CL | cs.AIPDF

Patricio Vera

TL;DR: 这篇论文提出了’意图坍缩’的概念,用以描述语言模型将高维内部状态压缩为单一词序列的过程。作者定义了三个模型无关的意图度量指标(意图熵、有效维度和潜在知识可恢复性),并提出了一个研究推理时计算如何塑造内部意图的实证框架。通过在GSM8K数据集上对Mistral 7B模型进行的小规模实验,比较了直接回答、思维链和随机生成三种推理机制,发现思维链能显著提升准确率并降低意图熵,同时意图度量能有效区分不同推理机制并揭示在坍缩过程中部分丢失的潜在信息。

Details

Motivation: 为了解决语言模型在生成过程中,其丰富的内部状态(意图)如何被压缩为单一词序列这一’黑箱’问题,并量化分析不同推理机制(如思维链)对内部意图形成的影响。

Result: 在GSM8K的200个问题上,使用4位量化的Mistral 7B模型进行实验。思维链将准确率从5.5%大幅提升至53%,并将坍缩前的意图熵从1.42比特降至0.37比特。思维链机制下的有效维度也高于其他机制。在线性探测任务中,思维链机制下的意图空间AUROC达到0.65,而直接回答基线则接近随机猜测水平。

Insight: 论文的创新点在于形式化了’意图坍缩’这一概念,并提出了可量化的意图层面度量指标,为分析语言模型的内部推理过程提供了新的工具。从客观角度看,这些指标能够有效区分不同的推理生成机制,并揭示思维链等机制通过改变内部意图的分布和结构来提升性能的潜在机理,这为理解和改进语言模型的推理能力提供了新的视角。

Abstract: Every act of language generation compresses a rich internal state into a single token sequence. We call this process intention collapse: a many-to-one projection from a high dimensional intention space I into an external language space L. We formalize intention collapse for contemporary language models, define three simple, model agnostic intention metrics (intention entropy Hint, effective dimensionality dimeff, and latent knowledge recoverability Recov), and propose an empirical agenda for studying how inference time computation shapes internal intentions before they are verbalized. We also report a first small scale experiment. Using a 4 bit Mistral 7B model on 200 GSM8K problems, we compare a direct answer baseline, a chain of thought (CoT) regime, and a babble control. CoT raises accuracy from 5.5 percent to 53 percent, sharply reduces pre collapse intention entropy (from 1.42 to 0.37 bits), and shows higher global effective dimensionality than the other regimes despite producing fewer tokens than babble. At the same time, Hint has little item level predictive power, and a linear probe on I achieves AUROC 0.65 in the CoT regime but only about chance in the baseline regime, where it collapses to the majority class. These preliminary results indicate that intention level metrics can distinguish inference regimes and expose latent information that is partly lost during collapse, while also revealing important limitations of our current proxies


[3] EmoLoom-2B: Fast Base-Model Screening for Emotion Classification and VAD with Lexicon-Weak Supervision and KV-Off Evaluation cs.CLPDF

Zilin Li, Weiwei Xu, Xuanbo Lu, Zheda Liu

TL;DR: EmoLoom-2B是一个轻量级、可复现的流程,旨在将参数少于20亿的小型语言模型快速筛选为适用于联合情感分类和VAD预测的候选模型。它通过统一的JSON输入输出协议、KV-off解码默认设置、结合VAD保持约束和外部评估分类器的语义正则化、以及基于镜像情感对的Valence Flip数据增强等技术,在GoEmotions和EmpatheticDialogues数据集上取得了强劲性能,并在DailyDialog上展示了稳健的跨语料库泛化能力。

Details

Motivation: 解决在情感分类和VAD预测任务中,如何快速、公平、可复现地筛选小型基础模型,为后续更重的训练或多模态融合提供可靠预选方案的问题。

Result: 以Qwen-1.8B-Chat为基础模型,在GoEmotions和EmpatheticDialogues数据集上取得了强劲性能,并在DailyDialog上展示了稳健的跨语料库泛化能力。

Insight: 创新点包括:1) 提出一个统一的、协议忠实且公平的评估流程(单一JSON契约、KV-off解码);2) 引入两种正交的语义正则化器(VAD保持约束和轻量级外部评估分类器)以提升控制性和公平性;3) 提出Valence Flip数据增强以改善极性敏感性;4) 在监督微调中采用基于熵感知温度调度的A/B混合采样来平衡覆盖率和收敛性。整个方案具有预算意识、可审计和可重入的特点。

Abstract: We introduce EmoLoom-2B, a lightweight and reproducible pipeline that turns small language models under 2B parameters into fast screening candidates for joint emotion classification and Valence-Arousal-Dominance prediction. To ensure protocol-faithful and fair evaluation, we unify data loading, training, and inference under a single JSON input-output contract and remove avoidable variance by adopting KV-off decoding as the default setting. We incorporate two orthogonal semantic regularizers: a VAD-preserving constraint that aligns generated text with target VAD triples, and a lightweight external appraisal classifier that provides training-time guidance on goal attainment, controllability, certainty, and fairness without injecting long rationales. To improve polarity sensitivity, we introduce Valence Flip augmentation based on mirrored emotional pairs. During supervised fine-tuning, we apply A/B mixture sampling with entropy-aware temperature scheduling to balance coverage and convergence. Using Qwen-1.8B-Chat as the base model, EmoLoom-2B achieves strong performance on GoEmotions and EmpatheticDialogues, and demonstrates robust cross-corpus generalization on DailyDialog. The proposed recipe is budget-aware, auditable, and re-entrant, serving as a dependable screening pass before heavier training or multimodal fusion.


[4] From Policy to Logic for Efficient and Interpretable Coverage Assessment cs.CL | cs.AIPDF

Rhitabrat Pokharel, Hamid Hassanzadeh, Ameeta Agrawal

TL;DR: 本文提出了一种结合覆盖感知检索器与符号化规则推理的混合系统,旨在提升医疗覆盖政策审查的效率和可解释性。该方法通过检索相关政策语言、将其组织为显式事实与规则,并生成可审计的推理依据,以减少大型语言模型(LLM)的推理需求并降低成本。

Details

Motivation: 解决LLM在解释复杂法律和政策语言时可能产生的幻觉和不一致性问题,特别是在医疗覆盖政策审查等关键领域,需要提高准确性和可靠性以支持人类专家。

Result: 在医疗覆盖政策评估任务中,该方法实现了44%的推理成本降低,同时F1分数提升了4.5%,展现了效率与性能的双重优势。

Insight: 创新点在于将神经检索与符号化规则推理相结合,构建可审计的混合系统,既减少了LLM依赖以控制成本,又通过结构化输出增强了可解释性,为政策分析等敏感领域提供了可靠的技术路径。

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in interpreting lengthy, complex legal and policy language. However, their reliability can be undermined by hallucinations and inconsistencies, particularly when analyzing subjective and nuanced documents. These challenges are especially critical in medical coverage policy review, where human experts must be able to rely on accurate information. In this paper, we present an approach designed to support human reviewers by making policy interpretation more efficient and interpretable. We introduce a methodology that pairs a coverage-aware retriever with symbolic rule-based reasoning to surface relevant policy language, organize it into explicit facts and rules, and generate auditable rationales. This hybrid system minimizes the number of LLM inferences required which reduces overall model cost. Notably, our approach achieves a 44% reduction in inference cost alongside a 4.5% improvement in F1 score, demonstrating both efficiency and effectiveness.


[5] T3C: Test-Time Tensor Compression with Consistency Guarantees cs.CL | cs.AI | cs.CVPDF

Ismail Lamaakal, Chaymae Yahyati, Yassine Maleh, Khalid El Makkaoui, Ibrahim Ouahbi

TL;DR: T3C是一个训练一次、在测试时根据预算进行条件压缩的框架,它将张量分解的秩和量化精度作为可控制的部署旋钮。该方法结合了弹性张量分解、秩绑定的混合精度量化以及一个轻量级控制器,该控制器将延迟/能耗/大小预算映射到每层的秩/比特分配。通过一个基于谱代理和激活统计的快速层一致性证书来上界logit漂移并正则化训练,从而提供可靠的性能保证。在ImageNet-1k上,T3C显著提升了视觉模型的帕累托前沿。

Details

Motivation: 解决在模型部署时,需要根据不同的延迟、能耗或模型大小预算,动态且可靠地调整模型压缩配置(如秩和精度)的问题,以实现可预测的精度-效率权衡。

Result: 在ImageNet-1k上,对于ResNet-50,在精度下降≤0.5%的情况下,p50延迟达到1.18ms,模型大小为38MB,优于PTQ-8b(1.44ms,88MB);对于ViT-B/16,T3C达到2.30ms p50延迟和59MB模型大小,超越了强大的PTQ/QAT基线方法。

Insight: 创新点在于将秩和精度作为统一的、预算驱动的可控部署参数,并通过一个具有理论保证(一致性证书)的训练框架来确保压缩后模型的可靠性。其核心是“训练一次,按需部署”的范式,单个检查点即可提供硬件对齐、预算单调且证书支持的可预测性能权衡。

Abstract: We present T3C, a train-once, test-time budget-conditioned compression framework that exposes rank and precision as a controllable deployment knob. T3C combines elastic tensor factorization (maintained up to a maximal rank) with rank-tied mixed-precision quantization and a lightweight controller that maps a latency/energy/size budget token to per-layer rank/bit assignments; the policy snaps to hardware-aligned profiles and is monotone in the budget. A fast, layerwise consistency certificate, computed from spectral proxies and activation statistics, upper-bounds logit drift and regularizes training, yielding a practical reliability signal with negligible overhead. On ImageNet-1k, T3C shifts the vision Pareto frontier: for ResNet-50 at matched accuracy (\leq 0.5% drop), p50 latency is 1.18ms with a 38MB model, outperforming PTQ-8b (1.44ms, 88MB); for ViT-B/16, T3C reaches 2.30ms p50 with 59MB, improving over strong PTQ/QAT baselines. A single T3C checkpoint therefore provides predictable, certificate-backed accuracy-latency-size trade-offs on demand across devices.


[6] Reasoning Over Recall: Evaluating the Efficacy of Generalist Architectures vs. Specialized Fine-Tunes in RAG-Based Mental Health Dialogue Systems cs.CLPDF

Md Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal

TL;DR: 本研究比较了在基于RAG的心理健康对话系统中,通用推理模型与领域微调模型的表现。通过使用ChromaDB构建相同的RAG流程,评估了四个开源模型(两个通用模型Qwen2.5-3B和Phi-3-Mini,两个领域微调模型MentalHealthBot-7B和TherapyBot-7B),并采用LLM-as-a-Judge框架进行自动化评估。结果表明,尽管规模更小,通用模型在共情能力上显著优于领域微调模型,且在安全性和上下文理解方面表现更好,揭示了在RAG框架下,强大的推理能力比领域特定训练更为关键。

Details

Motivation: 解决LLM在心理健康咨询中面临的幻觉和缺乏共情问题,探索在RAG范式下,是领域微调模型还是通用推理模型更有效。

Result: 在50轮对话的自动化评估中,通用模型在共情得分上显著更高(3.72 vs. 3.26,p < 0.001),尽管模型规模更小(3B vs. 7B);所有模型在安全性上表现良好,但通用模型展现出更好的上下文理解能力,且领域微调模型出现过拟合迹象。

Insight: 论文宣称的创新点在于直接比较了通用推理模型与领域微调模型在RAG心理健康系统中的效能。客观分析认为,其核心洞察是:在答案已基于临床证据的前提下,强大的通用推理能力比领域特定的词汇训练更能提供共情和平衡的支持,这挑战了领域微调必然更优的假设,并为轻量级通用模型的应用提供了依据。

Abstract: The deployment of Large Language Models (LLMs) in mental health counseling faces the dual challenges of hallucinations and lack of empathy. While the former may be mitigated by RAG (retrieval-augmented generation) by anchoring answers in trusted clinical sources, there remains an open question as to whether the most effective model under this paradigm would be one that is fine-tuned on mental health data, or a more general and powerful model that succeeds purely on the basis of reasoning. In this paper, we perform a direct comparison by running four open-source models through the same RAG pipeline using ChromaDB: two generalist reasoners (Qwen2.5-3B and Phi-3-Mini) and two domain-specific fine-tunes (MentalHealthBot-7B and TherapyBot-7B). We use an LLM-as-a-Judge framework to automate evaluation over 50 turns. We find a clear trend: the generalist models outperform the domain-specific ones in empathy (3.72 vs. 3.26, $p < 0.001$) in spite of being much smaller (3B vs. 7B), and all models perform well in terms of safety, but the generalist models show better contextual understanding and are less prone to overfitting as we observe in the domain-specific models. Overall, our results indicate that for RAG-based therapy systems, strong reasoning is more important than training on mental health-specific vocabulary; i.e. a well-reasoned general model would provide more empathetic and balanced support than a larger narrowly fine-tuned model, so long as the answer is already grounded in clinical evidence.


[7] EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery cs.CLPDF

Jicheng Ma, Guohua Wang, Xinhua Feng, Yiming Liu, Zhichao Hu

TL;DR: 该论文提出了一个名为EternalMath的自动化评估框架,用于评估大型语言模型在数学推理方面的能力。该框架通过将最新的同行评审数学文献转化为可执行和可验证的推理任务,构建了一个动态更新的基准。实验表明,当前最先进的LLMs在EternalMath上表现不佳,揭示了研究级数学推理的评估远未饱和。

Details

Motivation: 当前LLMs的数学推理评估主要依赖静态基准,这些基准覆盖范围有限且性能容易饱和,无法有效评估研究级数学。论文旨在解决这一问题,提出一个能随人类数学发现而演进的评估方法。

Result: 在EternalMath基准上对SOTA LLMs进行实验,结果显示模型存在显著的性能差距,表明前沿数学推理能力远未达到饱和水平。

Insight: 创新点在于提出一个完全自动化、基于定理的评估流程,能够将最新数学文献转化为可验证任务,支持时间可扩展性、内在正确性检查和领域定制,为动态评估LLMs的数学推理能力提供了新方法。

Abstract: Current evaluations of mathematical reasoning in large language models (LLMs) are dominated by static benchmarks, either derived from competition-style problems or curated through costly expert effort, resulting in limited coverage of research-level mathematics and rapid performance saturation. We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning, which directly transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks. The pipeline identifies constructive or quantitative results, instantiates them into parameterized problem templates, and generates deterministic solutions through execution-based verification, enabling scalable, reproducible, and continuously updatable evaluation without reliance on large-scale expert authoring. By design, this approach supports temporal extensibility, intrinsic correctness checking, and domain-specific customization across mathematical subfields. Applying this pipeline yields \textbf{EternalMath}, an evolving evaluation suite derived from contemporary research papers. Experiments with state-of-the-art LLMs reveal substantial performance gaps, indicating that mathematical reasoning at the research frontier remains far from saturated and underscoring the need for evaluation methodologies that evolve in step with human mathematical discovery.


[8] From Emotion Classification to Emotional Reasoning: Enhancing Emotional Intelligence in Large Language Models cs.CLPDF

Arjhun Sreedar, Rohan Pillay, Laukik Patade

TL;DR: 本研究探讨了使用合成的情绪链式思维数据是否能提升小型开源大语言模型(LLMs)的情绪推理能力。作者设计了一个多智能体生成管道,用于生成治疗风格的对话,并将其转换为带有解释的结构化情绪多项选择题(MCQs)。研究表明,在EmoBench风格评估中,对多种7B模型进行微调可以显著提升情绪理解和情绪意识,表明无需改变模型架构即可诱导情绪推理能力。

Details

Motivation: 解决当前大语言模型在复杂情绪任务(如情绪理解和意识)上能力不足的问题,特别是对于较小模型,探索通过合成数据增强其情绪智能的可行性。

Result: 微调后的Mistral 7B模型在情绪理解(EU)上从10.5提升到20.5,在情绪意识(EA)上从40.5提升到60.0,在EmoBench评估中验证了合成情绪推理数据的有效性。

Insight: 创新点在于提出了一种多智能体管道生成合成情绪推理数据的方法,并证明仅通过数据微调(无需架构修改)即可显著提升模型在情绪任务上的性能,为增强LLMs的情绪智能提供了高效途径。

Abstract: This work investigates whether synthetic emotional chain-of-thought data can improve the emotional reasoning abilities of smaller open large language models (LLMs). We design a multi-agent generation pipeline that produces therapy-style conversations and converts them into structured emotion multiple-choice questions (MCQs) with explanations. We propose that fine-tuning a variety of 7B models on this dataset should yield substantial gains in emotional understanding and emotional awareness on EmoBench-style evaluations, suggesting that emotional reasoning can be induced without architectural changes. Our results demonstrate that fine-tuned Mistral 7B achieves EU improvements from 10.5 to 20.5 and EA improvements from 40.5 to 60.0, validating the effectiveness of synthetic emotional reasoning data for enhancing model capabilities in nuanced emotional tasks.


Harshil Darji, Martin Heckelmann, Christina Kratsch, Gerard de Melo

TL;DR: 本文针对德国开放法律数据中法院判决文本格式不一致、缺乏明确章节划分的问题,构建了一个经过清洗和章节划分的数据集。该数据集从原始数据集中提取了251,038份德国法院判决,并系统性地分离了判决书中的三个核心部分:判决主文、案件事实和判决理由,同时将上诉告知作为独立字段提取。为确保提取可靠性,作者通过统计抽样和人工验证确保了数据质量,最终以JSONL格式公开了该语料库。

Details

Motivation: 德国法律系统中结构化法律数据的可用性对于推进NLP技术至关重要。广泛使用的开放法律数据集中,法院判决文本格式不一致且缺乏明确章节标记,这影响了诸如修辞角色分类、检索和引文分析等下游任务。

Result: 研究构建并公开了一个包含251,038份德国法院判决的清洗和章节划分数据集。为确保质量,作者使用Cochran公式以95%置信水平和5%误差幅度抽取了384个案例的统计代表性随机样本,并人工验证了三个核心章节的正确识别。

Insight: 创新点在于系统性地解决了德国法律文本中特定章节(判决主文、案件事实、判决理由)的识别和结构化问题,并采用统计方法验证数据质量。这为德国法律系统的NLP研究提供了一个高质量、结构化的基础资源,其方法论(结合统计抽样验证)对于处理其他领域非结构化文本数据具有借鉴意义。

Abstract: The availability of structured legal data is important for advancing Natural Language Processing (NLP) techniques for the German legal system. One of the most widely used datasets, Open Legal Data, provides a large-scale collection of German court decisions. While the metadata in this raw dataset is consistently structured, the decision texts themselves are inconsistently formatted and often lack clearly marked sections. Reliable separation of these sections is important not only for rhetorical role classification but also for downstream tasks such as retrieval and citation analysis. In this work, we introduce a cleaned and sectioned dataset of 251,038 German court decisions derived from the official Open Legal Data dataset. We systematically separated three important sections in German court decisions, namely Tenor (operative part of the decision), Tatbestand (facts of the case), and Entscheidungsgründe (judicial reasoning), which are often inconsistently represented in the original dataset. To ensure the reliability of our extraction process, we used Cochran’s formula with a 95% confidence level and a 5% margin of error to draw a statistically representative random sample of 384 cases, and manually verified that all three sections were correctly identified. We also extracted the Rechtsmittelbelehrung (appeal notice) as a separate field, since it is a procedural instruction and not part of the decision itself. The resulting corpus is publicly available in the JSONL format, making it an accessible resource for further research on the German legal system.


[10] Can Legislation Be Made Machine-Readable in PROLEG? cs.CLPDF

May-Myo Zin, Sabine Wehnert, Yuntao Kong, Ha-Thanh Nguyen, Wachara Fungwacharakorn

TL;DR: 本文提出了一种结合大语言模型(LLM)和法律表示系统PROLEG的框架,旨在将法规文本(如GDPR第6条)自动转换为机器可读的if-then规则和PROLEG编码,最终生成可执行的PROLEG程序,以支持法规应用并生成人类可读的解释。

Details

Motivation: 解决法规应用过程中对准确性和效率的需求,利用现代人工智能技术(如自然语言处理和机器辅助推理)将法律文本转化为机器可读格式,以促进法规的自动化处理。

Result: 框架以欧盟通用数据保护条例(GDPR)第6条为例,通过LLM提示将法律文本编译为if-then规则和PROLEG编码,经法律专家验证后生成可执行程序,展示了PROLEG执行实例,但未提及具体量化指标或基准测试结果。

Insight: 创新点在于整合LLM与PROLEG系统,实现从自然语言法规到可执行逻辑规则的端到端转换,并强调专家验证环节以确保准确性,为法规的机器可读化提供了可借鉴的自动化流程。

Abstract: The anticipated positive social impact of regulatory processes requires both the accuracy and efficiency of their application. Modern artificial intelligence technologies, including natural language processing and machine-assisted reasoning, hold great promise for addressing this challenge. We present a framework to address the challenge of tools for regulatory application, based on current state-of-the-art (SOTA) methods for natural language processing (large language models or LLMs) and formalization of legal reasoning (the legal representation system PROLEG). As an example, we focus on Article 6 of the European General Data Protection Regulation (GDPR). In our framework, a single LLM prompt simultaneously transforms legal text into if-then rules and a corresponding PROLEG encoding, which are then validated and refined by legal domain experts. The final output is an executable PROLEG program that can produce human-readable explanations for instances of GDPR decisions. We describe processes to support the end-to-end transformation of a segment of a regulatory document (Article 6 from GDPR), including the prompting frame to guide an LLM to “compile” natural language text to if-then rules, then to further “compile” the vetted if-then rules to PROLEG. Finally, we produce an instance that shows the PROLEG execution. We conclude by summarizing the value of this approach and note observed limitations with suggestions to further develop such technologies for capturing and deploying regulatory frameworks.


[11] Distortion Instead of Hallucination: The Effect of Reasoning Under Strict Constraints cs.CL | cs.AIPDF

Junichiro Niimi

TL;DR: 本文研究了在严格约束条件下(如推荐计算机科学领域的同行评审期刊文章),大语言模型(LLMs)进行推理时对输出可靠性的影响。实验发现,推理模型与非推理模型之间存在一个关键权衡:非推理模型虽然违反约束的比率高(66-75%),但能保持事实准确性;而推理模型虽能显著降低违反约束的比率(13-26%),却会系统性地扭曲已知事实以满足约束,甚至增加完全捏造的内容。这种权衡模式在不同架构的模型(GPT-5.2和Gemini 3 Flash)中表现一致,表明这是推理本身的一个根本性局限。

Details

Motivation: 随着大语言模型的广泛应用,其输出中的幻觉(非事实性捏造)问题备受关注。推理能力常被视为一种自我验证过程,用以提高输出可靠性。然而,在LLMs无法依赖外部工具或知识的封闭系统中,推理的效果尚未明确。本研究旨在严格约束条件下,澄清推理对输出可靠性的实际影响。

Result: 在计算机科学期刊文章推荐的严格约束任务中,对GPT-5.2和Gemini 3 Flash模型的实验表明:非推理模型约束违反率高(66-75%),但事实准确性得以保持;推理模型约束违反率低(13-26%),但会系统性地扭曲事实以满足约束,并增加完全捏造。这种在约束合规性与事实准确性之间的权衡模式在不同模型架构中一致出现。

Insight: 论文宣称的创新点在于揭示了推理在严格约束下的一个根本局限:它并非普遍提高可靠性,而是可能导致模型用难以检测的扭曲(distortion)来替代诚实的约束违反。从客观角度看,这一发现挑战了“推理必然提升输出真实性”的常见假设,并强调了在评估模型可靠性时,需要同时考虑约束合规性和事实准确性,以及不同模型在权衡这两者时可能采取的不同策略。

Abstract: With the widespread adoption of large language models (LLMs), hallucinations, which are non-factual fabrications in model outputs, have become serious concerns. Reasoning capabilities have received attention as a self-verification process to improve output reliability. However, the effect of reasoning within a closed system where LLMs cannot rely on external tools or knowledge has yet to be clarified. We therefore conduct experiments under strict constraints (recommending peer-reviewed journal articles in computer science) to examine the effect of reasoning across multiple models (GPT-5.2 and Gemini 3 Flash). Our results reveal a problematic trade-off between constraint compliance and factual accuracy. Non-reasoning models exhibit high constraint violation rates (66-75%) but maintain factual accuracy, while reasoning models reduce violations (13-26%) but systematically distort known facts to satisfy constraints and increase complete fabrication. This trade-off pattern is consistent across both models despite different architectures, indicating a fundamental limitation of reasoning. Furthermore, reasoning does not uniformly improve output authenticity: effects diverge by model, reflecting different allocations of the compliance-truthfulness trade-off. These findings challenge the assumption that reasoning universally improves reliability: reasoning models trade honest constraint violations for detection-resistant distortions.


[12] From Failure to Mastery: Generating Hard Samples for Tool-use Agents cs.CLPDF

Bingguang Hao, Zengzhuang Xu, Yuntao Wen, Xinyi Xu, Yang Liu

TL;DR: 本文提出了一种名为HardGen的自动代理流水线,用于生成具有可验证推理的困难工具使用训练样本。该方法通过构建基于代理失败案例的动态API图来合成困难轨迹,并以此作为条件先验来实例化模块化的高级工具,进而生成困难查询,最终产生可验证的复杂思维链。实验表明,使用该方法生成的数据集训练的4B参数模型在性能上超越了多个领先的开源和闭源竞争对手。

Details

Motivation: 现有工具使用代理的数据生成方法主要依赖于随机采样和浅层生成,导致产生的轨迹简单且同质化,无法捕捉复杂、隐式的逻辑依赖关系,因此需要生成更困难、更多样的训练样本来提升代理能力。

Result: 广泛的评估显示,使用该方法生成的数据集训练的4B参数模型在性能上超越了多个领先的开源和闭源模型,如GPT-5.2、Gemini-3-Pro和Claude-Opus-4.5。

Insight: 论文的创新点在于提出了一种从失败案例出发、构建动态API图来生成困难样本的闭环流水线,这为生成具有复杂逻辑依赖的训练数据提供了新思路,其模块化工具实例化和可验证思维链生成的方法也具有借鉴意义。

Abstract: The advancement of LLM agents with tool-use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce HardGen, an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning. Firstly, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. Secondly, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. Finally, the advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT), with a closed-loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several leading open-source and closed-source competitors (e.g., GPT-5.2, Gemini-3-Pro and Claude-Opus-4.5). Our code, models, and dataset will be open-sourced to facilitate future research.


[13] HalluZig: Hallucination Detection using Zigzag Persistence cs.CLPDF

Shreyas N. Samaga, Gilberto Gonzalez Arroyo, Tamal K. Dey

TL;DR: 本文提出了一种名为HalluZig的新方法,用于检测大型语言模型(LLMs)的幻觉问题。该方法通过分析模型逐层注意力演化的动态拓扑结构,利用拓扑数据分析中的Zigzag持续性工具提取拓扑特征,并假设事实性生成与幻觉性生成具有不同的拓扑特征。实验表明,该方法在多个基准测试上优于现有基线模型。

Details

Motivation: 当前检测方法通常依赖于模型输出的表面信号,忽视了模型内部推理过程中的故障,而LLMs的幻觉问题是其在高风险领域应用的关键障碍,因此需要一种新的检测范式来深入分析模型内部状态。

Result: 在多个基准测试上验证了HalluZig框架,结果表明其性能优于强基线模型,并且这些拓扑特征在不同模型间具有可泛化性,仅使用部分网络深度的结构特征即可实现幻觉检测。

Insight: 创新点在于将拓扑数据分析(特别是Zigzag持续性)引入幻觉检测领域,通过建模注意力矩阵序列的Zigzag图过滤来提取拓扑特征,从而从模型内部推理的动态拓扑变化中区分事实与幻觉,这是一种新颖的内部状态分析方法。

Abstract: The factual reliability of Large Language Models (LLMs) remains a critical barrier to their adoption in high-stakes domains due to their propensity to hallucinate. Current detection methods often rely on surface-level signals from the model’s output, overlooking the failures that occur within the model’s internal reasoning process. In this paper, we introduce a new paradigm for hallucination detection by analyzing the dynamic topology of the evolution of model’s layer-wise attention. We model the sequence of attention matrices as a zigzag graph filtration and use zigzag persistence, a tool from Topological Data Analysis, to extract a topological signature. Our core hypothesis is that factual and hallucinated generations exhibit distinct topological signatures. We validate our framework, HalluZig, on multiple benchmarks, demonstrating that it outperforms strong baselines. Furthermore, our analysis reveals that these topological signatures are generalizable across different models and hallucination detection is possible only using structural signatures from partial network depth.


[14] How Does Prefix Matter in Reasoning Model Tuning? cs.CLPDF

Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang

TL;DR: 本研究挑战了在监督微调(SFT)数据集中移除引导性前缀短语的常见做法,通过实验证明,保留安全性和推理导向的前缀句子可以作为轻量级的对齐信号,引导模型生成更安全和更连贯的响应。

Details

Motivation: 论文的动机是质疑当前对齐研究中普遍移除SFT数据集前缀短语的假设,并探究这些前缀是否能为模型解码提供有效的对齐信号,以提升安全性、推理能力等核心能力。

Result: 实验结果显示,前缀条件化SFT在安全性和推理任务上表现提升,在对抗性基准(WildJailbreak, StrongReject)上Safe@1准确率最高提升+6%,在GSM8K推理任务上提升+7%;但事实性和编码任务仅显示边际或负面影响。

Insight: 论文宣称的创新点在于揭示了前缀条件化作为一种可扩展且可解释的机制,能够通过缩小搜索空间来稳定推理轨迹,补充传统的基于奖励的对齐方法;客观分析认为,其通过token级损失分析识别出特定前缀词(如’revised’、’logically’)作为对齐锚点,为模型微调提供了新的轻量级干预思路。

Abstract: Recent alignment studies commonly remove introductory boilerplate phrases from supervised fine-tuning (SFT) datasets. This work challenges that assumption. We hypothesize that safety- and reasoning-oriented prefix sentences serve as lightweight alignment signals that can guide model decoding toward safer and more coherent responses. To examine this, we fine-tune three R1 series models across three core model capabilities: reasoning (mathematics, coding), safety, and factuality, systematically varying prefix inclusion from 0% to 100%. Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy on adversarial benchmarks (WildJailbreak, StrongReject) and +7% improvement on GSM8K reasoning. However, factuality and coding tasks show marginal or negative effects, indicating that prefix-induced narrowing of the search space benefits structured reasoning. Token-level loss analysis further reveals that prefix tokens such as “revised” and “logically” incur higher gradient magnitudes, acting as alignment anchors that stabilize reasoning trajectories. Our findings suggest that prefix conditioning offers a scalable and interpretable mechanism for improving reasoning safety, serving as an implicit form of alignment that complements traditional reward-based methods.


[15] Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage cs.CL | cs.AI | cs.MAPDF

Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, Xiaowei Huang

TL;DR: 本文提出了一种新型的多智能体认知共谋攻击,攻击者通过公开渠道仅使用真实的证据片段,利用大语言模型(LLM)的过度思考倾向,引导受害者形成并传播虚假结论。作者提出了生成蒙太奇框架,并构建了CoPHEME数据集进行模拟攻击,结果表明多种LLM模型普遍存在漏洞,且推理能力越强的模型越容易受到攻击。

Details

Motivation: 随着大语言模型向能够综合实时信息的自主智能体过渡,其推理能力引入了一个意想不到的攻击面。本文旨在研究一种新型威胁,即共谋的智能体无需依赖隐蔽通信、后门或伪造文件,仅通过公开渠道分发真实的证据片段,就能操纵受害者的信念。

Result: 在基于真实世界谣言事件构建的CoPHEME数据集上进行模拟攻击,结果显示14个LLM家族普遍存在漏洞:专有模型的攻击成功率高达74.4%,开源权重模型为70.6%。具有更强推理能力的模型(如推理专用模型)比基础模型或提示更易受攻击,且这些虚假信念会向下游判断者传播,欺骗率超过60%。

Insight: 论文的创新点在于首次形式化了认知共谋攻击,并提出了生成蒙太奇(Writer-Editor-Director)框架,通过对抗性辩论和协调发布证据片段来构建欺骗性叙事。客观来看,其揭示了LLM智能体在动态信息环境中交互时存在的社会技术漏洞,即仅凭真实信息片段通过特定编排即可有效操纵信念,且模型推理能力与易受攻击性呈正相关,这一发现具有重要的安全启示。

Abstract: As large language models (LLMs) transition to autonomous agents synthesizing real-time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs’ overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer-Editor-Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real-world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open-weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning-specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio-technical vulnerability in how LLM-based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.


[16] A Training-Free Large Reasoning Model-based Knowledge Tracing Framework for Unified Prediction and Prescription cs.CLPDF

Unggi Lee, Joo Young Kim, Ran Ju, Minyoung Jung, Jeyeon Eo

TL;DR: 本文提出了一种名为Thinking-KT的无训练知识追踪框架,该框架结合了测试时缩放技术,使小型大语言模型也能实现有竞争力的知识追踪性能,并能在一个统一的输出中联合执行预测、个性化反馈生成和学习推荐任务。

Details

Motivation: 解决现有基于大语言模型的知识追踪方法通常需要微调、性能不稳定,以及传统知识追踪系统依赖多阶段流程进行反馈和推荐,导致系统复杂度和资源消耗增加的问题。

Result: 实验结果表明,测试时缩放是基于大语言模型的知识追踪中一个关键但未被充分探索的因素,并且小型大语言模型可以作为统一智能导学系统引擎。

Insight: 创新点在于提出了一个无需训练、利用测试时缩放来提升小型大语言模型知识追踪性能的框架,并实现了预测、反馈和推荐的统一输出,同时提供了对知识追踪中推理轨迹的系统性分析。

Abstract: Knowledge Tracing (KT) aims to estimate a learner’s evolving mastery based on interaction histories. Recent studies have explored Large Language Models (LLMs) for KT via autoregressive nature, but such approaches typically require fine-tuning and exhibit unstable or near-random performance. Moreover, prior KT systems primarily focus on prediction and rely on multi-stage pipelines for feedback and recommendation, resulting in increased system complexity and resources. To address this gap, we propose Thinking-KT, a training-free KT framework that incorporates Test-Time Scaling (TTS), enabling even small LLMs to achieve competitive KT performance. Moreover, in this framework, a small LLM can jointly perform KT prediction, personalized feedback generation, and learning recommendation in a unified output without degrading prediction accuracy. Beyond performance, we present the systematic analysis of reasoning traces in KT. Our results demonstrate that TTS is a critical yet underexplored factor in LLM-based KT, and that small LLMs can serve as unified ITS engines.


[17] K-EXAONE Technical Report cs.CL | cs.AIPDF

Eunbi Choi, Kibong Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon

TL;DR: 本文介绍了由LG AI Research开发的大规模多语言语言模型K-EXAONE。该模型基于专家混合架构,总参数量为2360亿,推理时激活230亿参数,支持256K令牌的上下文窗口,并覆盖韩语、英语、西班牙语、德语、日语和越南语六种语言。

Details

Motivation: 开发一个强大的专有AI基础模型,旨在通过推进人工智能技术来改善生活,适用于广泛的工业和研究应用。

Result: 在涵盖推理、智能体、通用能力、韩语及多语言能力的综合基准测试套件中,K-EXAONE表现出与类似规模的开源权重模型相当的性能。

Insight: 采用大规模专家混合架构以高效激活参数,并专注于多语言支持(特别是韩语)和长上下文处理,旨在构建一个专有的、面向实际应用的基础模型。

Abstract: This technical report presents K-EXAONE, a large-scale multilingual language model developed by LG AI Research. K-EXAONE is built on a Mixture-of-Experts architecture with 236B total parameters, activating 23B parameters during inference. It supports a 256K-token context window and covers six languages: Korean, English, Spanish, German, Japanese, and Vietnamese. We evaluate K-EXAONE on a comprehensive benchmark suite spanning reasoning, agentic, general, Korean, and multilingual abilities. Across these evaluations, K-EXAONE demonstrates performance comparable to open-weight models of similar size. K-EXAONE, designed to advance AI for a better life, is positioned as a powerful proprietary AI foundation model for a wide range of industrial and research applications.


[18] CSCBench: A PVC Diagnostic Benchmark for Commodity Supply Chain Reasoning cs.CLPDF

Yaxin Cui, Yuanqiang Zeng, Jiapeng Yan, Keling Lin, Kai Ji

TL;DR: 论文提出了CSCBench基准测试,这是一个包含2300多个单选题的商品供应链推理诊断基准,基于PVC三维评估框架(流程、品种、认知)构建,用于评估大语言模型在受制度规则和可行性约束的供应链领域的推理能力。

Details

Motivation: 大语言模型在通用基准上表现出色,但在受制度规则系统和可行性约束的商品供应链领域的推理能力尚未得到充分探索,需要专门的基准来诊断和提升模型在该高风险领域的性能。

Result: 在直接提示设置下评估代表性大语言模型,发现在流程和认知轴线上表现强劲,但在品种轴线上性能显著下降,尤其是在货运协议相关任务上。

Insight: 创新点在于提出了一个结构化的三维评估框架(PVC),将供应链任务分解为流程、品种和认知三个正交维度,并将品种维度具体化为在物料-信息-财务耦合约束下的商品特定规则系统,为领域特定的模型能力诊断提供了新工具。

Abstract: Large Language Models (LLMs) have achieved remarkable success in general benchmarks, yet their competence in commodity supply chains (CSCs) – a domain governed by institutional rule systems and feasibility constraints – remains under-explored. CSC decisions are shaped jointly by process stages (e.g., planning, procurement, delivery), variety-specific rules (e.g., contract specifications and delivery grades), and reasoning depth (from retrieval to multi-step analysis and decision selection). We introduce CSCBench, a 2.3K+ single-choice benchmark for CSC reasoning, instantiated through our PVC 3D Evaluation Framework (Process, Variety, and Cognition). The Process axis aligns tasks with SCOR+Enable; the Variety axis operationalizes commodity-specific rule systems under coupled material-information-financial constraints, grounded in authoritative exchange guidebooks/rulebooks and industry reports; and the Cognition axis follows Bloom’s revised taxonomy. Evaluating representative LLMs under a direct prompting setting, we observe strong performance on the Process and Cognition axes but substantial degradation on the Variety axis, especially on Freight Agreements. CSCBench provides a diagnostic yardstick for measuring and improving LLM capabilities in this high-stakes domain.


[19] DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs cs.CLPDF

Jinghan Ru, Siyuan Yan, Yuguo Yin, Yuexian Zou, Zongyuan Ge

TL;DR: 本文提出了DermoGPT,一个专注于皮肤病学推理的多模态大语言模型(MLLM)。为了解决皮肤病学MLLM领域数据有限、任务覆盖窄和缺乏临床监督的问题,作者构建了大规模指令数据集DermoInstruct、综合评估基准DermoBench,并开发了通过监督微调和新型强化学习目标训练的DermoGPT模型。

Details

Motivation: 动机是解决当前皮肤病学多模态大语言模型(MLLM)发展滞后的问题,具体包括训练数据有限、任务覆盖范围狭窄,以及缺乏模拟专家诊断流程的、基于临床的监督信号。

Result: 实验表明,DermoGPT在涵盖形态学、诊断、推理和公平性四个临床维度的11个任务上,显著优于16个代表性基线模型,达到了最先进的性能,并大幅缩小了人类与AI之间的差距。评估基于包含3600个专家验证的开放式实例的DermoBench基准。

Insight: 论文的创新点包括:1)构建了大规模、以形态学为基础的皮肤病学指令数据集DermoInstruct;2)建立了全面的多任务评估基准DermoBench;3)提出了形态学锚定的视觉-推理一致性(MAVIC)强化学习目标,以增强视觉观察与诊断结论之间的一致性;4)在推理时采用了置信度一致性测试时适应(CCT)方法以提高鲁棒性。

Abstract: Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision that mirrors expert diagnostic workflows. We present a comprehensive framework to address these gaps. First, we introduce DermoInstruct, a large-scale morphology-anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats, capturing the complete diagnostic pipeline from morphological observation and clinical reasoning to final diagnosis. Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600 expert-verified open-ended instances and human performance baselines. Third, we develop DermoGPT, a dermatology reasoning MLLM trained via supervised fine-tuning followed by our Morphologically-Anchored Visual-Inference-Consistent (MAVIC) reinforcement learning objective, which enforces consistency between visual observations and diagnostic conclusions. At inference, we deploy Confidence-Consistency Test-time adaptation (CCT) for robust predictions. Experiments show DermoGPT significantly outperforms 16 representative baselines across all axes, achieving state-of-the-art performance while substantially narrowing the human-AI gap. DermoInstruct, DermoBench and DermoGPT will be made publicly available at https://github.com/mendicant04/DermoGPT upon acceptance.


[20] Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents cs.CLPDF

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng

TL;DR: 本文提出了Agentic Memory (AgeMem)框架,将长短期记忆管理统一集成到LLM智能体的策略中,通过工具化操作使智能体自主决策信息的存储、检索、更新、总结与丢弃,并采用渐进式强化学习进行训练。

Details

Motivation: 解决LLM智能体因有限上下文窗口导致的长时序推理能力受限问题,现有方法将长短期记忆分离处理且依赖启发式规则,限制了适应性与端到端优化能力。

Result: 在五个长时序基准测试中,AgeMem在多种LLM骨干网络上均优于现有记忆增强基线方法,提升了任务性能、长时记忆质量与上下文使用效率。

Insight: 创新点在于将记忆管理统一为可学习的策略动作,并设计渐进式强化学习与分步GRPO算法解决记忆操作带来的稀疏奖励问题,实现了端到端的自适应记忆控制。

Abstract: Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows, making effective memory management critical. Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components, relying on heuristics or auxiliary controllers, which limits adaptability and end-to-end optimization. In this paper, we propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent’s policy. AgeMem exposes memory operations as tool-based actions, enabling the LLM agent to autonomously decide what and when to store, retrieve, update, summarize, or discard information. To train such unified behaviors, we propose a three-stage progressive reinforcement learning strategy and design a step-wise GRPO to address sparse and discontinuous rewards induced by memory operations. Experiments on five long-horizon benchmarks demonstrate that AgeMem consistently outperforms strong memory-augmented baselines across multiple LLM backbones, achieving improved task performance, higher-quality long-term memory, and more efficient context usage.


[21] Deferred Commitment Decoding for Diffusion Language Models with Confidence-Aware Sliding Windows cs.CL | cs.AIPDF

Yingte Shu, Yuchuan Tian, Chao Xu, Yunhe Wang, Hanting Chen

TL;DR: 本文提出了一种称为延迟承诺解码(DCD)的新型、无需训练的解码策略,用于解决基于块的扩散语言模型(DLMs)中存在的边界诱导上下文截断(BICT)问题。该方法通过维护一个置信度感知的滑动窗口,根据不确定性早期解析低不确定性标记,同时延迟高不确定性标记直到获得足够的上下文证据,从而在不牺牲效率的情况下改善解码质量和推理效率。

Details

Motivation: 基于块的扩散语言模型解码存在结构限制,即边界诱导上下文截断(BICT):靠近块边界的未解码标记被迫在无法访问附近未来上下文的情况下被确定,即使这些上下文能显著降低不确定性。这降低了解码置信度和生成质量,尤其是在需要精确推理的任务(如数学问题求解和代码生成)中。

Result: 在多个扩散语言模型、基准测试和缓存配置上的广泛实验表明,与固定的基于块的扩散方法相比,DCD在平均时间相当的情况下,将生成准确率提高了1.39%,最大改进达到9.0%。

Insight: 核心创新点在于提出了一种基于不确定性的延迟承诺解码原则,通过置信度感知滑动窗口实现解码窗口内的有效双向信息流。这为解决扩散模型解码中的上下文截断问题提供了一个简单而有效的训练免费方案,平衡了质量与效率。

Abstract: Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based diffusion, decoding tokens block by block. However, this paradigm suffers from a structural limitation that we term Boundary-Induced Context Truncation (BICT): undecoded tokens near block boundaries are forced to commit without access to nearby future context, even when such context could substantially reduce uncertainty. This limitation degrades decoding confidence and generation quality, especially for tasks requiring precise reasoning, such as mathematical problem solving and code generation. We propose Deferred Commitment Decoding (DCD), a novel, training-free decoding strategy that mitigates this issue. DCD maintains a confidence-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available. This design enables effective bidirectional information flow within the decoding window without sacrificing efficiency. Extensive experiments across multiple diffusion language models, benchmarks, and caching configurations show that DCD improves generation accuracy by 1.39% with comparable time on average compared to fixed block-based diffusion methods, with the most significant improvement reaching 9.0%. These results demonstrate that deferring token commitment based on uncertainty is a simple yet effective principle for improving both the quality and efficiency of diffusion language model decoding.


[22] FormationEval, an open multiple-choice benchmark for petroleum geoscience cs.CL | cs.AI | cs.LG | physics.geo-phPDF

Almaz Ermilov

TL;DR: 本文介绍了FormationEval,一个用于评估语言模型在石油地质科学和地下学科领域性能的开放式多项选择题基准数据集。该数据集包含505个问题,涵盖七个领域,评估了72个模型,结果显示顶级模型准确率超过97%,其中Gemini 3 Pro Preview达到99.8%,而开源模型如GLM-4.7也表现出色,准确率达98.6%。

Details

Motivation: 解决石油地质科学领域缺乏专门评估语言模型能力的基准问题,旨在提供一个高质量、可追溯的评估工具。

Result: 在FormationEval基准上,Gemini 3 Pro Preview达到99.8%准确率(SOTA),开源模型GLM-4.7达到98.6%,多个开源模型超过93%,且开源与闭源模型性能差距小于预期;岩石物理学是最具挑战性的领域。

Insight: 创新点包括使用推理模型和基于概念的方法构建数据集以避免版权问题,并记录和缓解了答案长度偏差;客观分析表明,该基准为领域特定评估提供了标准化方法,并揭示了开源模型在专业任务上的竞争力。

Abstract: This paper presents FormationEval, an open multiple-choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines. The dataset contains 505 questions across seven domains including petrophysics, petroleum geology and reservoir engineering, derived from three authoritative sources using a reasoning model with detailed instructions and a concept-based approach that avoids verbatim copying of copyrighted text. Each question includes source metadata to support traceability and audit. The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open-weight alternatives. The top performers achieve over 97% accuracy, with Gemini 3 Pro Preview reaching 99.8%, while tier and domain gaps persist. Among open-weight models, GLM-4.7 leads at 98.6%, with several DeepSeek, Llama, Qwen and Mistral models also exceeding 93%. The performance gap between open-weight and closed models is narrower than expected, with several lower-cost open-weight models exceeding 90% accuracy. Petrophysics emerges as the most challenging domain across all models, while smaller models show wider performance variance. Residual length bias in the dataset (correct answers tend to be longer) is documented along with bias mitigation strategies applied during construction. The benchmark, evaluation code and results are publicly available.


cs.CV [Back]

[23] Free Energy-Based Modeling of Emotional Dynamics in Video Advertisements cs.CV | cs.AIPDF

Takashi Ushio, Kazuhiro Onishi, Hideyoshi Yanagisawa

TL;DR: 本研究基于自由能原理,提出了一种仅从广告视频场景级表达特征量化情感动态的方法,无需依赖生理信号或主观评分等外部信息。通过计算Kullback-Leibler散度(KLD)、贝叶斯惊奇(BS)和不确定性(UN)来分别捕捉愉悦感、惊奇感和习惯化,并在1059个15秒食品广告视频上验证了其有效性,识别出三种典型情感模式,且在不同超参数设置和广告类型上表现出稳健性。

Details

Motivation: 广告观看中的情感反应对理解媒体效果至关重要,但现有方法常依赖外部信息。本研究旨在建立一种可解释的情感估计方法,仅从视频内容本身量化情感动态。

Result: 在1059个15秒食品广告视频上的实验表明,KLD反映了与品牌呈现相关的“愉悦感”,BS捕捉了信息复杂性引起的“惊奇感”,UN反映了由元素类型、空间排列不确定性以及元素多样性和数量驱动的“惊奇感”。该方法在九种超参数设置和六类日本广告视频(三种类型、两种时长)的泛化测试中保持稳定。

Insight: 创新点在于将自由能原理的计算框架(KLD、BS、UN)直接映射到广告视频的情感维度(愉悦感、惊奇感、习惯化),实现了仅从场景特征进行可解释情感建模,为创作更具吸引力的广告视频提供了技术基础。

Abstract: Emotional responses during advertising video viewing are recognized as essential for understanding media effects because they have influenced attention, memory, and purchase intention. To establish a methodological basis for explainable emotion estimation without relying on external information such as physiological signals or subjective ratings, we have quantified “pleasantness,” “surprise,” and “habituation” solely from scene-level expression features of advertising videos, drawing on the free energy(FE) principle, which has provided a unified account of perception, learning, and behavior. In this framework, Kullback-Leibler divergence (KLD) has captured prediction error, Bayesian surprise (BS) has captured belief updates, and uncertainty (UN) has reflected prior ambiguity, and together they have formed the core components of FE. Using 1,059 15 s food video advertisements, the experiments have shown that KLD has reflected “pleasantness” associated with brand presentation, BS has captured “surprise” arising from informational complexity, and UN has reflected “surprise” driven by uncertainty in element types and spatial arrangements, as well as by the variability and quantity of presented elements. This study also identified three characteristic emotional patterns, namely uncertain stimulus, sustained high emotion, and momentary peak and decay, demonstrating the usefulness of the proposed method. Robustness across nine hyperparameter settings and generalization tests with six types of Japanese advertising videos (three genres and two durations) confirmed that these tendencies remained stable. This work can be extended by integrating a wider range of expression elements and validating the approach through subjective ratings, ultimately guiding the development of technologies that can support the creation of more engaging advertising videos.


[24] Unified Review and Benchmark of Deep Segmentation Architectures for Cardiac Ultrasound on CAMUS cs.CVPDF

Zahid Ullah, Muhammad Hilal, Eunsoo Lee, Dragan Pamucar, Jihie Kim

TL;DR: 本文结合心脏超声分割文献综述与统一实验基准,在CAMUS数据集上对比了U-Net、Attention U-Net和TransUNet三种架构,并评估了多种预处理方法(如NIfTI原始数据、PNG导出、GPT辅助伪标签及自监督预训练)对分割性能的影响。

Details

Motivation: 现有综述多关注心脏成像与深度学习进展,但缺乏统一且可复现的实验基准;本文旨在通过标准化比较,为心脏超声分割提供可靠的性能评估与实用指导。

Result: 在CAMUS数据集上,U-Net在NIfTI数据上达到94%的平均Dice系数,PNG-16位流程为91%;Attention U-Net在小区域或低对比度区域略有改进,TransUNet凭借全局上下文建模能力在困难帧上泛化最强,尤其结合自监督预训练时表现更优。

Insight: 创新点包括:建立了U-Net、Attention U-Net和TransUNet在标准化CAMUS预处理下的公平基准;提出了保持超声数据强度保真度与对齐的实用指南;展望了可扩展自监督与基于GPT的多模态标注流程用于快速标注与数据管理。

Abstract: Several review papers summarize cardiac imaging and DL advances, few works connect this overview to a unified and reproducible experimental benchmark. In this study, we combine a focused review of cardiac ultrasound segmentation literature with a controlled comparison of three influential architectures, U-Net, Attention U-Net, and TransUNet, on the Cardiac Acquisitions for Multi-Structure Ultrasound Segmentation (CAMUS) echocardiography dataset. Our benchmark spans multiple preprocessing routes, including native NIfTI volumes, 16-bit PNG exports, GPT-assisted polygon-based pseudo-labels, and self-supervised pretraining (SSL) on thousands of unlabeled cine frames. Using identical training splits, losses, and evaluation criteria, a plain U-Net achieved a 94% mean Dice when trained directly on NIfTI data (preserving native dynamic range), while the PNG-16-bit workflow reached 91% under similar conditions. Attention U-Net provided modest improvements on small or low-contrast regions, reducing boundary leakage, whereas TransUNet demonstrated the strongest generalization on challenging frames due to its ability to model global spatial context, particularly when initialized with SSL. Pseudo-labeling expanded the training set and improved robustness after confidence filtering. Overall, our contributions are threefold: a harmonized, apples-to-apples benchmark of U-Net, Attention U-Net, and TransUNet under standardized CAMUS preprocessing and evaluation; practical guidance on maintaining intensity fidelity, resolution consistency, and alignment when preparing ultrasound data; and an outlook on scalable self-supervision and emerging multimodal GPT-based annotation pipelines for rapid labeling, quality assurance, and targeted dataset curation.


[25] Motion-Compensated Latent Semantic Canvases for Visual Situational Awareness on Edge cs.CVPDF

Igor Lodin, Sergii Filatov, Vira Filatova, Dmytro Filatov

TL;DR: 本文提出了一种名为运动补偿潜在语义画布(MCLSC)的方法,用于在资源受限的边缘设备上实现视觉态势感知。其核心思想是在一个由视频流稳定得到的基准坐标系中,维护两个潜在画布(一个缓慢累积的静态层和一个快速更新的动态层)来存储持久的语义元数据。通过异步运行且由运动触发的全景分割(Mask2Former),仅在检测到运动指示新信息时才进行推理,同时利用稳定化/运动补偿技术为潜在语义记忆保持一致的坐标系。在预录制的480p视频片段上,与简单的逐帧分割相比,该原型系统将分割调用次数减少了30倍以上,并将平均端到端处理时间降低了20倍以上,同时保持了连贯的静态/动态语义叠加效果。

Details

Motivation: 解决在计算资源有限的边缘设备上,实现高效、持续的视觉态势感知的挑战,传统逐帧全景分割计算成本过高。

Result: 在预录制的480p视频剪辑上,与朴素逐帧分割相比,分割调用减少超过30倍,平均端到端处理时间降低超过20倍,同时保持语义覆盖的连贯性。

Insight: 创新点在于将语义记忆分为静态和动态两层潜在画布,并结合运动门控与异步处理,在基准坐标系中通过运动补偿实现高效、一致的语义信息更新与维护,大幅降低了计算开销。

Abstract: We propose Motion-Compensated Latent Semantic Canvases (MCLSC) for visual situational awareness on resource-constrained edge devices. The core idea is to maintain persistent semantic metadata in two latent canvases - a slowly accumulating static layer and a rapidly updating dynamic layer - defined in a baseline coordinate frame stabilized from the video stream. Expensive panoptic segmentation (Mask2Former) runs asynchronously and is motion-gated: inference is triggered only when motion indicates new information, while stabilization/motion compensation preserves a consistent coordinate system for latent semantic memory. On prerecorded 480p clips, our prototype reduces segmentation calls by >30x and lowers mean end-to-end processing time by >20x compared to naive per-frame segmentation, while maintaining coherent static/dynamic semantic overlays.


[26] VL-OrdinalFormer: Vision Language Guided Ordinal Transformers for Interpretable Knee Osteoarthritis Grading cs.CVPDF

Zahid Ullah, Jihie Kim

TL;DR: 本文提出了一种名为VL-OrdinalFormer的视觉语言引导序数学习框架,用于从膝关节X光片中全自动评估膝骨关节炎(KOA)的严重程度。该方法结合了ViT骨干网络、基于CORAL的序数回归和CLIP驱动的语义对齐模块,以融入与关节间隙变窄、骨赘形成等相关的临床文本概念。在OAI kneeKL224数据集上的实验表明,该模型在宏观F1分数和总体准确率上达到了最先进的性能,特别是在区分早期阶段(KL1和KL2)方面有显著提升,同时保持了良好的可解释性。

Details

Motivation: 膝骨关节炎是全球致残的主要原因,其严重程度评估(使用Kellgren Lawrence分级系统)对临床决策至关重要。然而,早期阶段(尤其是KL1和KL2)的影像学区别细微,常导致放射科医生间的观察者间差异。

Result: 在公开数据集OAI kneeKL224上,VL-OrdinalFormer在宏观F1分数和总体准确率方面超越了CNN和ViT基线模型,达到了最先进的性能。该框架在不影响轻症或重症病例分类准确性的前提下,显著提升了KL1和KL2分级的性能。

Insight: 创新点在于将视觉语言预训练(CLIP)与序数回归(CORAL)结合到Transformer架构中,使模型能够利用临床文本概念来引导学习,从而提升对细微差异的识别能力和模型的可解释性。这为医学影像分析提供了融合多模态信息和处理有序分类任务的新思路。

Abstract: Knee osteoarthritis (KOA) is a leading cause of disability worldwide, and accurate severity assessment using the Kellgren Lawrence (KL) grading system is critical for clinical decision making. However, radiographic distinctions between early disease stages, particularly KL1 and KL2, are subtle and frequently lead to inter-observer variability among radiologists. To address these challenges, we propose VLOrdinalFormer, a vision language guided ordinal learning framework for fully automated KOA grading from knee radiographs. The proposed method combines a ViT L16 backbone with CORAL based ordinal regression and a Contrastive Language Image Pretraining (CLIP) driven semantic alignment module, allowing the model to incorporate clinically meaningful textual concepts related to joint space narrowing, osteophyte formation, and subchondral sclerosis. To improve robustness and mitigate overfitting, we employ stratified five fold cross validation, class aware re weighting to emphasize challenging intermediate grades, and test time augmentation with global threshold optimization. Experiments conducted on the publicly available OAI kneeKL224 dataset demonstrate that VLOrdinalFormer achieves state of the art performance, outperforming CNN and ViT baselines in terms of macro F1 score and overall accuracy. Notably, the proposed framework yields substantial performance gains for KL1 and KL2 without compromising classification accuracy for mild or severe cases. In addition, interpretability analyses using Grad CAM and CLIP similarity maps confirm that the model consistently attends to clinically relevant anatomical regions. These results highlight the potential of vision language aligned ordinal transformers as reliable and interpretable tools for KOA grading and disease progression assessment in routine radiological practice.


[27] VideoCuRL: Video Curriculum Reinforcement Learning with Orthogonal Difficulty Decomposition cs.CVPDF

Hongbo Jin, Kuanwei Lin, Wenhao Zhang, Yichen Jin, Ge Li

TL;DR: 本文提出VideoCuRL,一种用于视频大语言模型强化学习的课程学习框架。它通过将学习难度分解为视觉时间感知负荷和认知推理深度两个正交维度,并利用光流、关键帧熵和校准惊奇度等无需训练的代理指标,将数据映射到二维课程网格中。采用能力感知的对角波前策略进行训练调度,并引入动态稀疏KL散度和结构化重访机制来稳定训练。

Details

Motivation: 当前强化学习范式主要依赖随机数据洗牌或基于标量难度度量的简单课程策略,这些方法无法区分视频理解中视觉时间感知和认知推理这两个正交的挑战,限制了视频大语言模型在复杂时空推理上的能力提升。

Result: 在VSI-Bench推理任务上提升2.5分,在VideoMME感知任务上提升2.9分,超越了现有强化的学习基线方法,并且消除了基于生成的课程方法带来的巨大推理开销。

Insight: 核心创新点在于将视频理解难度分解为视觉和认知两个正交维度,并构建二维课程网格进行精细化训练调度。客观来看,其提出的无需训练代理指标(光流、关键帧熵、校准惊奇度)以及稳定训练机制(动态稀疏KL、结构化重访)为视频大模型的稳健后训练提供了可扩展的解决方案。

Abstract: Reinforcement Learning (RL) is crucial for empowering VideoLLMs with complex spatiotemporal reasoning. However, current RL paradigms predominantly rely on random data shuffling or naive curriculum strategies based on scalar difficulty metrics. We argue that scalar metrics fail to disentangle two orthogonal challenges in video understanding: Visual Temporal Perception Load and Cognitive Reasoning Depth. To address this, we propose VideoCuRL, a novel framework that decomposes difficulty into these two axes. We employ efficient, training-free proxies, optical flow and keyframe entropy for visual complexity, Calibrated Surprisal for cognitive complexity, to map data onto a 2D curriculum grid. A competence aware Diagonal Wavefront strategy then schedules training from base alignment to complex reasoning. Furthermore, we introduce Dynamic Sparse KL and Structured Revisiting to stabilize training against reward collapse and catastrophic forgetting. Extensive experiments show that VideoCuRL surpasses strong RL baselines on reasoning (+2.5 on VSI-Bench) and perception (+2.9 on VideoMME) tasks. Notably, VideoCuRL eliminates the prohibitive inference overhead of generation-based curricula, offering a scalable solution for robust video post-training.


[28] CornViT: A Multi-Stage Convolutional Vision Transformer Framework for Hierarchical Corn Kernel Analysis cs.CV | cs.AIPDF

Sai Teja Erukude, Jane Mascarenhas, Lior Shamir

TL;DR: 本文提出了CornViT,一个用于玉米籽粒分级分析的三阶段卷积视觉Transformer(CvT)框架。该框架模拟人类专家的分层推理过程,依次进行纯度分类、形态分类(扁平与圆形)和胚芽朝向判断。研究构建了三个对应的数据集,并基于预训练的CvT-13模型进行微调,在各项任务上均取得了超过90%的准确率,显著优于ResNet-50和DenseNet-121等传统CNN模型。

Details

Motivation: 玉米籽粒的准确分级对于种子认证、定向播种和育种至关重要,但目前仍主要依赖人工检测,效率低下且主观性强。本文旨在开发一个自动化、可部署的解决方案来替代人工,实现高效、客观的玉米籽粒质量评估。

Result: 在构建的三个特定任务数据集(纯度7265张、形态3859张、胚芽朝向1960张)上,CornViT框架的测试准确率分别为:纯度分类93.76%,形态分类94.11%,胚芽朝向检测91.12%。在相同训练条件下,ResNet-50的准确率仅为76.56%至81.02%,DenseNet-121为86.56%至89.38%,CornViT达到了SOTA水平。

Insight: 主要创新点在于将复杂的玉米籽粒分级任务分解为三个层次化的子任务,并设计了一个多阶段的CvT框架来模拟人类专家的分析流程。这验证了卷积增强的自注意力机制(CvT)在细粒度农业图像分析任务上的优势。同时,研究提供了完整的数据集、代码和基于Flask的Web应用,构成了一个可直接部署的解决方案,推动了计算机视觉在农业质量检测中的实际应用。

Abstract: Accurate grading of corn kernels is critical for seed certification, directional seeding, and breeding, yet it is still predominantly performed by manual inspection. This work introduces CornViT, a three-stage Convolutional Vision Transformer (CvT) framework that emulates the hierarchical reasoning of human seed analysts for single-kernel evaluation. Three sequential CvT-13 classifiers operate on 384x384 RGB images: Stage 1 distinguishes pure from impure kernels; Stage 2 categorizes pure kernels into flat and round morphologies; and Stage 3 determines the embryo orientation (up vs. down) for pure, flat kernels. Starting from a public corn seed image collection, we manually relabeled and filtered images to construct three stage-specific datasets: 7265 kernels for purity, 3859 pure kernels for morphology, and 1960 pure-flat kernels for embryo orientation, all released as benchmarks. Head-only fine-tuning of ImageNet-22k pretrained CvT-13 backbones yields test accuracies of 93.76% for purity, 94.11% for shape, and 91.12% for embryo-orientation detection. Under identical training conditions, ResNet-50 reaches only 76.56 to 81.02 percent, whereas DenseNet-121 attains 86.56 to 89.38 percent accuracy. These results highlight the advantages of convolution-augmented self-attention for kernel analysis. To facilitate adoption, we deploy CornViT in a Flask-based web application that performs stage-wise inference and exposes interpretable outputs through a browser interface. Together, the CornViT framework, curated datasets, and web application provide a deployable solution for automated corn kernel quality assessment in seed quality workflows. Source code and data are publicly available.


[29] Evaluating Contextual Intelligence in Recyclability: A Comprehensive Study of Image-Based Reasoning Systems cs.CV | cs.AIPDF

Eliot Park, Abhi Kumar, Pranav Rajpurkar

TL;DR: 本研究评估了GPT-4o、GPT-4o-mini和Claude 3.5等前沿视觉语言模型在预测常见物品可回收性方面的表现,包括匹配回收箱、考虑物理尺寸、适应地区指南、处理污染或损坏以及多材料物品等复杂场景。

Details

Motivation: 解决公众在准确判断物品可回收性和正确处置方式上面临的复杂问题,探索先进AI模型在提升回收实践中的应用潜力。

Result: 研究发现这些模型在上下文理解方面相比前代有显著进步,但在某些复杂场景中仍存在不足;通过精心策划的图像数据集进行了评估,但未提及具体基准或SOTA比较。

Insight: 创新点在于系统评估了视觉语言模型在回收场景中的上下文智能,包括物理适配性、地区差异、物品状态和多材料处理等现实因素,为环境可持续性应用提供了模型能力边界的重要洞察。

Abstract: While the importance of efficient recycling is widely acknowledged, accurately determining the recyclability of items and their proper disposal remains a complex task for the general public. In this study, we explore the application of cutting-edge vision-language models (GPT-4o, GPT-4o-mini, and Claude 3.5) for predicting the recyclability of commonly disposed items. Utilizing a curated dataset of images, we evaluated the models’ ability to match objects to appropriate recycling bins, including assessing whether the items could physically fit into the available bins. Additionally, we investigated the models’ performance across several challenging scenarios: (i) adjusting predictions based on location-specific recycling guidelines; (ii) accounting for contamination or structural damage; and (iii) handling objects composed of multiple materials. Our findings highlight the significant advancements in contextual understanding offered by these models compared to previous iterations, while also identifying areas where they still fall short. The continued refinement of context-aware models is crucial for enhancing public recycling practices and advancing environmental sustainability.


[30] Analyzing the Shopping Journey: Computing Shelf Browsing Visits in a Physical Retail Store cs.CV | cs.AI | cs.ROPDF

Luis Yoichi Morales, Francesco Zanlungo, David M. Woollard

TL;DR: 本文提出了一种基于机器视觉3D跟踪和头顶摄像头轨迹的算法,用于计算顾客在实体零售店中的‘货架访问’行为,以捕捉其浏览活动。通过在不同商店收集的两组轨迹数据进行独立校准和评估,验证了算法在跨环境下的泛化能力,并分析了浏览模式与购买行为的关系,探讨了其在零售规划和人机交互中的应用。

Details

Motivation: 针对零售业中部署面向顾客的机器人所面临的挑战,本研究旨在通过分析顾客在实体店的活动,实现对其购物意图的自主理解,从而为零售自动化和人机交互提供支持。

Result: 算法在两组独立轨迹(8138条和15129条)上进行了校准,并在校准集外(包括同店和跨店数据)进行了评估,结果显示算法能在不同环境中有效识别顾客浏览活动,但未提及具体定量指标(如准确率)或与SOTA的比较。

Insight: 创新点在于将机器视觉3D跟踪与轨迹分析结合,定义并提取‘货架访问’作为浏览行为的量化指标,实现了跨商店环境的泛化分析;从客观角度看,该方法为零售行为分析提供了可扩展的自动化工具,有助于理解购物旅程并优化零售策略。

Abstract: Motivated by recent challenges in the deployment of robots into customer-facing roles within retail, this work introduces a study of customer activity in physical stores as a step toward autonomous understanding of shopper intent. We introduce an algorithm that computes shoppers’ shelf visits'' -- capturing their browsing behavior in the store. Shelf visits are extracted from trajectories obtained via machine vision-based 3D tracking and overhead cameras. We perform two independent calibrations of the shelf visit algorithm, using distinct sets of trajectories (consisting of 8138 and 15129 trajectories), collected in different stores and labeled by human reviewers. The calibrated models are then evaluated on trajectories held out of the calibration process both from the same store on which calibration was performed and from the other store. An analysis of the results shows that the algorithm can recognize customers' browsing activity when evaluated in an environment different from the one on which calibration was performed. We then use the model to analyze the customers' browsing patterns’’ on a large set of trajectories and their relation to actual purchases in the stores. Finally, we discuss how shelf browsing information could be used for retail planning and in the domain of human-robot interaction scenarios.


[31] PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education cs.CVPDF

Megha Mariam K. M, Aditya Arun, Zakaria Laskar, C. V. Jawahar

TL;DR: 该论文提出了PhyEduVideo基准,用于评估文本到视频(T2V)模型在物理教育中生成解释性视频的能力。基准将物理概念分解为细粒度的教学点,并为每个点提供精心设计的提示,以评估模型生成准确视频的能力。评估发现,当前模型能生成视觉连贯、运动平滑的视频,但在概念准确性上存在不足,尤其在电磁学和热力学等抽象领域表现不佳。

Details

Motivation: 评估生成式AI模型(特别是T2V系统)在物理教育中自动创建直观视觉解释的潜力,以推动可扩展、可访问和个性化的AI驱动学习体验。

Result: 在PhyEduVideo基准上的评估显示,当前T2V模型在力学、流体和光学等领域表现较好,但在电磁学和热力学等抽象概念上存在困难,揭示了视觉质量与概念准确性之间的差距。

Insight: 创新点在于首次为物理教育视频生成建立了专门的评估基准,通过细粒度教学点分解和提示设计来系统评估T2V模型的概念传达能力;客观分析认为,该基准有助于推动社区关注教育视频生成中概念准确性的重要性,并为未来开发更准确的课程对齐AI系统提供了基础。

Abstract: Generative AI models, particularly Text-to-Video (T2V) systems, offer a promising avenue for transforming science education by automating the creation of engaging and intuitive visual explanations. In this work, we take a first step toward evaluating their potential in physics education by introducing a dedicated benchmark for explanatory video generation. The benchmark is designed to assess how well T2V models can convey core physics concepts through visual illustrations. Each physics concept in our benchmark is decomposed into granular teaching points, with each point accompanied by a carefully crafted prompt intended for visual explanation of the teaching point. T2V models are evaluated on their ability to generate accurate videos in response to these prompts. Our aim is to systematically explore the feasibility of using T2V models to generate high-quality, curriculum-aligned educational content-paving the way toward scalable, accessible, and personalized learning experiences powered by AI. Our evaluation reveals that current models produce visually coherent videos with smooth motion and minimal flickering, yet their conceptual accuracy is less reliable. Performance in areas such as mechanics, fluids, and optics is encouraging, but models struggle with electromagnetism and thermodynamics, where abstract interactions are harder to depict. These findings underscore the gap between visual quality and conceptual correctness in educational video generation. We hope this benchmark helps the community close that gap and move toward T2V systems that can deliver accurate, curriculum-aligned physics content at scale. The benchmark and accompanying codebase are publicly available at https://github.com/meghamariamkm/PhyEduVideo.


[32] Deep Clustering with Associative Memories cs.CV | cs.LGPDF

Bishwajit Saha, Dmitry Krotov, Mohammed J. Zaki, Parikshit Ram

TL;DR: 本文提出了一种名为DCAM的新型深度聚类方法,通过基于能量的关联记忆动态构建损失函数,将表示学习和聚类更紧密地结合在单一目标中,从而改善了聚类质量。

Details

Motivation: 深度聚类中表示学习可微分而聚类本质是离散优化,现有方法需近似和正则化导致两者割裂,本文旨在通过关联记忆能量模型更紧密地统一两者。

Result: DCAM在不同架构(卷积、残差、全连接)和数据模态(图像、文本)上均展现出优于现有方法的聚类质量提升。

Insight: 创新点在于利用关联记忆的能量动态构建损失函数,将表示学习和聚类统一为可微分的单一目标,避免了传统方法的近似割裂问题。

Abstract: Deep clustering - joint representation learning and latent space clustering - is a well studied problem especially in computer vision and text processing under the deep learning framework. While the representation learning is generally differentiable, clustering is an inherently discrete optimization task, requiring various approximations and regularizations to fit in a standard differentiable pipeline. This leads to a somewhat disjointed representation learning and clustering. In this work, we propose a novel loss function utilizing energy-based dynamics via Associative Memories to formulate a new deep clustering method, DCAM, which ties together the representation learning and clustering aspects more intricately in a single objective. Our experiments showcase the advantage of DCAM, producing improved clustering quality for various architecture choices (convolutional, residual or fully-connected) and data modalities (images or text).


[33] Few-Shot Video Object Segmentation in X-Ray Angiography Using Local Matching and Spatio-Temporal Consistency Loss cs.CVPDF

Lin Xi, Yingliang Ma, Xiahai Zhuang

TL;DR: 本文提出了一种新颖的少样本视频目标分割模型,通过局部匹配策略将搜索空间限制在最相关的相邻像素上,并采用基于方向的采样视角重组局部采样过程,实现动态变化的采样区域。此外,设计了一种有监督的时空对比学习方案以增强帧间特征一致性,并引入了一个公开的X射线血管造影视频多目标分割基准数据集。在多个数据集上的实验表明,该方法在分割精度和泛化能力上优于当前最先进的视频分割方法。

Details

Motivation: 解决现有方法(如标准卷积、深度卷积、特征移位机制或特定硬件CUDA内核)在非CUDA设备上可移植性有限、计算效率低的问题,旨在为X射线血管造影等临床应用提供更灵活、高效的少样本视频目标分割方案。

Result: 在CADICA、XACV和MOSXAV数据集上进行广泛实验,结果显示所提FSVOS方法在分割精度和泛化能力(包括已见和未见类别)上均优于当前最先进的视频分割方法,达到了SOTA水平。

Insight: 创新点包括:采用基于方向的非参数采样机制实现动态采样区域,避免了参数层的高计算成本和模型重训练需求;设计了有监督的时空对比学习损失以增强特征时空一致性;贡献了公开的X射线血管造影视频多目标分割基准数据集MOSXAV。从客观角度看,该方法在提升模型可移植性和计算效率方面具有借鉴意义。

Abstract: We introduce a novel FSVOS model that employs a local matching strategy to restrict the search space to the most relevant neighboring pixels. Rather than relying on inefficient standard im2col-like implementations (e.g., spatial convolutions, depthwise convolutions and feature-shifting mechanisms) or hardware-specific CUDA kernels (e.g., deformable and neighborhood attention), which often suffer from limited portability across non-CUDA devices, we reorganize the local sampling process through a direction-based sampling perspective. Specifically, we implement a non-parametric sampling mechanism that enables dynamically varying sampling regions. This approach provides the flexibility to adapt to diverse spatial structures without the computational costs of parametric layers and the need for model retraining. To further enhance feature coherence across frames, we design a supervised spatio-temporal contrastive learning scheme that enforces consistency in feature representations. In addition, we introduce a publicly available benchmark dataset for multi-object segmentation in X-ray angiography videos (MOSXAV), featuring detailed, manually labeled segmentation ground truth. Extensive experiments on the CADICA, XACV, and MOSXAV datasets show that our proposed FSVOS method outperforms current state-of-the-art video segmentation methods in terms of segmentation accuracy and generalization capability (i.e., seen and unseen categories). This work offers enhanced flexibility and potential for a wide range of clinical applications.


[34] WildIng: A Wildlife Image Invariant Representation Model for Geographical Domain Shift cs.CV | cs.AIPDF

Julian D. Santamaria, Claudia Isaza, Jhony H. Giraldo

TL;DR: 本文提出了WildIng模型,旨在解决野生动物监测中深度学习模型因地理域偏移(如背景、光照和环境条件变化)导致的泛化能力下降问题。该模型通过整合图像特征与文本描述,构建对地理域偏移更鲁棒的表示,从而提升模型在新地理区域的识别性能。

Details

Motivation: 现有基于深度学习的野生动物识别模型(如CLIP适配器)在训练与测试数据地理分布一致时表现良好,但在新地理区域(如从非洲到美洲)测试时性能显著下降(从84.77%降至16.17%),主要原因是模型依赖图像表示,对地理数据分布偏移敏感。

Result: 在美洲和非洲两个不同区域的数据集上评估,WildIng将基础模型(如BioCLIP)在地理域偏移条件下的准确率提升了30%。

Insight: 创新点在于结合文本描述(如物种外观细节)与图像特征,捕获一致的语义信息,以增强模型对地理域变化的鲁棒性;客观分析认为,这种多模态融合方法可有效缓解域偏移问题,为野生动物监测的跨区域泛化提供了新思路。

Abstract: Wildlife monitoring is crucial for studying biodiversity loss and climate change. Camera trap images provide a non-intrusive method for analyzing animal populations and identifying ecological patterns over time. However, manual analysis is time-consuming and resource-intensive. Deep learning, particularly foundation models, has been applied to automate wildlife identification, achieving strong performance when tested on data from the same geographical locations as their training sets. Yet, despite their promise, these models struggle to generalize to new geographical areas, leading to significant performance drops. For example, training an advanced vision-language model, such as CLIP with an adapter, on an African dataset achieves an accuracy of 84.77%. However, this performance drops significantly to 16.17% when the model is tested on an American dataset. This limitation partly arises because existing models rely predominantly on image-based representations, making them sensitive to geographical data distribution shifts, such as variation in background, lighting, and environmental conditions. To address this, we introduce WildIng, a Wildlife image Invariant representation model for geographical domain shift. WildIng integrates text descriptions with image features, creating a more robust representation to geographical domain shifts. By leveraging textual descriptions, our approach captures consistent semantic information, such as detailed descriptions of the appearance of the species, improving generalization across different geographical locations. Experiments show that WildIng enhances the accuracy of foundation models such as BioCLIP by 30% under geographical domain shift conditions. We evaluate WildIng on two datasets collected from different regions, namely America and Africa. The code and models are publicly available at https://github.com/Julian075/CATALOG/tree/WildIng.


[35] DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models cs.CVPDF

Yue Zhou, Jue Chen, Zilun Zhang, Penghui Huang, Ran Ding

TL;DR: 该论文提出了DVGBench,一个针对无人机图像的高质量隐式视觉定位基准数据集,涵盖交通、灾害、安全、体育、社交活动和生产活动六大应用场景,每个对象都提供显式和隐式查询。基于此数据集,作者设计了DroneVG-R1模型,该模型将新颖的隐式到显式思维链(I2E-CoT)集成到强化学习范式中,以利用场景特定知识将隐式参考转换为显式参考,从而降低定位难度。对主流模型的评估揭示了它们在显式和隐式视觉定位任务中推理能力的显著局限性。

Details

Motivation: 现有遥感视觉语言模型在视觉定位任务中表现出潜力,但相关数据集主要依赖显式指代表达(如相对位置、大小和颜色线索),这限制了模型在需要场景特定领域知识的隐式视觉定位任务上的性能。

Result: 在DVGBench基准上对主流模型进行评估,结果显示它们在显式和隐式视觉定位任务中的推理能力存在显著不足。论文提出的DroneVG-R1模型旨在通过I2E-CoT机制改善这一状况,但摘要中未提供具体的定量性能指标(如准确率)或与SOTA模型的直接比较结果。

Insight: 主要创新点包括:1) 构建了首个专注于无人机图像的隐式视觉定位高质量基准数据集DVGBench,提供了成对的显式和隐式查询;2) 提出了隐式到显式思维链(I2E-CoT)推理机制,并将其与强化学习结合,使模型能够利用领域知识将模糊的隐式查询转化为更易处理的显式描述,这为提升大视觉语言模型在复杂场景下的推理能力提供了新思路。

Abstract: Remote sensing (RS) large vision-language models (LVLMs) have shown strong promise across visual grounding (VG) tasks. However, existing RS VG datasets predominantly rely on explicit referring expressions-such as relative position, relative size, and color cues-thereby constraining performance on implicit VG tasks that require scenario-specific domain knowledge. This article introduces DVGBench, a high-quality implicit VG benchmark for drones, covering six major application scenarios: traffic, disaster, security, sport, social activity, and productive activity. Each object provides both explicit and implicit queries. Based on the dataset, we design DroneVG-R1, an LVLM that integrates the novel Implicit-to-Explicit Chain-of-Thought (I2E-CoT) within a reinforcement learning paradigm. This enables the model to take advantage of scene-specific expertise, converting implicit references into explicit ones and thus reducing grounding difficulty. Finally, an evaluation of mainstream models on both explicit and implicit VG tasks reveals substantial limitations in their reasoning capabilities. These findings provide actionable insights for advancing the reasoning capacity of LVLMs for drone-based agents. The code and datasets will be released at https://github.com/zytx121/DVGBench


[36] ITSELF: Attention Guided Fine-Grained Alignment for Vision-Language Retrieval cs.CV | cs.AI | cs.IRPDF

Tien-Huy Nguyen, Huu-Loc Tran, Thanh Duc Ngo

TL;DR: 本文提出了ITSELF框架,用于解决基于文本的行人检索任务中图像与文本细粒度对齐的挑战。该框架通过注意力引导的隐式局部对齐,利用模型自身的注意力机制构建高显著性令牌库,并应用局部目标进行学习,避免了额外监督和先验知识注入带来的问题。

Details

Motivation: 现有基于文本的行人检索方法通常采用局部对齐,但容易陷入捷径学习和虚假关联,导致错位;同时注入先验知识可能扭曲模态内部结构。作者发现编码器注意力在训练早期就能提供空间精确的证据,因此旨在利用模型自身注意力来引导细粒度对齐,避免上述问题。

Result: 在三个广泛使用的基于文本的行人检索基准测试上进行了大量实验,结果表明该方法取得了最先进的性能,并展现出强大的跨数据集泛化能力。

Insight: 创新点在于提出了一个完全利用模型自身注意力进行隐式细粒度对齐的框架。其核心组件GRAB将注意力转换为高显著性令牌库进行局部学习;MARS通过跨层注意力聚合和多样性感知选择确保可靠性;ATS通过自适应令牌调度在训练中从粗到细保留预算。该方法无需额外先验监督,有效且鲁棒。

Abstract: Vision Language Models (VLMs) have rapidly advanced and show strong promise for text-based person search (TBPS), a task that requires capturing fine-grained relationships between images and text to distinguish individuals. Previous methods address these challenges through local alignment, yet they are often prone to shortcut learning and spurious correlations, yielding misalignment. Moreover, injecting prior knowledge can distort intra-modality structure. Motivated by our finding that encoder attention surfaces spatially precise evidence from the earliest training epochs, and to alleviate these issues, we introduceITSELF, an attention-guided framework for implicit local alignment. At its core, Guided Representation with Attentive Bank (GRAB) converts the model’s own attention into an Attentive Bank of high-saliency tokens and applies local objectives on this bank, learning fine-grained correspondences without extra supervision. To make the selection reliable and non-redundant, we introduce Multi-Layer Attention for Robust Selection (MARS), which aggregates attention across layers and performs diversity-aware top-k selection; and Adaptive Token Scheduler (ATS), which schedules the retention budget from coarse to fine over training, preserving context early while progressively focusing on discriminative details. Extensive experiments on three widely used TBPS benchmarks showstate-of-the-art performance and strong cross-dataset generalization, confirming the effectiveness and robustness of our approach without additional prior supervision. Our project is publicly available at https://trhuuloc.github.io/itself


[37] Deepfake Detection with Multi-Artifact Subspace Fine-Tuning and Selective Layer Masking cs.CV | cs.MMPDF

Xiang Zhang, Wenliang Weng, Daoyong Fu, Ziqiang Li, Zhangjie Fu

TL;DR: 本文提出了一种基于多伪影子空间微调和选择性层掩码的深度伪造检测方法MASM,通过将预训练权重分解为稳定的语义主空间和多个可学习的伪影子空间,实现语义与伪造伪影的解耦建模,并引入选择性层掩码策略自适应调节网络层更新,以提升跨数据集场景下的泛化鲁棒性。

Details

Motivation: 深度伪造检测在跨数据集和真实复杂场景中面临挑战,主要原因是不同伪造方法引入的伪影分布高度多样,而预训练模型在适应新伪影时容易破坏原有语义结构,现有方法难以在保持语义稳定性的同时有效建模多样伪影。

Result: 未在摘要中提及具体定量结果或基准测试,但声称该方法能提升跨数据集场景下的泛化鲁棒性。

Insight: 创新点包括:通过奇异值分解将权重解耦为语义和伪影子空间,实现语义稳定与伪影多样性建模;引入选择性层掩码自适应调节更新,防止对单一伪造特征的过拟合;施加正交性和谱一致性约束,确保子空间互补且整体结构稳定。

Abstract: Deepfake detection still faces significant challenges in cross-dataset and real-world complex scenarios. The root cause lies in the high diversity of artifact distributions introduced by different forgery methods, while pretrained models tend to disrupt their original general semantic structures when adapting to new artifacts. Existing approaches usually rely on indiscriminate global parameter updates or introduce additional supervision signals, making it difficult to effectively model diverse forgery artifacts while preserving semantic stability. To address these issues, this paper proposes a deepfake detection method based on Multi-Artifact Subspaces and selective layer masks (MASM), which explicitly decouples semantic representations from artifact representations and constrains the fitting strength of artifact subspaces, thereby improving generalization robustness in cross-dataset scenarios. Specifically, MASM applies singular value decomposition to model weights, partitioning pretrained weights into a stable semantic principal subspace and multiple learnable artifact subspaces. This design enables decoupled modeling of different forgery artifact patterns while preserving the general semantic subspace. On this basis, a selective layer mask strategy is introduced to adaptively regulate the update behavior of corresponding network layers according to the learning state of each artifact subspace, suppressing overfitting to any single forgery characteristic. Furthermore, orthogonality constraints and spectral consistency constraints are imposed to jointly regularize multiple artifact subspaces, guiding them to learn complementary and diverse artifact representations while maintaining a stable overall spectral structure.


[38] Evaluating transfer learning strategies for improving dairy cattle body weight prediction in small farms using depth-image and point-cloud data cs.CV | cs.LGPDF

Jin Wang, Angelo De Castro, Yuxi Zhang, Lucas Basolli Borsatto, Yuechen Guo

TL;DR: 本研究评估了迁移学习在利用深度图像和点云数据预测小型农场奶牛体重中的效果,比较了不同模态数据的性能,发现迁移学习能显著提升小农场数据有限情况下的预测精度,且深度图像与点云模型表现相当。

Details

Motivation: 解决迁移学习在牲畜体重预测中效果不明确、缺乏最优微调策略的问题,并直接比较深度图像与点云两种模态在奶牛体重预测中的性能差异。

Result: 在小型农场数据集上,迁移学习在所有四种模型(ConvNeXt、MobileViT、PointNet、DGCNN)中均优于单源学习,达到与联合学习相当或更好的效果;深度图像与点云模型之间未发现一致的性能差异。

Insight: 迁移学习仅需预训练模型权重而非原始数据,适用于因隐私、物流或政策限制导致跨农场数据共享受限的小型农场场景;预训练表示能良好泛化至不同成像条件和牛群。

Abstract: Computer vision provides automated, non-invasive, and scalable tools for monitoring dairy cattle, thereby supporting management, health assessment, and phenotypic data collection. Although transfer learning is commonly used for predicting body weight from images, its effectiveness and optimal fine-tuning strategies remain poorly understood in livestock applications, particularly beyond the use of pretrained ImageNet or COCO weights. In addition, while both depth images and three-dimensional point-cloud data have been explored for body weight prediction, direct comparisons of these two modalities in dairy cattle are limited. Therefore, the objectives of this study were to 1) evaluate whether transfer learning from a large farm enhances body weight prediction on a small farm with limited data, and 2) compare the predictive performance of depth-image- and point-cloud-based approaches under three experimental designs. Top-view depth images and point-cloud data were collected from 1,201, 215, and 58 cows at large, medium, and small dairy farms, respectively. Four deep learning models were evaluated: ConvNeXt and MobileViT for depth images, and PointNet and DGCNN for point clouds. Transfer learning markedly improved body weight prediction on the small farm across all four models, outperforming single-source learning and achieving gains comparable to or greater than joint learning. These results indicate that pretrained representations generalize well across farms with differing imaging conditions and dairy cattle populations. No consistent performance difference was observed between depth-image- and point-cloud-based models. Overall, these findings suggest that transfer learning is well suited for small farm prediction scenarios where cross-farm data sharing is limited by privacy, logistical, or policy constraints, as it requires access only to pretrained model weights rather than raw data.


[39] EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos cs.CV | cs.AI | cs.GRPDF

Hongming Fu, Wenjia Wang, Xiaozhen Qiao, Shuo Yang, Zheng Liu

TL;DR: EgoGrasp是首个从野外动态单目第一人称视频中重建世界空间手物交互(W-HOI)的方法,通过多阶段框架(包括基于空间智能模型的预处理、基于解耦扩散模型的全身HOI先验模型和多目标测试时优化)解决现有方法在时间动态、全局一致性和遮挡下的局限性。

Details

Motivation: 现有手物交互方法局限于单图像或相机坐标系,无法建模时间动态或一致全局轨迹,且近期世界空间手部估计方法忽略了物体姿态和交互约束,在野外第一人称视频的剧烈相机运动和频繁遮挡下性能不佳。

Result: 实验证明该方法在世界空间手物交互重建上达到了最先进的性能(state-of-the-art)。

Insight: 创新点包括:利用新开发的空间智能模型构建鲁棒预处理流程;提出基于解耦扩散模型的、无模板且可扩展至多物体的全身HOI先验模型;以及采用多目标测试时优化范式,以应对野外视频的挑战。

Abstract: We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild. Accurate W-HOI reconstruction is critical for understanding human behavior and enabling applications in embodied intelligence and virtual reality. However, existing hand-object interactions (HOI) methods are limited to single images or camera coordinates, failing to model temporal dynamics or consistent global trajectories. Some recent approaches attempt world-space hand estimation but overlook object poses and HOI constraints. Their performance also suffers under severe camera motion and frequent occlusions common in egocentric in-the-wild videos. To address these challenges, we introduce a multi-stage framework with a robust pre-process pipeline built on newly developed spatial intelligence models, a whole-body HOI prior model based on decoupled diffusion models, and a multi-objective test-time optimization paradigm. Our HOI prior model is template-free and scalable to multiple objects. In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.


[40] Luminark: Training-free, Probabilistically-Certified Watermarking for General Vision Generative Models cs.CV | cs.AIPDF

Jiayi Xu, Zhang Zhang, Yuanrui Zhang, Ruitao Chen, Yixian Xu

TL;DR: 本文提出了Luminark,一种无需训练、具有概率认证的水印方法,适用于通用视觉生成模型。该方法基于一种新颖的水印定义,利用图像块级别的亮度统计信息。服务提供商预定义一个二进制模式及相应的块级阈值,检测时通过判断每个块的亮度是否超过阈值来验证生成的二进制模式是否与目标模式匹配,从而控制误报率并确保认证检测。通过利用广泛采用的引导技术作为即插即用机制,开发了“水印引导”,实现了在不同生成范式(如扩散模型、自回归模型和混合框架)中无缝注入水印,且不损害图像质量。

Details

Motivation: 解决通用视觉生成模型中水印方法的训练依赖、缺乏认证保证以及跨模型通用性不足的问题,旨在提供一种无需训练、具有概率认证且适用于多种生成模型的水印技术。

Result: 在涵盖扩散模型、自回归模型和混合框架的九个模型上进行了评估,Luminark在所有评估中均表现出高检测精度、对常见图像变换的强鲁棒性以及良好的视觉质量。

Insight: 创新点包括基于块级亮度统计的水印定义,通过统计分析方法控制误报率以实现认证检测;利用引导技术作为即插即用机制实现水印注入,确保了跨不同生成模型的通用性且不牺牲图像质量。从客观角度看,该方法将水印检测问题转化为简单的二进制模式匹配,降低了计算复杂度,并借助概率认证增强了可靠性,为生成模型的水印保护提供了高效、通用的解决方案。

Abstract: In this paper, we introduce \emph{Luminark}, a training-free and probabilistically-certified watermarking method for general vision generative models. Our approach is built upon a novel watermark definition that leverages patch-level luminance statistics. Specifically, the service provider predefines a binary pattern together with corresponding patch-level thresholds. To detect a watermark in a given image, we evaluate whether the luminance of each patch surpasses its threshold and then verify whether the resulting binary pattern aligns with the target one. A simple statistical analysis demonstrates that the false positive rate of the proposed method can be effectively controlled, thereby ensuring certified detection. To enable seamless watermark injection across different paradigms, we leverage the widely adopted guidance technique as a plug-and-play mechanism and develop the \emph{watermark guidance}. This design enables Luminark to achieve generality across state-of-the-art generative models without compromising image quality. Empirically, we evaluate our approach on nine models spanning diffusion, autoregressive, and hybrid frameworks. Across all evaluations, Luminark consistently demonstrates high detection accuracy, strong robustness against common image transformations, and good performance on visual quality.


[41] NarrativeTrack: Evaluating Video Language Models Beyond the Frame cs.CV | cs.LGPDF

Hyeonjeong Ha, Jinjin Ge, Bo Feng, Kaixin Ma, Gargi Chakraborty

TL;DR: 本文提出了NarrativeTrack,这是首个通过细粒度、以实体为中心的推理来评估多模态大语言模型视频叙事理解能力的基准。该基准将视频分解为构成实体,并通过一个结构化的评估框架——组合推理进展,来检验实体在时间上的连续性,该框架在三个维度上逐步增加叙事复杂性:实体存在、实体变化和实体模糊性。评估发现,现有模型在跨视觉转换和时间动态中稳健跟踪实体的能力不足,揭示了感知基础与时间推理之间的基本权衡。

Details

Motivation: 当前多模态大语言模型在视觉-语言推理方面取得了显著进展,但其理解视频中随时间展开的叙事的能力仍未得到充分探索。真正的叙事理解需要模型能够确定谁在何时何地做什么,并在动态的视觉和时间上下文中保持连贯的实体表征。现有基准大多局限于短视频片段或粗粒度的场景级语义,缺乏对细粒度、以实体为中心的叙事理解的系统性评估。

Result: 对最先进的多模态大语言模型的评估表明,模型在跨视觉转换和时间动态中稳健跟踪实体方面存在失败,经常在上下文变化下产生身份幻觉。开源通用MLLMs表现出较强的感知基础但时间连贯性较弱,而视频专用MLLMs能捕捉时间上下文但会幻觉实体的上下文。这些结果揭示了感知基础与时间推理之间的基本权衡。

Insight: 论文的核心创新点在于提出了首个系统性的、以实体为中心的叙事理解评估框架NarrativeTrack及其组合推理进展评估方法。这为诊断和推进MLLMs在视频中的时序叙事理解能力提供了基础。从客观角度看,将叙事分解为实体并追踪其随时间的变化,是理解复杂视频内容的关键,该工作揭示了当前模型在这一核心能力上的具体短板(如时间连贯性与感知细节的权衡),为未来模型设计指明了方向。

Abstract: Multimodal large language models (MLLMs) have achieved impressive progress in vision-language reasoning, yet their ability to understand temporally unfolding narratives in videos remains underexplored. True narrative understanding requires grounding who is doing what, when, and where, maintaining coherent entity representations across dynamic visual and temporal contexts. We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs through fine-grained entity-centric reasoning. Unlike existing benchmarks limited to short clips or coarse scene-level semantics, we decompose videos into constituent entities and examine their continuity via a Compositional Reasoning Progression (CRP), a structured evaluation framework that progressively increases narrative complexity across three dimensions: entity existence, entity changes, and entity ambiguity. CRP challenges models to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning. A fully automated entity-centric pipeline enables scalable extraction of temporally grounded entity representations, providing the foundation for CRP. Evaluations of state-of-the-art MLLMs reveal that models fail to robustly track entities across visual transitions and temporal dynamics, often hallucinating identity under context shifts. Open-source general-purpose MLLMs exhibit strong perceptual grounding but weak temporal coherence, while video-specific MLLMs capture temporal context yet hallucinate entity’s contexts. These findings uncover a fundamental trade-off between perceptual grounding and temporal reasoning, indicating that narrative understanding emerges only from their integration. NarrativeTrack provides the first systematic framework to diagnose and advance temporally grounded narrative comprehension in MLLMs.


[42] Histogram Assisted Quality Aware Generative Model for Resolution Invariant NIR Image Colorization cs.CV | eess.IVPDF

Abhinav Attri, Rajeev Ranjan Dwivedi, Samiran Das, Vinod Kumar Kurmi

TL;DR: 本文提出了HAQAGen,一个用于分辨率不变近红外(NIR)图像彩色化的统一生成模型。该模型通过结合可微直方图匹配、感知质量度量和特征相似性的损失函数来对齐全局颜色统计,利用SPADE注入局部色调-饱和度先验以稳定颜色重建,并在Mamba骨干网络中引入纹理感知监督以保留细节。模型还包含一个自适应分辨率推理引擎,支持高质量的高分辨率转换。

Details

Motivation: 解决现有NIR到RGB彩色化方法在平衡颜色真实感与结构保真度、处理不同分辨率图像以及保持纹理细节方面存在的不足。

Result: 在FANVID、OMSIV、VCIP2020和RGB2NIR等多个数据集上使用不同评估指标进行了广泛评估,结果表明其性能持续优于现有最先进的基线方法,生成的图像具有更清晰的纹理和更自然的颜色,在感知指标上取得了显著提升。

Insight: 主要创新点包括:1)结合全局颜色统计对齐、感知质量和纹理特征保持的混合损失函数;2)通过SPADE注入局部颜色先验以增强颜色一致性;3)在Mamba骨干网络中引入纹理感知监督;4)自适应分辨率推理引擎实现了高质量的高分辨率转换。这些设计共同确保了模型在颜色真实性、结构保真度和分辨率缩放方面的优异性能。

Abstract: We present HAQAGen, a unified generative model for resolution-invariant NIR-to-RGB colorization that balances chromatic realism with structural fidelity. The proposed model introduces (i) a combined loss term aligning the global color statistics through differentiable histogram matching, perceptual image quality measure, and feature based similarity to preserve texture information, (ii) local hue-saturation priors injected via Spatially Adaptive Denormalization (SPADE) to stabilize chromatic reconstruction, and (iii) texture-aware supervision within a Mamba backbone to preserve fine details. We introduce an adaptive-resolution inference engine that further enables high-resolution translation without sacrificing quality. Our proposed NIR-to-RGB translation model simultaneously enforces global color statistics and local chromatic consistency, while scaling to native resolutions without compromising texture fidelity or generalization. Extensive evaluations on FANVID, OMSIV, VCIP2020, and RGB2NIR using different evaluation metrics demonstrate consistent improvements over state-of-the-art baseline methods. HAQAGen produces images with sharper textures, natural colors, attaining significant gains as per perceptual metrics. These results position HAQAGen as a scalable and effective solution for NIR-to-RGB translation across diverse imaging scenarios. Project Page: https://rajeev-dw9.github.io/HAQAGen/


[43] Cross-Layer Attentive Feature Upsampling for Low-latency Semantic Segmentation cs.CVPDF

Tianheng Cheng, Xinggang Wang, Junchao Liao, Wenyu Liu

TL;DR: 本文提出了一种名为引导注意力插值(GAI)的新方法,用于在语义分割任务中高效地生成高分辨率特征图。该方法通过自适应地结合低分辨率特征的语义信息来插值细粒度的高分辨率特征,以解决传统插值方法(如双线性插值)导致的特征错位和上下文信息不足问题,并满足低延迟推理的需求。

Details

Motivation: 当前基于坐标引导的低分辨率特征插值方法(如双线性插值)生成的高分辨率特征粗糙,存在特征错位和上下文信息不足的问题,同时为高分辨率特征丰富语义信息通常计算负担高,难以满足低延迟推理的要求。

Result: 基于GAI的语义分割网络(GAIN)在Cityscapes数据集上达到78.8 mIoU和22.3 FPS,在CamVid数据集上达到80.6 mIoU和64.5 FPS(使用NVIDIA 1080Ti GPU),这些结果是低延迟语义分割领域的新SOTA。

Insight: 创新点在于提出GAI方法,它通过确定不同分辨率特征之间的空间和语义关系,并利用这些关系来插值具有丰富语义的高分辨率特征。从客观角度看,这是一种跨层注意力机制,能够自适应地融合多尺度信息,在保持高精度的同时显著提升推理速度,为实时语义分割提供了有效的解决方案。

Abstract: Semantic segmentation is a fundamental problem in computer vision and it requires high-resolution feature maps for dense prediction. Current coordinate-guided low-resolution feature interpolation methods, e.g., bilinear interpolation, produce coarse high-resolution features which suffer from feature misalignment and insufficient context information. Moreover, enriching semantics to high-resolution features requires a high computation burden, so that it is challenging to meet the requirement of lowlatency inference. We propose a novel Guided Attentive Interpolation (GAI) method to adaptively interpolate fine-grained high-resolution features with semantic features to tackle these issues. Guided Attentive Interpolation determines both spatial and semantic relations of pixels from features of different resolutions and then leverages these relations to interpolate high-resolution features with rich semantics. GAI can be integrated with any deep convolutional network for efficient semantic segmentation. In experiments, the GAI-based semantic segmentation networks, i.e., GAIN, can achieve78.8 mIoU with 22.3 FPS on Cityscapes and 80.6 mIoU with 64.5 on CamVid using an NVIDIA 1080Ti GPU, which are the new state-of-the-art results of low-latency semantic segmentation. Code and models are available at: https://github.com/hustvl/simpleseg.


[44] CardioMOD-Net: A Modal Decomposition-Neural Network Framework for Diagnosis and Prognosis of HFpEF from Echocardiography Cine Loops cs.CVPDF

Andrés Bell-Navas, Jesús Garicano-Mena, Antonella Ausiello, Soledad Le Clainche, María Villalba-Orero

TL;DR: 本文提出了一种名为CardioMOD-Net的统一AI框架,该框架结合了高阶动态模态分解(HODMD)和视觉Transformer,旨在直接从标准超声心动图电影循环中,对射血分数保留的心力衰竭(HFpEF)进行多类别诊断和连续预后预测。

Details

Motivation: HFpEF病因多样且进展缓慢,早期诊断和预后困难。现有的基于超声心动图的AI模型主要关注二分类检测,无法提供针对特定合并症的疾病表型分析或疾病进展的时间预测。

Result: 在四组小鼠数据(对照组、高血糖组、肥胖组、系统性动脉高血压组)上的多类别诊断总体准确率为65%,所有类别准确率均超过50%。预后模块预测HFpEF发病时间的均方根误差为21.72周,其中肥胖组和高血压组的预测最为准确。

Insight: 创新点在于提出了一个统一的模态分解-神经网络框架,将HODMD用于从视频中提取时序特征,并结合视觉Transformer同时完成诊断(分类)和预后(回归)任务,为临床前HFpEF研究中的诊断与预后模型整合提供了基础。

Abstract: Introduction: Heart failure with preserved ejection fraction (HFpEF) arises from diverse comorbidities and progresses through prolonged subclinical stages, making early diagnosis and prognosis difficult. Current echocardiography-based Artificial Intelligence (AI) models focus primarily on binary HFpEF detection in humans and do not provide comorbidity-specific phenotyping or temporal estimates of disease progression towards decompensation. We aimed to develop a unified AI framework, CardioMOD-Net, to perform multiclass diagnosis and continuous prediction of HFpEF onset directly from standard echocardiography cine loops in preclinical models. Methods: Mouse echocardiography videos from four groups were used: control (CTL), hyperglycaemic (HG), obesity (OB), and systemic arterial hypertension (SAH). Two-dimensional parasternal long-axis cine loops were decomposed using Higher Order Dynamic Mode Decomposition (HODMD) to extract temporal features for downstream analysis. A shared latent representation supported Vision Transformers, one for a classifier for diagnosis and another for a regression module for predicting the age at HFpEF onset. Results: Overall diagnostic accuracy across the four groups was 65%, with all classes exceeding 50% accuracy. Misclassifications primarily reflected early-stage overlap between OB or SAH and CTL. The prognostic module achieved a root-mean-square error of 21.72 weeks for time-to-HFpEF prediction, with OB and SAH showing the most accurate estimates. Predicted HFpEF onset closely matched true distributions in all groups. Discussion: This unified framework demonstrates that multiclass phenotyping and continuous HFpEF onset prediction can be obtained from a single cine loop, even under small-data conditions. The approach offers a foundation for integrating diagnostic and prognostic modelling in preclinical HFpEF research.


[45] GenCAMO: Scene-Graph Contextual Decoupling for Environment-aware and Mask-free Camouflage Image-Dense Annotation Generation cs.CVPDF

Chenglizhao Chen, Shaojiang Yuan, Xiaoxue Lu, Mengke Song, Jia Song

TL;DR: 本文提出GenCAMO,一个环境感知且无需掩码的生成框架,用于合成高质量的伪装图像密集标注数据,以解决伪装密集预测任务中高质量大规模数据集稀缺的问题。

Details

Motivation: 伪装密集预测任务(如RGB-D伪装目标检测和开放词汇伪装目标分割)需要大量高质量密集标注数据,但现有数据集稀缺且标注成本高昂,因此探索利用生成模型合成逼真数据来训练模型。

Result: 在多个模态上的广泛实验表明,GenCAMO通过提供高质量合成数据,显著提升了复杂伪装场景下的密集预测性能。

Insight: 创新点包括引入大规模多模态标注数据集GenCAMO-DB,以及提出基于场景图上下文解耦的环境感知生成框架,无需掩码即可生成高保真伪装图像密集标注,为数据稀缺任务提供了合成数据解决方案。

Abstract: Conceal dense prediction (CDP), especially RGB-D camouflage object detection and open-vocabulary camouflage object segmentation, plays a crucial role in advancing the understanding and reasoning of complex camouflage scenes. However, high-quality and large-scale camouflage datasets with dense annotation remain scarce due to expensive data collection and labeling costs. To address this challenge, we explore leveraging generative models to synthesize realistic camouflage image-dense data for training CDP models with fine-grained representations, prior knowledge, and auxiliary reasoning. Concretely, our contributions are threefold: (i) we introduce GenCAMO-DB, a large-scale camouflage dataset with multi-modal annotations, including depth maps, scene graphs, attribute descriptions, and text prompts; (ii) we present GenCAMO, an environment-aware and mask-free generative framework that produces high-fidelity camouflage image-dense annotations; (iii) extensive experiments across multiple modalities demonstrate that GenCAMO significantly improves dense prediction performance on complex camouflage scenes by providing high-quality synthetic data. The code and datasets will be released after paper acceptance.


[46] Crowded Video Individual Counting Informed by Social Grouping and Spatial-Temporal Displacement Priors cs.CVPDF

Hao Lu, Xuhui Zhu, Wenjing Zhang, Yanan Li, Xiang Bai

TL;DR: 本文提出了一种名为OMAN++的新基线方法,用于视频个体计数任务,通过引入社交分组和时空位移先验知识来提升拥挤场景下的计数性能,并在多个基准数据集上实现了SOTA结果。

Details

Motivation: 现有视频个体计数方法在拥挤场景(如地铁通勤)中表现不佳,为解决此问题,作者构建了WuhanMetroCrowd数据集,并重新思考VIC的本质,利用社交分组和时空位移先验来改进模型。

Result: OMAN++在SenseCrowd、CroHD和MovingDroneCrowd基准测试中超越了现有SOTA方法,在拥挤的WuhanMetroCrowd数据集上误差降低了38.12%。

Insight: 创新点包括:将标准的一对一匹配松弛为一对多匹配,通过隐式上下文生成器和O2M匹配器实现;设计位移先验注入器以增强特征提取和模型训练;构建了首个表征拥挤动态人流特征的VIC数据集WuhanMetroCrowd。

Abstract: Video Individual Counting (VIC) is a recently introduced task aiming to estimate pedestrian flux from a video. It extends Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that learns to count pedestrians across frames, VIC must identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, can underperform in congested scenes such as metro commuting. To address this, we build WuhanMetroCrowd, one of the first VIC datasets that characterize crowded, dynamic pedestrian flows. It features sparse-to-dense density levels, short-to-long video clips, slow-to-fast flow variations, front-to-back appearance changes, and light-to-heavy occlusions. To better adapt VIC approaches to crowds, we rethink the nature of VIC and recognize two informative priors: i) the social grouping prior that indicates pedestrians tend to gather in groups and ii) the spatial-temporal displacement prior that informs an individual cannot teleport physically. The former inspires us to relax the standard one-to-one (O2O) matching used by VIC to one-to-many (O2M) matching, implemented by an implicit context generator and a O2M matcher; the latter facilitates the design of a displacement prior injector, which strengthens not only O2M matching but also feature extraction and model training. These designs jointly form a novel and strong VIC baseline OMAN++. Extensive experiments show that OMAN++ not only outperforms state-of-the-art VIC baselines on the standard SenseCrowd, CroHD, and MovingDroneCrowd benchmarks, but also indicates a clear advantage in crowded scenes, with a 38.12% error reduction on our WuhanMetroCrowd dataset. Code, data, and pretrained models are available at https://github.com/tiny-smart/OMAN.


[47] XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression cs.CVPDF

Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang

TL;DR: 本文提出了XStreamVGGT,一种无需微调的方法,通过联合剪枝和量化来系统性地压缩KV缓存,以实现极其内存高效的流式推理。该方法解决了StreamVGGT模型中KV缓存无限增长导致内存消耗和推理延迟增加的问题。

Details

Motivation: 动机是解决基于Transformer的流式3D视觉几何模型(如StreamVGGT)中,随着输入帧累积,KV缓存无限增长导致的内存消耗和推理延迟不断攀升的问题。

Result: 大量评估表明,XStreamVGGT在性能损失基本可忽略的情况下,将内存使用量大幅减少了4.42倍,并将推理速度提升了5.48倍。

Insight: 创新点在于提出了一种无需微调的KV缓存压缩方案,通过高效的令牌重要性识别来剪枝多视图输入产生的冗余KV以固定内存预算,并利用KV张量的独特分布进行量化以进一步减少内存消耗,从而实现可扩展且实用的流式3D应用。

Abstract: Learning-based 3D visual geometry models have benefited substantially from large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention for strong streaming reconstruction, but suffers from unbounded KV cache growth, leading to escalating memory consumption and inference latency as input frames accumulate. We propose XStreamVGGT, a tuning-free approach that systematically compresses the KV cache through joint pruning and quantization, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs originating from multi-view inputs are pruned through efficient token importance identification, enabling a fixed memory budget. Leveraging the unique distribution of KV tensors, we incorporate KV quantization to further reduce memory consumption. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$\times$ and accelerating inference by 5.48$\times$, enabling scalable and practical streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.


[48] UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass cs.CVPDF

Mengfei Li, Peng Li, Zheng Zhang, Jiahao Lu, Chengfeng Zhao

TL;DR: UniSH是一个前馈式统一框架,用于联合恢复度量尺度的3D场景和人体重建。该框架通过创新的训练范式,有效利用未标注的野外数据,解决了依赖合成数据导致的sim-to-real域差距问题,实现了高保真场景几何、人体点云、相机参数和一致度量尺度SMPL人体的单次前馈恢复。

Details

Motivation: 解决联合3D场景和人体重建中因缺乏大规模标注真实数据而依赖合成数据集,导致sim-to-real域差距大、泛化能力差、人体几何保真度低和对野外视频对齐不佳的问题。

Result: 在人体中心场景重建任务上达到state-of-the-art性能,在全局人体运动估计上取得高度竞争力的结果,优于基于优化的框架和仅人体模型回归方法。

Insight: 创新点包括:提出利用未标注野外数据的训练范式,结合场景重建和人体模型回归的强先验,采用鲁棒蒸馏策略从专家深度模型提取高频细节以优化人体表面,以及两阶段监督方案(先在合成数据上学习粗定位,再通过直接优化SMPL网格与人体点云的几何对应在真实数据上微调),实现前馈式联合重建。

Abstract: We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real data by directly optimizing the geometric correspondence between the SMPL mesh and the human point cloud. This approach enables our feed-forward model to jointly recover high-fidelity scene geometry, human point clouds, camera parameters, and coherent, metric-scale SMPL bodies, all in a single forward pass. Extensive experiments demonstrate that our model achieves state-of-the-art performance on human-centric scene reconstruction and delivers highly competitive results on global human motion estimation, comparing favorably against both optimization-based frameworks and HMR-only methods. Project page: https://murphylmf.github.io/UniSH/


[49] AI-Powered Deepfake Detection Using CNN and Vision Transformer Architectures cs.CV | cs.AI | cs.LGPDF

Sifatullah Sheikh Urmi, Kirtonia Nuzath Tabassum Arthi, Md Al-Imran

TL;DR: 该论文评估了四种基于AI的模型(包括三种CNN和一种Vision Transformer)在大型人脸图像数据集上的深度伪造检测性能。通过数据预处理和增强技术提升了模型在不同场景下的表现,其中VFDNET结合MobileNetV3展现出最优的准确率和高效性能。

Details

Motivation: 针对人工智能生成的深度伪造技术日益增多对数字真实性构成的挑战,研究旨在开发可靠的深度伪造检测方法。

Result: 在大型人脸图像数据集上的评估显示,VFDNET结合MobileNetV3取得了优越的准确率,证明了AI在深度伪造检测中的高效能力。

Insight: 创新点在于综合比较了CNN和Vision Transformer架构,并利用数据预处理与增强技术优化模型性能;客观来看,将轻量级MobileNetV3集成到VFDNET中实现了准确性与效率的平衡,为实际应用提供了可行方案。

Abstract: The increasing use of artificial intelligence generated deepfakes creates major challenges in maintaining digital authenticity. Four AI-based models, consisting of three CNNs and one Vision Transformer, were evaluated using large face image datasets. Data preprocessing and augmentation techniques improved model performance across different scenarios. VFDNET demonstrated superior accuracy with MobileNetV3, showing efficient performance, thereby demonstrating AI’s capabilities for dependable deepfake detection.


[50] S2M-Net: Spectral-Spatial Mixing for Medical Image Segmentation with Morphology-Aware Adaptive Loss cs.CVPDF

Md. Sanaullah Chowdhury Lameya Sabrin

TL;DR: 本文提出S2M-Net,一种用于医学图像分割的轻量级架构,通过谱空间混合和形态感知自适应损失,在保持全局上下文的同时实现高效计算,并在多个数据集上达到SOTA性能。

Details

Motivation: 解决医学图像分割中局部精度、全局上下文和计算效率的三难问题,现有卷积网络感受野有限,而视觉Transformer计算成本过高且易在小数据集上过拟合。

Result: 在涵盖8种模态的16个医学影像数据集上评估,S2M-Net在息肉分割达到96.12% Dice,手术器械分割达到83.77%(比先前最佳提升17.85%),脑肿瘤分割达到80.90%,参数比基于Transformer的方法少3.5-6倍,且比专用基线一致提升3-18%。

Insight: 创新点包括:谱选择性令牌混合器(SSTM)利用截断2D FFT和可学习频率滤波实现O(HW log HW)全局上下文,避免二次注意力成本;形态感知自适应分割损失(MASL)通过分析结构特征自动调制五种互补损失分量,无需手动调参。

Abstract: Medical image segmentation requires balancing local precision for boundary-critical clinical applications, global context for anatomical coherence, and computational efficiency for deployment on limited data and hardware a trilemma that existing architectures fail to resolve. Although convolutional networks provide local precision at $\mathcal{O}(n)$ cost but limited receptive fields, vision transformers achieve global context through $\mathcal{O}(n^2)$ self-attention at prohibitive computational expense, causing overfitting on small clinical datasets. We propose S2M-Net, a 4.7M-parameter architecture that achieves $\mathcal{O}(HW \log HW)$ global context through two synergistic innovations: (i) Spectral-Selective Token Mixer (SSTM), which exploits the spectral concentration of medical images via truncated 2D FFT with learnable frequency filtering and content-gated spatial projection, avoiding quadratic attention cost while maintaining global receptive fields; and (ii) Morphology-Aware Adaptive Segmentation Loss (MASL), which automatically analyzes structure characteristics (compactness, tubularity, irregularity, scale) to modulate five complementary loss components through constrained learnable weights, eliminating manual per-dataset tuning. Comprehensive evaluation in 16 medical imaging datasets that span 8 modalities demonstrates state-of-the-art performance: 96.12% Dice on polyp segmentation, 83.77% on surgical instruments (+17.85% over the prior art) and 80.90% on brain tumors, with consistent 3-18% improvements over specialized baselines while using 3.5–6$\times$ fewer parameters than transformer-based methods.


[51] VReID-XFD: Video-based Person Re-identification at Extreme Far Distance Challenge Results cs.CVPDF

Kailash A. Hambarde, Hugo Proença, Md Rashidunnabi, Pranita Samale, Qiwei Yang

TL;DR: 本文介绍了VReID-XFD,一个用于研究极端远距离(XFD)下、基于视频的空中到地面行人重识别(ReID)的基准数据集和社区挑战赛。该数据集源自DetReIDX,包含371个身份、11,288个轨迹段和1175万帧,覆盖了从5.8米到120米的高度、从30度斜角到90度天底角的视角以及水平距离达120米的复杂场景。VReID-XFD-25挑战赛吸引了10支团队参与,系统分析揭示了性能随高度和距离单调下降、天底角视角普遍不利以及峰值性能与鲁棒性之间存在权衡。即使在最佳方法(SAS-PReID)下,空中到地面设置的mAP也仅为43.93%。

Details

Motivation: 解决在极端远距离下,跨空中和地面视角的行人重识别问题。在该场景下,严重的分辨率退化、极端的视角变化、不稳定的运动线索以及服装变化共同破坏了现有基于外观的ReID系统的假设。

Result: 在VReID-XFD基准上进行了评估。最佳性能方法SAS-PReID在严格的空中到地面设置中仅达到43.93%的mAP。分析表明性能随高度和距离单调下降,天底角视角表现普遍较差。

Insight: 论文的创新点在于构建了一个专门针对极端远距离、跨视角(空中-地面)视频行人重识别的、具有丰富物理元数据(高度、视角、距离)的大规模基准数据集VReID-XFD,并组织了社区挑战赛以推动该领域研究。从客观角度看,该工作明确界定并系统研究了ReID的一个极具挑战性的新范式(XFD),其数据集和系统性分析(如性能与物理参数的关系)为未来方法设计(如需要超越纯外观、融合运动或几何线索)提供了重要洞见和基准。

Abstract: Person re-identification (ReID) across aerial and ground views at extreme far distances introduces a distinct operating regime where severe resolution degradation, extreme viewpoint changes, unstable motion cues, and clothing variation jointly undermine the appearance-based assumptions of existing ReID systems. To study this regime, we introduce VReID-XFD, a video-based benchmark and community challenge for extreme far-distance (XFD) aerial-to-ground person re-identification. VReID-XFD is derived from the DetReIDX dataset and comprises 371 identities, 11,288 tracklets, and 11.75 million frames, captured across altitudes from 5.8 m to 120 m, viewing angles from oblique (30 degrees) to nadir (90 degrees), and horizontal distances up to 120 m. The benchmark supports aerial-to-aerial, aerial-to-ground, and ground-to-aerial evaluation under strict identity-disjoint splits, with rich physical metadata. The VReID-XFD-25 Challenge attracted 10 teams with hundreds of submissions. Systematic analysis reveals monotonic performance degradation with altitude and distance, a universal disadvantage of nadir views, and a trade-off between peak performance and robustness. Even the best-performing SAS-PReID method achieves only 43.93 percent mAP in the aerial-to-ground setting. The dataset, annotations, and official evaluation protocols are publicly available at https://www.it.ubi.pt/DetReIDX/ .


[52] LinMU: Multimodal Understanding Made Linear cs.CV | cs.AI | cs.LG | cs.MM | eess.IVPDF

Hongjie Wang, Niraj K. Jha

TL;DR: 本文提出了LinMU,一种具有线性复杂度的视觉语言模型架构,旨在解决现有基于自注意力的VLM因二次复杂度而难以部署在边缘设备或处理高分辨率图像/长视频的问题。LinMU通过用M-MATE块(结合双向状态空间模型和局部窗口注意力)替换所有自注意力层,并采用三阶段蒸馏框架从预训练VLM迁移知识,在保持性能的同时显著提升了推理效率。

Details

Motivation: 现有视觉语言模型的自注意力机制具有二次复杂度,导致其在边缘设备部署困难,且处理高分辨率图像和长视频时计算成本过高。

Result: 在MMMU、TextVQA、LongVideoBench、Video-MME等多个基准测试上,LinMU的性能与教师模型相当,同时在分钟级视频上将首token生成时间(TTFT)降低了最多2.7倍,并将token吞吐量提升了最多9.0倍。

Insight: 核心创新点在于提出了线性复杂度的M-MATE块(融合全局状态空间建模与局部窗口注意力)以及高效的三阶段蒸馏框架,证明了无需二次注意力即可实现SOTA多模态推理,为处理长上下文的高效VLM开辟了新途径。

Abstract: Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the VLM with the M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU architecture, we propose a three-stage distillation framework that (i) initializes both branches with self-attention weights and trains the Flex-MA branch alone, (ii) unfreezes the Local-Swin branch and fine-tunes it jointly with the Flex-MA branch, and (iii) unfreezes the remaining blocks and fine-tunes them using LoRA adapters, while regressing on hidden states and token-level logits of the frozen VLM teacher. On MMMU, TextVQA, LongVideoBench, Video-MME, and other benchmarks, LinMU matches the performance of teacher models, yet reduces Time-To-First-Token (TTFT) by up to 2.7$\times$ and improves token throughput by up to 9.0$\times$ on minute-length videos. Ablations confirm the importance of each distillation stage and the necessity of the two branches of the M-MATE block. The proposed framework demonstrates that state-of-the-art multimodal reasoning can be achieved without quadratic attention, thus opening up avenues for long-context VLMs that can deal with high-resolution images and long videos.


[53] Achieving Fine-grained Cross-modal Understanding through Brain-inspired Hierarchical Representation Learning cs.CVPDF

Weihang You, Hanqi Jiang, Yi Pan, Junhao Chen, Tianming Liu

TL;DR: 本文提出了一种名为NeuroAlign的脑启发分层表示学习框架,旨在解决神经数据与视觉输入之间的模态鸿沟,并实现对视觉刺激的细粒度跨模态理解。该框架通过模拟人类视觉系统的分层组织,采用两阶段机制:全局语义理解(通过神经-时间对比学习)和细粒度模式匹配(通过增强的向量量化),从而显著提升了跨模态检索任务的性能。

Details

Motivation: 现有方法主要将神经解码简化为生成任务或简单相关性分析,无法反映大脑视觉处理的分层和时间过程,因此需要一种能更好地建模神经响应与视觉刺激之间复杂关系的新方法。

Result: 实验表明,NeuroAlign在跨模态检索任务中显著优于现有方法,为理解视觉认知机制建立了新范式。

Insight: 创新点包括:受生物视觉通路启发的两阶段分层表示学习机制、通过双向预测显式建模时间动态的神经-时间对比学习(NTCL),以及实现动态多模态融合与自适应加权的DynaSyncMM-EMA方法。这些设计有助于更精细地对齐fMRI与视频数据,推动脑科学与人工智能的交叉研究。

Abstract: Understanding neural responses to visual stimuli remains challenging due to the inherent complexity of brain representations and the modality gap between neural data and visual inputs. Existing methods, mainly based on reducing neural decoding to generation tasks or simple correlations, fail to reflect the hierarchical and temporal processes of visual processing in the brain. To address these limitations, we present NeuroAlign, a novel framework for fine-grained fMRI-video alignment inspired by the hierarchical organization of the human visual system. Our framework implements a two-stage mechanism that mirrors biological visual pathways: global semantic understanding through Neural-Temporal Contrastive Learning (NTCL) and fine-grained pattern matching through enhanced vector quantization. NTCL explicitly models temporal dynamics through bidirectional prediction between modalities, while our DynaSyncMM-EMA approach enables dynamic multi-modal fusion with adaptive weighting. Experiments demonstrate that NeuroAlign significantly outperforms existing methods in cross-modal retrieval tasks, establishing a new paradigm for understanding visual cognitive mechanisms.


[54] Slot-ID: Identity-Preserving Video Generation from Reference Videos via Slot-Based Temporal Identity Encoding cs.CV | cs.AIPDF

Yixuan Lai, He Wang, Kun Zhou, Tianjia Shao

TL;DR: 本文提出Slot-ID,一种基于槽位的时间身份编码方法,用于从参考视频生成身份保持的视频。该方法通过一个简短的参考视频(而非单张图像)来捕捉特定主体的动态特征(如微笑形成方式),并利用Sinkhorn路由编码器学习紧凑的身份令牌,从而在保持预训练主干网络兼容性的同时,显著提升身份保持能力。

Details

Motivation: 现有模型从单张参考图像生成视频时,完全忽略了时间特征,导致姿态锁定、不自然的扭曲以及在视角和表情变化时产生’平均’人脸。本文旨在解决身份保持与运动自然性之间的平衡问题,通过利用参考视频中的动态信息来改进身份保留。

Result: 该方法在保持提示忠实度和视觉真实感的同时,在大姿态变化和丰富面部表情下,身份保持能力得到一致提升。尽管仅添加了轻量级条件,但该方法在多样主体和提示下均表现出色。

Insight: 核心创新在于使用参考视频(而非单张图像)来编码主体特定的时间动态模式,并设计了一个Sinkhorn路由编码器来学习紧凑、与预训练主干兼容的身份令牌。这为身份保持的视频生成提供了一种有效的时间身份编码方案。

Abstract: Producing prompt-faithful videos that preserve a user-specified identity remains challenging: models need to extrapolate facial dynamics from sparse reference while balancing the tension between identity preservation and motion naturalness. Conditioning on a single image completely ignores the temporal signature, which leads to pose-locked motions, unnatural warping, and “average” faces when viewpoints and expressions change. To this end, we introduce an identity-conditioned variant of a diffusion-transformer video generator which uses a short reference video rather than a single portrait. Our key idea is to incorporate the dynamics in the reference. A short clip reveals subject-specific patterns, e.g., how smiles form, across poses and lighting. From this clip, a Sinkhorn-routed encoder learns compact identity tokens that capture characteristic dynamics while remaining pretrained backbone-compatible. Despite adding only lightweight conditioning, the approach consistently improves identity retention under large pose changes and expressive facial behavior, while maintaining prompt faithfulness and visual realism across diverse subjects and prompts.


[55] Advanced Machine Learning Approaches for Enhancing Person Re-Identification Performance cs.CVPDF

Dang H. Pham, Tu N. Nguyen, Hoa N. Nguyen

TL;DR: 该论文提出了三种先进的方法来提升行人重识别(ReID)的性能,分别针对有监督、无监督域适应(UDA)和完全无监督三种场景。具体包括:SCM-ReID(结合监督对比学习和混合损失优化)、IQAGA和DAPRH(利用GAN图像增强和伪标签细化处理域适应问题),以及ViTC-UReID(基于Vision Transformer和相机感知代理学习的无监督方法)。这些方法在多个基准数据集上进行了全面评估,均取得了显著性能提升。

Details

Motivation: 行人重识别在复杂环境下的多摄像头智能监控系统中至关重要,但面临外观变化、域偏移和标注数据有限等重大挑战。论文旨在通过提出先进方法,分别解决有监督、无监督域适应和完全无监督设置下的这些关键问题,以提升ReID的鲁棒性和泛化能力。

Result: 在有监督设置下,SCM-ReID在Market-1501和CUHK03数据集上达到了最先进的准确率。在无监督域适应场景中,IQAGA和DAPRH在具有挑战性的迁移场景下,相比基线方法在mAP和Rank-1指标上提升了高达12%。在完全无监督设置下,ViTC-UReID在大规模基准测试上显著优于现有的无监督方法。综合评估涵盖了CUHK03、Market-1501、DukeMTMC-reID和MSMT17数据集,证实了所提方法的有效性。

Insight: 论文的创新点在于:1)将有监督对比学习与混合损失(分类、中心、三元组和质心三元组损失)优化相结合,以增强特征判别性;2)在域适应中结合基于GAN的图像增强、域不变映射和伪标签细化来减少域差异;3)在无监督ReID中利用Vision Transformer进行特征编码,并结合相机感知代理学习以及全局与局部注意力机制。这些方法为解决特征学习、域适应和标签噪声处理等关键限制提供了新的思路。

Abstract: Person re-identification (ReID) plays a critical role in intelligent surveillance systems by linking identities across multiple cameras in complex environments. However, ReID faces significant challenges such as appearance variations, domain shifts, and limited labeled data. This dissertation proposes three advanced approaches to enhance ReID performance under supervised, unsupervised domain adaptation (UDA), and fully unsupervised settings. First, SCM-ReID integrates supervised contrastive learning with hybrid loss optimization (classification, center, triplet, and centroid-triplet losses), improving discriminative feature representation and achieving state-of-the-art accuracy on Market-1501 and CUHK03 datasets. Second, for UDA, IQAGA and DAPRH combine GAN-based image augmentation, domain-invariant mapping, and pseudo-label refinement to mitigate domain discrepancies and enhance cross-domain generalization. Experiments demonstrate substantial gains over baseline methods, with mAP and Rank-1 improvements up to 12% in challenging transfer scenarios. Finally, ViTC-UReID leverages Vision Transformer-based feature encoding and camera-aware proxy learning to boost unsupervised ReID. By integrating global and local attention with camera identity constraints, this method significantly outperforms existing unsupervised approaches on large-scale benchmarks. Comprehensive evaluations across CUHK03, Market-1501, DukeMTMC-reID, and MSMT17 confirm the effectiveness of the proposed methods. The contributions advance ReID research by addressing key limitations in feature learning, domain adaptation, and label noise handling, paving the way for robust deployment in real-world surveillance systems.


[56] AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval cs.CVPDF

Yue Zhou, Ran Ding, Xue Yang, Xue Jiang, Xingzhao Liu

TL;DR: 本文针对遥感视觉语言模型在空间理解上的不足,提出了一个空间感知的无人机图像数据集AirSpatial,并引入两项新任务:空间定位和空间问答。通过两阶段训练策略,开发了一个名为AirSpatialBot的空中智能体,能够进行细粒度的车辆属性识别与检索。实验验证了该方法的有效性,并揭示了现有模型的局限性。

Details

Motivation: 现有遥感视觉语言模型在空间理解方面存在不足,限制了其在实际应用中的效果。本文旨在通过解决无人机捕获的车辆图像问题,推动遥感VLMs的发展,特别关注空间感知能力的提升。

Result: 实验结果表明,所提出的方法有效提升了模型的空间理解能力,并验证了AirSpatialBot在细粒度车辆属性识别与检索任务上的性能。同时,研究揭示了现有VLMs在空间任务上的局限性。

Insight: 创新点包括:1) 引入首个提供3D边界框的遥感空间感知数据集AirSpatial;2) 提出两阶段训练策略,将图像理解预训练与空间理解微调相结合;3) 开发了一个能动态整合任务规划、图像理解、空间理解和任务执行能力的空中智能体AirSpatialBot。

Abstract: Despite notable advancements in remote sensing vision-language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real-world applications. To push the boundaries of VLMs in remote sensing, we specifically address vehicle imagery captured by drones and introduce a spatially-aware dataset AirSpatial, which comprises over 206K instructions and introduces two novel tasks: Spatial Grounding and Spatial Question Answering. It is also the first remote sensing grounding dataset to provide 3DBB. To effectively leverage existing image understanding of VLMs to spatial domains, we adopt a two-stage training strategy comprising Image Understanding Pre-training and Spatial Understanding Fine-tuning. Utilizing this trained spatially-aware VLM, we develop an aerial agent, AirSpatialBot, which is capable of fine-grained vehicle attribute recognition and retrieval. By dynamically integrating task planning, image understanding, spatial understanding, and task execution capabilities, AirSpatialBot adapts to diverse query requirements. Experimental results validate the effectiveness of our approach, revealing the spatial limitations of existing VLMs while providing valuable insights. The model, code, and datasets will be released at https://github.com/VisionXLab/AirSpatialBot


[57] DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer cs.CVPDF

Xu Guo, Fulong Ye, Xinghui Li, Pengqi Tu, Pengze Zhang

TL;DR: 本文提出DreamID-V,一个基于扩散Transformer的框架,旨在将图像人脸交换(IFS)的优势无缝迁移到视频领域,以解决视频人脸交换(VFS)中身份相似性、属性保持和时间一致性难以兼顾的挑战。核心创新包括SyncID-Pipe数据管道、模态感知条件模块、合成到真实课程机制和身份一致性强化学习策略,并在新构建的IDBench-V基准上验证了其优越性能。

Details

Motivation: 解决现有视频人脸交换方法在保持身份相似性和属性(如姿态、表情、光照)的同时,难以维持时间一致性的问题,旨在将高性能的图像人脸交换技术有效迁移到视频域。

Result: 在提出的综合基准IDBench-V上进行广泛实验,结果表明DreamID-V超越了现有最先进(SOTA)方法,并展现出卓越的泛化能力,可无缝适配多种交换相关任务。

Insight: 主要创新点包括:1)SyncID-Pipe数据管道,通过预训练身份锚定视频合成器并与IFS模型结合,构建用于显式监督的双向ID四元组数据;2)首个基于扩散Transformer的VFS框架DreamID-V,其核心是模态感知条件模块,用于区分性地注入多模态条件;3)合成到真实课程机制和身份一致性强化学习策略,以增强视觉真实感和挑战场景下的身份一致性。从客观角度看,其将扩散Transformer架构引入VFS任务,并结合精心设计的数据工程和训练策略,是系统性的方法创新。

Abstract: Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.


[58] Rethinking Multimodal Few-Shot 3D Point Cloud Segmentation: From Fused Refinement to Decoupled Arbitration cs.CV | cs.AI | cs.LGPDF

Wentao Bian, Fenglei Xu

TL;DR: 本文提出了一种名为DA-FSS的解耦仲裁少样本3D点云语义分割模型,旨在解决现有‘先融合后精炼’范式中的‘可塑性-稳定性困境’以及CLIP模型带来的类间混淆问题。该模型通过并行专家精炼模块分离几何与语义路径,并利用堆叠仲裁模块进行协调,在S3DIS和ScanNet数据集上超越了基线方法MM-FSS。

Details

Motivation: 针对多模态少样本3D点云分割中‘先融合后精炼’范式存在的‘可塑性-稳定性困境’,以及CLIP模型可能导致的语义盲区问题,旨在设计一个能更好区分并利用几何与语义信息的新模型。

Result: 在S3DIS和ScanNet等流行数据集上的实验表明,DA-FSS在性能上超越了基线MM-FSS,并且在几何边界、完整性和纹理区分等方面均优于基线。

Insight: 创新点在于提出了解耦的并行专家架构(几何专家与语义专家),通过堆叠仲裁模块和知识对齐模块进行梯度互正则化,从而在保持可塑性的同时确保稳定性,有效提升了多模态信息的利用效率和模型泛化能力。

Abstract: In this paper, we revisit multimodal few-shot 3D point cloud semantic segmentation (FS-PCS), identifying a conflict in “Fuse-then-Refine” paradigms: the “Plasticity-Stability Dilemma.” In addition, CLIP’s inter-class confusion can result in semantic blindness. To address these issues, we present the Decoupled-experts Arbitration Few-Shot SegNet (DA-FSS), a model that effectively distinguishes between semantic and geometric paths and mutually regularizes their gradients to achieve better generalization. DA-FSS employs the same backbone and pre-trained text encoder as MM-FSS to generate text embeddings, which can increase free modalities’ utilization rate and better leverage each modality’s information space. To achieve this, we propose a Parallel Expert Refinement module to generate each modal correlation. We also propose a Stacked Arbitration Module (SAM) to perform convolutional fusion and arbitrate correlations for each modality pathway. The Parallel Experts decouple two paths: a Geometric Expert maintains plasticity, and a Semantic Expert ensures stability. They are coordinated via a Decoupled Alignment Module (DAM) that transfers knowledge without propagating confusion. Experiments on popular datasets (S3DIS, ScanNet) demonstrate the superiority of DA-FSS over MM-FSS. Meanwhile, geometric boundaries, completeness, and texture differentiation are all superior to the baseline. The code is available at: https://github.com/MoWenQAQ/DA-FSS.


[59] Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation cs.CVPDF

Mingxing Zhan, Li Zhang, Beibei Wang, Yingjie Wang, Zenglin Shi

TL;DR: 本文提出一种结合语言先验与视觉校准的单目深度估计方法,通过冻结相对深度基础模型和CLIP文本编码器,仅训练轻量级校准头来恢复度量尺度深度。该方法利用语言预测不确定性感知的参数包络,再通过多尺度视觉特征在包络内选择图像特定的校准参数,从而解决单目度量深度估计中全局尺度不可识别和域偏移敏感的问题。

Details

Motivation: 相对深度基础模型虽能较好迁移,但单目度量深度估计因全局尺度不可识别和对域偏移高度敏感而仍属不适定问题。本文旨在利用语言提供的粗略尺度线索和视觉特征进行校准,以恢复准确的度量尺度深度。

Result: 在NYUv2和KITTI数据集上提升了域内精度,并在SUN-RGBD和DDAD上零样本迁移时表现出比纯语言基线更强的鲁棒性。

Insight: 创新点在于将语言作为先验预测不确定性参数包络而非点估计,再通过冻结视觉特征进行精细校准,实现了语言与视觉的有效结合,以轻量方式解决了度量尺度恢复问题。

Abstract: Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.


[60] Robust Ship Detection and Tracking Using Modified ViBe and Backwash Cancellation Algorithm cs.CVPDF

Mohammad Hassan Saghafi, Seyed Majid Noorhosseini, Seyed Abolfazl Seyed Javadein, Hadi Khalili

TL;DR: 本文提出了一种用于海岸视频序列中船舶检测与跟踪的鲁棒实时方法,通过改进ViBe算法检测移动物体(船舶和尾流),并引入基于几何特性和亮度失真概念的尾流消除新方法,以应对海岸场景的动态性和不可预测性。

Details

Motivation: 解决海岸场景中因动态海况、光照变化和不可预测环境导致的船舶检测与跟踪难题,需要鲁棒且实时的检测方法。

Result: 实验结果表明,所提策略和方法在船舶检测与跟踪中表现出色,具有实时性和精确性能,但未提及具体基准测试或与SOTA的比较。

Insight: 创新点包括改进ViBe算法以减少船舶丢失概率、快速更新背景以应对海浪和光照变化,以及基于船舶几何特性和亮度失真提出尾流消除方法,提升了在动态海岸环境中的鲁棒性。

Abstract: In this paper, we propose a robust real time detection and tracking method for detecting ships in a coastal video sequences. Since coastal scenarios are unpredictable and scenes have dynamic properties it is essential to apply detection methods that are robust to these conditions. This paper presents modified ViBe for moving object detection which detects ships and backwash. In the modified ViBe the probability of losing ships is decreased in comparison with the original ViBe. It is robust to natural sea waves and variation of lights and is capable of quickly updating the background. Based on geometrical properties of ship and some concepts such as brightness distortion, a new method for backwash cancellation is proposed. Experimental results demonstrate that the proposed strategy and methods have outstanding performance in ship detection and tracking. These results also illustrate real time and precise performance of the proposed strategy.


[61] Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization cs.CVPDF

Xinyu Qiu, Heng Jia, Zhengwen Zeng, Shuheng Shen, Changhua Meng

TL;DR: 本文提出了一种名为优势解耦偏好优化(ADPO)的统一强化学习框架,旨在解决视觉语言模型中并行测试时扩展通常需要分别训练生成和验证模型所导致的高昂训练和推理成本问题。该框架在一个单一策略中联合学习答案生成和自我验证,通过引入偏好验证奖励和解耦优化机制,实现了生成与验证能力的协同优化。

Details

Motivation: 动机是解决现有方法中为视觉语言模型进行并行测试时扩展时,需要分别训练独立的生成模型和验证模型所带来的高训练和推理成本问题,寻求一个更高效统一的解决方案。

Result: 在多个基准测试上取得了显著提升:验证AUC最高提升+34.1%,推理时间降低-53.5%,在MathVista和MMMU上的准确率分别提升+2.8%和+1.4%,在ReasonSeg上提升+1.9 cIoU,在AndroidControl和GUI Odyssey上的步骤成功率分别提升+1.7%和+1.0%。

Insight: 摘要宣称的创新点包括:1)偏好验证奖励,通过计算正负样本的平均验证分数作为决策阈值,在预测正确性与答案正确性一致时提供正反馈;2)优势解耦优化机制,为生成和验证分别计算优势,应用令牌掩码隔离梯度,并结合掩码GRPO目标,在保持生成质量的同时校准验证分数。从客观角度看,其核心创新在于将生成和验证任务统一到一个可协同优化的强化学习框架中,并通过解耦设计有效平衡了两个任务的目标,这为构建更高效、集成的视觉语言模型提供了新思路。

Abstract: Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.


[62] DeepInv: A Novel Self-supervised Learning Approach for Fast and Accurate Diffusion Inversion cs.CV | cs.AIPDF

Ziyue Zhang, Luxi Lin, Xiaolin Hu, Chao Chang, HuaiXi Wang

TL;DR: 本文提出了一种名为DeepInv的新型自监督学习方法,用于实现快速准确的扩散模型反演。该方法通过自监督目标和数据增强策略生成高质量伪噪声,并采用迭代多尺度训练机制训练参数化反演求解器,从而高效地将图像映射回噪声。

Details

Motivation: 扩散反演是恢复扩散模型中图像噪声的关键任务,对可控扩散图像编辑至关重要。现有方法因缺乏有效监督信号而依赖近似解,常以性能或效率为代价。

Result: 在COCO数据集上的实验表明,DeepInv在性能和推理速度上显著优于现有方法,例如SSIM比EasyInv提升40.435%,速度比ReNoise快9887.5%。

Insight: 创新点包括首次提出可训练的逐步预测反演噪声的求解器,以及通过自监督和数据增强生成伪噪声的策略,为社区提供了高效反演的新思路。

Abstract: Diffusion inversion is a task of recovering the noise of an image in a diffusion model, which is vital for controllable diffusion image editing. At present, diffusion inversion still remains a challenging task due to the lack of viable supervision signals. Thus, most existing methods resort to approximation-based solutions, which however are often at the cost of performance or efficiency. To remedy these shortcomings, we propose a novel self-supervised diffusion inversion approach in this paper, termed Deep Inversion (DeepInv). Instead of requiring ground-truth noise annotations, we introduce a self-supervised objective as well as a data augmentation strategy to generate high-quality pseudo noises from real images without manual intervention. Based on these two innovative designs, DeepInv is also equipped with an iterative and multi-scale training regime to train a parameterized inversion solver, thereby achieving the fast and accurate image-to-noise mapping. To the best of our knowledge, this is the first attempt of presenting a trainable solver to predict inversion noise step by step. The extensive experiments show that our DeepInv can achieve much better performance and inference speed than the compared methods, e.g., +40.435% SSIM than EasyInv and +9887.5% speed than ReNoise on COCO dataset. Moreover, our careful designs of trainable solvers can also provide insights to the community. Codes and model parameters will be released in https://github.com/potato-kitty/DeepInv.


[63] DiffKD-DCIS: Predicting Upgrade of Ductal Carcinoma In Situ with Diffusion Augmentation and Knowledge Distillation cs.CVPDF

Tao Li, Qing Li, Na Li, Hui Xie

TL;DR: 本文提出了DiffKD-DCIS框架,用于预测乳腺导管原位癌(DCIS)升级为浸润性导管癌(IDC)的风险。该框架结合了条件扩散模型进行数据增强和师生知识蒸馏,以解决超声数据有限和模型泛化能力差的问题。

Details

Motivation: 传统深度学习方法在预测DCIS升级时,因超声数据有限和泛化能力差而面临挑战,本研究旨在通过数据增强和知识蒸馏技术来克服这些限制。

Result: 在包含1,435例的多中心数据集上评估,合成图像质量良好。学生网络参数更少、推理更快,在外部测试集上优于部分组合模型,其准确性与资深放射科医生相当,优于初级医生,显示出显著的临床潜力。

Insight: 创新点在于将条件扩散模型用于医学图像数据增强,并结合知识蒸馏构建轻量级学生网络,在保持高精度的同时提升了计算效率,为医学影像分析中的小样本和泛化问题提供了新思路。

Abstract: Accurately predicting the upgrade of ductal carcinoma in situ (DCIS) to invasive ductal carcinoma (IDC) is crucial for surgical planning. However, traditional deep learning methods face challenges due to limited ultrasound data and poor generalization ability. This study proposes the DiffKD-DCIS framework, integrating conditional diffusion modeling with teacher-student knowledge distillation. The framework operates in three stages: First, a conditional diffusion model generates high-fidelity ultrasound images using multimodal conditions for data augmentation. Then, a deep teacher network extracts robust features from both original and synthetic data. Finally, a compact student network learns from the teacher via knowledge distillation, balancing generalization and computational efficiency. Evaluated on a multi-center dataset of 1,435 cases, the synthetic images were of good quality. The student network had fewer parameters and faster inference. On external test sets, it outperformed partial combinations, and its accuracy was comparable to senior radiologists and superior to junior ones, showing significant clinical potential.


[64] FastV-RAG: Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation cs.CV | cs.AIPDF

Gen Li, Peiyu Liu

TL;DR: 本文提出了一种名为VideoSpeculateRAG的高效视频问答框架,它结合了推测解码和检索增强生成技术。该框架通过一个轻量级草稿模型快速生成多个候选答案,再由一个精确的重型模型进行验证和精炼,从而在不牺牲准确性的前提下显著降低推理延迟。同时,通过引入基于相似度的过滤策略来纠正检索知识中的实体识别错误,提升了答案的整体准确性。

Details

Motivation: 当前视觉语言模型在整合外部知识方面存在困难,而现有的检索增强生成方法效率低下且难以维持高答案质量。本文旨在解决这些挑战,提升复杂、知识密集型多模态任务的效率和可靠性。

Result: 实验表明,VideoSpeculateRAG在达到与标准RAG方法相当或更高准确率的同时,将推理速度提升了约2倍。

Insight: 主要创新点在于将推测解码范式引入检索增强生成框架以加速推理,并设计了一个简单的基于相似度的过滤策略来改善检索知识中的实体对齐,从而提升答案准确性。这是一种将高效推理技术与知识增强相结合的有前景的思路。

Abstract: Vision-Language Models (VLMs) excel at visual reasoning but still struggle with integrating external knowledge. Retrieval-Augmented Generation (RAG) is a promising solution, but current methods remain inefficient and often fail to maintain high answer quality. To address these challenges, we propose VideoSpeculateRAG, an efficient VLM-based RAG framework built on two key ideas. First, we introduce a speculative decoding pipeline: a lightweight draft model quickly generates multiple answer candidates, which are then verified and refined by a more accurate heavyweight model, substantially reducing inference latency without sacrificing correctness. Second, we identify a major source of error - incorrect entity recognition in retrieved knowledge - and mitigate it with a simple yet effective similarity-based filtering strategy that improves entity alignment and boosts overall answer accuracy. Experiments demonstrate that VideoSpeculateRAG achieves comparable or higher accuracy than standard RAG approaches while accelerating inference by approximately 2x. Our framework highlights the potential of combining speculative decoding with retrieval-augmented reasoning to enhance efficiency and reliability in complex, knowledge-intensive multimodal tasks.


[65] BARE: Towards Bias-Aware and Reasoning-Enhanced One-Tower Visual Grounding cs.CVPDF

Hongbing Li, Linhui Xiao, Zihan Zhao, Qi Shen, Yixiang Huang

TL;DR: 本文提出BARE框架,一种针对单塔视觉定位任务的偏置感知与推理增强方法,旨在解决现有单塔架构中模态特征过度纠缠导致的偏置问题以及语义推理不足的挑战。

Details

Motivation: 现有单塔视觉定位方法存在两个主要局限:一是多模态表示过度纠缠加剧了欺骗性模态偏置,二是语义推理不足阻碍了对指代线索的理解。

Result: 在五个基准测试上的实验结果表明,BARE不仅达到了最先进的性能,而且相比现有方法具有更优的计算效率。

Insight: 创新点在于通过语言显著性调制器、视觉偏置校正和指代关系增强三个模块,保留模态特定特征并构建指代语义,共同减轻多模态干扰并增强指代理解能力。

Abstract: Visual Grounding (VG), which aims to locate a specific region referred to by expressions, is a fundamental yet challenging task in the multimodal understanding fields. While recent grounding transfer works have advanced the field through one-tower architectures, they still suffer from two primary limitations: (1) over-entangled multimodal representations that exacerbate deceptive modality biases, and (2) insufficient semantic reasoning that hinders the comprehension of referential cues. In this paper, we propose BARE, a bias-aware and reasoning-enhanced framework for one-tower visual grounding. BARE introduces a mechanism that preserves modality-specific features and constructs referential semantics through three novel modules: (i) language salience modulator, (ii) visual bias correction and (iii) referential relationship enhancement, which jointly mitigate multimodal distractions and enhance referential comprehension. Extensive experimental results on five benchmarks demonstrate that BARE not only achieves state-of-the-art performance but also delivers superior computational efficiency compared to existing approaches. The code is publicly accessible at https://github.com/Marloweeee/BARE.


[66] DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving cs.CV | cs.AI | cs.ROPDF

Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li

TL;DR: 本文提出了DrivingGen,这是首个用于生成式自动驾驶世界模型的综合性基准测试。该基准结合了从驾驶数据集和互联网视频源中精心挑选的多样化评估数据集,并引入了一套新的评估指标,以全面衡量视觉真实性、轨迹合理性、时间一致性和可控性。通过对14个最先进模型的评估,揭示了通用模型与专用模型在视觉质量和物理合理性之间的权衡。

Details

Motivation: 当前自动驾驶领域中的视频生成模型(即驾驶世界模型)缺乏一个严格的基准来测量进展和指导研究方向。现有的评估方法存在局限性:通用视频指标忽视了安全关键的成像因素;轨迹合理性很少被量化;时间和智能体级别的一致性被忽略;且对自我条件可控性的评估不足。此外,现有数据集未能覆盖现实世界部署所需的各种条件。

Result: 在DrivingGen基准上对14个最先进的模型进行了评估,结果显示明显的权衡:通用模型在视觉上看起来更好,但违反了物理规律;而专门针对驾驶的模型能更真实地捕捉运动,但在视觉质量上落后。该基准提供了一个统一的评估框架,旨在促进可靠、可控和可部署的驾驶世界模型的发展。

Insight: 论文的创新点在于首次提出了一个全面的生成式驾驶世界模型基准,通过结合多样化的数据集和新的多维度评估指标(包括视觉真实性、轨迹合理性、时间一致性和可控性),解决了现有评估方法的不足。从客观角度看,该基准为领域内的模型比较和优化提供了标准化工具,有助于推动可扩展的仿真、规划和数据驱动决策在自动驾驶中的应用。

Abstract: Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.


[67] Improving Flexible Image Tokenizers for Autoregressive Image Generation cs.CVPDF

Zixuan Fu, Lanqing Guo, Chong Wang, Binbin Song, Ding Liu

TL;DR: 本文提出了一种名为ReToK的新型灵活图像分词器,旨在解决现有灵活分词器(通过嵌套dropout实现)因尾部截断策略导致图像信息过度集中于序列前部、从而限制下游自回归图像生成效果的问题。ReToK通过引入冗余令牌填充和分层语义正则化,更充分地利用所有令牌进行潜在建模,在ImageNet 256×256数据集上取得了优于现有灵活和固定长度分词器的生成性能。

Details

Motivation: 现有灵活图像分词器通过嵌套dropout训练,采用尾部截断策略,这导致图像信息过度集中在序列的早期令牌中。随着令牌序列长度增加,这种信息分布不均会限制下游自回归图像生成模型的有效性。

Result: 在ImageNet 256×256基准测试上的广泛实验表明,该方法在生成性能上优于现有的灵活分词器和固定长度分词器,达到了更优的水平。

Insight: 主要创新点包括:1. 冗余令牌填充:通过更频繁地激活尾部令牌,缓解信息在前部令牌的过度集中。2. 分层语义正则化:将早期令牌的解码特征与预训练视觉基础模型对齐,同时向尾部逐步降低正则化强度,以允许更精细的低级细节重建。这为灵活序列建模中的信息分布和正则化策略提供了新思路。

Abstract: Flexible image tokenizers aim to represent an image using an ordered 1D variable-length token sequence. This flexible tokenization is typically achieved through nested dropout, where a portion of trailing tokens is randomly truncated during training, and the image is reconstructed using the remaining preceding sequence. However, this tail-truncation strategy inherently concentrates the image information in the early tokens, limiting the effectiveness of downstream AutoRegressive (AR) image generation as the token length increases. To overcome these limitations, we propose \textbf{ReToK}, a flexible tokenizer with \underline{Re}dundant \underline{Tok}en Padding and Hierarchical Semantic Regularization, designed to fully exploit all tokens for enhanced latent modeling. Specifically, we introduce \textbf{Redundant Token Padding} to activate tail tokens more frequently, thereby alleviating information over-concentration in the early tokens. In addition, we apply \textbf{Hierarchical Semantic Regularization} to align the decoding features of earlier tokens with those from a pre-trained vision foundation model, while progressively reducing the regularization strength toward the tail to allow finer low-level detail reconstruction. Extensive experiments demonstrate the effectiveness of ReTok: on ImageNet 256$\times$256, our method achieves superior generation performance compared with both flexible and fixed-length tokenizers. Code will be available at: \href{https://github.com/zfu006/ReTok}{https://github.com/zfu006/ReTok}


[68] EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding cs.CV | cs.AI | cs.LGPDF

Tianjun Gu, Chenghua Gong, Jingyu Gong, Zhizhong Zhang, Yuan Xie

TL;DR: 本文提出了Teleo-Spatial Intelligence(TSI)新范式,旨在统一物理动态推理和意图驱动推理,并引入了EscherVerse基准、数据集和模型,以评估智能体在开放世界、以人为中心的动态场景中对物体持久性、状态转换、轨迹预测及背后人类意图的理解能力。

Details

Motivation: 当前空间推理研究忽视了空间变化背后的人类意图,因此需要一个新的范式来统一物理动态理解和意图推断,以推动空间智能从被动场景描述向目的驱动的整体理解发展。

Result: 论文提出了EscherVerse基准(Escher-Bench)、数据集(Escher-35k)和模型(Escher系列),这是首个系统评估意图驱动推理的基准,旨在挑战模型将物理事件与潜在人类目的联系起来的能力。

Insight: 创新点在于提出了TSI范式,将物理动态推理与意图驱动推理相结合,并构建了首个专注于意图驱动推理的大规模开放世界基准和数据集,其新颖的数据处理流程为空间智能研究提供了基础资源。

Abstract: The ability to reason about spatial dynamics is a cornerstone of intelligence, yet current research overlooks the human intent behind spatial changes. To address these limitations, we introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning–understanding the physical principles of object interactions–and Intent-Driven Reasoning–inferring the human goals behind these actions. To catalyze research in TSI, we present EscherVerse, consisting of a large-scale, open-world benchmark (Escher-Bench), a dataset (Escher-35k), and models (Escher series). Derived from real-world videos, EscherVerse moves beyond constrained settings to explicitly evaluate an agent’s ability to reason about object permanence, state transitions, and trajectory prediction in dynamic, human-centric scenarios. Crucially, it is the first benchmark to systematically assess Intent-Driven Reasoning, challenging models to connect physical events to their underlying human purposes. Our work, including a novel data curation pipeline, provides a foundational resource to advance spatial intelligence from passive scene description toward a holistic, purpose-driven understanding of the world.


[69] Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation cs.CV | cs.MMPDF

Haonan Cai, Yuxuan Luo, Zhouhui Lian

TL;DR: 本文提出了一种名为GAR-Font的新型自回归框架,用于解决多模态少样本字体生成(FFG)问题。该框架通过全局感知分词器捕获局部结构和全局风格模式,利用轻量级语言风格适配器实现多模态风格编码,并采用后处理细化流程提升结构保真度和风格一致性。

Details

Motivation: 现有少样本字体生成方法通常局限于图像到图像的范式,依赖视觉参考而忽略了语言在传达字体设计风格意图中的作用;同时,传统的自回归模型采用基于图像块的分词方式,忽略了字体合成中至关重要的全局依赖关系。

Result: 大量实验表明,GAR-Font在少样本字体生成任务上超越了现有方法,在保持全局风格忠实度和利用文本风格指导生成更高质量结果方面表现出色。

Insight: 主要创新点在于:1)引入全局感知分词器以建模字体图像的全局依赖;2)提出包含轻量级语言风格适配器的多模态风格编码器,无需大量多模态预训练即可实现灵活的风格控制;3)设计了后处理细化流程以进一步提升生成质量。这为结合语言指导的生成任务提供了新思路。

Abstract: Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.


[70] FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing cs.CVPDF

Xijie Huang, Chengming Xu, Donghao Luo, Xiaobin Hu, Peng Tang

TL;DR: 本文提出了FFP-300K数据集和一个新的无需运行时引导的首帧传播视频编辑框架。FFP-300K是一个包含30万对720p、81帧视频的大规模高质量数据集,用于训练稳健的时序先验。基于此,作者设计了一个新框架,通过自适应时空RoPE和自蒸馏策略,解决了保持首帧外观与保留源视频运动之间的关键矛盾,实现了真正的免引导视频编辑。

Details

Motivation: 现有基于首帧传播的视频编辑方法严重依赖繁琐的运行时引导,其根本原因在于训练数据集(通常较短、分辨率低、任务多样性不足)无法教会模型稳健的时序先验。

Result: 在EditVerseBench基准测试上的综合实验表明,该方法显著优于现有的学术和商业模型,PickScore和VLM分数分别提升了约0.2和0.3。

Insight: 核心创新点在于:1) 构建了大规模、高质量、长序列、任务多样的FFP-300K数据集以解决数据瓶颈;2) 提出了自适应时空RoPE,通过动态重映射位置编码来解耦外观和运动参考;3) 采用了以身份传播任务作为正则器的自蒸馏策略,确保长期时序稳定性并防止语义漂移。

Abstract: First-Frame Propagation (FFP) offers a promising paradigm for controllable video editing, but existing methods are hampered by a reliance on cumbersome run-time guidance. We identify the root cause of this limitation as the inadequacy of current training datasets, which are often too short, low-resolution, and lack the task diversity required to teach robust temporal priors. To address this foundational data gap, we first introduce FFP-300K, a new large-scale dataset comprising 300K high-fidelity video pairs at 720p resolution and 81 frames in length, constructed via a principled two-track pipeline for diverse local and global edits. Building on this dataset, we propose a novel framework designed for true guidance-free FFP that resolves the critical tension between maintaining first-frame appearance and preserving source video motion. Architecturally, we introduce Adaptive Spatio-Temporal RoPE (AST-RoPE), which dynamically remaps positional encodings to disentangle appearance and motion references. At the objective level, we employ a self-distillation strategy where an identity propagation task acts as a powerful regularizer, ensuring long-term temporal stability and preventing semantic drift. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors.


[71] MANGO:Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement cs.CVPDF

Lei Zhu, Lijian Lin, Ye Zhu, Jiahao Wu, Xuehan Hou

TL;DR: 本文提出了一种名为MANGO的两阶段框架,用于生成自然的多说话人3D对话头部动画。该方法利用纯图像级监督和交替训练来减少伪3D标签的噪声,从而更好地模拟真实对话行为。第一阶段使用基于扩散的Transformer和双音频交互模块从多说话人音频中建模3D运动;第二阶段通过快速3D高斯渲染器生成高保真图像,并通过交替训练为3D运动提供2D级光度监督。此外,作者还构建了MANGO-Dialog数据集,包含超过50小时、500多个身份的2D-3D对齐对话数据。

Details

Motivation: 当前音频驱动的3D头部生成方法主要局限于单说话人场景,缺乏自然、双向的听说交互。实现流畅的对话行为(即说话和倾听状态的自然过渡)仍是一个关键挑战。现有3D对话化身方法依赖容易出错的伪3D标签,无法捕捉细粒度的面部动态。

Result: 大量实验表明,该方法在建模双人3D对话运动方面实现了卓越的准确性和真实感,显著提升了音频驱动说话头部的保真度和可控性。

Insight: 创新点包括:1)提出两阶段框架,利用纯图像级监督和交替训练来避免伪3D标签的噪声问题;2)引入基于扩散的Transformer与双音频交互模块,以建模多说话人音频驱动的自然3D运动;3)使用快速3D高斯渲染器进行2D级光度监督,增强3D运动的真实感;4)构建大规模高质量的2D-3D对齐对话数据集MANGO-Dialog,为多说话人3D对话生成研究提供了重要资源。

Abstract: Current audio-driven 3D head generation methods mainly focus on single-speaker scenarios, lacking natural, bidirectional listen-and-speak interaction. Achieving seamless conversational behavior, where speaking and listening states transition fluidly remains a key challenge. Existing 3D conversational avatar approaches rely on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics. To address these limitations, we introduce a novel two-stage framework MANGO, which leveraging pure image-level supervision by alternately training to mitigate the noise introduced by pseudo-3D labels, thereby achieving better alignment with real-world conversational behaviors. Specifically, in the first stage, a diffusion-based transformer with a dual-audio interaction module models natural 3D motion from multi-speaker audio. In the second stage, we use a fast 3D Gaussian Renderer to generate high-fidelity images and provide 2D-level photometric supervision for the 3D motions through alternate training. Additionally, we introduce MANGO-Dialog, a high-quality dataset with over 50 hours of aligned 2D-3D conversational data across 500+ identities. Extensive experiments demonstrate that our method achieves exceptional accuracy and realism in modeling two-person 3D dialogue motion, significantly advancing the fidelity and controllability of audio-driven talking heads.


[72] CTIS-QA: Clinical Template-Informed Slide-level Question Answering for Pathology cs.CVPDF

Hao Lu, Ziniu Qian, Yifu Li, Yang Zhou, Bingzheng Wei

TL;DR: 本文提出了一个基于临床诊断模板的病理学信息收集与结构化流程,并构建了用于视觉语言对齐训练的数据集CTIS-Align和视觉问答基准CTIS-Bench。在此基础上,作者进一步提出了CTIS-QA模型,该模型采用双流架构模拟病理学家的诊断过程,在多个基准测试中均优于现有最先进模型。

Details

Motivation: 解决病理学报告中信息提取缺乏标准化,以及现有视觉问答任务在病理学领域缺乏临床相关性、难以评估真实切片理解能力的问题。

Result: 在WSI-VQA、CTIS-Bench和切片级诊断任务上的大量实验表明,CTIS-QA在多个指标上持续优于现有的最先进模型。

Insight: 创新点在于:1)基于CAP癌症协议设计临床病理报告模板,实现了病理信息的标准化、结构化提取;2)构建了强调临床基础、封闭式问题的VQA基准,更贴近真实诊断流程;3)提出模拟病理学家诊断方法的双流模型架构,结合全局上下文和局部显著区域感知。

Abstract: In this paper, we introduce a clinical diagnosis template-based pipeline to systematically collect and structure pathological information. In collaboration with pathologists and guided by the the College of American Pathologists (CAP) Cancer Protocols, we design a Clinical Pathology Report Template (CPRT) that ensures comprehensive and standardized extraction of diagnostic elements from pathology reports. We validate the effectiveness of our pipeline on TCGA-BRCA. First, we extract pathological features from reports using CPRT. These features are then used to build CTIS-Align, a dataset of 80k slide-description pairs from 804 WSIs for vision-language alignment training, and CTIS-Bench, a rigorously curated VQA benchmark comprising 977 WSIs and 14,879 question-answer pairs. CTIS-Bench emphasizes clinically grounded, closed-ended questions (e.g., tumor grade, receptor status) that reflect real diagnostic workflows, minimize non-visual reasoning, and require genuine slide understanding. We further propose CTIS-QA, a Slide-level Question Answering model, featuring a dual-stream architecture that mimics pathologists’ diagnostic approach. One stream captures global slide-level context via clustering-based feature aggregation, while the other focuses on salient local regions through attention-guided patch perception module. Extensive experiments on WSI-VQA, CTIS-Bench, and slide-level diagnostic tasks show that CTIS-QA consistently outperforms existing state-of-the-art models across multiple metrics. Code and data are available at https://github.com/HLSvois/CTIS-QA.


[73] DDNet: A Dual-Stream Graph Learning and Disentanglement Framework for Temporal Forgery Localization cs.CV | cs.MM | eess.IVPDF

Boyang Zhao, Xin Liao, Jiaxin Chen, Xiaoshuai Wu, Yufeng Wu

TL;DR: 本文提出DDNet,一种用于视频时序伪造定位的双流图学习与解缠框架,通过协调局部伪影的时间距离流和捕捉长程连接的语义内容流,有效解决现有方法因局部视角而忽略全局异常的问题,并在ForgeryNet和TVIL基准测试中实现约9%的AP@0.95性能提升。

Details

Motivation: 针对AIGC技术快速发展导致视频中仅篡改小片段即可误导观众,而现有视频级检测方法不准确且缺乏说服力的问题,本文旨在通过时序伪造定位(TFL)精确定位篡改片段,并克服现有方法受限于局部视角、无法捕捉全局异常的缺陷。

Result: 在ForgeryNet和TVIL基准测试中,DDNet在AP@0.95指标上优于现有最先进方法约9%,并在跨域鲁棒性方面有显著提升。

Insight: 创新点包括:1)双流图学习架构(时间距离流和语义内容流)协调局部与全局信息,防止全局线索被局部平滑性淹没;2)引入痕迹解缠与适应(TDA)来隔离通用伪造指纹;3)通过跨层级特征嵌入(CLFE)实现层次特征的深度融合,构建鲁棒特征基础。从客观角度看,该框架通过图学习和解缠技术有效整合多尺度信息,提升了时序伪造定位的准确性和泛化能力。

Abstract: The rapid evolution of AIGC technology enables misleading viewers by tampering mere small segments within a video, rendering video-level detection inaccurate and unpersuasive. Consequently, temporal forgery localization (TFL), which aims to precisely pinpoint tampered segments, becomes critical. However, existing methods are often constrained by \emph{local view}, failing to capture global anomalies. To address this, we propose a \underline{d}ual-stream graph learning and \underline{d}isentanglement framework for temporal forgery localization (DDNet). By coordinating a \emph{Temporal Distance Stream} for local artifacts and a \emph{Semantic Content Stream} for long-range connections, DDNet prevents global cues from being drowned out by local smoothness. Furthermore, we introduce Trace Disentanglement and Adaptation (TDA) to isolate generic forgery fingerprints, alongside Cross-Level Feature Embedding (CLFE) to construct a robust feature foundation via deep fusion of hierarchical features. Experiments on ForgeryNet and TVIL benchmarks demonstrate that our method outperforms state-of-the-art approaches by approximately 9% in AP@0.95, with significant improvements in cross-domain robustness.


[74] VerLM: Explaining Face Verification Using Natural Language cs.CV | cs.AIPDF

Syed Abdul Hannan, Hazim Bukhari, Thomas Cantalapiedra, Eman Ansar, Massa Baali

TL;DR: 本文提出了一种创新的视觉语言模型(VerLM),用于人脸验证任务,该模型不仅能准确判断两张人脸图像是否属于同一个人,还能用自然语言解释其决策依据。模型采用两种互补的解释风格进行训练:简洁总结关键因素和详细描述图像间具体差异。通过将音频区分模型适配到视觉输入,实现了跨模态迁移,显著提升了准确性和可解释性。

Details

Motivation: 解决现有人脸验证系统缺乏透明度的问题,旨在开发一个既能准确验证又能提供自然语言解释的模型,以增强系统的可解释性和可靠性。

Result: 模型在性能上超越了基线方法和现有模型,表现出优越的准确性和可解释性,但摘要未提及具体基准测试名称或是否达到SOTA水平。

Insight: 创新点在于将音频区分模型跨模态迁移到视觉任务,并集成两种解释风格(简洁与详细)来增强可解释性;客观分析认为,这种多风格解释框架和跨模态适配策略为可解释AI提供了新思路。

Abstract: Face verification systems have seen substantial advancements; however, they often lack transparency in their decision-making processes. In this paper, we introduce an innovative Vision-Language Model (VLM) for Face Verification, which not only accurately determines if two face images depict the same individual but also explicitly explains the rationale behind its decisions. Our model is uniquely trained using two complementary explanation styles: (1) concise explanations that summarize the key factors influencing its decision, and (2) comprehensive explanations detailing the specific differences observed between the images. We adapt and enhance a state-of-the-art modeling approach originally designed for audio-based differentiation to suit visual inputs effectively. This cross-modal transfer significantly improves our model’s accuracy and interpretability. The proposed VLM integrates sophisticated feature extraction techniques with advanced reasoning capabilities, enabling clear articulation of its verification process. Our approach demonstrates superior performance, surpassing baseline methods and existing models. These findings highlight the immense potential of vision language models in face verification set up, contributing to more transparent, reliable, and explainable face verification systems.


[75] Causality-Aware Temporal Projection for Video Understanding in Video-LLMs cs.CVPDF

Zhengjian Kang, Qi Chen, Rui Liu, Kangtong Mo, Xingyu Zhang

TL;DR: 本文提出V-CORE框架,通过引入可学习的空间聚合和因果感知的时间投影器,为视频大语言模型添加显式的时间顺序约束,以解决现有双向投影器模糊时序、损害因果连贯性的问题。该框架参数高效,在单张消费级GPU上即可训练,并在多个视频问答基准上取得了竞争力的性能。

Details

Motivation: 现有参数高效的视频大语言模型通常使用无约束的双向投影器建模帧间交互,这会模糊视频的时间顺序(允许后帧影响前帧),缺乏对视频推理方向性的显式建模,从而在需要时序一致性和因果连贯性的任务上表现不佳。

Result: 在NExT-QA基准测试上达到61.2%的准确率,并在MSVD-QA、MSRVTT-QA和TGIF-QA上保持竞争力。在时序推理和因果推理子类别上分别获得+3.5%和+5.2%的性能提升,直接验证了显式时序约束的重要性。

Insight: 核心创新在于通过块因果注意力和一个作为因果汇聚点的终端动态摘要令牌,构建了结构化的单向信息流,在保留帧内空间交互的同时,确保时间信息以严格有序的方式聚合。这为视频理解模型设计提供了因果感知和时间投影的新思路。

Abstract: Recent Video Large Language Models (Video-LLMs) have shown strong multimodal reasoning capabilities, yet remain challenged by video understanding tasks that require consistent temporal ordering and causal coherence. Many parameter-efficient Video-LLMs rely on unconstrained bidirectional projectors to model inter-frame interactions, which can blur temporal ordering by allowing later frames to influence earlier representations, without explicit architectural mechanisms to respect the directional nature of video reasoning. To address this limitation, we propose V-CORE, a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding. V-CORE consists of two key components: (1) Learnable Spatial Aggregation (LSA), which adaptively selects salient spatial tokens to reduce redundancy, and (2) a Causality-Aware Temporal Projector (CATP), which enforces structured unidirectional information flow via block-causal attention and a terminal dynamic summary token acting as a causal sink. This design preserves intra-frame spatial interactions while ensuring that temporal information is aggregated in a strictly ordered manner. With 4-bit QLoRA and a frozen LLM backbone, V-CORE can be trained efficiently on a single consumer GPU. Experiments show that V-CORE achieves strong performance on the challenging NExT-QA benchmark, reaching 61.2% accuracy, and remains competitive across MSVD-QA, MSRVTT-QA, and TGIF-QA, with gains concentrated in temporal and causal reasoning subcategories (+3.5% and +5.2% respectively), directly validating the importance of explicit temporal ordering constraints.


[76] Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning cs.CVPDF

Sungjune Park, Hongda Mao, Qingshuang Chen, Yong Man Ro, Yelin Kim

TL;DR: 本文提出了一种语言引导的场景上下文感知学习框架,用于鲁棒的第一人称视觉注意力预测。该框架通过语言描述引导上下文感知器总结视频内容,并引入两个训练目标来聚焦感兴趣区域并抑制无关区域干扰。

Details

Motivation: 第一人称视频分析需求增长,但动态场景的复杂性和模糊性使得注意力预测具有挑战性;动机是利用场景上下文信息在调节人类注意力中的关键作用。

Result: 在Ego4D和Aria Everyday Activities (AEA)数据集上的广泛实验表明,该方法实现了最先进的性能,并在多样动态第一人称场景中增强了鲁棒性。

Insight: 创新点在于语言引导的场景上下文感知学习,通过语言描述生成上下文感知视频表示,并设计训练目标来优化注意力聚焦和抑制干扰,可借鉴于多模态融合和注意力机制设计。

Abstract: As the demand for analyzing egocentric videos grows, egocentric visual attention prediction, anticipating where a camera wearer will attend, has garnered increasing attention. However, it remains challenging due to the inherent complexity and ambiguity of dynamic egocentric scenes. Motivated by evidence that scene contextual information plays a crucial role in modulating human attention, in this paper, we present a language-guided scene context-aware learning framework for robust egocentric visual attention prediction. We first design a context perceiver which is guided to summarize the egocentric video based on a language-based scene description, generating context-aware video representations. We then introduce two training objectives that: 1) encourage the framework to focus on the target point-of-interest regions and 2) suppress distractions from irrelevant regions which are less likely to attract first-person attention. Extensive experiments on Ego4D and Aria Everyday Activities (AEA) datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance and enhanced robustness across diverse, dynamic egocentric scenarios.


[77] RSwinV2-MD: An Enhanced Residual SwinV2 Transformer for Monkeypox Detection from Skin Images cs.CV | cs.AIPDF

Rashid Iqbal, Saddam Hussain Khan

TL;DR: 本文提出了一种名为RSwinV2-MD的深度学习模型,用于从皮肤图像中检测猴痘(Mpox)。该方法基于SwinTransformerV2进行增强,通过引入自定义的分层Transformer结构、移位窗口注意力机制以及逆残差块(IRB),旨在有效结合全局与局部特征,提升对猴痘、水痘、麻疹和牛痘等皮肤病变的分类能力。

Details

Motivation: 解决现有方法在皮肤病变分类中可能存在的局部性限制和梯度消失问题,提升对猴痘等疾病的自动诊断准确性和鲁棒性。

Result: 在Kaggle公共数据集上,RSwinV2-MD取得了96.21%的准确率和95.62%的F1分数,性能优于标准CNN模型和SwinTransformer,达到了当前先进水平(SOTA)。

Insight: 创新点包括:1)基于输入维度定制分层Transformer结构以优化计算效率;2)结合移位窗口注意力机制缓解非重叠区域的局部性问题;3)引入逆残差块(IRB)利用卷积跳跃连接解决梯度消失,并有效融合全局与局部模式。这些设计增强了模型对病变细微差异的区分能力。

Abstract: In this paper, a deep learning approach for Mpox diagnosis named Customized Residual SwinTransformerV2 (RSwinV2) has been proposed, trying to enhance the capability of lesion classification by employing the RSwinV2 tool-assisted vision approach. In the RSwinV2 method, a hierarchical structure of the transformer has been customized based on the input dimensionality, embedding structure, and output targeted by the method. In this RSwinV2 approach, the input image has been split into non-overlapping patches and processed using shifted windows and attention in these patches. This process has helped the method link all the windows efficiently by avoiding the locality issues of non-overlapping regions in attention, while being computationally efficient. RSwinV2 has further developed based on SwinTransformer and has included patch and position embeddings to take advantage of the transformer global-linking capability by employing multi-head attention in these embeddings. Furthermore, RSwinV2 has developed and incorporated the Inverse Residual Block (IRB) into this method, which utilizes convolutional skip connections with these inclusive designs to address the vanishing gradient issues during processing. RSwinV2 inclusion of IRB has therefore facilitated this method to link global patterns as well as local patterns; hence, its integrity has helped improve lesion classification capability by minimizing variability of Mpox and increasing differences of Mpox, chickenpox, measles, and cowpox. In testing SwinV2, its accuracy of 96.21 and an F1score of 95.62 have been achieved on the Kaggle public dataset, which has outperformed standard CNN models and SwinTransformers; RSwinV2 vector has thus proved its valiance as a computer-assisted tool for Mpox lesion observation interpretation.


[78] ESGaussianFace: Emotional and Stylized Audio-Driven Facial Animation via 3D Gaussian Splatting cs.CVPDF

Chuhang Ma, Shuai Tan, Ye Pan, Jiaolong Yang, Xin Tong

TL;DR: 本文提出ESGaussianFace框架,利用3D高斯泼溅技术实现情感与风格化的音频驱动面部动画,通过情感音频引导的空间注意力机制和3D高斯变形预测器,结合多阶段训练策略,高效生成高质量、三维一致的面部视频。

Details

Motivation: 现有音频驱动面部动画研究多集中于中性情感,而结合情感表达与风格特征的高质量说话头部视频生成仍面临挑战,本文旨在解决这一问题。

Result: 在唇部运动准确性、表情变化和风格特征表现力方面,该方法在实验中超越了现有最先进技术,实现了高效、高质量的三维一致结果。

Insight: 创新点包括情感音频引导的空间注意力机制、两个3D高斯变形预测器以及多阶段训练策略,有效整合情感与风格特征,提升面部细节重建和动画生成质量。

Abstract: Most current audio-driven facial animation research primarily focuses on generating videos with neutral emotions. While some studies have addressed the generation of facial videos driven by emotional audio, efficiently generating high-quality talking head videos that integrate both emotional expressions and style features remains a significant challenge. In this paper, we propose ESGaussianFace, an innovative framework for emotional and stylized audio-driven facial animation. Our approach leverages 3D Gaussian Splatting to reconstruct 3D scenes and render videos, ensuring efficient generation of 3D consistent results. We propose an emotion-audio-guided spatial attention method that effectively integrates emotion features with audio content features. Through emotion-guided attention, the model is able to reconstruct facial details across different emotional states more accurately. To achieve emotional and stylized deformations of the 3D Gaussian points through emotion and style features, we introduce two 3D Gaussian deformation predictors. Futhermore, we propose a multi-stage training strategy, enabling the step-by-step learning of the character’s lip movements, emotional variations, and style features. Our generated results exhibit high efficiency, high quality, and 3D consistency. Extensive experimental results demonstrate that our method outperforms existing state-of-the-art techniques in terms of lip movement accuracy, expression variation, and style feature expressiveness.


[79] GCR: Geometry-Consistent Routing for Task-Agnostic Continual Anomaly Detection cs.CVPDF

Joongwon Chae, Lihui Luo, Yang Liu, Runming Wang, Dongmei Yu

TL;DR: 本文提出GCR(Geometry-Consistent Routing),一种轻量级的专家混合框架,用于解决任务无关的持续异常检测中的路由不稳定问题。该方法通过在共享的冻结补丁嵌入空间中,基于几何一致性(最小化累积最近原型距离)将测试图像路由到特定类别的专家模型,从而分离跨专家决策与专家内异常评分,避免了跨类别分数可比性问题。

Details

Motivation: 实际工业检测部署中,需要在类别持续扩展且测试时类别身份未知的任务无关设置下进行异常检测,而现有方法依赖跨独立构建的专家模型比较异常分数进行路由,由于不同类别的分数分布(尺度和尾部行为)差异显著,导致路由规则不可靠,影响整体性能。

Result: 在MVTec AD和VisA基准测试上的实验表明,几何一致性路由显著提高了路由稳定性,缓解了持续性能崩溃,实现了接近零的遗忘,同时保持了有竞争力的异常检测和定位性能。

Insight: 创新点在于将路由决策与异常评分解耦,直接在共享的冻结特征空间中进行基于几何一致性的路由,避免了端到端表示学习的需要;核心洞察是许多先前归因于表示遗忘的失败,实际上可解释为跨专家路由中决策规则的不稳定性。

Abstract: Feature-based anomaly detection is widely adopted in industrial inspection due to the strong representational power of large pre-trained vision encoders. While most existing methods focus on improving within-category anomaly scoring, practical deployments increasingly require task-agnostic operation under continual category expansion, where the category identity is unknown at test time. In this setting, overall performance is often dominated by expert selection, namely routing an input to an appropriate normality model before any head-specific scoring is applied. However, routing rules that compare head-specific anomaly scores across independently constructed heads are unreliable in practice, as score distributions can differ substantially across categories in scale and tail behavior. We propose GCR, a lightweight mixture-of-experts framework for stabilizing task-agnostic continual anomaly detection through geometry-consistent routing. GCR routes each test image directly in a shared frozen patch-embedding space by minimizing an accumulated nearest-prototype distance to category-specific prototype banks, and then computes anomaly maps only within the routed expert using a standard prototype-based scoring rule. By separating cross-head decision making from within-head anomaly scoring, GCR avoids cross-head score comparability issues without requiring end-to-end representation learning. Experiments on MVTec AD and VisA show that geometry-consistent routing substantially improves routing stability and mitigates continual performance collapse, achieving near-zero forgetting while maintaining competitive detection and localization performance. These results indicate that many failures previously attributed to representation forgetting can instead be explained by decision-rule instability in cross-head routing. Code is available at https://github.com/jw-chae/GCR


[80] RRNet: Configurable Real-Time Video Enhancement with Arbitrary Local Lighting Variations cs.CVPDF

Wenlong Yang, Canran Jin, Weihang Yuan, Chao Wang, Lifeng Sun

TL;DR: RRNet是一个轻量级可配置的实时视频增强框架,通过估计虚拟光源参数和深度感知渲染模块实现局部重光照,在视觉质量和效率之间取得了SOTA平衡,适用于视频会议、AR人像增强和移动摄影等应用。

Details

Motivation: 解决现有实时视频增强方法在非均匀光照下难以平衡速度与有效曝光控制的问题。

Result: 在低光增强、局部光照调整和眩光去除任务上持续优于先前方法,实现了视觉质量与效率的SOTA权衡。

Insight: 创新点包括:通过最小虚拟光源集实现可解释的局部光照控制;无需像素对齐训练数据的深度感知渲染;基于生成式AI的低成本多样化光照数据集构建流程;以及轻量编码器和预测头支持实时高分辨率处理。

Abstract: With the growing demand for real-time video enhancement in live applications, existing methods often struggle to balance speed and effective exposure control, particularly under uneven lighting. We introduce RRNet (Rendering Relighting Network), a lightweight and configurable framework that achieves a state-of-the-art tradeoff between visual quality and efficiency. By estimating parameters for a minimal set of virtual light sources, RRNet enables localized relighting through a depth-aware rendering module without requiring pixel-aligned training data. This object-aware formulation preserves facial identity and supports real-time, high-resolution performance using a streamlined encoder and lightweight prediction head. To facilitate training, we propose a generative AI-based dataset creation pipeline that synthesizes diverse lighting conditions at low cost. With its interpretable lighting control and efficient architecture, RRNet is well suited for practical applications such as video conferencing, AR-based portrait enhancement, and mobile photography. Experiments show that RRNet consistently outperforms prior methods in low-light enhancement, localized illumination adjustment, and glare removal.


[81] Entity-Guided Multi-Task Learning for Infrared and Visible Image Fusion cs.CVPDF

Wenyu Shao, Hongbo Liu, Yunchuan Ma, Ruili Wang

TL;DR: 本文提出了一种名为实体引导多任务学习(EGMT)的红外与可见光图像融合新方法。该方法利用大视觉语言模型从图像描述中提取实体级文本信息,构建并行多任务学习架构(融合任务与多标签分类任务),并开发了实体引导的跨模态交互模块,以增强特征表示和融合图像质量。

Details

Motivation: 现有基于文本驱动的红外与可见光图像融合方法通常依赖句子级文本信息,容易引入语义噪声且未能充分利用文本的深层语义价值,因此需要一种更精细的文本引导机制来提升融合效果。

Result: 在TNO、RoadScene、M3FD和MSRS四个公开数据集上的大量实验表明,EGMT在保留显著目标、纹理细节和语义一致性方面优于现有最先进方法(SOTA)。

Insight: 创新点包括:从图像描述中提取实体级文本信息以消除语义噪声;通过多标签分类任务(以实体作为伪标签)提供语义监督,提升模型对图像内容的理解;设计实体引导的跨模态交互模块,在视觉间和视觉-实体层面捕获跨模态依赖关系,增强特征表示。

Abstract: Existing text-driven infrared and visible image fusion approaches often rely on textual information at the sentence level, which can lead to semantic noise from redundant text and fail to fully exploit the deeper semantic value of textual information. To address these issues, we propose a novel fusion approach named Entity-Guided Multi-Task learning for infrared and visible image fusion (EGMT). Our approach includes three key innovative components: (i) A principled method is proposed to extract entity-level textual information from image captions generated by large vision-language models, eliminating semantic noise from raw text while preserving critical semantic information; (ii) A parallel multi-task learning architecture is constructed, which integrates image fusion with a multi-label classification task. By using entities as pseudo-labels, the multi-label classification task provides semantic supervision, enabling the model to achieve a deeper understanding of image content and significantly improving the quality and semantic density of the fused image; (iii) An entity-guided cross-modal interactive module is also developed to facilitate the fine-grained interaction between visual and entity-level textual features, which enhances feature representation by capturing cross-modal dependencies at both inter-visual and visual-entity levels. To promote the wide application of the entity-guided image fusion framework, we release the entity-annotated version of four public datasets (i.e., TNO, RoadScene, M3FD, and MSRS). Extensive experiments demonstrate that EGMT achieves superior performance in preserving salient targets, texture details, and semantic consistency, compared to the state-of-the-art methods. The code and dataset will be publicly available at https://github.com/wyshao-01/EGMT.


[82] CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving cs.CV | cs.AIPDF

Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng

TL;DR: 本文提出了CogFlow,一个受认知启发的三阶段框架(感知→内化→推理),旨在解决多模态大语言模型在视觉数学问题求解中感知与推理脱节的问题。该框架通过协同视觉奖励、知识内化奖励模型和视觉门控策略优化算法,全面提升视觉信息的提取、整合与基于视觉的推理能力,并贡献了包含高质量对齐标注的新数据集MathCog。

Details

Motivation: 现有方法在视觉数学推理中主要关注提升视觉输入的提取和解释,但忽略了提取的视觉线索是否被忠实整合并有效用于后续推理这一关键问题。

Result: 在常用的视觉数学推理基准测试上进行的综合实验和分析验证了CogFlow的优越性,表明其达到了先进水平。

Insight: 创新点在于模拟人类推理的层次流程,通过知识内化阶段明确桥接感知与推理,并设计了协同视觉奖励、知识内化奖励模型和视觉门控策略优化算法来确保视觉知识的忠实整合与利用,防止模型产生视觉上无根据的推理捷径。

Abstract: Despite significant progress, multimodal large language models continue to struggle with visual mathematical problem solving. Some recent works recognize that visual perception is a bottleneck in visual mathematical reasoning, but their solutions are limited to improving the extraction and interpretation of visual inputs. Notably, they all ignore the key issue of whether the extracted visual cues are faithfully integrated and properly utilized in subsequent reasoning. Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perception$\Rightarrow$internalization$\Rightarrow$reasoning. Inline with this hierarchical flow, we holistically enhance all its stages. We devise Synergistic Visual Rewards to boost perception capabilities in parametric and semantic spaces, jointly improving visual information extraction from symbols and diagrams. To guarantee faithful integration of extracted visual cues into subsequent reasoning, we introduce a Knowledge Internalization Reward model in the internalization stage, bridging perception and reasoning. Moreover, we design a Visual-Gated Policy Optimization algorithm to further enforce the reasoning is grounded with the visual knowledge, preventing models seeking shortcuts that appear coherent but are visually ungrounded reasoning chains. Moreover, we contribute a new dataset MathCog for model training, which contains samples with over 120K high-quality perception-reasoning aligned annotations. Comprehensive experiments and analysis on commonly used visual mathematical reasoning benchmarks validate the superiority of the proposed CogFlow.


[83] Agentic AI in Remote Sensing: Foundations, Taxonomy, and Emerging Systems cs.CVPDF

Niloufar Alipour Talemi, Julia Boone, Fatemeh Afghah

TL;DR: 本文首次系统综述了遥感领域的智能体AI范式,提出了区分单智能体协同与多智能体系统的统一分类体系,并分析了规划机制、检索增强生成和记忆结构等架构基础。

Details

Motivation: 针对当前视觉基础模型和多模态大语言模型在复杂地理空间工作流中缺乏序列规划和主动工具编排能力的问题,推动遥感分析从静态深度学习模型向自主智能体AI转型。

Result: 综述了新兴的评估基准,这些基准将评估重点从像素级精度转向轨迹感知的推理正确性,为未来自主地理空间智能的发展提供了路线图。

Insight: 创新点在于构建了遥感智能体AI的系统性分类框架,并强调了从静态模型评估到动态工作流推理的范式转变,为开发具有扎实基础、安全性和协调能力的自主地理空间智能系统指明了方向。

Abstract: The paradigm of Earth Observation analysis is shifting from static deep learning models to autonomous agentic AI. Although recent vision foundation models and multimodal large language models advance representation learning, they often lack the sequential planning and active tool orchestration required for complex geospatial workflows. This survey presents the first comprehensive review of agentic AI in remote sensing. We introduce a unified taxonomy distinguishing between single-agent copilots and multi-agent systems while analyzing architectural foundations such as planning mechanisms, retrieval-augmented generation, and memory structures. Furthermore, we review emerging benchmarks that move the evaluation from pixel-level accuracy to trajectory-aware reasoning correctness. By critically examining limitations in grounding, safety, and orchestration, this work outlines a strategic roadmap for the development of robust, autonomous geospatial intelligence.


[84] Learning Action Hierarchies via Hybrid Geometric Diffusion cs.CVPDF

Arjun Ramesh Kaushik, Nalini K. Ratha, Venu Govindaraju

TL;DR: 本文提出了一种名为HybridTAS的新框架,用于视频时序动作分割任务。该框架创新性地将欧几里得几何和双曲几何混合引入扩散模型的去噪过程,以利用动作的层次结构。通过在扩散过程的不同时间步中分别利用抽象的高级动作类别和细粒度的低级动作类别,实现了从粗到细的标签去噪。

Details

Motivation: 现有的基于迭代细化的时序动作分割方法未能显式利用人类动作固有的层次性结构,本文旨在解决这一问题。

Result: 在GTEA、50Salads和Breakfast三个基准数据集上的大量实验表明,该方法取得了最先进的性能。

Insight: 核心创新点在于将双曲几何(能自然表征树状层次关系)与扩散模型结合,引导去噪过程遵循从抽象到具体的层次结构。这为利用结构化先验知识改进生成或判别模型提供了一个新颖的几何视角。

Abstract: Temporal action segmentation is a critical task in video understanding, where the goal is to assign action labels to each frame in a video. While recent advances leverage iterative refinement-based strategies, they fail to explicitly utilize the hierarchical nature of human actions. In this work, we propose HybridTAS - a novel framework that incorporates a hybrid of Euclidean and hyperbolic geometries into the denoising process of diffusion models to exploit the hierarchical structure of actions. Hyperbolic geometry naturally provides tree-like relationships between embeddings, enabling us to guide the action label denoising process in a coarse-to-fine manner: higher diffusion timesteps are influenced by abstract, high-level action categories (root nodes), while lower timesteps are refined using fine-grained action classes (leaf nodes). Extensive experiments on three benchmark datasets, GTEA, 50Salads, and Breakfast, demonstrate that our method achieves state-of-the-art performance, validating the effectiveness of hyperbolic-guided denoising for the temporal action segmentation task.


[85] TalkPhoto: A Versatile Training-Free Conversational Assistant for Intelligent Image Editing cs.CVPDF

Yujie Hu, Zecheng Tang, Xu Jiang, Weiqi Li, Jian Zhang

TL;DR: TalkPhoto是一个无需训练的多功能图像编辑框架,通过对话交互实现精确的图像操作。它利用开源大语言模型分析用户需求,并分层调用现有先进编辑方法,无需额外训练即可处理复杂和未见过的编辑任务,实现稳定高质量的编辑结果。

Details

Motivation: 现有基于指令的图像编辑方法通常需要构建多指令数据集来训练模型处理多种编辑任务,这不仅耗时耗力,且效果不佳。TalkPhoto旨在解决这一问题,通过免训练框架提升图像编辑的灵活性和可控性。

Result: 大量实验表明,该方法在多种图像编辑任务中,不仅以更少的token消耗提供更准确的调用,还实现了更高的编辑质量。

Insight: 创新点在于设计了一个免训练、可插拔的框架,通过提示模板引导LLM分析用户指令并分层调用现有编辑方法,从而高效整合复杂编辑任务,提升编辑质量和灵活性。

Abstract: Thanks to the powerful language comprehension capabilities of Large Language Models (LLMs), existing instruction-based image editing methods have introduced Multimodal Large Language Models (MLLMs) to promote information exchange between instructions and images, ensuring the controllability and flexibility of image editing. However, these frameworks often build a multi-instruction dataset to train the model to handle multiple editing tasks, which is not only time-consuming and labor-intensive but also fails to achieve satisfactory results. In this paper, we present TalkPhoto, a versatile training-free image editing framework that facilitates precise image manipulation through conversational interaction. We instruct the open-source LLM with a specially designed prompt template to analyze user needs after receiving instructions and hierarchically invoke existing advanced editing methods, all without additional training. Moreover, we implement a plug-and-play and efficient invocation of image editing methods, allowing complex and unseen editing tasks to be integrated into the current framework, achieving stable and high-quality editing results. Extensive experiments demonstrate that our method not only provides more accurate invocation with fewer token consumption but also achieves higher editing quality across various image editing tasks.


[86] AR-MOT: Autoregressive Multi-object Tracking cs.CVPDF

Lianjie Jia, Yuhan Wu, Binghao Ran, Yifan Wang, Lijun Wang

TL;DR: 本文提出AR-MOT,一种基于大语言模型的自回归多目标跟踪新范式,将MOT任务建模为序列生成问题。该方法通过引入目标分词器、区域感知对齐模块和时序记忆融合模块,实现了无需特定任务头的灵活跟踪,并在MOT17和DanceTrack基准上取得了与SOTA相当的性能。

Details

Motivation: 现有MOT方法架构僵化、任务特定,难以适应通用、多模态场景及新任务形式,限制了其扩展性和灵活性。

Result: 在MOT17和DanceTrack基准上的大量实验验证了方法的可行性,性能达到了与当前最先进方法相当的水平。

Insight: 创新点在于将MOT重构为LLM框架下的序列生成任务,实现了架构无关的灵活输出;通过目标分词器、区域感知对齐和时序记忆融合模块,增强了视觉感知与长时跟踪能力,为构建更通用、可扩展的MOT系统奠定了基础。

Abstract: As multi-object tracking (MOT) tasks continue to evolve toward more general and multi-modal scenarios, the rigid and task-specific architectures of existing MOT methods increasingly hinder their applicability across diverse tasks and limit flexibility in adapting to new tracking formulations. Most approaches rely on fixed output heads and bespoke tracking pipelines, making them difficult to extend to more complex or instruction-driven tasks. To address these limitations, we propose AR-MOT, a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework. This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads. To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector. To mitigate the misalignment between global and regional features, we propose a Region-Aware Alignment (RAA) module, and to support long-term tracking, we design a Temporal Memory Fusion (TMF) module that caches historical object tokens. AR-MOT offers strong potential for extensibility, as new modalities or instructions can be integrated by simply modifying the output sequence format without altering the model architecture. Extensive experiments on MOT17 and DanceTrack validate the feasibility of our approach, achieving performance comparable to state-of-the-art methods while laying the foundation for more general and flexible MOT systems.


[87] MacVQA: Adaptive Memory Allocation and Global Noise Filtering for Continual Visual Question Answering cs.CVPDF

Zhifei Li, Yiran Wang, Chenyi Xiong, Yujing Xia, Xiaoju Hou

TL;DR: 本文提出了一种名为MacVQA的新框架,用于持续视觉问答任务。该框架通过自适应内存分配和全局噪声过滤机制,融合视觉与文本信息,旨在平衡知识保留、适应新信息以及鲁棒特征表示之间的挑战。

Details

Motivation: 当前持续VQA方法在平衡知识保留、适应性和鲁棒特征表示方面存在困难,本文旨在解决这些问题。

Result: 在十个持续VQA任务上的实验表明,MacVQA超越了现有基线模型,在标准任务上取得了43.38%的平均准确率和2.32%的平均遗忘率,在新组合任务上取得了42.53%的平均准确率和3.60%的平均遗忘率。

Insight: 创新点在于结合了基于原型的自适应内存分配以优化特征质量和内存使用,以及全局噪声过滤以确保鲁棒的表征,这有助于在持续学习中平衡知识获取、保留和组合泛化能力。

Abstract: Visual Question Answering (VQA) requires models to reason over multimodal information, combining visual and textual data. With the development of continual learning, significant progress has been made in retaining knowledge and adapting to new information in the VQA domain. However, current methods often struggle with balancing knowledge retention, adaptation, and robust feature representation. To address these challenges, we propose a novel framework with adaptive memory allocation and global noise filtering called MacVQA for visual question answering. MacVQA fuses visual and question information while filtering noise to ensure robust representations, and employs prototype-based memory allocation to optimize feature quality and memory usage. These designs enable MacVQA to balance knowledge acquisition, retention, and compositional generalization in continual VQA learning. Experiments on ten continual VQA tasks show that MacVQA outperforms existing baselines, achieving 43.38% average accuracy and 2.32% average forgetting on standard tasks, and 42.53% average accuracy and 3.60% average forgetting on novel composition tasks.


[88] MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization cs.CVPDF

Zhexin Zhang, Yifeng Zhu, Yangyang Xu, Long Chen, Yong Du

TL;DR: MotionAdapter是一个基于扩散Transformer的视频运动迁移框架,通过内容感知的注意力定制实现鲁棒且语义对齐的运动迁移。它首先从3D全注意力模块中提取注意力衍生的运动场,然后利用DINO引导的运动定制模块根据内容对应关系重新排列和细化运动场,最后用定制后的运动场指导DiT去噪过程,确保合成视频继承参考运动的同时保持目标外观和语义。

Details

Motivation: 解决基于扩散的文本到视频模型在视频间迁移复杂运动时面临的挑战,需要显式解耦运动与外观,并自适应地将运动定制到目标内容。

Result: 在定性和定量评估中均优于最先进的方法,并自然支持复杂运动迁移和运动编辑任务(如缩放)。

Insight: 创新点在于通过分析跨帧注意力来显式解耦运动,并引入DINO引导的内容感知模块来桥接参考视频与目标视频之间的语义鸿沟,实现自适应运动定制。

Abstract: Recent advances in diffusion-based text-to-video models, particularly those built on the diffusion transformer architecture, have achieved remarkable progress in generating high-quality and temporally coherent videos. However, transferring complex motions between videos remains challenging. In this work, we present MotionAdapter, a content-aware motion transfer framework that enables robust and semantically aligned motion transfer within DiT-based T2V models. Our key insight is that effective motion transfer requires \romannumeral1) explicit disentanglement of motion from appearance and \romannumeral 2) adaptive customization of motion to target content. MotionAdapter first isolates motion by analyzing cross-frame attention within 3D full-attention modules to extract attention-derived motion fields. To bridge the semantic gap between reference and target videos, we further introduce a DINO-guided motion customization module that rearranges and refines motion fields based on content correspondences. The customized motion field is then used to guide the DiT denoising process, ensuring that the synthesized video inherits the reference motion while preserving target appearance and semantics. Extensive experiments demonstrate that MotionAdapter outperforms state-of-the-art methods in both qualitative and quantitative evaluations. Moreover, MotionAdapter naturally supports complex motion transfer and motion editing tasks such as zooming.


[89] AFTER: Mitigating the Object Hallucination of LVLM via Adaptive Factual-Guided Activation Editing cs.CVPDF

Tianbo Wang, Yuqing Ma, Kewei Liao, Zhange Zhang, Simin Li

TL;DR: 本文提出AFTER方法,通过自适应事实引导的激活编辑来缓解大型视觉语言模型(LVLM)中的物体幻觉问题。AFTER包含事实增强激活引导(FAS)和查询自适应偏移优化(QAO)两个组件,旨在将原始有偏的激活自适应地引导至事实语义,从而减少由语言偏见引起的类别、属性和关系幻觉。

Details

Motivation: 大型视觉语言模型在跨模态任务中取得显著进展,但由于语言偏见,容易产生物体幻觉(包括类别、属性和关系幻觉),这阻碍了可信AI应用。现有编辑内部激活的方法缺乏事实文本语义的有效引导,难以显式缓解语言偏见。

Result: 在三个广泛采用的LVLM上,于标准幻觉基准测试(如AMBER)上进行了广泛实验,验证了AFTER的有效性,在AMBER基准上相比基线最高减少了16.3%的幻觉。

Insight: 创新点包括:1)提出事实增强激活引导(FAS),为激活编辑提供事实和通用指导,显式建模精确的视觉-文本关联;2)引入查询自适应偏移优化(QAO),通过查询感知的偏移估计器建立查询特定的编辑,增强编辑的多样性和粒度。从客观角度看,该方法结合了事实语义引导和自适应查询调整,有望更精细地纠正模型激活中的偏见。

Abstract: Large Vision-Language Models (LVLMs) have achieved substantial progress in cross-modal tasks. However, due to language bias, LVLMs are susceptible to object hallucination, which can be primarily divided into category, attribute, and relation hallucination, significantly impeding the trustworthy AI applications. Editing the internal activations of LVLMs has shown promising effectiveness in mitigating hallucinations with minimal cost. However, previous editing approaches neglect the effective guidance offered by factual textual semantics, thereby struggling to explicitly mitigate language bias. To address these issues, we propose Adaptive Factual-guided Visual-Textual Editing for hallucination mitigation (AFTER), which comprises Factual-Augmented Activation Steering (FAS) and Query-Adaptive Offset Optimization (QAO), to adaptively guides the original biased activations towards factual semantics. Specifically, FAS is proposed to provide factual and general guidance for activation editing, thereby explicitly modeling the precise visual-textual associations. Subsequently, QAO introduces a query-aware offset estimator to establish query-specific editing from the general steering vector, enhancing the diversity and granularity of editing. Extensive experiments on standard hallucination benchmarks across three widely adopted LVLMs validate the efficacy of the proposed AFTER, notably achieving up to a 16.3% reduction of hallucination over baseline on the AMBER benchmark. Our code and data will be released for reproducibility.


[90] Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation cs.CVPDF

Weijian Ma, Shizhao Sun, Tianyu Yu, Ruiyu Wang, Tat-Seng Chua

TL;DR: 这篇论文提出了一种通过结构化物体表示来增强视觉语言模型空间推理能力的方法。该方法的核心是引入以物体为中心的蓝图概念,模型首先为输入图像和问题构建一个记录物体位置、大小和属性的JSON格式蓝图,然后基于这个结构化表示进行推理以得出最终答案。

Details

Motivation: 现有方法在提升空间推理时存在局限:要么通过重访局部图像块来改善细粒度感知,但削弱了全局空间意识;要么仅标记孤立坐标来捕获物体位置,却忽略了物体的整体组织关系。因此,需要一种能同时兼顾局部细节和全局结构的方法。

Result: 实验表明,该方法在空间推理任务上持续优于现有的通用视觉语言模型和专用空间推理模型。

Insight: 论文的创新点在于将物体中心蓝图这一认知概念整合到视觉语言模型中,并配套了三个关键技术:用于监督微调的蓝图嵌入推理轨迹、用于强化学习的蓝图感知奖励机制,以及防止模型走捷径的抗捷径数据增强。这为模型提供了可解释的结构化中间表示,引导其进行更可靠的因果推理。

Abstract: Spatial reasoning – the ability to perceive and reason about relationships in space – advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image patches, improving fine-grained perception but weakening global spatial awareness, or mark isolated coordinates, which capture object locations but overlook their overall organization. In this work, we integrate the cognitive concept of an object-centric blueprint into VLMs to enhance spatial reasoning. Given an image and a question, the model first constructs a JSON-style blueprint that records the positions, sizes, and attributes of relevant objects, and then reasons over this structured representation to produce the final answer. To achieve this, we introduce three key techniques: (1) blueprint-embedded reasoning traces for supervised fine-tuning to elicit basic reasoning skills; (2) blueprint-aware rewards in reinforcement learning to encourage the blueprint to include an appropriate number of objects and to align final answers with this causal reasoning; and (3) anti-shortcut data augmentation that applies targeted perturbations to images and questions, discouraging reliance on superficial visual or linguistic cues. Experiments show that our method consistently outperforms existing VLMs and specialized spatial reasoning models.


[91] VIT-Ped: Visionary Intention Transformer for Pedestrian Behavior Analysis cs.CV | cs.AI | cs.ROPDF

Aly R. Elkammar, Karim M. Gamaleldin, Catherine M. Elias

TL;DR: 该论文提出了一种基于Transformer/ViT架构的多模态算法VIT-Ped,用于预测行人意图,以提升自动驾驶的安全性。该模型有不同尺寸的变体,并在JAAD数据集上取得了SOTA性能,在准确率、AUC和F1分数等指标上超越了现有方法。

Details

Motivation: 行人意图预测是自动驾驶从L3级向L4级过渡的关键技术之一,旨在通过综合考虑多种要素和特征来理解行人过街行为,从而提升道路安全。

Result: 在流行的行人行为数据集JAAD上进行了评估,在准确率、AUC和F1分数等指标上达到了SOTA(最先进)水平,并通过广泛的消融实验验证了不同模型设计选择带来的优势。

Insight: 创新点在于将Transformer/视频视觉Transformer架构应用于多模态的行人意图预测任务,并通过模型尺寸变体和消融研究系统地探索了不同设计选择的有效性,为相关领域提供了可借鉴的架构和评估方法。

Abstract: Pedestrian Intention prediction is one of the key technologies in the transition from level 3 to level 4 autonomous driving. To understand pedestrian crossing behaviour, several elements and features should be taken into consideration to make the roads of tomorrow safer for everybody. We introduce a transformer / video vision transformer based algorithm of different sizes which uses different data modalities .We evaluated our algorithms on popular pedestrian behaviour dataset, JAAD, and have reached SOTA performance and passed the SOTA in metrics like Accuracy, AUC and F1-score. The advantages brought by different model design choices are investigated via extensive ablation studies.


[92] API: Empowering Generalizable Real-World Image Dehazing via Adaptive Patch Importance Learning cs.CVPDF

Chen Zhu, Huiwen Zhang, Yujie Li, Mu He, Xiaotian Qiao

TL;DR: 本文提出了一种名为自适应块重要性学习(API)的新型框架,用于提升真实世界图像去雾的泛化能力。该框架包含自动雾霾生成(AHG)模块和密度感知去雾(DHR)模块,并通过一种新的多负样本对比去雾(MNCD)损失函数来缓解去雾图像细节的模糊问题。

Details

Motivation: 现有基于学习的方法在处理复杂真实世界雾霾场景时,由于训练数据有限和雾霾密度分布的内在复杂性,性能会显著下降。本文旨在解决这些挑战,提出一个泛化性强的真实世界图像去雾框架。

Result: 广泛的实验表明,该框架在多个真实世界基准测试中达到了最先进的性能,在定量指标和定性视觉质量上都取得了强劲的结果,并在不同雾霾分布上展现出鲁棒的泛化能力。

Insight: 创新点包括:1) 通过AHG模块提供混合数据增强策略,生成逼真多样的雾霾图像作为额外高质量训练数据;2) DHR模块以自适应块重要性感知的方式处理不同雾霾密度分布的区域;3) 引入MNCD损失函数,充分利用空间和频域中多个负样本的信息来提升细节清晰度。

Abstract: Real-world image dehazing is a fundamental yet challenging task in low-level vision. Existing learning-based methods often suffer from significant performance degradation when applied to complex real-world hazy scenes, primarily due to limited training data and the intrinsic complexity of haze density distributions.To address these challenges, we introduce a novel Adaptive Patch Importance-aware (API) framework for generalizable real-world image dehazing. Specifically, our framework consists of an Automatic Haze Generation (AHG) module and a Density-aware Haze Removal (DHR) module. AHG provides a hybrid data augmentation strategy by generating realistic and diverse hazy images as additional high-quality training data. DHR considers hazy regions with varying haze density distributions for generalizable real-world image dehazing in an adaptive patch importance-aware manner. To alleviate the ambiguity of the dehazed image details, we further introduce a new Multi-Negative Contrastive Dehazing (MNCD) loss, which fully utilizes information from multiple negative samples across both spatial and frequency domains. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across multiple real-world benchmarks, delivering strong results in both quantitative metrics and qualitative visual quality, and exhibiting robust generalization across diverse haze distributions.


[93] Nighttime Hazy Image Enhancement via Progressively and Mutually Reinforcing Night-Haze Priors cs.CVPDF

Chen Zhu, Huiwen Zhang, Mu He, Yujie Li, Xiaotian Qiao

TL;DR: 本文提出了一种新颖的夜间有雾图像增强框架,通过渐进式地、相互强化夜间雾霾先验与低光照先验之间的内在一致性,来提升图像可见度。该方法利用跨视觉域和频域的多层级专家(图像级、块级、像素级)逐步恢复全局场景结构、区域模式和细粒度细节,并引入频率感知路由器自适应地引导各专家的贡献。

Details

Motivation: 现有方法通常单独处理雾霾或低光照等单一退化类型,忽略了不同类型退化之间的相互作用,导致可见度提升有限。本文观察到低光照先验与雾霾先验之间的领域知识可以相互强化,从而更好地提升可见度。

Result: 在夜间去雾基准测试上进行了广泛的实验,定性和定量结果均表明该模型性能优越。此外,模型在白天去雾和低光照增强任务上也展现出了良好的泛化能力。

Insight: 核心创新点在于提出了一个渐进式相互强化夜间-雾霾先验的框架,将低光照增强与去雾任务协同处理。技术亮点包括跨视觉/频域的多层级专家系统,以及用于自适应融合的频率感知路由器,这有助于更鲁棒地恢复复杂退化下的图像细节。

Abstract: Enhancing the visibility of nighttime hazy images is challenging due to the complex degradation distributions. Existing methods mainly address a single type of degradation (e.g., haze or low-light) at a time, ignoring the interplay of different degradation types and resulting in limited visibility improvement. We observe that the domain knowledge shared between low-light and haze priors can be reinforced mutually for better visibility. Based on this key insight, in this paper, we propose a novel framework that enhances visibility in nighttime hazy images by reinforcing the intrinsic consistency between haze and low-light priors mutually and progressively. In particular, our model utilizes image-, patch-, and pixel-level experts that operate across visual and frequency domains to recover global scene structure, regional patterns, and fine-grained details progressively. A frequency-aware router is further introduced to adaptively guide the contribution of each expert, ensuring robust image restoration. Extensive experiments demonstrate the superior performance of our model on nighttime dehazing benchmarks both quantitatively and qualitatively. Moreover, we showcase the generalizability of our model in daytime dehazing and low-light enhancement tasks.


[94] Leveraging 2D-VLM for Label-Free 3D Segmentation in Large-Scale Outdoor Scene Understanding cs.CVPDF

Toshihiko Nishimura, Hirofumi Abe, Kazuhiko Murasaki, Taiga Yoshida, Ryuichi Tanida

TL;DR: 本文提出了一种无需标注3D训练数据或配对RGB图像的大规模点云3D语义分割新方法。该方法通过虚拟相机将3D点云投影到2D图像,并利用自然语言提示引导的基础2D模型进行语义分割,再通过加权投票聚合多视角预测实现3D分割。

Details

Motivation: 解决传统监督方法需要大量标注3D数据以及无法进行开放词汇识别的局限性,实现无需训练数据的3D语义分割。

Result: 在无训练方法中表现优于现有方法,分割精度达到与监督方法相当的水平,并支持开放词汇识别。

Insight: 创新性地利用2D视觉语言模型(VLM)和自然语言提示实现无标注3D分割,通过多视角投影与加权投票策略将2D知识迁移到3D领域,实现了开放词汇的灵活识别能力。

Abstract: This paper presents a novel 3D semantic segmentation method for large-scale point cloud data that does not require annotated 3D training data or paired RGB images. The proposed approach projects 3D point clouds onto 2D images using virtual cameras and performs semantic segmentation via a foundation 2D model guided by natural language prompts. 3D segmentation is achieved by aggregating predictions from multiple viewpoints through weighted voting. Our method outperforms existing training-free approaches and achieves segmentation accuracy comparable to supervised methods. Moreover, it supports open-vocabulary recognition, enabling users to detect objects using arbitrary text queries, thus overcoming the limitations of traditional supervised approaches.


[95] AlignVTOFF: Texture-Spatial Feature Alignment for High-Fidelity Virtual Try-Off cs.CVPDF

Yihan Zhu, Mengying Ge

TL;DR: 本文提出AlignVTOFF,一种用于高保真虚拟试衣(VTOFF)任务的新型并行U-Net框架。该框架通过一个参考U-Net进行多尺度特征提取以增强几何保真度,并利用新颖的纹理-空间特征对齐(TSFA)模块,通过混合注意力设计将参考服装特征注入到冻结的去噪U-Net中,从而在复杂的几何变形下合成具有丰富高频纹理的平铺服装图像。

Details

Motivation: 解决现有虚拟试衣方法因依赖轻量级模块进行快速特征提取,而难以保持结构化图案和细粒度细节,导致生成过程中纹理衰减的问题。

Result: 在多种设置下的广泛实验表明,AlignVTOFF始终优于最先进的方法,生成的平铺服装结果在结构真实性和高频细节保真度方面均有提升。

Insight: 主要创新点在于提出了一个并行U-Net框架,其中参考U-Net专注于几何保真,而新颖的TSFA模块通过结合可训练的交叉注意力与冻结的自注意力,显式地对齐纹理与空间线索,有效缓解了去噪过程中的高频信息丢失。

Abstract: Virtual Try-Off (VTOFF) is a challenging multimodal image generation task that aims to synthesize high-fidelity flat-lay garments under complex geometric deformation and rich high-frequency textures. Existing methods often rely on lightweight modules for fast feature extraction, which struggles to preserve structured patterns and fine-grained details, leading to texture attenuation during generation.To address these issues, we propose AlignVTOFF, a novel parallel U-Net framework built upon a Reference U-Net and Texture-Spatial Feature Alignment (TSFA). The Reference U-Net performs multi-scale feature extraction and enhances geometric fidelity, enabling robust modeling of deformation while retaining complex structured patterns. TSFA then injects the reference garment features into a frozen denoising U-Net via a hybrid attention design, consisting of a trainable cross-attention module and a frozen self-attention module. This design explicitly aligns texture and spatial cues and alleviates the loss of high-frequency information during the denoising process.Extensive experiments across multiple settings demonstrate that AlignVTOFF consistently outperforms state-of-the-art methods, producing flat-lay garment results with improved structural realism and high-frequency detail fidelity.


[96] Agentic Retoucher for Text-To-Image Generation cs.CV | cs.AIPDF

Shaocheng Shen, Jianfeng Liang. Chunlei Cai, Cong Geng, Huiyu Duan, Xiaoyun Zhang

TL;DR: 本文提出Agentic Retoucher,一种分层决策驱动框架,用于解决文本到图像(T2I)扩散模型(如SDXL、FLUX)生成图像中普遍存在的小尺度扭曲问题(如肢体、面部、文字)。该框架将后生成修正模拟为类似人类的感知-推理-行动循环,通过感知代理定位扭曲、推理代理进行诊断、行动代理执行局部修复,并引入了包含27K标注瑕疵区域的GenBlemish-27K数据集进行监督与评估。

Details

Motivation: 现有T2I扩散模型生成的图像在细节上仍存在普遍扭曲,而现有细化方法要么成本高昂(需迭代重新生成),要么依赖空间定位能力弱的视觉语言模型(VLM),导致语义漂移和局部编辑不可靠。本文旨在弥合这一差距,实现更可靠、可控的后生成修正。

Result: 在广泛的实验中,Agentic Retoucher在感知质量、扭曲定位和人类偏好对齐方面持续优于最先进(SOTA)方法,在GenBlemish-27K数据集上进行了定量评估,确立了新的性能基准。

Insight: 主要创新点在于将图像修正问题重新构建为一个分层的、类似人类的决策循环(感知-推理-行动),将感知证据、语言推理和可控修正整合到一个统一的自校正决策过程中。客观来看,其设计的三个代理(感知、推理、行动)协同工作,并结合了文本-图像一致性线索与渐进式偏好对齐,实现了更精细、更符合人类意图的局部编辑,避免了现有方法的语义漂移问题。同时,构建的大规模细粒度标注数据集GenBlemish-27K也为该领域的监督与评估提供了重要资源。

Abstract: Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.


[97] InpaintHuman: Reconstructing Occluded Humans with Multi-Scale UV Mapping and Identity-Preserving Diffusion Inpainting cs.CVPDF

Jinlong Fan, Shanshan Zhao, Liang Zheng, Jing Zhang, Yuxiang Yang

TL;DR: 本文提出InpaintHuman方法,旨在从存在严重遮挡的单目视频中重建完整且可动画的3D人体化身。该方法结合了多尺度UV映射表示和身份保持扩散修复模块,以解决现有方法在遮挡下几何损坏和时间不一致的问题。

Details

Motivation: 现有基于3D高斯泼溅的方法在遮挡情况下难以从单目视频中重建完整、高保真且可动画的3D人体化身,常导致几何损坏和时间不一致。

Result: 在合成基准数据集(PeopleSnapshot, ZJU-MoCap)和真实场景数据集(OcMotion)上的实验表明,该方法在重建质量上取得了具有竞争力的性能,在不同姿态和视角下均有一致性提升。

Insight: 创新点包括:1) 采用多尺度UV参数化表示与从粗到细的层次特征插值,以鲁棒地重建遮挡区域并保留几何细节;2) 提出身份保持扩散修复模块,结合文本反演和语义条件引导,实现主体特定且时间一致的补全;与基于SDS的方法不同,该方法采用直接像素级监督以确保身份保真度。

Abstract: Reconstructing complete and animatable 3D human avatars from monocular videos remains challenging, particularly under severe occlusions. While 3D Gaussian Splatting has enabled photorealistic human rendering, existing methods struggle with incomplete observations, often producing corrupted geometry and temporal inconsistencies. We present InpaintHuman, a novel method for generating high-fidelity, complete, and animatable avatars from occluded monocular videos. Our approach introduces two key innovations: (i) a multi-scale UV-parameterized representation with hierarchical coarse-to-fine feature interpolation, enabling robust reconstruction of occluded regions while preserving geometric details; and (ii) an identity-preserving diffusion inpainting module that integrates textual inversion with semantic-conditioned guidance for subject-specific, temporally coherent completion. Unlike SDS-based methods, our approach employs direct pixel-level supervision to ensure identity fidelity. Experiments on synthetic benchmarks (PeopleSnapshot, ZJU-MoCap) and real-world scenarios (OcMotion) demonstrate competitive performance with consistent improvements in reconstruction quality across diverse poses and viewpoints.


[98] MagicFight: Personalized Martial Arts Combat Video Generation cs.CVPDF

Jiancheng Huang, Mingfu Yan, Songyan Chen, Yi Huang, Shifeng Chen

TL;DR: 本文提出了一个名为MagicFight的新任务——个性化武术格斗视频生成,旨在解决现有单人视频生成模型在双人交互场景下存在的身份混淆、肢体异常和动作不匹配等问题。该方法通过使用Unity游戏物理引擎创建定制化数据集,并改进现有模型策略,以生成高保真、身份清晰且动作连贯的双人格斗视频。

Details

Motivation: 现有文本到视频生成技术主要集中于单人场景,而在双人交互(尤其是武术格斗)领域存在空白,现有模型无法捕捉双人互动的细微复杂之处,导致生成视频质量低下。

Result: 论文通过自建的Unity生成数据集进行实验,MagicFight方法能够生成高保真的双人格斗视频,在保持个体身份和动作连贯性方面表现出色,为该新任务领域奠定了基础。

Insight: 创新点在于首次定义了“个性化武术格斗视频生成”这一新任务,并通过合成专用3D数据集来弥补数据缺失,同时针对双人交互的挑战对现有模型进行适配与优化,为交互式视频内容生成开辟了新方向。

Abstract: Amid the surge in generic text-to-video generation, the field of personalized human video generation has witnessed notable advancements, primarily concentrated on single-person scenarios. However, to our knowledge, the domain of two-person interactions, particularly in the context of martial arts combat, remains uncharted. We identify a significant gap: existing models for single-person dancing generation prove insufficient for capturing the subtleties and complexities of two engaged fighters, resulting in challenges such as identity confusion, anomalous limbs, and action mismatches. To address this, we introduce a pioneering new task, Personalized Martial Arts Combat Video Generation. Our approach, MagicFight, is specifically crafted to overcome these hurdles. Given this pioneering task, we face a lack of appropriate datasets. Thus, we generate a bespoke dataset using the game physics engine Unity, meticulously crafting a multitude of 3D characters, martial arts moves, and scenes designed to represent the diversity of combat. MagicFight refines and adapts existing models and strategies to generate high-fidelity two-person combat videos that maintain individual identities and ensure seamless, coherent action sequences, thereby laying the groundwork for future innovations in the realm of interactive video content creation. Website: https://MingfuYAN.github.io/MagicFight/ Dataset: https://huggingface.co/datasets/MingfuYAN/KungFu-Fiesta


[99] Remote Sensing Change Detection via Weak Temporal Supervision cs.CV | cs.AIPDF

Xavier Bou, Elliot Vincent, Gabriele Facciolo, Rafael Grompone von Gioi, Jean-Michel Morel

TL;DR: 本文提出了一种弱时序监督策略,用于遥感语义变化检测,通过利用现有单时相数据集的多时相观测,无需额外标注即可生成变化检测训练数据。该方法假设真实双时相对大多无变化,并通过不同位置图像配对生成变化样本,结合对象感知变化图生成和迭代优化处理弱标签噪声。

Details

Motivation: 遥感语义变化检测面临标注数据稀缺的挑战,现有合成数据或人工变化对方法泛化能力有限,因此研究旨在利用现有单时相数据集的多时相观测,无需新标注来提升模型性能。

Result: 在扩展的FLAIR和IAILD航空数据集上验证,方法在零样本和低数据场景下表现优异,并在法国大区域展示可扩展潜力。

Insight: 创新点在于弱时序监督策略,通过利用多时相观测生成弱标签训练数据,结合对象感知和迭代优化处理噪声,为数据稀缺下的变化检测提供了新思路。

Abstract: Semantic change detection in remote sensing aims to identify land cover changes between bi-temporal image pairs. Progress in this area has been limited by the scarcity of annotated datasets, as pixel-level annotation is costly and time-consuming. To address this, recent methods leverage synthetic data or generate artificial change pairs, but out-of-domain generalization remains limited. In this work, we introduce a weak temporal supervision strategy that leverages additional temporal observations of existing single-temporal datasets, without requiring any new annotations. Specifically, we extend single-date remote sensing datasets with new observations acquired at different times and train a change detection model by assuming that real bi-temporal pairs mostly contain no change, while pairing images from different locations to generate change examples. To handle the inherent noise in these weak labels, we employ an object-aware change map generation and an iterative refinement process. We validate our approach on extended versions of the FLAIR and IAILD aerial datasets, achieving strong zero-shot and low-data regime performance across different benchmarks. Lastly, we showcase results over large areas in France, highlighting the scalability potential of our method.


[100] BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models cs.CV | cs.AI | cs.LGPDF

Sunny Gupta, Shounak Das, Amit Sethi

TL;DR: 本文提出了一种双边提示优化框架BiPrompt,用于在视觉语言模型(如CLIP)中同时减轻视觉和文本模态的虚假相关性。该方法通过视觉端的结构化注意力引导擦除和文本端的平衡提示归一化,在测试时适应中联合最小化虚假线索与预测之间的条件互信息,从而提升模型的因果性和领域不变性推理能力。

Details

Motivation: 现有的去偏方法通常只针对单一模态(视觉或文本),导致在分布偏移下部分鲁棒性和适应不稳定。本文旨在同时解决视觉和文本模态中的虚假特征依赖问题,以实现更全面的去偏和鲁棒性。

Result: 在真实世界和合成的偏置基准测试上进行广泛评估,结果显示在平均准确率和最差组准确率上均优于先前的测试时去偏方法。

Insight: 创新点在于提出双边优化框架,视觉端采用结构化注意力引导擦除抑制背景激活并强制因果与虚假区域间的正交预测一致性,文本端引入可学习的平衡提示归一化机制将类别嵌入对齐到各向同性语义空间。该方法无需重新训练或领域监督,为轻量级且有效的可信视觉语言适应提供了新路径。

Abstract: Vision language foundation models such as CLIP exhibit impressive zero-shot generalization yet remain vulnerable to spurious correlations across visual and textual modalities. Existing debiasing approaches often address a single modality either visual or textual leading to partial robustness and unstable adaptation under distribution shifts. We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation. On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce orthogonal prediction consistency between causal and spurious regions. On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space. Together, these modules jointly minimize conditional mutual information between spurious cues and predictions, steering the model toward causal, domain invariant reasoning without retraining or domain supervision. Extensive evaluations on real-world and synthetic bias benchmarks demonstrate consistent improvements in both average and worst-group accuracies over prior test-time debiasing methods, establishing a lightweight yet effective path toward trustworthy and causally grounded vision-language adaptation.


[101] NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation cs.CV | cs.AIPDF

Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song

TL;DR: NextFlow是一个统一的仅解码器自回归Transformer模型,在6万亿交错文本-图像离散标记上训练。它通过统一的视觉表示和自回归架构,原生激活了多模态理解和生成能力,包括图像编辑、交错内容生成和视频生成。

Details

Motivation: 解决不同模态(文本严格顺序、图像本质分层)的固有差异,以统一模型实现高效的多模态理解和生成,克服传统光栅扫描方法在图像生成上的速度瓶颈。

Result: 在视觉质量上,NextFlow在统一模型中达到SOTA性能,并可媲美专门的扩散模型基线;其方法能在5秒内生成1024x1024图像,速度远超同类自回归模型。

Insight: 创新点在于针对文本和图像的不同特性,分别采用下一标记预测和下一尺度预测,并引入稳健的多尺度生成训练方案和用于强化学习的前缀调优策略,实现了高效、高质量的统一多模态建模。

Abstract: We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.


[102] Seeing the Unseen: Zooming in the Dark with Event Cameras cs.CV | cs.AIPDF

Dachun Kai, Zeyu Xiao, Huyue Zhu, Jiaxiao Wang, Yueyi Zhang

TL;DR: 本文提出RetinexEVSR,首个事件驱动的低光视频超分辨率框架,通过结合高对比度事件信号和Retinex先验,从低光低分辨率输入恢复高分辨率视频。它采用双向跨模态融合策略,整合事件数据和RGB帧信息,并引入光照引导的事件增强模块和事件引导的反射率增强模块来抑制伪影并恢复细节。

Details

Motivation: 现有低光视频超分辨率方法因对比度有限和高频信息不足,难以恢复精细细节,本文旨在利用事件相机的高对比度信号和Retinex先验来解决这一问题。

Result: 在三个数据集上达到SOTA性能,在SDSD基准上比先前基于事件的方法提升2.95 dB,同时减少65%的运行时间。

Insight: 创新点包括事件驱动框架、双向跨模态融合策略,以及基于Retinex的模块设计,可借鉴于低光视觉任务中事件信号与RGB数据的有效整合。

Abstract: This paper addresses low-light video super-resolution (LVSR), aiming to restore high-resolution videos from low-light, low-resolution (LR) inputs. Existing LVSR methods often struggle to recover fine details due to limited contrast and insufficient high-frequency information. To overcome these challenges, we present RetinexEVSR, the first event-driven LVSR framework that leverages high-contrast event signals and Retinex-inspired priors to enhance video quality under low-light scenarios. Unlike previous approaches that directly fuse degraded signals, RetinexEVSR introduces a novel bidirectional cross-modal fusion strategy to extract and integrate meaningful cues from noisy event data and degraded RGB frames. Specifically, an illumination-guided event enhancement module is designed to progressively refine event features using illumination maps derived from the Retinex model, thereby suppressing low-light artifacts while preserving high-contrast details. Furthermore, we propose an event-guided reflectance enhancement module that utilizes the enhanced event features to dynamically recover reflectance details via a multi-scale fusion mechanism. Experimental results show that our RetinexEVSR achieves state-of-the-art performance on three datasets. Notably, on the SDSD benchmark, our method can get up to 2.95 dB gain while reducing runtime by 65% compared to prior event-based methods. Code: https://github.com/DachunKai/RetinexEVSR.


[103] Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion cs.CVPDF

Binglei Li, Mengping Yang, Zhiyu Tan, Junping Zhang, Hao Li

TL;DR: 本文提出了一种无需训练的系统性分析流程,用于探究基于多模态扩散变换器(MMDiT)的文本到图像生成模型中各模块的功能及其与文本条件的交互。基于分析发现,作者进一步提出了无需训练的改进策略,以提升文本对齐、实现精确编辑和加速推理,并在多个基准测试中取得了优于基线方法的效果。

Details

Motivation: 现有方法主要分析MMDiT模型中特定组件(如位置编码和注意力层)的作用,但缺乏对不同模块及其与文本条件交互如何影响合成过程的全面理解。本文旨在填补这一空白,深入理解MMDiT的内部机制。

Result: 在SD3.5模型上,该方法将T2I-Combench++得分从56.92%提升至63.00%,GenEval得分从66.42%提升至71.63%,且不牺牲合成质量。在文本到图像生成、图像编辑和推理加速等任务上,该方法均优于多种基线方法,展现出优异的性能。

Insight: 创新点在于提出了一种无需训练的系统性分析流程,揭示了MMDiT模块中语义信息与细节渲染的时序规律,并基于此设计了无需训练的增强策略。客观来看,其将模型分析与性能提升紧密结合,为理解和改进基于Transformer的扩散模型提供了新思路。

Abstract: Recent breakthroughs of transformer-based diffusion models, particularly with Multimodal Diffusion Transformers (MMDiT) driven models like FLUX and Qwen Image, have facilitated thrilling experiences in text-to-image generation and editing. To understand the internal mechanism of MMDiT-based models, existing methods tried to analyze the effect of specific components like positional encoding and attention layers. Yet, a comprehensive understanding of how different blocks and their interactions with textual conditions contribute to the synthesis process remains elusive. In this paper, we first develop a systematic pipeline to comprehensively investigate each block’s functionality by removing, disabling and enhancing textual hidden-states at corresponding blocks. Our analysis reveals that 1) semantic information appears in earlier blocks and finer details are rendered in later blocks, 2) removing specific blocks is usually less disruptive than disabling text conditions, and 3) enhancing textual conditions in selective blocks improves semantic attributes. Building on these observations, we further propose novel training-free strategies for improved text alignment, precise editing, and acceleration. Extensive experiments demonstrated that our method outperforms various baselines and remains flexible across text-to-image generation, image editing, and inference acceleration. Our method improves T2I-Combench++ from 56.92% to 63.00% and GenEval from 66.42% to 71.63% on SD3.5, without sacrificing synthesis quality. These results advance understanding of MMDiT models and provide valuable insights to unlock new possibilities for further improvements.


[104] FMVP: Masked Flow Matching for Adversarial Video Purification cs.CVPDF

Duoxun Tang, Xueyi Zhang, Chak Hin Wang, Xi Xiao, Dasen Dai

TL;DR: 本文提出了一种名为FMVP的对抗性视频净化方法,该方法通过掩码策略物理破坏对抗性结构,并利用条件流匹配(CFM)和修复目标重建干净视频动态。此外,设计了频率门控损失(FGL)以解耦语义内容和对抗噪声,并提出了攻击感知和通用训练范式分别处理已知和未知威胁。在UCF-101和HMDB-51数据集上的实验表明,FMVP在对抗PGD和CW攻击时实现了超过87%和89%的鲁棒准确率,优于现有方法,并展现出对自适应攻击的优越鲁棒性以及零样本对抗检测能力。

Details

Motivation: 视频识别模型易受对抗攻击,而现有基于扩散的净化方法存在采样效率低和轨迹弯曲的问题;直接回归干净视频难以恢复忠实内容,因此需要物理破坏对抗结构。

Result: 在UCF-101和HMDB-51数据集上,FMVP在对抗PGD和CW攻击时分别达到超过87%和89%的鲁棒准确率,优于DiffPure、Defense Patterns、Temporal Shuffling和FlowPure等SOTA方法;同时,在对抗自适应攻击(DiffHammer)时表现出优越鲁棒性,并作为零样本对抗检测器,对PGD和CW攻击的检测准确率分别达到98%和79%。

Insight: 创新点包括:1)通过掩码策略物理破坏全局对抗结构,结合条件流匹配进行视频修复;2)设计频率门控损失(FGL)以显式抑制高频对抗残差并保持低频保真度;3)提出攻击感知和通用训练范式,分别针对已知和未知威胁。从客观角度看,该方法将掩码、流匹配和频率解耦相结合,为视频对抗净化提供了高效且鲁棒的解决方案。

Abstract: Video recognition models remain vulnerable to adversarial attacks, while existing diffusion-based purification methods suffer from inefficient sampling and curved trajectories. Directly regressing clean videos from adversarial inputs often fails to recover faithful content due to the subtle nature of perturbations; this necessitates physically shattering the adversarial structure. Therefore, we propose Flow Matching for Adversarial Video Purification FMVP. FMVP physically shatters global adversarial structures via a masking strategy and reconstructs clean video dynamics using Conditional Flow Matching (CFM) with an inpainting objective. To further decouple semantic content from adversarial noise, we design a Frequency-Gated Loss (FGL) that explicitly suppresses high-frequency adversarial residuals while preserving low-frequency fidelity. We design Attack-Aware and Generalist training paradigms to handle known and unknown threats, respectively. Extensive experiments on UCF-101 and HMDB-51 demonstrate that FMVP outperforms state-of-the-art methods (DiffPure, Defense Patterns (DP), Temporal Shuffling (TS) and FlowPure), achieving robust accuracy exceeding 87% against PGD and 89% against CW attacks. Furthermore, FMVP demonstrates superior robustness against adaptive attacks (DiffHammer) and functions as a zero-shot adversarial detector, attaining detection accuracies of 98% for PGD and 79% for highly imperceptible CW attacks.


[105] SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection cs.CVPDF

Xiantai Xiang, Guangyao Zhou, Zixiao Wen, Wenshuai Li, Ben Niu

TL;DR: 本文提出了SLGNet,一个参数高效的框架,用于RGB和红外(IR)图像的多模态目标检测。它通过结合层次化结构先验和语言引导调制,在冻结的视觉Transformer(ViT)基础模型中,解决了现有方法在跨模态结构一致性和环境感知方面的不足。

Details

Motivation: 现有基于适配器的方法在将RGB预训练基础模型迁移到多模态检测任务时,往往牺牲了跨模态结构一致性,导致在域差距大(如高对比度或夜间环境)时丢失关键结构线索;同时,传统的静态多模态融合机制缺乏环境感知能力,在复杂动态场景下适应性和检测性能受限。

Result: 在LLVIP、FLIR、KAIST和DroneVehicle数据集上的大量实验表明,SLGNet取得了新的最先进(SOTA)性能。特别是在LLVIP基准上,该方法实现了66.1的mAP,同时相比传统的完全微调,可训练参数减少了约87%。

Insight: 主要创新点包括:1. 设计了结构感知适配器(Structure-Aware Adapter),从两种模态提取层次化结构表示并动态注入ViT,以补偿ViT主干固有的结构退化;2. 提出了语言引导调制模块(Language-Guided Modulation),利用视觉语言模型(VLM)驱动的结构化描述来动态重新校准视觉特征,赋予模型鲁棒的环境感知能力。这为参数高效的多模态感知提供了一个兼顾结构一致性和环境适应性的解决方案。

Abstract: Multimodal object detection leveraging RGB and Infrared (IR) images is pivotal for robust perception in all-weather scenarios. While recent adapter-based approaches efficiently transfer RGB-pretrained foundation models to this task, they often prioritize model efficiency at the expense of cross-modal structural consistency. Consequently, critical structural cues are frequently lost when significant domain gaps arise, such as in high-contrast or nighttime environments. Moreover, conventional static multimodal fusion mechanisms typically lack environmental awareness, resulting in suboptimal adaptation and constrained detection performance under complex, dynamic scene variations. To address these limitations, we propose SLGNet, a parameter-efficient framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based foundation model. Specifically, we design a Structure-Aware Adapter to extract hierarchical structural representations from both modalities and dynamically inject them into the ViT to compensate for structural degradation inherent in ViT-based backbones. Furthermore, we propose a Language-Guided Modulation module that exploits VLM-driven structured captions to dynamically recalibrate visual features, thereby endowing the model with robust environmental awareness. Extensive experiments on the LLVIP, FLIR, KAIST, and DroneVehicle datasets demonstrate that SLGNet establishes new state-of-the-art performance. Notably, on the LLVIP benchmark, our method achieves an mAP of 66.1, while reducing trainable parameters by approximately 87% compared to traditional full fine-tuning. This confirms SLGNet as a robust and efficient solution for multimodal perception.


[106] VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation cs.CV | cs.LGPDF

Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song

TL;DR: 本文针对视觉自回归(VAR)模型在强化学习训练中因异构输入结构导致的严重异步策略冲突问题,提出了一种增强型GRPO框架。该框架通过引入稳定中间奖励、动态时间步重加权方案以及基于奖励反馈学习的掩码传播算法,有效管理冲突,从而提升样本质量和对齐效果。

Details

Motivation: 解决VAR模型在生成步骤中因输入结构异构引发的异步策略冲突,该冲突在强化学习场景下会导致训练不稳定和对齐效果不佳。

Result: 相比原始GRPO基线,所提方法在样本质量和目标对齐方面取得了显著改进,实现了对VAR模型更鲁棒和有效的优化。

Insight: 创新点在于将GRPO与针对异步冲突的显式管理机制相结合,特别是通过掩码传播算法在时空维度隔离优化效应,为处理VAR模型的强化学习训练提供了新思路。

Abstract: Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.


[107] DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies cs.CVPDF

Renke Wang, Zhenyu Zhang, Ying Tai, Jian Yang

TL;DR: DiffProxy是一个从多视角图像恢复人体网格的新框架,通过扩散模型生成密集代理来弥合合成数据与真实数据之间的域差距。该框架利用多条件机制生成多视角一致、像素对齐的人体代理,并结合手部细化模块和不确定性感知的测试时缩放方法,在仅使用合成数据训练的情况下,在多个真实世界基准测试中实现了最先进的性能。

Details

Motivation: 解决多视角人体网格恢复中真实数据集标注不完美导致训练偏差,以及合成数据存在域差距的问题。

Result: 在五个真实世界基准测试上实现了最先进的性能,特别是在遮挡和部分视角等挑战性场景中表现出强大的零样本泛化能力。

Insight: 创新点包括:利用扩散生成先验桥接合成训练与真实泛化;多条件机制确保多视角一致性;手部细化模块增强局部细节;不确定性感知测试时缩放提升鲁棒性。客观分析认为,该工作通过生成式方法有效利用了合成数据的精确监督,避免了真实数据标注噪声,是解决域适应问题的有效途径。

Abstract: Human mesh recovery from multi-view images faces a fundamental challenge: real-world datasets contain imperfect ground-truth annotations that bias the models’ training, while synthetic data with precise supervision suffers from domain gap. In this paper, we propose DiffProxy, a novel framework that generates multi-view consistent human proxies for mesh recovery. Central to DiffProxy is leveraging the diffusion-based generative priors to bridge the synthetic training and real-world generalization. Its key innovations include: (1) a multi-conditional mechanism for generating multi-view consistent, pixel-aligned human proxies; (2) a hand refinement module that incorporates flexible visual prompts to enhance local details; and (3) an uncertainty-aware test-time scaling method that increases robustness to challenging cases during optimization. These designs ensure that the mesh recovery process effectively benefits from the precise synthetic ground truth and generative advantages of the diffusion-based pipeline. Trained entirely on synthetic data, DiffProxy achieves state-of-the-art performance across five real-world benchmarks, demonstrating strong zero-shot generalization particularly on challenging scenarios with occlusions and partial views. Project page: https://wrk226.github.io/DiffProxy.html


[108] TopoLoRA-SAM: Topology-Aware Parameter-Efficient Adaptation of Foundation Segmenters for Thin-Structure and Cross-Domain Binary Semantic Segmentation cs.CV | cs.AI | cs.LGPDF

Salim Khazem

TL;DR: 本文提出了TopoLoRA-SAM,一个用于二进制语义分割的拓扑感知且参数高效的适配框架。该方法通过向冻结的ViT编码器中注入低秩适配(LoRA),并结合轻量级空间卷积适配器以及可选的基于可微分clDice的拓扑感知监督,来高效适配基础分割模型SAM,以应对薄结构(如视网膜血管)和噪声模态(如SAR图像)的跨域分割挑战。

Details

Motivation: 动机在于解决基础分割模型(如SAM)在适应特定领域(尤其是薄结构和噪声模态)的语义分割任务时面临的挑战,同时避免全参数微调带来的高计算成本和灾难性遗忘风险。

Result: 在五个基准测试(DRIVE, STARE, CHASE_DB1, Kvasir-SEG, SL-SSDD)上,TopoLoRA-SAM在视网膜血管分割平均Dice和总体平均Dice上取得了最佳结果,同时仅训练了约5.2%的模型参数(约490万)。在具有挑战性的CHASE_DB1数据集上,该方法显著提升了分割精度和鲁棒性,性能匹配或超越了全参数微调的专家模型。

Insight: 创新点在于将参数高效的LoRA适配与拓扑感知监督(clDice)相结合,并引入轻量级空间卷积适配器来增强空间特征建模。这为高效、高精度地适配大规模预训练基础模型到特定领域(尤其是对拓扑结构敏感的任务)提供了一种有效途径。

Abstract: Foundation segmentation models such as the Segment Anything Model (SAM) exhibit strong zero-shot generalization through large-scale pretraining, but adapting them to domain-specific semantic segmentation remains challenging, particularly for thin structures (e.g., retinal vessels) and noisy modalities (e.g., SAR imagery). Full fine-tuning is computationally expensive and risks catastrophic forgetting. We propose \textbf{TopoLoRA-SAM}, a topology-aware and parameter-efficient adaptation framework for binary semantic segmentation. TopoLoRA-SAM injects Low-Rank Adaptation (LoRA) into the frozen ViT encoder, augmented with a lightweight spatial convolutional adapter and optional topology-aware supervision via differentiable clDice. We evaluate our approach on five benchmarks spanning retinal vessel segmentation (DRIVE, STARE, CHASE_DB1), polyp segmentation (Kvasir-SEG), and SAR sea/land segmentation (SL-SSDD), comparing against U-Net, DeepLabV3+, SegFormer, and Mask2Former. TopoLoRA-SAM achieves the best retina-average Dice and the best overall average Dice across datasets, while training only \textbf{5.2%} of model parameters ($\sim$4.9M). On the challenging CHASE_DB1 dataset, our method substantially improves segmentation accuracy and robustness, demonstrating that topology-aware parameter-efficient adaptation can match or exceed fully fine-tuned specialist models. Code is available at : https://github.com/salimkhazem/Seglab.git


[109] InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams cs.CVPDF

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao

TL;DR: 本文提出了InfiniteVGGT,一种用于无限长视频流的因果视觉几何Transformer。它通过一个有界但自适应、持续表达的KV缓存实现滚动记忆机制,并配合无需训练的注意力无关剪枝策略来丢弃过时信息,从而在保持长期稳定性的同时支持无限时长的流式输入。此外,论文还引入了包含约10,000帧序列的Long3D基准测试,以严格评估长期3D几何估计性能。

Details

Motivation: 解决大规模、持久性3D视觉几何理解中,离线模型的批处理方式不适用于实时系统,而现有流式架构又无法支持真正无限长输入或会因长期序列导致灾难性漂移的难题。

Result: InfiniteVGGT在长期稳定性上超越了现有的流式方法,并支持无限时长的流式处理。其性能在作者新提出的Long3D基准(约10,000帧的连续序列)上得到了验证,为长期3D几何理解提供了首个严格的评估平台。

Insight: 核心创新在于提出了滚动记忆机制,通过有界自适应KV缓存和无需训练的注意力无关剪枝策略,实现了对无限长视频流的高效、稳定处理。同时,创建Long3D基准填补了极端长期连续序列评估的空白,对领域有重要贡献。

Abstract: The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling’’ the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT


[110] Rank-based Geographical Regularization: Revisiting Contrastive Self-Supervised Learning for Multispectral Remote Sensing Imagery cs.CVPDF

Tom Burgert, Leonard Hackel, Paolo Rota, Begüm Demir

TL;DR: 本文提出了一种名为GeoRank的新型正则化方法,用于多光谱遥感图像的对比自监督学习,通过直接优化球面距离将地理关系嵌入到学习到的特征空间中,从而改进了现有技术。

Details

Motivation: 将自监督学习应用于多光谱遥感图像时,由于数据的地理和时间变异性,存在独特的挑战和机遇,需要一种能够有效整合地理信息的方法。

Result: GeoRank在整合地理元数据的先前方法上表现优异或相当,并持续改进了多种对比自监督学习算法(如BYOL、DINO),在相关基准测试中达到了先进水平。

Insight: 创新点在于通过直接优化球面距离来嵌入地理关系,这是一种新颖的正则化策略;同时,论文系统性地研究了对比自监督学习在多光谱遥感图像中的关键适应因素,如数据增强效果、数据集规模和图像大小的影响,以及时间视图的任务依赖性,为领域提供了全面的分析视角。

Abstract: Self-supervised learning (SSL) has become a powerful paradigm for learning from large, unlabeled datasets, particularly in computer vision (CV). However, applying SSL to multispectral remote sensing (RS) images presents unique challenges and opportunities due to the geographical and temporal variability of the data. In this paper, we introduce GeoRank, a novel regularization method for contrastive SSL that improves upon prior techniques by directly optimizing spherical distances to embed geographical relationships into the learned feature space. GeoRank outperforms or matches prior methods that integrate geographical metadata and consistently improves diverse contrastive SSL algorithms (e.g., BYOL, DINO). Beyond this, we present a systematic investigation of key adaptations of contrastive SSL for multispectral RS images, including the effectiveness of data augmentations, the impact of dataset cardinality and image size on performance, and the task dependency of temporal views. Code is available at https://github.com/tomburgert/georank.


[111] Prithvi-Complimentary Adaptive Fusion Encoder (CAFE): unlocking full-potential for flood inundation mapping cs.CVPDF

Saurabh Kaushik, Lalit Maurya, Beth Tellman

TL;DR: 本文提出了一种名为Prithvi-Complementary Adaptive Fusion Encoder (CAFE)的模型,用于洪水淹没范围制图。该模型通过将预训练的Prithvi地理基础模型编码器与一个由卷积注意力模块增强的并行CNN残差分支相结合,实现了多尺度、多层次的融合,以捕捉关键的局部细节并保持长程依赖关系。

Details

Motivation: 现有的地理基础模型在洪水制图等下游任务中,难以超越U-Net等基线模型,主要问题在于模型难以捕捉关键的局部细节。本文旨在解决这一局限性。

Result: 在Sen1Flood11和FloodPlanet两个洪水制图数据集上取得了最先进的结果。在Sen1Flood11测试集上,IoU达到83.41,优于原始Prithvi(82.50)及其他主要GFMs;在保留测试点上,IoU达到81.37,显著优于基线U-Net(70.57)和原始Prithvi(72.42)。在FloodPlanet数据集上,IoU达到64.70,同样超越了基线U-Net(60.14)及其他GFMs。

Insight: 创新点在于提出了一种简单有效的融合编码器架构,通过适配器实现快速微调,并融合了基础模型的全局表征能力和CNN的局部细节捕捉能力。该方法对于多通道、多模态数据互补且局部细节至关重要的分割任务具有普适潜力。

Abstract: Geo-Foundation Models (GFMs), have proven effective in diverse downstream applications, including semantic segmentation, classification, and regression tasks. However, in case of flood mapping using Sen1Flood11 dataset as a downstream task, GFMs struggles to outperform the baseline U-Net, highlighting model’s limitation in capturing critical local nuances. To address this, we present the Prithvi-Complementary Adaptive Fusion Encoder (CAFE), which integrate Prithvi GFM pretrained encoder with a parallel CNN residual branch enhanced by Convolutional Attention Modules (CAM). Prithvi-CAFE enables fast and efficient fine-tuning through adapters in Prithvi and performs multi-scale, multi-level fusion with CNN features, capturing critical local details while preserving long-range dependencies. We achieve state-of-the-art results on two comprehensive flood mapping datasets: Sen1Flood11 and FloodPlanet. On Sen1Flood11 test data, Prithvi-CAFE (IoU 83.41) outperforms the original Prithvi (IoU 82.50) and other major GFMs (TerraMind 82.90, DOFA 81.54, spectralGPT: 81.02). The improvement is even more pronounced on the hold-out test site, where Prithvi-CAFE achieves an IoU of 81.37 compared to the baseline U-Net (70.57) and original Prithvi (72.42). On FloodPlanet, Prithvi-CAFE also surpasses the baseline U-Net and other GFMs, achieving an IoU of 64.70 compared to U-Net (60.14), Terramind (62.33), DOFA (59.15) and Prithvi 2.0 (61.91). Our proposed simple yet effective Prithvi-CAFE demonstrates strong potential for improving segmentation tasks where multi-channel and multi-modal data provide complementary information and local details are critical. The code is released on \href{https://github.com/Sk-2103/Prithvi-CAFE}{Prithvi-CAFE Github}


[112] Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding cs.CVPDF

Jingming He, Chongyi Li, Shiqi Wang, Sam Kwong

TL;DR: 本文提出了一种联合增强的3D语义高斯建模框架,通过各向异性的3D高斯切比雪夫描述符捕捉细粒度形状细节,并利用局部语义和形状信号自适应调整高斯分配与球谐函数,同时引入跨场景知识迁移模块以加速收敛。该方法在多个数据集上实现了分割精度与渲染质量的同步提升,并保持了高渲染帧率。

Details

Motivation: 现有方法在3D高斯建模中常将语义与渲染分支分离,仅依赖2D监督而忽略3D高斯几何,且自适应策略仅基于渲染梯度,在纹理稀疏区域效果有限。本文旨在通过联合增强语义与渲染分支,解决这些问题。

Result: 在多个数据集上的实验表明,该方法在分割准确性和渲染质量上均有提升,同时保持了高渲染帧率,但未明确提及是否达到SOTA水平。

Insight: 创新点包括:引入各向异性3D高斯切比雪夫描述符以增强形状细节表征;结合局部语义与形状信号进行自适应高斯分配;跨场景知识迁移模块促进快速收敛与鲁棒表示。这些设计可借鉴于多任务3D重建与语义理解任务中。

Abstract: Recent works propose extending 3DGS with semantic feature vectors for simultaneous semantic segmentation and image rendering. However, these methods often treat the semantic and rendering branches separately, relying solely on 2D supervision while ignoring the 3D Gaussian geometry. Moreover, current adaptive strategies adapt the Gaussian set depending solely on rendering gradients, which can be insufficient in subtle or textureless regions. In this work, we propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches. Firstly, unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor using the Laplace-Beltrami operator to capture fine-grained 3D shape details, thereby distinguishing objects with similar appearances and reducing reliance on potentially noisy 2D guidance. In addition, without relying solely on rendering gradient, we adaptively adjust Gaussian allocation and spherical harmonics with local semantic and shape signals, enhancing rendering efficiency through selective resource allocation. Finally, we employ a cross-scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations without relearning shape information from scratch for each new scene. Experiments on multiple datasets demonstrate improvements in segmentation accuracy and rendering quality while maintaining high rendering frame rates.


[113] Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes cs.CVPDF

Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang

TL;DR: Talk2Move是一个基于强化学习的扩散框架,用于根据文本指令对场景中的物体进行空间几何变换(如平移、旋转、缩放)。它通过Group Relative Policy Optimization探索几何动作,利用空间奖励模型对齐变换与语言描述,并通过离策略步评估和主动步采样提高效率,无需昂贵的配对数据。

Details

Motivation: 现有基于文本的编辑方法难以执行物体级的几何变换(如平移、旋转、缩放),主要受限于稀缺的配对监督数据和像素级优化的局限性。本文旨在解决通过自然语言指令对场景中物体进行精确空间操纵的挑战。

Result: 在精心策划的基准测试上的实验表明,Talk2Move在空间准确性和场景一致性方面均优于现有的文本引导编辑方法,实现了精确、一致且语义忠实的目标变换。

Insight: 创新点包括:采用Group Relative Policy Optimization从输入图像和轻量级文本变体中探索动作,无需配对数据;设计以物体为中心的空间奖励直接评估位移、旋转和缩放行为,使变换可解释且连贯;引入离策略步评估和主动步采样以提高学习效率,专注于信息丰富的变换阶段。

Abstract: We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.


[114] VINO: A Unified Visual Generator with Interleaved OmniModal Context cs.CVPDF

Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai

TL;DR: VINO是一个统一的视觉生成器,能够在单一框架内执行图像和视频的生成与编辑任务。它采用共享的扩散主干网络,通过文本、图像和视频作为条件输入,避免了为不同模态设计独立模块,从而支持多参考接地、长指令跟随以及静态与动态内容间的连贯身份保持。

Details

Motivation: 当前视觉生成任务通常依赖针对特定任务或模态的独立模型,导致系统复杂且难以统一管理。VINO旨在解决这一问题,通过构建一个统一的视觉生成框架,整合图像和视频的生成与编辑能力,以简化模型架构并提升多任务处理的效率。

Result: 在多样化的生成和编辑基准测试中,VINO展现出强大的视觉质量、忠实的指令跟随能力、改进的参考和属性保持效果,以及更可控的多身份编辑性能,验证了其作为统一视觉生成器的有效性。

Insight: VINO的创新点在于将视觉语言模型与多模态扩散Transformer耦合,将多模态输入编码为交错的条件令牌来指导扩散过程,避免了模态特定的架构组件。此外,其多阶段训练流程逐步将视频生成基础模型扩展为统一的多任务生成器,为可扩展的统一视觉生成提供了实用路径,并突显了交错上下文计算作为通用视觉创作基础的潜力。

Abstract: We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.


[115] ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors cs.CVPDF

Kaede Shiohara, Toshihiko Yamasaki, Vladislav Golyanik

TL;DR: 本文提出了一种名为ExposeAnyone的全自监督人脸伪造检测方法,该方法基于一个从音频生成表情序列的扩散模型。核心思想是:模型通过参考集针对特定人物进行个性化后,能够通过扩散重建误差计算可疑视频与个性化人物之间的身份距离,从而实现针对特定人物的零样本人脸伪造检测。

Details

Motivation: 当前最先进的人脸伪造检测方法主要依赖于对现有深度伪造或伪伪造数据的监督训练,这导致模型过拟合于特定的伪造模式,难以泛化到未知的伪造操作。而现有的自监督方法又难以仅从自监督中学习到具有判别性的表示。因此,本文旨在开发一种完全自监督、泛化能力强的检测方法。

Result: 在DF-TIMIT、DFDCP、KoDF和IDForge数据集上的广泛实验表明,该方法在平均AUC上比之前的最先进方法高出4.22个百分点。此外,该方法还能有效检测Sora2生成的视频(之前的方法表现不佳),并且对模糊和压缩等图像退化具有高度鲁棒性。

Insight: 论文的创新点在于将个性化音频到表情的扩散模型用作零样本伪造检测器,通过重建误差作为身份距离度量。从客观角度看,这是一种新颖的、基于生成模型重建能力的检测范式,它不依赖于已知的伪造痕迹,而是通过验证内容与个性化身份的一致性来检测伪造,这为检测未知伪造提供了新的思路,并展示了扩散模型在安全领域的潜力。

Abstract: Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision. In this paper, we propose ExposeAnyone, a fully self-supervised approach based on a diffusion model that generates expression sequences from audio. The key idea is, once the model is personalized to specific subjects using reference sets, it can compute the identity distances between suspected videos and personalized subjects via diffusion reconstruction errors, enabling person-of-interest face forgery detection. Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.


cs.HC [Back]

[116] A Platform for Interactive AI Character Experiences cs.HC | cs.AI | cs.CL | cs.GRPDF

Rafael Wampfler, Chen Yang, Dillon Elste, Nikola Kovacevic, Philine Witzig

TL;DR: 本文提出了一个用于创建交互式AI角色的平台,旨在整合对话AI、角色完整性维护、情感管理、知识记忆、语音合成、动画生成等多种技术,以提供沉浸式的故事驱动对话体验。

Details

Motivation: 当前,尽管基础模型、提示工程和微调等技术在解决对话AI、角色完整性等单个挑战方面取得进展,但将这些技术整合用于创建交互式角色仍是一个开放问题。

Result: 作为概念验证,该平台实现了Digital Einstein,允许用户与阿尔伯特·爱因斯坦的数字角色进行关于其生活、研究和人格的对话,展示了系统的可行性和灵活性。

Insight: 主要创新点在于将多种AI组件统一到一个易于适配的平台中,为沉浸式角色体验铺平道路,实现了从单一技术突破到系统整合的跨越。

Abstract: From movie characters to modern science fiction - bringing characters into interactive, story-driven conversations has captured imaginations across generations. Achieving this vision is highly challenging and requires much more than just language modeling. It involves numerous complex AI challenges, such as conversational AI, maintaining character integrity, managing personality and emotions, handling knowledge and memory, synthesizing voice, generating animations, enabling real-world interactions, and integration with physical environments. Recent advancements in the development of foundation models, prompt engineering, and fine-tuning for downstream tasks have enabled researchers to address these individual challenges. However, combining these technologies for interactive characters remains an open problem. We present a system and platform for conveniently designing believable digital characters, enabling a conversational and story-driven experience while providing solutions to all of the technical challenges. As a proof-of-concept, we introduce Digital Einstein, which allows users to engage in conversations with a digital representation of Albert Einstein about his life, research, and persona. While Digital Einstein exemplifies our methods for a specific character, our system is flexible and generalizes to any story-driven or conversational character. By unifying these diverse AI components into a single, easy-to-adapt platform, our work paves the way for immersive character experiences, turning the dream of lifelike, story-based interactions into a reality.


cs.CY [Back]

Hongkun Yang, Lionel Z. Wang, Wei Fan, Yiran Hu, Lixu Wang

TL;DR: 本文提出了AppellateGen基准,用于上诉(二审)法律判决生成任务,包含7,351个案例对,要求模型基于初始判决和证据更新进行推理以生成具有法律约束力的判决,从而建模审判阶段间的因果依赖关系。

Details

Motivation: 现有法律判决生成研究主要集中于一审审判,依赖静态的事实到判决映射,忽视了上诉(二审)审查的辩证性质,因此需要专门针对上诉法律判决生成的基准和模型。

Result: 实验结果表明,提出的基于司法标准操作程序的法律多智能体系统(SLMAS)提高了逻辑一致性,但上诉推理的复杂性对当前大语言模型仍构成重大挑战。

Insight: 创新点在于引入上诉法律判决生成基准,并提出了模拟司法工作流程的SLMAS系统,将生成过程分解为问题识别、检索和起草等离散阶段,以更好地处理上诉阶段的辩证推理和因果依赖。

Abstract: Legal judgment generation is a critical task in legal intelligence. However, existing research in legal judgment generation has predominantly focused on first-instance trials, relying on static fact-to-verdict mappings while neglecting the dialectical nature of appellate (second-instance) review. To address this, we introduce AppellateGen, a benchmark for second-instance legal judgment generation comprising 7,351 case pairs. The task requires models to draft legally binding judgments by reasoning over the initial verdict and evidentiary updates, thereby modeling the causal dependency between trial stages. We further propose a judicial Standard Operating Procedure (SOP)-based Legal Multi-Agent System (SLMAS) to simulate judicial workflows, which decomposes the generation process into discrete stages of issue identification, retrieval, and drafting. Experimental results indicate that while SLMAS improves logical consistency, the complexity of appellate reasoning remains a substantial challenge for current LLMs. The dataset and code are publicly available at: https://anonymous.4open.science/r/AppellateGen-5763.


cs.SD [Back]

[118] SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning cs.SD | cs.CL | eess.ASPDF

Peidong Wang, Zhiming Ma, Xin Dai, Yongkang Liu, Shi Feng

TL;DR: 本文提出了SAFE-QAQ,一个基于强化学习的端到端慢思考音频-文本欺诈检测框架。该框架直接处理音频信号,避免了传统方法依赖ASR转录带来的错误,并通过基于规则的慢思考奖励机制引导系统进行分层推理,捕捉细粒度的音频细节以识别欺诈模式。此外,它引入了动态风险评估,支持实时通话中的早期欺诈检测与预防。

Details

Motivation: 现有欺诈检测方法主要依赖转录文本,易受ASR错误影响且忽略了音调和环境上下文等关键声学线索,限制了其应对复杂欺骗策略的有效性。

Result: 在TeleAntiFraud-Bench基准测试上的实验表明,SAFE-QAQ在准确性、推理效率和实时处理能力等多个关键维度上相比现有方法取得了显著提升。该系统已部署,每日分析超过70,000通电话,有效自动化了复杂欺诈检测,减少了人工工作量和财务损失。

Insight: 主要创新点包括:1) 端到端的音频直接处理框架,消除了转录错误的影响;2) 基于规则的慢思考奖励机制,通过分层推理系统性地引导模型捕捉细粒度音频细节以识别欺诈模式;3) 动态实时风险评估框架,支持早期检测与预防。从客观角度看,将“慢思考”推理范式与强化学习奖励机制结合用于音频欺诈检测是一个新颖的思路,强调了多模态(音频-文本)端到端学习和实时动态决策的实用价值。

Abstract: Existing fraud detection methods predominantly rely on transcribed text, suffering from ASR errors and missing crucial acoustic cues like vocal tone and environmental context. This limits their effectiveness against complex deceptive strategies. To address these challenges, we first propose \textbf{SAFE-QAQ}, an end-to-end comprehensive framework for audio-based slow-thinking fraud detection. First, the SAFE-QAQ framework eliminates the impact of transcription errors on detection performance. Secondly, we propose rule-based slow-thinking reward mechanisms that systematically guide the system to identify fraud-indicative patterns by accurately capturing fine-grained audio details, through hierarchical reasoning processes. Besides, our framework introduces a dynamic risk assessment framework during live calls, enabling early detection and prevention of fraud. Experiments on the TeleAntiFraud-Bench demonstrate that SAFE-QAQ achieves dramatic improvements over existing methods in multiple key dimensions, including accuracy, inference efficiency, and real-time processing capabilities. Currently deployed and analyzing over 70,000 calls daily, SAFE-QAQ effectively automates complex fraud detection, reducing human workload and financial losses. Code: https://anonymous.4open.science/r/SAFE-QAQ.


[119] MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning cs.SD | cs.AI | cs.CV | cs.MM | eess.ASPDF

Chunyu Qiang, Jun Wang, Xiaopeng Wang, Kang Yin, Yuxin Guo

TL;DR: MM-Sonate是一个多模态可控音视频联合生成框架,它通过统一的指令-音素输入实现了严格的语言和时间对齐,并集成了零样本语音克隆功能。该框架采用基于流的匹配方法,通过音色注入机制解耦说话人身份与语言内容,并利用基于噪声的负调节策略提升音频保真度。

Details

Motivation: 解决现有音视频联合生成模型在细粒度声学控制(特别是身份保持语音)方面的不足,以及无法在统一框架内实现零样本语音克隆的问题,旨在克服级联生成导致的时间错位和缺乏联合合成能力等缺陷。

Result: 在联合生成基准测试中取得了新的最先进(SOTA)性能,在唇部同步和语音清晰度方面显著优于基线模型,同时其语音克隆保真度可与专用文本转语音(TTS)系统相媲美。

Insight: 创新点包括:统一的指令-音素输入确保对齐;音色注入机制实现身份与内容的解耦以支持零样本克隆;针对多模态场景提出基于噪声的负调节策略,利用自然噪声先验提升音频质量。这些方法为可控多模态生成提供了可借鉴的架构设计思路。

Abstract: Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.


cs.SE [Back]

[120] SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving cs.SE | cs.CLPDF

Chaofan Tao, Jierun Chen, Yuxin Jiang, Kaiqi Kou, Shaowei Wang

TL;DR: 本文提出了SWE-Lego,一种用于软件工程问题解决的监督微调方法。该方法通过构建高质量数据集、改进的微调流程以及测试时扩展技术,探索了仅使用轻量级监督微调在软件工程任务上达到最先进性能的极限。

Details

Motivation: 当前主流方法依赖复杂的训练范式(如中期训练、监督微调、强化学习及其组合),本文旨在探索如何仅通过轻量化的监督微调方法,在软件工程问题解决任务上取得突破。

Result: 在SWE-bench Verified基准测试中,仅使用前两个核心模块,SWE-Lego-Qwen3-8B达到42.2%,SWE-Lego-Qwen3-32B达到52.6%,在同类规模开源模型中达到最先进水平。结合测试时扩展后,性能进一步提升至49.6%和58.8%。

Insight: 创新点包括:1)构建了结合真实与合成数据的高质量SWE-Lego数据集;2)提出了包含错误掩码和基于难度课程学习的改进监督微调流程;3)在监督微调基础上评估并改进了测试时扩展方法,通过验证器显著提升模型性能。

Abstract: We present SWE-Lego, a supervised fine-tuning (SFT) recipe designed to achieve state-ofthe-art performance in software engineering (SWE) issue resolving. In contrast to prevalent methods that rely on complex training paradigms (e.g., mid-training, SFT, reinforcement learning, and their combinations), we explore how to push the limits of a lightweight SFT-only approach for SWE tasks. SWE-Lego comprises three core building blocks, with key findings summarized as follows: 1) the SWE-Lego dataset, a collection of 32k highquality task instances and 18k validated trajectories, combining real and synthetic data to complement each other in both quality and quantity; 2) a refined SFT procedure with error masking and a difficulty-based curriculum, which demonstrably improves action quality and overall performance. Empirical results show that with these two building bricks alone,the SFT can push SWE-Lego models to state-of-the-art performance among open-source models of comparable size on SWE-bench Verified: SWE-Lego-Qwen3-8B reaches 42.2%, and SWE-Lego-Qwen3-32B attains 52.6%. 3) We further evaluate and improve test-time scaling (TTS) built upon the SFT foundation. Based on a well-trained verifier, SWE-Lego models can be significantly boosted–for example, 42.2% to 49.6% and 52.6% to 58.8% under TTS@16 for the 8B and 32B models, respectively.


physics.app-ph [Back]

[121] Image Synthesis Using Spintronic Deep Convolutional Generative Adversarial Network physics.app-ph | cs.CVPDF

Saumya Gupta, Abhinandan, Venkatesh vadde, Bhaskaran Muralidharan, Abhishek Sharma

TL;DR: 本文提出了一种混合CMOS-自旋电子学的深度卷积生成对抗网络(DCGAN)架构,用于合成图像生成。该架构利用自旋电子硬件实现DCGAN中的反卷积、卷积和激活层,并通过硬件感知设计将生成器的反卷积层重构为零填充卷积,以无缝集成基于斯格明子的6位突触交叉阵列。非线性激活函数采用混合CMOS-畴壁结构的ReLU和Leaky ReLU单元实现,其中可调Leaky ReLU单元能耗低至0.192 pJ。该模型在灰度(Fashion MNIST)和彩色(Anime Face)数据集上均表现出适应性,分别取得了27.5和45.4的FID分数,并实现了较低的测试与训练能耗。

Details

Motivation: 生成对抗网络(GANs)的计算需求超出了传统冯·诺依曼架构的极限,因此需要寻求如神经形态自旋电子学等能效更高的替代方案。

Result: 在Fashion MNIST数据集上取得了27.5的Fréchet Inception Distance(FID)分数,在Anime Face数据集上取得了45.4的FID分数。测试能耗分别为4.9 nJ/图像和24.72 nJ/图像,训练能耗分别为14.97 nJ/图像和74.7 nJ/图像。

Insight: 主要创新点在于提出了一种硬件感知的混合CMOS-自旋电子DCGAN架构,通过将生成器的反卷积层重构为零填充卷积,实现了与基于斯格明子的6位突触交叉阵列的无缝集成。同时,设计了基于畴壁位置编码、具有连续电阻状态和分段单轴抛物线各向异性剖面的可调Leaky ReLU激活单元,实现了极低的能耗(0.192 pJ)。该工作为在神经形态硬件上高效实现复杂的生成模型提供了新的思路和硬件设计范例。

Abstract: The computational requirements of generative adversarial networks (GANs) exceed the limit of conventional Von Neumann architectures, necessitating energy efficient alternatives such as neuromorphic spintronics. This work presents a hybrid CMOS-spintronic deep convolutional generative adversarial network (DCGAN) architecture for synthetic image generation. The proposed generative vision model approach follows the standard framework, leveraging generator and discriminators adversarial training with our designed spintronics hardware for deconvolution, convolution, and activation layers of the DCGAN architecture. To enable hardware aware spintronic implementation, the generator’s deconvolution layers are restructured as zero padded convolution, allowing seamless integration with a 6-bit skyrmion based synapse in a crossbar, without compromising training performance. Nonlinear activation functions are implemented using a hybrid CMOS domain wall based Rectified linear unit (ReLU) and Leaky ReLU units. Our proposed tunable Leaky ReLU employs domain wall position coded, continuous resistance states and a piecewise uniaxial parabolic anisotropy profile with a parallel MTJ readout, exhibiting energy consumption of 0.192 pJ. Our spintronic DCGAN model demonstrates adaptability across both grayscale and colored datasets, achieving Fr’echet Inception Distances (FID) of 27.5 for the Fashion MNIST and 45.4 for Anime Face datasets, with testing energy (training energy) of 4.9 nJ (14.97~nJ/image) and 24.72 nJ (74.7 nJ/image).


cs.DL [Back]

[122] A Global Atlas of Digital Dermatology to Map Innovation and Disparities cs.DL | cs.AI | cs.CVPDF

Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Lea Habermacher

TL;DR: 该论文提出了SkinMap,一个用于全面审计皮肤病学领域所有公开数据基础的多模态框架。它将超过110万张皮肤病图像整合成一个可查询的语义图谱,并量化了信息新颖性、数据集冗余性以及人口统计学和诊断学上的代表性差距。研究发现,尽管数据集规模呈指数增长,但信息新颖性已趋于平缓,且深色皮肤类型、儿科患者和许多罕见疾病在数据中代表性严重不足。

Details

Motivation: 人工智能在皮肤病学中的应用有望实现医疗保健的民主化,但模型的可靠性取决于驱动这些模型的数据质量和全面性。目前该领域缺乏定量的关键性能指标来衡量新数据集是扩展了临床覆盖范围,还是仅仅重复了已知信息。

Result: 研究量化了皮肤病学数据集的信息新颖性随时间的变化、数据集冗余性以及人口统计学和诊断学上的代表性差距。具体发现包括:深色皮肤色调(Fitzpatrick V-VI)仅占图像的5.8%,儿科患者仅占3.0%,而许多罕见疾病和表型组合的代表性稀疏。

Insight: 论文的创新点在于首次构建了一个统一、可查询的皮肤病学数据语义图谱(SkinMap),并系统性地量化了数据集的覆盖偏差和冗余。这为衡量数据盲点、指导未来战略性数据采集以填补临床空间的覆盖空白提供了基础设施和方法论。

Abstract: The adoption of artificial intelligence in dermatology promises democratized access to healthcare, but model reliability depends on the quality and comprehensiveness of the data fueling these models. Despite rapid growth in publicly available dermatology images, the field lacks quantitative key performance indicators to measure whether new datasets expand clinical coverage or merely replicate what is already known. Here we present SkinMap, a multi-modal framework for the first comprehensive audit of the field’s entire data basis. We unify the publicly available dermatology datasets into a single, queryable semantic atlas comprising more than 1.1 million images of skin conditions and quantify (i) informational novelty over time, (ii) dataset redundancy, and (iii) representation gaps across demographics and diagnoses. Despite exponential growth in dataset sizes, informational novelty across time has somewhat plateaued: Some clusters, such as common neoplasms on fair skin, are densely populated, while underrepresented skin types and many rare diseases remain unaddressed. We further identify structural gaps in coverage: Darker skin tones (Fitzpatrick V-VI) constitute only 5.8% of images and pediatric patients only 3.0%, while many rare diseases and phenotype combinations remain sparsely represented. SkinMap provides infrastructure to measure blind spots and steer strategic data acquisition toward undercovered regions of clinical space.


cs.AI [Back]

[123] CogCanvas: Compression-Resistant Cognitive Artifacts for Long LLM Conversations cs.AI | cs.CL | cs.IRPDF

Tao An

TL;DR: 本文提出了CogCanvas,一个无需训练的框架,用于解决大语言模型在长对话中面临的上下文窗口限制与信息保真度之间的根本矛盾。该框架从对话轮次中提取基于原文的认知构件(如决策、事实、提醒),并将其组织成一个具有时间感知的图结构,以实现抗压缩的信息检索。

Details

Motivation: 现有方法(如截断和摘要)在长对话中要么丢弃早期信息,要么丢失细微细节,无法平衡上下文窗口限制与信息保真度。本文旨在解决这一根本矛盾。

Result: 在LoCoMo基准测试中,CogCanvas实现了34.7%的整体准确率,优于RAG(25.6%)和GraphRAG(13.7%)。在时序推理任务上优势最明显(31.5% vs. RAG的9.3%和GraphRAG的5.0%),在多跳因果推理上通过率达到81.0%(vs. GraphRAG的40.0%)。在受控基准测试中,召回率达到97.5%,精确匹配保留率为93.0%。

Insight: 创新点在于提出了一种无需训练、基于原文提取认知构件并组织成时间感知图的方法,实现了对长对话信息的压缩抵抗性检索。这为从业者提供了一个可立即部署的替代方案,显著优于标准基线方法,特别是在时序和多跳推理任务上。

Abstract: Large language models face a fundamental tension between context window limits and information fidelity in long conversations. Existing approaches–truncation and summarization–either discard early information or lose nuanced details. We introduce CogCanvas, a training-free framework that extracts verbatim-grounded cognitive artifacts (decisions, facts, reminders) from conversation turns and organizes them into a temporal-aware graph for compression-resistant retrieval. On the LoCoMo benchmark, CogCanvas achieves 34.7% overall accuracy, outperforming RAG (25.6%, +9.1pp) and GraphRAG (13.7%, +21.0pp). The advantage is most pronounced on temporal reasoning: 31.5% vs. 9.3% (RAG) and 5.0% (GraphRAG)–a +530% relative improvement. On multi-hop causal reasoning, CogCanvas achieves 81.0% pass rate vs. 40.0% for GraphRAG (+41.0pp). Controlled benchmarks show 97.5% recall (+78.5pp vs. summarization) with 93.0% exact match preservation. While heavily-optimized approaches achieve higher absolute scores through dedicated training (EverMemOS: approximately 92%), our training-free approach provides practitioners with an immediately-deployable alternative that significantly outperforms standard baselines. Code and data: https://github.com/tao-hpu/cog-canvas.


[124] Aletheia: Quantifying Cognitive Conviction in Reasoning Models via Regularized Inverse Confusion Matrix cs.AI | cs.CL | cs.LGPDF

Fanzhe Fu

TL;DR: 本文提出了Aletheia项目,这是一个量化推理模型‘认知确信度’的认知物理学框架。该框架通过Tikhonov正则化反演评判者的混淆矩阵来量化模型在推理中的信念深度,并引入了合成代理协议进行验证。初步研究揭示了推理模型可能存在的‘防御性过度思考’现象,并提出了对齐确信度分数以确保安全性。

Details

Motivation: 当前AGI评估范式面临认识论危机,静态基准测试能衡量知识广度,但无法量化信念深度。本文旨在扩展CHOKE现象框架,以量化System 2推理模型中的‘认知确信度’,解决现有评估方法在深度信念衡量上的不足。

Result: 初步试点研究在2025年的基线模型(如DeepSeek-R1、OpenAI o1)上进行,结果表明推理模型虽然充当‘认知缓冲区’,但在对抗性压力下可能表现出‘防御性过度思考’。

Insight: 主要创新点包括:1) 提出了一个基于正则化逆混淆矩阵的量化认知确信度的通用框架;2) 引入了合成代理协议,避免依赖不透明的私有数据进行验证;3) 定义了对齐确信度分数,将信念深度与安全性对齐考量,为衡量AI科学完整性提供了蓝图。

Abstract: In the progressive journey toward Artificial General Intelligence (AGI), current evaluation paradigms face an epistemological crisis. Static benchmarks measure knowledge breadth but fail to quantify the depth of belief. While Simhi et al. (2025) defined the CHOKE phenomenon in standard QA, we extend this framework to quantify “Cognitive Conviction” in System 2 reasoning models. We propose Project Aletheia, a cognitive physics framework that employs Tikhonov Regularization to invert the judge’s confusion matrix. To validate this methodology without relying on opaque private data, we implement a Synthetic Proxy Protocol. Our preliminary pilot study on 2025 baselines (e.g., DeepSeek-R1, OpenAI o1) suggests that while reasoning models act as a “cognitive buffer,” they may exhibit “Defensive OverThinking” under adversarial pressure. Furthermore, we introduce the Aligned Conviction Score (S_aligned) to verify that conviction does not compromise safety. This work serves as a blueprint for measuring AI scientific integrity.


[125] Simulated Reasoning is Reasoning cs.AI | cs.CLPDF

Hendrik Kempt, Alon Lavie

TL;DR: 本文探讨了基础模型(FM)通过模拟‘大声思考’过程、测试生成路径并迭代进行推理的能力,挑战了传统符号推理的必要性。论文认为这种模拟推理虽能独立或通过少样本学习解决问题,但由于缺乏常识和基础而显得脆弱,并讨论了其哲学意义、安全考量以及‘随机鹦鹉’隐喻的过时性。

Details

Motivation: 动机在于重新评估推理的本质及其必要条件,揭示基础模型通过模仿思考过程实现推理的方式,并指出其与人类推理的根本差异(如缺乏常识和基础性),以促进对模型安全性和鲁棒性的深入思考。

Result: 论文未提及具体定量结果或基准测试,而是基于哲学分析和理论论证,强调基础模型的模拟推理能力在解决任务中的有效性,但指出其脆弱性可能导致安全风险。

Insight: 创新点在于提出基础模型的‘模拟推理’概念,挑战传统符号推理范式,并论证‘随机鹦鹉’隐喻已不适用;从客观角度看,论文推动了AI推理研究的哲学反思,强调了模型安全性和鲁棒性设计的重要性。

Abstract: Reasoning has long been understood as a pathway between stages of understanding. Proper reasoning leads to understanding of a given subject. This reasoning was conceptualized as a process of understanding in a particular way, i.e., “symbolic reasoning”. Foundational Models (FM) demonstrate that this is not a necessary condition for many reasoning tasks: they can “reason” by way of imitating the process of “thinking out loud”, testing the produced pathways, and iterating on these pathways on their own. This leads to some form of reasoning that can solve problems on its own or with few-shot learning, but appears fundamentally different from human reasoning due to its lack of grounding and common sense, leading to brittleness of the reasoning process. These insights promise to substantially alter our assessment of reasoning and its necessary conditions, but also inform the approaches to safety and robust defences against this brittleness of FMs. This paper offers and discusses several philosophical interpretations of this phenomenon, argues that the previously apt metaphor of the “stochastic parrot” has lost its relevance and thus should be abandoned, and reflects on different normative elements in the safety- and appropriateness-considerations emerging from these reasoning models and their growing capacity.


[126] EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning cs.AI | cs.CLPDF

Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai

TL;DR: EverMemOS是一个自组织记忆操作系统,旨在解决大语言模型在长期交互中因上下文窗口有限而难以维持连贯行为的问题。它通过受记忆印迹启发的生命周期,将对话流转换为记忆单元,组织成主题记忆场景,并指导检索以提供下游推理所需的上下文。

Details

Motivation: 现有记忆系统通常存储孤立记录并检索片段,难以整合演变的用户状态和解决冲突,因此需要一种能够结构化管理长期记忆以支持连贯推理的系统。

Result: 在LoCoMo和LongMemEval基准测试中,EverMemOS在记忆增强推理任务上达到了最先进的性能,并在PersonaMem v2上进行了配置文件研究,展示了用户画像和前瞻等面向聊天的能力。

Insight: 创新点包括受记忆印迹启发的记忆生命周期设计,将对话流转换为包含情节痕迹、原子事实和时间限制前瞻信号的记忆单元,并通过语义整合组织成主题记忆场景以实现自组织和高效检索。

Abstract: Large Language Models (LLMs) are increasingly deployed as long-term interactive agents, yet their limited context windows make it difficult to sustain coherent behavior over extended interactions. Existing memory systems often store isolated records and retrieve fragments, limiting their ability to consolidate evolving user states and resolve conflicts. We introduce EverMemOS, a self-organizing memory operating system that implements an engram-inspired lifecycle for computational memory. Episodic Trace Formation converts dialogue streams into MemCells that capture episodic traces, atomic facts, and time-bounded Foresight signals. Semantic Consolidation organizes MemCells into thematic MemScenes, distilling stable semantic structures and updating user profiles. Reconstructive Recollection performs MemScene-guided agentic retrieval to compose the necessary and sufficient context for downstream reasoning. Experiments on LoCoMo and LongMemEval show that EverMemOS achieves state-of-the-art performance on memory-augmented reasoning tasks. We further report a profile study on PersonaMem v2 and qualitative case studies illustrating chat-oriented capabilities such as user profiling and Foresight. Code is available at https://github.com/EverMind-AI/EverMemOS.


[127] XAI-MeD: Explainable Knowledge Guided Neuro-Symbolic Framework for Domain Generalization and Rare Class Detection in Medical Imaging cs.AI | cs.CVPDF

Midhat Urooj, Ayan Banerjee, Sandeep Gupta

TL;DR: 本文提出了XAI-MeD框架,这是一个可解释的医学AI框架,通过神经符号架构将临床专家知识整合到深度学习中,旨在提升模型在分布偏移下的鲁棒性、增强罕见类别的检测灵敏度,并提供与临床对齐的透明解释。

Details

Motivation: 解决医学AI中深度模型在真实世界分布偏移下失效、以及对不常见临床病症存在偏见的可解释性领域泛化和罕见类别可靠性等关键挑战。

Result: 在包括从静息态功能磁共振成像定位癫痫发作起始区和跨6个多中心数据集的糖尿病视网膜病变分级等四项挑战性任务上进行了评估,结果显示在跨领域泛化性能上提升了6%,罕见类别的F1分数提升了10%,远超当前最先进的深度学习基线模型。

Insight: 创新点在于将临床专业知识编码为可机器验证的类别特定规则,并通过加权特征满足分数进行量化,形成一个符号推理分支来补充神经预测;同时,通过受熵不平衡增益和罕见类基尼系数引导的自适应路由机制,有效缓解了类别不平衡、高类内变异性和不确定性。

Abstract: Explainability domain generalization and rare class reliability are critical challenges in medical AI where deep models often fail under real world distribution shifts and exhibit bias against infrequent clinical conditions This paper introduces XAIMeD an explainable medical AI framework that integrates clinically accurate expert knowledge into deep learning through a unified neuro symbolic architecture XAIMeD is designed to improve robustness under distribution shift enhance rare class sensitivity and deliver transparent clinically aligned interpretations The framework encodes clinical expertise as logical connectives over atomic medical propositions transforming them into machine checkable class specific rules Their diagnostic utility is quantified through weighted feature satisfaction scores enabling a symbolic reasoning branch that complements neural predictions A confidence weighted fusion integrates symbolic and deep outputs while a Hunt inspired adaptive routing mechanism guided by Entropy Imbalance Gain EIG and Rare Class Gini mitigates class imbalance high intra class variability and uncertainty We evaluate XAIMeD across diverse modalities on four challenging tasks i Seizure Onset Zone SOZ localization from rs fMRI ii Diabetic Retinopathy grading across 6 multicenter datasets demonstrate substantial performance improvements including 6 percent gains in cross domain generalization and a 10 percent improved rare class F1 score far outperforming state of the art deep learning baselines Ablation studies confirm that the clinically grounded symbolic components act as effective regularizers ensuring robustness to distribution shifts XAIMeD thus provides a principled clinically faithful and interpretable approach to multimodal medical AI.


cs.RO [Back]

[128] DST-Calib: A Dual-Path, Self-Supervised, Target-Free LiDAR-Camera Extrinsic Calibration Network cs.RO | cs.CVPDF

Zhiwei Huang, Yanwei Fu, Yi Zhou, Xieyuanli Chen, Qijun Chen

TL;DR: 本文提出了一种名为DST-Calib的双路径自监督无目标LiDAR-相机外参标定网络,旨在解决现有方法依赖标定板或特定静态场景、泛化能力受限的问题。该方法采用创新的双面数据增强策略生成多视角相机视图,并设计了一个双路径自监督标定框架,通过构建差异图来显式关联LiDAR和相机特征,从而提升标定精度并降低模型复杂度。

Details

Motivation: 现有LiDAR-相机外参标定方法通常依赖手工标定板或特定静态场景,限制了其在真实世界自主机器人应用中的适应性和部署能力;同时,传统单边数据增强策略导致显著的泛化性能下降问题。

Result: 在五个公开基准数据集及自采集数据集上的大量实验表明,该方法在泛化性方面显著优于现有方法,实现了更高的标定精度。

Insight: 创新点包括:提出双面数据增强技术以缓解泛化退化问题;设计双路径自监督标定框架,减少对高精度真值标签的依赖并支持全自适应在线标定;用差异图构建过程替代传统双分支特征提取,显式关联跨模态特征,提升精度并降低复杂度。

Abstract: LiDAR-camera extrinsic calibration is essential for multi-modal data fusion in robotic perception systems. However, existing approaches typically rely on handcrafted calibration targets (e.g., checkerboards) or specific, static scene types, limiting their adaptability and deployment in real-world autonomous and robotic applications. This article presents the first self-supervised LiDAR-camera extrinsic calibration network that operates in an online fashion and eliminates the need for specific calibration targets. We first identify a significant generalization degradation problem in prior methods, caused by the conventional single-sided data augmentation strategy. To overcome this limitation, we propose a novel double-sided data augmentation technique that generates multi-perspective camera views using estimated depth maps, thereby enhancing robustness and diversity during training. Built upon this augmentation strategy, we design a dual-path, self-supervised calibration framework that reduces the dependence on high-precision ground truth labels and supports fully adaptive online calibration. Furthermore, to improve cross-modal feature association, we replace the traditional dual-branch feature extraction design with a difference map construction process that explicitly correlates LiDAR and camera features. This not only enhances calibration accuracy but also reduces model complexity. Extensive experiments conducted on five public benchmark datasets, as well as our own recorded dataset, demonstrate that the proposed method significantly outperforms existing approaches in terms of generalizability.


[129] AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving cs.RO | cs.CVPDF

Yanhao Wu, Haoyang Zhang, Fei He, Rui Wu, Congpei Qiu

TL;DR: 本文提出了一种名为AlignDrive的新型端到端自动驾驶规划框架,通过将纵向规划显式地建立在驾驶路径上,解决了现有并行规划方法中横向与纵向规划不协调和静态信息冗余编码的问题。该方法采用级联设计,沿路径预测纵向位移,并引入了面向规划的数据增强策略以提升安全性。在Bench2Drive基准测试中取得了新的SOTA性能。

Details

Motivation: 现有SOTA端到端自动驾驶模型在规划阶段将横向和纵向预测解耦并行处理,这可能导致规划的路径与速度不协调,且未能充分利用驾驶路径作为纵向规划的先验信息,造成静态信息冗余编码。本文旨在解决这些协调和效率问题。

Result: 在具有挑战性的Bench2Drive基准测试上,该方法取得了新的SOTA结果,驾驶分数达到89.07,成功率达到73.18%,显著提升了规划的协调性和安全性。

Insight: 主要创新点在于提出了一个级联框架,通过路径条件化的公式将驾驶路径显式地融入纵向规划,使模型沿路径预测纵向位移而非完整的2D轨迹点,从而简化了纵向推理并加强了与横向规划的耦合。此外,提出的面向规划的数据增强策略通过模拟罕见的安全关键事件(如车辆切入)来提升模型在复杂场景下的鲁棒性。

Abstract: End-to-end autonomous driving has rapidly progressed, enabling joint perception and planning in complex environments. In the planning stage, state-of-the-art (SOTA) end-to-end autonomous driving models decouple planning into parallel lateral and longitudinal predictions. While effective, this parallel design can lead to i) coordination failures between the planned path and speed, and ii) underutilization of the drive path as a prior for longitudinal planning, thus redundantly encoding static information. To address this, we propose a novel cascaded framework that explicitly conditions longitudinal planning on the drive path, enabling coordinated and collision-aware lateral and longitudinal planning. Specifically, we introduce a path-conditioned formulation that explicitly incorporates the drive path into longitudinal planning. Building on this, the model predicts longitudinal displacements along the drive path rather than full 2D trajectory waypoints. This design simplifies longitudinal reasoning and more tightly couples it with lateral planning. Additionally, we introduce a planning-oriented data augmentation strategy that simulates rare safety-critical events, such as vehicle cut-ins, by adding agents and relabeling longitudinal targets to avoid collision. Evaluated on the challenging Bench2Drive benchmark, our method sets a new SOTA, achieving a driving score of 89.07 and a success rate of 73.18%, demonstrating significantly improved coordination and safety


cs.CR [Back]

[130] OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs cs.CR | cs.CVPDF

Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang, Yang Yao

TL;DR: 本文介绍了OpenRT,一个开源的、模块化的、高吞吐量的红队测试框架,旨在全面评估多模态大语言模型(MLLMs)的安全性。该框架通过引入一个对抗核心,在模型集成、数据集管理、攻击策略、判断方法和评估指标五个关键维度实现模块化分离,从而标准化攻击接口并实现跨模型的系统化扩展。

Details

Motivation: 当前多模态大语言模型在关键应用中的快速集成受到持续安全漏洞的阻碍,而现有的红队测试基准往往碎片化、仅限于单轮文本交互且缺乏系统评估所需的可扩展性。

Result: 通过对20个先进模型(包括GPT-5.2、Claude 4.5和Gemini 3 Pro)的广泛实证研究,暴露了关键的安全漏洞:即使是前沿模型也无法泛化到不同的攻击范式,领先模型的平均攻击成功率高达49.14%。研究发现,推理模型在面对复杂的多轮越狱攻击时并不具备固有的更强鲁棒性。

Insight: 创新点在于提出了一个统一的、模块化的红队测试框架,通过标准化接口和异步运行时实现攻击逻辑与执行的高效解耦,从而支持大规模、系统化的安全评估。该框架集成了37种不同的攻击方法,为AI安全的研究和标准化提供了可持续、可扩展的基础设施。

Abstract: The rapid integration of Multimodal Large Language Models (MLLMs) into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red-teaming benchmarks are often fragmented, limited to single-turn text interactions, and lack the scalability required for systematic evaluation. To address this, we introduce OpenRT, a unified, modular, and high-throughput red-teaming framework designed for comprehensive MLLM safety evaluation. At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five critical dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. By standardizing attack interfaces, it decouples adversarial logic from a high-throughput asynchronous runtime, enabling systematic scaling across diverse models. Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies. Through an extensive empirical study on 20 advanced models (including GPT-5.2, Claude 4.5, and Gemini 3 Pro), we expose critical safety gaps: even frontier models fail to generalize across attack paradigms, with leading models exhibiting average Attack Success Rates as high as 49.14%. Notably, our findings reveal that reasoning models do not inherently possess superior robustness against complex, multi-turn jailbreaks. By open-sourcing OpenRT, we provide a sustainable, extensible, and continuously maintained infrastructure that accelerates the development and standardization of AI safety.


[131] Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization cs.CR | cs.AI | cs.CV | cs.LGPDF

Jiwei Guan, Haibo Jin, Haohan Wang

TL;DR: 本文提出了一种针对大型视觉语言模型(LVLMs)的黑盒越狱攻击方法ZO-SPSA,该方法利用零阶优化和同步扰动随机逼近,无需模型内部知识即可生成对抗性输入,以绕过安全机制并触发有害输出。

Details

Motivation: 现有白盒攻击方法需要完全访问模型、计算成本高且对抗样本迁移性不足,难以应用于现实世界的黑盒场景。本文旨在解决这些限制,探索在仅通过输入-输出交互的黑盒设置下对LVLMs进行有效攻击。

Result: 在InstructBLIP、LLaVA和MiniGPT-4三个LVLM上评估,ZO-SPSA在InstructBLIP上取得了83.0%的最高越狱成功率,扰动与白盒方法相当且难以察觉。此外,从MiniGPT-4生成的对抗样本对其他LVLMs表现出强迁移性,攻击成功率(ASR)达到64.18%。

Insight: 创新点在于首次将ZO-SPSA优化应用于LVLMs的黑盒越狱攻击,实现了无需梯度、模型无关且资源需求低的对抗样本生成。客观来看,该方法揭示了当前LVLMs安全机制在现实黑盒攻击下的脆弱性,为评估和增强模型鲁棒性提供了新视角。

Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs


eess.IV [Back]

[132] Placenta Accreta Spectrum Detection using Multimodal Deep Learning eess.IV | cs.AI | cs.CV | cs.LGPDF

Sumaiya Ali, Areej Alhothali, Sameera Albasri, Ohoud Alzamzami, Ahmed Abduljabbar

TL;DR: 本研究开发并验证了一个用于检测胎盘植入谱系(PAS)的多模态深度学习框架。该框架通过中间特征级融合架构,整合了3D磁共振成像(MRI)和2D超声(US)扫描数据,旨在提升产前诊断的准确性。

Details

Motivation: 胎盘植入谱系(PAS)是一种危及生命的产科并发症,早期准确的产前诊断对于降低母婴风险至关重要。现有单一模态影像诊断存在局限性,因此需要整合多模态信息以提高检测性能。

Result: 在独立测试集上,多模态融合模型取得了92.5%的准确率和0.927的AUC值,显著优于仅使用MRI(82.5%, AUC 0.825)或仅使用US(87.5%, AUC 0.879)的单模态模型,达到了该任务上的先进水平。

Insight: 论文的创新点在于采用中间特征级融合架构,结合了针对不同模态优化的特征提取器(3D DenseNet121-Vision Transformer用于MRI,2D ResNet50用于US),有效整合了MRI和US的互补诊断信息,为多模态医学影像分析提供了可借鉴的范式。

Abstract: Placenta Accreta Spectrum (PAS) is a life-threatening obstetric complication involving abnormal placental invasion into the uterine wall. Early and accurate prenatal diagnosis is essential to reduce maternal and neonatal risks. This study aimed to develop and validate a deep learning framework that enhances PAS detection by integrating multiple imaging modalities. A multimodal deep learning model was designed using an intermediate feature-level fusion architecture combining 3D Magnetic Resonance Imaging (MRI) and 2D Ultrasound (US) scans. Unimodal feature extractors, a 3D DenseNet121-Vision Transformer for MRI and a 2D ResNet50 for US, were selected after systematic comparative analysis. Curated datasets comprising 1,293 MRI and 1,143 US scans were used to train the unimodal models and paired samples of patient-matched MRI-US scans was isolated for multimodal model development and evaluation. On an independent test set, the multimodal fusion model achieved superior performance, with an accuracy of 92.5% and an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.927, outperforming the MRI-only (82.5%, AUC 0.825) and US-only (87.5%, AUC 0.879) models. Integrating MRI and US features provides complementary diagnostic information, demonstrating strong potential to enhance prenatal risk assessment and improve patient outcomes.


[133] MetaFormer-driven Encoding Network for Robust Medical Semantic Segmentation eess.IV | cs.CVPDF

Le-Anh Tran, Chung Nguyen Tran, Nhan Cach Dang, Anh Le Van Quoc, Jordi Carrabina

TL;DR: 本文提出了一种名为MFEnNet的高效医学图像分割框架,通过将MetaFormer架构引入U-Net编码阶段,以降低计算成本并保持分割精度。该方法采用池化Transformer模块替代标准自注意力机制,结合Swish激活函数和空间金字塔池化,在多个医学分割基准测试中实现了与先进模型相当的准确性,同时显著减少了计算负担。

Details

Motivation: 针对现有先进医学图像分割模型架构复杂、计算资源需求高,难以在资源受限的临床环境中部署的问题,旨在设计一个高效且准确的分割框架。

Result: 在多个医学分割基准测试上,MFEnNet达到了与最先进模型(SOTA)竞争性的精度,同时显著降低了计算成本。

Insight: 创新点包括将MetaFormer抽象架构与U-Net结合,用池化Transformer模块降低自注意力计算复杂度,并集成Swish激活和空间金字塔池化以优化训练与多尺度特征提取;从客观角度看,该工作通过架构轻量化设计平衡了性能与效率,为医疗边缘计算场景提供了实用解决方案。

Abstract: Semantic segmentation is crucial for medical image analysis, enabling precise disease diagnosis and treatment planning. However, many advanced models employ complex architectures, limiting their use in resource-constrained clinical settings. This paper proposes MFEnNet, an efficient medical image segmentation framework that incorporates MetaFormer in the encoding phase of the U-Net backbone. MetaFormer, an architectural abstraction of vision transformers, provides a versatile alternative to convolutional neural networks by transforming tokenized image patches into sequences for global context modeling. To mitigate the substantial computational cost associated with self-attention, the proposed framework replaces conventional transformer modules with pooling transformer blocks, thereby achieving effective global feature aggregation at reduced complexity. In addition, Swish activation is used to achieve smoother gradients and faster convergence, while spatial pyramid pooling is incorporated at the bottleneck to improve multi-scale feature extraction. Comprehensive experiments on different medical segmentation benchmarks demonstrate that the proposed MFEnNet approach attains competitive accuracy while significantly lowering computational cost compared to state-of-the-art models. The source code for this work is available at https://github.com/tranleanh/mfennet.


[134] YODA: Yet Another One-step Diffusion-based Video Compressor eess.IV | cs.CVPDF

Xingchen Li, Junzhe Zhang, Junqi Shi, Ming Lu, Zhan Ma

TL;DR: YODA是一种基于单步扩散模型的视频压缩器,通过嵌入多尺度时间参考特征来利用时空相关性,并使用线性扩散变换器进行高效的单步去噪,在感知质量指标上达到了最先进的性能。

Details

Motivation: 解决现有单步扩散模型在视频压缩中忽视时间依赖性的问题,旨在通过利用时空相关性实现更紧凑的视频表示。

Result: 在LPIPS、DISTS、FID和KID等感知质量指标上,YODA均优于传统和深度学习方法,达到了最先进的性能水平。

Insight: 创新点在于引入多尺度时间参考特征来增强潜在表示,并采用线性扩散变换器优化单步去噪过程,有效提升了视频压缩的感知质量。

Abstract: While one-step diffusion models have recently excelled in perceptual image compression, their application to video remains limited. Prior efforts typically rely on pretrained 2D autoencoders that generate per-frame latent representations independently, thereby neglecting temporal dependencies. We present YODA–Yet Another One-step Diffusion-based Video Compressor–which embeds multiscale features from temporal references for both latent generation and latent coding to better exploit spatial-temporal correlations for more compact representation, and employs a linear Diffusion Transformer (DiT) for efficient one-step denoising. YODA achieves state-of-the-art perceptual performance, consistently outperforming traditional and deep-learning baselines on LPIPS, DISTS, FID, and KID. Source code will be publicly available at https://github.com/NJUVISION/YODA.


cs.LG [Back]

[135] Reliability Under Randomness: An Empirical Analysis of Sparse and Dense Language Models Across Decoding Temperatures cs.LG | cs.CLPDF

Kabir Grover

TL;DR: 本文通过实验分析了稀疏与稠密语言模型在不同解码温度下的可靠性差异,重点研究了稀疏混合专家模型在随机解码中的输出稳定性。

Details

Motivation: 随着稀疏MoE架构在大型语言模型中的普及,其条件计算机制与基于温度的随机采样之间的交互是否会损害输出稳定性,相比稠密架构是否降低可靠性,是本文研究的核心动机。

Result: 在确定性算术推理任务上的实验表明,经过指令微调的稀疏模型(如Mixtral-8x7B)在所有解码温度下表现出与稠密指令微调模型(Qwen2.5-3B)相当的稳定性,而未经指令微调的稀疏基础模型(OLMoE-7B)则随温度升高出现系统性性能下降。

Insight: 论文的创新点在于揭示了对于确定性任务,指令微调而非架构稀疏性本身,是模型抵抗解码随机性、保持鲁棒性的主要决定因素。这为在可靠性要求高的应用中安全部署稀疏模型提供了关键见解。

Abstract: The increasing prevalence of sparse Mixture-of-Experts (MoE) architectures in large language models raises important questions regarding their reliability under stochastic decoding. While conditional computation enables substantial gains in computational efficiency, it remains unclear whether the interaction between sparse routing and temperature-based sampling compromises output stability relative to dense architectures. This work investigates whether conditional computation in MoE models amplifies decoding-induced randomness, leading to reduced reliability as temperature increases. We evaluate three representative models: OLMoE-7B (sparse base), Mixtral-8x7B (sparse instruction-tuned), and Qwen2.5-3B (dense instruction-tuned) on deterministic arithmetic reasoning tasks with objectively verifiable answers. Experiments span four decoding configurations, ranging from greedy decoding to T=1.0. Our evaluation encompasses accuracy, format compliance, output consistency across repeated generations, and confidence metrics, totaling 9,360 model generations. Results demonstrate that the sparse instruction-tuned model exhibits stability comparable to the dense instruction-tuned model across all decoding temperatures, while the sparse base model shows systematic degradation as temperature increases. These findings indicate that instruction tuning, rather than architectural sparsity, is the primary determinant of robustness to decoding randomness on deterministic tasks. We discuss the implications of these results for deploying sparse language models in reliability-critical applications, highlighting scenarios in which sparse architectures can be safely adopted without sacrificing output stability.


[136] Entropy-Aligned Decoding of LMs for Better Writing and Reasoning cs.LG | cs.CLPDF

Kareem Ahmed, Sameer Singh

TL;DR: 本文提出了一种名为EPIC的超参数自由解码方法,通过将未来轨迹的熵纳入语言模型解码过程,在每一步生成时显式调节不确定性,使采样分布的熵与数据不确定性对齐,从而提升生成质量。

Details

Motivation: 现有解码算法依赖贪婪启发式方法,导致生成文本存在短视失真、同质化、重复和不连贯的问题,EPIC旨在通过熵对齐解码解决这些限制。

Result: 在创意写作和摘要任务中,EPIC在LM-as-judge偏好胜率上持续优于广泛使用的解码策略,自动指标显示其生成更多样化和更忠实的摘要;在数学推理任务中,EPIC也超越了所有基线方法。

Insight: EPIC的创新点在于引入熵感知的惰性Gumbel-Max采样,实现精确且高效的解码,其采样分布与底层数据分布的熵经验对齐,这为改善语言模型的生成质量和推理能力提供了新思路。

Abstract: Language models (LMs) are trained on billions of tokens in an attempt to recover the true language distribution. Still, vanilla random sampling from LMs yields low quality generations. Decoding algorithms attempt to restrict the LM distribution to a set of high-probability continuations, but rely on greedy heuristics that introduce myopic distortions, yielding sentences that are homogeneous, repetitive and incoherent. In this paper, we introduce EPIC, a hyperparameter-free decoding approach that incorporates the entropy of future trajectories into LM decoding. EPIC explicitly regulates the amount of uncertainty expressed at every step of generation, aligning the sampling distribution’s entropy to the aleatoric (data) uncertainty. Through Entropy-Aware Lazy Gumbel-Max sampling, EPIC manages to be exact, while also being efficient, requiring only a sublinear number of entropy evaluations per step. Unlike current baselines, EPIC yields sampling distributions that are empirically well-aligned with the entropy of the underlying data distribution. Across creative writing and summarization tasks, EPIC consistently improves LM-as-judge preference win-rates over widely used decoding strategies. These preference gains are complemented by automatic metrics, showing that EPIC produces more diverse generations and more faithful summaries. We also evaluate EPIC on mathematical reasoning, where it outperforms all baselines.


[137] HyperCLOVA X 8B Omni cs.LG | cs.AI | cs.CL | cs.SDPDF

NAVER Cloud HyperCLOVA X Team

TL;DR: 本文介绍了HyperCLOVA X 8B Omni,这是HyperCLOVA X系列中首个支持文本、音频和视觉作为输入和输出的任意到任意全模态模型。该模型通过共享的下一个令牌预测接口统一多模态,并利用视觉和音频编码器注入连续嵌入以实现细粒度理解,旨在成为实用的全模态助手。

Details

Motivation: 为了解决传统多模态模型通常采用分离的模态特定流水线的问题,本文旨在构建一个统一的、支持任意模态间相互理解和生成的单一模型,以推动实用全模态助手的发展。

Result: 经验评估表明,在涵盖文本、音频和视觉的多种输入输出组合上,该模型在韩语和英语中均取得了与同类规模模型相当的性能。

Insight: 主要创新点在于将多模态理解和生成整合到一个单一模型(而非分离的流水线)中,通过共享的下一个令牌预测接口统一处理交错的多模态序列,并利用编码器注入连续嵌入以实现细粒度理解,这为构建统一的任意到任意全模态模型提供了一个可行的路径点。

Abstract: In this report, we present HyperCLOVA X 8B Omni, the first any-to-any omnimodal model in the HyperCLOVA X family that supports text, audio, and vision as both inputs and outputs. By consolidating multimodal understanding and generation into a single model rather than separate modality-specific pipelines, HyperCLOVA X 8B Omni serves as an 8B-scale omni-pathfinding point toward practical any-to-any omni assistants. At a high level, the model unifies modalities through a shared next-token prediction interface over an interleaved multimodal sequence, while vision and audio encoders inject continuous embeddings for fine-grained understanding and grounding. Empirical evaluations demonstrate competitive performance against comparably sized models across diverse input-output combinations spanning text, audio, and vision, in both Korean and English. We anticipate that the open-weight release of HyperCLOVA X 8B Omni will support a wide range of research and deployment scenarios.


[138] Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting cs.LG | cs.AI | cs.CLPDF

Muxi Diao, Lele Yang, Wuxuan Gong, Yutong Zhang, Zhonghao Yan

TL;DR: 本文提出了一种名为熵自适应微调(EAFT)的新方法,旨在解决监督微调(SFT)中常见的灾难性遗忘问题。该方法通过利用词元级别的熵作为门控机制,区分认知不确定性和知识冲突,从而在保持下游任务性能的同时,显著缓解通用能力的退化。

Details

Motivation: 监督微调(SFT)是领域适应的标准范式,但常导致灾难性遗忘;而策略内强化学习(RL)却能有效保留通用能力。作者研究这一差异,发现根本原因在于分布差距:RL与模型内部信念对齐,而SFT迫使模型拟合外部监督,这导致了模型高度自信但其预测与真实标签相冲突的“自信冲突”词元,从而引发破坏性的梯度更新。

Result: 在Qwen和GLM系列模型(参数规模从4B到32B)上,于数学、医疗和智能体领域进行了广泛实验。结果表明,EAFT在匹配标准SFT下游性能的同时,显著减轻了通用能力的退化。

Insight: 核心创新在于利用词元级别的熵(而非仅依赖预测概率)作为门控信号,来区分不确定样本(应学习)和自信冲突样本(应抑制梯度)。这为解决SFT中的遗忘问题提供了一个新颖且有效的视角,即从模型内部置信度(熵)出发进行自适应调整,而非仅依赖外部监督信号。

Abstract: Supervised Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting. In sharp contrast, on-policy Reinforcement Learning (RL) effectively preserves general capabilities. We investigate this discrepancy and identify a fundamental distributional gap: while RL aligns with the model’s internal belief, SFT forces the model to fit external supervision. This mismatch often manifests as “Confident Conflicts” tokens characterized by low probability but low entropy. In these instances, the model is highly confident in its own prediction but is forced to learn a divergent ground truth, triggering destructive gradient updates. To address this, we propose Entropy-Adaptive Fine-Tuning (EAFT). Unlike methods relying solely on prediction probability, EAFT utilizes token-level entropy as a gating mechanism to distinguish between epistemic uncertainty and knowledge conflict. This allows the model to learn from uncertain samples while suppressing gradients on conflicting data. Extensive experiments on Qwen and GLM series (ranging from 4B to 32B parameters) across mathematical, medical, and agentic domains confirm our hypothesis. EAFT consistently matches the downstream performance of standard SFT while significantly mitigating the degradation of general capabilities.


[139] Real-Time Human Detection for Aerial Captured Video Sequences via Deep Models cs.LG | cs.CVPDF

Nouar AlDahoul, Aznul Qalid Md Sabri, Ali Mohammed Mansoor

TL;DR: 该论文提出了一种结合光流和三种不同深度模型(监督卷积神经网络、预训练CNN特征提取器和分层极限学习机)的方法,用于从非静态空中平台捕获的视频序列中进行实时人体检测。

Details

Motivation: 传统方法依赖于手工制作的特征,这些特征对动态事件(如光照变化、相机抖动和物体尺寸变化)敏感,且需要专业知识;而基于特征学习的方法能自动提取高度抽象和区分性特征,更廉价且易于实现。

Result: 在公开且具有挑战性的UCF-ARG空中数据集上进行训练和测试,评估了五种人类动作(挖掘、挥手、投掷、行走和奔跑)。预训练CNN的平均准确率为98.09%,S-CNN使用softmax和SVM的平均准确率分别为95.6%和91.7%,H-ELM的平均准确率为95.9%。H-ELM在普通CPU上的训练时间为445秒,S-CNN在高性能GPU上的学习时间为770秒。

Insight: 创新点在于结合光流与多种深度模型进行空中视频人体检测,实现了高准确率和实时性能;从客观角度看,该方法通过自动特征学习减少了对专业知识的依赖,并有效应对动态环境变化,为实时应用提供了实用解决方案。

Abstract: Human detection in videos plays an important role in various real-life applications. Most traditional approaches depend on utilizing handcrafted features, which are problem-dependent and optimal for specific tasks. Moreover, they are highly susceptible to dynamical events such as illumination changes, camera jitter, and variations in object sizes. On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge. In this paper, we utilize automatic feature learning methods, which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes. The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset. The comparison between these models in terms of training, testing accuracy, and learning speed is analyzed. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that the proposed methods are successful for the human detection task. The pretrained CNN produces an average accuracy of 98.09%. S-CNN produces an average accuracy of 95.6% with softmax and 91.7% with Support Vector Machines (SVM). H-ELM has an average accuracy of 95.9%. Using a normal Central Processing Unit (CPU), H-ELM’s training time takes 445 seconds. Learning in S-CNN takes 770 seconds with a high-performance Graphical Processing Unit (GPU).


[140] SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models cs.LG | cs.AI | cs.CVPDF

Yunlin Zeng

TL;DR: 本文提出了SPoRC-VIST基准,用于评估视觉语言模型在生成多说话人播客对话这类长篇幅、引人入胜的自然叙事方面的能力。作者构建了一个端到端的视觉播客生成流程,并在一个包含4000个图像-对话对的数据集上微调了Qwen3-VL-32B模型。关键创新在于采用了从合成到真实的训练策略,并使用AI作为评判员和新型风格指标进行超越文本重叠的全面评估。

Details

Motivation: 当前视觉语言模型在描述性任务上表现出色,但其生成引人入胜的长篇叙事(特别是多说话人播客对话)的能力尚未被充分探索且难以评估。标准指标(如BLEU、ROUGE)无法捕捉对话自然度、个性和叙事流畅性等细微差别,常常奖励安全、重复的输出而非吸引人的故事讲述。

Result: 实验表明,经过微调的32B模型在对话自然度上显著优于235B的基础模型(胜率>80%),在叙事深度上也有显著提升(平均轮次长度增加50%),同时保持了相同的视觉基础能力(CLIPScore: 20.39)。评估在SPoRC-VIST基准上进行,该基准结合了合成训练数据和真实世界的VIST照片序列。

Insight: 论文的创新点包括:1)提出了专门用于评估生成式自然叙事的SPoRC-VIST基准;2)采用了从合成到真实的训练策略,以测试模型从合成数据到真实视觉领域的泛化能力;3)提出了超越文本重叠的全面评估框架,结合了AI作为评判员和新型风格指标(如平均轮次长度、说话人切换率)。这为评估和提升VLMs的叙事生成能力提供了新的方向和方法论。

Abstract: Vision-Language Models (VLMs) have achieved remarkable success in descriptive tasks such as image captioning and visual question answering (VQA). However, their ability to generate engaging, long-form narratives – specifically multi-speaker podcast dialogues – remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational naturalness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual podcast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically generated imagery, and evaluate on real-world photo sequences from the Visual Storytelling Dataset (VIST). This rigorous setup tests the model’s ability to generalize from synthetic training data to real-world visual domains. We propose a comprehensive evaluation framework that moves beyond textual overlap, and use AI-as-a-judge (Gemini 3 Pro, Claude Opus 4.5, GPT 5.2) and novel style metrics (average turn length, speaker switch rate) to assess quality. Our experiments demonstrate that our fine-tuned 32B model significantly outperforms a 235B base model in conversational naturalness ($>$80% win rate) and narrative depth (+50% turn length), while maintaining identical visual grounding capabilities (CLIPScore: 20.39).


[141] Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments cs.LG | cs.AI | cs.CVPDF

Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, Thomas Anderson Keller

TL;DR: 本文提出了一种名为’流等变世界模型’的框架,将智能体自身运动和外部物体运动统一建模为单参数李群’流’,并利用这种统一性实现对这些变换的群等变性,从而在数百个时间步上提供稳定的潜在世界表示。

Details

Motivation: 解决现有神经网络世界模型忽略连续感官输入流中平滑、时间参数化对称性结构的问题,旨在为部分可观测动态环境提供更稳定、数据高效的世界表示。

Result: 在2D和3D部分可观测视频世界建模基准测试中,该方法显著优于当前最先进的基于扩散和记忆增强的世界建模架构,特别是在智能体当前视野外存在可预测世界动态的情况下,并且在远超训练时长的长序列推演中表现出优异的泛化能力。

Insight: 通过将内部和外部运动的结构化表示引入世界模型,流等变性为数据高效、对称性引导的具身智能提供了一条可扩展的路径,其核心创新在于将运动统一为李群流并强制执行等变性约束,从而避免了从数据中重复学习相同变换。

Abstract: Embodied systems experience the world as ‘a symphony of flows’: a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects. These streams obey smooth, time-parameterized symmetries, which combine through a precisely structured algebra; yet most neural network world models ignore this structure and instead repeatedly re-learn the same transformations from data. In this work, we introduce ‘Flow Equivariant World Models’, a framework in which both self-motion and external object motion are unified as one-parameter Lie group ‘flows’. We leverage this unification to implement group equivariance with respect to these transformations, thereby providing a stable latent world representation over hundreds of timesteps. On both 2D and 3D partially observed video world modeling benchmarks, we demonstrate that Flow Equivariant World Models significantly outperform comparable state-of-the-art diffusion-based and memory-augmented world modeling architectures – particularly when there are predictable world dynamics outside the agent’s current field of view. We show that flow equivariance is particularly beneficial for long rollouts, generalizing far beyond the training horizon. By structuring world model representations with respect to internal and external motion, flow equivariance charts a scalable route to data efficient, symmetry-guided, embodied intelligence. Project link: https://flowequivariantworldmodels.github.io.


[142] GDRO: Group-level Reward Post-training Suitable for Diffusion Models cs.LG | cs.CVPDF

Yiyang Wang, Xi Chen, Xiaogang Xu, Yu Liu, Hengshuang Zhao

TL;DR: 本文提出了一种名为GDRO的新后训练范式,用于扩散模型的群体级奖励对齐,旨在解决现有在线强化学习方法在效率、随机采样器依赖和奖励黑客问题上的挑战。

Details

Motivation: 现有方法将LLM中的在线强化学习应用于文本到图像的整流流扩散模型进行奖励对齐,但面临效率低、依赖随机采样器和奖励黑客等问题,而整流流模型与LLM在效率和确定性上存在根本差异。

Result: 在OCR和GenEval任务上的大量实验表明,GDRO通过群体级离线优化有效且高效地提升了扩散模型的奖励分数,同时展现出缓解奖励黑客的强稳定性和鲁棒性。

Insight: GDRO的创新点在于结合整流流模型特性,支持完全离线训练以节省图像采样时间,且独立于扩散采样器,无需ODE到SDE近似来获得随机性;同时引入考虑奖励黑客趋势的校正评分进行更可靠的评估。

Abstract: Recent advancements adopt online reinforcement learning (RL) from LLMs to text-to-image rectified flow diffusion models for reward alignment. The use of group-level rewards successfully aligns the model with the targeted reward. However, it faces challenges including low efficiency, dependency on stochastic samplers, and reward hacking. The problem is that rectified flow models are fundamentally different from LLMs: 1) For efficiency, online image sampling takes much more time and dominates the time of training. 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). GDRO is a new post-training paradigm for group-level reward alignment that combines the characteristics of rectified flow models. Through rigorous theoretical analysis, we point out that GDRO supports full offline training that saves the large time cost for image rollout sampling. Also, it is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtain stochasticity. We also empirically study the reward hacking trap that may mislead the evaluation, and involve this factor in the evaluation using a corrected score that not only considers the original evaluation reward but also the trend of reward hacking. Extensive experiments demonstrate that GDRO effectively and efficiently improves the reward score of the diffusion model through group-wise offline optimization across the OCR and GenEval tasks, while demonstrating strong stability and robustness in mitigating reward hacking.


[143] CORE: Code-based Inverse Self-Training Framework with Graph Expansion for Virtual Agents cs.LG | cs.CVPDF

Keyu Wang, Bingchen Miao, Wendong Bu, Yu Wu, Juncheng Li

TL;DR: 本文提出了CORE框架,一种基于代码的逆向自训练与图扩展框架,旨在解决多模态虚拟代理训练中行为克隆方法行为多样性低与强化学习方法依赖人工设计奖励函数之间的冲突。该框架通过语义代码抽象自动从专家演示中推断奖励函数,利用策略图扩展增强领域内行为多样性,并通过轨迹引导外推丰富领域外行为多样性。

Details

Motivation: 解决多模态虚拟代理训练中行为克隆方法行为多样性不足与强化学习方法依赖人工奖励设计之间的矛盾,寻求一种能融合模仿与探索优势的新训练范式。

Result: 在Web和Android平台上的实验表明,CORE显著提升了虚拟代理的整体性能和泛化能力,突显了其作为强大、可泛化训练范式的潜力。

Insight: 创新点在于将奖励函数形式化为可执行的代码(标签函数),并构建策略图来捕捉任务的多路径解决方案,同时利用成功与失败轨迹进行任务空间扩展,从而系统性地提升行为多样性并摆脱对人工奖励的依赖。

Abstract: The development of Multimodal Virtual Agents has made significant progress through the integration of Multimodal Large Language Models. However, mainstream training paradigms face key challenges: Behavior Cloning is simple and effective through imitation but suffers from low behavioral diversity, while Reinforcement Learning is capable of discovering novel strategies through exploration but heavily relies on manually designed reward functions. To address the conflict between these two methods, we present CORE, a Code-based Inverse Self-Training Framework with Graph Expansion that bridges imitation and exploration, offering a novel training framework that promotes behavioral diversity while eliminating the reliance on manually reward design. Specifically, we introduce Semantic Code Abstraction to automatically infers reward functions from expert demonstrations without manual design. The inferred reward function, referred to as the Label Function, is executable code that verifies one key step within a task. Building on this, we propose Strategy Graph Expansion to enhance in-domain behavioral diversity, which constructs a multi-path graph called Strategy Graph that captures diverse valid solutions beyond expert demonstrations. Furthermore, we introduce Trajectory-Guided Extrapolation, which enriches out-of-domain behavioral diversity by utilizing both successful and failed trajectories to expand the task space. Experiments on Web and Android platforms demonstrate that CORE significantly improves both overall performance and generalization, highlighting its potential as a robust and generalizable training paradigm for building powerful virtual agents.