Table of Contents
- cs.CL [Total: 24]
- cs.CV [Total: 131]
- cs.AI [Total: 7]
- cs.RO [Total: 5]
- q-bio.QM [Total: 1]
- cs.LG [Total: 8]
- cs.DL [Total: 1]
- eess.IV [Total: 2]
- cs.CR [Total: 2]
- cs.CY [Total: 2]
cs.CL [Back]
[1] ReportLogic: Evaluating Logical Quality in Deep Research Reports cs.CL | cs.AIPDF
Jujia Zhao, Zhaoxin Huan, Zihan Wang, Xiaolu Zhang, Jun Zhou
TL;DR: 该论文提出了ReportLogic基准,用于评估深度研究报告的逻辑质量,通过构建一个包含宏观逻辑、阐述逻辑和结构逻辑三个层次的人类标注数据集,并训练了一个开源的LogicJudge模型进行可扩展的评估,同时通过对抗攻击测试了评估模型的鲁棒性。
Details
Motivation: 当前用户依赖大语言模型生成深度研究报告,但现有评估框架忽视了报告的逻辑质量(即主张和论证是否得到明确支持),这影响了报告在下游应用中的可靠性。
Result: 论文构建了人类标注的ReportLogic数据集,并训练了LogicJudge模型进行逻辑质量评估;通过对抗攻击发现,现成的LLM评估器易受表面线索(如冗长)影响,推理模式可能掩盖断裂的支持关系。
Insight: 创新点在于提出了一个以读者可审计性为中心的层次化逻辑质量评估分类法(宏观、阐述、结构逻辑),并开发了可扩展的评估工具,为构建更鲁棒的逻辑评估器和提升LLM生成报告的逻辑可靠性提供了实用指导。
Abstract: Users increasingly rely on Large Language Models (LLMs) for Deep Research, using them to synthesize diverse sources into structured reports that support understanding and action. In this context, the practical reliability of such reports hinges on logical quality: whether the report’s claims and arguments are explicitly supported and can be trusted as a basis for downstream use, rather than merely appearing fluent or informative. However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. Specifically, ReportLogic adopts a hierarchical taxonomy that evaluates whether readers can (1) trace an on-topic report structure with a unified analytical arc (Macro-Logic), (2) understand the progression with necessary context (Expositional-Logic), and (3) verify conclusions via explicit claim–support (Structural-Logic). Based on this taxonomy, we construct a human-annotated rubric-guided dataset and train an open-source LogicJudge for scalable evaluation. We further evaluate judge robustness via adversarial attacks, showing that off-the-shelf LLM judges are frequently influenced by superficial cues (e.g., verbosity), and reasoning modes can mask broken support relations. Overall, our results provide actionable guidance for building more robust logic evaluators and improving the logical reliability of LLM-generated reports.
[2] ConfSpec: Efficient Step-Level Speculative Reasoning via Confidence-Gated Verification cs.CL | cs.AIPDF
Siran Liu, Cyril Y. He
TL;DR: ConfSpec提出了一种基于置信度门控级联验证的步级推测推理框架,旨在解决思维链推理中生成长序列导致的高延迟问题。该方法通过利用生成与验证任务的不对称性,让小规模草稿模型在置信度范围内进行校准,高置信度的推理步骤直接接受,不确定的步骤则提交给大型目标模型验证,从而在保持准确性的同时提升推理速度。
Details
Motivation: 思维链推理虽能提升大语言模型在复杂任务上的性能,但生成长推理轨迹会导致高推理延迟。现有步级推测推理方法在准确性、推理速度和资源效率之间存在难以权衡的长期矛盾,需要一种能同时优化这些方面的解决方案。
Result: 在多样化工作负载上的评估表明,ConfSpec实现了最高2.24倍的端到端加速,同时匹配目标模型的准确性。该方法无需外部判断模型,且与词元级推测解码正交,可实现进一步的乘法加速。
Insight: 创新点在于识别出生成与验证任务的不对称性:生成正确推理步骤需要大模型能力,而步级验证是受限的判别任务,小规模草稿模型在其能力范围内校准良好,从而通过置信度门控机制实现高效验证。客观来看,该方法通过级联验证框架有效平衡了速度与精度,且无需额外模型资源,具有较好的实用性和扩展性。
Abstract: Chain-of-Thought reasoning significantly improves the performance of large language models on complex tasks, but incurs high inference latency due to long generation traces. Step-level speculative reasoning aims to mitigate this cost, yet existing approaches face a long-standing trade-off among accuracy, inference speed, and resource efficiency. We propose ConfSpec, a confidence-gated cascaded verification framework that resolves this trade-off. Our key insight is an asymmetry between generation and verification: while generating a correct reasoning step requires substantial model capacity, step-level verification is a constrained discriminative task for which small draft models are well-calibrated within their competence range, enabling high-confidence draft decisions to be accepted directly while selectively escalating uncertain cases to the large target model. Evaluation across diverse workloads shows that ConfSpec achieves up to 2.24$\times$ end-to-end speedups while matching target-model accuracy. Our method requires no external judge models and is orthogonal to token-level speculative decoding, enabling further multiplicative acceleration.
[3] DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning cs.CLPDF
Fangyuan Xu, Sihao Chen, Zinan Lin, Taiwei Shi, Sydney Graham
TL;DR: 本文提出了一种名为DP-RFT(差分隐私强化微调)的在线强化学习算法,用于在无法直接访问私有数据的情况下,训练大型语言模型(LLM)生成高质量的差分隐私合成文本。该方法利用来自私有语料库的差分隐私保护的最近邻投票作为奖励信号,通过近端策略优化(PPO)迭代优化LLM,使其生成能最大化预期奖励的合成数据。
Details
Motivation: 解决在差分隐私合成数据生成中,直接微调需要访问私有数据内容,而避免直接暴露的方法又受限于未微调模型、导致生成数据领域保真度不足的难题,旨在训练LLM生成高质量合成文本,同时无需直接查看私有数据个体样本。
Result: 在生成长篇和领域特定合成数据(如新闻文章、会议记录和医学文章摘要)的实验中,DP-RFT在生成数据的保真度和下游效用方面,缩小了私有进化方法与差分隐私微调方法之间的差距,同时尊重了私有数据边界。
Insight: 核心创新点在于将差分隐私保护的最近邻投票作为强化学习的奖励信号,构建了一个无需直接访问私有数据内容的在线学习框架(DP-RFT),结合PPO优化LLM,实现了隐私保护与生成数据质量之间的更好权衡。
Abstract: Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data, where data owners cannot provide eyes-on access to individual examples. Generating DP synthetic data typically involves a difficult trade-off. On one hand, DP finetuning methods train an LLM as a synthetic data generator with formal privacy guarantees, yet it still requires the raw content of private examples for model training. However, methods that avoid direct exposure to private data are bounded by an off-the-shelf, un-finetuned model, whose outputs often lack domain fidelity. Can we train an LLM to generate high-quality synthetic text without eyes-on access to individual private examples? In this work, we introduce Differentially Private Reinforcement Fine-Tuning (DP-RFT), an online reinforcement learning algorithm for synthetic data generation with LLMs. DP-RFT leverages DP-protected nearest-neighbor votes from an eyes-off private corpus as a reward signal for on-policy synthetic samples generated by an LLM. The LLM iteratively learns to generate synthetic data to maximize the expected DP votes through Proximal Policy Optimization (PPO). We evaluate DP-RFT for long-form and domain-specific synthetic data generation, such as news articles, meeting transcripts, and medical article abstracts. Our experiments show that DP-RFT closes the gap between private evolution and DP finetuning methods in terms of the fidelity and downstream utility of the generated synthetic data, while respecting the private data boundary.
[4] PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation cs.CLPDF
Nina Hosseini-Kivanani
TL;DR: 本文介绍了PolyFrame系统,用于解决多模态模型在理解习语表达时面临的挑战,特别是在多语言环境下。该系统通过一个统一的流程处理图像+文本排序(子任务A)和纯文本标题排序(子任务B),主要采用冻结的CLIP风格视觉-语言编码器和多语言BGE M3编码器,仅训练轻量级模块如逻辑回归、基于LLM的句子类型预测器、习语同义词替换、干扰项感知评分和Borda排序融合。
Details
Motivation: 多模态模型在处理习语表达时,由于其非组合性含义而面临困难,这一挑战在多语言环境中尤为突出。论文旨在通过PolyFrame系统解决多模态习语消歧问题,参与MWE-2026 AdMIRe2共享任务。
Result: 从CLIP基线(英语开发集Top-1为26.7%,英语测试集为6.7%)开始,通过添加习语感知改写和显式句子类型分类,性能提升至英语Top-1为60.0%,并在零样本迁移到葡萄牙语时达到Top-1为60.0%(NDCG@5为0.822)。在多语言盲测中,系统在15种语言上平均Top-1/NDCG得分分别为子任务A的0.35/0.73和子任务B的0.32/0.71。消融实验表明习语感知改写是性能提升的主要因素。
Insight: 创新点包括:使用冻结的大型编码器结合轻量级模块实现高效习语消歧,无需微调多模态编码器;习语感知改写和句子类型预测增强了模型的鲁棒性;多模态融合策略提升了跨语言性能。这为多模态自然语言处理中的习语理解提供了可扩展且资源高效的解决方案。
Abstract: Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision–language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our systems achieved average Top-1/NDCG scores of 0.35/0.73 for Subtask A and 0.32/0.71 for Subtask B across 15 languages. Ablation results highlight idiom-aware rewriting as the main contributor to performance, while sentence-type prediction and multimodal fusion enhance robustness. These findings suggest that effective idiom disambiguation is feasible without fine-tuning large multimodal encoders.
[5] Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM cs.CLPDF
Md Badsha Biswas, Ozlem Uzuner
TL;DR: 本文提出了一种新颖的开放领域声明验证系统,该系统利用大型语言模型、多视角证据检索和跨来源分歧分析。通过同时检索原始声明及其否定形式的证据,从维基百科、PubMed和谷歌等多个来源收集支持和矛盾信息,并经过过滤、去重和聚合形成统一的知识库,最终用于LLM进行声明验证。
Details
Motivation: 现有自动化声明验证系统通常依赖单一知识来源并忽略不同来源间的分歧,这限制了其知识覆盖范围和透明度。本文旨在通过整合多来源、多视角证据并分析来源间分歧来解决这些问题。
Result: 在四个基准数据集上使用五个LLM进行的广泛评估表明,知识聚合不仅提升了声明验证的性能,还揭示了不同来源特有的推理差异。
Insight: 核心创新点在于提出了一种同时检索声明及其否定形式的双视角检索策略,以及跨来源证据聚合与分歧量化分析框架,强调了在证据中拥抱多样性、矛盾性和聚合对于构建可靠、透明验证系统的重要性。
Abstract: The spread of misinformation across digital platforms can pose significant societal risks. Claim verification, a.k.a. fact-checking, systems can help identify potential misinformation. However, their efficacy is limited by the knowledge sources that they rely on. Most automated claim verification systems depend on a single knowledge source and utilize the supporting evidence from that source; they ignore the disagreement of their source with others. This limits their knowledge coverage and transparency. To address these limitations, we present a novel system for open-domain claim verification (ODCV) that leverages large language models (LLMs), multi-perspective evidence retrieval, and cross-source disagreement analysis. Our approach introduces a novel retrieval strategy that collects evidence for both the original and the negated forms of a claim, enabling the system to capture supporting and contradicting information from diverse sources: Wikipedia, PubMed, and Google. These evidence sets are filtered, deduplicated, and aggregated across sources to form a unified and enriched knowledge base that better reflects the complexity of real-world information. This aggregated evidence is then used for claim verification using LLMs. We further enhance interpretability by analyzing model confidence scores to quantify and visualize inter-source disagreement. Through extensive evaluation on four benchmark datasets with five LLMs, we show that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning. Our findings underscore the importance of embracing diversity, contradiction, and aggregation in evidence for building reliable and transparent claim verification systems
[6] ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models cs.CL | eess.ASPDF
Zefang Liu, Chenyang Zhu, Sangwoo Cho, Shi-Xiong Zhang
TL;DR: 本文提出ReHear框架,通过集成指令调优的音频感知大语言模型(LLM)到自训练循环中,对半监督语音识别(ASR)中的伪标签进行迭代精炼,以解决传统伪标签方法因噪声监督导致的确认偏差和错误累积问题。
Details
Motivation: 传统半监督ASR中的伪标签方法存在确认偏差和错误传播的局限性,需要一种能利用音频信息纠正严重识别错误、生成高质量伪标签的机制。
Result: 在多个基准测试上的实验结果表明,ReHear有效缓解了错误传播,性能持续优于全监督和传统伪标签基线方法。
Insight: 创新点在于将音频感知LLM(而非纯文本纠正器)集成到迭代训练循环中,利用音频和ASR假设共同条件化LLM,以恢复音素准确的转录,从而生成高保真伪标签用于模型微调。
Abstract: Semi-supervised learning in automatic speech recognition (ASR) typically relies on pseudo-labeling, which often suffers from confirmation bias and error accumulation due to noisy supervision. To address this limitation, we propose ReHear, a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop. Unlike conventional text-based correctors, our approach conditions the LLM on both the ASR hypothesis and the source audio, allowing it to recover phonetically accurate transcripts even from severe recognition errors. These refined pseudo-labels serve as high-fidelity targets for fine-tuning the ASR model in an iterative cycle. Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.
[7] BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models cs.CLPDF
Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat
TL;DR: 本文介绍了BURMESE-SAN,这是首个针对缅甸语(Burmese)的综合性基准测试,用于系统评估大语言模型在自然语言理解、推理和生成三个核心能力上的表现。该基准整合了七个子任务,包括问答、情感分析、毒性检测、因果推理、自然语言推理、抽象摘要和机器翻译,其中多个任务此前在缅甸语中不可用。通过大规模评估开源和商业LLM,研究发现缅甸语性能更依赖于架构设计、语言表示和指令微调,而非仅模型规模,特别是东南亚区域微调和新模型代际带来显著提升。基准已作为公开排行榜发布,以支持缅甸语及其他低资源语言的系统评估和持续进展。
Details
Motivation: 解决缅甸语作为低资源语言在自然语言处理中缺乏系统性评估基准的问题,旨在填补现有LLM评估在缅甸语上的空白,并探究由于有限预训练覆盖、丰富形态和句法变异带来的建模挑战。
Result: 在BURMESE-SAN基准上对开源和商业LLM进行大规模评估,结果显示缅甸语性能提升更多依赖于架构设计、语言表示和指令微调,而非单纯模型规模;东南亚区域微调和新模型代际带来实质性增益。基准作为公开排行榜(https://leaderboard.sea-lion.ai/detailed/MY)发布,以促进持续评估。
Insight: 创新点在于构建首个针对缅甸语的综合性NLP基准,通过母语者驱动过程确保语言自然性和文化真实性,减少翻译伪影;客观分析表明,对于低资源语言如缅甸语,区域特定微调和模型设计优化比单纯扩大规模更有效,这为其他低资源语言评估提供了可借鉴的方法。
Abstract: We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY
[8] Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models cs.CL | cs.AIPDF
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma
TL;DR: 这篇论文提出了一个名为Think$^{2}$的心理学基础元认知框架,该框架将Ann Brown的调节循环(计划、监控、评估)结构化为提示架构,并集成到一个轻量级的双过程MetaController中,用于自适应地分配推理努力。该框架旨在增强大型语言模型(LLMs)监控、诊断和纠正自身错误的能力。
Details
Motivation: 尽管大型语言模型展现出强大的推理能力,但其可靠地监控、诊断和纠正自身错误的能力仍然有限。论文旨在通过一个基于成熟认知心理学理论的元认知框架来解决这一问题。
Result: 在多个推理和诊断基准(GSM8K, CRUXEval, MBPP, AIME, CorrectBench, TruthfulQA)上,使用Llama-3和Qwen-3(8B)模型,显式的调节结构显著改善了错误诊断,并使成功的自我纠正率提高了三倍。对580个查询对的盲测人工评估显示,在可信度和元认知自我意识方面,该方法比标准方法和思维链基线获得了84%的总体偏好。
Insight: 论文的核心创新点是将认知心理学中成熟的元认知调节循环理论(计划、监控、评估)结构化为一个可操作的提示框架,并引入一个轻量级的MetaController进行自适应努力分配。这为提升LLM的自我诊断和纠错能力提供了一个有原则的、可解释的路径,而非仅仅依赖启发式提示工程。
Abstract: Large Language Models (LLMs) demonstrate strong reasoning performance, yet their ability to reliably monitor, diagnose, and correct their own errors remains limited. We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown’s regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight dual-process MetaController for adaptive effort allocation. Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold increase in successful self-correction. Blinded human evaluations over 580 query pairs show an 84% aggregate preference for trustworthiness and metacognitive self-awareness over standard and Chain-of-Thought baselines. Grounding LLM reasoning in established cognitive theory offers a principled path toward more transparent and diagnostically robust AI systems.
[9] IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning cs.CL | cs.LGPDF
Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su
TL;DR: 本文提出了一种名为IAPO(信息感知策略优化)的后训练框架,旨在解决大语言模型推理时链式思维过长导致的推理成本高昂问题。该方法基于信息论,通过计算每个token与最终答案的条件互信息来分配token级别的优势值,从而在减少推理步骤的同时保持或提高准确性。
Details
Motivation: 现有基于序列级奖励塑形的方法在控制推理努力在token间的分配上能力有限,导致推理冗长且效率低下。本文旨在填补这一空白,为token高效的后训练提供一个原则性机制。
Result: 在多个推理数据集上的实验表明,IAPO在保持正确性的同时,能持续提高推理准确率,并将推理长度减少高达36%,性能优于现有的token高效强化学习方法。
Insight: 核心创新点在于将信息论中的条件互信息引入强化学习奖励塑形,为识别信息丰富的推理步骤和抑制低效用探索提供了显式且原则性的机制。这为token高效的后训练提供了一个强大且通用的新方向。
Abstract: Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token’s conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.
[10] Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer cs.CLPDF
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng
TL;DR: 该论文研究了大型语言模型(LLMs)和大型视觉语言模型(LVLMs)在推理过程中是否共享神经元。研究发现,超过一半的顶级激活神经元在两类模型中是共享的,形成了一个模态不变的推理子空间。基于此,作者提出了共享神经元低秩融合(SNRF)框架,能够高效地将LLMs的成熟推理能力迁移到LVLMs中,从而提升其多模态推理性能。
Details
Motivation: 尽管大型视觉语言模型(LVLMs)发展迅速,但在需要多步推理和组合决策的任务上仍落后于纯文本的大型语言模型(LLMs)。鉴于它们共享Transformer架构,作者旨在探究这两类模型是否依赖共同的内部计算机制进行推理。
Result: 在多种数学和感知基准测试中,提出的SNRF框架持续提升了LVLM的推理性能,同时保持了其感知能力。该方法通过极少的参数修改(无需大规模多模态微调)实现了性能增强。
Insight: 论文的创新点在于发现了LLMs和LVLMs之间存在大量共享神经元,构成了可解释的模态不变推理子空间。基于此提出的SNRF框架,通过分析跨模型激活、计算权重差异的低秩近似,并选择性地在共享神经元子空间注入更新,实现了参数高效的推理能力迁移,为提升多模态模型推理性能提供了一种低成本、可解释的新途径。
Abstract: Large vision-language models (LVLMs) have rapidly advanced across various domains, yet they still lag behind strong text-only large language models (LLMs) on tasks that require multi-step inference and compositional decision-making. Motivated by their shared transformer architectures, we investigate whether the two model families rely on common internal computation for such inference. At the neuron level, we uncover a surprisingly large overlap: more than half of the top-activated units during multi-step inference are shared between representative LLMs and LVLMs, revealing a modality-invariant inference subspace. Through causal probing via activation amplification, we further show that these shared neurons encode consistent and interpretable concept-level effects, demonstrating their functional contribution to inference. Building on this insight, we propose Shared Neuron Low-Rank Fusion (SNRF), a parameter-efficient framework that transfers mature inference circuitry from LLMs to LVLMs. SNRF profiles cross-model activations to identify shared neurons, computes a low-rank approximation of inter-model weight differences, and injects these updates selectively within the shared-neuron subspace. This mechanism strengthens multimodal inference performance with minimal parameter changes and requires no large-scale multimodal fine-tuning. Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities. Our results demonstrate that shared neurons form an interpretable bridge between LLMs and LVLMs, enabling low-cost transfer of inference ability into multimodal models. Our code is available at https://github.com/chenhangcuisg-code/Do-LLMs-VLMs-Share-Neurons.
[11] AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG cs.CLPDF
Qijie You, Wenkai Yu, Wentao Zhang
TL;DR: 本文提出了AgenticRAGTracer,这是首个主要利用大语言模型自动构建、旨在支持逐步验证的Agentic RAG基准测试。该基准跨越多个领域,包含1305个数据点,与现有主流基准无重叠。实验表明,即使是GPT-5等最佳大语言模型在该数据集上表现也较差,突显了模型在多步检索推理中分配步骤与任务逻辑结构一致性的关键能力不足。
Details
Motivation: 现有评估Agentic RAG多跳推理能力的基准通常只提供最终问题和答案,缺乏连接原子问题与最终多跳查询的中间跳级问题,这阻碍了对智能体失败步骤的分析和更细粒度的模型能力评估。同时,当前基准多为人工构建,耗时耗力且可扩展性和泛化性有限。
Result: 在提出的AgenticRAGTracer数据集上进行了广泛实验,结果表明即使是最好的大语言模型(如GPT-5)在该数据集上表现不佳,在最难部分仅获得22.6%的EM准确率。基于跳级的诊断揭示失败主要由扭曲的推理链(过早崩溃或过度扩展)驱动。
Insight: 论文的创新点在于:1) 首次提出了一个主要自动构建、支持逐步验证的Agentic RAG基准,解决了现有基准缺乏中间步骤和可扩展性差的问题;2) 通过引入跳级感知的诊断,揭示了模型失败的具体模式(推理链扭曲),提供了一个传统评估中缺失的诊断维度,有助于更精细地理解和改进Agentic RAG系统的多步推理能力。
Abstract: With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation. Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks. Extensive experiments demonstrate that even the best large language models perform poorly on our dataset. For instance, GPT-5 attains merely 22.6% EM accuracy on the hardest portion of our dataset. Hop-aware diagnosis reveals that failures are primarily driven by distorted reasoning chains – either collapsing prematurely or wandering into over-extension. This highlights a critical inability to allocate steps consistent with the task’s logical structure, providing a diagnostic dimension missing in traditional evaluations. We believe our work will facilitate research in Agentic RAG and inspire further meaningful progress in this area. Our code and data are available at https://github.com/YqjMartin/AgenticRAGTracer.
[12] Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs cs.CLPDF
Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide
TL;DR: 本文提出了一种基于对比稀疏自编码器(SAE)的框架,用于实现角色扮演大语言模型(LLMs)在人格五因素模型(Big Five)30个分维度层面上的精确人格控制。该方法通过构建一个包含15,000个样本、控制信息泄露的语料库进行监督学习,生成与人格特质对齐的控制向量,并利用特质激活路由模块将其动态注入模型的残差空间,从而在长对话中实现稳定、可解释的人格引导。
Details
Motivation: 现有角色扮演代理(RPAs)的人格控制方法存在局限:基于提示词或检索增强生成(RAG)的无训练方法在长对话中信号易被稀释,导致人格漂移和不一致;而监督微调(SFT)虽有效但需要人格标注数据且对新角色不灵活。本文旨在解决这些方法在人格控制的精确性、稳定性和灵活性上的不足。
Result: 在大语言模型上的实验表明,该方法在情境化设置下,在角色保真度和输出质量方面均优于对比激活加法(CAA)和仅使用提示词的基线方法。其中,SAE与提示词结合的配置(SAE+Prompt)取得了最佳的整体性能。
Insight: 主要创新点在于:1)提出了一个对比稀疏自编码器框架,用于学习与大五人格30个分维度对齐的、细粒度的人格控制向量;2)构建了一个新的、控制信息泄露的平衡语料库,为每个分维度提供监督信号;3)设计了特质激活路由模块,实现了控制向量的动态、可解释注入。从客观角度看,该方法将人格控制从角色层面细化到特质分维度层面,并通过在残差空间进行动态路由,为LLMs的细粒度、可组合行为控制提供了一种有前景的技术路径。
Abstract: Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-specific corpora. While SFT can be effective, it requires persona-labeled data and retraining for new roles, limiting flexibility. In contrast, prompt- and RAG-based signals are easy to apply but can be diluted in long dialogues, leading to drifting and sometimes inconsistent persona behavior. To address this, we propose a contrastive Sparse AutoEncoder (SAE) framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model. A new 15,000-sample leakage-controlled corpus is constructed to provide balanced supervision for each facet. The learned vectors are integrated into the model’s residual space and dynamically selected by a trait-activated routing module, enabling precise and interpretable personality steering. Experiments on Large Language Models (LLMs) show that the proposed method maintains stable character fidelity and output quality across contextualized settings, outperforming Contrastive Activation Addition (CAA) and prompt-only baselines. The combined SAE+Prompt configuration achieves the best overall performance, confirming that contrastively trained latent vectors can enhance persona control while preserving dialogue coherence.
[13] Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection cs.CLPDF
Raihan Tanvir, Md. Golam Rabiul Alam
TL;DR: 本文提出了一种用于孟加拉语仇恨表情包检测的增强型双协同注意力框架(xDORA)及其检索增强版本(RAG-Fused DORA)。通过整合视觉编码器(如CLIP、DINOv2)和多语言文本编码器(如XGLM、XLM-R),并利用加权注意力池化学习鲁棒的跨模态表示。此外,还引入了基于FAISS的k近邻分类器进行非参数推理,并评估了LLaVA在零样本、少样本和检索增强提示下的性能。实验在扩展的孟加拉语仇恨表情包数据集上进行,结果表明所提方法在仇恨内容识别和目标实体检测任务上取得了优于基线的性能。
Details
Motivation: 社交媒体上仇恨内容常以结合图像和文本的多模态表情包形式出现。在孟加拉语等低资源语言中,由于标注数据有限、类别不平衡以及普遍存在的语码混合现象,自动化检测面临挑战。
Result: 在扩展的数据集上,xDORA(CLIP + XLM-R)在仇恨表情包识别和目标实体检测任务上分别取得了0.78和0.71的宏平均F1分数。RAG-Fused DORA进一步将性能提升至0.79和0.74,优于DORA基线。基于FAISS的分类器表现具有竞争力,并通过语义相似性建模对稀有类别展现了鲁棒性。相比之下,LLaVA在少样本设置下效果有限,检索增强仅带来小幅提升。
Insight: 主要创新点包括:1)通过整合MIMOSA数据集增强BHM数据集,改善了类别平衡和语义多样性;2)提出了xDORA框架,利用加权注意力池化融合视觉和文本编码器以学习跨模态表示;3)引入了检索增强推理(RAG-Fused DORA)和非参数的FAISS k近邻分类器,提升了性能和对稀有类别的处理能力。研究强调了监督、检索增强和非参数多模态框架在处理低资源语言仇恨检测中语言和文化复杂性方面的有效性。
Abstract: Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.
[14] Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering cs.CL | cs.AI | cs.IRPDF
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani
TL;DR: 本文提出了PR2(个性化检索增强推理)框架,这是一个基于强化学习的框架,用于在个性化问答中实现多步检索与推理的集成。该框架通过学习自适应的检索-推理策略,决定何时检索、从用户档案中检索哪些证据,以及如何将其融入中间推理步骤,从而生成更符合用户背景和偏好的答案。
Details
Motivation: 现有基于检索增强生成(RAG)的个性化问答方法通常直接使用用户查询检索个人文档,导致个性化停留在表面层次,无法深入结合用户偏好和历史上下文进行多步推理。本文旨在解决这一问题,实现更深入、更准确的个性化答案生成。
Result: 在LaMP-QA基准测试上使用三种大语言模型进行的广泛实验表明,PR2始终优于强基线方法,在个性化问答任务中实现了平均8.8%-12%的相对性能提升。
Insight: 核心创新在于将强化学习与检索增强生成相结合,通过优化多轮推理轨迹和个性化奖励函数,学习动态的检索-推理策略,从而更有效地利用个人上下文进行深度个性化,而非简单的表面匹配。这种方法为复杂、多步的个性化信息整合提供了新思路。
Abstract: Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users’ background, preferences, and historical context. Existing state-of-the-art methods primarily rely on retrieval-augmented generation (RAG) solutions that construct personal context by retrieving relevant items from the user’s profile. Existing methods use the user’s query directly to retrieve personal documents, and such strategies often lead to surface-level personalization. We propose PR2 (Personalized Retrieval-Augmented Reasoning), a reinforcement learning framework that integrates reasoning and retrieval from personal context for personalization. PR2 learns adaptive retrieval-reasoning policies, determining when to retrieve, what evidence to retrieve from user profiles, and how to incorporate it into intermediate reasoning steps. By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model. Extensive experiments on the LaMP-QA benchmark using three LLMs show that PR2 consistently outperforms strong baselines, achieving an average relative improvement of 8.8%-12% in personalized QA.
[15] Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations cs.CL | cs.AIPDF
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore
TL;DR: 本文对基于大语言模型的智能体记忆系统进行了系统性综述,从架构和系统两个视角出发,分析了当前评估基准和系统设计的局限性。
Details
Motivation: 尽管智能体记忆系统架构发展迅速,但其经验基础仍很脆弱,现有基准测试规模不足、评估指标与语义效用错位、性能受骨干模型影响大且系统级成本常被忽视,因此需要对此进行结构化分析。
Result: 本文未报告具体的定量实验结果,而是通过分析指出当前基准测试存在饱和效应、评估指标的有效性和评判敏感性存在问题、准确性依赖于骨干模型,以及记忆维护会引入延迟和吞吐量开销等关键痛点。
Insight: 创新点在于提出了一个基于四种记忆结构的MAG系统简明分类法,并将记忆结构与经验性限制联系起来,为更可靠的评估和可扩展的系统设计指明了方向。
Abstract: Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.
[16] Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference cs.CL | cs.AI | cs.LGPDF
Arindam Khaled
TL;DR: 本文提出了一种名为Pyramid MoA的分层专家混合架构,旨在解决大语言模型推理成本与能力之间的权衡问题。该架构通过一个轻量级路由器,仅在必要时将查询动态升级给更强大的模型处理,从而在保持高性能的同时显著降低计算成本。
Details
Motivation: 大语言模型在推理成本与复杂任务处理能力之间存在固有矛盾:大型’Oracle’模型(如Llama-3-70B)精度高但部署成本极高,而小型模型(如8B参数)成本低但难以处理复杂任务。本文旨在设计一个系统,能够智能地分配查询,在保证精度的前提下优化总体计算开销。
Result: 在GSM8K基准测试上,该系统达到了93.0%的准确率,基本匹配了Oracle基线模型(98.0%)的性能,同时将计算成本降低了61%。系统引入的延迟开销极小(+0.82秒),并允许在性能与预算之间进行可调节的权衡。
Insight: 论文的核心创新在于其分层MoA架构和基于语义一致性与置信度校准的路由机制。通过集成多个小模型的预测并评估其一致性,路由器能够高精度地识别’困难’问题,从而实现了成本感知的自适应推理。这种动态资源分配范式为构建高效、可扩展的LLM服务提供了新思路。
Abstract: Large Language Models (LLMs) face a persistent trade-off between inference cost and reasoning capability. While “Oracle” models (e.g., Llama-3-70B) achieve state-of-the-art accuracy, they are prohibitively expensive for high-volume deployment. Smaller models (e.g., 8B parameters) are cost-effective but struggle with complex tasks. In this work, we propose “Pyramid MoA”, a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary. By leveraging semantic agreement and confidence calibration among an ensemble of small models, our Router identifies “hard” problems with high precision. On the GSM8K benchmark, our system achieves 93.0% accuracy, effectively matching the Oracle baseline (98.0%) while reducing compute costs by 61%. We demonstrate that the system introduces negligible latency overhead (+0.82s) and allows for a tunable trade-off between performance and budget.
[17] How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1 cs.CLPDF
Yinuo Xu, Shuo Lu, Jianjie Cheng, Meng Wang, Qianlong Xie
TL;DR: 本文系统研究了强化学习在深度研究智能体中的作用,通过解耦分析提示模板、奖励函数和策略优化三个维度,提出了改进的Search-R1++基线方法,显著提升了模型在知识密集型任务上的性能。
Details
Motivation: 旨在深入理解强化学习如何提升深度研究智能体在知识密集型任务中的表现,并探索其具体贡献。
Result: 在Search-R1基准上,Search-R1++将Qwen2.5-7B模型的性能从0.403提升至0.442,Qwen2.5-3B从0.289提升至0.331,实现了性能改进。
Insight: 创新点包括:发现Fast Thinking提示模板比Slow Thinking更稳定高效;揭示基于F1的奖励函数因答案回避导致训练崩溃,可通过动作级惩罚缓解;REINFORCE策略优化优于PPO且搜索动作更少。这些发现为深度研究系统提供了更原则化的RL训练策略。
Abstract: Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.
[18] Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework cs.CL | cs.CV | cs.IRPDF
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo
TL;DR: 本文提出了一种名为’Prune-then-Merge’的两阶段框架,旨在解决视觉文档检索中多向量范式性能优异但计算开销巨大的问题。该框架首先通过自适应剪枝过滤低信息图像块,然后通过分层合并压缩特征集,从而在保持高压缩率的同时显著提升特征保真度。
Details
Motivation: 当前视觉文档检索领域的最先进多向量范式虽然性能优越,但存在计算开销过大的问题,现有的剪枝和合并等效率优化方法在压缩率和特征保真度之间存在难以权衡的困境。
Result: 在29个视觉文档检索数据集上的广泛实验表明,该框架持续优于现有方法,显著扩展了近无损压缩的范围,并在高压缩比下提供了鲁棒的性能。
Insight: 论文的核心创新在于将互补的剪枝和合并方法协同整合到一个两阶段框架中,先通过自适应剪枝去除噪声,再对精炼后的特征集进行分层合并,避免了单阶段方法因噪声导致的特征稀释问题,从而在效率和精度之间取得了更好的平衡。
Abstract: Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.
[19] Temporal-Aware Heterogeneous Graph Reasoning with Multi-View Fusion for Temporal Question Answering cs.CL | cs.AIPDF
Wuzhenghong Wen, Bowen Zhou, Jinwen Huang, Xianjie Wu, Yuwei Sun
TL;DR: 本文提出了一种用于时序知识图谱问答(TKGQA)的新框架,该框架通过时序感知的问题编码、多跳图推理和多视图异构信息融合,解决了现有方法在时序约束整合、显式多跳推理以及语言与图谱表示融合方面的不足。
Details
Motivation: 解决现有时序知识图谱问答方法中存在的三个关键问题:问题表示中时序约束整合不足导致的推理偏差、显式多跳推理能力有限,以及语言与图谱表示融合效果欠佳。
Result: 在多个TKGQA基准测试上的实验表明,该方法相比多个基线模型取得了持续的性能提升。
Insight: 创新点在于:1)结合语言模型语义线索与时序实体动态的约束感知问题表示;2)通过时序感知消息传递实现显式多跳推理的时序感知图神经网络;3)用于更有效融合问题上下文和时序图谱知识的多视图注意力机制。这些设计增强了模型对时序信息的利用和推理的透明度。
Abstract: Question Answering over Temporal Knowledge Graphs (TKGQA) has attracted growing interest for handling time-sensitive queries. However, existing methods still struggle with: 1) weak incorporation of temporal constraints in question representation, causing biased reasoning; 2) limited ability to perform explicit multi-hop reasoning; and 3) suboptimal fusion of language and graph representations. We propose a novel framework with temporal-aware question encoding, multi-hop graph reasoning, and multi-view heterogeneous information fusion. Specifically, our approach introduces: 1) a constraint-aware question representation that combines semantic cues from language models with temporal entity dynamics; 2) a temporal-aware graph neural network for explicit multi-hop reasoning via time-aware message passing; and 3) a multi-view attention mechanism for more effective fusion of question context and temporal graph knowledge. Experiments on multiple TKGQA benchmarks demonstrate consistent improvements over multiple baselines.
[20] Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling cs.CL | cs.LGPDF
Xiang Li, Zikai Wei, Yiyan Qi, Wanyun Zhou, Xiang Liu
TL;DR: 本文提出Janus-Q,一种端到端的事件驱动交易框架,将金融新闻事件从辅助信号提升为主要决策单元。该框架采用两阶段范式:第一阶段构建大规模金融新闻事件数据集,包含62,400篇文章,标注了10种细粒度事件类型、相关股票、情感标签和事件驱动的累积异常收益;第二阶段通过结合监督学习和强化学习进行决策导向的微调,使用分层门控奖励模型捕获多交易目标间的权衡。实验表明,Janus-Q相比市场指数和LLM基线实现了更一致、可解释且盈利的交易决策。
Details
Motivation: 金融市场波动常由新闻传递的离散金融事件驱动,其影响具有异质性、突发性,难以通过纯数值预测目标捕捉。现有方法面临两大挑战:缺乏大规模、以事件为中心的数据集来联合建模新闻语义和统计基础的市场反应;以及语言模型推理与动态市场条件下有效交易行为之间的错位。
Result: 在广泛的实验中,Janus-Q相比市场指数和LLM基线实现了更一致、可解释且盈利的交易决策,将夏普比率提升高达102.0%,同时方向准确性相比最强竞争策略提高超过17.5%。
Insight: 创新点包括:将金融新闻事件作为主要决策单元,提出端到端事件驱动交易框架;构建大规模细粒度标注的金融新闻事件数据集;引入分层门控奖励模型,结合监督学习和强化学习进行决策微调,显式捕获多目标权衡,提升交易策略的稳健性和可解释性。
Abstract: Financial market movements are often driven by discrete financial events conveyed through news, whose impacts are heterogeneous, abrupt, and difficult to capture under purely numerical prediction objectives. These limitations have motivated growing interest in using textual information as the primary source of trading signals in learning-based systems. Two key challenges hinder existing approaches: (1) the absence of large-scale, event-centric datasets that jointly model news semantics and statistically grounded market reactions, and (2) the misalignment between language model reasoning and financially valid trading behavior under dynamic market conditions. To address these challenges, we propose Janus-Q, an end-to-end event-driven trading framework that elevates financial news events from auxiliary signals to primary decision units. Janus-Q unifies event-centric data construction and model optimization under a two-stage paradigm. Stage I focuses on event-centric data construction, building a large-scale financial news event dataset comprising 62,400 articles annotated with 10 fine-grained event types, associated stocks, sentiment labels, and event-driven cumulative abnormal return (CAR). Stage II performs decision-oriented fine-tuning, combining supervised learning with reinforcement learning guided by a Hierarchical Gated Reward Model (HGRM), which explicitly captures trade-offs among multiple trading objectives. Extensive experiments demonstrate that Janus-Q achieves more consistent, interpretable, and profitable trading decisions than market indices and LLM baselines, improving the Sharpe Ratio by up to 102.0% while increasing direction accuracy by over 17.5% compared to the strongest competing strategies.
[21] Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval cs.CL | cs.IRPDF
Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao
TL;DR: 本文是关于视觉文档检索(VDR)领域的首篇综合性综述,重点探讨了在多模态大语言模型(MLLM)时代背景下,如何利用多模态嵌入模型、重排序模型以及检索增强生成(RAG)与智能体系统来解决视觉文档的检索与理解问题,并指出了当前挑战与未来方向。
Details
Motivation: 随着多模态信息的快速增长,视觉文档检索(VDR)成为连接非结构化视觉丰富数据与精确信息获取的关键前沿,旨在克服视觉文档中密集文本、复杂布局和细粒度语义依赖等独特挑战。
Result: 本文为综述性论文,未提供具体实验定量结果,但系统梳理了VDR领域的基准测试、方法演进(包括多模态嵌入、重排序及RAG与智能体集成),并总结了当前成就。
Insight: 创新点在于首次从MLLM视角全面审视VDR领域,提出了方法分类框架(多模态嵌入、重排序、RAG与智能体集成),并为未来多模态文档智能发展提供了清晰路线图,强调了结合布局与语义理解的重要性。
Abstract: With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.
[22] ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting cs.CL | cs.AIPDF
Yuxing Tian, Fengran Mo, Weixu Zhang, Yiyan Qi, Jian-Yun Nie
TL;DR: 本文提出了一种名为ReAttn的后处理重加权策略,用于改进基于注意力机制的重排序方法。该方法通过跨文档IDF加权降低查询重叠词汇的注意力权重以减少词汇偏差,并利用基于熵的正则化缓解注意力过度集中的问题,从而提升排序性能。
Details
Motivation: 基于注意力机制的重排序方法虽然高效且可解释,但面临注意力信号过度集中在少数文档的少量词汇上,以及过度强调与查询词汇相似的短语导致排序偏差的问题。
Result: 在广泛的实验中,ReAttn方法有效提升了基于注意力的重排序性能,无需额外训练或监督,在零样本重排序任务中表现出色。
Insight: 创新点在于通过跨文档IDF加权和熵正则化对现有注意力权重进行后处理重加权,以减轻词汇偏差和注意力集中问题,这为改进注意力机制提供了一种简单有效的策略。
Abstract: The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task. Attention-based re-ranking methods, which derive relevance scores directly from attention weights, offer an efficient and interpretable alternative to generation-based re-ranking methods. However, they still face two major limitations. First, attention signals are highly concentrated a small subset of tokens within a few documents, making others indistinguishable. Second, attention often overemphasizes phrases lexically similar to the query, yielding biased rankings that irrelevant documents with mere lexical resemblance are regarded as relevant. In this paper, we propose \textbf{ReAttn}, a post-hoc re-weighting strategy for attention-based re-ranking methods. It first compute the cross-document IDF weighting to down-weight attention on query-overlapping tokens that frequently appear across the candidate documents, reducing lexical bias and emphasizing distinctive terms. It then employs entropy-based regularization to mitigate over-concentrated attention, encouraging a more balanced distribution across informative tokens. Both adjustments operate directly on existing attention weights without additional training or supervision. Extensive experiments demonstrate the effectiveness of our method.
[23] QUIETT: Query-Independent Table Transformation for Robust Reasoning cs.CLPDF
Gaurav Najpande, Tampu Ravi Kumar, Manan Roy Choudhury, Neha Valeti, Yanjie Fu
TL;DR: 论文提出了QuIeTT框架,这是一种查询无关的表格转换方法,旨在将原始表格预处理为规范化的SQL就绪表示,以提升下游表格推理和问答任务的鲁棒性。
Details
Motivation: 现实世界中的表格常存在不规则模式、异构值格式和隐式关系结构,这降低了表格推理和问答的可靠性。现有方法多以查询依赖的方式处理这些问题,将表格清理与推理过程耦合,限制了泛化能力。
Result: 在WikiTQ、HiTab、NQ-Table和SequentialQA四个基准测试上的实验表明,该方法在不同模型和推理范式中均带来了一致的性能提升,特别是在结构多样、未见问题的挑战集上表现出显著的改进。
Insight: 核心创新点在于将表格转换与下游推理解耦,通过无损的模式与值规范化、显式化隐式关系并保留完整的溯源信息,实现了查询无关的预处理,从而在不修改下游模型的情况下提高了查询的清洁度、可靠性和效率。
Abstract: Real-world tables often exhibit irregular schemas, heterogeneous value formats, and implicit relational structure, which degrade the reliability of downstream table reasoning and question answering. Most existing approaches address these issues in a query-dependent manner, entangling table cleanup with reasoning and thus limiting generalization. We introduce QuIeTT, a query-independent table transformation framework that preprocesses raw tables into a single SQL-ready canonical representation before any test-time queries are observed. QuIeTT performs lossless schema and value normalization, exposes implicit relations, and preserves full provenance via raw table snapshots. By decoupling table transformation from reasoning, QuIeTT enables cleaner, more reliable, and highly efficient querying without modifying downstream models. Experiments on four benchmarks, WikiTQ, HiTab, NQ-Table, and SequentialQA show consistent gains across models and reasoning paradigms, with particularly strong improvements on a challenge set of structurally diverse, unseen questions.
[24] To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering cs.CL | cs.AIPDF
Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen
TL;DR: 本文提出了一种名为选择性思维链(Selective CoT)的推理时策略,旨在提升大型语言模型在医学问答任务中的效率。该方法通过预测问题是否需要推理,仅在必要时生成解释,从而在保持准确性的同时减少计算开销。
Details
Motivation: 动机是解决医学问答中大型语言模型推理效率低下的问题,避免在不需要复杂推理的回忆型问题上进行不必要的思维链生成,以降低计算成本并提升实际部署可行性。
Result: 在四个生物医学问答基准(HeadQA、MedQA-USMLE、MedMCQA、PubMedQA)上评估,使用Llama-3.1-8B和Qwen-2.5-7B模型,选择性CoT将推理时间减少13-45%,生成令牌数减少8-47%,准确率损失最小(≤4%),在某些情况下甚至实现了更高的准确率和效率,相比固定长度CoT,以显著更低的计算成本达到相似或更优的准确率。
Insight: 创新点在于动态平衡推理深度与效率,根据问题复杂度选择性触发显式推理,减少冗余,同时保持可解释性;这是一种简单、模型无关且成本效益高的方法,可增强基于LLM的临床系统的实际部署能力。
Abstract: Objective: To improve the efficiency of medical question answering (MedQA) with large language models (LLMs) by avoiding unnecessary reasoning while maintaining accuracy. Methods: We propose Selective Chain-of-Thought (Selective CoT), an inference-time strategy that first predicts whether a question requires reasoning and generates a rationale only when needed. Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA. Metrics included accuracy, total generated tokens, and inference time. Results: Selective CoT reduced inference time by 13-45% and token usage by 8-47% with minimal accuracy loss ($\leq$4%). In some model-task pairs, it achieved both higher accuracy and greater efficiency than standard CoT. Compared with fixed-length CoT, Selective CoT reached similar or superior accuracy at substantially lower computational cost. Discussion: Selective CoT dynamically balances reasoning depth and efficiency by invoking explicit reasoning only when beneficial, reducing redundancy on recall-type questions while preserving interpretability. Conclusion: Selective CoT provides a simple, model-agnostic, and cost-effective approach for medical QA, aligning reasoning effort with question complexity to enhance real-world deployability of LLM-based clinical systems.
cs.CV [Back]
[25] Replication Study: Federated Text-Driven Prompt Generation for Vision-Language Models cs.CV | cs.LGPDF
Suraj Prasad, Anubha Pant
TL;DR: 本文是对联邦文本驱动提示生成方法FedTPG的复现研究,在六个视觉数据集上验证了其有效性,结果显示在未见类上的泛化性能比静态提示学习方法提升了1.43个百分点。
Details
Motivation: 解决CLIP等视觉语言模型在联邦学习场景下,特别是对未见类泛化能力不足的挑战。
Result: 在Caltech101、Oxford Flowers等六个数据集上,复现结果与原论文报告精度相差在0.2%以内,平均准确率在已见类上为74.58%,在未见类上为76.00%,泛化性能提升1.43个百分点。
Insight: 文本驱动的提示生成网络能根据类别名动态生成提示,在联邦设置中实现更好的跨类泛化;联邦训练提示生成器可在不共享私有数据的情况下保持跨视觉领域的高性能。
Abstract: Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities, yet their adaptation to federated learning scenarios presents significant challenges, particularly regarding generalization to unseen classes. The original FedTPG paper \cite{Qiu2024} addresses this limitation by introducing a text driven prompt generation network that dynamically creates prompts conditioned on class names, enabling better cross-class generalization in federated settings. In this work, we present a faithful replication study of FedTPG, evaluating the pre-trained model on six diverse vision datasets: Caltech101, Oxford Flowers, FGVC Aircraft, Oxford Pets, Food-101, and DTD. Our evaluation achieves results within 0.2% of the original paper’s reported accuracies, with an average accuracy of 74.58% on seen (base) classes and 76.00% on unseen (new) classes, demonstrating a +1.43 percentage point improvement in generalization. These results validate the original paper’s core claims: (1) text-driven prompt generation enables superior generalization to unseen classes compared to static prompt learning methods, and (2) federated training of prompt generators maintains high performance across diverse visual domains without sharing private data. Our successful replication confirms the robustness and reproducibility of the FedTPG approach.
[26] Scaling Ultrasound Volumetric Reconstruction via Mobile Augmented Reality cs.CV | cs.ET | cs.HCPDF
Kian Wei Ng, Yujia Gao, Deborah Khoo, Ying Zhen Tan, Chengzheng Mao
TL;DR: 本文提出了一种名为MARVUS(移动增强现实体积超声)的系统,旨在利用移动增强现实技术,通过常规2D超声设备实现准确、可重复的病灶体积测量,以解决现有3D超声方案成本高、便携性差的问题。
Details
Motivation: 解决在乳腺和甲状腺成像中,基于2D超声的体积估计存在高用户间差异,以及现有3D超声解决方案因需要专用探头或外部跟踪硬件而导致成本增加、便携性降低,从而限制其广泛临床应用的问题。
Result: 在由经验丰富的临床医生对乳腺体模进行的用户研究中,MARVUS显著提高了体积估计的准确性(平均差异:0.469 cm³),并降低了用户间变异性(平均差异:0.417 cm³)。增强现实可视化也被证明能提升客观性能指标和临床医生报告的使用性。
Insight: 论文的创新点在于将移动增强现实技术与基础模型相结合,构建了一个与常规超声系统互操作、硬件需求最小化的资源高效系统,实现了在保持低成本和高便携性的同时,提高体积测量的准确性和可重复性,有望增强基于超声的癌症筛查和诊疗工作流程的可及性。
Abstract: Accurate volumetric characterization of lesions is essential for oncologic diagnosis, risk stratification, and treatment planning. While imaging modalities such as Computed Tomography provide high-quality 3D data, 2D ultrasound (2D-US) remains the preferred first-line modality for breast and thyroid imaging due to cost, portability, and safety factors. However, volume estimates derived from 2D-US suffer from high inter-user variability even among experienced clinicians. Existing 3D ultrasound (3D-US) solutions use specialized probes or external tracking hardware, but such configurations increase costs and diminish portability, constraining widespread clinical use. To address these limitations, we present Mobile Augmented Reality Volumetric Ultrasound (MARVUS), a resource-efficient system designed to increase accessibility to accurate and reproducible volumetric assessment. MARVUS is interoperable with conventional ultrasound (US) systems, using a foundation model to enhance cross-specialty generalization while minimizing hardware requirements relative to current 3D-US solutions. In a user study involving experienced clinicians performing measurements on breast phantoms, MARVUS yielded a substantial improvement in volume estimation accuracy (mean difference: 0.469 cm3) with reduced inter-user variability (mean difference: 0.417 cm3). Additionally, we prove that augmented reality (AR) visualizations enhance objective performance metrics and clinician-reported usability. Collectively, our findings suggests that MARVUS can enhance US-based cancer screening, diagnostic workflows, and treatment planning in a scalable, cost-conscious, and resource-efficient manner. Usage video demonstration available (https://youtu.be/m4llYcZpqmM).
[27] A Computer Vision Framework for Multi-Class Detection and Tracking in Soccer Broadcast Footage cs.CV | cs.AIPDF
Daniel Tshiani
TL;DR: 本文提出了一种基于单摄像头广播视频的端到端计算机视觉框架,用于足球比赛中的多类别检测与跟踪,旨在为预算有限的球队提供可扩展的数据驱动分析方法。
Details
Motivation: 解决专业俱乐部依赖昂贵多摄像头或GPS系统获取详细数据而中小球队无法负担的问题,探索从标准广播视频中提取球员级空间信息的可行性。
Result: 在检测和跟踪球员、裁判和守门员方面表现出色,获得了较高的精确率、召回率和mAP50分数,但球的检测仍是主要挑战。
Insight: 创新点在于将YOLO目标检测器与ByteTrack跟踪算法结合,构建了从单摄像头广播视频中提取有意义空间信息的完整流程,降低了专业硬件依赖,推动了计算机视觉在足球分析中的普及应用。
Abstract: Clubs with access to expensive multi-camera setups or GPS tracking systems gain a competitive advantage through detailed data, whereas lower-budget teams are often unable to collect similar information. This paper examines whether such data can instead be extracted directly from standard broadcast footage using a single-camera computer vision pipeline. This project develops an end-to-end system that combines a YOLO object detector with the ByteTrack tracking algorithm to identify and track players, referees, goalkeepers, and the ball throughout a match. Experimental results show that the pipeline achieves high performance in detecting and tracking players and officials, with strong precision, recall, and mAP50 scores, while ball detection remains the primary challenge. Despite this limitation, our findings demonstrate that AI can extract meaningful player-level spatial information from a single broadcast camera. By reducing reliance on specialized hardware, the proposed approach enables colleges, academies, and amateur clubs to adopt scalable, data-driven analysis methods previously accessible only to professional teams, highlighting the potential for affordable computer vision-based soccer analytics.
[28] Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams cs.CV | cs.AIPDF
Aayam Bansal
TL;DR: 本文提出了Sketch2Feedback框架,用于为STEM教育中的学生手绘图表提供符合评分标准的反馈。该框架采用语法循环机制,将问题分解为混合感知、符号图构建、约束检查和受限视觉语言模型反馈四个阶段,以限制语言模型的幻觉生成。在自由体图和电路图两个合成微基准测试上,与端到端大型多模态模型相比,该框架在减少幻觉的同时,生成了更具可操作性的反馈。
Details
Motivation: 解决在STEM教育中为学生绘制的图表提供及时且符合评分标准的反馈这一长期挑战,并克服现有大型多模态模型(LMMs)在应用中因产生幻觉而损害可信度的问题。
Result: 在FBD-10和Circuit-10两个合成基准测试上进行了评估。端到端LMM(如Qwen2-VL-7B)在微F1分数上最高(FBDs: 0.570, 电路: 0.528),但幻觉率极高(0.78, 0.98)。语法框架在LLM-as-judge评估中,为电路图生成的反馈可操作性得分(4.85/5)显著高于端到端LMM(3.11/5)。置信度阈值调整(tau=0.7)可将电路幻觉率从0.970降至0.880,且无F1损失。
Insight: 主要创新点是提出了一个“语法在循环中”的分解框架,通过上游规则引擎验证约束,再让语言模型仅描述被验证的违规项,从而将可解释的符号推理与生成能力结合,有效控制幻觉。客观分析认为,该方法展示了符号方法与端到端方法之间可被利用的互补性,并为特定领域(如电路图)的鲁棒性提供了见解。
Abstract: Providing timely, rubric-aligned feedback on student-drawn diagrams is a persistent challenge in STEM education. While large multimodal models (LMMs) can jointly parse images and generate explanations, their tendency to hallucinate undermines trust in classroom deployments. We present Sketch2Feedback, a grammar-in-the-loop framework that decomposes the problem into four stages – hybrid perception, symbolic graph construction, constraint checking, and constrained VLM feedback – so that the language model verbalizes only violations verified by an upstream rule engine. We evaluate on two synthetic micro-benchmarks, FBD-10 (free-body diagrams) and Circuit-10 (circuit schematics), each with 500 images spanning standard and hard noise augmentation tiers, comparing our pipeline against end-to-end LMMs (LLaVA-1.5-7B, Qwen2-VL-7B), a vision-only detector, a YOLOv8-nano learned detector, and an ensemble oracle. On n=100 test samples per benchmark with 95% bootstrap CIs, results are mixed and instructive: Qwen2-VL-7B achieves the highest micro-F1 on both FBDs (0.570) and circuits (0.528), but with extreme hallucination rates (0.78, 0.98). An ensemble oracle that selects the best prediction per sample reaches F1=0.556 with hallucination 0.320 on FBDs, demonstrating exploitable complementarity between grammar and end-to-end approaches. Confidence thresholding at tau=0.7 reduces circuit hallucination from 0.970 to 0.880 with no F1 loss. Hard noise augmentation reveals domain-dependent robustness: FBD detection is resilient while circuit detection degrades sharply. An LLM-as-judge evaluation confirms that the grammar pipeline produces more actionable circuit feedback (4.85/5) than the end-to-end LMM (3.11/5). We release all code, datasets, and evaluation scripts.
[29] JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments cs.CV | cs.AI | cs.SDPDF
Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen
TL;DR: 本文提出了JAEGER框架,将音频-视觉大语言模型(AV-LLMs)从2D感知扩展到3D空间,通过整合RGB-D观测和多通道一阶Ambisonics音频,实现联合空间定位与推理。核心贡献是神经强度向量(Neural IV),一种学习到的空间音频表示,用于增强方向估计。同时,作者构建了SpatialSceneQA基准数据集进行训练与评估。实验表明,该方法在多种空间感知与推理任务上优于2D基线。
Details
Motivation: 当前AV-LLMs主要局限于2D感知(RGB视频和单声道音频),存在维度不匹配问题,无法在复杂3D环境中进行可靠的声源定位和空间推理。本文旨在解决这一限制。
Result: 在从模拟物理环境中构建的SpatialSceneQA基准(包含61k指令调优样本)上进行广泛实验,结果表明,该方法在多样化的空间感知和推理任务上持续超越以2D为中心的基线模型,强调了显式3D建模的必要性。
Insight: 主要创新点包括:1) 将AV-LLMs扩展至3D空间的整体框架;2) 提出神经强度向量(Neural IV)这一新颖的空间音频表示,以学习鲁棒的方向线索,提升在重叠声源等不利声学场景中的到达方向估计;3) 创建了大规模模拟数据集SpatialSceneQA用于系统评估。从客观角度看,该工作强调了多模态感知中显式3D几何与空间音频信息融合的重要性,为物理环境中的AI推理提供了新方向。
Abstract: Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.
[30] VLANeXt: Recipes for Building Strong VLA Models cs.CV | cs.AI | cs.ROPDF
Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang
TL;DR: 本文提出了一种名为VLANeXt的视觉-语言-动作模型,并通过在一个统一的框架和评估设置下系统性地分析VLA设计空间,提炼出12个关键发现,形成了一套构建强大VLA模型的实用方案。该模型在LIBERO和LIBERO-plus基准测试中超越了之前的最先进方法,并在真实世界实验中展现出强大的泛化能力。
Details
Motivation: 当前VLA模型领域的研究呈现碎片化和探索性,不同模型在训练协议和评估设置上的不一致使得难以确定哪些设计选择真正有效。本文旨在通过一个统一的框架来重新审视VLA设计空间,为社区提供清晰的结构和指导。
Result: VLANeXt在LIBERO和LIBERO-plus基准测试中超越了之前的最先进方法,并在真实世界实验中展现出强大的泛化能力。
Insight: 论文的主要创新点在于系统性地解构了VLA设计空间,将其分为基础组件、感知要素和动作建模视角三个维度,并从中提炼出12个关键发现,形成了一套可复现的、实用的构建指南。这为社区提供了一个统一的代码库和共同平台,有助于推动VLA研究的标准化和进一步发展。
Abstract: Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong generalization in real-world experiments. We will release a unified, easy-to-use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.
[31] Rodent-Bench cs.CV | cs.AIPDF
Thomas Heap, Laurence Aitchison, Emma Cahill, Adriana Casado Rodriguez
TL;DR: 本文提出了Rodent-Bench,一个用于评估多模态大语言模型(MLLMs)在注释啮齿动物行为视频方面能力的新基准。该基准涵盖了多种行为范式(如社交互动、理毛、抓挠和僵直行为)的数据集,视频长度从10分钟到35分钟不等。作者评估了包括Gemini-2.5-Pro、Gemini-2.5-Flash和Qwen-VL-Max在内的先进MLLMs,发现它们在当前任务中表现均不足以作为有效助手。
Details
Motivation: 解决当前多模态大语言模型在科学视频注释,特别是神经科学研究中啮齿动物行为自动标注任务上的能力评估缺失问题,为追踪该领域进展提供基础。
Result: 在Rodent-Bench上评估的SOTA MLLMs(如Gemini-2.5-Pro等)整体表现不佳,无法可靠地用作该任务的助手。部分模型在特定数据集(如理毛行为检测)上表现尚可,但在时间分割、处理长视频序列和区分细微行为状态方面面临重大挑战。评估使用了秒级准确率、宏平均F1、平均精度均值、互信息和马修斯相关系数等标准化指标。
Insight: 论文的创新点在于创建了首个专门针对啮齿动物行为视频标注的MLLM基准(Rodent-Bench),并提供了两个版本以适应不同模型能力。客观来看,其系统性的评估框架和针对长视频、细粒度时序行为分析的挑战设定,为未来MLLM在科学视频理解方向的发展提供了明确的评估标准和改进方向。
Abstract: We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage. We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as an assistant for this task. Our benchmark encompasses diverse datasets spanning multiple behavioral paradigms including social interactions, grooming, scratching, and freezing behaviors, with videos ranging from 10 minutes to 35 minutes in length. We provide two benchmark versions to accommodate varying model capabilities and establish standardized evaluation metrics including second-wise accuracy, macro F1, mean average precision, mutual information, and Matthew’s correlation coefficient. While some models show modest performance on certain datasets (notably grooming detection), overall results reveal significant challenges in temporal segmentation, handling extended video sequences, and distinguishing subtle behavioral states. Our analysis identifies key limitations in current MLLMs for scientific video annotation and provides insights for future model development. Rodent-Bench serves as a foundation for tracking progress toward reliable automated behavioral annotation in neuroscience research.
[32] Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification cs.CVPDF
Massoud Dehghan, Ramona Woitek, Amirreza Mahbod
TL;DR: 该论文系统研究了在二维和三维医学图像分类任务中,Vision Transformer (ViT) 的补丁大小对微调性能的影响。通过在12个医学影像数据集上进行实验,发现使用较小的补丁尺寸(1、2、4)能显著提升分类准确率,但会增加计算开销。此外,通过集成多个小补丁尺寸模型的预测,可以进一步提升性能。
Details
Motivation: Vision Transformer 及其变体已成为计算机视觉任务的主流架构,但作为其关键初始设计参数的补丁大小的影响尚未得到充分探索,尤其是在同时存在二维和三维成像模态的医学影像领域。
Result: 在12个医学影像数据集(7个2D,5个3D)上的实验结果表明,使用较小的补丁尺寸(1、2、4)能带来最一致的性能提升。具体而言,在2D数据集上,补丁尺寸2相比28将平衡准确率提升了高达12.78%;在3D数据集上,补丁尺寸1相比14提升了高达23.78%。通过集成补丁尺寸为1、2、4的模型预测,在大多数情况下,尤其是2D数据集上,能实现进一步的性能提升。
Insight: 论文的核心创新点在于系统地量化了ViT补丁大小这一关键超参数在医学图像分类(特别是2D和3D)中的影响,并揭示了小补丁尺寸的优越性。从客观角度看,其研究为医学影像领域ViT模型的微调提供了重要的超参数选择指导,并展示了简单集成策略的有效性,具有直接的实践借鉴意义。
Abstract: Vision Transformers (ViTs) and their variants have become state-of-the-art in many computer vision tasks and are widely used as backbones in large-scale vision and vision-language foundation models. While substantial research has focused on architectural improvements, the impact of patch size, a crucial initial design choice in ViTs, remains underexplored, particularly in medical domains where both two-dimensional (2D) and three-dimensional (3D) imaging modalities exist. In this study, using 12 medical imaging datasets from various imaging modalities (including seven 2D and five 3D datasets), we conduct a thorough evaluation of how different patch sizes affect ViT classification performance. Using a single graphical processing unit (GPU) and a range of patch sizes (1, 2, 4, 7, 14, 28), we fine-tune ViT models and observe consistent improvements in classification performance with smaller patch sizes (1, 2, and 4), which achieve the best results across nearly all datasets. More specifically, our results indicate improvements in balanced accuracy of up to 12.78% for 2D datasets (patch size 2 vs. 28) and up to 23.78% for 3D datasets (patch size 1 vs. 14), at the cost of increased computational expense. Moreover, by applying a straightforward ensemble strategy that fuses the predictions of the models trained with patch sizes 1, 2, and 4, we demonstrate a further boost in performance in most cases, especially for the 2D datasets. Our implementation is publicly available on GitHub: https://github.com/HealMaDe/MedViT
[33] Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space cs.CVPDF
Aashish Chandra, Aashutosh A, Abhijit Das
TL;DR: 本文提出了一种新颖的音频-视觉叙事人脸生成方法,能够从静态图像、声音配置文件和目标文本中,合成出逼真的说话人脸和声音。该方法通过编码提示文本、驱动图像和个体声音配置文件,并将其输入到一个多纠缠潜在空间中,为音频和视频模态的生成流程提供键值对和查询。
Details
Motivation: 解决从单一静态图像、声音配置和文本提示生成同步且个性化的说话人脸视频的挑战,旨在实现可控的、高质量的音频-视觉内容生成。
Result: 摘要中未提及具体的定量结果或基准测试,但宣称能够生成逼真的说话和交谈人脸。
Insight: 创新点在于引入了多纠缠潜在空间来建立模态间的时空人物特定特征关联,从而实现文本、图像和声音配置文件的多模态融合与协同生成。
Abstract: We present a novel approach for generating realistic speaking and talking faces by synthesizing a person’s voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.
[34] Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding cs.CV | cs.AIPDF
Houlun Chen, Xin Wang, Guangyao Li, Yuwei Zhou, Yihan Chen
TL;DR: 本文提出Video-TwG框架,一种用于长视频理解的课程强化推理方法。其核心是‘Think-with-Grounding’范式,让视频大语言模型在交织的文本-视频推理中,能主动决定何时对问题进行按需的视频片段定位,仅在必要时聚焦于问题相关的片段,以缓解幻觉问题。
Details
Motivation: 现有长视频理解方法主要依赖纯文本推理,但由于长视频存在时间冗余,在有限的视频上下文长度下容易忽略关键细节,加剧模型的幻觉问题。本文旨在解决纯文本推理在固定视频上下文下的这一局限性。
Result: 在Video-MME、LongVideoBench和MLVU等基准测试上的实验表明,Video-TwG持续优于强大的长视频理解基线模型。消融实验验证了其两阶段强化课程策略的必要性,并表明其TwG-GRPO算法能更好地利用多样化的未标记数据,在不牺牲问答性能的情况下提高定位质量并减少冗余定位。
Insight: 主要创新点包括:1) 提出‘Think-with-Grounding’的主动按需视频定位推理范式;2) 设计了两阶段强化课程训练策略,先在小规模带标注的短视频数据集上学习,再泛化到多样化领域的长视频数据;3) 提出了TwG-GRPO算法,结合细粒度定位奖励、自确认伪奖励和精度门控机制来处理复杂推理;4) 构建了新的TwG-51K数据集以支持训练。
Abstract: Long video understanding is challenging due to rich and complicated multimodal clues in long temporal range.Current methods adopt reasoning to improve the model’s ability to analyze complex video clues in long videos via text-form reasoning.However,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long videos.To address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when necessary.Video-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning tracesIn detail,we design a Two-stage Reinforced Curriculum Strategy, where the model first learns think-with-grounding behavior on a small short-video GQA dataset with grounding labels,and then scales to diverse general QA data with videos of diverse domains to encourage generalization. Further, to handle complex think-with-grounding reasoning for various kinds of data,we propose TwG-GRPO algorithm which features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated mechanism.Finally,we propose to construct a new TwG-51K dataset that facilitates training. Experiments on Video-MME, LongVideoBench, and MLVU show that Video-TwG consistently outperforms strong LVU baselines.Further ablation validates the necessity of our Two-stage Reinforced Curriculum Strategy and shows our TwG-GRPO better leverages diverse unlabeled data to improve grounding quality and reduce redundant groundings without sacrificing QA performance.
[35] HIME: Mitigating Object Hallucinations in LVLMs via Hallucination Insensitivity Model Editing cs.CVPDF
Ahmed Akl, Abdelwahed Khamis, Ali Cheraghian, Zhe Wang, Sara Khalifa
TL;DR: 本文提出了一种名为HIME(幻觉不敏感模型编辑)的训练免费方法,用于缓解大型视觉语言模型(LVLMs)中的物体幻觉问题。该方法通过系统分析模型各层对幻觉的敏感性,引入幻觉不敏感分数(HIS)作为指导,并采用层自适应的权重编辑策略,在多个基准测试上显著减少了幻觉,同时保持了模型的预训练知识。
Details
Motivation: LVLMs在物体描述中容易产生幻觉(描述不存在或属性错误的物体),这影响了其实际部署的可靠性。微调方法计算成本高且不实用,因此需要无需训练的替代方案,模型编辑是其中一个有前景的方向,但盲目编辑可能破坏预训练知识,因此需要研究如何在各层进行精准干预以抑制幻觉并保留知识。
Result: 在包括CHAIR、MME和GPT-4V辅助评估在内的开放式生成基准测试中,HIME平均减少了61.8%的幻觉,且不引入额外参数、推理延迟或计算开销。
Insight: 创新点在于系统分析了LVLM解码器各层对幻觉的敏感性差异,并提出了HIS这一原则性度量来量化敏感性,从而指导层自适应的权重编辑。这提供了一种无需训练、高效且精准的模型编辑方法,可平衡幻觉抑制与知识保留。从客观角度看,该方法通过层级干预而非全局修改,实现了对模型行为的细粒度控制,具有实际应用价值。
Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal understanding capabilities, yet they remain prone to object hallucination, where models describe non-existent objects or attribute incorrect factual information, raising serious concerns for reliable real-world deployment. While fine-tuning is a commonly adopted mitigation strategy, its high computational cost and practical difficulty motivate the need for training-free alternatives, among which model editing has recently emerged as a promising direction. However, indiscriminate editing risks disrupting the rich implicit knowledge encoded in pre-trained LVLMs, leading to a fundamental question: how much intervention is necessary at each layer to suppress hallucinations while preserving pre-trained knowledge? To address this question, we present a systematic analysis of LVLM decoders built on three widely used large language model backbones-Qwen, LLaMA, and Vicuna-revealing clear layer-wise differences in susceptibility to object hallucination. Building on these insights, we introduce the Hallucination Insensitivity Score (HIS), a principled metric that quantifies each layer’s sensitivity to hallucination and provides guidance for targeted intervention. Leveraging HIS, we propose Hallucination Insensitivity Model Editing (HIME), a simple yet effective layer-adaptive weight editing approach that selectively modifies latent features to suppress hallucinations while preserving pre-trained knowledge. Extensive experiments demonstrate that HIME reduces hallucinations by an average of 61.8% across open-ended generation benchmarks, including CHAIR, MME, and GPT-4V-aided evaluation, without introducing additional parameters, inference-time latency, or computational overhead.
[36] NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures cs.CVPDF
Yufan Wang, Sokratis Makrogiannis, Chandra Kambhamettu
TL;DR: 本文提出NeXt2Former-CD,一个用于遥感变化检测的端到端框架,它结合了基于DINOv3初始化的Siamese ConvNeXt编码器、可变形注意力时序融合模块和Mask2Former解码器,旨在更好地处理配准噪声、小物体空间偏移和语义模糊性。
Details
Motivation: 探索现代卷积和注意力架构作为状态空间模型(SSMs)的竞争替代方案,以解决遥感变化检测中残差配准噪声、小物体空间偏移和双时相图像语义模糊等问题。
Result: 在LEVIR-CD、WHU-CD和CDD数据集上的实验表明,该方法在F1分数和IoU指标上均优于包括近期Mamba基线在内的评估方法,达到了最佳结果,且推理延迟与基于SSM的方法相当。
Insight: 创新点在于将现代视觉架构(ConvNeXt、可变形注意力、Mask2Former)集成到一个统一的框架中,有效结合了卷积的局部建模能力和注意力的全局建模能力,在保持高效推理的同时提升了变化检测的精度和鲁棒性。
Abstract: State Space Models (SSMs) have recently gained traction in remote sensing change detection (CD) for their favorable scaling properties. In this paper, we explore the potential of modern convolutional and attention-based architectures as a competitive alternative. We propose NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder. This design is intended to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery. Experiments on LEVIR-CD, WHU-CD, and CDD datasets show that our method achieves the best results among the evaluated methods, improving over recent Mamba-based baselines in both F1 score and IoU. Furthermore, despite a larger parameter count, our model maintains inference latency comparable to SSM-based approaches, suggesting it is practical for high-resolution change detection tasks.
[37] Subtle Motion Blur Detection and Segmentation from Static Image Artworks cs.CVPDF
Ganesh Samarth, Sibendu Paul, Solale Tabarestani, Caren Chen
TL;DR: 本文提出SMBlurDetect框架,用于从静态图像中检测和分割细微运动模糊。该框架结合了高质量运动模糊数据集生成与端到端检测器,能够进行多粒度零样本检测。通过可控相机和物体运动模拟在SAM分割区域上合成逼真模糊,并采用混合掩码和图像中心策略训练U-Net检测器,实现了对细微模糊的精准定位。
Details
Motivation: 解决流媒体服务中缩略图等视觉资产普遍存在的细微运动模糊问题,该问题会降低视觉清晰度并影响用户信任和点击率。现有方法和数据集主要关注严重模糊,缺乏质量关键应用所需的细粒度像素级标注。
Result: 在GoPro数据集上达到89.68%的准确率(基线为66.50%),在CUHK数据集上达到59.77%的平均IoU(基线为9.00%),分割性能提升6.6倍,展示了强大的零样本泛化能力。
Insight: 创新点包括:1)通过可控运动模拟和SAM分割生成高质量、空间局部化的细微模糊数据集;2)采用混合掩码和图像中心训练策略,结合课程学习、困难负样本、焦点损失、模糊频率通道和分辨率感知增强;3)实现了对细微模糊的精准定位,支持自动过滤低质量帧和智能裁剪的兴趣区域提取。
Abstract: Streaming services serve hundreds of millions of viewers worldwide, where visual assets such as thumbnails, box art, and cover images are critical for engagement. Subtle motion blur remains a pervasive quality issue, reducing visual clarity and negatively affecting user trust and click-through rates. However, motion blur detection from static images is underexplored, as existing methods and datasets focus on severe blur and lack fine-grained pixel-level annotations needed for quality-critical applications. Benchmarks such as GOPRO and NFS are dominated by strong synthetic blur and often contain residual blur in their sharp references, leading to ambiguous supervision. We propose SMBlurDetect, a unified framework combining high-quality motion blur specific dataset generation with an end-to-end detector capable of zero-shot detection at multiple granularities. Our pipeline synthesizes realistic motion blur from super high resolution aesthetic images using controllable camera and object motion simulations over SAM segmented regions, enhanced with alpha-aware compositing and balanced sampling to generate subtle, spatially localized blur with precise ground truth masks. We train a U-Net based detector with ImageNet pretrained encoders using a hybrid mask and image centric strategy incorporating curriculum learning, hard negatives, focal loss, blur frequency channels, and resolution aware augmentation.Our method achieves strong zero-shot generalization, reaching 89.68% accuracy on GoPro (vs 66.50% baseline) and 59.77% Mean IoU on CUHK (vs 9.00% baseline), demonstrating 6.6x improvement in segmentation. Qualitative results show accurate localization of subtle blur artifacts, enabling automated filtering of low quality frames and precise region of interest extraction for intelligent cropping.
[38] MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment cs.CV | cs.AIPDF
Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Nguyen Dao Minh Anh, Shivank Garg
TL;DR: 本文提出了MiSCHiEF基准测试集,包含安全(MiS)和文化(MiC)两个领域的细粒度图像-文本对齐评估数据集,采用对比对设计,用于评估视觉语言模型在区分细微差异的图像和描述对上的能力。研究发现模型在确认正确配对时表现优于拒绝错误配对,且从高度相似的描述中选择正确描述比反向任务更容易,揭示了当前模型在模态对齐上的持续挑战。
Details
Motivation: 解决视觉语言模型在细粒度图像-文本对齐任务中的不足,特别是在安全和文化等社会关键领域,其中细微的视觉或语言线索误解可能导致重大现实后果,需要评估模型对微妙语义和视觉差异的精确跨模态对齐能力。
Result: 在MiSCHiEF基准上评估了四个视觉语言模型,模型在确认正确图像-描述对时表现更好,从两个高度相似描述中选择正确描述的准确率高于从图像中选择正确描述的相反任务,整体结果突显了当前模型在模态对齐上的困难。
Insight: 创新点在于通过对比对设计构建了专注于安全和文化领域的细粒度对齐基准,强调模型在细微差异区分上的性能评估;客观分析认为,这种最小差异对测试能有效揭示模型跨模态对齐的局限性,为改进模型在现实敏感应用中的鲁棒性提供了方向。
Abstract: Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones. Additionally, models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task. The results, overall, highlight persistent modality misalignment challenges in current VLMs, underscoring the difficulty of precise cross-modal grounding required for applications with subtle semantic and visual distinctions.
[39] Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code cs.CV | cs.AIPDF
Haobo Lin, Tianyi Bai, Chen Chen, Jiajun Zhang, Bohan Zeng
TL;DR: 本文提出了一种从零合成多模态几何推理数据集GeoCode的流程,该流程通过符号种子构建、基于验证的实例化和代码渲染确保问题在结构、文本、推理和图像上的一致性,并利用数据集中的绘图代码引入代码预测作为显式的视觉-符号对齐目标。
Details
Motivation: 当前视觉-语言模型在处理复杂几何构造时存在困难,主要原因是训练数据有限且视觉与符号对齐较弱,本文旨在解决多模态几何推理中数据稀缺和对齐不足的问题。
Result: 在多个几何基准测试上的广泛实验表明,使用GeoCode训练模型能带来一致的性能提升,证明了数据集和所提对齐策略的有效性。
Insight: 创新点在于提出了一种可确保数学正确性的、从零生成复杂多模态几何问题的合成流程,并将代码预测作为显式的监督结构化预测任务来实现视觉理解与符号推理的强对齐。
Abstract: Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference, yet current vision–language models struggle with complex geometric constructions due to limited training data and weak visual–symbolic alignment. We propose a pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named \textbf{GeoCode}, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images. Leveraging the plotting code provided in GeoCode, we further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task. GeoCode exhibits substantially higher structural complexity and reasoning difficulty than existing benchmarks, while maintaining mathematical correctness through multi-stage validation. Extensive experiments show that models trained on GeoCode achieve consistent improvements on multiple geometry benchmarks, demonstrating both the effectiveness of the dataset and the proposed alignment strategy. The code will be available at https://github.com/would1920/GeoCode.
[40] MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions cs.CVPDF
Haoyu Zhang, Yuwei Wu, Pengxiang Li, Xintong Zhang, Zhi Gao
TL;DR: 本文提出了MIRROR框架,旨在通过基于视觉区域的反思来增强视觉语言模型(VLMs)的多模态推理能力。该框架将视觉反思作为一个核心机制,构建了一个包含草稿生成、批评、基于区域的验证和修订的闭环过程,并创建了ReflectV数据集进行训练。实验表明,MIRROR在多个基准测试上提高了答案的正确性并减少了视觉幻觉。
Details
Motivation: 现有视觉语言模型在处理模糊或复杂的视觉输入时,容易产生幻觉或逻辑错误,即使被提示“反思”,其修正也可能与图像证据脱节。因此,需要一种能够基于视觉证据进行迭代推理和验证的机制。
Result: 在通用视觉语言基准测试和代表性的视觉语言推理基准测试上的实验表明,MIRROR提高了正确性并减少了视觉幻觉,验证了其方法的有效性。
Insight: 核心创新在于将反思过程形式化为一个证据驱动的、基于视觉区域的验证闭环,而非纯文本修订步骤。这通过专门的ReflectV数据集进行监督训练得以实现,强调了视觉接地(grounding)在迭代推理中的重要性。
Abstract: In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to “reflect”, their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct ReflectV, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.
[41] Benchmarking Computational Pathology Foundation Models For Semantic Segmentation cs.CVPDF
Lavish Ramchandani, Aashay Tinaikar, Dev Kumar Das, Rohit Garg, Tijo Thomas
TL;DR: 该研究系统评估了10个基础模型在组织病理学图像语义分割任务上的表现,提出了一种利用注意力图作为像素级特征、结合XGBoost进行分类的快速且无需微调的基准测试方法,发现视觉语言模型CONCH表现最佳,且多模型特征融合能显著提升分割性能。
Details
Motivation: 针对基础模型在组织病理学像素级语义分割任务上缺乏系统性独立评估的问题,旨在通过标准化基准测试比较不同模型在形态学组织和细胞/核分割任务上的性能。
Result: 在四个组织病理学数据集上的实验表明,视觉语言模型CONCH表现最优,PathDino次之;融合CONCH、PathDino和CellViT的特征后,平均分割性能提升7.95%,优于单一模型。
Insight: 创新点在于提出了一种基于注意力图和XGBoost的模型无关评估框架,无需微调即可快速评估;关键发现是不同病理学数据训练的模型具有互补性,特征融合能有效提升泛化能力,为多模型集成在计算病理学中的应用提供了依据。
Abstract: In recent years, foundation models such as CLIP, DINO,and CONCH have demonstrated remarkable domain generalization and unsupervised feature extraction capabilities across diverse imaging tasks. However, systematic and independent evaluations of these models for pixel-level semantic segmentation in histopathology remain scarce. In this study, we propose a robust benchmarking approach to asses 10 foundational models on four histopathological datasets covering both morphological tissue-region and cellular/nuclear segmentation tasks. Our method leverages attention maps of foundation models as pixel-wise features, which are then classified using a machine learning algorithm, XGBoost, enabling fast, interpretable, and model-agnostic evaluation without finetuning. We show that the vision language foundation model, CONCH performed the best across datasets when compared to vision-only foundation models, with PathDino as close second. Further analysis shows that models trained on distinct histopathology cohorts capture complementary morphological representations, and concatenating their features yields superior segmentation performance. Concatenating features from CONCH, PathDino and CellViT outperformed individual models across all the datasets by 7.95% (averaged across the datasets), suggesting that ensembles of foundation models can better generalize to diverse histopathological segmentation tasks.
[42] Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement cs.CV | cs.GRPDF
Yuran Dong, Hang Dai, Mang Ye
TL;DR: 本文提出EditedID框架,通过对齐、解耦和耦合策略解决多模态大模型在人像编辑中面部身份一致性下降的问题,实现了无需训练的即插即用式面部身份恢复。
Details
Motivation: 现有多模态编辑大模型在真实人像编辑中面临面部身份一致性下降的挑战,这主要源于跨源分布偏差和跨源特征污染,阻碍了模型的实用部署。
Result: 大量实验表明,EditedID在保持原始面部身份和编辑元素一致性方面达到了最先进的性能,为开放世界单/多人面部身份恢复设立了新基准。
Insight: 创新点在于通过分析扩散轨迹、采样器行为和注意力特性,提出了自适应混合策略、混合求解器和注意力门控机制,系统性解决了身份恢复中的对齐、解耦和耦合问题。
Abstract: Multimodal editing large models have demonstrated powerful editing capabilities across diverse tasks. However, a persistent and long-standing limitation is the decline in facial identity (ID) consistency during realistic portrait editing. Due to the human eye’s high sensitivity to facial features, such inconsistency significantly hinders the practical deployment of these models. Current facial ID preservation methods struggle to achieve consistent restoration of both facial identity and edited element IP due to Cross-source Distribution Bias and Cross-source Feature Contamination. To address these issues, we propose EditedID, an Alignment-Disentanglement-Entanglement framework for robust identity-specific facial restoration. By systematically analyzing diffusion trajectories, sampler behaviors, and attention properties, we introduce three key components: 1) Adaptive mixing strategy that aligns cross-source latent representations throughout the diffusion process. 2) Hybrid solver that disentangles source-specific identity attributes and details. 3) Attentional gating mechanism that selectively entangles visual elements. Extensive experiments show that EditedID achieves state-of-the-art performance in preserving original facial ID and edited element IP consistency. As a training-free and plug-and-play solution, it establishes a new benchmark for practical and reliable single/multi-person facial identity restoration in open-world settings, paving the way for the deployment of multimodal editing large models in real-person editing scenarios. The code is available at https://github.com/NDYBSNDY/EditedID.
[43] TAG: Thinking with Action Unit Grounding for Facial Expression Recognition cs.CV | cs.AIPDF
Haobo Lin, Tianyi Bai, Jiajun Zhang, Xuanhao Chang, Sheng Lu
TL;DR: 本文提出TAG框架,通过将多模态推理过程显式地锚定在面部动作单元(AU)上,以提升面部表情识别的鲁棒性和可解释性。该方法结合监督微调和强化学习,在多个数据集上超越了现有视觉-语言模型,并减少了幻觉现象。
Details
Motivation: 现有视觉-语言模型在面部表情识别中常产生与视觉证据弱关联、不可验证的推理,导致跨数据集鲁棒性差和幻觉问题,因此需要一种基于可验证面部线索的接地推理方法。
Result: 在RAF-DB、FERPlus和AffectNet数据集上,TAG均优于开源和闭源的视觉-语言模型基线,同时提高了视觉忠实度;消融实验表明AU接地的奖励能稳定推理并减少幻觉。
Insight: 创新点在于引入AU作为结构化中间表示来约束多模态推理,通过AU感知的奖励机制对齐预测区域与外部AU检测器,从而增强推理的可验证性和鲁棒性,为可信多模态推理提供了新思路。
Abstract: Facial Expression Recognition (FER) is a fine-grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision–language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision–language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU-related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine-tuning on AU-grounded reasoning traces followed by reinforcement learning with an AU-aware reward that aligns predicted regions with external AU detectors. Evaluated on RAF-DB, FERPlus, and AffectNet, TAG consistently outperforms strong open-source and closed-source VLM baselines while simultaneously improving visual faithfulness. Ablation and preference studies further show that AU-grounded rewards stabilize reasoning and mitigate hallucination, demonstrating the importance of structured grounded intermediate representations for trustworthy multimodal reasoning in FER. The code will be available at https://github.com/would1920/FER_TAG .
[44] Initialization matters in few-shot adaptation of vision-language models for histopathological image classification cs.CVPDF
Pablo Meseguer, Rocío del Amor, Valery Naranjo
TL;DR: 本文提出了一种名为零样本多示例学习(ZS-MIL)的方法,用于解决在组织病理学全切片图像分类中,基于少样本学习的多示例学习框架因分类器权重随机初始化而性能不佳的问题。该方法利用视觉语言模型文本编码器的类别级嵌入作为分类层的初始化起点,以计算样本的袋级概率,从而提升少样本适应性能。
Details
Motivation: 动机在于解决在少样本适应场景下,多示例学习框架中线性分类器权重随机初始化可能导致性能低于零样本预测的问题,旨在提升视觉语言模型在组织病理学图像分类中的监督微调效果。
Result: 通过多项实验,在少样本亚型预测的高效迁移学习场景中,ZS-MIL在性能和稳定性方面均优于已知的权重初始化技术,表现出更强的鲁棒性。
Insight: 创新点在于将视觉语言模型的文本编码器嵌入直接用于初始化多示例学习分类层,避免了随机初始化的不确定性,从而在少样本条件下实现更稳定且优越的分类性能,为基于预训练模型的少样本适应提供了新思路。
Abstract: Vision language models (VLM) pre-trained on datasets of histopathological image-caption pairs enabled zero-shot slide-level classification. The ability of VLM image encoders to extract discriminative features also opens the door for supervised fine-tuning for whole-slide image (WSI) classification, ideally using few labeled samples. Slide-level prediction frameworks require the incorporation of multiple instance learning (MIL) due to the gigapixel size of the WSI. Following patch-level feature extraction and aggregation, MIL frameworks rely on linear classifiers trained on top of the slide-level aggregated features. Classifier weight initialization has a large influence on Linear Probing performance in efficient transfer learning (ETL) approaches based on few-shot learning. In this work, we propose Zero-Shot Multiple-Instance Learning (ZS-MIL) to address the limitations of random classifier initialization that underperform zero-shot prediction in MIL problems. ZS-MIL uses the class-level embeddings of the VLM text encoder as the classification layer’s starting point to compute each sample’s bag-level probabilities. Through multiple experiments, we demonstrate the robustness of ZS-MIL compared to well-known weight initialization techniques both in terms of performance and variability in an ETL few-shot scenario for subtyping prediction.
[45] Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection cs.CVPDF
Wanqi Wang, Jingcai Guo, Yuxiang Cai, Zhi Chen
TL;DR: 本文提出了一种名为LMP的双分支检测器,用于跨域少样本目标检测(CD-FSOD)。该方法通过结合文本引导和从目标域提取的视觉示例,学习多模态原型,以在仅有少量标注样本的情况下,检测未见目标域中的新类别。
Details
Motivation: 现有基于视觉语言模型(VLM)的开放词汇检测器主要依赖文本提示,这些提示编码了领域不变的语义,但缺乏在少样本监督下精确定位所需的领域特定视觉信息。因此,需要一种方法能同时利用文本语义和视觉示例,以提升跨域少样本检测的性能。
Result: 在六个跨域基准数据集和标准的1/5/10-shot设置下,该方法实现了最先进的或极具竞争力的平均精度(mAP),达到了SOTA水平。
Insight: 创新点包括:1)视觉原型构建模块,通过聚合支持区域中的类级原型,并利用抖动框在查询图像中动态生成硬负样本原型,以捕捉干扰项和视觉相似背景;2)双分支架构,其中视觉引导分支将原型注入检测流程,文本引导分支保留开放词汇语义,两者联合训练并在推理时集成,结合语义抽象和领域自适应细节。
Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples. While open-vocabulary detectors built on vision-language models (VLMs) transfer well, they depend almost entirely on text prompts, which encode domain-invariant semantics but miss domain-specific visual information needed for precise localization under few-shot supervision. We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain. A Visual Prototype Construction module aggregates class-level prototypes from support RoIs and dynamically generates hard-negative prototypes in query images via jittered boxes, capturing distractors and visually similar backgrounds. In the visual-guided branch, we inject these prototypes into the detection pipeline with components mirrored from the text branch as the starting point for training, while a parallel text-guided branch preserves open-vocabulary semantics. The branches are trained jointly and ensembled at inference by combining semantic abstraction with domain-adaptive details. On six cross-domain benchmark datasets and standard 1/5/10-shot settings, our method achieves state-of-the-art or highly competitive mAP.
[46] Robust Self-Supervised Cross-Modal Super-Resolution against Real-World Misaligned Observations cs.CVPDF
Xiaoyu Dong, Jiahuan Li, Ziteng Cui, Naoto Yokoya
TL;DR: 本文提出了一种名为RobSelf的全自监督跨模态超分辨率方法,旨在处理现实世界中存在空间错位的低分辨率源图像和高分辨率引导图像对。该方法无需训练数据、真实监督或预对齐,通过错位感知特征翻译器和内容感知参考滤波器在线优化,在多种任务上实现了最先进的性能和高效性,并引入了RealMisSR数据集。
Details
Motivation: 解决现实世界中跨模态超分辨率任务中,仅能获取有限数量且存在复杂空间错位的低分辨率源图像和高分辨率引导图像对的挑战,避免对训练数据、真实监督或预对齐的依赖。
Result: 在多种任务上,RobSelf实现了最先进的性能(state-of-the-art)和卓越的效率,并在引入的RealMisSR数据集上进行了验证。
Insight: 创新点包括:1)错位感知特征翻译器,将无监督跨模态和跨分辨率对齐重新表述为弱监督的错位感知翻译子任务;2)内容感知参考滤波器,基于对齐的引导特征进行参考式判别性自增强,实现高分辨率和高保真的超分辨率预测;3)完全自监督和在线优化的框架,无需外部数据或对齐预处理。
Abstract: We study cross-modal super-resolution (SR) on real-world misaligned data, where only a limited number of low-resolution (LR) source and high-resolution (HR) guide image pairs with complex spatial misalignments are available. To address this challenge, we propose RobSelf–a fully self-supervised model that is optimized online, requiring no training data, ground-truth supervision, or pre-alignment. RobSelf features two key techniques: a misalignment-aware feature translator and a content-aware reference filter. The translator reformulates unsupervised cross-modal and cross-resolution alignment as a weakly-supervised, misalignment-aware translation subtask, producing an aligned guide feature with inherent redundancy. Guided by this feature, the filter performs reference-based discriminative self-enhancement on the source, enabling SR predictions with high resolution and high fidelity. Across a variety of tasks, we demonstrate that RobSelf achieves state-of-the-art performance and superior efficiency. Additionally, we introduce a real-world dataset, RealMisSR, to advance research on this topic. Dataset and code: https://github.com/palmdong/RobSelf.
[47] Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs cs.CVPDF
Chengwei Xia, Fan Ma, Ruijie Quan, Yunqiu Xu, Kun Zhan
TL;DR: 本文提出了一种为多模态大语言模型(MLLMs)生成版权触发器的框架,旨在通过对抗性引导的双重注入方法,将可验证的所有权信息嵌入模型。该方法生成的触发图像能在原始模型的微调衍生模型中引发所有权相关的文本响应,而在其他非衍生模型中保持沉默,从而有效追踪模型谱系。
Details
Motivation: 随着MLLMs的快速部署和广泛应用,关于模型版本归属和所有权的争议日益增多,引发了知识产权保护的重大关切。本文旨在解决如何为MLLMs嵌入可验证所有权信息的问题。
Result: 大量实验表明,该双重注入方法在各种微调和领域迁移场景下,能有效追踪模型谱系,验证了其有效性。
Insight: 创新点在于提出了一种对抗性引导的双重注入机制:首先通过辅助MLLM输出与预设所有权目标文本的一致性损失进行文本级注入;其次通过最小化图像与目标文本CLIP特征距离进行语义级注入。此外,引入基于原始模型衍生的辅助模型进行对抗训练,增强了在深度微调衍生模型中的鲁棒性。
Abstract: With the rapid deployment and widespread adoption of multimodal large language models (MLLMs), disputes regarding model version attribution and ownership have become increasingly frequent, raising significant concerns about intellectual property protection. In this paper, we propose a framework for generating copyright triggers for MLLMs, enabling model publishers to embed verifiable ownership information into the model. The goal is to construct trigger images that elicit ownership-related textual responses exclusively in fine-tuned derivatives of the original model, while remaining inert in other non-derivative models. Our method constructs a tracking trigger image by treating the image as a learnable tensor, performing adversarial optimization with dual-injection of ownership-relevant semantic information. The first injection is achieved by enforcing textual consistency between the output of an auxiliary MLLM and a predefined ownership-relevant target text; the consistency loss is backpropagated to inject this ownership-related information into the image. The second injection is performed at the semantic-level by minimizing the distance between the CLIP features of the image and those of the target text. Furthermore, we introduce an additional adversarial training stage involving the auxiliary model derived from the original model itself. This auxiliary model is specifically trained to resist generating ownership-relevant target text, thereby enhancing robustness in heavily fine-tuned derivative models. Extensive experiments demonstrate the effectiveness of our dual-injection approach in tracking model lineage under various fine-tuning and domain-shift scenarios.
[48] DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference cs.CV | cs.AIPDF
Aditya Kumar Singh, Hitesh Kandala, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
TL;DR: 本文提出DUET-VLM,一种用于视觉语言模型(VLM)训练和推理的双阶段统一高效令牌压缩框架。该框架首先在视觉编码器输出端进行冗余感知压缩,然后在语言主干中基于文本引导逐层丢弃不重要的视觉令牌,从而在保持语义完整性的同时大幅减少计算开销。
Details
Motivation: 现有视觉语言模型因密集的视觉令牌化导致计算成本高昂,现有效率方法往往以牺牲精度为代价来提升速度,本文旨在通过协调的令牌管理实现更高效的压缩而不损失关键语义。
Result: 在LLaVA-1.5-7B上,该方法在减少67%令牌时保持超过99%的基线精度,在减少89%令牌时仍保持>97%的精度;在训练中应用双阶段压缩后,在67%和89%减少率下分别达到99.7%和97.6%的精度,超越了多个基准测试上的先前SOTA视觉令牌减少方法。在Video-LLaVA-7B中集成后,甚至在减少53.1%令牌时实现了超过100%的基线精度,并在极端93.4%减少率下保持97.6%的精度。
Insight: 创新点在于提出了一种两阶段协调压缩策略:视觉端冗余感知压缩与语言主干内文本引导的逐层令牌丢弃相结合,实现了端到端训练下的高效令牌减少,能够在相同计算预算下生成紧凑且语义丰富的表示,且可作为即插即用模块灵活集成到现有VLM中。
Abstract: Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder’s output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline – achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.
[49] Open-Vocabulary Domain Generalization in Urban-Scene Segmentation cs.CVPDF
Dong Zhao, Qi Zang, Nan Pu, Wenjing Li, Nicu Sebe
TL;DR: 本文提出了开放词汇域泛化语义分割(OVDG-SS)这一新任务,旨在同时解决语义分割模型在未见过的领域和未见过的类别上的泛化问题。作者针对自动驾驶场景构建了首个OVDG-SS基准,并提出了S2-Corr方法,一种基于状态空间的文本-图像相关性优化机制,以缓解领域偏移对预训练视觉语言模型造成的文本-图像相关性扭曲。
Details
Motivation: 传统的域泛化语义分割方法局限于固定的已知类别,而现有的开放词汇语义分割模型对领域偏移敏感,在真实开放世界(尤其是城市驾驶场景)中部署时鲁棒性不足。本文旨在填补这一空白,联合解决未见领域和未见类别的挑战。
Result: 在作者构建的自动驾驶OVDG-SS基准(涵盖合成到真实和真实到真实的泛化)上进行的大量实验表明,所提出的S2-Corr方法在跨域性能和效率上均优于现有的开放词汇语义分割方法。
Insight: 核心创新点在于首次定义了开放词汇域泛化语义分割任务并建立了相应基准。方法上的创新是提出了S2-Corr机制,其洞察是领域偏移会扭曲预训练视觉语言模型中的文本-图像相关性,通过状态空间驱动的方法来优化这种相关性,从而提升模型在分布变化下的鲁棒性。
Abstract: Domain Generalization in Semantic Segmentation (DG-SS) aims to enable segmentation models to perform robustly in unseen environments. However, conventional DG-SS methods are restricted to a fixed set of known categories, limiting their applicability in open-world scenarios. Recent progress in Vision-Language Models (VLMs) has advanced Open-Vocabulary Semantic Segmentation (OV-SS) by enabling models to recognize a broader range of concepts. Yet, these models remain sensitive to domain shifts and struggle to maintain robustness when deployed in unseen environments, a challenge that is particularly severe in urban-driving scenarios. To bridge this gap, we introduce Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting that jointly addresses unseen domains and unseen categories. We introduce the first benchmark for OVDG-SS in autonomous driving, addressing a previously unexplored problem and covering both synthetic-to-real and real-to-real generalization across diverse unseen domains and unseen categories. In OVDG-SS, we observe that domain shifts often distort text-image correlations in pre-trained VLMs, which hinders the performance of OV-SS models. To tackle this challenge, we propose S2-Corr, a state-space-driven text-image correlation refinement mechanism that mitigates domain-induced distortions and produces more consistent text-image correlations under distribution changes. Extensive experiments on our constructed benchmark demonstrate that the proposed method achieves superior cross-domain performance and efficiency compared to existing OV-SS approaches.
[50] Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation cs.CVPDF
Shile Li, Markus Karmann, Onay Urfalioglu
TL;DR: 本文提出了一种用于视觉Transformer(ViT)的端到端联合后训练量化框架,无需标记数据,通过联合优化所有层和块间依赖关系,在单GPU上仅需一小时即可完成ViT-small的量化。该方法在ImageNet上实现了W4A4和W3A3的SOTA精度,并在极低比特设置(W1.58A8)下首次在ViT、DeiT和Swin-T模型上保持强精度。此外,引入了一种无数据校准策略,利用学习到的多模式提示引导Stable Diffusion Turbo合成多样化的无标签样本,性能与真实ImageNet数据校准相当。
Details
Motivation: 解决现有后训练量化或块级重建方法在视觉Transformer量化中无法联合优化所有层和块间依赖、依赖标记数据或计算效率低的问题,旨在实现高效、无需标记数据的端到端量化,以促进边缘部署。
Result: 在ImageNet上,W4A4和W3A3量化达到SOTA精度;在极低比特设置(W1.58A8)下,首次在ViT、DeiT和Swin-T模型上保持强精度,超越先前方法。无数据校准策略性能与真实ImageNet数据校准相当,并优于简单文本提示基线。
Insight: 创新点包括:端到端联合量化框架,无需标记数据且高效;无数据校准策略,通过学习多模式提示引导Stable Diffusion Turbo生成多样化样本,提升量化性能。客观分析认为,该方法在联合优化和提示引导数据生成方面具有借鉴价值,尤其适用于资源受限场景。
Abstract: We present a framework for end-to-end joint quantization of Vision Transformers trained on ImageNet for the purpose of image classification. Unlike prior post-training or block-wise reconstruction methods, we jointly optimize over the entire set of all layers and inter-block dependencies without any labeled data, scaling effectively with the number of samples and completing in just one hour on a single GPU for ViT-small. We achieve state-of-the-art W4A4 and W3A3 accuracies on ImageNet and, to the best of our knowledge, the first PTQ results that maintain strong accuracy on ViT, DeiT, and Swin-T models under extremely low-bit settings (W1.58A8), demonstrating the potential for efficient edge deployment. Furthermore, we introduce a data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts. By encouraging diversity in both the learned prompt embeddings and the generated image features, our data-free approach achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as “a
[51] Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning cs.CVPDF
Zhuofan Xie, Zishan Lin, Jinliang Lin, Jie Qi, Shaohua Hong
TL;DR: 本文提出了Similarity-as-Evidence (SaE)框架,用于校准视觉语言模型在医学图像主动学习中的过度自信问题。该方法将文本-图像相似度向量重新解释为证据,并用狄利克雷分布参数化标签分布,从而显式量化证据不足和证据冲突。基于此,SaE采用了一种双因素获取策略,在主动学习的不同阶段优先选择高不确定性样本,以提高样本选择效率和临床可解释性。
Details
Motivation: 解决医学图像主动学习中,当标注数据稀缺时(冷启动问题),视觉语言模型的零样本预测因softmax输出的确定性解释而过度自信,误导样本选择,浪费标注预算的问题。
Result: 在十个公共医学影像数据集上,使用20%的标注预算,SaE达到了82.57%的宏平均准确率,达到了最先进水平。在代表性的BTMRI数据集上,也实现了优异的校准性能,负对数似然为0.425。
Insight: 核心创新在于将相似度向量视为证据,并用狄利克雷分布建模,从而显式分离和量化了证据不足(覆盖度)和证据冲突(模糊性)这两种不确定性类型。这为主动学习提供了更具解释性的样本选择标准,并有效缓解了模型的过度自信问题。
Abstract: Active Learning (AL) reduces annotation costs in medical imaging by selecting only the most informative samples for labeling, but suffers from cold-start when labeled data are scarce. Vision-Language Models (VLMs) address the cold-start problem via zero-shot predictions, yet their temperature-scaled softmax outputs treat text-image similarities as deterministic scores while ignoring inherent uncertainty, leading to overconfidence. This overconfidence misleads sample selection, wasting annotation budgets on uninformative cases. To overcome these limitations, the Similarity-as-Evidence (SaE) framework calibrates text-image similarities by introducing a Similarity Evidence Head (SEH), which reinterprets the similarity vector as evidence and parameterizes a Dirichlet distribution over labels. In contrast to a standard softmax that enforces confident predictions even under weak signals, the Dirichlet formulation explicitly quantifies lack of evidence (vacuity) and conflicting evidence (dissonance), thereby mitigating overconfidence caused by rigid softmax normalization. Building on this, SaE employs a dual-factor acquisition strategy: high-vacuity samples (e.g., rare diseases) are prioritized in early rounds to ensure coverage, while high-dissonance samples (e.g., ambiguous diagnoses) are prioritized later to refine boundaries, providing clinically interpretable selection rationales. Experiments on ten public medical imaging datasets with a 20% label budget show that SaE attains state-of-the-art macro-averaged accuracy of 82.57%. On the representative BTMRI dataset, SaE also achieves superior calibration, with a negative log-likelihood (NLL) of 0.425.
[52] Enhancing 3D LiDAR Segmentation by Shaping Dense and Accurate 2D Semantic Predictions cs.CVPDF
Xiaoyu Dong, Tiankui Xian, Wanshui Gan, Naoto Yokoya
TL;DR: 本文提出了一种名为MM2D3D的多模态分割模型,旨在通过生成密集且准确的2D语义预测来增强3D LiDAR点云的语义分割性能。该方法利用相机图像作为辅助数据,通过跨模态引导滤波和动态交叉伪监督技术,克服了投影LiDAR和标签图的稀疏性问题,从而提升了最终的3D分割精度。
Details
Motivation: 解决3D LiDAR点云语义分割中,由于投影LiDAR和标签图固有的稀疏性导致中间2D语义预测稀疏且不准确,进而限制最终3D精度的问题。
Result: 实验表明,该方法在2D和3D空间均实现了优于先前方法的性能,获得了分布密集且精度更高的中间2D语义预测,有效提升了最终3D精度。
Insight: 创新点在于利用相机图像的密集语义关系(通过跨模态引导滤波)和密集分布特性(通过动态交叉伪监督)来约束和引导稀疏的LiDAR投影数据,从而在2D层面生成更优的预测以反哺3D分割。这是一种有效利用多模态互补信息解决数据稀疏性的思路。
Abstract: Semantic segmentation of 3D LiDAR point clouds is important in urban remote sensing for understanding real-world street environments. This task, by projecting LiDAR point clouds and 3D semantic labels as sparse maps, can be reformulated as a 2D problem. However, the intrinsic sparsity of the projected LiDAR and label maps can result in sparse and inaccurate intermediate 2D semantic predictions, which in return limits the final 3D accuracy. To address this issue, we enhance this task by shaping dense and accurate 2D predictions. Specifically, we develop a multi-modal segmentation model, MM2D3D. By leveraging camera images as auxiliary data, we introduce cross-modal guided filtering to overcome label map sparsity by constraining intermediate 2D semantic predictions with dense semantic relations derived from the camera images; and we introduce dynamic cross pseudo supervision to overcome LiDAR map sparsity by encouraging the 2D predictions to emulate the dense distribution of the semantic predictions from the camera images. Experiments show that our techniques enable our model to achieve intermediate 2D semantic predictions with dense distribution and higher accuracy, which effectively enhances the final 3D accuracy. Comparisons with previous methods demonstrate our superior performance in both 2D and 3D spaces.
[53] FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model cs.CV | cs.AIPDF
Zhou Liu, Tonghua Su, Hongshi Zhang, Fuxiang Yang, Donglin Di
TL;DR: 本文提出FOCA框架,一种基于多模态大语言模型的方法,通过跨注意力融合模块整合RGB空间域和频域的判别特征,以解决图像伪造检测与定位中过度依赖语义内容、忽视纹理线索以及可解释性不足的问题。该方法不仅能准确检测和定位伪造,还能提供人类可理解的跨域解释。
Details
Motivation: 针对现有图像伪造检测与定位方法过度依赖语义内容而忽略纹理线索,以及对细微低层篡改痕迹可解释性有限的问题,旨在提升检测性能与可解释性。
Result: 在大量实验中,FOCA在空间域和频域上的检测性能和可解释性均优于最先进方法,并在新引入的大规模数据集FSE-Set上验证了其有效性。
Insight: 创新点在于通过多模态大语言模型整合空间与频域特征,利用跨注意力融合实现跨域解释;客观分析认为其将频域分析与可解释AI结合,为伪造检测提供了新的多模态解决方案。
Abstract: Advances in image tampering techniques, particularly generative models, pose significant challenges to media verification, digital forensics, and public trust. Existing image forgery detection and localization (IFDL) methods suffer from two key limitations: over-reliance on semantic content while neglecting textural cues, and limited interpretability of subtle low-level tampering traces. To address these issues, we propose FOCA, a multimodal large language model-based framework that integrates discriminative features from both the RGB spatial and frequency domains via a cross-attention fusion module. This design enables accurate forgery detection and localization while providing explicit, human-interpretable cross-domain explanations. We further introduce FSE-Set, a large-scale dataset with diverse authentic and tampered images, pixel-level masks, and dual-domain annotations. Extensive experiments show that FOCA outperforms state-of-the-art methods in detection performance and interpretability across both spatial and frequency domains.
[54] PhysConvex: Physics-Informed 3D Dynamic Convex Radiance Fields for Reconstruction and Simulation cs.CV | cs.GRPDF
Dan Wang, Xinrui Cui, Serge Belongie, Ravi Ramamoorthi
TL;DR: PhysConvex是一种物理信息化的3D动态凸体辐射场,用于统一视觉渲染与物理模拟。它使用基于连续介质力学原理的凸体基元来表示可变形辐射场,并通过边界驱动的动态凸体表示来建模非均匀变形。该方法还引入了基于神经蒙皮本征模态的降阶凸体模拟,以高效处理复杂几何与异质材料。
Details
Motivation: 现有神经表示(如NeRF、3DGS)在视觉重建方面表现出色,但难以捕捉复杂的材料变形与动力学,因此需要一种能同时保证视觉真实感与物理一致性的动态3D场景重建与模拟方法。
Result: 实验表明,PhysConvex能够从视频中高保真地重建几何、外观与物理属性,其性能优于现有方法。
Insight: 创新点在于将物理模拟与神经辐射场结合,通过凸体基元实现紧凑无间隙的体积覆盖,并利用降阶模拟(神经蒙皮本征模态)高效处理动态变形,从而在几何效率与模拟保真度上均有提升。
Abstract: Reconstructing and simulating dynamic 3D scenes with both visual realism and physical consistency remains a fundamental challenge. Existing neural representations, such as NeRFs and 3DGS, excel in appearance reconstruction but struggle to capture complex material deformation and dynamics. We propose PhysConvex, a Physics-informed 3D Dynamic Convex Radiance Field that unifies visual rendering and physical simulation. PhysConvex represents deformable radiance fields using physically grounded convex primitives governed by continuum mechanics. We introduce a boundary-driven dynamic convex representation that models deformation through vertex and surface dynamics, capturing spatially adaptive, non-uniform deformation, and evolving boundaries. To efficiently simulate complex geometries and heterogeneous materials, we further develop a reduced-order convex simulation that advects dynamic convex fields using neural skinning eigenmodes as shape- and material-aware deformation bases with time-varying reduced DOFs under Newtonian dynamics. Convex dynamics also offers compact, gap-free volumetric coverage, enhancing both geometric efficiency and simulation fidelity. Experiments demonstrate that PhysConvex achieves high-fidelity reconstruction of geometry, appearance, and physical properties from videos, outperforming existing methods.
[55] SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World cs.CVPDF
Jungho Kim, Jiyong Oh, Seunghoon Yu, Hongjae Shin, Donghyuk Kwak
TL;DR: 本文提出了SafeDrive,一种用于端到端驾驶规划的框架,通过轨迹条件稀疏世界模型进行显式且可解释的安全推理。该框架包含两个互补网络:稀疏世界网络(SWNet)构建轨迹条件稀疏世界以模拟关键动态智能体和道路实体的未来行为,而细粒度推理网络(FRNet)则评估特定智能体的碰撞风险和可行驶区域的时间依从性,从而精确识别未来时间步中的安全关键事件。
Details
Motivation: 端到端(E2E)范式直接将传感器输入映射到驾驶决策,因其统一的建模能力和可扩展性而受到关注,但在此统一框架中确保安全仍是一个关键挑战。
Result: SafeDrive在开环和闭环基准测试中均达到最先进性能:在NAVSIM上,PDMS为91.6,EPDMS为87.5,在12,146个场景中仅发生61次碰撞(0.5%);在Bench2Drive上,驾驶得分为66.8%。
Insight: 创新点在于引入轨迹条件稀疏世界模型进行显式安全推理,通过SWNet提供以交互为中心的表示,并结合FRNet实现细粒度的风险评估,从而在端到端驾驶中提升安全性和可解释性。
Abstract: The end-to-end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory-conditioned Sparse World Model. SafeDrive comprises two complementary networks: the Sparse World Network (SWNet) and the Fine-grained Reasoning Network (FRNet). SWNet constructs trajectory-conditioned sparse worlds that simulate the future behaviors of critical dynamic agents and road entities, providing interaction-centric representations for downstream reasoning. FRNet then evaluates agent-specific collision risks and temporal adherence to drivable regions, enabling precise identification of safety-critical events across future timesteps. SafeDrive achieves state-of-the-art performance on both open-loop and closed-loop benchmarks. On NAVSIM, it records a PDMS of 91.6 and an EPDMS of 87.5, with only 61 collisions out of 12,146 scenarios (0.5%). On Bench2Drive, SafeDrive attains a 66.8% driving score.
[56] SCHEMA for Gemini 3 Pro Image: A Structured Methodology for Controlled AI Image Generation on Google’s Native Multimodal Model cs.CV | cs.HCPDF
Luca Cazzaniga
TL;DR: 本文提出了SCHEMA,一种专为Google Gemini 3 Pro Image设计的结构化提示工程方法。该方法基于对约4800张生成图像的系统性专业实践,构建了一个包含三层渐进系统、模块化标签架构、决策树和模型限制文档的框架。关键发现包括在621个结构化提示中观察到91%的强制遵守率和94%的禁止遵守率,以及结构化提示在批次一致性测试中表现出显著更高的生成间连贯性。
Details
Motivation: 解决通用提示指南或模型无关技巧在控制Gemini 3 Pro Image等原生多模态模型生成图像时缺乏结构化和系统性指导的问题,旨在为专业领域的可控AI图像生成提供工程化框架。
Result: 在六个专业领域的评估中,结构化提示的强制遵守率为91%,禁止遵守率为94%;批次一致性测试显示结构化提示的生成间连贯性显著更高;独立从业者验证(n=40)和信息设计专项验证(约300个可公开验证的信息图)中,空间和排版控制的首代遵守率超过95%。
Insight: 创新点在于提出了一个针对特定模型(Gemini 3 Pro Image)的工程化、结构化提示框架,包含可扩展的控制层级、模块化组件和明确的决策路由,将提示工程从经验性技巧提升为系统化方法论,并提供了量化的合规性和一致性验证。
Abstract: This paper presents SCHEMA (Structured Components for Harmonized Engineered Modular Architecture), a structured prompt engineering methodology specifically developed for Google Gemini 3 Pro Image. Unlike generic prompt guidelines or model-agnostic tips, SCHEMA is an engineered framework built on systematic professional practice encompassing 850 verified API predictions within an estimated corpus of approximately 4,800 generated images, spanning six professional domains: real estate photography, commercial product photography, editorial content, storyboards, commercial campaigns, and information design. The methodology introduces a three-tier progressive system (BASE, MEDIO, AVANZATO) that scales practitioner control from exploratory (approximately 5%) to directive (approximately 95%), a modular label architecture with 7 core and 5 optional structured components, a decision tree with explicit routing rules to alternative tools, and systematically documented model limitations with corresponding workarounds. Key findings include an observed 91% Mandatory compliance rate and 94% Prohibitions compliance rate across 621 structured prompts, a comparative batch consistency test demonstrating substantially higher inter-generation coherence for structured prompts, independent practitioner validation (n=40), and a dedicated Information Design validation demonstrating >95% first-generation compliance for spatial and typographical control across approximately 300 publicly verifiable infographics. Previously published on Zenodo (doi:10.5281/zenodo.18721380).
[57] Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates cs.CVPDF
Shengjie Zhu, Ahmed Abdelkader, Mark J. Matthews, Xiaoming Liu, Wen-Sheng Chu
TL;DR: 该论文提出了一种名为边缘化光束法平差(MBA)的新方法,旨在解决将单目深度估计(MDE)集成到运动恢复结构(SfM)中的挑战。MBA利用MDE深度图的密度来减轻其高误差方差,从而能够从多视图图像中恢复相机位姿和场景几何,并在不同规模的设置下实现稳健的性能。
Details
Motivation: 尽管深度学习使得单目深度估计(MDE)能够从单张图像中准确预测深度,但将其整合到传统的SfM流程中仍然困难,因为MDE产生的密集深度图具有比传统稀疏点云高得多的误差方差。
Result: 通过广泛的评估,该方法在SfM和相机重定位任务中取得了最先进(SoTA)或具有竞争力的结果,并在从少量帧到数千张图像的大规模多视图系统中都表现出一致的稳健性能。
Insight: 论文的核心创新点是受现代RANSAC估计器启发提出的边缘化光束法平差(MBA),它通过利用MDE深度图的密度来边缘化(或处理)其误差方差,从而成功地将密集但噪声较大的MDE深度图用于精确的多视图几何计算,这凸显了MDE在多视图3D视觉中的巨大潜力。
Abstract: Structure-from-Motion (SfM) is a fundamental 3D vision task for recovering camera parameters and scene geometry from multi-view images. While recent deep learning advances enable accurate Monocular Depth Estimation (MDE) from single images without depending on camera motion, integrating MDE into SfM remains a challenge. Unlike conventional triangulated sparse point clouds, MDE produces dense depth maps with significantly higher error variance. Inspired by modern RANSAC estimators, we propose Marginalized Bundle Adjustment (MBA) to mitigate MDE error variance leveraging its density. With MBA, we show that MDE depth maps are sufficiently accurate to yield SoTA or competitive results in SfM and camera relocalization tasks. Through extensive evaluations, we demonstrate consistently robust performance across varying scales, ranging from few-frame setups to large multi-view systems with thousands of images. Our method highlights the significant potential of MDE in multi-view 3D vision.
[58] Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation cs.CVPDF
Kaiming Jin, Yuefan Wu, Shengqiong Wu, Bobo Li, Shuicheng Yan
TL;DR: 本文提出了一种名为DACo的双智能体框架,用于视觉语言场景导航任务。该框架将全局规划与局部执行解耦,通过一个全局指挥官进行高层战略规划,一个局部操作员进行以自我为中心的观察和细粒度执行,以解决现有方法在长视野导航中存在的认知过载和指令漂移问题。
Details
Motivation: 现有视觉语言导航方法要么依赖多智能体导致协调和资源成本高,要么采用单智能体范式使智能体同时承担全局规划和局部感知任务,导致在长视野场景中推理能力下降和指令漂移。
Result: 在R2R、REVERIE和R4R基准测试的零样本设置中,DACo相比最佳基线分别实现了4.9%、6.5%和5.4%的绝对性能提升,并且在闭源(如GPT-4o)和开源(如Qwen-VL系列)骨干模型上均能有效泛化。
Insight: 核心创新点是规划与执行解耦的双智能体架构设计,通过动态子目标规划和自适应重规划机制,实现了结构化且鲁棒的长视野导航,为稳健的长视野导航提供了一个原则性且可扩展的范式。
Abstract: Vision-and-Language Scene navigation is a fundamental capability for embodied human-AI collaboration, requiring agents to follow natural language instructions to execute coherent action sequences in complex environments. Existing approaches either rely on multiple agents, incurring high coordination and resource costs, or adopt a single-agent paradigm, which overloads the agent with both global planning and local perception, often leading to degraded reasoning and instruction drift in long-horizon settings. To address these issues, we introduce DACo, a planning-grounding decoupled architecture that disentangles global deliberation from local grounding. Concretely, it employs a Global Commander for high-level strategic planning and a Local Operative for egocentric observing and fine-grained execution. By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability. The framework further integrates dynamic subgoal planning and adaptive replanning to enable structured and resilient navigation. Extensive evaluations on R2R, REVERIE, and R4R demonstrate that DACo achieves 4.9%, 6.5%, 5.4% absolute improvements over the best-performing baselines in zero-shot settings, and generalizes effectively across both closed-source (e.g., GPT-4o) and open-source (e.g., Qwen-VL Series) backbones. DACo provides a principled and extensible paradigm for robust long-horizon navigation. Project page: https://github.com/ChocoWu/DACo
[59] YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos cs.CVPDF
Kedi Sun, Le Zhang
TL;DR: 本文提出了一种基于YOLOv10的多任务框架,用于在创伤外科手术视频中实时定位手部并分类其左右侧。该模型在Trauma THOMPSON Challenge 2025 Task 2数据集上训练,通过数据增强和多任务检测设计提升了在运动模糊、光照变化和不同手部外观下的鲁棒性。
Details
Motivation: 解决创伤外科手术中实时手部跟踪的需求,以支持快速、精确的术中决策。
Result: 在指定数据集上评估,模型实现了左手67%、右手71%的分类准确率,mAP_{[0.5:0.95]}为0.33,并保持了实时推理速度。
Insight: 创新点在于将YOLOv10与多任务学习结合,同时处理手部定位和左右分类,并通过数据增强策略提升模型在复杂手术场景中的鲁棒性,为后续手-器械交互分析奠定了基础。
Abstract: Real-time hand tracking in trauma surgery is essential for supporting rapid and precise intraoperative decisions. We propose a YOLOv10-based framework that simultaneously localizes hands and classifies their laterality (left or right) in complex surgical scenes. The model is trained on the Trauma THOMPSON Challenge 2025 Task 2 dataset, consisting of first-person surgical videos with annotated hand bounding boxes. Extensive data augmentation and a multi-task detection design improve robustness against motion blur, lighting variations, and diverse hand appearances. Evaluation demonstrates accurate left-hand (67%) and right-hand (71%) classification, while distinguishing hands from the background remains challenging. The model achieves an $mAP_{[0.5:0.95]}$ of 0.33 and maintains real-time inference, highlighting its potential for intraoperative deployment. This work establishes a foundation for advanced hand-instrument interaction analysis in emergency surgical procedures.
[60] Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding cs.CVPDF
Thinesh Thiyakesan Ponbagavathi, Constantin Seibold, Alina Roitberg
TL;DR: 本文提出了Frame2Freq,一种用于图像预训练主干网络视频适配的频域感知适配器家族。它通过在时间维度进行快速傅里叶变换并学习频带特定的嵌入,自适应地突出最具判别性的频率范围,以解决现有时间域适配器在细粒度视频理解中忽略中速运动的问题,并在多个细粒度动作识别数据集上取得了优异性能。
Details
Motivation: 现有将图像预训练主干网络适配到视频的方法通常依赖于针对单一时间尺度调整的时间域适配器,这些模块容易捕捉静态图像线索和极快的闪烁变化,却忽略了中速运动。而捕获多时间尺度的动态对于细粒度时间分析至关重要。
Result: 在五个细粒度活动识别数据集上的实验表明,Frame2Freq超越了先前的参数高效微调方法,并在其中四个数据集上甚至超过了完全微调的模型。
Insight: 主要创新点在于将频域分析引入图像到视频的迁移学习中,通过谱编码和频带特定嵌入来建模多尺度时间动态。这为建模时序动态提供了一个新的强大工具,表明频域方法在细粒度视频理解中具有显著潜力。
Abstract: Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle). To address this, we introduce Frame2Freq – a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them. These results provide encouraging evidence that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer. Code is available at https://github.com/th-nesh/Frame2Freq.
[61] IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition cs.CVPDF
Yuyang Ji, Yixuan Shen, Kien Nguyen, Lifeng Zhou, Feng Liu
TL;DR: 本文提出IDSelect,一种基于强化学习的成本感知选择器,用于视频多模态行人识别。该方法通过为每个模态和每个序列动态选择预训练模型,优化精度与效率的权衡,显著减少计算资源消耗。
Details
Motivation: 现有视频行人识别系统通常固定使用所有模态的重模型集成,无论输入复杂度如何,导致计算资源浪费。IDSelect旨在通过输入条件选择器动态选择模型,以更少的资源实现超越固定集成的性能。
Result: 在CCVID数据集上,IDSelect达到95.9%的Rank-1准确率,计算量比强基线减少92.4%,同时准确率提升1.8%;在MEVID数据集上,计算量减少41.3%的同时保持有竞争力的性能。
Insight: 创新点在于使用强化学习训练轻量级代理进行端到端预算感知优化,奖励函数平衡识别精度与计算成本,熵正则化防止过早收敛,实现了动态模型选择与模态特定相似度融合,为多模态系统提供了可扩展的效率优化方案。
Abstract: Video-based person recognition achieves robust identification by integrating face, body, and gait. However, current systems waste computational resources by processing all modalities with fixed heavyweight ensembles regardless of input complexity. To address these limitations, we propose IDSelect, a reinforcement learning-based cost-aware selector that chooses one pre-trained model per modality per-sequence to optimize the accuracy-efficiency trade-off. Our key insight is that an input-conditioned selector can discover complementary model choices that surpass fixed ensembles while using substantially fewer resources. IDSelect trains a lightweight agent end-to-end using actor-critic reinforcement learning with budget-aware optimization. The reward balances recognition accuracy with computational cost, while entropy regularization prevents premature convergence. At inference, the policy selects the most probable model per modality and fuses modality-specific similarities for the final score. Extensive experiments on challenging video-based datasets demonstrate IDSelect’s superior efficiency: on CCVID, it achieves 95.9% Rank-1 accuracy with 92.4% less computation than strong baselines while improving accuracy by 1.8%; on MEVID, it reduces computation by 41.3% while maintaining competitive performance.
[62] Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction cs.CVPDF
Shannan Yan, Leqi Zheng, Keyu Lv, Jingchen Ni, Hongyang Wei
TL;DR: 本文提出了一种基于条件二元分割的框架,用于解决视频中不同视角(特别是第一人称与第三人称视角之间)的对象级视觉对应问题。该方法将查询对象掩码编码为潜在表示,以指导在目标视频中定位对应对象,并通过循环一致性训练目标增强视角不变性表示,同时支持推理时的测试时训练(TTT)。
Details
Motivation: 解决视频中跨视角(尤其是第一人称与第三人称视角之间)的对象级视觉对应这一挑战性任务,旨在建立鲁棒且无需真实标注的对应关系。
Result: 在Ego-Exo4D和HANDAL-X基准测试上实现了最先进的性能,验证了优化目标和TTT策略的有效性。
Insight: 创新点在于引入循环一致性训练目标作为自监督信号,通过双向约束提升视角不变性表示,并支持测试时训练以增强推理性能;客观分析认为,该方法通过简单的条件分割框架结合循环一致性,有效解决了跨视角对应问题,且无需标注数据,具有实用性和泛化潜力。
Abstract: We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.
[63] A Benchmark and Knowledge-Grounded Framework for Advanced Multimodal Personalization Study cs.CVPDF
Xia Hu, Honglei Zhuang, Brian Potetz, Alireza Fathi, Bo Hu
TL;DR: 该论文针对当前高级多模态个性化研究缺乏合适基准的问题,提出了一个名为Life-Bench的综合性、基于合成生成的多模态基准数据集,该数据集模拟用户数字足迹,包含评估从用户画像理解到历史数据复杂推理等多种能力的广泛问题。同时,论文还提出了一个名为LifeGraph的端到端框架,该框架将个人上下文组织成知识图谱,以促进结构化检索和推理。实验表明,现有方法在复杂的个性化任务上表现不佳,而LifeGraph通过利用结构化知识显著缩小了性能差距,为未来研究指明了方向。
Details
Motivation: 现代视觉语言模型强大的推理能力为高级个性化研究开辟了新前沿,但该领域进展因缺乏合适的基准而严重受阻。
Result: 在Life-Bench上的实验表明,现有方法在复杂的个性化任务上表现显著不足,暴露出巨大的性能提升空间,尤其是在关系、时间和聚合推理方面。LifeGraph通过利用结构化知识显著缩小了这一差距,并展示了一个有前景的方向。
Insight: 论文的主要创新点在于构建了一个远超先前基准、反映现实应用关键需求的综合性多模态个性化基准(Life-Bench),以及一个将个人上下文结构化以增强检索和推理的知识图谱框架(LifeGraph)。这为解决高级多模态个性化任务中的复杂推理挑战提供了新的基准和方法论思路。
Abstract: The powerful reasoning of modern Vision Language Models open a new frontier for advanced personalization study. However, progress in this area is critically hampered by the lack of suitable benchmarks. To address this gap, we introduce Life-Bench, a comprehensive, synthetically generated multimodal benchmark built on simulated user digital footprints. Life-Bench features over questions evaluating a wide spectrum of capabilities, from persona understanding to complex reasoning over historical data. These capabilities expand far beyond prior benchmarks, reflecting the critical demands essential for real-world applications. Furthermore, we propose LifeGraph, an end-to-end framework that organizes personal context into a knowledge graph to facilitate structured retrieval and reasoning. Our experiments on Life-Bench reveal that existing methods falter significantly on complex personalized tasks, exposing a large performance headroom, especially in relational, temporal and aggregative reasoning. While LifeGraph closes this gap by leveraging structured knowledge and demonstrates a promising direction, these advanced personalization tasks remain a critical open challenge, motivating new research in this area.
[64] MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment cs.CVPDF
Duc Duy Nguyen, Tat-Jun Chin, Minh Hoai
TL;DR: MoBind是一个用于IMU信号与视频2D姿态序列细粒度对齐的分层对比学习框架,旨在解决跨模态检索、时间同步、主体与身体部位定位以及动作识别等任务。
Details
Motivation: 论文的动机是学习IMU信号与视频提取的2D姿态序列之间的联合表示,以解决跨模态对齐中的三个挑战:过滤无关视觉背景、建模结构化多传感器IMU配置以及实现细粒度(亚秒级)时间对齐。
Result: 在mRi、TotalCapture和EgoHumans数据集上评估,MoBind在所有四个任务中均优于强基线,展示了鲁棒的细粒度时间对齐能力,同时保持了跨模态的粗粒度语义一致性。
Insight: 创新点包括:将IMU信号与骨骼运动序列而非原始像素对齐以隔离运动相关线索;将全身运动分解为局部身体部位轨迹,与对应IMU配对实现语义基础的多传感器对齐;采用分层对比策略,先对齐令牌级时间片段,再融合局部(身体部位)对齐与全局(全身)运动聚合。从客观角度看,该方法通过层次化建模有效提升了跨模态对齐的精度和鲁棒性。
Abstract: We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.
[65] An interpretable framework using foundation models for fish sex identification cs.CV | cs.AIPDF
Zheng Miao, Tien-Chieh Hung
TL;DR: 本文提出了一种名为FishProtoNet的、基于基础模型的可解释性计算机视觉框架,用于对濒危物种三角洲胡瓜鱼进行非侵入式的性别鉴定。该框架通过视觉基础模型提取感兴趣区域、进行特征提取,并利用可解释的原型网络进行性别识别,在产卵早期和产卵后阶段取得了良好性能。
Details
Motivation: 现有鱼类性别鉴定方法多为侵入性或具有胁迫性,可能对濒危种群造成额外死亡风险,因此需要开发一种稳健、非侵入性的方法。
Result: 在三角洲胡瓜鱼的产卵早期和产卵后阶段,FishProtoNet分别达到了74.40%和81.16%的准确率,以及74.27%和79.43%的F1分数,展现了较强的性能。
Insight: 创新点在于结合了视觉基础模型以减少背景噪声影响,并利用可解释的原型网络提供模型决策的透明度,为濒危物种的非侵入性监测提供了一种稳健且可解释的解决方案。
Abstract: Accurate sex identification in fish is vital for optimizing breeding and management strategies in aquaculture, particularly for species at the risk of extinction. However, most existing methods are invasive or stressful and may cause additional mortality, posing severe risks to threatened or endangered fish populations. To address these challenges, we propose FishProtoNet, a robust, non-invasive computer vision-based framework for sex identification of delta smelt (Hypomesus transpacificus), an endangered fish species native to California, across its full life cycle. Unlike the traditional deep learning methods, FishProtoNet provides interpretability through learned prototype representations while improving robustness by leveraging foundation models to reduce the influence of background noise. Specifically, the FishProtoNet framework consists of three key components: fish regions of interest (ROIs) extraction using visual foundation model, feature extraction from fish ROIs and fish sex identification based on an interpretable prototype network. FishProtoNet demonstrates strong performance in delta smelt sex identification during early spawning and post-spawning stages, achieving the accuracies of 74.40% and 81.16% and corresponding F1 scores of 74.27% and 79.43% respectively. In contrast, delta smelt sex identification at the subadult stage remains challenging for current computer vision methods, likely due to less pronounced morphological differences in immature fish. The source code of FishProtoNet is publicly available at: https://github.com/zhengmiao1/Fish_sex_identification
[66] Towards Calibrating Prompt Tuning of Vision-Language Models cs.CVPDF
Ashshak Sharifdeen, Fahad Shamshad, Muhammad Akhtar Munir, Abhishek Basu, Mohamed Insaf Ismithdeen
TL;DR: 本文针对大规模视觉语言模型(如CLIP)的提示调优(prompt tuning)中存在的置信度校准不佳和预测不确定性不可靠的问题,提出了一个校准框架。该框架在保持预训练CLIP嵌入空间几何结构(这对鲁棒泛化至关重要)的同时,通过引入两个互补的正则化项来增强预测可靠性。
Details
Motivation: 动机是解决CLIP等模型在提示调优后出现的置信度校准不良问题,即模型预测的概率不能可靠地反映其实际正确可能性,这影响了预测不确定性的可靠性。
Result: 在7种提示调优方法和11个多样化数据集上的广泛实验表明,与竞争性校准技术相比,该方法在基础类和新类上都显著降低了期望校准误差(ECE)。
Insight: 创新点在于提出了一个双正则化校准框架:1)均值-方差边际惩罚,通过最大化类间logit边界的均值并最小化其离散度来稳定边界,缓解欠自信和过自信峰值;2)文本矩匹配损失,使调优后的文本嵌入的一阶和二阶矩与冻结的CLIP对应部分对齐,保留了泛化所必需的语义离散性。该框架在提升校准度的同时,有意识地保持了预训练嵌入空间的几何结构。
Abstract: Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and unreliable predictive uncertainty. We address this problem by proposing a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization. Our approach extends the standard cross-entropy loss with two complementary regularizers: (1) a mean-variance margin penalty that stabilizes inter-class logit margins by maximizing their average while minimizing dispersion, mitigating underconfidence and overconfidence spikes; and (2) a text moment-matching loss that aligns the first and second moments of tuned text embeddings with their frozen CLIP counterparts, preserving semantic dispersion crucial for generalization. Through extensive experiments across 7 prompt-tuning methods and 11 diverse datasets, we demonstrate that our approach significantly reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes
[67] TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation cs.CV | cs.ROPDF
Qingwen Zhang, Chenhan Jiang, Xiaomeng Zhu, Yunqi Miao, Yushan Zhang
TL;DR: 本文提出了TeFlow,一种用于自监督前馈场景流估计的方法,通过挖掘时间一致的监督信号来实现多帧监督,以解决传统两帧对应监督在遮挡下不可靠的问题。
Details
Motivation: 现有自监督前馈场景流估计方法依赖两帧点对应关系进行监督,这种监督在遮挡情况下不可靠且容易失效,而多帧监督有潜力通过整合过去帧的运动线索提供更稳定的指导,但简单的两帧目标扩展因点对应关系在帧间突变而无效。
Result: 在Argoverse 2和nuScenes数据集上的广泛评估表明,TeFlow为自监督前馈方法建立了新的最先进水平,性能提升高达33%,与领先的基于优化的方法性能相当,但速度快150倍。
Insight: 论文的创新点在于引入了时间集成策略,通过聚合从多帧构建的候选池中最具时间一致性的运动线索来形成可靠的监督信号,从而有效利用多帧信息提升场景流估计的鲁棒性和准确性。
Abstract: Self-supervised feed-forward methods for scene flow estimation offer real-time efficiency, but their supervision from two-frame point correspondences is unreliable and often breaks down under occlusions. Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames, yet naive extensions of two-frame objectives are ineffective because point correspondences vary abruptly across frames, producing inconsistent signals. In the paper, we present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision. TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames. Extensive evaluations demonstrate that TeFlow establishes a new state-of-the-art for self-supervised feed-forward methods, achieving performance gains of up to 33% on the challenging Argoverse 2 and nuScenes datasets. Our method performs on par with leading optimization-based methods, yet speeds up 150 times. The code is open-sourced at https://github.com/KTH-RPL/OpenSceneFlow along with trained model weights.
[68] Direction-aware 3D Large Multimodal Models cs.CVPDF
Quan Liu, Weihao Xuan, Junjue Wang, Naoto Yokoya, Ling Shao
TL;DR: 本文提出了一种新的范式,通过识别和补充点云基准数据中的自我姿态,并据此转换点云数据,从而赋能方向感知的3D大型多模态模型。该方法包含两个核心设计:PoseRecover(一种通过物体-视锥体相交和Z-buffer可见性检查,从RGB-D视频外参中自动匹配问题与自我姿态的流程)和PoseAlign(一种将点云数据与识别出的自我姿态对齐的转换方法)。
Details
Motivation: 现有3D大型多模态模型严重依赖自我姿态进行方向性问答和空间推理,但多数点云基准数据集包含丰富的方向性查询却缺乏对应的自我姿态,导致问题定义不明确。
Result: 在多个3D LMM骨干模型(如LL3DA、LL3DA-SONATA、Chat-Scene、3D-LLAVA)上的广泛实验表明,该方法带来了一致的性能提升,例如将ScanRefer的mIoU提高了30.0%,将Scan2Cap的LLM-as-judge准确率提高了11.7%。
Insight: 创新点在于提出了一个严谨的、通过补充自我姿态来重新定义方向感知3D LMM问题的新范式,并设计了无需将姿态注入文本提示或投影层的姿态对齐方法。该方法简单、通用且训练高效,仅需指令微调,为方向感知3D-LMM建立了强基线。
Abstract: 3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.
[69] Restoration-Guided Kuzushiji Character Recognition Framework under Seal Interference cs.CVPDF
Rui-Yang Ju, Kohei Yamashita, Hirotaka Kameko, Shinsuke Mori
TL;DR: 本文提出了一种名为RG-KCR的三阶段修复引导框架,专门用于解决古日文(Kuzushiji)字符识别中因印章干扰(如印章覆盖字符)导致的识别准确率下降问题。该框架包含字符检测、印章干扰修复和字符分类三个阶段,并通过构建数据集和实验验证了其有效性。
Details
Motivation: 现有的古日文自动识别方法在相对干净的文档图像上表现良好,但在包含常见印章干扰的历史文档中识别准确率会显著下降。本文旨在解决印章重叠字符这一特定干扰下的识别难题。
Result: 在构建的测试集上,第一阶段使用的YOLOv12-medium模型达到了98.0%的精确率和93.3%的召回率。第二阶段修复模块的加入,将第三阶段基于Vision Transformer的分类器Metom的Top-1准确率从93.45%提升到了95.33%。修复性能也通过PSNR和SSIM进行了定量评估。
Insight: 论文的核心创新在于提出了一个将文档修复任务与字符识别任务明确结合的引导式框架,通过中间修复阶段来提升最终分类性能,为解决特定类型图像退化(印章干扰)下的OCR问题提供了一个有效的范式。其构建的专门数据集也为该领域研究提供了基准。
Abstract: Kuzushiji was one of the most popular writing styles in pre-modern Japan and was widely used in both personal letters and official documents. However, due to its highly cursive forms and extensive glyph variations, most modern Japanese readers cannot directly interpret Kuzushiji characters. Therefore, recent research has focused on developing automated Kuzushiji character recognition methods, which have achieved satisfactory performance on relatively clean Kuzushiji document images. However, existing methods struggle to maintain recognition accuracy under seal interference (e.g., when seals overlap characters), despite the frequent occurrence of seals in pre-modern Japanese documents. To address this challenge, we propose a three-stage restoration-guided Kuzushiji character recognition (RG-KCR) framework specifically designed to mitigate seal interference. We construct datasets for evaluating Kuzushiji character detection (Stage 1) and classification (Stage 3). Experimental results show that the YOLOv12-medium model achieves a precision of 98.0% and a recall of 93.3% on the constructed test set. We quantitatively evaluate the restoration performance of Stage 2 using PSNR and SSIM. In addition, we conduct an ablation study to demonstrate that Stage 2 improves the Top-1 accuracy of Metom, a Vision Transformer (ViT)-based Kuzushiji classifier employed in Stage 3, from 93.45% to 95.33%. The implementation code of this work is available at https://ruiyangju.github.io/RG-KCR.
[70] Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling cs.CV | cs.GR | cs.LGPDF
Qi Sun, Can Wang, Jiaxiang Shang, Yingchun Liu, Jing Liao
TL;DR: 本文提出了Ani3DHuman框架,用于生成具有照片级真实感的3D人体动画。该方法结合了基于运动学的动画和视频扩散先验,通过分层运动表示分离刚体运动和非刚体残差运动,并引入一种新颖的自引导随机采样方法来解决初始渲染分布外的问题,从而生成高质量视频以优化非刚体运动场。
Details
Motivation: 现有3D人体动画方法难以实现照片级真实感:基于运动学的方法缺乏非刚体动力学(如衣物动态),而利用视频扩散先验的方法虽能合成非刚体运动,但存在质量伪影和身份信息丢失的问题。
Result: 大量实验表明,Ani3DHuman能够生成照片级真实感的3D人体动画,其性能优于现有方法。
Insight: 核心创新点在于分层运动表示(分离刚体与残差非刚体运动)以及自引导随机采样方法,该采样方法结合了随机采样(保证真实感)和自引导(保持身份保真度),有效解决了扩散模型在分布外输入上的采样挑战,从而实现了高质量的非刚体运动恢复与优化。
Abstract: Current 3D human animation methods struggle to achieve photorealism: kinematics-based approaches lack non-rigid dynamics (e.g., clothing dynamics), while methods that leverage video diffusion priors can synthesize non-rigid motion but suffer from quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics-based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion from residual non-rigid motion. Rigid motion is generated by a kinematic method, which then produces a coarse rendering to guide the video diffusion model in generating video sequences that restore the residual non-rigid motion. However, this restoration task, based on diffusion sampling, is highly challenging, as the initial renderings are out-of-distribution, causing standard deterministic ODE samplers to fail. Therefore, we propose a novel self-guided stochastic sampling method, which effectively addresses the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity). These restored videos provide high-quality supervision, enabling the optimization of the residual non-rigid motion field. Extensive experiments demonstrate that \MethodName can generate photorealistic 3D human animation, outperforming existing methods. Code is available in https://github.com/qiisun/ani3dhuman.
[71] CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension cs.CVPDF
Lihao Liu, Yan Wang, Biao Yang, Da Li, Jiangxia Cao
TL;DR: 本文提出CREM模型,通过压缩驱动的统一框架增强多模态表示,同时保持生成能力,在检索任务上达到SOTA性能并保持强大的理解能力。
Details
Motivation: 解决多模态大语言模型在检索任务中因输出格式和优化目标不匹配而表现不佳的问题,同时避免传统对比微调方法导致的生成能力损失。
Result: 在MMEB基准测试中实现最先进的检索性能,并在多个理解基准测试中保持强大的生成性能。
Insight: 创新点在于提出压缩驱动的统一训练框架,通过可学习的聚合令牌和压缩感知注意力整合对比与生成目标,揭示了生成监督能进一步提升多模态表示质量。
Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable success in comprehension tasks such as visual description and visual question answering. However, their direct application to embedding-based tasks like retrieval remains challenging due to the discrepancy between output formats and optimization objectives. Previous approaches often employ contrastive fine-tuning to adapt MLLMs for retrieval, but at the cost of losing their generative capabilities. We argue that both generative and embedding tasks fundamentally rely on shared cognitive mechanisms, specifically cross-modal representation alignment and contextual comprehension. To this end, we propose CREM (Compression-driven Representation Enhanced Model), with a unified framework that enhances multimodal representations for retrieval while preserving generative ability. Specifically, we introduce a compression-based prompt design with learnable chorus tokens to aggregate multimodal semantics and a compression-driven training strategy that integrates contrastive and generative objectives through compression-aware attention. Extensive experiments demonstrate that CREM achieves state-of-the-art retrieval performance on MMEB while maintaining strong generative performance on multiple comprehension benchmarks. Our findings highlight that generative supervision can further improve the representational quality of MLLMs under the proposed compression-driven paradigm.
[72] Universal 3D Shape Matching via Coarse-to-Fine Language Guidance cs.CVPDF
Qinfeng Xiao, Guofeng Mei, Bo Yang, Liying Zhang, Jian Zhang
TL;DR: 本文提出UniMatch框架,通过从粗到细的语言引导实现跨类别、非等距三维形状的密集语义对应。该方法首先进行类别无关的三维分割获取语义部件,并利用多模态大语言模型识别部件名称,再通过预训练视觉语言模型提取文本嵌入构建粗粒度对应;随后基于排序对比学习将粗对应引导至密集对应学习。
Details
Motivation: 解决现有三维形状对应方法依赖近等距假设和单一类别(如仅人体形状)的限制,旨在为强非等距、跨类别的物体建立密集语义对应。
Result: 大量实验表明,UniMatch在各种挑战性场景下均优于现有方法,实现了跨类别和非等距形状的通用匹配。
Insight: 创新点在于将类别无关分割、语言引导(利用MLLMs和VLMs)与排序对比学习结合,无需预定义部件提案即可实现通用三维形状匹配;客观分析其通过语义线索从粗到细的引导机制有效突破了传统几何对应的局限。
Abstract: Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift “coarse” semantic cues into “fine” correspondence, which is achieved through two stages. In the “coarse” stage, we perform class-agnostic 3D segmentation to obtain non-overlapping semantic parts and prompt multimodal large language models (MLLMs) to identify part names. Then, we employ pretrained vision language models (VLMs) to extract text embeddings, enabling the construction of matched semantic parts. In the “fine” stage, we leverage these coarse correspondences to guide the learning of dense correspondences through a dedicated rank-based contrastive scheme. Thanks to class-agnostic segmentation, language guiding, and rank-based contrastive learning, our method is versatile for universal object categories and requires no predefined part proposals, enabling universal matching for inter-class and non-isometric shapes. Extensive experiments demonstrate UniMatch consistently outperforms competing methods in various challenging scenarios.
[73] Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models cs.CVPDF
Jaeyun Jang, Seunghui Shin, Taeho Park, Hyoseok Hwang
TL;DR: 本文提出了一种名为符号投影布局(SymPL)的框架,旨在解决视觉语言模型(VLMs)在全中心(物体中心)视角下的空间推理性能下降问题。SymPL通过将全中心推理问题转化为模型擅长的符号布局形式,显著提升了模型在多种空间推理任务中的表现。
Details
Motivation: 视觉语言模型在自我中心(观察者中心)视角下的空间推理表现良好,但在全中心视角下性能显著下降,因为后者需要从场景中物体的视角推断空间关系。本文旨在解决这一尚未充分探索的挑战。
Result: 大量实验表明,SymPL框架显著提高了模型在全中心和自我中心任务中的性能,增强了在视觉错觉和多视角场景下的鲁棒性,并且每个组件都对性能提升至关重要。
Insight: 创新点在于将全中心空间推理问题系统地转化为符号布局表示,通过投影、抽象、二分和定位四个关键因素实现。这为处理复杂的视角感知空间推理提供了一种有效且原则性的方法。
Abstract: Perspective-aware spatial reasoning involves understanding spatial relationships from specific viewpoints-either egocentric (observer-centered) or allocentric (object-centered). While vision-language models (VLMs) perform well in egocentric settings, their performance deteriorates when reasoning from allocentric viewpoints, where spatial relations must be inferred from the perspective of objects within the scene. In this study, we address this underexplored challenge by introducing Symbolic Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic-layout forms that VLMs inherently handle well. By leveraging four key factors-projection, abstraction, bipartition, and localization-SymPL converts allocentric questions into structured symbolic-layout representations. Extensive experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks, enhances robustness under visual illusions and multi-view scenarios, and that each component contributes critically to these gains. These results show that SymPL provides an effective and principled approach for addressing complex perspective-aware spatial reasoning.
[74] StreetTree: A Large-Scale Global Benchmark for Fine-Grained Tree Species Classification cs.CVPDF
Jiapeng Li, Yingjing Huang, Fan Zhang, Yu liu
TL;DR: 本文介绍了StreetTree,这是首个专为细粒度街道树木分类设计的大规模全球基准数据集,包含来自五大洲133个国家、超过1200万张图像,涵盖8300多种常见街道树种,并提供了专家验证的观测数据和层次分类法。
Details
Motivation: 解决细粒度街道树木分类领域因缺乏大规模、地理多样性且公开可用的基准数据集而进展缓慢的问题,以支持城市规划、街景管理和城市生态系统服务评估。
Result: 通过使用多种视觉模型进行广泛实验,建立了强基线,并揭示了现有方法在处理现实世界复杂性(如物种间高视觉相似性、长尾分布、季节变化引起的类内差异等)方面的局限性。
Insight: 创新点在于构建了首个大规模、全球覆盖的街道树木细粒度分类基准,并引入了层次分类法以支持层次分类和表示学习研究,为计算机视觉与城市科学的交叉领域提供了关键资源并推动新进展。
Abstract: The fine-grained classification of street trees is a crucial task for urban planning, streetscape management, and the assessment of urban ecosystem services. However, progress in this field has been significantly hindered by the lack of large-scale, geographically diverse, and publicly available benchmark datasets specifically designed for street trees. To address this critical gap, we introduce StreetTree, the world’s first large-scale benchmark dataset dedicated to fine-grained street tree classification. The dataset contains over 12 million images covering more than 8,300 common street tree species, collected from urban streetscapes across 133 countries spanning five continents, and supplemented with expert-verified observational data. StreetTree poses substantial challenges for pretrained vision models under complex urban environments: high inter-species visual similarity, long-tailed natural distributions, significant intra-class variations caused by seasonal changes, and diverse imaging conditions such as lighting, occlusions from buildings, and varying camera angles. In addition, we provide a hierarchical taxonomy (order-family-genus-species) to support research in hierarchical classification and representation learning. Through extensive experiments with various visual models, we establish strong baselines and reveal the limitations of existing methods in handling such real-world complexities. We believe that StreetTree will serve as a key resource for the refined management and research of urban street trees, while also driving new advancements at the intersection of computer vision and urban science.
[75] Mapping Networks cs.CVPDF
Lord Sen, Shyamapada Mukherjee
TL;DR: 该论文提出了一种名为Mapping Networks的新方法,通过将高维权重空间替换为紧凑的可训练潜在向量,来应对现代深度学习模型参数数量激增带来的训练效率和过拟合问题。该方法基于大网络训练参数位于平滑低维流形的假设,并通过映射定理和专门的映射损失在理论和实践中验证了从潜在空间到目标权重空间映射的存在性。在图像分类、深度伪造检测等复杂视觉和序列任务中,Mapping Networks显著减少了过拟合,性能与目标网络相当甚至更好,同时将可训练参数减少了约500倍(99.5%)。
Details
Motivation: 现代深度学习模型参数数量不断增长,导致训练效率低下和过拟合问题日益严重,论文旨在通过引入低维潜在表示来替代高维权重空间,以解决这些挑战。
Result: 在图像分类、深度伪造检测等复杂视觉和序列任务中,Mapping Networks实现了与目标网络相当或更好的性能,同时将可训练参数减少了约500倍(99.5%),显著降低了过拟合。
Insight: 论文的创新点在于提出了基于低维流形假设的Mapping Networks,通过映射定理和映射损失将高维权重压缩到紧凑潜在空间,这为减少模型参数和缓解过拟合提供了可借鉴的新思路,尤其在资源受限场景下具有应用潜力。
Abstract: The escalating parameter counts in modern deep learning models pose a fundamental challenge to efficient training and resolution of overfitting. We address this by introducing the \emph{Mapping Networks} which replace the high dimensional weight space by a compact, trainable latent vector based on the hypothesis that the trained parameters of large networks reside on smooth, low-dimensional manifolds. Henceforth, the Mapping Theorem enforced by a dedicated Mapping Loss, shows the existence of a mapping from this latent space to the target weight space both theoretically and in practice. Mapping Networks significantly reduce overfitting and achieve comparable to better performance than target network across complex vision and sequence tasks, including Image Classification, Deepfake Detection etc, with $\mathbf{99.5%}$, i.e., around $500\times$ reduction in trainable parameters.
[76] CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion cs.CV | cs.LGPDF
Sijie Mai, Shiqin Han
TL;DR: 本文提出CaReFlow方法,通过扩展修正流(rectified flow)进行模态分布映射,以解决多模态融合中的模态鸿沟问题。该方法利用‘一对多映射’策略使源模态数据点能够观察目标模态的整体分布,并设计了自适应松弛对齐和循环修正流来提升映射准确性和防止信息损失。
Details
Motivation: 现有方法(如扩散模型和对抗学习)通常专注于一对一对齐,未能让源模态数据点暴露于目标模态的全局分布信息,且受限于配对数据不足,限制了多模态融合的有效性。
Result: 在多个多模态情感计算任务上,即使使用简单的融合方法,该方法也取得了非常有竞争力的结果,可视化验证了其能有效减少模态鸿沟。
Insight: 创新点包括:将修正流扩展到模态分布映射,引入一对多映射策略以利用全局分布信息;设计自适应松弛对齐,根据样本或类别关系调整对齐严格性;采用循环修正流确保特征可逆转换,保留模态特定信息。这些机制增强了分布对齐的鲁棒性和准确性。
Abstract: Modality gap significantly restricts the effectiveness of multimodal fusion. Previous methods often use techniques such as diffusion models and adversarial learning to reduce the modality gap, but they typically focus on one-to-one alignment without exposing the data points of the source modality to the global distribution information of the target modality. To this end, leveraging the characteristic of rectified flow that can map one distribution to another via a straight trajectory, we extend rectified flow for modality distribution mapping. Specifically, we leverage the one-to-many mapping' strategy in rectified flow that allows each data point of the source modality to observe the overall target distribution. This also alleviates the issue of insufficient paired data within each sample, enabling a more robust distribution transformation. Moreover, to achieve more accurate distribution mapping and address the ambiguous flow directions in one-to-many mapping, we design adaptive relaxed alignment’, enforcing stricter alignment for modality pairs belonging to the same sample, while applying relaxed mapping for pairs not belonging to the same sample or category. Additionally, to prevent information loss during distribution mapping, we introduce `cyclic rectified flow’ to ensure the transferred features can be translated back to the original features, allowing multimodal representations to learn sufficient modality-specific information. After distribution alignment, our approach achieves very competitive results on multiple tasks of multimodal affective computing even with a simple fusion method, and visualizations verify that it can effectively reduce the modality gap.
[77] VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval cs.CV | cs.CLPDF
Diogo Glória-Silva, David Semedo, João Maglhães
TL;DR: VIGiA是一种新颖的多模态对话模型,旨在理解和推理复杂的多步骤教学视频动作计划。它通过结合多模态计划推理和基于计划的检索能力,支持基于视觉输入、教学计划和交错用户交互的对话,在烹饪和DIY领域的教学视频对话数据集上表现出色。
Details
Motivation: 解决现有工作主要关注纯文本指导或将视觉与语言孤立处理的问题,旨在实现能够对视觉输入、教学计划和用户交互进行推理的、有根据的、计划感知的对话。
Result: 在对话式计划指导设置的所有任务上,VIGiA超越了现有的最先进模型,在计划感知的视觉问答任务上达到了超过90%的准确率。
Insight: 创新点在于整合了多模态计划推理和基于计划的检索两大关键能力,使模型能够将单模态和多模态查询与当前任务计划对齐,并以文本或视觉表示检索相关计划步骤,实现了对复杂教学视频的更深入理解和交互。
Abstract: We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90% accuracy on plan-aware VQA.
[78] Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation cs.CVPDF
Lunjie Zhu, Yushi Huang, Xingtong Ge, Yufei Xue, Zhening Liu
TL;DR: 本文提出Flash-VAED,一种用于视频生成中VAE解码器的即插即用高效加速框架,通过通道剪枝和因果3D卷积优化来降低延迟,同时保持与原始潜在分布的对齐,在Wan和LTX-Video解码器上实现约6倍加速且重建性能保持96.9%,端到端生成流程加速达36%且质量下降可忽略。
Details
Motivation: 潜在扩散模型虽能实现高质量视频合成,但其推理成本高且耗时,随着扩散变换器效率提升,VAE解码器成为延迟瓶颈,需在保持质量的同时降低其延迟。
Result: 在Wan和LTX-Video VAE解码器上,方法在质量和速度上均优于基线,实现约6倍加速且重建性能达96.9%,在VBench-2.0上端到端生成流程加速达36%且质量下降可忽略。
Insight: 创新点包括独立性感知通道剪枝以缓解通道冗余,以及阶段式主导算子优化以降低因果3D卷积推理成本,并设计了三阶段动态蒸馏框架高效迁移原始VAE解码器能力,为视频生成中的VAE解码器提供了通用加速方案。
Abstract: Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.
[79] Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition cs.CV | cs.CL | cs.LGPDF
Minxue Tang, Yangyang Yu, Aolin Ding, Maziyar Baran Pouyan, Taha Belkhouja Yujia Bao
TL;DR: 本文提出了一种名为ADAMAB的高效嵌入校准框架,用于解决少样本隐式模式识别任务。该方法通过固定预训练嵌入模型,在其上训练轻量级校准器以降低计算成本,并引入基于多臂老虎机的自适应数据增强策略,减少对大规模训练数据的需求。实验表明,ADAMAB在少样本场景下性能优越,每类仅用不到5个初始样本即可提升高达40%的准确率。
Details
Motivation: 当前预训练基础模型(如LLMs和VLMs)在处理长尾隐式模式识别任务时面临挑战,因为微调通常需要大量训练数据和计算开销,而实际应用中往往缺乏这些资源。
Result: 多模态实验证明ADAMAB在少样本训练中表现优异,每类仅用少于5个初始样本即可实现高达40%的准确率提升,在隐式模式识别任务中达到先进水平。
Insight: 创新点包括:1) 嵌入无关的轻量级校准器设计,避免访问预训练模型参数以降低计算成本;2) 基于多臂老虎机的自适应数据增强策略,通过改进的上置信界算法减少梯度偏移并保证理论收敛,适用于少样本场景。
Abstract: Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI. However, tackling long-tail pattern recognition tasks remains challenging for current pre-trained foundation models such as LLMs and VLMs. While finetuning pre-trained models can improve accuracy in recognizing implicit patterns, it is usually infeasible due to a lack of training data and high computational overhead. In this paper, we propose ADAMAB, an efficient embedding calibration framework for few-shot pattern recognition. To maximally reduce the computational costs, ADAMAB trains embedder-agnostic light-weight calibrators on top of fixed embedding models without accessing their parameters. To mitigate the need for large-scale training data, we introduce an adaptive data augmentation strategy based on the Multi-Armed Bandit (MAB) mechanism. With a modified upper confidence bound algorithm, ADAMAB diminishes the gradient shifting and offers theoretically guaranteed convergence in few-shot training. Our multi-modal experiments justify the superior performance of ADAMAB, with up to 40% accuracy improvement when training with less than 5 initial data samples of each class.
[80] JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation cs.CV | cs.MM | cs.SDPDF
Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang
TL;DR: 本文提出了JavisDiT++,一个用于联合音视频生成的统一建模与优化框架。该框架通过引入模态特定专家混合模块、时间对齐的旋转位置编码策略以及音视频直接偏好优化方法,旨在解决现有开源模型在生成质量、时间同步性和人类偏好对齐方面的不足。
Details
Motivation: 当前开源的联合音视频生成方法在生成质量、时间同步性以及与人类偏好对齐方面仍落后于先进的商业模型,本文旨在弥合这一差距。
Result: 基于Wan2.1-1.3B-T2V构建的模型仅使用约100万公开训练数据,就在定性和定量评估中显著优于先前方法,达到了最先进的性能。
Insight: 创新点在于提出了模态特定专家混合设计以平衡跨模态交互与单模态生成质量,时间对齐的旋转位置编码策略以实现帧级同步,以及音视频直接偏好优化方法以对齐人类偏好。这些模块通过消融研究验证了其有效性。
Abstract: AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules. All the code, model, and dataset are released at https://JavisVerse.github.io/JavisDiT2-page.
[81] BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment cs.CVPDF
Kanglei Zhou, Chang Li, Qingyi Pan, Liyuan Wang
TL;DR: 本文提出了一种名为BriMA(Bridged Modality Adaptation)的创新方法,用于解决多模态连续动作质量评估(AQA)中模态缺失的挑战。该方法通过记忆引导的桥接补全模块重建缺失模态,并结合模态感知的回放机制优先选择信息丰富的样本,从而在模态不平衡的现实场景中提升评估性能。
Details
Motivation: 现实世界中的多模态AQA部署常因传感器故障或标注间隙导致模态缺失或不稳定,而现有的连续AQA方法假设所有模态始终完整且稳定,缺乏实用性,因此需要一种能适应模态缺失条件的鲁棒方法。
Result: 在三个代表性多模态AQA数据集(RG、Fis-V和FS1000)上的实验表明,BriMA在不同模态缺失条件下均能提升性能,平均相关性提高6-8%,误差降低12-15%,展示了其在现实约束下的有效性。
Insight: 创新点在于结合任务无关和任务特定表示进行缺失模态重建的桥接补全模块,以及基于模态失真和分布漂移的模态感知回放机制,为多模态连续学习在非平稳环境中的鲁棒性提供了新思路。
Abstract: Action Quality Assessment (AQA) aims to score how well an action is performed and is widely used in sports analysis, rehabilitation assessment, and human skill evaluation. Multi-modal AQA has recently achieved strong progress by leveraging complementary visual and kinematic cues, yet real-world deployments often suffer from non-stationary modality imbalance, where certain modalities become missing or intermittently available due to sensor failures or annotation gaps. Existing continual AQA methods overlook this issue and assume that all modalities remain complete and stable throughout training, which restricts their practicality. To address this challenge, we introduce Bridged Modality Adaptation (BriMA), an innovative approach to multi-modal continual AQA under modality-missing conditions. BriMA consists of a memory-guided bridging imputation module that reconstructs missing modalities using both task-agnostic and task-specific representations, and a modality-aware replay mechanism that prioritizes informative samples based on modality distortion and distribution drift. Experiments on three representative multi-modal AQA datasets (RG, Fis-V, and FS1000) show that BriMA consistently improves performance under different modality-missing conditions, achieving 6–8% higher correlation and 12–15% lower error on average. These results demonstrate a step toward robust multi-modal AQA systems under real-world deployment constraints.
[82] EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer’s Disease cs.CVPDF
Qiuhui Chen, Xuancheng Yao, Zhenglei Zhou, Xinyue Hu, Yi Hong
TL;DR: 本文提出EMAD框架,一种基于证据的阿尔茨海默病多模态诊断方法,通过层次化的句子-证据-解剖结构(SEA)关联机制,生成结构化的诊断报告,其中每个诊断声明都明确关联到多模态证据。
Details
Motivation: 解决当前医学影像分析深度学习模型作为黑箱,缺乏与临床指南对齐且未能将决策与支持证据明确关联的问题,特别是在阿尔茨海默病诊断中,预测应基于解剖和临床发现。
Result: 在AD-MultiSense数据集上,EMAD实现了最先进的诊断准确性,并比现有方法生成更透明、解剖学上更可靠的报告。
Insight: 创新点包括SEA层次化关联机制、GTX-Distill减少密集标注需求的方法,以及Executable-Rule GRPO强化微调方案,确保临床一致性、协议遵循和推理-诊断连贯性,提升了医学视觉语言模型的可信度。
Abstract: Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer’s disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.
[83] VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery cs.CVPDF
Wenhao Shen, Hao Wang, Wanqi Yin, Fayao Liu, Xulei Yang
TL;DR: 本文提出了一种基于视觉语言模型引导的群体偏好对齐框架,用于改进基于扩散模型的人体网格恢复方法。该方法通过一个具有自反思能力的双记忆增强HMR评判智能体,为预测的网格生成上下文感知的质量分数,并利用这些分数构建群体偏好数据集,进而微调扩散模型,以生成更符合物理规律和图像一致性的3D人体网格。
Details
Motivation: 解决单目RGB图像人体网格恢复任务中固有的歧义性问题,特别是现有扩散方法在生成多种假设时,往往牺牲了准确性,导致预测结果在遮挡或复杂场景下物理上不可行或与输入图像不一致。
Result: 大量实验表明,该方法在性能上超越了当前最先进的方法。
Insight: 创新点在于引入了一个能够产生细粒度质量分数的评判智能体来构建偏好数据集,并利用该数据集通过群体偏好对齐框架来微调扩散模型,从而将关于3D人体运动结构、物理可行性和图像对齐的丰富偏好信号注入模型。从客观角度看,这是一种将高质量评判信号(可能来自VLM)与模型微调相结合的新颖范式,旨在提升生成结果的物理合理性和一致性。
Abstract: Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent diffusion-based methods tackle this by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Leveraging this dataset, we propose a group preference alignment framework for finetuning diffusion-based HMR models. This process injects the rich preference signals into the model, guiding it to generate more physically plausible and image-consistent human meshes. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches.
[84] PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration cs.CVPDF
Chen Duan, Zhentao Guo, Pei Fu, Zining Wang, Kai Zhou
TL;DR: 本文提出PositionOCR,一种参数高效的混合架构,旨在增强多模态大语言模型在OCR相关视觉问答任务中的位置感知能力。该方法通过集成文本定位专家的坐标预测优势与大语言模型的上下文推理能力,以较少的可训练参数实现了卓越的多模态处理性能。
Details
Motivation: 当前多模态大语言模型依赖语言模型解码器,缺乏精确视觉任务所需的位置推理能力,且参数量大、训练成本高;而文本定位专家虽能实现SOTA坐标预测,却缺乏语义推理。本文旨在融合专家效率与LLM上下文能力,构建位置精确的MLLM。
Result: PositionOCR在文本定位和文本检测等任务上持续超越传统MLLMs,其框架仅包含1.31亿可训练参数,展现出出色的多模态处理能力。
Insight: 创新点在于提出参数高效的混合架构,将文本定位模型的位置优势与LLM的上下文推理无缝集成,解决了MLLMs位置感知不足与计算资源需求高的双重挑战,为构建精准且高效的多模态系统提供了新思路。
Abstract: In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of LLMs to create a positionally-accurate MLLM? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model’s positional strengths with an LLM’s contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional MLLMs.
[85] FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery cs.CV | cs.AIPDF
Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo
TL;DR: 该论文提出了FUSAR-GPT,一个专门用于合成孔径雷达(SAR)图像的视觉语言模型。它通过构建首个SAR图像-文本-地理特征三元组数据集,并创新性地引入地理空间基线模型作为先验知识,以及通过‘时空锚点’嵌入多源遥感时序特征,来动态补偿SAR图像中目标的稀疏表示。此外,采用两阶段监督微调策略解耦大模型的知识注入与任务执行,从而在多个典型遥感视觉语言基准测试中实现了最先进的性能。
Details
Motivation: 解决现有视觉语言模型(VLMs)由于SAR成像机制复杂、对散射特征敏感以及高质量文本语料稀缺,而无法直接有效应用于SAR图像智能解译的问题。
Result: 在多个典型遥感视觉语言基准测试中取得了最先进的性能,显著优于主流基线模型超过12%。
Insight: 创新点包括:1) 构建首个SAR图像-文本-地理特征三元组数据集;2) 引入地理空间基线模型作为‘世界知识’先验;3) 通过‘时空锚点’嵌入多源遥感时序特征以动态补偿目标稀疏性;4) 设计两阶段监督微调策略来解耦知识注入与任务执行。这些方法为特定领域(如遥感)的视觉语言建模提供了可借鉴的范式,特别是在数据稀缺和模态特性复杂的场景下。
Abstract: Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a ‘world knowledge’ prior and embeds multi-source remote-sensing temporal features into the model’s visual backbone via ‘spatiotemporal anchors’, enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 12%.
[86] Prompt Tuning for CLIP on the Pretrained Manifold cs.CVPDF
Xi Yang, Yuanrong Xu, Weigang Zhang, Guangming Lu, David Zhang
TL;DR: 本文提出了一种名为ManiPT的框架,用于在预训练流形上进行提示调优,以解决有限监督下提示调优导致表示漂移和泛化性能下降的问题。该方法通过引入余弦一致性约束和结构偏置,将学习到的表示限制在预训练的几何邻域内,并引导适应沿可迁移方向进行。
Details
Motivation: 动机是解决提示调优在有限监督下会改变预训练表示,导致下游特征偏离预训练流形,从而损害泛化能力的问题。
Result: 实验在未见类泛化、少样本分类、跨数据集迁移和领域泛化四个下游设置上进行,ManiPT在这些设置上的平均性能均优于基线方法。
Insight: 创新点在于提出了在预训练流形上进行提示调优的框架,通过模态内和模态间的余弦一致性约束以及结构偏置来缓解有限数据下的过拟合趋势,并从理论角度分析了提示调优的过拟合机制。
Abstract: Prompt tuning introduces learnable prompt vectors that adapt pretrained vision-language models to downstream tasks in a parameter-efficient manner. However, under limited supervision, prompt tuning alters pretrained representations and drives downstream features away from the pretrained manifold toward directions that are unfavorable for transfer. This drift degrades generalization. To address this limitation, we propose ManiPT, a framework that performs prompt tuning on the pretrained manifold. ManiPT introduces cosine consistency constraints in both the text and image modalities to confine the learned representations within the pretrained geometric neighborhood. Furthermore, we introduce a structural bias that enforces incremental corrections, guiding the adaptation along transferable directions to mitigate reliance on shortcut learning. From a theoretical perspective, ManiPT alleviates overfitting tendencies under limited data. Our experiments cover four downstream settings: unseen-class generalization, few-shot classification, cross-dataset transfer, and domain generalization. Across these settings, ManiPT achieves higher average performance than baseline methods. Notably, ManiPT provides an explicit perspective on how prompt tuning overfits under limited supervision.
[87] UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models cs.CVPDF
Gang Xu, Zhiyu Zhu, Junhui Hou
TL;DR: 本文提出UniE2F,一个统一的扩散框架,利用预训练视频扩散模型的生成先验,从稀疏的事件相机数据中重建高保真视频帧。该方法首先将事件数据直接作为条件建立基线模型,然后基于事件流与视频帧之间的物理相关性,引入基于事件的帧间残差引导以提升重建精度。此外,通过调制反向扩散采样过程,该方法能够以零样本方式扩展到视频帧插值和预测任务。
Details
Motivation: 事件相机擅长高速、低功耗和高动态范围场景感知,但其仅记录相对强度变化而非绝对强度,导致数据流存在空间信息和静态纹理细节的显著损失。本文旨在利用预训练视频扩散模型的生成能力来克服这一限制,从稀疏事件数据中重建高质量视频帧。
Result: 在真实世界和合成数据集上的实验结果表明,该方法在定量和定性评估上均显著优于先前方法。
Insight: 核心创新点在于将预训练视频扩散模型的生成先验与事件数据的物理特性相结合,通过事件条件化和帧间残差引导来增强重建质量,并构建了一个统一的框架,可零样本扩展到插值和预测任务,为事件相机数据的高质量视觉重建提供了新思路。
Abstract: Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we first establish a baseline model by directly applying event data as a condition to synthesize videos. Then, based on the physical correlation between the event stream and video frames, we further introduce the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction. Furthermore, we extend our method to video frame interpolation and prediction in a zero-shot manner by modulating the reverse diffusion sampling process, thereby creating a unified event-to-frame reconstruction framework. Experimental results on real-world and synthetic datasets demonstrate that our method significantly outperforms previous approaches both quantitatively and qualitatively. We also refer the reviewers to the video demo contained in the supplementary material for video results. The code will be publicly available at https://github.com/CS-GangXu/UniE2F.
[88] SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation cs.CVPDF
Yujie Lu, Jingwen Li, Sibo Ju, Yanzhou Su, he yao
TL;DR: 本文提出SegMoTE,一种用于医学图像分割的token级专家混合框架。它基于SAM模型,通过引入少量可学习参数实现跨模态和任务的动态适应,并设计了渐进式提示token化机制以实现全自动分割,大幅降低标注依赖。在MedSeg-HQ数据集上训练,该模型在多种成像模态和解剖任务上取得了SOTA性能。
Details
Motivation: 解决通用交互式分割模型(如SAM)迁移到医学图像时面临的两个关键瓶颈:缺乏针对模态和解剖特定任务的自适应机制,以及现有医学适应方法在大规模异构数据集上无选择微调导致的噪声监督、高成本和负迁移问题。
Result: 在精心筛选的MedSeg-HQ数据集(规模不到现有大规模数据集的1%)上训练,SegMoTE在多种成像模态和解剖任务上实现了SOTA性能。
Insight: 创新点包括:引入token级专家混合机制实现高效跨模态适应;设计渐进式提示token化机制支持全自动分割,减少标注依赖;在极低标注成本下实现通用分割模型向医学领域的高效、鲁棒且可扩展的适应。
Abstract: Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM’s original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.
[89] Questions beyond Pixels: Integrating Commonsense Knowledge in Visual Question Generation for Remote Sensing cs.CVPDF
Siran Li, Li Mi, Javiera Castillo-Navarro, Devis Tuia
TL;DR: 本文提出了一种知识感知的遥感视觉问题生成模型(KRSVQG),旨在通过结合外部常识知识来丰富和多样化遥感图像的问题生成,超越简单的像素级描述。模型利用图像描述作为中间表示,并采用视觉语言预训练和微调策略以适应低数据场景。研究构建了两个新的数据集进行评估,结果表明KRSVQG优于现有方法,能生成基于图像和领域知识的丰富问题。
Details
Motivation: 当前自动生成的问题往往过于简单和模板化,限制了问答或视觉对话系统在实际应用中的部署。为了解决这一问题,本文旨在通过整合图像内容和常识知识来丰富和多样化遥感图像的问题生成。
Result: 在构建的NWPU-300和TextRS-300数据集上,通过指标和人工评估,KRSVQG模型优于现有方法,能生成基于图像和领域知识的丰富问题,达到了SOTA水平。
Insight: 创新点包括:将外部知识三元组整合到问题生成中,以拓宽问题内容;使用图像描述作为中间表示来确保问题与图像对应;采用视觉语言预训练和微调策略以适应低数据场景。从客观角度看,这推动了视觉语言研究中对图像内容的理解超越像素层面,促进了基于视觉的人类常识知识增强的视觉语言系统发展。
Abstract: With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing semantic image retrieval. However, current automatically generated questions tend to be simplistic and template-based, which hinders the deployment of question answering or visual dialogue systems for real-world applications. To enrich and diversify the questions with both image content and commonsense knowledge, we propose a Knowledge-aware Remote Sensing Visual Question Generation model (KRSVQG). The proposed model incorporates related knowledge triplets from external knowledge sources to broaden the question content, while employing image captioning as an intermediary representation to ground questions to the corresponding images. Moreover, KRSVQG utilizes a vision-language pre-training and fine-tuning strategy, enabling the model’s adaptation to low data regimes. To evaluate the proposed KRSVQG model, we construct two knowledge-aware remote sensing visual question generation datasets: the NWPU-300 dataset and the TextRS-300 dataset. Evaluations, including metrics and human assessment, demonstrate that KRSVQG outperforms existing methods and leads to rich questions, grounded in both image and domain knowledge. As a key practice in vision-language research, knowledge-aware visual question generation advances the understanding of image content beyond pixels, facilitating the development of knowledge-enriched vision-language systems with vision-grounded human commonsense.
[90] Controlled Face Manipulation and Synthesis for Data Augmentation cs.CV | cs.LGPDF
Joris Kirchner, Amogh Gudi, Marian Bittner, Chirag Raman
TL;DR: 本文提出了一种基于预训练人脸生成器(Diffusion Autoencoder)语义潜在空间的面部表情可控编辑方法,用于数据增强。该方法通过依赖感知条件化和正交投影减少属性纠缠,并引入表情中和步骤实现绝对动作单元(AU)编辑。生成的编辑数据用于平衡AU出现频率和多样化身份/人口统计特征,从而提升AU检测器的准确性和解耦预测能力。
Details
Motivation: 深度学习视觉模型需要大量监督数据,但许多应用面临标签稀缺和类别不平衡问题。可控图像编辑可以增强稀缺的标注数据,但现有编辑方法常引入伪影并导致非目标属性纠缠。本文针对面部表情分析中的AU操作进行研究,其中AU标注成本高且AU共激活导致纠缠问题。
Result: 在AU检测器训练中使用生成数据进行增强,提高了准确性,并产生了更解耦的预测,减少了共激活捷径。该方法在数据高效训练策略中表现优异,学习曲线分析表明其改进效果相当于需要大量标注数据才能达到的水平。与先前方法相比,本文的编辑效果更强、伪影更少、身份保持更好。
Insight: 创新点包括:(1)依赖感知条件化,考虑AU共激活以减少特征纠缠;(2)正交投影去除干扰属性方向(如眼镜);(3)表情中和步骤实现绝对AU编辑。这些方法共同实现了更可控、更高质量的面部编辑,为数据增强提供了有效解决方案。
Abstract: Deep learning vision models excel with abundant supervision, but many applications face label scarcity and class imbalance. Controllable image editing can augment scarce labeled data, yet edits often introduce artifacts and entangle non-target attributes. We study this in facial expression analysis, targeting Action Unit (AU) manipulation where annotation is costly and AU co-activation drives entanglement. We present a facial manipulation method that operates in the semantic latent space of a pre-trained face generator (Diffusion Autoencoder). Using lightweight linear models, we reduce entanglement of semantic features via (i) dependency-aware conditioning that accounts for AU co-activation, and (ii) orthogonal projection that removes nuisance attribute directions (e.g., glasses), together with an expression neutralization step to enable absolute AU edit. We use these edits to balance AU occurrence by editing labeled faces and to diversify identities/demographics via controlled synthesis. Augmenting AU detector training with the generated data improves accuracy and yields more disentangled predictions with fewer co-activation shortcuts, outperforming alternative data-efficient training strategies and suggesting improvements similar to what would require substantially more labeled data in our learning-curve analysis. Compared to prior methods, our edits are stronger, produce fewer artifacts, and preserve identity better.
[91] No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection cs.CV | cs.AIPDF
Zunkai Dai, Ke Li, Jiajia Liu, Jie Yang, Yuanyuan Qiao
TL;DR: 本文提出LAVIDA,一个无需真实异常数据的端到端零样本视频异常检测框架。它通过异常暴露采样器生成伪异常来增强模型泛化能力,并集成多模态大语言模型以提升语义理解,同时采用基于反向注意力的令牌压缩方法处理异常模式的时空稀疏性并降低计算成本。在四个基准数据集上的评估表明,LAVIDA在零样本设置下实现了帧级和像素级异常检测的SOTA性能。
Details
Motivation: 解决视频异常检测在开放世界场景中表现不佳的问题,主要由于数据集多样性有限、对上下文相关异常语义理解不足,以及异常数据因罕见性和时空稀疏性而难以收集和检测。
Result: 在四个基准视频异常检测数据集上的评估显示,LAVIDA在零样本设置下,无论是帧级还是像素级异常检测,均达到了最先进的性能水平。
Insight: 创新点包括:使用伪异常生成(异常暴露采样器)来避免对真实异常数据的依赖;集成多模态大语言模型以增强对异常语义的上下文理解;设计基于反向注意力的令牌压缩方法来高效处理时空稀疏的异常模式并减少计算开销。这些方法共同实现了零样本下的高效异常检测。
Abstract: The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in https://github.com/VitaminCreed/LAVIDA.
[92] DD-CAM: Minimal Sufficient Explanations for Vision Models Using Delta Debugging cs.CV | cs.SEPDF
Krishna Khadka, Yu Lei, Raghu N. Kacker, D. Richard Kuhn
TL;DR: 本文提出了一种名为DD-CAM的无梯度框架,用于为视觉模型生成最小、充分且保持决策的解释。该方法通过识别并隔离模型表征单元中最小的子集,该子集的联合激活足以保持模型的原始预测,从而生成更清晰、更聚焦的显著性图。
Details
Motivation: 现有方法(如基于类激活图CAM的方法)通常会聚合所有单元,导致生成的显著性图杂乱且难以解释。本文旨在解决这一问题,寻找能保持预测的最小、充分的特征子集,以提供更忠实、更精确的解释。
Result: 实验评估表明,该方法在解释的忠实度和定位精度方面均优于最先进的基于CAM的方法。
Insight: 核心创新点在于将软件调试领域的Delta Debugging(增量调试)系统化缩减策略,适配并应用于视觉模型的解释生成,以高效地识别最小充分子集。根据分类头中单元间的交互情况(是否存在交互),配置了不同的搜索策略(测试单个单元或单元组合),从而生成仅突出最本质特征的显著性图。
Abstract: We introduce a gradient-free framework for identifying minimal, sufficient, and decision-preserving explanations in vision models by isolating the smallest subset of representational units whose joint activation preserves predictions. Unlike existing approaches that aggregate all units, often leading to cluttered saliency maps, our approach, DD-CAM, identifies a 1-minimal subset whose joint activation suffices to preserve the prediction (i.e., removing any unit from the subset alters the prediction). To efficiently isolate minimal sufficient subsets, we adapt delta debugging, a systematic reduction strategy from software debugging, and configure its search strategy based on unit interactions in the classifier head: testing individual units for models with non-interacting units and testing unit combinations for models in which unit interactions exist. We then generate minimal, prediction-preserving saliency maps that highlight only the most essential features. Our experimental evaluation demonstrates that our approach can produce more faithful explanations and achieve higher localization accuracy than the state-of-the-art CAM-based approaches.
[93] A Two-Stage Detection-Tracking Framework for Stable Apple Quality Inspection in Dense Conveyor-Belt Environments cs.CVPDF
Keonvin Park, Aditya Pal, Jin Hong Mok
TL;DR: 本文提出了一种两阶段检测-跟踪框架,用于在密集传送带环境中实现稳定的苹果质量检测。该框架首先使用YOLOv8模型定位苹果,然后通过ByteTrack进行多目标跟踪以维持身份一致性,最后利用ResNet18对裁剪出的苹果区域进行缺陷分类,并通过轨迹级聚合来增强时间一致性。
Details
Motivation: 现有工业水果检测系统大多在图像级别进行评估,无法确保视频流中的时间稳定性,尤其是在密集多目标交互和连续运动场景下。本文旨在解决这一问题,为传送带环境提供稳定的多苹果质量检测方案。
Result: 实验结果表明,与逐帧推理相比,该框架在稳定性方面有显著提升。通过定义轨迹级缺陷比率和时间一致性等视频级工业指标,验证了系统在实际处理条件下的鲁棒性。
Insight: 创新点在于将检测、跟踪与分类相结合,并引入轨迹级聚合来强制时间一致性,减少预测振荡。从客观角度看,该框架强调了跟踪集成对于实际自动化水果分级系统的重要性,为类似动态密集环境下的目标检测提供了可借鉴的思路。
Abstract: Industrial fruit inspection systems must operate reliably under dense multi-object interactions and continuous motion, yet most existing works evaluate detection or classification at the image level without ensuring temporal stability in video streams. We present a two-stage detection-tracking framework for stable multi-apple quality inspection in conveyor-belt environments. An orchard-trained YOLOv8 model performs apple localization, followed by ByteTrack multi-object tracking to maintain persistent identities. A ResNet18 defect classifier, fine-tuned on a healthy-defective fruit dataset, is applied to cropped apple regions. Track-level aggregation is introduced to enforce temporal consistency and reduce prediction oscillation across frames. We define video-level industrial metrics such as track-level defect ratio and temporal consistency to evaluate system robustness under realistic processing conditions. Results demonstrate improved stability compared to frame-wise inference, suggesting that integrating tracking is essential for practical automated fruit grading systems.
[94] MRI Contrast Enhancement Kinetics World Model cs.CVPDF
Jindi Kong, Yuting He, Cong Xia, Rongjun Ge, Shuo Li
TL;DR: 本文提出了一种名为MRI CEKWorld的模型,通过时空一致性学习(STCL)来模拟MRI对比增强动力学,以解决临床MRI采集信息效率低下的问题。该方法利用潜在对齐学习(LAL)和潜在差异学习(LDL)分别确保空间内容一致性和时间动力学平滑性,从而生成连续且真实的对比增强动态图像。
Details
Motivation: 临床MRI对比增强采集存在信息效率低下的问题,表现为高风险高成本的采集协议与固定稀疏的采集序列不匹配。现有世界模型在训练时受限于低时间分辨率数据,导致生成内容扭曲和时间不连续。
Result: 在两个数据集上的广泛实验表明,MRI CEKWorld在生成真实内容和动力学方面优于基线方法,实现了更好的性能。
Insight: 创新点包括首次提出针对MRI对比增强动力学的世界模型,并引入时空一致性学习(STCL),其中潜在对齐学习(LAL)利用患者级结构一致性约束内容,潜在差异学习(LDL)通过插值和潜在空间平滑约束来学习连续动力学规律。这为医学图像生成提供了可借鉴的时空一致性框架。
Abstract: Clinical MRI contrast acquisition suffers from inefficient information yield, which presents as a mismatch between the risky and costly acquisition protocol and the fixed and sparse acquisition sequence. Applying world models to simulate the contrast enhancement kinetics in the human body enables continuous contrast-free dynamics. However, the low temporal resolution in MRI acquisition restricts the training of world models, leading to a sparsely sampled dataset. Directly training a generative model to capture the kinetics leads to two limitations: (a) Due to the absence of data on missing time, the model tends to overfit to irrelevant features, leading to content distortion. (b) Due to the lack of continuous temporal supervision, the model fails to learn the continuous kinetics law over time, causing temporal discontinuities. For the first time, we propose MRI Contrast Enhancement Kinetics World model (MRI CEKWorld) with SpatioTemporal Consistency Learning (STCL). For (a), guided by the spatial law that patient-level structures remain consistent during enhancement, we propose Latent Alignment Learning (LAL) that constructs a patient-specific template to constrain contents to align with this template. For (b), guided by the temporal law that the kinetics follow a consistent smooth trend, we propose Latent Difference Learning (LDL) which extends the unobserved intervals by interpolation and constrains smooth variations in the latent space among interpolated sequences. Extensive experiments on two datasets show our MRI CEKWorld achieves better realistic contents and kinetics. Codes will be available at https://github.com/DD0922/MRI-Contrast-Enhancement-Kinetics-World-Model.
[95] Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition cs.CV | cs.SDPDF
Alexandros Haliassos, Rodrigo Mira, Stavros Petridis
TL;DR: 本文提出了一种名为CTC-driven teacher forcing的新方法,用于改进统一语音识别(USR)框架。该方法通过贪婪解码的CTC伪标签在单次前向传播中生成注意力目标,从而显著减少训练时间并提高对分布外输入的鲁棒性。
Details
Motivation: 现有USR框架依赖自回归伪标注,训练成本高,且CTC与注意力分支的解耦监督容易导致自我强化错误,尤其在面对长序列、噪声或未见域等分布偏移时。
Result: 所提方法USR 2.0在LRS3、LRS2和WildVSR基准测试上取得了最先进(SOTA)结果,训练时间减半,并超越了原始USR及特定模态的自监督基线模型。
Insight: 创新点在于利用CTC伪标签直接驱动注意力目标生成,实现了CTC的鲁棒性与注意力的表达能力的有效结合,无需昂贵束搜索;同时引入混合采样缓解解码器的曝光偏差。
Abstract: Unified Speech Recognition (USR) has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition, achieving state-of-the-art results on in-distribution benchmarks. However, its reliance on autoregressive pseudo-labelling makes training expensive, while its decoupled supervision of CTC and attention branches increases susceptibility to self-reinforcing errors, particularly under distribution shifts involving longer sequences, noise, or unseen domains. We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets in a single forward pass. Although these can be globally incoherent, in the pseudo-labelling setting they enable efficient and effective knowledge transfer. Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can predict both simultaneously, benefiting from the robustness of CTC and the expressiveness of attention without costly beam search. We further propose mixed sampling to mitigate the exposure bias of the decoder relying solely on CTC inputs. The resulting method, USR 2.0, halves training time, improves robustness to out-of-distribution inputs, and achieves state-of-the-art results on LRS3, LRS2, and WildVSR, surpassing USR and modality-specific self-supervised baselines.
[96] US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound cs.CV | cs.AI | cs.LGPDF
Ashwath Radhachandran, Vedrana Ivezić, Shreeram Athreya, Ronit Anilkumar, Corey W. Arnold
TL;DR: 本文提出了US-JEPA,一种用于医学超声图像的自监督表示学习框架。该框架采用静态教师非对称潜在训练(SALT)目标,通过一个冻结的、领域特定的教师模型提供稳定的潜在表示目标,从而避免了传统联合嵌入预测架构(JEPA)对在线教师和指数移动平均的依赖。在涵盖多器官和病理条件的公开数据集基准UltraBench上,US-JEPA在多种分类任务的线性探测评估中,性能优于或与领域特定及通用视觉基础模型基线相当。
Details
Motivation: 解决超声成像由于固有噪声采集过程(低信噪比和随机散斑模式)导致的标准自监督学习方法(依赖像素级重建目标)效果不佳的问题,并改进传统JEPA方法对超参数敏感且计算成本高的在线教师模型的依赖。
Result: 在UltraBench基准测试的多种分类任务线性探测评估中,US-JEPA的性能与领域特定及通用视觉基础模型基线相当或更优。
Insight: 创新点在于采用静态教师非对称潜在训练(SALT)目标,使用冻结的领域特定教师提供稳定潜在目标,解耦了师生优化过程,并促使学生模型在教师语义先验基础上进行扩展。这为鲁棒超声表示学习提供了一条稳定高效的路径,并首次对公开可用的先进超声基础模型进行了严格比较。
Abstract: Ultrasound (US) imaging poses unique challenges for representation learning due to its inherently noisy acquisition process. The low signal-to-noise ratio and stochastic speckle patterns hinder standard self-supervised learning methods relying on a pixel-level reconstruction objective. Joint-Embedding Predictive Architectures (JEPAs) address this drawback by predicting masked latent representations rather than raw pixels. However, standard approaches depend on hyperparameter-brittle and computationally expensive online teachers updated via exponential moving average. We propose US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective. By using a frozen, domain-specific teacher to provide stable latent targets, US-JEPA decouples student-teacher optimization and pushes the student to expand upon the semantic priors of the teacher. In addition, we provide the first rigorous comparison of all publicly available state-of-the-art ultrasound foundation models on UltraBench, a public dataset benchmark spanning multiple organs and pathological conditions. Under linear probing for diverse classification tasks, US-JEPA achieves performance competitive with or superior to domain-specific and universal vision foundation model baselines. Our results demonstrate that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.
[97] DefenseSplat: Enhancing the Robustness of 3D Gaussian Splatting via Frequency-Aware Filtering cs.CVPDF
Yiran Qiao, Yiren Lu, Yunlai Zhou, Rui Yang, Linlin Hou
TL;DR: 本文提出DefenseSplat方法,通过小波变换分析对抗扰动在输入图像低频和高频分量的不同行为,设计了一种频率感知的防御策略,在训练视图中过滤高频噪声并保留低频内容,从而增强3D高斯泼溅(3DGS)模型对对抗性攻击的鲁棒性,同时不影响干净数据的训练性能。
Details
Motivation: 3D高斯泼溅(3DGS)在实时高保真3D重建中表现出色,但研究发现其对输入视图中的对抗性扰动非常脆弱,微小但一致的扰动会导致渲染质量下降、训练和渲染时间增加、内存使用膨胀,甚至引发服务器拒绝服务,因此需要提升其鲁棒性。
Result: 在多个基准测试和广泛攻击强度下的实验表明,该方法在没有干净真实值监督的情况下,显著增强了3DGS的鲁棒性,同时保持了干净输入上的性能,实现了鲁棒性与性能的良好平衡。
Insight: 创新点在于通过频率分析(小波变换)揭示对抗扰动在高低频的差异行为,并据此设计频率感知的滤波防御策略;从客观角度看,该方法简单有效,为3D重建系统的安全性和鲁棒性提供了新思路,强调了频率域处理在对抗防御中的潜力。
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for real-time and high-fidelity 3D reconstruction from posed images. However, recent studies reveal its vulnerability to adversarial corruptions in input views, where imperceptible yet consistent perturbations can drastically degrade rendering quality, increase training and rendering time, and inflate memory usage, even leading to server denial-of-service. In our work, to mitigate this issue, we begin by analyzing the distinct behaviors of adversarial perturbations in the low- and high-frequency components of input images using wavelet transforms. Based on this observation, we design a simple yet effective frequency-aware defense strategy that reconstructs training views by filtering high-frequency noise while preserving low-frequency content. This approach effectively suppresses adversarial artifacts while maintaining the authenticity of the original scene. Notably, it does not significantly impair training on clean data, achieving a desirable trade-off between robustness and performance on clean inputs. Through extensive experiments under a wide range of attack intensities on multiple benchmarks, we demonstrate that our method substantially enhances the robustness of 3DGS without access to clean ground-truth supervision. By highlighting and addressing the overlooked vulnerabilities of 3D Gaussian Splatting, our work paves the way for more robust and secure 3D reconstructions.
[98] RetinaVision: XAI-Driven Augmented Regulation for Precise Retinal Disease Classification using deep learning framework cs.CV | cs.AIPDF
Mohammad Tahmid Noor, Shayan Abrar, Jannatul Adan Mahi, Md Parvez Mia, Asaduzzaman Hridoy
TL;DR: 本研究提出了一种名为RetinaVision的深度学习框架,用于基于光学相干断层扫描(OCT)图像的视网膜疾病精确分类。该方法在包含8种疾病、24,000张标记图像的Retinal OCT Image Classification - C8数据集上,测试了Xception和InceptionV3等CNN架构,并采用CutMix和MixUp数据增强技术提升泛化能力,同时利用GradCAM和LIME进行可解释性评估。研究开发了相应的Web应用以实现实际场景部署。
Details
Motivation: 早期准确分类视网膜疾病对于防止视力丧失和指导临床管理至关重要,本研究旨在开发一个结合高精度和可解释性的深度学习方法来满足这一临床需求。
Result: 在Retinal OCT Image Classification - C8数据集上,Xception网络取得了95.25%的最高准确率,InceptionV3紧随其后达到94.82%,表明该方法能有效进行OCT视网膜疾病分类。
Insight: 创新点在于将数据增强技术(CutMix、MixUp)与可解释性AI工具(GradCAM、LIME)结合,构建了一个从模型训练到临床可解释性应用的全流程框架,并通过Web应用实现了实际部署,强调了在医疗应用中同时追求准确性和可解释性的重要性。
Abstract: Early and accurate classification of retinal diseases is critical to counter vision loss and for guiding clinical management of retinal diseases. In this study, we proposed a deep learning method for retinal disease classification utilizing optical coherence tomography (OCT) images from the Retinal OCT Image Classification - C8 dataset (comprising 24,000 labeled images spanning eight conditions). Images were resized to 224x224 px and tested on convolutional neural network (CNN) architectures: Xception and InceptionV3. Data augmentation techniques (CutMix, MixUp) were employed to enhance model generalization. Additionally, we applied GradCAM and LIME for interpretability evaluation. We implemented this in a real-world scenario via our web application named RetinaVision. This study found that Xception was the most accurate network (95.25%), followed closely by InceptionV3 (94.82%). These results suggest that deep learning methods allow effective OCT retinal disease classification and highlight the importance of implementing accuracy and interpretability for clinical applications.
[99] MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose cs.CV | cs.AIPDF
Sirine Bhouri, Lan Wei, Jian-Qing Zheng, Dandan Zhang
TL;DR: 本文提出了MultiDiffSense,一种基于扩散模型的多模态视觉-触觉图像生成方法,通过CAD导出的深度图和结构化提示(编码传感器类型和4自由度接触位姿)进行双重条件控制,实现在单一架构内为多种视觉触觉传感器(ViTac、TacTip、ViTacTip)生成可控且物理一致的多模态图像,以缓解触觉感知中数据收集的瓶颈。
Details
Motivation: 获取对齐的多模态视觉-触觉数据集成本高、速度慢,且现有合成方法多为单模态,限制了跨模态学习,因此需要一种可控的多模态生成方法。
Result: 在8个物体(5个已知、3个新颖)和未见位姿上评估,MultiDiffSense在SSIM指标上优于Pix2Pix cGAN基线:ViTac提升36.3%,ViTacTip提升134.6%,TacTip提升64.7%;在下游3自由度位姿估计任务中,混合50%合成数据与50%真实数据可将所需真实数据减半并保持竞争力。
Insight: 创新点在于使用统一扩散模型结合CAD深度图和结构化提示进行双重条件控制,实现多传感器、多模态的可控生成;客观分析其通过合成数据有效减少真实数据需求,为机器人应用提供了可扩展的多模态数据集生成方案。
Abstract: Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.
[100] PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis cs.CVPDF
Zhilin Guo, Jing Yang, Kyle Fogarty, Jingyi Wan, Boqiao Zhang
TL;DR: PoseCraft是一个用于生成逼真人体图像的扩散模型框架,其核心创新在于使用标记化的3D接口。该方法将稀疏的3D身体关键点和相机外参编码为离散的条件标记,并通过交叉注意力注入到扩散过程中,从而避免了传统2D控制图像在大姿态和视角变化下的重投影模糊问题。论文还提出了GenHumanRF数据生成流程,用于大规模训练和评估。
Details
Motivation: 现有方法(如基于蒙皮的流程或神经体积方法)在数字化人体和合成具有明确3D姿态和相机控制的逼真化身时,存在流程繁琐、依赖规范模板或需要为每个未见姿态重新优化的问题。PoseCraft旨在解决这些问题,实现更灵活、保真度更高的人体图像合成。
Result: 实验表明,PoseCraft在感知质量上显著优于以扩散模型为中心的方法,并且在定量指标上达到或超过了最新的体积渲染SOTA方法,同时更好地保留了衣物和头发等细节。
Insight: 主要创新点在于提出了一个标记化的3D条件接口,将3D语义(关键点和相机参数)直接作为离散标记融入扩散过程,这比依赖2D光栅化几何控制更能保持3D一致性。此外,GenHumanRF数据生成流程为训练提供了大规模、多样化的监督数据,是该研究得以成功的重要支撑。
Abstract: Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, we encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention. Our approach preserves 3D semantics by avoiding 2D re-projection ambiguity under large pose and viewpoint changes, and produces photorealistic imagery that faithfully captures identity and appearance. To train and evaluate at scale, we also implement GenHumanRF, a data generation workflow that renders diverse supervision from volumetric reconstructions. Our experiments show that PoseCraft achieves significant perceptual quality improvement over diffusion-centric methods, and attains better or comparable metrics to latest volumetric rendering SOTA while better preserving fabric and hair details.
[101] MentalBlackboard: Evaluating Spatial Visualization via Mathematical Transformations cs.CV | cs.LGPDF
Nilay Yilmaz, Maitreya Patel, Naga Sai Abhiram Kusumba, Yixuan He, Yezhou Yang
TL;DR: 该论文提出了MentalBlackboard基准测试,用于评估视觉语言模型在空间可视化能力(如想象、变换和操纵物体空间特性)上的表现,重点关注折纸和打孔测试中的预测与规划任务。研究发现,即使是最先进的模型在对称变换、旋转感知和多阶段对称过程规划方面也存在显著困难。
Details
Motivation: 探索当前最先进的视觉语言模型是否具备人类的空间可视化认知能力,即能否在心理层面想象和操纵物体的空间变换。
Result: 在预测任务中,模型在对称变换和旋转方面表现不佳;在规划任务中,Claude Opus 4.1的准确率最高,仅为10%。性能最佳的模型o3在泛化任务(无需空间可视化)上达到71.6%的准确率,但在基于文本的预测任务中仅达到25%。
Insight: 论文的创新点在于构建了一个开放式的空间可视化基准测试,揭示了VLMs在高级空间推理(特别是对称性和多阶段规划)方面的核心局限性,为评估模型认知能力提供了新维度。
Abstract: Spatial visualization is the mental ability to imagine, transform, and manipulate the spatial characteristics of objects and actions. This intelligence is a part of human cognition where actions and perception are connected on a mental level. To explore whether state-of-the-art Vision-Language Models (VLMs) exhibit this ability, we develop MentalBlackboard, an open-ended spatial visualization benchmark for Paper Folding and Hole Punching tests within two core tasks: prediction and planning. Our prediction experiments reveal that models struggle with applying symmetrical transformations, even when they predict the sequence of unfolding steps correctly. Also, rotations introduce a significant challenge to the physical situational awareness for models. The planning task reveals limitations of models in analyzing symmetrical relationships and in implementing the multi-stage symmetry process, with Claude Opus 4.1 achieving the highest planning score at an accuracy of 10%. The top-performing model, o3, attains a peak performance of 71.6% on the generalization task, which does not require spatial visualization but transfers spatial data; however, it achieves only 25% accuracy on text-based prediction tasks.
[102] Detector-in-the-Loop Tracking: Active Memory Rectification for Stable Glottic Opening Localization cs.CVPDF
Huayu Wang, Bahaa Alattar, Cheng-Yen Yang, Hsiang-Wei Huang, Jung Heon Kim
TL;DR: 本文提出了一种用于喉镜视频中声门开口定位的检测器闭环跟踪框架CL-MC,通过结合单帧检测器和SAM2跟踪器,利用高置信度检测主动校正跟踪器记忆,以解决跟踪漂移问题,在紧急插管视频上实现了最先进的性能。
Details
Motivation: 解决声门开口定位中因单帧检测器缺乏时序上下文、而基础模型跟踪器存在记忆漂移导致的时序稳定性挑战,尤其是在喉镜视频中组织快速变形、遮挡和视觉模糊的复杂场景下。
Result: 在紧急插管视频数据集上,CL-MC相比SAM2变体和开环方法显著减少了漂移和丢失率,达到了最先进的性能水平。
Insight: 创新点在于提出了一个无需训练的、基于置信度对齐状态决策和主动记忆校正的检测器闭环监督机制,将高置信度检测作为语义重置信号来覆盖被污染的跟踪器记忆,从而有效缓解复杂内窥镜场景中的漂移累积问题。
Abstract: Temporal stability in glottic opening localization remains challenging due to the complementary weaknesses of single-frame detectors and foundation-model trackers: the former lacks temporal context, while the latter suffers from memory drift. Specifically, in video laryngoscopy, rapid tissue deformation, occlusions, and visual ambiguities in emergency settings require a robust, temporally aware solution that can prevent progressive tracking errors. We propose Closed-Loop Memory Correction (CL-MC), a detector-in-the-loop framework that supervises Segment Anything Model 2(SAM2) through confidence-aligned state decisions and active memory rectification. High-confidence detections trigger semantic resets that overwrite corrupted tracker memory, effectively mitigating drift accumulation with a training-free foundation tracker in complex endoscopic scenes. On emergency intubation videos, CL-MC achieves state-of-the-art performance, significantly reducing drift and missing rate compared with the SAM2 variants and open loop based methods. Our results establish memory correction as a crucial component for reliable clinical video tracking. Our code will be available in https://github.com/huayuww/CL-MR.
[103] PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention cs.CVPDF
Hefei Mei, Zirui Wang, Chang Xu, Jianyuan Guo, Minjing Dong
TL;DR: 本文提出PA-Attack,一种针对大型视觉语言模型(LVLM)视觉编码器的灰盒对抗攻击方法。该方法通过原型锚定引导提供稳定的攻击方向,并结合两阶段注意力增强机制,将扰动集中在关键视觉令牌上并自适应地重新校准注意力权重,以提升攻击效果和任务泛化能力。
Details
Motivation: 现有白盒攻击方法跨任务泛化能力差,黑盒方法依赖昂贵的迁移攻击且效率低下。LVLM中标准化且常共享的视觉编码器为攻击提供了稳定的灰盒切入点,具有强跨模型迁移性,因此研究基于此的灰盒攻击方法。
Result: 在多种下游任务和LVLM架构上的广泛实验表明,PA-Attack实现了平均75.1%的分数降低率(SRR),证明了其在LVLM中强大的攻击有效性、效率和任务泛化能力。
Insight: 创新点在于利用视觉编码器作为灰盒攻击的稳定支点,并提出了原型锚定引导以解决传统攻击的属性限制和任务泛化问题,以及两阶段注意力增强机制来动态聚焦和调整攻击扰动。这为针对多模态模型的对抗攻击提供了新的、更高效的思路。
Abstract: Large Vision-Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern. Prior white-box attacks rarely generalize across tasks, and black-box methods depend on expensive transfer, which limits efficiency. The vision encoder, standardized and often shared across LVLMs, provides a stable gray-box pivot with strong cross-model transfer. Building on this premise, we introduce PA-Attack (Prototype-Anchored Attentive Attack). PA-Attack begins with a prototype-anchored guidance that provides a stable attack direction towards a general and dissimilar prototype, tackling the attribute-restricted issue and limited task generalization of vanilla attacks. Building on this, we propose a two-stage attention enhancement mechanism: (i) leverage token-level attention scores to concentrate perturbations on critical visual tokens, and (ii) adaptively recalibrate attention weights to track the evolving attention during the adversarial process. Extensive experiments across diverse downstream tasks and LVLM architectures show that PA-Attack achieves an average 75.1% score reduction rate (SRR), demonstrating strong attack effectiveness, efficiency, and task generalization in LVLMs. Code is available at https://github.com/hefeimei06/PA-Attack.
[104] Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images cs.CVPDF
Yuxuan Yang, Zhonghao Yan, Yi Zhang, Bo Yun, Muxi Diao
TL;DR: 本文提出Hepato-LLaVA,一个用于肝细胞癌病理全切片图像分析的专家级多模态大语言模型。它引入了新颖的稀疏拓扑包注意力机制来显式建模二维组织拓扑结构,并构建了包含3.3万个层次化问答对的临床数据集HepatoPathoVQA。实验表明,该模型在肝细胞癌诊断和描述任务上达到了最先进的性能。
Details
Motivation: 当前基于全切片图像的计算方法受限于固定分辨率处理机制和低效的特征聚合,导致严重信息丢失或高特征冗余,无法满足肝细胞癌诊断中对千兆像素图像进行细粒度分析的需求。
Result: Hepato-LLaVA在肝细胞癌诊断和图像描述任务上取得了最先进的性能,显著超越了现有方法。
Insight: 主要创新点包括:1)稀疏拓扑包注意力机制,能显式建模二维组织拓扑,将局部诊断证据聚合为语义摘要令牌的同时保持全局上下文;2)构建了由病理专家验证的、临床基础的多尺度层次化视觉问答数据集HepatoPathoVQA,解决了该领域多尺度数据缺乏的问题。
Abstract: Hepatocellular Carcinoma diagnosis relies heavily on the interpretation of gigapixel Whole Slide Images. However, current computational approaches are constrained by fixed-resolution processing mechanisms and inefficient feature aggregation, which inevitably lead to either severe information loss or high feature redundancy. To address these challenges, we propose Hepato-LLaVA, a specialized Multi-modal Large Language Model designed for fine-grained hepatocellular pathology analysis. We introduce a novel Sparse Topo-Pack Attention mechanism that explicitly models 2D tissue topology. This mechanism effectively aggregates local diagnostic evidence into semantic summary tokens while preserving global context. Furthermore, to overcome the lack of multi-scale data, we present HepatoPathoVQA, a clinically grounded dataset comprising 33K hierarchically structured question-answer pairs validated by expert pathologists. Our experiments demonstrate that Hepato-LLaVA achieves state-of-the-art performance on HCC diagnosis and captioning tasks, significantly outperforming existing methods. Our code and implementation details are available at https://pris-cv.github.io/Hepto-LLaVA/.
[105] TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation cs.CVPDF
Dong-Guw Lee, Tai Hyoung Rhee, Hyunsoo Jang, Young-Sik Shin, Ukcheol Shin
TL;DR: 本文提出TherA框架,通过热感知视觉语言提示实现可控的RGB到热红外图像转换,结合TherA-VLM和潜在扩散模型,生成多样且热物理合理的图像。
Details
Motivation: 解决热红外成像数据收集和标注困难的问题,现有RGB到TIR转换方法依赖RGB先验而忽略热物理,导致热分布不合理。
Result: 在基准测试中达到SOTA性能,零样本转换性能平均提升33%。
Insight: 创新点在于引入热感知视觉语言提示编码场景、物体、材料和热发射上下文,实现基于时间、天气和物体状态的细粒度控制,提升热红外合成的真实性和可控性。
Abstract: Despite the inherent advantages of thermal infrared(TIR) imaging, large-scale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level. TherA couples TherA-VLM with a latent-diffusion-based translator. Given a single RGB image and a user-prompted condition pair, TherA-VLM yields a thermal-aware embedding that encodes scene, object, material, and heat-emission context reflecting the input scene-condition pair. Conditioning the diffusion model on this embedding enables realistic TIR synthesis and fine-grained control across time of day, weather, and object state. Compared to other baselines, TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.
[106] CountEx: Fine-Grained Counting via Exemplars and Exclusion cs.CVPDF
Yifeng Huang, Gia Khanh Nguyen, Minh Hoai
TL;DR: 本文提出CountEx,一种通过示例和排除进行细粒度计数的判别式视觉计数框架,旨在解决现有基于提示方法的关键局限:无法明确排除视觉上相似的干扰物。CountEx允许用户通过包含自然语言描述和可选视觉示例的多模态提示来表达包含和排除意图,指定要计数的内容和要忽略的内容。其核心是一个新颖的判别式查询精炼模块,通过联合推理包含和排除线索来优化计数查询。
Details
Motivation: 解决现有基于提示的计数方法在杂乱场景中无法明确排除视觉相似干扰物的问题,避免因混淆对象类别导致的歧义和过度计数。
Result: 在引入的新基准CoCount(包含97个类别对的1,780个视频和10,086个标注帧)上,CountEx在已知和新类别对象计数任务上相比最先进方法取得了显著改进。
Insight: 创新点在于允许用户通过多模态提示同时指定包含和排除意图,并设计了判别式查询精炼模块来联合处理包含和排除线索,通过特征共享、模式隔离和选择性抑制来精炼计数查询,提升了细粒度计数的鲁棒性。
Abstract: This paper presents CountEx, a discriminative visual counting framework designed to address a key limitation of existing prompt-based methods: the inability to explicitly exclude visually similar distractors. While current approaches allow users to specify what to count via inclusion prompts, they often struggle in cluttered scenes with confusable object categories, leading to ambiguity and overcounting. CountEx enables users to express both inclusion and exclusion intent, specifying what to count and what to ignore, through multimodal prompts including natural language descriptions and optional visual exemplars. At the core of CountEx is a novel Discriminative Query Refinement module, which jointly reasons over inclusion and exclusion cues by first identifying shared visual features, then isolating exclusion-specific patterns, and finally applying selective suppression to refine the counting query. To support systematic evaluation of fine-grained counting methods, we introduce CoCount, a benchmark comprising 1,780 videos and 10,086 annotated frames across 97 category pairs. Experiments show that CountEx achieves substantial improvements over state-of-the-art methods for counting objects from both known and novel categories. The data and code are available at https://github.com/bbvisual/CountEx.
[107] UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment cs.CVPDF
Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang
TL;DR: 本文提出了一种无需训练的后处理语义校准框架UrbanAlign,用于将视觉语言模型(VLM)的输出与人类偏好对齐。该方法通过概念挖掘、多智能体结构化评分和几何校准三个紧密耦合的阶段,在冻结VLM的基础上,仅使用少量人工标注即可实现主观感知任务中的偏好对齐。
Details
Motivation: 在特定领域任务中,将VLM输出与人类偏好对齐通常需要微调或强化学习,这些方法既需要标注数据又消耗大量计算资源。本文的动机是证明对于主观感知任务,这种对齐可以在不进行任何模型训练的情况下实现,因为VLM本身是强大的概念提取器,但却是较差的决策校准器。
Result: 在城市感知任务Place Pulse 2.0数据集上,该方法在六个类别中达到了72.2%的准确率(κ=0.45),比最佳有监督基线高出15.1个百分点,比未校准的VLM评分高出16.3个百分点,实现了完全维度级可解释性且无需修改模型权重。
Insight: 论文宣称的创新点在于提出了一种训练自由、后处理的概念瓶颈流程,通过端到端的维度优化循环将概念挖掘、多智能体结构化评分和几何校准统一起来。从客观角度看,其核心创新在于将VLM视为固定的概念提取器,而将偏好对齐问题转化为外部校准问题,从而避免了昂贵的模型训练,并保持了模型的可解释性。
Abstract: Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy ($κ=0.45$) on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.
[108] Decoupling Vision and Language: Codebook Anchored Visual Adaptation cs.CVPDF
Jason Wu, Tianchen Zhao, Chang Liu, Jiarui Cai, Zheng Zhang
TL;DR: 本文提出了一种名为CRAFT的轻量级方法,用于解决大型视觉语言模型(LVLMs)在特定领域视觉任务(如医学图像诊断或细粒度分类)中表现不佳的问题。该方法通过使用离散码本将视觉表示锚定到稳定的标记空间来微调视觉编码器,实现了领域适应,而无需修改模型的其他部分。这种解耦设计使得适配后的编码器能够无缝提升具有不同语言架构的LVLMs的性能,只要它们共享相同的码本。
Details
Motivation: 大型视觉语言模型(LVLMs)的视觉编码器在特定领域视觉任务中表现不佳,其表示错误会通过语言模型级联,导致错误响应。现有适应方法通过投影器调优或其他参数高效更新来修改编码器和语言模型之间的连续特征接口,这仍然耦合了两个组件,并且在编码器更改时需要重新对齐。
Result: 在VQARAD和PlantVillage等10个特定领域基准测试中,CRAFT平均提升了13.51%的性能,同时保持了LLM的语言能力,并优于在连续标记上操作的同类方法。
Insight: 论文的创新点在于提出了一种解耦视觉和语言组件的轻量级适应方法CRAFT,通过离散码本锚定视觉表示,实现了领域适应而无需修改模型其他部分,从而允许适配后的编码器灵活应用于不同语言架构的LVLMs。从客观角度看,该方法的核心创新在于利用离散码本作为稳定接口,有效解耦了视觉编码器的更新与语言模型,提高了适应效率和模型兼容性。
Abstract: Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM’s linguistic capabilities and outperforming peer methods that operate on continuous tokens.
[109] Physics-informed Active Polarimetric 3D Imaging for Specular Surfaces cs.CV | physics.opticsPDF
Jiazhang Wang, Hyelim Yang, Tianyi Wang, Florian Willomitzer
TL;DR: 本文提出了一种基于物理信息深度学习的单次拍摄偏振3D成像框架,用于复杂镜面表面的三维重建。该方法结合偏振线索和结构光编码的几何信息,通过双编码器架构进行特征调制,直接推断表面法线,实现了快速、准确且鲁棒的单次拍摄3D成像。
Details
Motivation: 解决镜面表面在动态环境(如在线检测或手持扫描)中快速、准确三维成像的挑战。现有方法如偏折术依赖多次拍摄,傅里叶单次方法对高空间频率或大曲率表面性能下降,而偏振3D成像虽单次且鲁棒,但精度受正交成像假设限制。
Result: 该方法在单次拍摄下实现了准确且鲁棒的法线估计,推理速度快,适用于复杂镜面表面的实际3D成像。摘要未提及具体基准测试或SOTA比较。
Insight: 创新点在于将偏振线索作为方向先验与结构光编码的几何信息互补结合,通过双编码器架构进行相互特征调制,以解析非线性耦合并直接推断表面法线,突破了传统偏振方法基于正交成像假设的精度限制。
Abstract: 3D imaging of specular surfaces remains challenging in real-world scenarios, such as in-line inspection or hand-held scanning, requiring fast and accurate measurement of complex geometries. Optical metrology techniques such as deflectometry achieve high accuracy but typically rely on multi-shot acquisition, making them unsuitable for dynamic environments. Fourier-based single-shot approaches alleviate this constraint, yet their performance deteriorates when measuring surfaces with high spatial frequency structure or large curvature. Alternatively, polarimetric 3D imaging in computer vision operates in a single-shot fashion and exhibits robustness to geometric complexity. However, its accuracy is fundamentally limited by the orthographic imaging assumption. In this paper, we propose a physics-informed deep learning framework for single-shot 3D imaging of complex specular surfaces. Polarization cues provide orientation priors that assist in interpreting geometric information encoded by structured illumination. These complementary cues are processed through a dual-encoder architecture with mutual feature modulation, allowing the network to resolve their nonlinear coupling and directly infer surface normals. The proposed method achieves accurate and robust normal estimation in single-shot with fast inference, enabling practical 3D imaging of complex specular surfaces.
[110] Forgetting-Resistant and Lesion-Aware Source-Free Domain Adaptive Fundus Image Analysis with Vision-Language Model cs.CVPDF
Zheang Huai, Hui Tang, Hualiang Wang, Xiaomeng Li
TL;DR: 本文提出了一种名为FRLA(Forgetting-Resistant and Lesion-Aware)的新型源自由域适应方法,用于眼底图像分析。该方法利用视觉语言模型(ViL)来辅助适应过程,通过一个抗遗忘适应模块来保护目标模型的可靠预测,以及一个病灶感知适应模块来利用ViL模型的细粒度知识,从而在仅使用未标记目标域数据和源模型的情况下,提升模型在目标域上的性能。
Details
Motivation: 解决传统源自由域适应方法在领域偏移下容易出错的问题,并针对现有利用视觉语言模型进行SFDA的方法存在的两个缺陷:一是目标模型在适应过程中会遗忘其自身的优秀预测,二是忽视了ViL模型中丰富的细粒度知识。
Result: 大量实验表明,该方法不仅显著优于视觉语言模型本身,而且在多个基准测试中相比最先进方法取得了持续一致的性能提升。
Insight: 创新点在于明确设计了抗遗忘和病灶感知两个模块,前者通过保留目标模型的置信预测来防止性能退化,后者通过利用ViL模型的细粒度、补丁级别的预测来增强模型对病灶区域的感知能力,从而更有效地整合基础模型的先验知识。
Abstract: Source-free domain adaptation (SFDA) aims to adapt a model trained in the source domain to perform well in the target domain, with only unlabeled target domain data and the source model. Taking into account that conventional SFDA methods are inevitably error-prone under domain shift, recently greater attention has been directed to SFDA assisted with off-the-shelf foundation models, e.g., vision-language (ViL) models. However, existing works of leveraging ViL models for SFDA confront two issues: (i) Although mutual information is exploited to consider the joint distribution between the predictions of ViL model and the target model, we argue that the forgetting of some superior predictions of the target model still occurs, as indicated by the decline of the accuracies of certain classes during adaptation; (ii) Prior research disregards the rich, fine-grained knowledge embedded in the ViL model, which offers detailed grounding for fundus image diagnosis. In this paper, we introduce a novel forgetting-resistant and lesion-aware (FRLA) method for SFDA of fundus image diagnosis with ViL model. Specifically, a forgetting-resistant adaptation module explicitly preserves the confident predictions of the target model, and a lesion-aware adaptation module yields patch-wise predictions from ViL model and employs them to help the target model be aware of the lesion areas and leverage the ViL model’s fine-grained knowledge. Extensive experiments show that our method not only significantly outperforms the vision-language model, but also achieves consistent improvements over the state-of-the-art methods. Our code will be released.
[111] Exploiting Label-Independent Regularization from Spatial Dependencies for Whole Slide Image Analysis cs.CVPDF
Weiyi Wu, Xinwen Xu, Chongyang Gao, Xingjian Diao, Siting Li
TL;DR: 该论文提出了一种空间正则化的多示例学习框架,用于解决全切片图像分析中因标注稀缺和数据规模巨大带来的挑战。该方法通过利用补丁特征间的空间依赖关系作为标签无关的正则化信号,联合优化特征诱导的空间重建和标签引导的分类目标,从而提升模型性能。
Details
Motivation: 全切片图像分析面临数据规模巨大和标注稀缺的问题,现有MIL方法因单个袋级标签需指导大量补丁级特征学习而存在根本性不平衡,导致训练中难以可靠识别判别性补丁,造成优化不稳定和次优解。
Result: 在多个公共数据集上的实验结果表明,该方法相比现有最先进方法有显著提升,达到了SOTA水平。
Insight: 创新点在于利用补丁特征间的空间关系作为标签无关的正则化信号,通过联合优化空间重建和分类目标来增强特征学习的一致性,这为弱监督学习提供了新的正则化思路。
Abstract: Whole slide images, with their gigapixel-scale panoramas of tissue samples, are pivotal for precise disease diagnosis. However, their analysis is hindered by immense data size and scarce annotations. Existing MIL methods face challenges due to the fundamental imbalance where a single bag-level label must guide the learning of numerous patch-level features. This sparse supervision makes it difficult to reliably identify discriminative patches during training, leading to unstable optimization and suboptimal solutions. We propose a spatially regularized MIL framework that leverages inherent spatial relationships among patch features as label-independent regularization signals. Our approach learns a shared representation space by jointly optimizing feature-induced spatial reconstruction and label-guided classification objectives, enforcing consistency between intrinsic structural patterns and supervisory signals. Experimental results on multiple public datasets demonstrate significant improvements over state-of-the-art methods, offering a promising direction.
[112] MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models cs.CVPDF
Mingrui Wu, Hang Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji
TL;DR: 本文提出了MICON-Bench基准测试,用于评估统一多模态模型在多图像上下文生成任务上的能力,并针对现有模型在此类任务中的不足,提出了一种无需训练的动态注意力再平衡机制来提升生成质量和跨图像连贯性。
Details
Motivation: 现有基准测试主要关注文本到图像或单图像编辑任务,缺乏对多图像上下文推理与生成能力的评估,因此需要构建专门的基准来暴露和解决这一挑战。
Result: 在多个最先进的开源模型上的广泛实验表明,MICON-Bench基准能有效揭示多图像推理的挑战,而提出的动态注意力再平衡机制能显著提升生成质量和跨图像连贯性。
Insight: 创新点在于构建了首个专注于多图像上下文生成的综合基准,并提出了一个无需训练、即插即用的动态注意力再平衡推理机制,以增强模型在多图像输入下的连贯性和减少幻觉。
Abstract: Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce \textbf{MICON-Bench}, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present \textbf{Dynamic Attention Rebalancing (DAR)}, a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence. Github: https://github.com/Angusliuuu/MICON-Bench.
[113] A Text-Guided Vision Model for Enhanced Recognition of Small Instances cs.CVPDF
Hyun-Ki Jung
TL;DR: 本文提出了一种改进的文本引导视觉模型,旨在提升无人机场景下小目标检测的精度和效率。该方法基于YOLO-World模型,通过将主干网络中的C2f层替换为C3k2层以增强局部特征表示,并采用并行处理优化来加速计算和实现轻量化设计。在VisDrone数据集上的实验表明,该模型在准确率、召回率、F1分数和mAP@0.5等指标上均优于原始YOLO-World,同时参数量和计算量有所降低。
Details
Motivation: 随着无人机目标检测技术的发展,需求从单纯检测物体转向让用户能够通过文本提示精确识别特定目标,尤其是在小目标检测场景中。现有模型在精度和效率上仍有提升空间,因此需要一种更高效、轻量化的文本引导检测模型。
Result: 在VisDrone数据集上,改进模型的精确率从40.6%提升至41.6%,召回率从30.8%提升至31%,F1分数从35%提升至35.5%,mAP@0.5从30.4%提升至30.7%。同时模型参数量从400万降至380万,FLOPs从157亿降至152亿,实现了精度提升和模型轻量化。
Insight: 创新点在于将主干网络的C2f层替换为C3k2层,以更好地捕捉小目标或边界清晰物体的局部特征;同时通过并行处理优化提升处理速度并减少模型复杂度。这为无人机应用中的精确、高效目标检测提供了实用解决方案。
Abstract: As drone-based object detection technology continues to evolve, the demand is shifting from merely detecting objects to enabling users to accurately identify specific targets. For example, users can input particular targets as prompts to precisely detect desired objects. To address this need, an efficient text-guided object detection model has been developed to enhance the detection of small objects. Specifically, an improved version of the existing YOLO-World model is introduced. The proposed method replaces the C2f layer in the YOLOv8 backbone with a C3k2 layer, enabling more precise representation of local features, particularly for small objects or those with clearly defined boundaries. Additionally, the proposed architecture improves processing speed and efficiency through parallel processing optimization, while also contributing to a more lightweight model design. Comparative experiments on the VisDrone dataset show that the proposed model outperforms the original YOLO-World model, with precision increasing from 40.6% to 41.6%, recall from 30.8% to 31%, F1 score from 35% to 35.5%, and mAP@0.5 from 30.4% to 30.7%, confirming its enhanced accuracy. Furthermore, the model demonstrates superior lightweight performance, with the parameter count reduced from 4 million to 3.8 million and FLOPs decreasing from 15.7 billion to 15.2 billion. These results indicate that the proposed approach provides a practical and effective solution for precise object detection in drone-based applications.
[114] Test-Time Computing for Referring Multimodal Large Language Models cs.CVPDF
Mingrui Wu, Hao Chen, Jiayi Ji, Xiaoshuai Sun, Zhiyuan Liu
TL;DR: 本文提出了ControlMLLM++,一种无需重新训练或微调即可在推理时适应特定任务的框架。它通过向冻结的多模态大语言模型注入可学习的视觉提示,引导模型关注用户指定的图像区域,从而实现细粒度的、基于区域的视觉推理。
Details
Motivation: 解决现有MLLMs在推理时难以进行细粒度、基于区域的视觉理解,且通常需要昂贵的重新训练或微调的问题。
Result: 该方法支持多种视觉提示(如边界框、掩码、涂鸦、点),并展现出强大的域外泛化能力和可解释性。
Insight: 核心创新在于利用跨模态注意力图内在编码的语义对应关系,在推理时通过任务特定的能量函数优化潜在视觉令牌修改器,以引导模型注意力。同时,通过改进的优化策略和提示去偏机制,增强了优化稳定性并减轻了语言提示的偏见。
Abstract: We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at https://github.com/mrwu-mac/ControlMLLM.
[115] ORION: ORthonormal Text Encoding for Universal VLM AdaptatION cs.CVPDF
Omprakash Chakraborty, Jose Dolz, Ismail Ben Ayed
TL;DR: ORION是一种文本编码微调框架,旨在通过优化类名文本表示的正交性来提升视觉语言模型(VLM)的性能。该方法利用低秩适应(LoRA)技术,结合正交性损失和原型偏差惩罚,改进预训练VLM的文本原型,从而增强任务特定的判别能力。
Details
Motivation: 现有视觉语言模型在零样本分类中,由于冻结文本编码器和手工提示生成的文本原型可能具有相关性或弱分离性,限制了其任务特定的判别性能。
Result: 在11个基准测试和三个大型VLM骨干网络上进行的广泛实验表明,ORION生成的细化文本嵌入能够有效替代标准CLIP原型,作为即插即用模块集成到多种先进方法中,在零样本、少样本和测试时适应等不同预测设置下,性能均获得一致且显著的提升。
Insight: 创新点在于提出了一种基于正交性损失和原型偏差惩罚的文本编码微调框架,通过低秩适应实现高效优化,并从概率角度(基于Huygens定理的最大似然估计)为正交性惩罚提供了理论解释,增强了文本表示的判别性。
Abstract: Vision language models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task specific discriminability. We introduce ORION, a text encoder fine tuning framework that improves pretrained VLMs using only class names. Our method optimizes, via low rank adaptation, a novel loss integrating two terms, one promoting pairwise orthogonality between the textual representations of the classes of a given task and the other penalizing deviations from the initial class prototypes. Furthermore, we provide a probabilistic interpretation of our orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens theorem. We report extensive experiments on 11 benchmarks and three large VLM backbones, showing that the refined textual embeddings yield powerful replacements for the standard CLIP prototypes. Added as plug and play module on top of various state of the art methods, and across different prediction settings (zero shot, few shot and test time adaptation), ORION improves the performance consistently and significantly.
[116] Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems cs.CV | cs.CR | cs.LGPDF
Xingyu Shen, Tommy Duong, Xiaodong An, Zengqi Zhao, Zebang Hu
TL;DR: 本文系统评估了低成本化妆攻击(如假胡子、灰发、化妆和模拟皱纹)对AI年龄估计系统的有效性。研究发现,这些简单的物理修改能显著提高攻击转换率(ACR),使未成年人被误判为成年人,其中合成胡子攻击在所有测试模型上达到28%至69%的ACR,而组合攻击最高可达83%的ACR。
Details
Motivation: 年龄估计系统被广泛用于限制未成年人访问在线内容,但其对化妆修改的鲁棒性尚未得到系统评估,本文旨在探究低成本化妆攻击是否会导致系统误判未成年人年龄。
Result: 在包含329张10至21岁人脸图像的测试集上,使用VLM图像编辑器模拟攻击,评估了五个专用架构模型和三个视觉语言模型。组合攻击使预测年龄平均增加7.7岁,攻击转换率最高达83%;视觉语言模型的ACR(59%-71%)略低于专用模型(63%-83%),但差异未经过统计检验。
Insight: 创新点在于引入了攻击转换率(ACR)作为与测试集人口比例无关的评估指标,并首次系统评估了化妆攻击对年龄估计系统的威胁;客观分析表明,该研究揭示了部署系统中存在的关键漏洞,强调了对抗鲁棒性评估应作为模型选择的必要标准。
Abstract: Age estimation systems are increasingly deployed as gatekeepers for age-restricted online content, yet their robustness to cosmetic modifications has not been systematically evaluated. We investigate whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults. To study this threat at scale without ethical concerns, we simulate these physical attacks on 329 facial images of individuals aged 10 to 21 using a VLM image editor (Gemini 2.5 Flash Image). We then evaluate eight models from our prior benchmark: five specialized architectures (MiVOLO, Custom-Best, Herosan, MiViaLab, DEX) and three vision-language models (Gemini 3 Flash, Gemini 2.5 Flash, GPT-5-Nano). We introduce the Attack Conversion Rate (ACR), defined as the fraction of images predicted as minor at baseline that flip to adult after attack, a population-agnostic metric that does not depend on the ratio of minors to adults in the test set. Our results reveal that a synthetic beard alone achieves 28 to 69 percent ACR across all eight models; combining all four attacks shifts predicted age by +7.7 years on average across all 329 subjects and reaches up to 83 percent ACR; and vision-language models exhibit lower ACR (59 to 71 percent) than specialized models (63 to 83 percent) under the full attack, although the ACR ranges overlap and the difference is not statistically tested. These findings highlight a critical vulnerability in deployed age-verification pipelines and call for adversarial robustness evaluation as a mandatory criterion for model selection.
[117] Vinedresser3D: Agentic Text-guided 3D Editing cs.CVPDF
Yankuan Chi, Xiang Li, Zixuan Huang, James M. Rehg
TL;DR: 本文提出了Vinedresser3D,一个基于智能体(agentic)框架的文本引导3D编辑方法。该方法利用多模态大语言模型分析原始3D资产和编辑指令,生成结构化和外观层面的文本指导,并结合图像编辑模型提供视觉引导,最后在原生3D生成模型的潜在空间中通过基于反转的整流流修复流程进行编辑,实现了高质量、无需掩码的3D编辑。
Details
Motivation: 当前文本引导的3D编辑方法难以同时处理复杂指令、在3D空间中自动定位编辑区域以及保持未编辑内容。本文旨在解决这些挑战,实现更精确、连贯的3D编辑。
Result: 在多种3D编辑任务上的实验表明,Vinedresser3D在自动评估指标和人工偏好研究中均优于现有基线方法。
Insight: 创新点在于引入了一个智能体框架,将多模态大语言模型的分析能力与图像编辑模型的视觉引导相结合,并在3D潜在空间中通过整流流修复流程进行编辑,实现了对编辑区域和类型的自动推断,以及编辑过程中的3D一致性和内容保持。
Abstract: Text-guided 3D editing aims to modify existing 3D assets using natural-language instructions. Current methods struggle to jointly understand complex prompts, automatically localize edits in 3D, and preserve unedited content. We introduce Vinedresser3D, an agentic framework for high-quality text-guided 3D editing that operates directly in the latent space of a native 3D generative model. Given a 3D asset and an editing prompt, Vinedresser3D uses a multimodal large language model to infer rich descriptions of the original asset, identify the edit region and edit type (addition, modification, deletion), and generate decomposed structural and appearance-level text guidance. The agent then selects an informative view and applies an image editing model to obtain visual guidance. Finally, an inversion-based rectified-flow inpainting pipeline with an interleaved sampling module performs editing in the 3D latent space, enforcing prompt alignment while maintaining 3D coherence and unedited regions. Experiments on diverse 3D edits demonstrate that Vinedresser3D outperforms prior baselines in both automatic metrics and human preference studies, while enabling precise, coherent, and mask-free 3D editing.
[118] VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense cs.CVPDF
Nadav Kadvil, Ayellet Tal
TL;DR: 本文提出了一种名为VALD的多阶段视觉攻击检测方法,用于高效防御大型视觉语言模型(LVLM)对抗性图像攻击。该方法结合图像变换和智能数据整合,无需训练即可恢复模型的正确行为,通过两阶段检测机制快速过滤干净输入,仅在必要时调用大型语言模型(LLM)解决攻击引起的分歧,实现了SOTA准确性和高效率。
Details
Motivation: 大型视觉语言模型(LVLM)容易受到对抗性图像攻击,这些攻击会微妙地偏置模型输出为看似合理但错误的响应,因此需要一种通用、高效且无需训练的防御方法来解决此问题。
Result: 在实验中,该方法在保持显著效率的同时达到了最先进的准确率:大多数干净图像跳过昂贵处理,即使在存在大量对抗样本的情况下,开销也保持最小。
Insight: 创新点包括两阶段检测机制(先通过内容保持变换评估图像一致性,再在文本嵌入空间中检查差异)和智能数据整合策略(利用多个响应的相似性和差异来整合结果),从客观角度看,该方法通过分层处理降低了计算成本,提升了防御的实用性。
Abstract: Large Vision-Language Models (LVLMs) can be vulnerable to adversarial images that subtly bias their outputs toward plausible yet incorrect responses. We introduce a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior. A key component of our approach is a two-stage detection mechanism that quickly filters out the majority of clean inputs. We first assess image consistency under content-preserving transformations at negligible computational cost. For more challenging cases, we examine discrepancies in a text-embedding space. Only when necessary do we invoke a powerful LLM to resolve attack-induced divergences. A key idea is to consolidate multiple responses, leveraging both their similarities and their differences. We show that our method achieves state-of-the-art accuracy while maintaining notable efficiency: most clean images skip costly processing, and even in the presence of numerous adversarial examples, the overhead remains minimal.
[119] HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies cs.CVPDF
Chang Liu, Yunfan Ye, Qingyang Zhou, Xichen Tan, Mengxuan Luo
TL;DR: 该论文提出了HOCA-Bench基准测试,用于评估视频大语言模型在预测性世界建模方面的能力,特别是检测物理异常。该基准基于黑格尔哲学视角,将异常分为本体异常和因果异常两类,并利用先进的生成视频模型构建了包含1,439个视频和3,470个问答对的测试集。对17个视频大语言模型的评估显示,它们在因果推理任务上存在显著不足。
Details
Motivation: 当前视频大语言模型在语义感知方面虽有进步,但在预测性世界建模(物理基础智能的核心)方面仍存在不足,缺乏对物理规律的理解。
Result: 在HOCA-Bench上评估了17个视频大语言模型,结果显示模型在静态本体异常(如形状突变)上表现尚可,但在因果异常(如重力或摩擦力违反)任务上性能下降超过20%。系统2“思考”模式能改善推理,但无法弥合差距。
Insight: 创新点在于从黑格尔哲学视角(本体与因果的二分)构建了一个专注于物理异常检测的基准测试,揭示了当前模型擅长模式识别但缺乏基础物理定律应用的认知滞后。这为评估和推动模型的世界建模能力提供了新方向。
Abstract: Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling, which is central to physically grounded intelligence. We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens. HOCA-Bench separates anomalies into two types: ontological anomalies, where an entity violates its own definition or persistence, and causal anomalies, where interactions violate physical relations. Using state-of-the-art generative video models as adversarial simulators, we build a testbed of 1,439 videos (3,470 QA pairs). Evaluations on 17 Video-LLMs show a clear cognitive lag: models often identify static ontological violations (e.g., shape mutations) but struggle with causal mechanisms (e.g., gravity or friction), with performance dropping by more than 20% on causal tasks. System-2 “Thinking” modes improve reasoning, but they do not close the gap, suggesting that current architectures recognize visual patterns more readily than they apply basic physical laws.
[120] ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization cs.CVPDF
Minseo Kim, Minchan Kwon, Dongyeun Lee, Yunho Jeon, Junmo Kim
TL;DR: 本文提出ConceptPrism框架,通过残差令牌优化解决个性化文本到图像生成中的概念纠缠问题,该方法通过对比图像集自动分离共享视觉概念与图像特定残差,无需人工指导,从而在概念保真度和文本对齐之间实现更好的权衡。
Details
Motivation: 个性化文本到图像生成存在概念纠缠问题,即参考图像中的无关残差信息被捕获,导致概念保真度与文本对齐之间存在权衡;现有解耦方法依赖人工指导(如语言线索或分割掩码),限制了适用性且未能完全表达目标概念。
Result: 大量实验表明,ConceptPrism有效解决了概念纠缠,在保真度和对齐性之间实现了显著改善的权衡。
Insight: 创新点在于提出一种自动解耦框架,通过联合优化目标令牌和图像特定残差令牌,结合重建损失和新的排除损失,使残差令牌丢弃共享概念,从而让目标令牌无监督地捕获纯净概念;客观分析认为其通过图像集内比较和残差优化实现概念分离的方法具有新颖性。
Abstract: Personalized text-to-image generation suffers from concept entanglement, where irrelevant residual information from reference images is captured, leading to a trade-off between concept fidelity and text alignment. Recent disentanglement approaches attempt to solve this utilizing manual guidance, such as linguistic cues or segmentation masks, which limits their applicability and fails to fully articulate the target concept. In this paper, we propose ConceptPrism, a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set. Our method jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity, and a novel exclusion loss that compels residual tokens to discard the shared concept. This process allows the target token to capture the pure concept without direct supervision. Extensive experiments demonstrate that ConceptPrism effectively resolves concept entanglement, achieving a significantly improved trade-off between fidelity and alignment.
[121] CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning cs.CV | cs.AI | cs.MMPDF
Chunlei Meng, Guanhong Huang, Rong Fu, Runmin Jian, Zhongxue Gan
TL;DR: 本文提出了一种名为跨层级语义协同表示(CLCR)的新型多模态学习方法,旨在解决现有方法因将所有模态投影到单一潜在空间进行融合而忽略多模态数据异步、多层级语义结构所导致的语义错位和误差传播问题。该方法将每个模态的特征组织成浅层、中层和深层三个语义层级,并通过层级内协同交换域(IntraCED)和层级间协同聚合域(InterCAD)分别约束跨模态交互与跨层级信息整合,以提升表示质量。
Details
Motivation: 现有多模态学习方法通常将所有模态投影到单一潜在空间进行融合,忽略了多模态数据内在的异步、多层级语义结构,这导致了语义错位和误差传播,从而降低了表示质量。本文旨在通过显式建模语义层级和约束跨模态交互来解决这一问题。
Result: 在涵盖情感识别、事件定位、情感分析和动作识别六个基准测试上的实验表明,CLCR方法取得了强劲的性能,并在不同任务上展现出良好的泛化能力。
Insight: 论文的核心创新点在于显式地将每个模态的特征组织成三层语义层次结构,并设计了层级内协同交换域(IntraCED)和层级间协同聚合域(InterCAD)两个模块,分别用于约束同层级跨模态的共享信息交换以及跨层级信息的同步与聚合。这种结构化的协同表示方法能有效分离共享与私有特征,减少语义错位,为多模态学习提供了一种新的、更具解释性的融合框架。
Abstract: Multimodal learning aims to capture both shared and private information from multiple modalities. However, existing methods that project all modalities into a single latent space for fusion often overlook the asynchronous, multi-level semantic structure of multimodal data. This oversight induces semantic misalignment and error propagation, thereby degrading representation quality. To address this issue, we propose Cross-Level Co-Representation (CLCR), which explicitly organizes each modality’s features into a three-level semantic hierarchy and specifies level-wise constraints for cross-modal interactions. First, a semantic hierarchy encoder aligns shallow, mid, and deep features across modalities, establishing a common basis for interaction. And then, at each level, an Intra-Level Co-Exchange Domain (IntraCED) factorizes features into shared and private subspaces and restricts cross-modal attention to the shared subspace via a learnable token budget. This design ensures that only shared semantics are exchanged and prevents leakage from private channels. To integrate information across levels, the Inter-Level Co-Aggregation Domain (InterCAD) synchronizes semantic scales using learned anchors, selectively fuses the shared representations, and gates private cues to form a compact task representation. We further introduce regularization terms to enforce separation of shared and private features and to minimize cross-level interference. Experiments on six benchmarks spanning emotion recognition, event localization, sentiment analysis, and action recognition show that CLCR achieves strong performance and generalizes well across tasks.
[122] Satellite-Based Detection of Looted Archaeological Sites Using Machine Learning cs.CV | cs.AIPDF
Girmaw Abebe Tadesse, Titien Bartette, Andrew Hassanali, Allen Kim, Jonathan Chemla
TL;DR: 本文提出了一种基于卫星影像和机器学习的可扩展流水线,用于自动检测被掠夺的考古遗址。该方法利用PlanetScope月度镶嵌影像和包含1943个阿富汗考古遗址(898个被掠夺,1045个保存完好)的标注数据集,比较了基于原始RGB图像块的端到端CNN分类器与基于手工特征及遥感基础模型嵌入的传统机器学习方法。
Details
Motivation: 考古遗址的掠夺对文化遗产构成严重威胁,但监测数千个偏远地点在操作上仍然非常困难。本文旨在开发一种可扩展的、基于卫星的自动化检测方法来解决这一监控难题。
Result: 实验结果表明,结合空间掩码的ImageNet预训练CNN模型取得了最佳性能,F1分数达到0.926,明显超过了最强的传统机器学习方案(使用SatCLIP-V+RF+Mean组合,F1分数为0.710)。消融研究证实了ImageNet预训练和空间掩码的有效性。
Insight: 论文的创新点在于构建了一个专门用于检测遗址掠夺的卫星影像数据集和评估流水线,并系统比较了不同机器学习范式。一个关键的客观发现是,掠夺迹象具有极强的局部性特征,这使得最新的地理空间基础模型嵌入与手工特征表现相当,而领域迁移下的ImageNet预训练和空间上下文信息对提升CNN模型性能至关重要。
Abstract: Looting at archaeological sites poses a severe risk to cultural heritage, yet monitoring thousands of remote locations remains operationally difficult. We present a scalable and satellite-based pipeline to detect looted archaeological sites, using PlanetScope monthly mosaics (4.7m/pixel) and a curated dataset of 1,943 archaeological sites in Afghanistan (898 looted, 1,045 preserved) with multi-year imagery (2016–2023) and site-footprint masks. We compare (i) end-to-end CNN classifiers trained on raw RGB patches and (ii) traditional machine learning (ML) trained on handcrafted spectral/texture features and embeddings from recent remote-sensing foundation models. Results indicate that ImageNet-pretrained CNNs combined with spatial masking reach an F1 score of 0.926, clearly surpassing the strongest traditional ML setup, which attains an F1 score of 0.710 using SatCLIP-V+RF+Mean, i.e., location and vision embeddings fed into a Random Forest with mean-based temporal aggregation. Ablation studies demonstrate that ImageNet pretraining (even in the presence of domain shift) and spatial masking enhance performance. In contrast, geospatial foundation model embeddings perform competitively with handcrafted features, suggesting that looting signatures are extremely localized. The repository is available at https://github.com/microsoft/looted_site_detection.
[123] Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness cs.CVPDF
Xin Hu, Haomiao Ni, Yunbei Zhang, Jihun Hamm, Zechen Li
TL;DR: 本文提出了一种即插即用的模块,通过细化视觉令牌和丰富输入文本提示来显著提升视觉语言模型(VLM)在稀有物体上的推理能力,而无需对VLM进行微调。该方法利用视觉基础模型和同义词增强的文本描述学习多模态类别嵌入,并通过轻量级注意力增强模块改进细粒度物体细节,同时将学习到的嵌入作为物体感知检测器生成提示信息注入文本提示中。
Details
Motivation: 视觉语言模型在广泛视觉理解上取得了显著成功,但在稀有物体上的以物体为中心的推理仍面临挑战,主要原因是预训练数据中此类实例稀缺。现有方法通常通过检索额外数据或引入更强的视觉编码器来缓解,但这些方法在微调VLM时计算量大,且未充分利用原始训练数据。
Result: 在两个基准测试上的实验表明,该方法为预训练的VLM在稀有物体识别和推理任务上带来了持续且显著的性能提升。
Insight: 创新点在于提出了一种无需微调VLM的高效即插即用模块,通过多模态类别嵌入补偿有限训练样本,并利用注意力机制增强视觉令牌细节,同时通过文本提示注入物体感知提示来引导模型注意力,从而有效提升对稀有物体的推理能力。
Abstract: Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don’t fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs’ reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM’s attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM’s ability to focus on and reason about rare objects.
[124] PedaCo-Gen: Scaffolding Pedagogical Agency in Human-AI Collaborative Video Authoring cs.CV | cs.AI | cs.HCPDF
Injun Baek, Yearim Kim, Nojun Kwak
TL;DR: 本文提出了PedaCo-Gen系统,这是一个基于梅耶多媒体学习认知理论、支持人机协作的教学视频生成系统。它通过引入一个中间表示阶段,让教育工作者能与AI审阅者交互式地审查和优化视频蓝图(包括脚本和视觉描述),从而提升教学视频的质量和创作效率。
Details
Motivation: 当前文本到视频生成AI模型通常侧重于视觉保真度而非教学有效性,这限制了其在教育内容创作中的应用。本研究旨在通过将教学理论与生成式AI结合,支持教育工作者在视频创作中重新获得教学主导权,以提升教学视频的指导效果。
Result: 一项涉及23位教育专家的研究表明,与基线方法相比,PedaCo-Gen能显著提升跨主题和多媒体学习原则的视频质量。参与者报告了高生产效率(均值4.26)和高指导有效性(均值4.04)。
Insight: 主要创新点在于将教学理论(CTML)系统性地融入生成流程,并设计了交互式的中间表示阶段作为人机协作的支架。这为未来的AI创作工具提供了一个范式,即通过原则性的协同创作,将生成能力与人类专业知识相协调,而不仅仅是提供指令。
Abstract: While advancements in Text-to-Video (T2V) generative AI offer a promising path toward democratizing content creation, current models are often optimized for visual fidelity rather than instructional efficacy. This study introduces PedaCo-Gen, a pedagogically-informed human-AI collaborative video generating system for authoring instructional videos based on Mayer’s Cognitive Theory of Multimedia Learning (CTML). Moving away from traditional “one-shot” generation, PedaCo-Gen introduces an Intermediate Representation (IR) phase, enabling educators to interactively review and refine video blueprints-comprising scripts and visual descriptions-with an AI reviewer. Our study with 23 education experts demonstrates that PedaCo-Gen significantly enhances video quality across various topics and CTML principles compared to baselines. Participants perceived the AI-driven guidance not merely as a set of instructions but as a metacognitive scaffold that augmented their instructional design expertise, reporting high production efficiency (M=4.26) and guide validity (M=4.04). These findings highlight the importance of reclaiming pedagogical agency through principled co-creation, providing a foundation for future AI authoring tools that harmonize generative power with human professional expertise.
[125] TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures cs.CV | cs.AIPDF
Hyeongjin Nam, Daniel Sungho Jung, Kyoung Mu Lee
TL;DR: TeHOR是一个从单张图像联合重建3D人体和物体的框架,它通过利用文本描述和外观线索来增强重建的语义对齐和视觉合理性,特别适用于处理非接触式的人-物交互。
Details
Motivation: 现有方法严重依赖物理接触信息,无法处理非接触交互(如注视或指向物体),且重建过程主要基于局部几何邻近性,忽略了提供全局上下文的外观信息。
Result: 该框架实现了准确且语义一致的重建,在相关任务上达到了最先进的性能。
Insight: 创新点在于将文本描述作为语义约束引入3D重建过程以处理更广泛的交互类型,并整合人-物外观线索以捕获整体上下文信息,从而提升重建的视觉合理性和语义连贯性。
Abstract: Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human-object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human-object interactions to enforce semantic alignment between the 3D reconstruction and its textual cues, enabling reasoning over a wider spectrum of interactions, including non-contact cases. Second, we incorporate appearance cues of the 3D human and object into the alignment process to capture holistic contextual information, thereby ensuring visually plausible reconstructions. As a result, our framework produces accurate and semantically coherent reconstructions, achieving state-of-the-art performance.
[126] Universal Pose Pretraining for Generalizable Vision-Language-Action Policies cs.CV | cs.LG | cs.ROPDF
Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling
TL;DR: 本文提出Pose-VLA,一种解耦的视觉-语言-动作(VLA)模型训练范式,旨在解决现有VLA模型因纠缠高层感知与稀疏、具身特定的动作监督而导致的特征崩溃和训练效率低下问题。该方法将训练分为两个阶段:在统一相机中心空间中进行通用3D空间先验的预训练,以及在机器人特定动作空间中进行高效具身对齐的后训练。通过引入离散姿态令牌作为通用表示,Pose-VLA能够无缝整合来自多样化3D数据集的空间基础与来自机器人演示的几何级轨迹。
Details
Motivation: 现有VLA模型通常基于为视觉问答(VQA)优化的VLM骨干,擅长语义识别但常忽略决定不同动作模式的细微3D状态变化,导致特征对齐不佳和训练效率低。本文旨在解决这种错位,实现更通用、高效的具身智能策略。
Result: 在RoboTwin 2.0基准测试中达到79.5%的平均成功率,取得SOTA结果;在LIBERO基准测试中达到96.0%的竞争性性能。真实世界实验进一步验证了其泛化能力,仅需每个任务100个演示即可在多样化物体上实现鲁棒泛化。
Insight: 核心创新在于解耦训练范式,将通用3D空间先验学习与具体机器人动作对齐分离,并通过离散姿态令牌作为跨数据集和任务的统一表示,有效整合几何级轨迹监督,提升了模型的训练效率和空间感知能力。
Abstract: Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.
[127] Pixels Don’t Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision cs.CVPDF
Kartik Kuckreja, Parul Gupta, Muhammad Haris Khan, Abhinav Dhall
TL;DR: 本文提出了DeepfakeJudge框架,旨在解决深度伪造检测模型生成的自然语言解释缺乏视觉证据支撑、可靠性低的问题。该框架通过一个包含最新生成和编辑伪造技术的分布外基准数据集、带视觉推理标注的人工标注子集以及一套无需真实推理依据即可评估推理合理性的评估模型,实现了可扩展的推理监督与评估。
Details
Motivation: 现有深度伪造检测模型虽然能生成自然语言解释,但其推理过程往往缺乏视觉证据支撑,导致可靠性受限;同时,现有评估方法仅关注分类准确性,忽略了推理的忠实性。
Result: 在提出的元评估基准上,经过推理引导优化的模型达到了96.2%的准确率,优于参数量大30倍的基线模型;推理评估器与人类评分具有极高的相关性,在人工标注的元评估子集上达到了98.9%的成对一致性;用户研究表明,70%的参与者认为该框架生成的推理在忠实性、基于证据性和有用性方面优于其他模型和数据集。
Insight: 创新点在于将推理忠实性确立为深度伪造检测的可量化维度,并通过引导式的生成器-评估器优化过程,将人类反馈扩展为结构化推理监督,支持点对和成对评估,实现了无需真实推理依据的可扩展监督方法。
Abstract: Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator-evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2%, outperforming \texttt{30x} larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9% percent pairwise agreement on the human-annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study shows that participants preferred the reasonings generated by our framework 70% of the time, in terms of faithfulness, groundedness, and usefulness, compared to those produced by other models and datasets. All of our datasets, models, and codebase are \href{https://github.com/KjAeRsTuIsK/DeepfakeJudge}{open-sourced}.
[128] Towards Personalized Multi-Modal MRI Synthesis across Heterogeneous Datasets cs.CVPDF
Yue Zhang, Zhizheng Zhuo, Siyao Xu, Shan Lv, Zhaoxi Liu
TL;DR: 该论文提出了一个名为PMM-Synth的个性化多模态MRI合成框架,旨在解决现有统一合成模型在跨异构数据集泛化能力不足的问题。该框架通过在多个不同模态覆盖、疾病类型和强度分布的数据集上进行联合训练,并引入三项核心创新技术,实现了在跨数据集场景下灵活且高质量的缺失模态合成。
Details
Motivation: 现有统一的多模态MRI合成模型通常在单一数据集上训练和评估,这限制了它们在不同临床数据集间的泛化能力,阻碍了实际部署。论文旨在开发一个能够有效泛化到异构数据集、支持多种合成任务的个性化MRI合成框架。
Result: 在四个临床多模态MRI数据集上的评估表明,PMM-Synth在一对一和一对多合成任务中均持续优于最先进(SOTA)方法,取得了更高的PSNR和SSIM分数。定性结果也显示其能更好地保留解剖结构和病理细节。下游的肿瘤分割和放射学报告研究进一步证实了其在实际模态缺失场景下支持可靠诊断的潜力。
Insight: 论文的核心创新点在于提出了一个跨数据集联合训练的个性化框架,并引入了三项关键技术:基于数据集标识符动态调整特征表示的个性化特征调制模块、在不一致模态条件下促进稳定高效批量训练的模态一致性批量调度器,以及在部分真实模态缺失时确保有效学习的选择性监督损失。这些设计有效缓解了分布偏移问题,提升了模型的泛化性和实用性。
Abstract: Synthesizing missing modalities in multi-modal magnetic resonance imaging (MRI) is vital for ensuring diagnostic completeness, particularly when full acquisitions are infeasible due to time constraints, motion artifacts, and patient tolerance. Recent unified synthesis models have enabled flexible synthesis tasks by accommodating various input-output configurations. However, their training and evaluation are typically restricted to a single dataset, limiting their generalizability across diverse clinical datasets and impeding practical deployment. To address this limitation, we propose PMM-Synth, a personalized MRI synthesis framework that not only supports various synthesis tasks but also generalizes effectively across heterogeneous datasets. PMM-Synth is jointly trained on multiple multi-modal MRI datasets that differ in modality coverage, disease types, and intensity distributions. It achieves cross-dataset generalization through three core innovations: a Personalized Feature Modulation module that dynamically adapts feature representations based on dataset identifier to mitigate the impact of distributional shifts; a Modality-Consistent Batch Scheduler that facilitates stable and efficient batch training under inconsistent modality conditions; and a selective supervision loss to ensure effective learning when ground truth modalities are partially missing. Evaluated on four clinical multi-modal MRI datasets, PMM-Synth consistently outperforms state-of-the-art methods in both one-to-one and many-to-one synthesis tasks, achieving superior PSNR and SSIM scores. Qualitative results further demonstrate improved preservation of anatomical structures and pathological details. Additionally, downstream tumor segmentation and radiological reporting studies suggest that PMM-Synth holds potential for supporting reliable diagnosis under real-world modality-missing scenarios.
[129] VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments cs.CVPDF
Jingyi Xu, Zhangshuo Qi, Zhongmiao Yan, Xuyu Gao, Qianyun Jiao
TL;DR: 本文提出VGGT-MPR,一种用于自动驾驶场景的多模态地点识别框架。该框架采用视觉几何基础Transformer(VGGT)作为统一的几何引擎,用于全局检索和重排序。通过深度感知和点云监督提取几何丰富的视觉特征,并利用预测深度图增强稀疏LiDAR点云的结构表示,从而提升融合多模态特征的判别能力。此外,设计了一种无需训练的重排序机制,利用VGGT的跨视图关键点跟踪能力来优化检索结果。
Details
Motivation: 现有的多模态地点识别方法主要依赖于手工设计的融合策略和参数量大的主干网络,需要昂贵的重新训练成本。本文旨在解决这一问题,提出一个更高效、鲁棒性更强的框架。
Result: 在大规模自动驾驶基准数据集和自收集数据上的大量实验表明,VGGT-MPR实现了最先进的性能,并对剧烈的环境变化、视角偏移和遮挡表现出很强的鲁棒性。
Insight: 创新点在于使用VGGT作为统一的几何引擎,结合了深度感知监督和点云增强来提升特征表示,并设计了无需训练的重排序机制,利用关键点跟踪来优化结果,降低了计算和训练成本。
Abstract: In autonomous driving, robust place recognition is critical for global localization and loop closure detection. While inter-modality fusion of camera and LiDAR data in multimodal place recognition (MPR) has shown promise in overcoming the limitations of unimodal counterparts, existing MPR methods basically attend to hand-crafted fusion strategies and heavily parameterized backbones that require costly retraining. To address this, we propose VGGT-MPR, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking. In the global retrieval stage, VGGT extracts geometrically-rich visual embeddings through prior depth-aware and point map supervision, and densifies sparse LiDAR point clouds with predicted depth maps to improve structural representation. This enhances the discriminative ability of fused multimodal features and produces global descriptors for fast retrieval. Beyond global retrieval, we design a training-free re-ranking mechanism that exploits VGGT’s cross-view keypoint-tracking capability. By combining mask-guided keypoint extraction with confidence-aware correspondence scoring, our proposed re-ranking mechanism effectively refines retrieval results without additional parameter optimization. Extensive experiments on large-scale autonomous driving benchmarks and our self-collected data demonstrate that VGGT-MPR achieves state-of-the-art performance, exhibiting strong robustness to severe environmental changes, viewpoint shifts, and occlusions. Our code and data will be made publicly available.
[130] Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis cs.CVPDF
Junhyeok Choi, Sangwoo Mo, Minwoo Chae
TL;DR: 本文提出了一种无需训练的原型引导数据合成方法,用于多模态数据集蒸馏,以解决现有方法依赖大规模训练、架构特定且跨架构泛化能力有限的问题。该方法利用CLIP提取对齐的图像-文本嵌入,获取原型,并使用unCLIP解码器合成图像,实现高效且可扩展的多模态数据集蒸馏。
Details
Motivation: 多模态学习依赖大规模图像-文本数据集,导致训练成本高且效率低;现有数据集蒸馏方法需要全数据集训练和联合优化,限制了跨架构泛化能力。
Result: 在广泛实验中,该方法在跨架构泛化方面一致优于基于优化的数据集蒸馏和子集选择方法,达到了最先进水平。
Insight: 创新点在于采用学习免费框架,通过CLIP嵌入和unCLIP解码器合成数据,避免了训练和优化过程,提升了可扩展性和泛化能力;客观分析认为,该方法简化了多模态蒸馏流程,降低了计算需求,具有实际应用潜力。
Abstract: Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP to extract aligned image-text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation. Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.
[131] One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image cs.CVPDF
Pengfei Wang, Liyi Chen, Zhiyuan Ma, Yanjun Guo, Guowen Zhang
TL;DR: One2Scene是一个从单张图像生成可探索3D场景的框架,通过将问题分解为三个子任务:首先生成全景锚点视图,然后通过可泛化的前馈高斯溅射网络将其提升为显式3D几何支架,最后基于该支架生成任意视角的光真实感新视图,从而实现沉浸式场景探索。
Details
Motivation: 解决现有单图像3D场景生成方法在自由探索时存在严重几何失真和噪声伪影的问题,特别是当视角远离原始视角时。
Result: 在多个任务上大幅超越现有最先进方法,包括全景深度估计、前馈360°重建和可探索3D场景生成。
Insight: 将单图像重建重新表述为多视图立体匹配问题,利用从大规模多视图数据集中学习到的鲁棒几何先验;通过双向特征融合模块强制跨视图一致性,生成高效且几何可靠的3D支架;显式地以3D一致支架为条件进行重建,确保在大相机运动下的稳定性。
Abstract: Generating explorable 3D scenes from a single image is a highly challenging problem in 3D vision. Existing methods struggle to support free exploration, often producing severe geometric distortions and noisy artifacts when the viewpoint moves far from the original perspective. We introduce \textbf{One2Scene}, an effective framework that decomposes this ill-posed problem into three tractable sub-tasks to enable immersive explorable scene generation. We first use a panorama generator to produce anchor views from a single input image as initialization. Then, we lift these 2D anchors into an explicit 3D geometric scaffold via a generalizable, feed-forward Gaussian Splatting network. Instead of treating the panorama as a single image for reconstruction, we project it into multiple sparse anchor views and reformulate the reconstruction task as multi-view stereo matching, which allows us to leverage robust geometric priors learned from large-scale multi-view datasets. A bidirectional feature fusion module is used to enforce cross-view consistency, yielding an efficient and geometrically reliable scaffold. Finally, the scaffold serves as a strong prior for a novel view generator to produce photorealistic and geometrically accurate views at arbitrary cameras. By explicitly conditioning on a 3D-consistent scaffold to perform reconstruction, One2Scene works stably under large camera motions, supporting immersive scene exploration. Extensive experiments show that One2Scene substantially outperforms state-of-the-art methods in panorama depth estimation, feed-forward 360° reconstruction, and explorable 3D scene generation. Code and models will be released.
[132] TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding cs.CVPDF
Fan Yang, Shurong Zheng, Hongyin Zhao, Yufei Zhan, Xin Li
TL;DR: TraceVision是一种轨迹感知的视觉语言模型,通过整合人类视觉注意轨迹来增强空间理解能力,实现了端到端的轨迹引导描述生成和区域定位,并扩展至轨迹引导分割和视频场景理解。
Details
Motivation: 现有大型视觉语言模型主要关注全局图像理解,难以模拟人类视觉注意轨迹并解释描述与特定区域的关联,因此需要一种能够整合轨迹信息以提升空间理解和可解释性的模型。
Result: 在轨迹引导描述生成、文本引导轨迹预测、理解和分割等任务上的广泛实验表明,TraceVision实现了最先进的性能。
Insight: 创新点包括轨迹感知视觉感知模块、几何简化提取语义关键点、三阶段训练流程以及构建的RILN数据集,这些设计增强了模型的逻辑推理和可解释性,为直观空间交互奠定了基础。
Abstract: Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information. We design geometric simplification to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectories guide description generation and region localization. We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis. We construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, understanding, and segmentation demonstrate that TraceVision achieves state-of-the-art performance, establishing a foundation for intuitive spatial interaction and interpretable visual understanding.
[133] Open-vocabulary 3D scene perception in industrial environments cs.CVPDF
Keno Moenck, Adrian Philip Florea, Julian Koch, Thorsten Schüppstuhl
TL;DR: 本文提出了一种无需训练的开放词汇3D工业场景感知方法,通过融合基于语义特征的超点来生成掩码,并利用领域适应的视觉语言基础模型IndustrialCLIP进行开放词汇查询,有效解决了现有方法在工业物体上泛化能力差的问题。
Details
Motivation: 现有基于2D视觉语言基础模型的开放词汇方法依赖在非工业数据集上预训练的类无关分割模型,在常见工业物体上泛化性能不佳,因此需要一种能适应工业环境的开放词汇3D感知方案。
Result: 定性结果表明,该方法在一个代表性的3D工业车间场景中成功实现了对工业物体的分割。
Insight: 创新点在于采用无需训练的超点融合生成实例掩码,避免了预训练分割模型的泛化瓶颈,并结合领域适应的VLFM提升工业场景的开放词汇查询能力。
Abstract: Autonomous vision applications in production, intralogistics, or manufacturing environments require perception capabilities beyond a small, fixed set of classes. Recent open-vocabulary methods, leveraging 2D Vision-Language Foundation Models (VLFMs), target this task but often rely on class-agnostic segmentation models pre-trained on non-industrial datasets (e.g., household scenes). In this work, we first demonstrate that such models fail to generalize, performing poorly on common industrial objects. Therefore, we propose a training-free, open-vocabulary 3D perception pipeline that overcomes this limitation. Instead of using a pre-trained model to generate instance proposals, our method simply generates masks by merging pre-computed superpoints based on their semantic features. Following, we evaluate the domain-adapted VLFM “IndustrialCLIP” on a representative 3D industrial workshop scene for open-vocabulary querying. Our qualitative results demonstrate successful segmentation of industrial objects.
[134] TextShield-R1: Reinforced Reasoning for Tampered Text Detection cs.CVPDF
Chenfan Qu, Yiwu Zhong, Jian Liu, Xuekang Zhu, Bohan Yu
TL;DR: 本文提出了TextShield-R1,一种基于强化学习的多模态大语言模型(MLLM)解决方案,用于篡改文本检测与推理。该方法通过法证持续预训练、组相对策略优化和OCR校正等技术,旨在解决现有MLLM在检测微观伪影、定位篡改文本区域以及依赖昂贵标注方面的不足。同时,作者构建了包含多语言、多篡改技术的Text Forensics Reasoning (TFR)基准测试,以支持全面评估。
Details
Motivation: 针对篡改图像日益普遍带来的安全威胁,现有MLLM在检测微观伪影、定位篡改文本区域准确性低以及严重依赖昂贵标注进行伪造解释方面存在不足,需要一种更可靠、可解释的检测方法。
Result: 在作者新构建的TFR基准测试(包含超过4.5万张真实与篡改图像,涵盖16种语言、10种篡改技术和多种领域)上进行的广泛实验表明,TextShield-R1在可解释的篡改文本检测方面显著推进了当前技术水平(SOTA)。
Insight: 创新点包括:1)引入法证持续预训练,利用自然图像取证和OCR任务的大规模廉价数据进行由易到难的课程学习;2)在微调阶段采用带有新型奖励函数的组相对策略优化,以减少标注依赖并提升推理能力;3)在推理时通过OCR校正方法,利用MLLM强大的文本识别能力来优化定位预测;4)构建了全面且鲁棒的TFR基准测试,解决了现有基准的七个主要局限,支持跨风格、跨方法和跨语言的评估。
Abstract: The growing prevalence of tampered images poses serious security threats, highlighting the urgent need for reliable detection methods. Multimodal large language models (MLLMs) demonstrate strong potential in analyzing tampered images and generating interpretations. However, they still struggle with identifying micro-level artifacts, exhibit low accuracy in localizing tampered text regions, and heavily rely on expensive annotations for forgery interpretation. To this end, we introduce TextShield-R1, the first reinforcement learning based MLLM solution for tampered text detection and reasoning. Specifically, our approach introduces Forensic Continual Pre-training, an easy-to-hard curriculum that well prepares the MLLM for tampered text detection by harnessing the large-scale cheap data from natural image forensic and OCR tasks. During fine-tuning, we perform Group Relative Policy Optimization with novel reward functions to reduce annotation dependency and improve reasoning capabilities. At inference time, we enhance localization accuracy via OCR Rectification, a method that leverages the MLLM’s strong text recognition abilities to refine its predictions. Furthermore, to support rigorous evaluation, we introduce the Text Forensics Reasoning (TFR) benchmark, comprising over 45k real and tampered images across 16 languages, 10 tampering techniques, and diverse domains. Rich reasoning-style annotations are included, allowing for comprehensive assessment. Our TFR benchmark simultaneously addresses seven major limitations of existing benchmarks and enables robust evaluation under cross-style, cross-method, and cross-language conditions. Extensive experiments demonstrate that TextShield-R1 significantly advances the state of the art in interpretable tampered text detection.
[135] M3S-Net: Multimodal Feature Fusion Network Based on Multi-scale Data for Ultra-short-term PV Power Forecasting cs.CVPDF
Penghui Niu, Taotao Cai, Suqi Zhang, Junhua Gu, Ping Zhang
TL;DR: 本文提出了一种名为M3S-Net的新型多模态特征融合网络,用于超短期光伏功率预测。该网络通过多尺度部分通道选择网络精确提取薄云边界特征,利用基于快速傅里叶变换的多尺度序列到图像分析网络解析气象数据的复杂周期性,并引入一个具有动态C矩阵交换机制的跨模态Mamba交互模块,实现了视觉与时间模态间的深度结构耦合。
Details
Motivation: 解决光伏发电固有的间歇性和高频波动性(尤其在快速云层平流期间)对高渗透率光伏电网稳定性的挑战,克服现有多模态预测方法中浅层特征拼接和粗粒度二值云分割的局限性,以捕捉云的细粒度光学特征及视觉与气象模态间复杂的时空耦合关系。
Result: 在新构建的细粒度光伏功率数据集上的实验表明,M3S-Net在10分钟预测中,相比最先进的基线方法,平均绝对误差降低了6.2%。
Insight: 创新点包括:1) 使用部分卷积的多尺度部分通道选择网络,超越粗粒度二值掩码的精度限制,显式分离光学薄云的边界特征;2) 基于FFT的多尺度序列到图像分析网络,解耦不同时间范围内的气象数据周期性;3) 提出具有动态C矩阵交换机制的跨模态Mamba交互模块,通过交换状态空间参数实现模态间的深度耦合,并保持线性计算复杂度。
Abstract: The inherent intermittency and high-frequency variability of solar irradiance, particularly during rapid cloud advection, present significant stability challenges to high-penetration photovoltaic grids. Although multimodal forecasting has emerged as a viable mitigation strategy, existing architectures predominantly rely on shallow feature concatenation and binary cloud segmentation, thereby failing to capture the fine-grained optical features of clouds and the complex spatiotemporal coupling between visual and meteorological modalities. To bridge this gap, this paper proposes M3S-Net, a novel multimodal feature fusion network based on multi-scale data for ultra-short-term PV power forecasting. First, a multi-scale partial channel selection network leverages partial convolutions to explicitly isolate the boundary features of optically thin clouds, effectively transcending the precision limitations of coarse-grained binary masking. Second, a multi-scale sequence to image analysis network employs Fast Fourier Transform (FFT)-based time-frequency representation to disentangle the complex periodicity of meteorological data across varying time horizons. Crucially, the model incorporates a cross-modal Mamba interaction module featuring a novel dynamic C-matrix swapping mechanism. By exchanging state-space parameters between visual and temporal streams, this design conditions the state evolution of one modality on the context of the other, enabling deep structural coupling with linear computational complexity, thus overcoming the limitations of shallow concatenation. Experimental validation on the newly constructed fine-grained PV power dataset demonstrates that M3S-Net achieves a mean absolute error reduction of 6.2% in 10-minute forecasts compared to state-of-the-art baselines. The dataset and source code will be available at https://github.com/she1110/FGPD.
[136] Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation cs.CVPDF
Filip Wolf, Blaž Rolih, Luka Čehovin Zajc
TL;DR: 本文提出了一种用于多光谱遥感图像的双教师对比蒸馏框架,旨在解决地球观测领域中不同传感器和模态之间高效知识迁移的挑战。该方法结合了一个多光谱教师模型和一个光学视觉基础模型教师,通过对比自蒸馏范式对齐学生的预训练目标,实现了跨模态的连贯表征学习。
Details
Motivation: 地球观测领域存在多种传感器和模态,单一的通用基础模型不现实,需要多个专业模型共存,因此跨模态的高效知识迁移至关重要。现有预训练方法多基于掩码图像建模,侧重于局部重建,但对全局语义结构的控制有限。
Result: 在多种光学和多光谱基准测试上的实验表明,该模型能适应多光谱数据,且不损害纯光学输入的性能,在两种设置下均取得了最先进的结果:在语义分割任务上平均提升3.64个百分点,变化检测提升1.2,分类任务提升1.31。
Insight: 创新点在于提出了双教师对比蒸馏框架,将现代光学视觉基础模型的对比自蒸馏范式引入多光谱遥感预训练,实现了跨模态的连贯表征对齐。这为异构地球观测数据源的可扩展表征学习提供了一种原则性且高效的方法。
Abstract: Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student’s pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources. Code: Coming soon.
[137] ApET: Approximation-Error Guided Token Compression for Efficient VLMs cs.CVPDF
Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song
TL;DR: 本文提出了ApET,一种基于近似误差引导的视觉语言模型(VLM)令牌压缩框架。它从信息论角度出发,通过线性近似重建原始视觉令牌,并利用近似误差来识别和丢弃信息量最少的令牌,从而高效压缩视觉令牌,减少计算开销。该方法不依赖注意力机制,可与FlashAttention等高效注意力内核无缝集成,实现进一步的推理加速。
Details
Motivation: 现有视觉语言模型(VLMs)的冗余视觉令牌导致高昂的计算开销和推理效率下降。先前研究通常依赖注意力机制来识别冗余令牌,但会引入位置偏差且与FlashAttention等高效注意力内核不兼容,限制了其实际部署。本文旨在摆脱对注意力的依赖,从信息论角度重新审视视觉令牌压缩,以最大程度保留视觉信息。
Result: 在多个VLM和基准测试上的广泛实验表明,ApET在图像理解任务上保留了原始性能的95.2%,在视频理解任务上甚至达到了100.4%,同时分别将令牌预算压缩了88.9%和87.5%。
Insight: 创新点在于从信息论角度进行视觉令牌压缩,采用线性近似和近似误差引导的令牌丢弃策略,完全避免了注意力机制的使用。这使得该方法能够无缝集成FlashAttention,为VLM的实际部署提供了更高效的加速方案,是一种新颖且实用的压缩方法。
Abstract: Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at https://github.com/MaQianKun0/ApET.
[138] BigMaQ: A Big Macaque Motion and Animation Dataset Bridging Image and 3D Pose Representations cs.CVPDF
Lucas Martini, Alexander Lappe, Anna Bognár, Rufin Vogels, Martin A. Giese
TL;DR: 该论文提出了BigMaQ数据集,这是一个大规模猕猴3D运动和动画数据集,包含超过750个交互场景的详细3D姿态描述。它通过构建特定个体的纹理化虚拟形象,提供了比现有方法更准确的姿态描述,并从中衍生出BigMaQ500动作识别基准,证明了结合姿态信息能显著提升动作识别的平均精度均值(mAP)。
Details
Motivation: 当前动物行为识别主要基于视频和稀疏关键点,缺乏对三维姿态和形状的准确重建,尤其对于非人灵长类动物,基于网格的追踪技术落后,限制了动作动态的完整捕捉。
Result: 在BigMaQ500动作识别基准上,结合论文提供的姿态描述符与现有图像/视频编码器提取的特征,相比不使用姿态信息,平均精度均值(mAP)有显著提升。
Insight: 创新点在于首次将动态3D姿态-形状表示集成到动物动作识别学习任务中,通过构建个体化纹理虚拟形象提供更准确的姿态描述,并创建了首个连接基于表面的姿态向量与多只猴子单帧图像的识别基准,为研究非人灵长类的视觉外观、姿态和社交互动提供了丰富资源。
Abstract: The recognition of dynamic and social behavior in animals is fundamental for advancing ethology, ecology, medicine and neuroscience. Recent progress in deep learning has enabled automated behavior recognition from video, yet an accurate reconstruction of the three-dimensional (3D) pose and shape has not been integrated into this process. Especially for non-human primates, mesh-based tracking efforts lag behind those for other species, leaving pose descriptions restricted to sparse keypoints that are unable to fully capture the richness of action dynamics. To address this gap, we introduce the $\textbf{Big Ma}$ca$\textbf{Q}$ue 3D Motion and Animation Dataset ($\texttt{BigMaQ}$), a large-scale dataset comprising more than 750 scenes of interacting rhesus macaques with detailed 3D pose descriptions. Extending previous surface-based animal tracking methods, we construct subject-specific textured avatars by adapting a high-quality macaque template mesh to individual monkeys. This allows us to provide pose descriptions that are more accurate than previous state-of-the-art surface-based animal tracking methods. From the original dataset, we derive BigMaQ500, an action recognition benchmark that links surface-based pose vectors to single frames across multiple individual monkeys. By pairing features extracted from established image and video encoders with and without our pose descriptors, we demonstrate substantial improvements in mean average precision (mAP) when pose information is included. With these contributions, $\texttt{BigMaQ}$ establishes the first dataset that both integrates dynamic 3D pose-shape representations into the learning task of animal action recognition and provides a rich resource to advance the study of visual appearance, posture, and social interaction in non-human primates. The code and data are publicly available at https://martinivis.github.io/BigMaQ/ .
[139] Monocular Mesh Recovery and Body Measurement of Female Saanen Goats cs.CVPDF
Bo Jin, Shichao Zhao, Jin Lyu, Bin Zhang, Tao Yu
TL;DR: 该论文针对萨能奶山羊的体型与产奶性能关联问题,提出了一种基于单目RGBD输入进行三维重建和自动体型测量的方法。通过构建包含55只雌性萨能奶山羊多视角RGBD视频的FemaleSaanenGoat数据集,利用多视图DynamicFusion融合点云序列生成高保真三维扫描,并基于此开发了参数化三维形状模型SaanenGoat。该模型具备精细的骨骼关节和乳房表征,能够从单视角实现高精度三维重建,并自动测量体长、体高、胸宽、胸围、臀宽和臀高等六个关键身体尺寸。
Details
Motivation: 萨能奶山羊的产奶性能与其体型大小密切相关,但现有三维重建方法缺乏针对山羊的真实三维数据,无法准确评估其产奶潜力。
Result: 实验结果表明,该方法在三维重建和身体测量方面均具有优越的准确性,为精准畜牧业中的大规模三维视觉应用提供了新范式。
Insight: 创新点在于构建了首个针对雌性萨能奶山羊的多视角RGBD数据集和专用的参数化三维形状模型SaanenGoat,该模型改进了模板(41个骨骼关节)并增强了乳房表征,从而实现了从单目RGBD输入到高精度三维重建和自动化体型测量的端到端解决方案。
Abstract: The lactation performance of Saanen dairy goats, renowned for their high milk yield, is intrinsically linked to their body size, making accurate 3D body measurement essential for assessing milk production potential, yet existing reconstruction methods lack goat-specific authentic 3D data. To address this limitation, we establish the FemaleSaanenGoat dataset containing synchronized eight-view RGBD videos of 55 female Saanen goats (6-18 months). Using multi-view DynamicFusion, we fuse noisy, non-rigid point cloud sequences into high-fidelity 3D scans, overcoming challenges from irregular surfaces and rapid movement. Based on these scans, we develop SaanenGoat, a parametric 3D shape model specifically designed for female Saanen goats. This model features a refined template with 41 skeletal joints and enhanced udder representation, registered with our scan data. A comprehensive shape space constructed from 48 goats enables precise representation of diverse individual variations. With the help of SaanenGoat model, we get high-precision 3D reconstruction from single-view RGBD input, and achieve automated measurement of six critical body dimensions: body length, height, chest width, chest girth, hip width, and hip height. Experimental results demonstrate the superior accuracy of our method in both 3D reconstruction and body measurement, presenting a novel paradigm for large-scale 3D vision applications in precision livestock farming.
[140] ExpPortrait: Expressive Portrait Generation via Personalized Representation cs.CV | cs.GRPDF
Junyi Wang, Yudong Guo, Boyang Guo, Shengming Yang, Juyong Zhang
TL;DR: 本文提出ExpPortrait,一种通过个性化表征生成富有表现力肖像视频的方法。该方法首先构建了一个高保真度的个性化头部表示,能有效解耦身份与表情,并引入表情迁移模块在不同身份间传递姿态与表情细节。随后,该表示被用作条件信号,训练一个基于扩散Transformer(DiT)的生成器,以合成细节丰富的肖像视频。
Details
Motivation: 现有基于2D关键点或参数化模型等中间信号的肖像生成方法,其解耦能力有限且无法表达个性化细节,导致难以准确保持主体身份和表情,阻碍了高表现力肖像视频的生成。
Result: 在自我重演和交叉重演任务上的大量实验表明,该方法在身份保持、表情准确性和时序稳定性方面优于先前模型,特别是在捕捉复杂运动的细粒度细节上表现出色。
Insight: 核心创新在于提出了一种更有效的、能同时捕获静态全局几何与动态表情细节的高保真度个性化头部表示,并以此作为强条件信号驱动基于DiT的生成模型,实现了对身份与表情的高质量解耦与控制。
Abstract: While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.
[141] Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery cs.CVPDF
Wei He, Xianghan Meng, Zhiyuan Huang, Xianbiao Qi, Rong Xiao
TL;DR: 本文提出了一种名为SSR^2-GCD的新型多模态表示学习框架,用于解决广义类别发现(GCD)问题。该框架通过半监督率降低方法,强调模态内关系对齐以学习具有理想结构特性的跨模态表示,并利用视觉语言模型的模态间对齐集成提示候选以促进知识迁移。
Details
Motivation: 现有GCD方法主要依赖模态间对齐,但缺乏对模态内关系进行适当对齐以生成理想表示分布结构,导致表示学习效果受限。
Result: 在通用和细粒度基准数据集上的广泛实验表明,该方法取得了优越性能,达到了SOTA水平。
Insight: 创新点在于通过半监督率降低方法强调模态内对齐来优化表示结构,并利用视觉语言模型的模态间对齐集成提示以增强知识迁移,为多模态表示学习提供了新思路。
Abstract: Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.
[142] Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation cs.CVPDF
Yilong Yang, Jianxin Tian, Shengchuan Zhang, Liujuan Cao
TL;DR: 本文提出了一种渐进式零样本伪装目标分割(COS)框架DSS,通过发现-分割-选择三阶段流程,利用特征一致性目标发现模块生成多样化候选区域,SAM进行精细分割,并结合MLLM驱动的语义掩码选择模块筛选最优结果,无需训练即可在多个COS基准上实现SOTA性能。
Details
Motivation: 现有零样本COS方法采用两阶段(发现后分割)流程,依赖MLLM生成视觉提示常导致定位不准、误检和漏检,本文旨在解决这些问题。
Result: 在多个COS基准测试中达到SOTA性能,尤其在多实例场景下表现优异,且无需任何训练或监督。
Insight: 创新点在于将传统两阶段流程扩展为渐进式三阶段机制,引入特征一致性目标发现模块增强候选生成多样性,并结合MLLM的语义理解能力进行掩码选择,提升了零样本场景下的分割鲁棒性和准确性。
Abstract: Current zero-shot Camouflaged Object Segmentation methods typically employ a two-stage pipeline (discover-then-segment): using MLLMs to obtain visual prompts, followed by SAM segmentation. However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the \textbf{D}iscover-\textbf{S}egment-\textbf{S}elect (\textbf{DSS}) mechanism, a progressive framework designed to refine segmentation step by step. The proposed method contains a Feature-coherent Object Discovery (FOD) module that leverages visual features to generate diverse object proposals, a segmentation module that refines these proposals through SAM segmentation, and a Semantic-driven Mask Selection (SMS) module that employs MLLMs to evaluate and select the optimal segmentation mask from multiple candidates. Without requiring any training or supervision, DSS achieves state-of-the-art performance on multiple COS benchmarks, especially in multiple-instance scenes.
[143] When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators cs.CV | cs.AIPDF
Krzysztof Adamkiewicz, Brian Moser, Stanislav Frolov, Tobias Christian Nauen, Federico Raue
TL;DR: 本文研究了现代文本到图像(T2I)扩散模型作为合成视觉数据生成器的可靠性。通过使用2022年至2025年间发布的最先进T2I模型生成大规模合成数据集,并仅用此数据训练标准分类器,发现尽管模型的视觉保真度和提示跟随能力有所提升,但在真实测试数据上的分类准确率却持续下降。分析表明,这些模型倾向于生成集中于狭窄、美学中心的分布,损害了数据的多样性和标签-图像对齐。
Details
Motivation: 动机是重新审视合成数据作为真实训练集可扩展替代品的承诺,并检验一个日益增长的假设:生成模型真实感的进步是否意味着数据真实感的进步。
Result: 实验结果表明,使用更新的T2I模型生成的合成数据训练的分类器,在真实测试数据上的分类准确率持续下降,揭示了性能回归现象。
Insight: 创新点在于揭示了现代T2I模型作为训练数据生成器的一个关键缺陷:它们在追求视觉美观的过程中,牺牲了数据分布的多样性和语义对齐,这对依赖合成数据的研究提出了警示。从客观角度看,该研究对评估生成模型的数据效用提供了重要的实证分析和批判性视角。
Abstract: Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression. We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data. Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators. Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and label-image alignment. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.
[144] RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection cs.CVPDF
Tianyu Wang, Zhiyuan Ma, Qian Wang, Xinyi Zhang, Xinwei Long
TL;DR: 本文提出RL-RIG,一个基于强化学习的反思式图像生成框架,旨在解决现有生成模型在空间推理上的不足。该框架采用生成-反思-编辑的范式,通过Diffuser、Checker、Actor和Inverse Diffuser四个组件协同工作,提升对提示中细粒度空间关系的捕捉能力,从而生成结构更合理的图像。
Details
Motivation: 现有图像生成模型在空间推理方面存在困境,难以从提示中准确捕捉细粒度空间关系并生成结构完整的场景,导致生成内容视觉上吸引人但结构不合理。
Result: 在LAION-SG数据集上,RL-RIG在可控和精确的空间推理方面比现有最先进的开源模型提升了高达11%,评估采用Scene Graph IoU和VLM-as-a-Judge策略来量化空间一致性。
Insight: 创新点包括引入强化学习框架和生成-反思-编辑范式,通过Reflection-GRPO分别训练VLM Actor和Image Editor,以增强生成轨迹的直觉和图像质量,专注于空间准确性而非仅视觉美感。
Abstract: Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation. Our architecture comprises four primary components: Diffuser, Checker, Actor, and Inverse Diffuser, following a Generate-Reflect-Edit paradigm to spark the Chain of Thought reasoning ability in image generation for addressing the dilemma. To equip the model with better intuition over generation trajectories, we further develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image quality under a given prompt, respectively. Unlike traditional approaches that solely produce visually stunning yet structurally unreasonable content, our evaluation metrics prioritize spatial accuracy, utilizing Scene Graph IoU and employing a VLM-as-a-Judge strategy to assess the spatial consistency of generated images on LAION-SG dataset. Experimental results show that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.
[145] Descriptor: Dataset of Parasitoid Wasps and Associated Hymenoptera (DAPWH) cs.CV | cs.AIPDF
Joao Manoel Herrera Pinheiro, Gabriela Do Nascimento Herrera, Luciana Bueno Dos Reis Fernandes, Alvaro Doria Dos Santos, Ricardo V. Godoy
TL;DR: 本文介绍了DAPWH数据集,这是一个包含3,556张高分辨率图像的寄生蜂及相关膜翅目昆虫图像数据集,其中1,739张图像带有COCO格式的多类别边界框标注,旨在支持计算机视觉模型开发,以解决姬蜂总科等类群因形态隐蔽和物种繁多而难以准确分类的问题。
Details
Motivation: 姬蜂总科(包括姬蜂科和茧蜂科)是生物多样性监测和农业管理的关键类群,但由于形态隐蔽、未描述物种众多,分类鉴定极为困难;当前缺乏可靠的数字资源来支持自动化识别系统,因此构建一个专门的数据集以推动相关计算机视觉模型的发展。
Result: 论文未在摘要中提及具体的定量实验结果或基准测试,但通过提供包含多类别标注(如完整昆虫身体、翅脉和比例尺)的COCO格式数据集,为后续模型开发奠定了基础,有望提升识别这些类群的准确性和鲁棒性。
Insight: 创新点在于创建了一个针对寄生蜂及相关膜翅目昆虫的精心策划图像数据集,不仅包含高分辨率图像,还提供了多类别边界框标注,这有助于训练更精细的计算机视觉模型;从客观角度看,该数据集填补了该领域数字资源的空白,并强调了通过标注翅脉等关键形态特征来增强模型识别能力的方法。
Abstract: Accurate taxonomic identification is the cornerstone of biodiversity monitoring and agricultural management, particularly for the hyper-diverse superfamily Ichneumonoidea. Comprising the families Ichneumonidae and Braconidae, these parasitoid wasps are ecologically critical for regulating insect populations, yet they remain one of the most taxonomically challenging groups due to their cryptic morphology and vast number of undescribed species. To address the scarcity of robust digital resources for these key groups, we present a curated image dataset designed to advance automated identification systems. The dataset contains 3,556 high-resolution images, primarily focused on Neotropical Ichneumonidae and Braconidae, while also including supplementary families such as Andrenidae, Apidae, Bethylidae, Chrysididae, Colletidae, Halictidae, Megachilidae, Pompilidae, and Vespidae to improve model robustness. Crucially, a subset of 1,739 images is annotated in COCO format, featuring multi-class bounding boxes for the full insect body, wing venation, and scale bars. This resource provides a foundation for developing computer vision models capable of identifying these families.
[146] Closing the gap in multimodal medical representation alignment cs.CV | cs.LGPDF
Eleonora Grassucci, Giordano Cicchetti, Danilo Comminiello
TL;DR: 本文研究了多模态学习中的模态鸿沟问题,特别是在医学领域,并提出了一种模态无关的框架来弥合这一鸿沟,以改善放射学图像与临床文本之间的语义对齐。
Details
Motivation: 基于CLIP的对比损失在复杂多模态场景(如医学领域)中仍存在模态鸿沟问题,导致潜在空间稀疏和碎片化,影响真正的语义对齐。
Result: 该方法增强了放射学图像与临床文本的对齐,提升了跨模态检索和图像描述任务的性能。
Insight: 提出了一种模态无关的框架来主动弥合医学多模态表示中的模态鸿沟,确保语义相关表示更紧密对齐,而不受源模态影响。
Abstract: In multimodal learning, CLIP has emerged as the de-facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIP-based contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning.
[147] MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving cs.CV | cs.ROPDF
Junli Wang, Xueyi Liu, Yinan Zheng, Zebing Xing, Pengfei Li
TL;DR: MeanFuser是一种用于端到端自动驾驶的轨迹生成方法,通过引入高斯混合噪声(GMN)实现轨迹空间的连续表示,避免了传统方法对离散锚点词汇表的依赖;采用MeanFlow Identity建模平均速度场以加速推理;并设计自适应重建模块(ARM)来灵活选择或重建轨迹。
Details
Motivation: 现有基于锚点引导的生成模型在轨迹规划中需要在词汇表大小和模型性能之间权衡,且依赖离散锚点覆盖测试分布以确保鲁棒性,这限制了效率和性能。
Result: 在NAVSIM闭环基准测试中,MeanFuser在不依赖PDM Score监督的情况下取得了优异的性能,并展现出卓越的推理效率。
Insight: 创新点包括:使用高斯混合噪声实现连续轨迹表示以替代离散锚点;采用MeanFlow Identity建模平均速度场而非瞬时速度场,减少ODE求解器的数值误差并加速推理;设计轻量级自适应重建模块通过注意力权重隐式选择或重建轨迹,增强了灵活性和鲁棒性。
Abstract: Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance. To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We adapt ``MeanFlow Identity” to end-to-end planning, which models the mean velocity field between GMN and trajectory distribution instead of the instantaneous velocity field used in vanilla flow matching methods, effectively eliminating numerical errors from ODE solvers and significantly accelerating inference. (3) We design a lightweight Adaptive Reconstruction Module (ARM) that enables the model to implicitly select from all sampled proposals or reconstruct a new trajectory when none is satisfactory via attention weights. Experiments on the NAVSIM closed-loop benchmark demonstrate that MeanFuser achieves outstanding performance without the supervision of the PDM Score. and exceptional inference efficiency, offering a robust and efficient solution for end-to-end autonomous driving. Our code and model are available at https://github.com/wjl2244/MeanFuser.
[148] HeatPrompt: Zero-Shot Vision-Language Modeling of Urban Heat Demand from Satellite Images cs.CV | cs.AIPDF
Kundan Thota, Xuanhao Mu, Thorsten Schlachter, Veit Hagenmeyer
TL;DR: HeatPrompt是一种零样本视觉语言能量建模框架,利用卫星图像、基础地理信息系统和建筑级特征估计年度热需求。该方法通过预训练的大型视觉语言模型提取视觉属性,并结合多层感知机回归器提升预测精度。
Details
Motivation: 解决大多数城市缺乏详细建筑级数据以准确计算热需求地图的问题,支持数据稀缺地区的热规划。
Result: 在基准模型上,多层感知机回归器的R²提升了93.7%,平均绝对误差降低了30%,定性分析显示高影响标记与高需求区域对齐。
Insight: 创新点包括使用领域特定提示引导视觉语言模型作为能量规划器提取视觉属性,以及结合卫星图像和地理信息系统进行零样本热需求建模,为数据稀缺地区提供轻量级支持。
Abstract: Accurate heat-demand maps play a crucial role in decarbonizing space heating, yet most municipalities lack detailed building-level data needed to calculate them. We introduce HeatPrompt, a zero-shot vision-language energy modeling framework that estimates annual heat demand using semantic features extracted from satellite images, basic Geographic Information System (GIS), and building-level features. We feed pretrained Large Vision Language Models (VLMs) with a domain-specific prompt to act as an energy planner and extract the visual attributes such as roof age, building density, etc, from the RGB satellite image that correspond to the thermal load. A Multi-Layer Perceptron (MLP) regressor trained on these captions shows an $R^2$ uplift of 93.7% and shrinks the mean absolute error (MAE) by 30% compared to the baseline model. Qualitative analysis shows that high-impact tokens align with high-demand zones, offering lightweight support for heat planning in data-scarce regions.
[149] The Invisible Gorilla Effect in Out-of-distribution Detection cs.CV | cs.LGPDF
Harry Anthony, Ziyun Liang, Hermione Warr, Konstantinos Kamnitsas
TL;DR: 本文揭示了OOD检测中一种未被报告的偏差,即当难以检测的伪影(近OOD)与模型感兴趣区域(ROI)具有视觉相似性(如颜色)时,检测性能通常提升,反之则下降,作者称之为’隐形大猩猩效应’。
Details
Motivation: 解决OOD检测性能因伪影类型而异但其根本原因未被充分探索的问题,旨在识别OOD检测中的潜在偏差。
Result: 在三个公共数据集(如ISIC)的11,355张图像上标注伪影颜色并生成颜色交换反事实,评估了40种OOD方法在7个基准测试上的表现,发现当伪影与ROI不同时,大多数方法性能显著下降。
Insight: 创新点在于首次报告了OOD检测中的’隐形大猩猩效应’,即检测性能受伪影与ROI视觉相似性影响,这为设计更鲁棒的检测器提供了指导,并强调了当前方法的一个被忽视的失败模式。
Abstract: Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out-of-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model’s ROI and drops when it does not - a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour-swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations are available at: https://github.com/HarryAnthony/Invisible_Gorilla_Effect.
[150] Do Large Language Models Understand Data Visualization Principles? cs.CVPDF
Martin Sinnona, Valentin Bonas, Viviana Siless, Emmanuel Iarussi
TL;DR: 这篇论文首次系统评估了大型语言模型(LLMs)和视觉语言模型(VLMs)在理解和推理数据可视化原则方面的能力。研究基于从Answer Set Programming(ASP)推导出的硬验证真值,构建了一个包含约2000个标注了明确原则违反的Vega-Lite规范数据集和300多个真实世界图表的数据集,评估了模型在检测原则违反和修复有缺陷图表规范方面的表现。
Details
Motivation: 动机在于探索LLMs和VLMs是否能够直接推理和执行数据可视化原则,以绕过传统基于约束的系统需要专家知识将原则转化为形式化规范的局限性,从而利用这些模型作为灵活的可视化设计验证器和编辑器。
Result: 评估结果显示,前沿模型在纠正违反方面比可靠检测违反更有效,但在视觉感知的细微方面仍与符号求解器存在持续差距。
Insight: 创新点在于首次系统评估LLMs/VLMs对可视化原则的理解能力,并揭示了模型在检测与纠正任务中的不对称性;从客观角度看,这项工作为利用大模型作为可视化设计的自动化验证工具提供了实证基础,并指出了当前模型在细粒度推理上的局限性。
Abstract: Data visualization principles, derived from decades of research in design and perception, ensure proper visual communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they and their vision-language counterparts (VLMs) can reason about and enforce visualization principles directly. Constraint based systems encode these principles as logical rules for precise automated checks, but translating them into formal specifications demands expert knowledge. This motivates leveraging LLMs and VLMs as principle checkers that can reason about visual design directly, bypassing the need for symbolic rule specification. In this paper, we present the first systematic evaluation of both LLMs and VLMs on their ability to reason about visualization principles, using hard verification ground truth derived from Answer Set Programming (ASP). We compiled a set of visualization principles expressed as natural-language statements and generated a controlled dataset of approximately 2,000 Vega-Lite specifications annotated with explicit principle violations, complemented by over 300 real-world Vega-Lite charts. We evaluated both checking and fixing tasks, assessing how well models detect principle violations and correct flawed chart specifications. Our work highlights both the promise of large (vision-)language models as flexible validators and editors of visualization designs and the persistent gap with symbolic solvers on more nuanced aspects of visual perception. They also reveal an interesting asymmetry: frontier models tend to be more effective at correcting violations than at detecting them reliably.
[151] StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues cs.CV | cs.AIPDF
Zanxi Ruan, Qiuyu Kong, Songqun Gao, Yiming Wang, Marco Cristani
TL;DR: 本文提出StructXLIP,一种通过引入多模态结构线索来增强视觉-语言模型对齐的微调范式。该方法从图像中提取边缘图作为视觉结构代理,并过滤文本描述以强调结构信息,通过三种结构中心损失函数进行微调,旨在提升模型在长且细节丰富的描述上的跨模态检索性能。
Details
Motivation: 动机在于将早期视觉研究中边缘作为基础线索的原则扩展到视觉-语言对齐领域,通过隔离和对齐跨模态的结构线索,以解决在细节丰富的长文本描述上进行微调时视觉-语言对齐的挑战,特别关注提升跨模态检索任务。
Result: 方法在通用和特定领域的跨模态检索任务上均超越了当前竞争对手,达到了SOTA水平。
Insight: 创新点在于提出了一种结构中心的微调对齐范式,通过最大化多模态结构表示之间的互信息作为辅助优化目标,理论上引导模型找到更鲁棒和语义稳定的最小值。该方法可作为即插即用的通用增强方案集成到未来方法中。
Abstract: Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them “structure-centric”. Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: https://github.com/intelligolabs/StructXLIP.
[152] Benchmarking Unlearning for Vision Transformers cs.CV | cs.AIPDF
Kairan Zhao, Iurie Luca, Peter Triantafillou
TL;DR: 本文首次针对视觉Transformer(ViT和Swin-T)建立了机器遗忘(MU)的基准测试框架,评估了不同数据集、遗忘算法和协议下的性能,并分析了训练数据记忆化对遗忘效果的影响。
Details
Motivation: 机器遗忘对于构建安全公平的AI至关重要,但现有研究主要集中于CNN,缺乏针对视觉Transformer的遗忘基准测试,本文旨在填补这一空白。
Result: 研究使用统一的评估指标(包括遗忘质量、未见测试数据准确性和保留数据准确性),为现有和未来的VT遗忘算法提供了可复现、公平且全面的比较基准,并首次揭示了现有算法在VT环境中的表现水平。
Insight: 创新点在于首次系统性地将机器遗忘基准扩展到视觉Transformer,并重点关注利用训练数据记忆化来提升遗忘性能的方法,同时分析了VT与CNN在记忆化特性上的差异及其对遗忘的影响。
Abstract: Research in machine unlearning (MU) has gained strong momentum: MU is now widely regarded as a critical capability for building safe and fair AI. In parallel, research into transformer architectures for computer vision tasks has been highly successful: Increasingly, Vision Transformers (VTs) emerge as strong alternatives to CNNs. Yet, MU research for vision tasks has largely centered on CNNs, not VTs. While benchmarking MU efforts have addressed LLMs, diffusion models, and CNNs, none exist for VTs. This work is the first to attempt this, benchmarking MU algorithm performance in different VT families (ViT and Swin-T) and at different capacities. The work employs (i) different datasets, selected to assess the impacts of dataset scale and complexity; (ii) different MU algorithms, selected to represent fundamentally different approaches for MU; and (iii) both single-shot and continual unlearning protocols. Additionally, it focuses on benchmarking MU algorithms that leverage training data memorization, since leveraging memorization has been recently discovered to significantly improve the performance of previously SOTA algorithms. En route, the work characterizes how VTs memorize training data relative to CNNs, and assesses the impact of different memorization proxies on performance. The benchmark uses unified evaluation metrics that capture two complementary notions of forget quality along with accuracy on unseen (test) data and on retained data. Overall, this work offers a benchmarking basis, enabling reproducible, fair, and comprehensive comparisons of existing (and future) MU algorithms on VTs. And, for the first time, it sheds light on how well existing algorithms work in VT settings, establishing a promising reference performance baseline.
[153] Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning cs.CVPDF
Zhongxiao Cong, Qitao Zhao, Minsik Jeon, Shubham Tulsiani
TL;DR: Flow3r是一个用于可扩展视觉几何学习的框架,它通过使用密集的2D对应关系(光流)作为监督信号,从无标签的单目视频中进行训练。其核心创新在于将光流预测模块分解为几何潜在变量和姿态潜在变量,从而同时指导场景几何和相机运动的学习,并自然地扩展到动态场景。
Details
Motivation: 当前的前馈式3D/4D重建系统依赖于密集的几何和姿态监督,这些数据获取成本高、规模有限,尤其是在动态真实世界场景中非常稀缺。本文旨在利用无标签单目视频中的光流作为监督,实现可扩展的视觉几何学习。
Result: 在控制实验中,分解的光流预测设计优于其他方案,且性能随无标签数据量增加而稳定提升。将分解光流集成到现有视觉几何架构中,并使用约80万个无标签视频进行训练后,Flow3r在涵盖静态和动态场景的八个基准测试中取得了最先进(SOTA)的结果,尤其是在标记数据最稀缺的真实世界动态视频上提升最大。
Insight: 论文宣称的创新点在于将光流预测模块分解为几何潜在变量和姿态潜在变量,这种分解设计直接引导了场景几何和相机运动的联合学习。从客观角度看,这是一种将自监督信号(光流)与几何表示学习进行结构化耦合的有效方法,为解决动态场景下监督数据稀缺问题提供了可扩展的解决方案。
Abstract: Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision – expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow’) as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ${\sim}800$K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.
[154] A Very Big Video Reasoning Suite cs.CV | cs.AI | cs.LG | cs.MM | cs.ROPDF
Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer
TL;DR: 本文介绍了VBVR数据集和VBVR-Bench评估框架,旨在解决视频推理领域缺乏大规模训练数据和系统评估工具的问题。VBVR数据集包含200个精心设计的推理任务和超过100万个视频片段,规模远超现有数据集;VBVR-Bench则提供基于规则和人类对齐的评分机制,实现可复现和可解释的视频推理能力评估。利用这套工具,研究进行了大规模视频推理的扩展性研究,并观察到模型在未见任务上出现早期涌现泛化迹象。
Details
Motivation: 当前视频模型的研究主要集中在视觉质量上,而视频推理能力(如连续性、交互性和因果关系的时空结构推理)尚未得到充分探索,且缺乏大规模训练数据和系统化的评估方法。
Result: 研究构建了VBVR数据集(规模比现有数据集大约三个数量级)和VBVR-Bench评估框架,并利用它们进行了大规模扩展研究,发现了模型在未见推理任务上出现早期涌现泛化的迹象。
Insight: 创新点包括:1)提出了一个按原则分类的大规模视频推理数据集VBVR,填补了数据空白;2)设计了VBVR-Bench评估框架,采用基于规则和人类对齐的评分器,超越了基于模型的评判,提高了评估的可复现性和可解释性;3)首次进行了大规模视频推理的扩展性研究,为通用视频推理研究奠定了基础。
Abstract: Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at https://video-reason.com/ .
[155] Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device cs.CVPDF
Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad, Ritesh Thawkar, Omkar Thawakar
TL;DR: Mobile-O是一个紧凑的视觉-语言-扩散模型,旨在将统一的多模态理解和生成能力部署到移动设备上。其核心创新是Mobile Conditioning Projector模块,通过深度可分离卷积和层级对齐高效融合视觉语言特征与扩散生成器。模型仅需数百万样本训练,并在新颖的四元组格式下进行后训练,同时提升了视觉理解和生成能力。在iPhone上仅需约3秒即可生成512x512图像,实现了首个实用的边缘设备实时统一多模态智能框架。
Details
Motivation: 现有的统一多模态模型通常数据需求大且模型笨重,难以在边缘设备上部署。本文旨在解决这一问题,开发一个紧凑、高效且能在移动设备上实时运行的多模态理解和生成模型。
Result: 在生成任务上,Mobile-O在GenEval基准上达到74%,分别超过Show-O和JanusFlow 5%和11%,同时运行速度快6倍和11倍。在视觉理解任务上,在七个基准测试中平均分别超过它们15.3%和5.1%。在iPhone上生成512x512图像仅需约3秒。
Insight: 论文宣称的创新点包括:1) 核心模块Mobile Conditioning Projector,使用深度可分离卷积和层级对齐实现高效跨模态条件控制;2) 新颖的四元组后训练格式,联合提升理解和生成能力;3) 首个实用的、完全在设备上运行、不依赖云端的实时统一多模态智能框架。从客观角度看,其将高效架构设计与轻量训练策略结合,为边缘设备部署统一大模型提供了可行的技术路径。
Abstract: Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/
cs.AI [Back]
[156] Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications cs.AI | cs.CL | cs.HC | cs.LGPDF
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar
TL;DR: 本文提出了一种名为‘基于语言的分层奖励设计’(HRDL)的问题框架,旨在将人类对AI代理行为的复杂偏好转化为分层强化学习(RL)中的奖励函数,并进一步提出了‘语言到分层奖励’(L2HR)的解决方案。实验表明,使用L2HR设计的奖励进行训练的AI代理不仅能有效完成任务,还能更好地遵循人类的行为规范。
Details
Motivation: 随着AI代理处理的任务日益复杂,确保其行为与人类提供的详细规范(不仅关注任务是否完成,还关注完成方式)保持一致,对于负责任的人工智能部署至关重要。现有的奖励设计方法在捕捉长视野任务中细微的人类偏好方面存在局限。
Result: 实验结果表明,通过L2HR方法设计的奖励训练的AI代理,在遵循人类行为规范方面表现更优,有效提升了行为对齐效果。
Insight: 主要创新点在于将经典的奖励设计问题扩展为分层奖励设计(HRDL),并提出了从自然语言直接生成分层奖励(L2HR)的解决方案,这为在复杂、长视野任务中实现更精细的人类-AI行为对齐提供了新途径。
Abstract: When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed. As AI agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment. Reward design provides a direct channel for such alignment by translating human expectations into reward functions that guide reinforcement learning (RL). However, existing methods are often too limited to capture nuanced human preferences that arise in long-horizon tasks. Hence, we introduce Hierarchical Reward Design from Language (HRDL): a problem formulation that extends classical reward design to encode richer behavioral specifications for hierarchical RL agents. We further propose Language to Hierarchical Rewards (L2HR) as a solution to HRDL. Experiments show that AI agents trained with rewards designed via L2HR not only complete tasks effectively but also better adhere to human specifications. Together, HRDL and L2HR advance the research on human-aligned AI agents.
[157] The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol cs.AI | cs.CLPDF
Andreas Schlapbach
TL;DR: 本文揭示了模式引导对话(SGD)与模型上下文协议(MCP)在确定性、可审计的大语言模型(LLM)智能体交互范式上的根本性融合。通过分析这一融合,论文提取了模式设计的五个基本原则,并提出了具体的设计模式,旨在推动模式驱动的治理成为AI系统监管的可扩展机制。
Details
Motivation: 论文的动机在于发现并分析SGD(用于基于对话的API发现)与MCP(LLM-工具集成的实际标准)这两种框架在核心思想上的统一性,即模式不仅能编码工具签名,还能编码操作约束和推理指导。通过这种分析,旨在解决现有框架在故障模式和工具间关系利用上的不足,并为实际生产环境中的规模化应用(如渐进式披露)提供指导。
Result: 论文未提及具体的定量实验结果或基准测试。其成果是理论性的,即提取了模式设计的五个基本原则(语义完整性优于句法精确性、明确的操作边界、故障模式文档化、渐进式披露兼容性、工具间关系声明),并基于这些原则提出了具体的设计模式,以弥补现有框架的不足。
Insight: 论文宣称的创新点包括:1)论证了SGD的原始设计从根本上说是合理的,其优点应被MCP继承;2)识别并解决了SGD和MCP在故障模式和工具间关系利用上的空白;3)强调了在现实世界令牌约束下,渐进式披露是生产规模化的关键见解。从客观角度看,论文的创新之处在于将两个看似不同的框架统一到一个理论范式中,并系统性地提炼出具有普适指导意义的设计原则,这为构建更健壮、可审计和可扩展的LLM-agent交互系统提供了理论基础。
Abstract: This paper establishes a fundamental convergence: Schema-Guided Dialogue (SGD) and the Model Context Protocol (MCP) represent two manifestations of a unified paradigm for deterministic, auditable LLM-agent interaction. SGD, designed for dialogue-based API discovery (2019), and MCP, now the de facto standard for LLM-tool integration, share the same core insight – that schemas can encode not just tool signatures but operational constraints and reasoning guidance. By analyzing this convergence, we extract five foundational principles for schema design: (1) Semantic Completeness over Syntactic Precision, (2) Explicit Action Boundaries, (3) Failure Mode Documentation, (4) Progressive Disclosure Compatibility, and (5) Inter-Tool Relationship Declaration. These principles reveal three novel insights: first, SGD’s original design was fundamentally sound and should be inherited by MCP; second, both frameworks leave failure modes and inter-tool relationships unexploited – gaps we identify and resolve; third, progressive disclosure emerges as a critical production-scaling insight under real-world token constraints. We provide concrete design patterns for each principle. These principles position schema-driven governance as a scalable mechanism for AI system oversight without requiring proprietary system inspection – central to Software 3.0.
[158] Benchmark Test-Time Scaling of General LLM Agents cs.AI | cs.CLPDF
Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang
TL;DR: 该论文提出了一个名为General AgentBench的基准测试,用于评估通用LLM智能体在搜索、编码、推理和工具使用等多个领域的综合能力。研究发现,现有领先的LLM智能体在从特定领域评估转向通用智能体设置时,性能显著下降,并且无论是顺序扩展(迭代交互)还是并行扩展(采样多个轨迹)的测试时扩展方法在实践中都无法有效提升性能。
Details
Motivation: 现有基准测试主要针对特定领域环境来开发专用智能体,而评估通用智能体需要一个更现实的、能挑战其在统一环境中跨多种技能和工具操作能力的设置。
Result: 在General AgentBench上对十个领先的LLM智能体进行评估,发现它们在通用智能体设置下性能大幅下降。测试时扩展方法(顺序扩展和并行扩展)均未带来有效的性能提升。
Insight: 论文的创新点在于提出了一个统一的通用智能体评估框架,并揭示了测试时扩展的两个根本局限性:顺序扩展中的上下文上限和并行扩展中的验证差距,这为未来改进通用智能体性能指明了方向。
Abstract: LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.
[159] Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing cs.AI | cs.CL | cs.LOPDF
Maciej Świechowski, Adam Żychowski, Jacek Mańdziuk
TL;DR: 本文从形式化规则环境的角度,评估了四种大型语言模型(Gemini 2.5 Pro/Flash, Llama 3.3 70B, GPT-OSS 120B)在通用游戏对弈任务中的推理能力,包括状态预测和合法动作生成。研究发现多数模型表现良好,但性能随推理步数增加而下降,并通过特征分析和混淆实验揭示了模型在基于逻辑的问题中常见的错误类型。
Details
Motivation: 从形式化、规则驱动的环境(以通用游戏对弈为例)这一新颖视角,系统评估大型语言模型的推理能力,探究其在结构化问题中的表现与局限。
Result: 在通用游戏对弈实例的多种推理任务上,四种模型中的三种在大多数实验设置下表现良好,但性能随评估步数增加而下降。研究还分析了40种游戏结构特征与模型性能的相关性。
Insight: 创新点在于将LLM推理能力评估置于形式化规则环境(GGP)中,并系统分析结构特征、语言语义(通过混淆实验)和潜在训练数据暴露的影响。客观来看,其对模型常见推理错误(如规则幻觉、冗余状态、句法错误)的案例剖析提供了对LLM形式推理局限性的新见解。
Abstract: This paper examines the reasoning capabilities of Large Language Models (LLMs) from a novel perspective, focusing on their ability to operate within formally specified, rule-governed environments. We evaluate four LLMs (Gemini 2.5 Pro and Flash variants, Llama 3.3 70B and GPT-OSS 120B) on a suite of forward-simulation tasks-including next / multistep state formulation, and legal action generation-across a diverse set of reasoning problems illustrated through General Game Playing (GGP) game instances. Beyond reporting instance-level performance, we characterize games based on 40 structural features and analyze correlations between these features and LLM performance. Furthermore, we investigate the effects of various game obfuscations to assess the role of linguistic semantics in game definitions and the impact of potential prior exposure of LLMs to specific games during training. The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps). Detailed case-based analysis of the LLM performance provides novel insights into common reasoning errors in the considered logic-based problem formulation, including hallucinated rules, redundant state facts, or syntactic errors. Overall, the paper reports clear progress in formal reasoning capabilities of contemporary models.
[160] Classroom Final Exam: An Instructor-Tested Reasoning Benchmark cs.AI | cs.CE | cs.CL | cs.CVPDF
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song
TL;DR: 该论文提出了一个名为CFE(课堂期末考试)的多模态基准测试,用于评估大型语言模型在超过20个STEM领域的推理能力。该基准由大学真实、反复使用的作业和考试题目以及教师提供的参考答案构成,对前沿模型构成了显著挑战。论文还进行了诊断分析,发现模型在多步推理中难以可靠地维持正确的中间状态,且推理步骤效率较低。
Details
Motivation: 动机是创建一个更真实、更具挑战性的基准,以评估LLM在复杂、多步骤STEM问题上的推理能力,弥补现有基准在真实性和深度上的不足。
Result: 在CFE基准上,表现最佳的模型Gemini-3.1-pro-preview总体准确率为59.69%,Gemini-3-flash-preview为55.46%,表明即使是前沿模型也有巨大的提升空间。
Insight: 创新点在于构建了一个基于真实教学材料的、跨多STEM领域的基准,并引入了基于推理流程分解的诊断分析方法,揭示了模型在多步推理中维持状态连贯性和步骤效率方面的核心弱点。
Abstract: We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69%, while the second-best model, Gemini-3-flash-preview, reaches 55.46%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.
[161] Time Series, Vision, and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces cs.AI | cs.CVPDF
Pratham Yashwante, Rose Yu
TL;DR: 本文研究了时间序列、视觉和语言三种模态在对比表示空间中的对齐极限,发现独立预训练的编码器在没有显式耦合时几何结构近乎正交,通过对比学习训练投影头可实现后验对齐,且对齐效果随模型规模增大而提升,但呈现不对称性:时间序列与视觉表示的对齐强于与文本的对齐,图像可作为时间序列与语言间的有效中介。
Details
Motivation: 探索柏拉图表示假说在时间序列模态中的适用性,验证不同模态的预训练模型是否共享潜在世界结构,并分析时间序列与视觉、语言之间的对齐特性。
Result: 在对比表示空间中,对齐程度随模型规模增加而提升,但时间序列与视觉的对齐强于与文本的对齐;图像作为中介能改善时间序列与语言的对齐;文本描述和视觉表示的丰富性对齐效果存在阈值,超过后不再提升。
Insight: 揭示了多模态系统中非传统数据模态(如时间序列)对齐的不对称性,为构建超越视觉和语言的多模态系统提供了设计考量;后验对齐方法可有效耦合独立预训练编码器,且图像的中介作用有助于跨模态对齐。
Abstract: The Platonic Representation Hypothesis posits that learned representations from models trained on different modalities converge to a shared latent structure of the world. However, this hypothesis has largely been examined in vision and language, and it remains unclear whether time series participate in such convergence. We first examine this in a trimodal setting and find that independently pretrained time series, vision, and language encoders exhibit near-orthogonal geometry in the absence of explicit coupling. We then apply post-hoc alignment by training projection heads over frozen encoders using contrastive learning, and analyze the resulting representations with respect to geometry, scaling behavior, and dependence on information density and input modality characteristics. Our investigation reveals that overall alignment in contrastive representation spaces improves with model size, but this alignment is asymmetric: time series align more strongly with visual representations than with text, and images can act as effective intermediaries between time series and language. We further see that richer textual descriptions improve alignment only up to a threshold; training on denser captions does not lead to further improvement. Analogous effects are observed for visual representations. Our findings shed light on considerations for building multimodal systems involving non-conventional data modalities beyond vision and language.
[162] A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data cs.AI | cs.CVPDF
Joseph Bingham
TL;DR: 本文提出了一种多模态计算框架,旨在通过整合语言表达和从大规模众包图像中提取的感知表示,来建模人类指称解释的核心方面。该系统结合尺度不变特征变换(SIFT)对齐和通用质量指数(UQI)来量化认知合理的特征空间中的相似性,从而近似人类的感知分类,同时通过一系列语言预处理和查询转换操作来捕捉指称表达中的语用变异性。
Details
Motivation: 解决自然语言表达与视觉感知之间建立稳定映射这一认知科学和人工智能的基础问题,旨在理解人类如何在嘈杂、模糊的感知环境中实现跨模态对齐的机制。
Result: 在斯坦福重复指称游戏语料库(包含15,000个与七巧板刺激配对的表达)上评估,该框架实现了稳健的指称接地。它比人类对话者需要少65%的表达即可达到稳定映射,并且能够从单个指称表达中正确识别目标对象,准确率达到41.66%(而人类为20%),在该经典认知基准上达到了具有竞争力的水平。
Insight: 创新点在于将相对简单的感知-语言对齐机制(结合SIFT和UQI)与语言处理操作结合,以建模人类指称行为,表明简单机制即可在认知任务上实现与人类相当的性能,为接地通信、感知推理和跨模态概念形成模型提供了见解。
Abstract: Establishing stable mappings between natural language expressions and visual percepts is a foundational problem for both cognitive science and artificial intelligence. Humans routinely ground linguistic reference in noisy, ambiguous perceptual contexts, yet the mechanisms supporting such cross-modal alignment remain poorly understood. In this work, we introduce a computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery. The system approximates human perceptual categorization by combining scale-invariant feature transform (SIFT) alignment with the Universal Quality Index (UQI) to quantify similarity in a cognitively plausible feature space, while a set of linguistic preprocessing and query-transformation operations captures pragmatic variability in referring expressions. We evaluate the model on the Stanford Repeated Reference Game corpus (15,000 utterances paired with tangram stimuli), a paradigm explicitly developed to probe human-level perceptual ambiguity and coordination. Our framework achieves robust referential grounding. It requires 65% fewer utterances than human interlocutors to reach stable mappings and can correctly identify target objects from single referring expressions 41.66% of the time (versus 20% for humans).These results suggest that relatively simple perceptual-linguistic alignment mechanisms can yield human-competitive behavior on a classic cognitive benchmark, and offers insights into models of grounded communication, perceptual inference, and cross-modal concept formation. Code is available at https://anonymous.4open.science/r/metasequoia-9D13/README.md .
cs.RO [Back]
[163] RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning cs.RO | cs.AI | cs.CVPDF
Seungku Kim, Suhyeok Jang, Byungjun Yoon, Dongyoung Kim, John Won
TL;DR: 本文提出RoboCurate框架,用于生成高质量的合成机器人数据。该方法通过将生成视频中的预测动作在模拟器中重放,并比较模拟器轨迹与生成视频之间运动的一致性来评估和筛选动作质量。此外,通过图像到图像编辑和动作保持的视频到视频转换来增强观察多样性和外观。实验表明,所生成的数据在多个机器人学习任务上显著提升了成功率。
Details
Motivation: 解决现有基于视频生成模型的合成机器人数据中,由于视频生成不完美导致动作质量不一致的问题,以及现有视觉语言模型(VLMs)在验证视频质量时无法直接评估生成动作物理准确性的局限性。
Result: 在多个基准测试中取得显著提升:与仅使用真实数据相比,在GR-1 Tabletop(300个演示)上成功率相对提升+70.1%,在预训练设置的DexMimicGen上提升+16.1%,在具有挑战性的真实世界ALLEX人形灵巧操作场景中提升+179.9%。
Insight: 核心创新在于提出了一个通过模拟器重放来直接验证生成动作物理一致性的数据筛选机制,并结合了图像编辑和视频转换技术来增强数据多样性,从而生成更高质量、更可靠的合成机器人训练数据。
Abstract: Synthetic data generated by video generative models has shown promise for robot learning as a scalable pipeline, but it often suffers from inconsistent action quality due to imperfectly generated videos. Recently, vision-language models (VLMs) have been leveraged to validate video quality, but they have limitations in distinguishing physically accurate videos and, even then, cannot directly evaluate the generated actions themselves. To tackle this issue, we introduce RoboCurate, a novel synthetic robot data generation framework that evaluates and filters the quality of annotated actions by comparing them with simulation replay. Specifically, RoboCurate replays the predicted actions in a simulator and assesses action quality by measuring the consistency of motion between the simulator rollout and the generated video. In addition, we unlock observation diversity beyond the available dataset via image-to-image editing and apply action-preserving video-to-video transfer to further augment appearance. We observe RoboCurate’s generated data yield substantial relative improvements in success rates compared to using real data only, achieving +70.1% on GR-1 Tabletop (300 demos), +16.1% on DexMimicGen in the pre-training setup, and +179.9% in the challenging real-world ALLEX humanoid dexterous manipulation setting.
[164] WildOS: Open-Vocabulary Object Search in the Wild cs.RO | cs.CVPDF
Hardik Shah, Erica Tevere, Deegan Atha, Marcel Kaufmann, Shehryar Khattak
TL;DR: 本文提出了WildOS系统,这是一个用于野外长距离、开放词汇物体搜索的统一系统。它结合了安全的几何探索与语义视觉推理,通过构建稀疏导航图来维持空间记忆,并利用基于基础模型的视觉模块ExploRFM对图的边界节点进行评分,以同时预测可通行性、视觉边界和图像空间中的物体相似性,从而实现实时的、机载的语义导航任务。
Details
Motivation: 在复杂、非结构化的野外环境中,机器人需要在没有先验地图和深度感知有限的情况下进行长距离自主导航。仅依赖几何边界进行探索往往不足,因此需要具备语义推理能力来确定前进方向和可安全通行的区域,以实现鲁棒且高效的探索。
Result: 在多种越野和城市地形上进行的广泛闭环现场实验表明,WildOS能够实现鲁棒的导航,在效率和自主性方面显著优于纯几何和纯视觉的基线方法。
Insight: 论文的创新点在于将基于基础模型的视觉模块(ExploRFM)与几何探索相结合,实现了语义信息与几何基础的统一;同时引入了基于粒子滤波的方法对开放词汇目标查询进行粗略定位,能够估计超出机器人即时深度视野的候选目标位置,从而支持对远距离目标的有效规划。这展示了视觉基础模型在驱动兼具语义信息和几何基础的开放世界机器人行为方面的潜力。
Abstract: Autonomous navigation in complex, unstructured outdoor environments requires robots to operate over long ranges without prior maps and limited depth sensing. In such settings, relying solely on geometric frontiers for exploration is often insufficient. In such settings, the ability to reason semantically about where to go and what is safe to traverse is crucial for robust, efficient exploration. This work presents WildOS, a unified system for long-range, open-vocabulary object search that combines safe geometric exploration with semantic visual reasoning. WildOS builds a sparse navigation graph to maintain spatial memory, while utilizing a foundation-model-based vision module, ExploRFM, to score frontier nodes of the graph. ExploRFM simultaneously predicts traversability, visual frontiers, and object similarity in image space, enabling real-time, onboard semantic navigation tasks. The resulting vision-scored graph enables the robot to explore semantically meaningful directions while ensuring geometric safety. Furthermore, we introduce a particle-filter-based method for coarse localization of the open-vocabulary target query, that estimates candidate goal positions beyond the robot’s immediate depth horizon, enabling effective planning toward distant goals. Extensive closed-loop field experiments across diverse off-road and urban terrains demonstrate that WildOS enables robust navigation, significantly outperforming purely geometric and purely vision-based baselines in both efficiency and autonomy. Our results highlight the potential of vision foundation models to drive open-world robotic behaviors that are both semantically informed and geometrically grounded. Project Page: https://leggedrobotics.github.io/wildos/
[165] Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization cs.RO | cs.CV | cs.LGPDF
Yanting Yang, Shenyuan Gao, Qingwen Bu, Li Chen, Dimitris N. Metaxas
TL;DR: 本文提出了一种名为‘价值引导多路径反射’的测试时计算框架,用于优化视觉语言模型在复杂、长视野机器人操作任务中的决策。该方法将状态评估与动作生成解耦,通过显式建模动作计划的优势(以减少到目标的距离量化),并使用可扩展的评论家进行估计。为解决单轨迹评估的随机性,采用波束搜索探索多条未来路径并在解码时聚合以建模其期望长期回报,从而生成更鲁棒的动作。此外,引入了一个轻量级、基于置信度的触发器,在直接预测可靠时允许提前退出,仅在必要时调用反射。
Details
Motivation: 解决现有基于反射规划引导视觉语言模型的方法在复杂机器人操作任务中的局限性,包括依赖从嘈杂预测中低效且不准确隐式学习状态价值、仅评估单一贪婪未来以及推理延迟高的问题。
Result: 在多样、未见过的多阶段机器人操作任务上进行广泛实验,结果表明,与最先进的基线方法相比,成功率提高了24.6%,同时推理时间显著减少了56.5%。
Insight: 主要创新点在于将状态评估与动作生成解耦以提供更直接和细粒度的监督信号,显式建模动作计划优势并使用波束搜索聚合多路径未来以估计期望回报,以及引入基于置信度的轻量级触发器实现条件反射,从而在提升性能的同时大幅降低计算开销。
Abstract: Solving complex, long-horizon robotic manipulation tasks requires a deep understanding of physical interactions, reasoning about their long-term consequences, and precise high-level planning. Vision-Language Models (VLMs) offer a general perceive-reason-act framework for this goal. However, previous approaches using reflective planning to guide VLMs in correcting actions encounter significant limitations. These methods rely on inefficient and often inaccurate implicit learning of state-values from noisy foresight predictions, evaluate only a single greedy future, and suffer from substantial inference latency. To address these limitations, we propose a novel test-time computation framework that decouples state evaluation from action generation. This provides a more direct and fine-grained supervisory signal for robust decision-making. Our method explicitly models the advantage of an action plan, quantified by its reduction in distance to the goal, and uses a scalable critic to estimate. To address the stochastic nature of single-trajectory evaluation, we employ beam search to explore multiple future paths and aggregate them during decoding to model their expected long-term returns, leading to more robust action generation. Additionally, we introduce a lightweight, confidence-based trigger that allows for early exit when direct predictions are reliable, invoking reflection only when necessary. Extensive experiments on diverse, unseen multi-stage robotic manipulation tasks demonstrate a 24.6% improvement in success rate over state-of-the-art baselines, while significantly reducing inference time by 56.5%.
[166] To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation cs.RO | cs.AI | cs.CVPDF
Apoorva Vashisth, Manav Kulshrestha, Pranav Bakshi, Damon Conover, Guillaume Sartoretti
TL;DR: 本文提出了一个名为‘终身交互式导航’的新问题,旨在解决现实环境中因杂物堵塞导致无可行路径的导航挑战。为此,作者引入了一个基于LLM驱动和约束规划的框架,该框架结合了主动感知,使具备操作能力的移动机器人能够推理场景图,决定移动哪些障碍物、放置到何处以及下一步探索哪里,从而自主开辟路径以完成顺序物体放置任务。
Details
Motivation: 动机是解决现实世界(如家庭、仓库)中因杂物堵塞所有路径,导致传统视觉导航假设(即至少存在一条无障碍路径)失效的问题,从而完成需要移动障碍物以开辟路径的顺序物体放置任务。
Result: 在物理模拟器ProcTHOR-10k上的评估表明,该方法在完成顺序物体放置任务方面,优于非学习和基于学习的基线方法,并进一步在真实硬件上进行了定性演示。
Insight: 创新点在于将LLM推理与结构化场景图及主动感知耦合,使智能体能够进行基于约束的规划,专注于探索对任务完成有贡献的区域,而非穷尽式环境映射,从而实现了在复杂、动态环境中的零样本泛化能力。
Abstract: Visual navigation typically assumes the existence of at least one obstacle-free path between start and goal, which must be discovered/planned by the robot. However, in real-world scenarios, such as home environments and warehouses, clutter can block all routes. Targeted at such cases, we introduce the Lifelong Interactive Navigation problem, where a mobile robot with manipulation abilities can move clutter to forge its own path to complete sequential object- placement tasks - each involving placing an given object (eg. Alarm clock, Pillow) onto a target object (eg. Dining table, Desk, Bed). To address this lifelong setting - where effects of environment changes accumulate and have long-term effects - we propose an LLM-driven, constraint-based planning framework with active perception. Our framework allows the LLM to reason over a structured scene graph of discovered objects and obstacles, deciding which object to move, where to place it, and where to look next to discover task-relevant information. This coupling of reasoning and active perception allows the agent to explore the regions expected to contribute to task completion rather than exhaustively mapping the environment. A standard motion planner then executes the corresponding navigate-pick-place, or detour sequence, ensuring reliable low-level control. Evaluated in physics-enabled ProcTHOR-10k simulator, our approach outperforms non-learning and learning-based baselines. We further demonstrate our approach qualitatively on real-world hardware.
[167] NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning cs.RO | cs.AI | cs.CVPDF
Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian
TL;DR: NovaPlan是一个用于零样本长时程机器人操作的分层框架,它结合了闭环视觉语言模型规划和视频生成模型,并利用几何信息进行机器人执行。该框架通过高层VLM分解任务并监控执行,实现自主重规划以从单步失败中恢复;在低层,从生成视频中提取物体关键点和人手姿态作为运动学先验,并通过切换机制选择更优参考来指导机器人动作,从而在遮挡或深度不准确情况下保持稳定执行。
Details
Motivation: 解决长时程任务需要机器人整合高层语义推理与低层物理交互,但现有视觉语言模型和视频生成模型缺乏物理基础,难以直接用于真实世界执行。
Result: 在三个长时程任务和功能操作基准测试(FMB)上展示了有效性,能够执行复杂组装任务并表现出灵巧的错误恢复行为,且无需任何先验演示或训练。
Insight: 创新点在于将闭环VLM规划与视频规划统一,并利用从生成视频中提取的几何先验(物体关键点和人手姿态)作为机器人动作参考,通过切换机制增强鲁棒性,实现了零样本、无需训练的长时程操作能力。
Abstract: Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/
q-bio.QM [Back]
[168] AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting q-bio.QM | cs.AI | cs.CL | cs.LGPDF
Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari
TL;DR: AAVGen是一个生成式人工智能框架,用于从头设计具有增强多性状特征的腺相关病毒(AAV)衣壳蛋白。该框架整合了蛋白质语言模型(PLM)、监督微调(SFT)和一种称为组序列策略优化(GSPO)的强化学习技术,通过基于ESM-2的三个回归预测器(分别预测生产适应性、肾脏趋向性和热稳定性)的复合奖励信号进行引导,成功生成了多样化的新型VP1蛋白序列库。
Details
Motivation: AAV是基因治疗的有前景载体,但其天然血清型在组织趋向性、免疫逃逸和生产效率方面存在局限。由于序列空间巨大且难以同时优化多个功能特性,工程化改造衣壳面临挑战,而肾脏因其独特的解剖屏障和细胞靶点,对载体工程提出了更精确和高效的要求。
Result: 计算机模拟验证表明,生成的大多数变体在所有三个评估指标(生产适应性、肾脏趋向性和热稳定性)上均表现出优越性能,实现了成功的多目标优化。此外,通过AlphaFold3进行的结构分析证实,生成的序列在多样化的同时保持了典型的衣壳折叠结构。
Insight: 论文的创新点在于提出了一个整合PLM、SFT和GSPO的生成式AI框架,用于数据驱动的病毒载体工程,通过多目标优化奖励信号引导设计,能够加速开发具有定制功能特性的下一代AAV载体。从客观角度看,该方法将生成式AI与特定生物物理特性预测模型相结合,为复杂蛋白质序列的理性设计提供了新范式。
Abstract: Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency. Engineering capsids to overcome these hurdles is challenging due to the vast sequence space and the difficulty of simultaneously optimizing multiple functional properties. The complexity also adds when it comes to the kidney, which presents unique anatomical barriers and cellular targets that require precise and efficient vector engineering. Here, we present AAVGen, a generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi-trait profiles. AAVGen integrates a protein language model (PLM) with supervised fine-tuning (SFT) and a reinforcement learning technique termed Group Sequence Policy Optimization (GSPO). The model is guided by a composite reward signal derived from three ESM-2-based regression predictors, each trained to predict a key property: production fitness, kidney tropism, and thermostability. Our results demonstrate that AAVGen produces a diverse library of novel VP1 protein sequences. In silico validations revealed that the majority of the generated variants have superior performance across all three employed indices, indicating successful multi-objective optimization. Furthermore, structural analysis via AlphaFold3 confirms that the generated sequences preserve the canonical capsid folding despite sequence diversification. AAVGen establishes a foundation for data-driven viral vector engineering, accelerating the development of next-generation AAV vectors with tailored functional characteristics.
cs.LG [Back]
[169] TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning cs.LG | cs.AI | cs.CLPDF
Yujiao Yang
TL;DR: 本文提出了一个名为TRUE的可信统一解释框架,旨在解决大语言模型(LLM)推理过程难以解释的问题。该框架通过整合可执行推理验证、可行区域有向无环图建模和因果故障模式分析,为LLM的推理提供多层次、可验证的解释。
Details
Motivation: 现有解释方法缺乏可信的结构化洞察,且局限于单实例分析,无法揭示推理的稳定性和系统性故障机制。
Result: 在多个推理基准测试上的广泛实验表明,该框架能提供多层次的可验证解释,包括针对单个实例的可执行推理结构、针对邻近输入的可行区域表示,以及具有量化重要性的可解释故障模式。
Insight: 创新点在于将推理轨迹重新定义为可执行的过程规范,并通过结构一致的扰动构建可行区域DAG来显式地表征推理稳定性,同时引入基于Shapley值的因果故障模式分析方法来识别和量化重复出现的结构性故障模式。
Abstract: Large language models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, yet their decision-making processes remain difficult to interpret. Existing explanation methods often lack trustworthy structural insight and are limited to single-instance analysis, failing to reveal reasoning stability and systematic failure mechanisms. To address these limitations, we propose the Trustworthy Unified Explanation Framework (TRUE), which integrates executable reasoning verification, feasible-region directed acyclic graph (DAG) modeling, and causal failure mode analysis. At the instance level, we redefine reasoning traces as executable process specifications and introduce blind execution verification to assess operational validity. At the local structural level, we construct feasible-region DAGs via structure-consistent perturbations, enabling explicit characterization of reasoning stability and the executable region in the local input space. At the class level, we introduce a causal failure mode analysis method that identifies recurring structural failure patterns and quantifies their causal influence using Shapley values. Extensive experiments across multiple reasoning benchmarks demonstrate that the proposed framework provides multi-level, verifiable explanations, including executable reasoning structures for individual instances, feasible-region representations for neighboring inputs, and interpretable failure modes with quantified importance at the class level. These results establish a unified and principled paradigm for improving the interpretability and reliability of LLM reasoning systems.
[170] Learning to Detect Language Model Training Data via Active Reconstruction cs.LG | cs.AI | cs.CLPDF
Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov, Sewon Min, Hannaneh Hajishirzi
TL;DR: 本文提出了一种主动数据重建攻击(ADRA)方法,用于检测语言模型的训练数据。该方法通过强化学习主动诱导模型重建给定文本,利用训练数据与非成员数据在可重建性上的差异进行成员推理攻击,显著提升了检测性能。
Details
Motivation: 传统成员推理攻击(MIA)被动依赖固定模型权重,使用对数似然或文本生成进行检测,效果有限。本文旨在通过主动训练模型重建文本来更有效地区分训练数据与非成员数据。
Result: 实验表明,ADRA及其自适应变体ADRA+在检测预训练、后训练和蒸馏数据时,平均比之前的最佳方法提升10.7%,在BookMIA预训练检测上比Min-K%++提升18.8%,在AIME后训练检测上提升7.6%,均达到SOTA水平。
Insight: 创新点在于将成员推理攻击从被动框架转为主动重建,利用强化学习微调模型以增强数据重建能力,并设计了重建度量和对比奖励机制;客观分析认为,该方法通过主动干预模型训练过程,有效放大了训练数据与非成员数据之间的可区分性。
Abstract: Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7% over the previous runner-up. In particular, \MethodPlus~improves over Min-K%++ by 18.8% on BookMIA for pre-training detection and by 7.6% on AIME for post-training detection.
[171] SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning cs.LG | cs.AI | cs.CL | stat.MLPDF
Zelin He, Boran Han, Xiyuan Zhang, Shuai Zhang, Haotian Lin
TL;DR: 本文提出了一个混合知识注入框架,用于时间序列诊断推理,旨在结合通用推理大语言模型的推理能力和时间序列大语言模型的领域知识。通过将TSLM生成的见解直接注入GRLM的推理轨迹,并利用基于强化学习的可验证奖励方法自动生成知识丰富的轨迹,该方法在多个数据集上显著提升了性能。作者还发布了SenTSR-Bench基准数据集。
Details
Motivation: 解决现有方法在时间序列诊断推理中的不足:通用推理大语言模型缺乏领域知识以理解复杂时间序列模式,而微调的时间序列大语言模型则缺乏泛化推理能力来处理更复杂问题。
Result: 在SenTSR-Bench和其他公共数据集上,该方法一致超越了TSLMs(提升9.1%-26.1%)和GRLMs(提升7.9%-22.4%),提供了鲁棒且上下文感知的时间序列诊断见解。
Insight: 创新点包括:混合知识注入框架,将领域知识直接整合到推理轨迹中;基于强化学习的可验证奖励方法,无需人工监督即可生成知识丰富的轨迹;以及发布了一个真实工业操作收集的多变量时间序列诊断推理基准SenTSR-Bench。
Abstract: Time-series diagnostic reasoning is essential for many applications, yet existing solutions face a persistent gap: general reasoning large language models (GRLMs) possess strong reasoning skills but lack the domain-specific knowledge to understand complex time-series patterns. Conversely, fine-tuned time-series LLMs (TSLMs) understand these patterns but lack the capacity to generalize reasoning for more complicated questions. To bridge this gap, we propose a hybrid knowledge-injection framework that injects TSLM-generated insights directly into GRLM’s reasoning trace, thereby achieving strong time-series reasoning with in-domain knowledge. As collecting data for knowledge injection fine-tuning is costly, we further leverage a reinforcement learning-based approach with verifiable rewards (RLVR) to elicit knowledge-rich traces without human supervision, then transfer such an in-domain thinking trace into GRLM for efficient knowledge injection. We further release SenTSR-Bench, a multivariate time-series-based diagnostic reasoning benchmark collected from real-world industrial operations. Across SenTSR-Bench and other public datasets, our method consistently surpasses TSLMs by 9.1%-26.1% and GRLMs by 7.9%-22.4%, delivering robust, context-aware time-series diagnostic insights.
[172] DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning cs.LG | cs.CLPDF
Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang
TL;DR: 本文提出DSDR(双尺度多样性正则化)强化学习框架,用于解决大型语言模型推理中探索不足的问题。该方法将推理多样性分解为全局和局部两个尺度:全局层面促进正确推理轨迹的多样性以探索不同解决方案模式,局部层面在正确轨迹上应用长度不变的词元级熵正则化以防止模式内熵崩溃。两个尺度通过全局到局部的分配机制耦合,强调对更具区分性的正确轨迹进行局部正则化。
Details
Motivation: 现有基于验证器的强化学习方法在LLM推理中存在探索受限的问题,策略容易坍缩到少数推理模式并过早停止深度探索,而传统熵正则化仅引入局部随机性,无法产生有意义的路径级多样性,导致基于群体的策略优化信号弱且不稳定。
Result: 在多个推理基准测试上的实验表明,DSDR在准确率和pass@k指标上均取得了一致的提升。
Insight: 创新点在于将推理多样性分解为全局(路径间)和局部(路径内)两个尺度进行正则化,并理论证明了其在有界正则化下保持最优正确性、维持群体优化中的信息性学习信号,以及产生原则性的全局-局部耦合规则。这为RLVR中的深度探索提供了系统性的多样性促进机制。
Abstract: Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.
[173] Wide Open Gazes: Quantifying Visual Exploratory Behavior in Soccer with Pose Enhanced Positional Data cs.LG | cs.CVPDF
Joris Bekkers
TL;DR: 本文提出了一种基于姿态增强时空追踪数据的连续随机视觉层方法,用于量化足球运动员的视觉探索行为。该方法通过概率视野和遮挡模型,结合头部和肩部旋转角度,生成二维俯视平面上的速度依赖视觉地图,并与球场控制和球场价值表面结合,分析球员等待传球和后续控球阶段。研究利用2024年美洲杯的32场比赛数据,证明聚合视觉指标(如等待传球时观察到的防守区域百分比)能预测带球动作结束时的可控球场价值增益。
Details
Motivation: 传统测量足球视觉探索行为的方法依赖于计数快速头部运动(超过125°/s)的视觉探索动作,但存在球员位置偏差(如集中于中场球员)、标注困难、二元测量限制(即球员是否在扫描)、无法预测短期比赛成功,且与基本足球分析模型(如球场控制)不兼容。
Result: 使用2024年美洲杯32场比赛的同步姿态增强追踪数据和控球事件数据,证明聚合视觉指标(如等待传球时观察到的防守区域百分比)能预测带球动作结束时的可控球场价值增益,方法不受球员位置影响,无需手动标注,并提供连续测量。
Insight: 创新点包括引入连续随机视觉层量化视觉感知,结合概率视野和遮挡模型生成速度依赖视觉地图,并与现有足球分析框架(如球场控制和球场价值)无缝集成。客观分析认为,该方法解决了传统方法的局限,提供了更精细、位置无关的连续视觉行为量化工具。
Abstract: Traditional approaches to measuring visual exploratory behavior in soccer rely on counting visual exploratory actions (VEAs) based on rapid head movements exceeding 125°/s, but this method suffer from player position bias (i.e., a focus on central midfielders), annotation challenges, binary measurement constraints (i.e., a player is scanning, or not), lack the power to predict relevant short-term in-game future success, and are incompatible with fundamental soccer analytics models such as pitch control. This research introduces a novel formulaic continuous stochastic vision layer to quantify players’ visual perception from pose-enhanced spatiotemporal tracking. Our probabilistic field-of-view and occlusion models incorporate head and shoulder rotation angles to create speed-dependent vision maps for individual players in a two-dimensional top-down plane. We combine these vision maps with pitch control and pitch value surfaces to analyze the awaiting phase (when a player is awaiting the ball to arrive after a pass for a teammate) and their subsequent on-ball phase. We demonstrate that aggregated visual metrics - such as the percentage of defended area observed while awaiting a pass - are predictive of controlled pitch value gained at the end of dribbling actions using 32 games of synchronized pose-enhanced tracking data and on-ball event data from the 2024 Copa America. This methodology works regardless of player position, eliminates manual annotation requirements, and provides continuous measurements that seamlessly integrate into existing soccer analytics frameworks. To further support the integration with existing soccer analytics frameworks we open-source the tools required to make these calculations.
[174] Phase-Consistent Magnetic Spectral Learning for Multi-View Clustering cs.LG | cs.CVPDF
Mingdong Lu, Zhikui Chen, Meng Liu, Shubin Ma, Liang Zhao
TL;DR: 本文提出了一种用于多视图聚类的相位一致磁谱学习方法,通过显式建模跨视图方向一致性作为相位项,结合非负幅度主干构建复值磁亲和矩阵,利用厄米特磁拉普拉斯提取稳定的共享谱信号,并以此作为结构化自监督来指导无监督多视图表示学习和聚类。
Details
Motivation: 解决无监督多视图聚类中,现有方法仅依赖幅度亲和度或早期伪目标,当不同视图产生强度相当但方向矛盾的关系时,会导致全局谱几何失真和聚类性能下降的问题。
Result: 在多个公开多视图基准测试上的广泛实验表明,该方法持续优于强基线模型。
Insight: 创新点在于将跨视图方向一致性建模为相位项,构建复值磁亲和矩阵以捕获更稳定的共享结构信号,并通过锚点高阶共识建模和轻量级细化来获得鲁棒的谱提取输入,从而提升多视图聚类的鲁棒性和性能。
Abstract: Unsupervised multi-view clustering (MVC) aims to partition data into meaningful groups by leveraging complementary information from multiple views without labels, yet a central challenge is to obtain a reliable shared structural signal to guide representation learning and cross-view alignment under view discrepancy and noise. Existing approaches often rely on magnitude-only affinities or early pseudo targets, which can be unstable when different views induce relations with comparable strengths but contradictory directional tendencies, thereby distorting the global spectral geometry and degrading clustering. In this paper, we propose \emph{Phase-Consistent Magnetic Spectral Learning} for MVC: we explicitly model cross-view directional agreement as a phase term and combine it with a nonnegative magnitude backbone to form a complex-valued magnetic affinity, extract a stable shared spectral signal via a Hermitian magnetic Laplacian, and use it as structured self-supervision to guide unsupervised multi-view representation learning and clustering. To obtain robust inputs for spectral extraction at scale, we construct a compact shared structure with anchor-based high-order consensus modeling and apply a lightweight refinement to suppress noisy or inconsistent relations. Extensive experiments on multiple public multi-view benchmarks demonstrate that our method consistently outperforms strong baselines.
[175] Bayesian Lottery Ticket Hypothesis cs.LG | cs.CVPDF
Nicholas Kuhn, Arvid Weyrauch, Lars Heyen, Achim Streit, Markus Götz
TL;DR: 本文探讨了贝叶斯神经网络中的彩票假设,发现稀疏子网络在贝叶斯设置下同样存在,且能匹配甚至超越原始密集网络的精度,同时研究了剪枝策略和模型对掩码结构与权重初始化的依赖程度。
Details
Motivation: 贝叶斯神经网络在不确定性量化方面有用,但计算资源需求高;彩票假设在传统网络中已证明稀疏子网络的有效性,本文旨在验证其在贝叶斯网络中的存在性,以推动稀疏训练算法发展并深入理解训练过程。
Result: 实验在常见计算机视觉模型上进行,发现彩票假设在贝叶斯神经网络中成立,稀疏子网络在不同模型大小下均能匹配或超越原始精度,但在极高稀疏度下性能下降;剪枝策略应主要基于权重幅度,其次考虑标准差。
Insight: 创新点包括将彩票假设扩展到贝叶斯设置,并提出了连接贝叶斯网络与确定性彩票的移植方法;客观分析表明,研究揭示了贝叶斯网络中稀疏子网络的普遍性,以及剪枝策略和模型结构依赖性的新见解,为高效贝叶斯模型设计提供了理论基础。
Abstract: Bayesian neural networks (BNNs) are a useful tool for uncertainty quantification, but require substantially more computational resources than conventional neural networks. For non-Bayesian networks, the Lottery Ticket Hypothesis (LTH) posits the existence of sparse subnetworks that can train to the same or even surpassing accuracy as the original dense network. Such sparse networks can lower the demand for computational resources at inference, and during training. The existence of the LTH and corresponding sparse subnetworks in BNNs could motivate the development of sparse training algorithms and provide valuable insights into the underlying training process. Towards this end, we translate the LTH experiments to a Bayesian setting using common computer vision models. We investigate the defining characteristics of Bayesian lottery tickets, and extend our study towards a transplantation method connecting BNNs with deterministic Lottery Tickets. We generally find that the LTH holds in BNNs, and winning tickets of matching and surpassing accuracy are present independent of model size, with degradation at very high sparsities. However, the pruning strategy should rely primarily on magnitude, secondly on standard deviation. Furthermore, our results demonstrate that models rely on mask structure and weight initialization to varying degrees.
[176] DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation cs.LG | cs.CV | cs.CYPDF
Yangchen Zeng
TL;DR: DeepInterestGR提出了一种生成式推荐框架,通过利用多模态大语言模型挖掘用户深层次的多兴趣,并设计奖励标签和兴趣增强的语义ID离散化方法,以解决现有方法仅依赖浅层行为信号导致的兴趣建模不足问题,在多个基准测试上取得了SOTA性能。
Details
Motivation: 现有生成式推荐方法主要依赖物品标题和描述等浅层文本特征,导致模型无法捕捉用户交互背后潜在的、语义丰富的深层兴趣,限制了推荐的个性化和可解释性。
Result: 在三个Amazon Review基准测试上,DeepInterestGR在HR@K和NDCG@K指标上持续超越最先进的基线模型。
Insight: 创新点在于:1)利用多模态LLM通过思维链提示挖掘深层文本和视觉兴趣表示;2)使用轻量级二分类器为挖掘的兴趣分配奖励标签,为强化学习提供监督信号;3)将深层次兴趣编码并量化为语义ID令牌。其两阶段训练流程(监督微调+基于兴趣感知奖励的强化学习)也值得借鉴。
Abstract: Recent generative recommendation frameworks have demonstrated remarkable scaling potential by reformulating item prediction as autoregressive Semantic ID (SID) generation. However, existing methods primarily rely on shallow behavioral signals, encoding items solely through surface-level textual features such as titles and descriptions. This reliance results in a critical Shallow Interest problem: the model fails to capture the latent, semantically rich interests underlying user interactions, limiting both personalization depth and recommendation interpretability. DeepInterestGR introduces three key innovations: (1) Multi-LLM Interest Mining (MLIM): We leverage multiple frontier LLMs along with their multi-modal variants to extract deep textual and visual interest representations through Chain-of-Thought prompting. (2) Reward-Labeled Deep Interest (RLDI): We employ a lightweight binary classifier to assign reward labels to mined interests, enabling effective supervision signals for reinforcement learning. (3) Interest-Enhanced Item Discretization (IEID): The curated deep interests are encoded into semantic embeddings and quantized into SID tokens via RQ-VAE. We adopt a two-stage training pipeline: supervised fine-tuning aligns the generative model with deep interest signals and collaborative filtering patterns, followed by reinforcement learning with GRPO optimized by our Interest-Aware Reward. Experiments on three Amazon Review benchmarks demonstrate that DeepInterestGR consistently outperforms state-of-the-art baselines across HR@K and NDCG@K metrics.
cs.DL [Back]
[177] Iconographic Classification and Content-Based Recommendation for Digitized Artworks cs.DL | cs.AI | cs.CV | cs.IRPDF
Krzysztof Kutt, Maciej Baczyński
TL;DR: 本文提出一个概念验证系统,利用Iconclass词汇表和人工智能方法,对数字化艺术品进行自动化图像学分类和基于内容的推荐。该系统原型集成了YOLOv8目标检测、Iconclass代码映射、基于规则的抽象意义推理以及三种互补的推荐器,旨在加速大型文化遗产库的编目和导航。
Details
Motivation: 解决数字化艺术品库中手动图像学分类和内容推荐效率低下、可扩展性差的问题,利用AI方法自动化处理以提升文化遗产管理的效率。
Result: 评估表明该系统具有潜力,能够通过Iconclass感知的计算机视觉和推荐方法加速编目和增强导航,但尚需更多工程优化。
Insight: 创新点在于结合计算机视觉检测可见元素与符号结构(Iconclass层次)来推断意义,实现了从视觉特征到抽象含义的自动化映射,为文化遗产数字化提供了可扩展的AI解决方案。
Abstract: We present a proof-of-concept system that automates iconographic classification and content-based recommendation of digitized artworks using the Iconclass vocabulary and selected artificial intelligence methods. The prototype implements a four-stage workflow for classification and recommendation, which integrates YOLOv8 object detection with algorithmic mappings to Iconclass codes, rule-based inference for abstract meanings, and three complementary recommenders (hierarchical proximity, IDF-weighted overlap, and Jaccard similarity). Although more engineering is still needed, the evaluation demonstrates the potential of this solution: Iconclass-aware computer vision and recommendation methods can accelerate cataloging and enhance navigation in large heritage repositories. The key insight is to let computer vision propose visible elements and to use symbolic structures (Iconclass hierarchy) to reach meaning.
eess.IV [Back]
[178] TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking eess.IV | cs.CV | cs.LG | cs.MMPDF
Abdullah All Tanvir, Agnibh Dasgupta, Xin Zhong
TL;DR: TIACam是一种面向相机重拍摄鲁棒性的零水印框架,通过可学习的自动增强器模拟相机畸变、文本锚定的不变特征学习器实现跨模态语义对齐,以及零水印头在不修改图像像素的情况下将二进制信息嵌入不变特征空间,从而联合优化不变性、语义对齐和水印可恢复性。
Details
Motivation: 解决相机重拍摄引入的复杂光学退化(如透视扭曲、光照变化和莫尔干涉)对深度水印系统带来的挑战,提升水印在物理相机捕获场景下的鲁棒性。
Result: 在合成和真实世界相机捕获数据上的大量实验表明,TIACam在特征稳定性和水印提取准确率上达到了最先进水平(SOTA)。
Insight: 创新点包括:可学习的自动增强器通过可微几何、光度学和莫尔算子发现类相机畸变;文本锚定的不变特征学习器通过图像与文本的跨模态对抗对齐强制语义一致性;零水印头在不修改像素的情况下将信息绑定在不变特征空间中,为多模态不变性学习与物理鲁棒零水印之间建立了原则性桥梁。
Abstract: Camera recapture introduces complex optical degradations, such as perspective warping, illumination shifts, and Moiré interference, that remain challenging for deep watermarking systems. We present TIACam, a text-anchored invariant feature learning framework with auto-augmentation for camera-robust zero-watermarking. The method integrates three key innovations: (1) a learnable auto-augmentor that discovers camera-like distortions through differentiable geometric, photometric, and Moiré operators; (2) a text-anchored invariant feature learner that enforces semantic consistency via cross-modal adversarial alignment between image and text; and (3) a zero-watermarking head that binds binary messages in the invariant feature space without modifying image pixels. This unified formulation jointly optimizes invariance, semantic alignment, and watermark recoverability. Extensive experiments on both synthetic and real-world camera captures demonstrate that TIACam achieves state-of-the-art feature stability and watermark extraction accuracy, establishing a principled bridge between multimodal invariance learning and physically robust zero-watermarking.
[179] Using Unsupervised Domain Adaptation Semantic Segmentation for Pulmonary Embolism Detection in Computed Tomography Pulmonary Angiogram (CTPA) Images eess.IV | cs.CVPDF
Wen-Liang Lin, Yun-Chien Cheng
TL;DR: 本文提出了一种用于CTPA图像中肺栓塞检测的无监督域自适应语义分割框架,采用Transformer骨干网络和Mean-Teacher架构,通过原型对齐、全局局部对比学习和注意力辅助局部预测模块提升伪标签可靠性,并在跨中心和跨模态数据集上验证了其有效性。
Details
Motivation: 解决深度学习在CTPA肺栓塞检测中因域偏移和专家标注成本高昂而难以实际部署的问题。
Result: 在FUMPE与CAD-PE跨中心任务中,IoU分别从0.1152提升至0.4153和从0.1705提升至0.4302;在MMWHS数据集的CT到MRI跨模态任务中,Dice分数达到69.9%,且无需目标域标签进行模型选择。
Insight: 创新点包括结合原型对齐减少类别级分布差异、全局局部对比学习捕获像素级拓扑和全局语义、以及基于注意力的辅助局部预测模块增强对小病灶的敏感性;整体框架在无监督域适应中有效提升了伪标签质量和模型泛化能力。
Abstract: While deep learning has demonstrated considerable promise in computer-aided diagnosis for pulmonary embolism (PE), practical deployment in Computed Tomography Pulmonary Angiography (CTPA) is often hindered by “domain shift” and the prohibitive cost of expert annotations. To address these challenges, an unsupervised domain adaptation (UDA) framework is proposed, utilizing a Transformer backbone and a Mean-Teacher architecture for cross-center semantic segmentation. The primary focus is placed on enhancing pseudo-label reliability by learning deep structural information within the feature space. Specifically, three modules are integrated and designed for this task: (1) a Prototype Alignment (PA) mechanism to reduce category-level distribution discrepancies; (2) Global and Local Contrastive Learning (GLCL) to capture both pixel-level topological relationships and global semantic representations; and (3) an Attention-based Auxiliary Local Prediction (AALP) module designed to reinforce sensitivity to small PE lesions by automatically extracting high-information slices from Transformer attention maps. Experimental validation conducted on cross-center datasets (FUMPE and CAD-PE) demonstrates significant performance gains. In the FUMPE -> CAD-PE task, the IoU increased from 0.1152 to 0.4153, while the CAD-PE -> FUMPE task saw an improvement from 0.1705 to 0.4302. Furthermore, the proposed method achieved a 69.9% Dice score in the CT -> MRI cross-modality task on the MMWHS dataset without utilizing any target-domain labels for model selection, confirming its robustness and generalizability for diverse clinical environments.
cs.CR [Back]
[180] Watermarking LLM Agent Trajectories cs.CR | cs.CLPDF
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li
TL;DR: 本文提出了首个针对LLM智能体轨迹数据的水印方法ActHook,该方法通过在决策点插入钩子动作,利用秘密输入密钥激活水印,实现高可靠性的黑盒检测,同时不影响原始任务性能。
Details
Motivation: 现有研究忽视了LLM智能体轨迹数据的版权保护,导致数据创作者面临数据窃取风险且难以追踪滥用行为,因此需要一种专门的水印方法来保护这类高成本生成的数据集。
Result: 在数学推理、网络搜索和软件工程智能体上的实验表明,ActHook在Qwen-2.5-Coder-7B模型上平均检测AUC达到94.3,且性能下降可忽略不计。
Insight: 创新点在于借鉴软件工程中的钩子机制,将水印嵌入为可被密钥激活的钩子动作,利用智能体顺序执行的特点实现无干扰插入,为序列决策数据提供了有效的版权保护方案。
Abstract: LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering. Despite the high cost of creating these datasets, existing literature has overlooked copyright protection for LLM agent trajectories. This gap leaves creators vulnerable to data theft and makes it difficult to trace misuse or enforce ownership rights. This paper introduces ActHook, the first watermarking method tailored for agent trajectory datasets. Inspired by hook mechanisms in software engineering, ActHook embeds hook actions that are activated by a secret input key and do not alter the original task outcome. Like software execution, LLM agents operate sequentially, allowing hook actions to be inserted at decision points without disrupting task flow. When the activation key is present, an LLM agent trained on watermarked trajectories can produce these hook actions at a significantly higher rate, enabling reliable black-box detection. Experiments on mathematical reasoning, web searching, and software engineering agents show that ActHook achieves an average detection AUC of 94.3 on Qwen-2.5-Coder-7B while incurring negligible performance degradation.
[181] PrivacyBench: Privacy Isn’t Free in Hybrid Privacy-Preserving Vision Systems cs.CR | cs.CVPDF
Nnaemeka Obiefuna, Samuel Oyeneye, Similoluwa Odunaiya, Iremide Oyelaja, Steven Kolawole
TL;DR: 本文提出了PrivacyBench,一个用于评估混合隐私保护视觉系统中多种技术组合效果的基准测试框架。研究发现,联邦学习(FL)与差分隐私(DP)的组合会导致严重的收敛失败和性能下降,而FL与安全多方计算(SMPC)的组合则能保持接近基线的性能。该框架通过自动化配置和资源监控,为隐私-效用-成本的权衡提供了系统化评估平台。
Details
Motivation: 在敏感深度学习应用中,隐私保护机器学习部署通常需要结合多种技术,但缺乏系统化指导来评估这些混合配置的协同与非叠加交互作用,现有方法依赖孤立的技术分析,忽略了关键的系统级交互。
Result: 在ResNet18和ViT模型上对医学数据集进行系统评估发现,FL + DP组合的准确率从98%骤降至13%,计算成本和能耗显著增加,而FL + SMPC组合则保持了接近基线的性能且开销适中。
Insight: 论文的创新点在于首次提供了系统化平台来评估隐私技术组合的交互效应,揭示了隐私技术不能任意组合的关键见解,并推动了隐私保护计算机视觉从临时评估转向原则性系统设计。
Abstract: Privacy preserving machine learning deployments in sensitive deep learning applications; from medical imaging to autonomous systems; increasingly require combining multiple techniques. Yet, practitioners lack systematic guidance to assess the synergistic and non-additive interactions of these hybrid configurations, relying instead on isolated technique analysis that misses critical system level interactions. We introduce PrivacyBench, a benchmarking framework that reveals striking failures in privacy technique combinations with severe deployment implications. Through systematic evaluation across ResNet18 and ViT models on medical datasets, we uncover that FL + DP combinations exhibit severe convergence failure, with accuracy dropping from 98% to 13% while compute costs and energy consumption substantially increase. In contrast, FL + SMPC maintains near-baseline performance with modest overhead. Our framework provides the first systematic platform for evaluating privacy-utility-cost trade-offs through automated YAML configuration, resource monitoring, and reproducible experimental protocols. PrivacyBench enables practitioners to identify problematic technique interactions before deployment, moving privacy-preserving computer vision from ad-hoc evaluation toward principled systems design. These findings demonstrate that privacy techniques cannot be composed arbitrarily and provide critical guidance for robust deployment in resource-constrained environments.
cs.CY [Back]
[182] Can Large Language Models Replace Human Coders? Introducing ContentBench cs.CY | cs.AI | cs.CLPDF
Michael Haman
TL;DR: 本文介绍了ContentBench,一个用于评估低成本大语言模型能否替代人类进行内容分析编码工作的公开基准套件。作者报告了首个基准数据集ContentBench-ResearchTalk v1.0的结果,该数据集包含1000个关于学术研究的合成社交媒体帖子,分为五类。在59个评估模型中,最佳低成本LLM与参考标签的一致性达到约97-99%,且成本极低,但小型开源模型在讽刺类内容上表现不佳。
Details
Motivation: 解决低成本大语言模型能否接管经验内容分析中仍占主导地位的解释性编码工作的问题,并提供一个标准化的评估框架来量化其性能和成本。
Result: 在ContentBench-ResearchTalk v1.0基准上,最佳低成本LLM(如GPT-4o-mini)与由顶级推理模型一致同意并经作者审核的参考标签的一致性达到约97-99%,远高于GPT-3.5 Turbo。编码5万个帖子的成本仅需几美元。然而,小型本地运行模型(如Llama 3.2 3B)在讽刺内容上表现很差,一致性仅4%。
Insight: 创新点在于引入了版本化、社区可扩展的基准套件ContentBench,用于系统评估LLM在解释性编码任务上的性能和成本效益。客观来看,其通过合成数据、使用顶级模型共识生成高质量参考标签、并进行人工审计的方法,为LLM替代人类编码员的可行性研究提供了可靠且可复现的评估标准。
Abstract: Can low-cost large language models (LLMs) take over the interpretive coding work that still anchors much of empirical content analysis? This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks. The suite uses versioned tracks that invite researchers to contribute new benchmark datasets. I report results from the first track, ContentBench-ResearchTalk v1.0: 1,000 synthetic, social-media-style posts about academic research labeled into five categories spanning praise, critique, sarcasm, questions, and procedural remarks. Reference labels are assigned only when three state-of-the-art reasoning models (GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1) agree unanimously, and all final labels are checked by the author as a quality-control audit. Among the 59 evaluated models, the best low-cost LLMs reach roughly 97-99% agreement with these jury labels, far above GPT-3.5 Turbo, the model behind early ChatGPT and the initial wave of LLM-based text annotation. Several top models can code 50,000 posts for only a few dollars, pushing large-scale interpretive coding from a labor bottleneck toward questions of validation, reporting, and governance. At the same time, small open-weight models that run locally still struggle on sarcasm-heavy items (for example, Llama 3.2 3B reaches only 4% agreement on hard-sarcasm). ContentBench is released with data, documentation, and an interactive quiz at contentbench.github.io to support comparable evaluations over time and to invite community extensions.
[183] Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos cs.CY | cs.AI | cs.CVPDF
Yixuan Shen, Peng He, Honglu Liu, Yuyang Ji, Tingting Li
TL;DR: 该论文提出了首个针对K-12科学课堂视频的基准测试SciIBI,用于评估多模态大语言模型对教学实践的分析能力。研究发现,现有模型难以区分相似的教学实践,且视频模态的加入并未带来一致性的性能提升,表明模型往往依赖表面捷径而非真正的教学推理。
Details
Motivation: 现有课堂话语分析基准主要依赖数学科目文本,忽视了科学课堂中视觉材料和基于模型的推理,无法满足新一代科学标准的要求。本文旨在填补这一空白,构建一个多模态基准以评估AI对科学教学实践的理解。
Result: 在SciIBI基准(包含113个符合NGSS标准的视频片段,标注了核心教学实践及其复杂度)上评估了8个先进的LLM和多模态LLM。结果显示,模型在区分相似教学实践上存在根本性局限,视频输入的增益因模型架构而异且不一致,总体表现表明模型未能达到真正的教学理解。
Insight: 创新点在于构建了首个面向科学课堂视频的多模态分析基准,揭示了当前多模态LLM在教学推理任务上的本质缺陷(依赖表面模式匹配)。研究指出,未来方向应是人机协作,让模型作为证据检索工具辅助专家评审,而非替代人类。
Abstract: K-12 science classrooms are rich sites of inquiry where students coordinate phenomena, evidence, and explanatory models through discourse; yet, the multimodal complexity of these interactions has made automated analysis elusive. Existing benchmarks for classroom discourse focus primarily on mathematics and rely solely on transcripts, overlooking the visual artifacts and model-based reasoning emphasized by the Next Generation Science Standards (NGSS). We address this gap with SciIBI, the first video benchmark for analyzing science classroom discourse, featuring 113 NGSS-aligned clips annotated with Core Instructional Practices (CIP) and sophistication levels. By evaluating eight state-of-the-art LLMs and Multimodal LLMs, we reveal fundamental limitations: current models struggle to distinguish pedagogically similar practices, suggesting that CIP coding requires instructional reasoning beyond surface pattern matching. Furthermore, adding video input yields inconsistent gains across architectures. Crucially, our evidence-based evaluation reveals that models often succeed through surface shortcuts rather than genuine pedagogical understanding. These findings establish science classroom discourse as a challenging frontier for multimodal AI and point toward human-AI collaboration, where models retrieve evidence to accelerate expert review rather than replace it.