Table of Contents
- cs.CL [Total: 27]
- cs.CV [Total: 60]
- cs.LG [Total: 5]
- cs.AI [Total: 5]
- cs.MA [Total: 2]
- cs.CY [Total: 1]
- cs.CR [Total: 1]
- eess.AS [Total: 1]
- cs.IR [Total: 2]
- cs.SD [Total: 1]
cs.CL [Back]
[1] Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs cs.CL | cs.AIPDF
Snehit Vaddi, Pujith Vaddi
TL;DR: 该论文探讨了大型语言模型中的幻觉神经元是否能在不同知识领域间泛化。研究发现,幻觉神经元在特定领域内能有效预测幻觉,但跨领域时性能显著下降,表明幻觉机制具有领域特异性。
Details
Motivation: 动机是验证先前发现的幻觉神经元是否具有跨领域泛化能力,以理解幻觉的神经机制是否统一。
Result: 在6个知识领域和5个开源模型上的实验显示,幻觉神经元分类器在领域内AUROC为0.783,跨领域时降至0.563,性能下降显著且一致。
Insight: 创新点在于揭示了幻觉的神经机制是领域特异的,而非单一通用机制,这提示神经元级幻觉检测器需按领域校准,不能通用部署。
Abstract: Recent work identifies a sparse set of “hallucination neurons” (H-neurons), less than 0.1% of feed-forward network neurons, that reliably predict when large language models will hallucinate. These neurons are identified on general-knowledge question answering and shown to generalize to new evaluation instances. We ask a natural follow-up question: do H-neurons generalize across knowledge domains? Using a systematic cross-domain transfer protocol across 6 domains (general QA, legal, financial, science, moral reasoning, and code vulnerability) and 5 open-weight models (3B to 8B parameters), we find they do not. Classifiers trained on one domain’s H-neurons achieve AUROC 0.783 within-domain but only 0.563 when transferred to a different domain (delta = 0.220, p < 0.001), a degradation consistent across all models tested. Our results suggest that hallucination is not a single mechanism with a universal neural signature, but rather involves domain-specific neuron populations that differ depending on the knowledge type being queried. This finding has direct implications for the deployment of neuron-level hallucination detectors, which must be calibrated per domain rather than trained once and applied universally.
[2] OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models cs.CL | cs.AIPDF
Haijian Liang, Zenghao Niu, Junjie Wu, Changwang Zhang, Wangchunshu Zhou
TL;DR: 本文提出了OThink-SRR1框架,通过强化学习训练一个迭代的‘搜索-提炼-推理’过程,以增强大语言模型处理复杂多跳问题的能力。其核心创新在于引入提炼阶段,在推理前将检索到的文档浓缩为简洁相关的事实,并使用GRPO-IR强化学习算法进行端到端训练,以奖励准确识别证据并惩罚过度检索。
Details
Motivation: 当前RAG中的静态检索方法难以处理复杂的多跳问题,而动态检索策略又面临检索噪声误导推理和处理全文计算成本高昂两大挑战。
Result: 在四个多跳问答基准测试上的实验表明,该方法在实现更高准确率的同时,使用了更少的检索步骤和token,优于现有强基线模型。
Insight: 主要创新点在于将‘提炼’作为一个独立的、可学习的阶段集成到检索增强流程中,并通过专门的强化学习算法(GRPO-IR)联合优化检索的准确性与效率,这为构建高效的信息寻求智能体提供了一个有潜力的基础模型框架。
Abstract: Retrieval-Augmented Generation (RAG) expands the knowledge of Large Language Models (LLMs), yet current static retrieval methods struggle with complex, multi-hop problems. While recent dynamic retrieval strategies offer improvements, they face two key challenges: 1) irrelevant retrieved noise can misdirect the reasoning process, and 2) processing full documents incurs prohibitive computational and latency costs. To address these issues, we propose OThink-SRR1, a framework that enhances large models with an iterative Search-Refine-Reason process trained via reinforcement learning. Its core Refine stage distills retrieved documents into concise, relevant facts before reasoning. We introduce GRPO-IR, an end-to-end reinforcement learning algorithm that rewards accurate evidence identification while penalizing excessive retrievals, thus training the model to be both focused and efficient. Experiments on four multi-hop QA benchmarks show our approach achieves superior accuracy over strong baselines while using fewer retrieval steps and tokens. This positions OThink-SRR1 as a potent foundational model for information-seeking agents.
[3] Hybrid Multi-Phase Page Matching and Multi-Layer Diff Detection for Japanese Building Permit Document Review cs.CL | cs.CVPDF
Mitsumasa Wada
TL;DR: 本文提出了一种用于日本建筑许可文档自动审查的混合多阶段页面匹配算法和多层差异检测引擎。该算法通过结合最长公共子序列结构对齐、七阶段共识匹配流程和动态规划最优对齐阶段,能够在文档修订周期中即使页面顺序、编号或内容发生显著变化时也能稳健地配对页面。随后,一个包含文本层、表格层和像素层视觉差异检测的多层差异引擎生成高亮显示的差异报告。
Details
Motivation: 解决日本建筑许可审查中需要跨修订周期交叉引用大型PDF文档集的问题,该过程在手动执行时劳动密集且容易出错。
Result: 在真实许可文档集上的评估,在手动标注的基准测试中实现了F1=0.80和精确度=1.00,且匹配对零误报。
Insight: 创新点在于将结构对齐、多阶段共识匹配与动态规划相结合,以处理文档修订中的复杂变化,并采用多层(文本、表格、像素)差异检测来生成全面的差异报告,提高了自动化审查的鲁棒性和准确性。
Abstract: We present a hybrid multi-phase page matching algorithm for automated comparison of Japanese building permit document sets. Building permit review in Japan requires cross-referencing large PDF document sets across revision cycles, a process that is labor-intensive and error-prone when performed manually. The algorithm combines longest common subsequence (LCS) structural alignment, a seven-phase consensus matching pipeline, and a dynamic programming optimal alignment stage to robustly pair pages across revisions even when page order, numbering, or content changes substantially. A subsequent multi-layer diff engine – comprising text-level, table-level, and pixel-level visual differencing – produces highlighted difference reports. Evaluation on real-world permit document sets achieves F1=0.80 and precision=1.00 on a manually annotated ground-truth benchmark, with zero false-positive matched pairs.
[4] PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models cs.CL | cs.AIPDF
Jiyuan An, Jiachen Zhao, Fan Chen, Liner Yang, Zhenghao Liu
TL;DR: PR-CAD是一个基于大语言模型的渐进式精炼框架,用于统一可控且忠实于文本描述的CAD模型生成与编辑。它通过一个专门策划的高保真交互数据集和一个强化学习增强的推理代理,将意图理解、参数估计和精确编辑定位集成到一个系统中,实现了从创建到修改的“一体化”解决方案。
Details
Motivation: 传统CAD建模依赖人工操作且专业性强,现有基于LLM的文本到CAD生成方法通常将生成和编辑视为独立任务,限制了其实用性。本文旨在统一生成与编辑流程,实现更可控、更忠实的CAD建模。
Result: 在公开基准测试中,PR-CAD在生成和精炼场景下都实现了最先进(SOTA)的可控性和忠实度,并显著提高了CAD建模效率。实验表明生成与编辑任务之间以及定性与定量模态之间存在强烈的相互促进作用。
Insight: 核心创新点在于提出了一个统一的渐进式精炼框架,将生成与编辑整合;策划了覆盖完整CAD生命周期的多表征、多描述类型的高质量交互数据集;并设计了集成多种功能的单一推理代理,通过强化学习进行增强,实现了意图理解到精确定位的端到端流程。
Abstract: The construction of CAD models has traditionally relied on labor-intensive manual operations and specialized expertise. Recent advances in large language models (LLMs) have inspired research into text-to-CAD generation. However, existing approaches typically treat generation and editing as disjoint tasks, limiting their practicality. We propose PR-CAD, a progressive refinement framework that unifies generation and editing for controllable and faithful text-to-CAD modeling. To support this, we curate a high-fidelity interaction dataset spanning the full CAD lifecycle, encompassing multiple CAD representations as well as both qualitative and quantitative descriptions. The dataset systematically defines the types of edit operations and generates highly human-like interaction data. Building on a CAD representation tailored for LLMs, we propose a reinforcement learning-enhanced reasoning framework that integrates intent understanding, parameter estimation, and precise edit localization into a single agent. This enables an “all-in-one” solution for both design creation and refinement. Extensive experiments demonstrate strong mutual reinforcement between generation and editing tasks, and across qualitative and quantitative modalities. On public benchmarks, PR-CAD achieves state-of-the-art controllability and faithfulness in both generation and refinement scenarios, while also proving user-friendly and significantly improving CAD modeling efficiency.
[5] Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs cs.CLPDF
Amirul Rahman, Aisha Karim, Kenji Nakamura, Yi-Fan Ng
TL;DR: 本文提出了一种名为Budget-Adaptive Curriculum Reasoning (BACR)的统一框架,旨在优化大型语言模型在推理任务中的计算效率与性能。该框架通过预算条件化统一策略、课程感知预算调度器和截断感知密集奖励三个协同组件,动态调整不同难度问题上的计算资源分配,以解决现有方法在固定或均匀采样计算预算下导致的’过度思考’和’思考不足’问题。
Details
Motivation: 现有方法在扩展测试时计算以提升LLM推理能力时,通常采用固定或均匀采样的计算预算,忽略了问题难度与分配计算资源之间的不匹配,导致简单问题上计算浪费(过度思考)而困难问题上计算不足(思考不足),从而造成整体计算效率低下。
Result: 在数学推理基准测试(MATH, GSM8K, AIME, Minerva Math)上,BACR在所有计算预算下均优于其他强基线方法,在严格预算限制下准确率最高提升8.3%,同时相比无约束推理平均减少了34%的token消耗。
Insight: 创新点包括:1) 将计算预算作为连续条件信号嵌入的统一策略,无需解耦的思考与总结策略;2) 基于实时学习进度自适应调整训练预算分布的课程感知调度器;3) 通过过程级验证在中间推理步骤提供细粒度信用分配的截断感知密集奖励机制;4) 提出预算条件化优势估计(BCAE)以减少策略梯度的方差。这些机制共同实现了对问题难度自适应的计算资源分配,提升了推理效率。
Abstract: Scaling test-time compute via extended reasoning has become a key paradigm for improving the capabilities of large language models (LLMs). However, existing approaches optimize reasoning under fixed or uniformly sampled token budgets, ignoring the fundamental mismatch between problem difficulty and allocated compute. This leads to overthinking on easy problems and underthinking on hard ones, resulting in suboptimal token efficiency across diverse reasoning scenarios. In this paper, we propose Budget-Adaptive Curriculum Reasoning (BCAE), a unified framework that jointly optimizes reasoning quality and token efficiency through three synergistic components: (1) a \emph{budget-conditioned unified policy} that embeds the token budget as a continuous conditioning signal, eliminating the need for decoupled thinking and summarization strategies; (2) a \emph{curriculum-aware budget scheduler} that adaptively shifts the training budget distribution from easy to hard problems based on real-time learning progress; and (3) a \emph{truncation-aware dense reward} mechanism that provides fine-grained credit assignment at intermediate reasoning steps via process-level verification. We further introduce \emph{Budget-Conditioned Advantage Estimation} (BCAE), a novel variance reduction technique that conditions the advantage baseline on the sampled budget, yielding more stable policy gradients. Experiments on mathematical reasoning benchmarks (MATH, GSM8K, AIME, and Minerva Math) demonstrate that BACR consistently outperforms other strong baselines across all token budgets, achieving up to 8.3% accuracy improvement under tight budgets while reducing average token consumption by 34% compared to unconstrained reasoning.
[6] LLM Agents Predict Social Media Reactions but Do Not Outperform Text Classifiers: Benchmarking Simulation Accuracy Using 120K+ Personas of 1511 Humans cs.CL | cs.AI | cs.CYPDF
Ljubisa Bojic, Alexander Felfernig, Bojana Dinic, Velibor Ilic, Achim Rettinger
TL;DR: 本研究通过超过12万个基于1511名塞尔维亚参与者的独特代理-人设组合,评估了基于大语言模型(LLM)的智能体在预测人类社交媒体反应(如点赞、不喜欢、评论、分享、无反应)方面的准确性。研究发现,LLM智能体虽然表现出超越随机猜测的预测能力(马修斯相关系数MCC为0.29),但其性能被基于TF-IDF的传统文本分类器超越(MCC为0.36),表明其预测增益主要源于对语义的访问,而非独特的智能体推理能力。
Details
Motivation: 随着自主AI智能体越来越多地参与社交媒体,理解其行为保真度对于平台治理和民主韧性至关重要。先前研究表明LLM驱动的智能体可以复制聚合调查响应,但很少有研究测试智能体是否能预测特定个体对特定内容的反应。
Result: 在研究中,智能体在预测五种社交媒体反应时总体准确率达到70.7%,不同LLM选择导致13个百分点的性能差异。在二元强制选择(喜欢/不喜欢)评估中,智能体MCC为0.29,而传统TF-IDF文本分类器MCC为0.36,表现更优。
Insight: 论文的创新点在于大规模实证评估了零样本人设提示的LLM智能体在个体层面行为预测的准确性,并揭示了其预测能力可能主要源于语义理解而非独特的代理推理。这警示了在社交媒体上轻易部署行为各异的AI智能体群可能带来的操纵风险,同时也为利用此类智能体模拟预测极化动态和制定AI政策提供了机会。其零样本方法无需任务特定训练,便于大规模部署。
Abstract: Social media platforms mediate how billions form opinions and engage with public discourse. As autonomous AI agents increasingly participate in these spaces, understanding their behavioral fidelity becomes critical for platform governance and democratic resilience. Previous work demonstrates that LLM-powered agents can replicate aggregate survey responses, yet few studies test whether agents can predict specific individuals’ reactions to specific content. This study benchmarks LLM-based agents’ accuracy in predicting human social media reactions (like, dislike, comment, share, no reaction) across 120,000+ unique agent-persona combinations derived from 1,511 Serbian participants and 27 large language models. In Study 1, agents achieved 70.7% overall accuracy, with LLM choice producing a 13 percentage-point performance spread. Study 2 employed binary forced-choice (like/dislike) evaluation with chance-corrected metrics. Agents achieved Matthews Correlation Coefficient (MCC) of 0.29, indicating genuine predictive signal beyond chance. However, conventional text-based supervised classifiers using TF-IDF representations outperformed LLM agents (MCC of 0.36), suggesting predictive gains reflect semantic access rather than uniquely agentic reasoning. The genuine predictive validity of zero-shot persona-prompted agents warns against potential manipulation through easily deploying swarms of behaviorally distinct AI agents on social media, while simultaneously offering opportunities to use such agents in simulations for predicting polarization dynamics and informing AI policy. The advantage of using zero-shot agents is that they require no task-specific training, making their large-scale deployment easy across diverse contexts. Limitations include single-country sampling. Future research should explore multilingual testing and fine-tuning approaches.
[7] From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents cs.CLPDF
Md Nayem Uddin, Kumar Shubham, Eduardo Blanco, Chitta Baral, Gengyu Wang
TL;DR: 本文介绍了Memora,一个用于评估个性化智能体长期记忆能力的基准测试,涵盖数周至数月的用户对话,并评估记忆、推理和推荐三个任务。作者提出了遗忘感知记忆准确率(FAMA)指标,以惩罚对过时记忆的依赖。对四个大语言模型和六个记忆智能体的评估揭示了它们频繁重用无效记忆且难以处理记忆演变的缺陷。
Details
Motivation: 现有基准主要将长期记忆评估视为从过往对话中检索事实,无法充分评估智能体随时间整合记忆或处理频繁知识更新的能力,因此需要更全面的评估框架。
Result: 在Memora基准上对四个LLM和六个记忆智能体的评估显示,它们经常重用无效记忆,且在调和演变记忆方面存在失败;记忆智能体仅带来边际改进,暴露了当前个性化智能体长期记忆的不足。
Insight: 创新点在于提出了一个模拟长期、动态用户互动的基准Memora,并引入了FAMA指标来量化对过时记忆的依赖,这为评估智能体在真实场景中的记忆巩固和更新能力提供了更全面的视角。
Abstract: Personalized agents that interact with users over long periods must maintain persistent memory across sessions and update it as circumstances change. However, existing benchmarks predominantly frame long-term memory evaluation as fact retrieval from past conversations, providing limited insight into agents’ ability to consolidate memory over time or handle frequent knowledge updates. We introduce Memora, a long-term memory benchmark spanning weeks to months long user conversations. The benchmark evaluates three memory-grounded tasks: remembering, reasoning, and recommending. To ensure data quality, we employ automated memory-grounding checks and human evaluation. We further introduce Forgetting-Aware Memory Accuracy (FAMA), a metric that penalizes reliance on obsolete or invalidated memory when evaluating long-term memory. Evaluations of four LLMs and six memory agents reveal frequent reuse of invalid memories and failures to reconcile evolving memories. Memory agents offer marginal improvements, exposing shortcomings in long-term memory for personalized agents.
[8] TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs cs.CL | cs.AIPDF
Ziyi Wang, Chen Zhang, Wenjun Peng, Qi Wu, Xinyu Wang
TL;DR: TriEx是一个基于游戏的三视角可解释性框架,用于解释多智能体大语言模型在交互式部分可观测环境中的内部推理过程。该框架通过第一人称自我推理、第二人称对手信念状态和第三人称环境审计三个对齐的视角,将解释从自由叙述转化为可跨时间和视角比较的证据锚定对象。
Details
Motivation: 解决在交互式、部分可观测环境中,大语言模型智能体的决策依赖于动态信念和其他智能体,导致其可解释性特别具有挑战性的问题。
Result: 在不完美信息策略游戏的受控测试平台上,TriEx能够对解释忠实度、信念动态和评估者可靠性进行可扩展分析,揭示了智能体所言、所信和所为之间的系统性不匹配。
Insight: 创新点在于将可解释性视为依赖交互的属性,并提出了多视角、证据基础的可解释性评估框架,将解释转化为可验证和比较的结构化对象,而非自由文本叙述。
Abstract: Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present \textbf{TriEx}, a tri-view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first-person self-reasoning bound to an action, (ii) explicit second-person belief states about opponents updated over time, and (iii) third-person oracle audits grounded in environment-derived reference signals. This design turns explanations from free-form narratives into evidence-anchored objects that can be compared and checked across time and perspectives. Using imperfect-information strategic games as a controlled testbed, we show that TriEx enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do. Our results highlight explainability as an interaction-dependent property and motivate multi-view, evidence-grounded evaluation for LLM agents. Code is available at https://github.com/Einsam1819/TriEx.
[9] Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text cs.CL | cs.LGPDF
Chengyu Huang, Sheng-Yen Chou, Zhengxin Zhang, Claire Cardie
TL;DR: 本文提出了一种名为POP的自博弈框架,用于扩展大语言模型(LLM)在开放式任务上的后训练。该框架利用同一个LLM为每个示例合成评估标准(rubric)以及输入-输出对,然后使用该标准评估输出并训练模型。该方法基于内容丰富的预训练语料库,以减少奖励黑客攻击和防止模式崩溃。实验表明,在Qwen-2.5-7B模型上,POP提升了预训练和指令调优模型在医疗问答、创意写作和指令遵循等任务上的性能。
Details
Motivation: 现有自博弈方法主要应用于数学和编程等可验证任务,而本文旨在将其扩展到更现实的开放式任务上,以低成本生成高质量的训练数据,减少对人类或昂贵专有模型的依赖。
Result: 在Qwen-2.5-7B模型上,POP提升了预训练和指令调优模型在长格式医疗问答、创意写作和指令遵循等任务上的性能。
Insight: 创新点在于提出了一种结合评估标准合成的自博弈框架(POP),并利用预训练语料库来确保生成与验证之间的差距,从而减少奖励黑客和模式崩溃问题,为开放式任务的后训练提供了新思路。
Abstract: Self-play has recently emerged as a promising paradigm to train Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., ask a question), which it then addresses itself by producing a task output (e.g., give an answer). A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning (RL). Self-play incurs minimal supervision costs, and this is especially helpful for post-training LLMs, which require high-quality input-output pairs that traditionally have to be written by humans or expensive proprietary models. However, existing work explores self-play only for verifiable tasks such as math and coding. Instead, we seek to extend it to more realistic open-ended tasks. In particular, we propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics, along with input-output pairs, for each example. The rubric is then used to evaluate outputs and train the model. We further ground the framework on a content-rich pretraining corpus to (1) ensure a generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both pretrained and instruction-tuned models, across different tasks ranging from long-form Healthcare QA to creative writing and instruction following.
[10] Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework cs.CLPDF
Chenyuan Zhang, Qiguang Chen, Xie Chen, Zhuotao Tian, Bowen Xing
TL;DR: 本文提出了UL-XCoT,一种高效统一的逻辑跨语言思维链推理框架,旨在减少跨语言推理中的计算冗余。该方法通过在统一的逻辑空间中选择少量候选语言,并在解码过程中监控逻辑轨迹以剪枝低质量推理路径,从而显著降低了推理时的token消耗和延迟。
Details
Motivation: 现有跨语言思维链(XCoT)方法因需要在多种语言中广泛采样完整推理轨迹而成本高昂,且多语言大语言模型(LLM)的表征因语言而异,阻碍了直接的特征比较和有效剪枝。
Result: 在PolyMath(18种语言)和MMLU-ProX-Lite(29种语言)基准测试中使用DeepSeek-R1-DistillQwen-7B模型进行实验,结果表明UL-XCoT在保持竞争力的准确率的同时,相比之前的采样基线,解码token成本降低了超过50%,并且在低资源语言上表现出更稳定的性能提升和更强的鲁棒性。
Insight: 创新点在于构建了一个语言不变的统一逻辑空间,实现了跨语言的候选语言选择和推理路径动态监控与剪枝,从而在有限的采样预算下实现了更高的推理效率。这为降低多语言复杂推理任务的计算开销提供了新思路。
Abstract: Cross-lingual chain-of-thought (XCoT) with self-consistency markedly enhances multilingual reasoning, yet existing methods remain costly due to extensive sampling of full trajectories across languages. Moreover, multilingual LLM representations vary strongly by language, hindering direct feature comparisons and effective pruning. Motivated by this, we introduce UL-XCoT, the first efficient unified logic cross-lingual reasoning framework that minimizes redundancy in token usage and latency, yielding the greatest efficiency under limited sampling budgets during inference. Specifically, UL-XCoT (1) achieves less languages by selecting, per query, a small candidate language set in a language-invariant unified logic space, (2) enables less tokens by monitoring logic-space trajectory dynamics during decoding to prune low-quality reasoning paths, and (3) aggregates the remaining high-quality trajectories via voting. Experiments on PolyMath across 18 languages and MMLU-ProX-Lite across 29 languages with DeepSeek-R1-DistillQwen-7B demonstrate that UL-XCoT achieves competitive accuracy while sharply cutting over 50% decoding token cost versus prior sampling baselines. UL-XCoT also delivers more stable gains on low-resource languages, underscoring consistently superior robustness where standard XCoT self-consistency method fails.
[11] To Know is to Construct: Schema-Constrained Generation for Agent Memory cs.CLPDF
Lei Zheng, Weinan Song, Daili Li, Yanming Yang
TL;DR: 本文提出了一种基于建构主义认识论的智能体记忆架构SCG-MEM,将记忆访问重新定义为模式约束生成,以解决基于密集检索的记忆系统中存在的语义相似但上下文不匹配的噪声问题,以及开放式生成可能导致的‘结构幻觉’问题。
Details
Motivation: 现有基于密集检索的智能体记忆系统严重依赖句子内的语义重叠或实体匹配,难以区分语义相似但上下文不同的实例,导致检索到大量不匹配的噪声条目;而直接使用开放式生成进行记忆访问又可能产生记忆中不存在的‘结构幻觉’键,导致查找失败。
Result: 在LoCoMo基准测试上的实验表明,SCG-MEM在所有类别上都显著优于基于检索的基线方法。
Insight: 核心创新在于将记忆访问形式化为模式约束生成,通过维护动态认知模式来严格约束LLM解码,只生成有效的记忆条目键,从而在形式上避免了结构幻觉。此外,通过同化(将输入融入现有模式)和顺应(用新概念扩展模式)对记忆更新进行建模,并构建关联图以实现通过激活传播进行多跳推理。
Abstract: Constructivist epistemology argues that knowledge is actively constructed rather than passively copied. Despite the generative nature of Large Language Models (LLMs), most existing agent memory systems are still based on dense retrieval. However, dense retrieval heavily relies on semantic overlap or entity matching within sentences. Consequently, embeddings often fail to distinguish instances that are semantically similar but contextually distinct, introducing substantial noise by retrieving context-mismatched entries. Conversely, directly employing open-ended generation for memory access risks “Structural Hallucination” where the model generates memory keys that do not exist in the memory, leading to lookup failures. Inspired by this epistemology, we posit that memory is fundamentally organized by cognitive schemas, and valid recall must be a generative process performed within these schematic structures. To realize this, we propose SCG-MEM, a schema-constrained generative memory architecture. SCG-MEM reformulates memory access as Schema-Constrained Generation. By maintaining a dynamic Cognitive Schema, we strictly constrain LLM decoding to generate only valid memory entry keys, providing a formal guarantee against structural hallucinations. To support long-term adaptation, we model memory updates via assimilation (grounding inputs into existing schemas) and accommodation (expanding schemas with novel concepts). Furthermore, we construct an Associative Graph to enable multi-hop reasoning through activation propagation. Experiments on the LoCoMo benchmark show that SCG-MEM substantially improves performance across all categories over retrieval-based baselines.
[12] AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce cs.CL | cs.IRPDF
Biao Zhang, Lixin Chen, Bin Zhang, Zongwei Wang, Tong Liu
TL;DR: 本文提出了一种属性增强的细粒度多模态表示学习方法(AFMRL),用于解决电商场景下多模态表示模型在细粒度语义理解上的不足。该方法将产品细粒度理解定义为属性生成任务,利用多模态大语言模型(MLLM)从产品图像和文本中提取关键属性,并通过两阶段训练框架(属性引导对比学习和检索感知属性强化)增强表示学习。
Details
Motivation: 现有大型多模态表示模型(如VLM2Vec)在多模态理解方面表现出色,但在区分高度相似商品的细粒度语义理解上存在困难,而这对电商任务(如相同产品检索)至关重要。
Result: 在大规模电商数据集上的大量实验表明,该方法在多个下游检索任务上达到了最先进的性能(SOTA)。
Insight: 创新点在于将细粒度理解重新定义为属性生成任务,并设计了一个两阶段的协同训练框架,利用MLLM的生成能力来引导和强化表示学习,从而有效利用生成模型来推进细粒度表示学习。
Abstract: Multimodal representation is crucial for E-commerce tasks such as identical product retrieval. Large representation models (e.g., VLM2Vec) demonstrate strong multimodal understanding capabilities, yet they struggle with fine-grained semantic comprehension, which is essential for distinguishing highly similar items. To address this, we propose Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning (AFMRL), which defines product fine-grained understanding as an attribute generation task. It leverages the generative power of Multimodal Large Language Models (MLLMs) to extract key attributes from product images and text, and enhances representation learning through a two-stage training framework: 1) Attribute-Guided Contrastive Learning (AGCL), where the key attributes generated by the MLLM are used in the image-text contrastive learning training process to identify hard samples and filter out noisy false negatives. 2) Retrieval-aware Attribute Reinforcement (RAR), where the improved retrieval performance of the representation model post-attribute integration serves as a reward signal to enhance MLLM’s attribute generation during multimodal fine-tuning. Extensive experiments on large-scale E-commerce datasets demonstrate that our method achieves state-of-the-art performance on multiple downstream retrieval tasks, validating the effectiveness of harnessing generative models to advance fine-grained representation learning.
[13] Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving cs.CLPDF
Xinyu Zhang, Yuchen Wan, Boxuan Zhang, Zesheng Yang, Lingling Zhang
TL;DR: 本文提出双集群记忆代理(DCM-Agent)来解决大语言模型在优化问题中面临的结构性歧义问题,即单个问题存在多个相关但冲突的建模范式。该方法通过无训练方式利用历史解决方案,构建包含建模和编码两个集群的记忆,并将每个集群的内容提炼为方法、检查清单和陷阱三种结构化知识,以提供可泛化的指导。在推理阶段,代理利用记忆增强的推理动态导航解决方案路径、检测修复错误并自适应切换推理路径。在七个优化基准测试上的实验表明,DCM-Agent平均性能提升11%-21%。分析还揭示了’知识继承’现象:由更大模型构建的记忆可以指导更小模型获得更优性能。
Details
Motivation: 解决大语言模型在处理优化问题时遇到的结构性歧义问题,即单一问题对应多个相互冲突的建模范式,这阻碍了有效解决方案的生成。
Result: 在七个优化基准测试上,DCM-Agent实现了平均11%-21%的性能提升。分析还发现’知识继承’现象,即更大模型构建的记忆能提升更小模型的性能。
Insight: 核心创新点在于双集群记忆构建,将历史解决方案组织成建模和编码集群,并提炼为结构化知识(方法、检查清单、陷阱),以及记忆增强推理机制。从客观角度看,其无训练利用历史知识、结构化知识表示以及展示的跨模型知识可迁移性(知识继承)是值得借鉴的创新之处。
Abstract: Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To address this, we propose Dual-Cluster Memory Agent (DCM-Agent) to enhance performance by leveraging historical solutions in a training-free manner. Central to this is Dual-Cluster Memory Construction. This agent assigns historical solutions to modeling and coding clusters, then distills each cluster’s content into three structured types: Approach, Checklist, and Pitfall. This process derives generalizable guidance knowledge. Furthermore, this agent introduces Memory-augmented Inference to dynamically navigate solution paths, detect and repair errors, and adaptively switch reasoning paths with structured knowledge. The experiments across seven optimization benchmarks demonstrate that DCM-Agent achieves an average performance improvement of 11%- 21%. Notably, our analysis reveals a ``knowledge inheritance’’ phenomenon: memory constructed by larger models can guide smaller models toward superior performance, highlighting the framework’s scalability and efficiency.
[14] Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context cs.CLPDF
Yilun Zhu, Yuan Zhuang, Nikhita Vedula, Dushyanta Dhyani, Shaoyuan Xu
TL;DR: 本文提出了一种名为Quantile Token Regression的新方法,用于基于LLM的文本回归任务中预测完整的条件分布,而非单一值。该方法通过在输入序列中插入专门的量化分位数标记,为每个分位数建立直接的输入-输出路径,并利用检索机制引入语义相似的邻居实例及其经验分布,以增强预测的局部基础。
Details
Motivation: 当前基于LLM的分布回归方法存在两个关键局限:分布估计缺乏局部基础,以及依赖共享表示导致输入与分位数输出之间存在间接瓶颈。本文旨在解决这些问题,实现更准确和可靠的分布预测。
Result: 在Inside Airbnb和StackSample基准数据集上,使用参数规模从1.7B到14B的LLM进行实验,结果表明,结合邻居的量化分位数标记方法持续优于基线(平均绝对百分比误差降低约4个百分点,预测区间宽度减半),在较小和更具挑战性的数据集上尤其显著,能产生更尖锐和准确的分布。
Insight: 创新点包括首次在输入序列中引入专用的量化分位数标记,实现直接的自注意力输入-输出路径;通过检索邻居实例增强局部基础;首次对分位数回归的损失函数进行理论分析,阐明其优化的分布目标。从客观角度看,该方法通过结构设计改善了分布预测的准确性和可解释性。
Abstract: Many applications of LLM-based text regression require predicting a full conditional distribution rather than a single point value. We study distributional regression under empirical-quantile supervision, where each input is paired with multiple observed quantile outcomes, and the target distribution is represented by a dense grid of quantiles. We address two key limitations of current approaches: the lack of local grounding for distribution estimates, and the reliance on shared representations that create an indirect bottleneck between inputs and quantile outputs. In this paper, we introduce Quantile Token Regression, which, to our knowledge, is the first work to insert dedicated quantile tokens into the input sequence, enabling direct input-output pathways for each quantile through self-attention. We further augment these quantile tokens with retrieval, incorporating semantically similar neighbor instances and their empirical distributions to ground predictions with local evidence from similar instances. We also provide the first theoretical analysis of loss functions for quantile regression, clarifying which distributional objectives each optimizes. Experiments on the Inside Airbnb and StackSample benchmark datasets with LLMs ranging from 1.7B to 14B parameters show that quantile tokens with neighbors consistently outperform baselines (~4 points lower MAPE and 2x narrower prediction intervals), with especially large gains on smaller and more challenging datasets where quantile tokens produce substantially sharper and more accurate distributions.
[15] Hybrid Policy Distillation for LLMs cs.CL | cs.AIPDF
Wenhong Zhu, Ruobing Xie, Rui Wang, Pengfei Liu
TL;DR: 本文提出了一种名为混合策略蒸馏(HPD)的新方法,用于压缩大型语言模型(LLMs)。该方法通过整合前向KL与反向KL散度的优势,并结合离策略数据与轻量级近似在策略采样,旨在提升知识蒸馏过程的优化稳定性、计算效率和最终性能。
Details
Motivation: 现有知识蒸馏(KD)方法的有效性取决于散度方向、优化策略和数据机制等多个相互交织的选择,缺乏统一的设计视角。本文旨在通过提供一个统一框架来连接现有方法,并解决在模式覆盖和模式寻求之间取得平衡的挑战。
Result: 在长序列数学推理以及短序列对话和代码生成任务上验证了HPD的有效性,结果表明该方法在不同模型系列和规模上均能提升优化稳定性、计算效率和最终性能。
Insight: 主要创新点在于将知识蒸馏重新表述为令牌级别的加权对数似然目标,并提出了混合策略蒸馏(HPD),它创新性地结合了前向KL与反向KL散度,并融合了离策略与近似在策略数据,以平衡模式覆盖与模式寻求,从而提供了一种更稳定高效的蒸馏框架。
Abstract: Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.
[16] RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings cs.CL | cs.LGPDF
Wei Han, David Martinez, Anna Khanina, Lawrence Cavedon, Karin Verspoor
TL;DR: 本文提出了RADS(强化自适应领域采样),一种基于强化学习的样本选择策略,旨在解决低资源和类别不平衡临床场景下,传统主动学习方法(如不确定性采样和多样性采样)倾向于选择离群点而非真正信息性样本,导致性能下降的问题。
Details
Motivation: 动机是提升少样本微调在低资源、类别不平衡临床环境中的效果,因为传统主动学习方法在此类极端条件下表现不佳。
Result: 在多个真实世界临床数据集上的实验评估表明,RADS策略相比传统方法,能增强模型的可迁移性,并在极端类别不平衡下保持鲁棒性能。
Insight: 创新点在于将强化学习框架应用于样本选择过程,以自适应地识别最具信息量的样本,从而优化低资源不平衡场景下的迁移学习效果。
Abstract: A common strategy in transfer learning is few shot fine-tuning, but its success is highly dependent on the quality of samples selected as training examples. Active learning methods such as uncertainty sampling and diversity sampling can select useful samples. However, under extremely low-resource and class-imbalanced conditions, they often favor outliers rather than truly informative samples, resulting in degraded performance. In this paper, we introduce RADS (Reinforcement Adaptive Domain Sampling), a robust sample selection strategy using reinforcement learning (RL) to identify the most informative samples. Experimental evaluations on several real world clinical datasets show our sample selection strategy enhances model transferability while maintaining robust performance under extreme class imbalance compared to traditional methods.
[17] Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking cs.CLPDF
Mo Zhou, Jianwei Wang, Kai Wang, Helen Paik, Ying Zhang
TL;DR: 本文提出了一种名为MSR-MEL的无监督多模态实体链接框架,该框架利用大语言模型进行多视角证据合成与推理。该框架分为两个阶段:离线阶段合成包括实例中心、组级、词汇和统计证据在内的多视角证据集,其中组级证据通过图神经网络聚合邻域信息;在线阶段利用大语言模型作为推理模块,分析多视角证据的相关性和语义,以诱导出有效的排序策略,从而实现无监督的准确实体链接。
Details
Motivation: 现有的大多数多模态实体链接方法主要关注优化以实例为中心的特征和证据,未能充分探索更广泛的证据形式及其复杂的相互依赖关系。受人类专家依赖多视角判断的决策过程启发,本文旨在解决无监督场景下多模态实体链接中证据利用不充分的问题。
Result: 在广泛使用的多模态实体链接基准测试上进行的广泛实验表明,MSR-MEL在无监督方法中始终优于最先进的方法。
Insight: 创新点在于提出了一个结合多视角证据合成与LLM推理的两阶段无监督框架。具体而言,其核心贡献是合成组级证据,通过图神经网络有效聚合关键的邻域信息,并利用LLM作为推理模块来分析多视角证据,从而诱导出有效的排序策略。这为无监督多模态任务中整合结构化证据与LLM的语义推理能力提供了新思路。
Abstract: Multimodal Entity Linking (MEL) is a fundamental task in data management that maps ambiguous mentions with diverse modalities to the multimodal entities in a knowledge base. However, most existing MEL approaches primarily focus on optimizing instance-centric features and evidence, leaving broader forms of evidence and their intricate interdependencies insufficiently explored. Motivated by the observation that human expert decision-making process relies on multi-perspective judgment, in this work, we propose MSR-MEL, a Multi-perspective Evidence Synthesis and Reasoning framework with Large Language Models (LLMs) for unsupervised MEL. Specifically, we adopt a two-stage framework: (1) Offline Multi-Perspective Evidence Synthesis constructs a comprehensive set of evidence. This includes instance-centric evidence capturing the instance-centric multimodal information of mentions and entities, group-level evidence that aggregates neighborhood information, lexical evidence based on string overlap ratio, and statistical evidence based on simple summary statistics. A core contribution of our framework is the synthesis of group-level evidence, which effectively aggregates vital neighborhood information by graph. We first construct LLM-enhanced contextualized graphs. Subsequently, different modalities are jointly aligned through an asymmetric teacher-student graph neural network. (2) Online Multi-Perspective Evidence Reasoning leverages the power of LLM as a reasoning module to analyze the correlation and semantics of the multi-perspective evidence to induce an effective ranking strategy for accurate entity linking without supervision. Extensive experiments on widely used MEL benchmarks demonstrate that MSR-MEL consistently outperforms state-of-the-art unsupervised methods. The source code of this paper was available at: https://anonymous.4open.science/r/MSR-MEL-C21E/.
[18] WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning cs.CL | cs.LG | cs.SEPDF
Juyong Jiang, Chenglin Cai, Chansung Park, Jiasi Shen, Sunghun Kim
TL;DR: 论文提出了WebGen-R1,一个专为项目级网站生成设计的端到端强化学习框架。该框架通过引入脚手架驱动的结构化生成范式来约束动作空间,并设计了一种新颖的级联多模态奖励机制,结合了结构保证、基于执行的功能反馈和基于视觉的美学监督。实验表明,该框架能将一个7B的基础模型转变为能生成可部署、美观的多页面网站,其性能不仅超越了大型开源模型,甚至在某些方面媲美了最先进的DeepSeek-R1模型。
Details
Motivation: 解决大型语言模型在生成功能完整且视觉美观的多页面网站项目级任务上的挑战。现有方法通常局限于单页静态网站,而基于代理的框架则依赖多轮执行和专有模型,导致高成本、高延迟和脆弱的集成。训练小型LLM进行端到端强化学习是一个有前景的替代方案,但面临为网站生成设计可靠且计算可行的奖励的瓶颈。
Result: 在广泛的实验中,WebGen-R1将一个7B基础模型从生成几乎无法运行的网站转变为能生成可部署、美观对齐的多页面网站。它不仅持续优于大型开源模型(高达72B),而且在功能成功率上与最先进的DeepSeek-R1(671B)相当,同时在有效渲染和美学对齐方面大幅超越它。
Insight: 论文的创新点在于提出了一个脚手架驱动的结构化生成范式来管理大型开放动作空间,以及一个新颖的级联多模态奖励机制,该机制无缝结合了结构保证、基于执行的功能反馈和基于视觉的美学监督。从客观角度看,这为将小型开源模型从函数级代码生成扩展到项目级Web应用生成提供了一条可行的路径,特别是在处理主观美学、跨页面交互和功能正确性等多维度评估挑战方面。
Abstract: While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Existing works are often limited to single-page static websites, while agentic frameworks typically rely on multi-turn execution with proprietary models, leading to substantial token costs, high latency, and brittle integration. Training a small LLM end-to-end with reinforcement learning (RL) is a promising alternative, yet it faces a critical bottleneck in designing reliable and computationally feasible rewards for website generation. Unlike single-file coding tasks that can be verified by unit tests, website generation requires evaluating inherently subjective aesthetics, cross-page interactions, and functional correctness. To this end, we propose WebGen-R1, an end-to-end RL framework tailored for project-level website generation. We first introduce a scaffold-driven structured generation paradigm that constrains the large open-ended action space and preserves architectural integrity. We then design a novel cascaded multimodal reward that seamlessly couples structural guarantees with execution-grounded functional feedback and vision-based aesthetic supervision. Extensive experiments demonstrate that our WebGen-R1 substantially transforms a 7B base model from generating nearly nonfunctional websites into producing deployable, aesthetically aligned multi-page websites. Remarkably, our WebGen-R1 not only consistently outperforms heavily scaled open-source models (up to 72B), but also rivals the state-of-the-art DeepSeek-R1 (671B) in functional success, while substantially exceeding it in valid rendering and aesthetic alignment. These results position WebGen-R1 as a viable path for scaling small open models from function-level code generation to project-level web application generation.
[19] DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories cs.CL | cs.AI | cs.LGPDF
Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim
TL;DR: 论文提出了DialToM基准,用于评估大语言模型(LLMs)的心智理论(ToM)能力,特别是其预测基于心理状态的对话轨迹的能力。研究发现LLMs在识别心理状态(Literal ToM)方面表现良好,但在利用这些状态进行前瞻性诊断预测(Functional ToM)方面存在显著缺陷,且其推理与人类推理的语义相似性较弱。
Details
Motivation: 动机是探究LLMs所展现的心智理论能力是源于稳健的推理还是虚假的相关性,并为此创建一个基于自然人类对话、经过人工验证的评估基准。
Result: 在DialToM基准上的评估结果显示,除Gemini 3 Pro外,大多数LLMs在识别心理状态(Literal ToM)上表现出色,但在利用这些状态预测状态一致的对话轨迹(Functional ToM)方面失败,揭示了显著的推理不对称性。
Insight: 创新点在于区分并评估了心智理论的字面理解和功能效用,引入了前瞻性诊断预测任务来直接测试模型利用心理状态进行社会轨迹推理的能力,这为评估LLMs的深层推理而非表面模式匹配提供了新视角。
Abstract: Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting – probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM.
[20] Knowledge Capsules: Structured Nonparametric Memory Units for LLMs cs.CL | cs.AIPDF
Bin Ju, Shenfeng Weng, Danying Zhou, Kunkai Su, Rongkai Xu
TL;DR: 本文提出了一种名为’知识胶囊’的结构化非参数化记忆单元,用于增强大型语言模型的知识更新与扩展能力。该方法通过外部键值注入框架,将知识胶囊编译为注意力兼容的键值表示,使外部知识能直接参与模型的注意力计算,从而在多个问答基准测试中优于传统的检索增强生成方法。
Details
Motivation: 大型语言模型的知识编码在参数权重中,更新或扩展成本高昂;检索增强生成方法通过上下文扩展引入外部知识,但其影响间接且不稳定,尤其在长上下文和多跳推理场景中。
Result: 在多个问答基准测试中,该方法一致优于RAG和GraphRAG,在长上下文和多跳推理中表现出更高的稳定性和准确性,且无需参数更新。
Insight: 创新点在于将知识集成从上下文级增强转向内存级交互,通过结构化非参数化记忆单元和外部键值注入框架,使外部知识直接参与注意力计算,提升了知识利用的效率和稳定性。
Abstract: Large language models (LLMs) encode knowledge in parametric weights, making it costly to update or extend without retraining. Retrieval-augmented generation (RAG) mitigates this limitation by appending retrieved text to the input, but operates purely through context expansion, where external knowledge competes as tokens within the attention mechanism. As a result, its influence is indirect and often unstable, particularly in long context and multi hop reasoning scenarios. We propose Knowledge Capsules, structured nonparametric memory units that represent normalized relational knowledge and can be constructed directly from document corpora using a frozen base model. Instead of injecting knowledge as text, we introduce an External Key Value Injection (KVI) framework that compiles capsules into attention-compatible key value representations, enabling external knowledge to directly participate in the model’s attention computation. By shifting knowledge integration from context-level augmentation to memory level interaction, the proposed framework consistently outperforms RAG and GraphRAG across multiple QA benchmarks, with improved stability and accuracy in long context and multi hop reasoning, while requiring no parameter updates.
[21] LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures cs.CL | cs.AIPDF
Yuhang Wu, Qinyuan Liu, Qiuyang Zhao, Qingwei Chong
TL;DR: 本文提出了LayerTracer,一个与架构无关的端到端分析框架,用于联合分析任意大语言模型架构中的任务粒子定位和层脆弱性量化。通过逐层提取隐藏状态并映射到词汇概率分布,该框架揭示了不同LLM架构中层次表示演化、任务知识形成位置和网络鲁棒性瓶颈的规律。
Details
Motivation: 当前LLM架构多样化,但不同架构中层次表示的演化规律、任务知识形成位置以及网络鲁棒性瓶颈机制尚不明确,这给混合架构设计和模型优化带来了核心挑战。
Result: 在不同参数规模的模型上的实验表明,无论参数规模大小,任务粒子主要出现在模型的深层;而更大参数的模型表现出更强的层次鲁棒性。
Insight: 创新点在于定义了‘任务粒子’(目标标记概率首次显著上升的关键层)和‘脆弱层’(掩码扰动前后输出分布JS散度最大的层)两个核心概念,并提出了一个统一的、架构无关的框架进行联合分析,为混合架构的层划分、模块比例和门控切换提供了科学依据,有效支持LLM结构设计和可解释性研究。
Abstract: Currently, Large Language Models (LLMs) feature a diversified architectural landscape, including traditional Transformer, GateDeltaNet, and Mamba. However, the evolutionary laws of hierarchical representations, task knowledge formation positions, and network robustness bottleneck mechanisms in various LLM architectures remain unclear, posing core challenges for hybrid architecture design and model optimization. This paper proposes LayerTracer, an architecture-agnostic end-to-end analysis framework compatible with any LLM architecture. By extracting hidden states layer-by-layer and mapping them to vocabulary probability distributions, it achieves joint analysis of task particle localization and layer vulnerability quantification. We define the task particle as the key layer where the target token probability first rises significantly, representing the model’s task execution starting point, and the vulnerable layer is defined as the layer with the maximum Jensen-Shannon (JS) divergence between output distributions before and after mask perturbation, reflecting its sensitivity to disturbances. Experiments on models of different parameter scales show that task particles mainly appear in the deep layers of the model regardless of parameter size, while larger-parameter models exhibit stronger hierarchical robustness. LayerTracer provides a scientific basis for layer division, module ratio, and gating switching of hybrid architectures, effectively optimizing model performance. It accurately locates task-effective layers and stability bottlenecks, offering universal support for LLM structure design and interpretability research.
[22] LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation cs.CLPDF
Serhii Zabolotnii
TL;DR: 本文提出LLM StructCore方法,通过模式引导推理(SGR)和确定性编译的两阶段设计,自动从临床笔记中填充呼吸困难病例报告表(CRF)。第一阶段生成稳定的JSON摘要,第二阶段通过无LLM的编译器解析摘要、规范化项目名称、应用证据门控假阳性过滤,并扩展为所需的134项格式。该方法在开发集和隐藏测试集上取得了较高的宏F1分数,且具有语言无关性。
Details
Motivation: 解决从临床笔记自动填充病例报告表(CRF)的挑战,包括语言噪声、严格输出约束和高假阳性成本,特别是针对极端稀疏性(大多数字段未知)和官方评分对空值和未支持预测的惩罚问题。
Result: 在dev80分割上,最佳教师配置达到宏F1 0.6543(英语)和0.6905(意大利语);在隐藏的test200上,提交的英语变体在Codabench上得分为0.63。该方法语言无关,意大利语结果与英语相当或更优,无需语言特定工程。
Insight: 创新点包括从单步LLM预测转向两阶段分解:第一阶段使用SGR生成稳定摘要,第二阶段采用完全确定性的无LLM编译器进行解析和规范化,提高了鲁棒性和可解释性,并实现了语言无关的管道设计。
Abstract: Automatically filling Case Report Forms (CRFs) from clinical notes is challenging due to noisy language, strict output contracts, and the high cost of false positives. We describe our CL4Health 2026 submission for Dyspnea CRF filling (134 items) using a contract-driven two-stage design grounded in Schema-Guided Reasoning (SGR). The key task property is extreme sparsity: the majority of fields are unknown, and official scoring penalizes both empty values and unsupported predictions. We shift from a single-step “LLM predicts 134 fields” approach to a decomposition where (i) Stage 1 produces a stable SGR-style JSON summary with exactly 9 domain keys, and (ii) Stage 2 is a fully deterministic, 0-LLM compiler that parses the Stage 1 summary, canonicalizes item names, normalizes predictions to the official controlled vocabulary, applies evidence-gated false-positive filters, and expands the output into the required 134-item format. On the dev80 split, the best teacher configuration achieves macro-F1 0.6543 (EN) and 0.6905 (IT); on the hidden test200, the submitted English variant scores 0.63 on Codabench. The pipeline is language-agnostic: Italian results match or exceed English with no language-specific engineering.
[23] Where Reasoning Breaks: Logic-Aware Path Selection by Controlling Logical Connectives in LLMs Reasoning Chains cs.CLPDF
Seunghyun Park, Yuanyuan Lei
TL;DR: 本文提出了一种针对大语言模型在多步逻辑推理中脆弱性的干预框架。研究发现逻辑连接词是推理链中的高熵分叉点,模型在此处容易选错逻辑方向。通过梯度引导、局部前瞻搜索和针对性的偏好优化,该框架仅在关键逻辑节点进行干预,从而在提升推理准确性的同时保持了效率。
Details
Motivation: 大语言模型在多步逻辑推理中表现脆弱,单个推理步骤的错误会沿链传播,导致性能不稳定。作者发现逻辑连接词是这种结构脆弱性的主要节点,模型在此处难以确定正确的逻辑方向。
Result: 通过将干预集中在关键逻辑转换点,该框架在保持效率的同时,相比束搜索和自洽性等全局推理时间扩展方法,实现了更优的准确率-效率权衡。
Insight: 创新点在于将逻辑推理的脆弱性定位到具体的逻辑连接词(高熵分叉点),并提出了一个多层干预框架,通过梯度引导、局部搜索和手术式强化学习,选择性地优化关键节点的单令牌偏好,实现了精准且高效的推理引导。
Abstract: While LLMs demonstrate impressive reasoning capabilities, they remain fragile in multi-step logical deduction, where a single transition error can propagate through the entire reasoning chain, leading to unstable performance. In this work, we identify logical connectives as primary points of this structural fragility. Through empirical analysis, we show that connective tokens function as high entropy forking points, at which models frequently struggle to determine the correct logical direction. Motivated by this observation, we hypothesize that intervening in logical connective selection can guide LLMs toward more correct logical direction, thereby improving the overall reasoning chain. To validate this hypothesis, we propose a multi-layered framework that intervenes specifically at these logic-critical junctions in the reasoning process. Our framework includes (1) Gradient-based Logical Steering to guide LLMs internal representations towards valid reasoning subspaces, (2) Localized Branching to resolve ambiguity via targeted look-ahead search, and (3) Targeted Transition Preference Optimization, a surgical reinforcement learning objective that selectively optimizes single-token preferences at logical pivots. Crucially, by concentrating intervention solely on logic-critical transitions, our framework achieves a favorable accuracy–efficiency trade-off compared to global inference time scaling methods like beam search and self-consistency.
[24] Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents cs.CLPDF
Yuxuan Cai, Jie Zhou, Qin Chen, Liang He
TL;DR: 本文提出了ProactAgent,一种经验驱动的终身学习框架,通过主动检索结构化经验库来提升在线终身学习智能体的性能。该框架包含Experience-Enhanced Online Evolution (ExpOnEvo) 用于持续改进策略和记忆,以及Proactive Reinforcement Learning-based Retrieval (ProactRL) 将检索建模为显式策略动作,学习何时检索及检索什么内容。
Details
Motivation: 现有方法通常将经验检索视为被动操作,仅在任务初始化或步骤完成后触发,导致智能体在交互过程中难以识别知识缺口并主动检索对当前决策最有用的经验。
Result: 在SciWorld、AlfWorld和StuLife基准测试上的实验表明,ProactAgent显著提升了终身智能体的性能,在SciWorld上达到73.50%的成功率,在AlfWorld上达到71.28%的成功率,同时大幅减少了检索开销,并在StuLife上取得了与专有模型竞争的性能。
Insight: 创新点在于将检索建模为显式的策略动作,通过配对分支过程奖励学习主动检索的时机和内容;同时构建了包含事实记忆、情景记忆和行为技能的结构化经验库,使检索能同时提供相关证据和可操作指导。
Abstract: Online lifelong learning enables agents to accumulate experience across interactions and continually improve on long-horizon tasks. However, existing methods typically treat retrieval from past experience as a passive operation, triggering it only at task initialization or after completing a step. Consequently, agents often fail to identify knowledge gaps during interaction and proactively retrieve the most useful experience for the current decision. To address this limitation, we present ProactAgent, an experience-driven lifelong learning framework for proactive retrieval over a structured experience base. We first introduce Experience-Enhanced Online Evolution (ExpOnEvo), which enables continual improvement through both policy updates and memory refinement. The experience base organizes historical interactions into typed repositories, including factual memory, episodic memory, and behavioral skills, so that retrieval can provide both relevant evidence and actionable guidance. On top of this, we propose Proactive Reinforcement Learning-based Retrieval (ProactRL), which models retrieval as an explicit policy action and learns when and what to retrieve via paired-branch process rewards. By comparing continuations from identical interaction prefixes with and without retrieval, ProactRL provides step-level supervision for retrieval decisions, encouraging retrieval only when it leads to better task outcomes or higher efficiency. Experiments on SciWorld, AlfWorld, and StuLife show that ProactAgent consistently improves lifelong agent performance, achieving success rates of 73.50% on SciWorld and 71.28% on AlfWorld while substantially reducing retrieval overhead, and attains performance competitive with proprietary models on StuLife.
[25] Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows cs.CLPDF
Shivani Kumar, Adarsh Bharathwaj, David Jurgens
TL;DR: 该论文研究了多智能体大语言模型(LLM)团队在科学工作流中的协作性能。作者通过六种行为经济学游戏对35个开源LLM进行基准测试,发现从游戏中得出的合作行为特征能够稳健地预测LLM团队在受共享预算约束的科学任务(如数据分析、模型构建和报告撰写)中的下游表现。
Details
Motivation: 动机在于探究LLM在简化行为经济学游戏中的合作行为,是否能够预测其在现实、复杂的多智能体科学协作任务中的表现,从而为部署前筛选具有良好合作倾向的模型提供一种低成本诊断方法。
Result: 结果表明,在行为经济学游戏中表现出有效协调和团队生产投资(而非贪婪策略)的模型,在科学报告生成的准确性、质量和完整性三个指标上表现更好。这种关联在控制多个因素后依然成立,表明合作倾向是LLM一个独立且可测量的属性。
Insight: 创新点在于将行为经济学游戏框架引入多智能体LLM评估,建立了模型在简化合作场景中的行为特征与其在复杂、资源受限的科学工作流中团队绩效之间的可预测关联,为模型筛选提供了新维度和快速诊断工具。
Abstract: Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model’s behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.
[26] Intersectional Fairness in Large Language Models cs.CLPDF
Chaima Boufaied, Ronnie De Souza Santos, Ann Barcomb
TL;DR: 本文系统评估了六个大型语言模型在交叉人口属性上的公平性,使用两个基准数据集的模糊和去模糊上下文,通过偏差分数、子群公平性指标、准确性和一致性进行分析。研究发现,尽管现代LLMs在模糊上下文中表现良好,但这限制了公平性指标的信息量;在去模糊上下文中,模型准确性受刻板印象对齐影响,尤其是在种族-性别交叉点上,刻板印象强化时准确性更高。子群公平性指标显示结果分布不均,且响应一致性存在差异,没有模型在所有交叉设置中实现可靠公平行为。
Details
Motivation: 随着大型语言模型越来越多地部署在社会敏感场景中,引发了对公平性和偏见的担忧,特别是在交叉人口属性方面,需要系统评估其交叉公平性。
Result: 在模糊上下文中,LLMs表现良好但预测稀疏;在去模糊上下文中,准确性受刻板印象对齐影响,种族-性别交叉点的定向偏差更强,子群公平性指标显示结果分布不均,响应一致性存在变化,没有模型在所有交叉设置中达到一致可靠或公平。
Insight: 论文创新点在于系统评估LLMs的交叉公平性,结合偏差、子群公平性和一致性指标,强调超越准确性的评估重要性;客观分析认为,研究揭示了模型能力部分与刻板印象一致线索相关,为公平性评估提供了多维度框架。
Abstract: Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.
[27] RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering cs.CLPDF
Marisa Hudspeth, Patrick J. Burns, Brendan O’Connor
TL;DR: 本文介绍了RespondeoQA,一个针对拉丁语-英语双语问答和翻译任务的基准数据集,包含约7,800个问答对,问题来源于1800年代至今的拉丁语教学资源。该数据集覆盖了知识型、技能型、多跳推理、约束翻译和混合语言对等多种问题类型,是首个以拉丁语为中心的问答基准。作为案例研究,作者评估了LLaMa 3、Qwen QwQ和OpenAI o3-mini三个大语言模型,发现它们在技能导向问题上表现较差,推理模型在韵律分析和文学手法任务上表现更好但整体改进有限。
Details
Motivation: 解决拉丁语这一专业语言和文化领域中缺乏问答基准数据集的问题,以评估大语言模型在双语(拉丁语-英语)设置下的能力,特别是针对教学和翻译任务。
Result: 在RespondeoQA基准上评估了LLaMa 3、Qwen QwQ和OpenAI o3-mini三个模型,所有模型在技能导向问题上表现较差;推理模型在韵律分析和文学手法任务上表现更好但整体改进有限;QwQ在拉丁语提问的问题上略优,而LLaMa3和o3-mini表现更依赖任务类型。
Insight: 创新点在于创建了首个以拉丁语为中心的问答基准数据集,覆盖多样问题类型,并提供了可轻松适配其他语言的创建流程;客观分析认为,该工作突出了大语言模型在专业语言和文化领域评估中的局限性,为多语言和低资源语言研究提供了新资源和方法借鉴。
Abstract: We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models – LLaMa 3, Qwen QwQ, and OpenAI’s o3-mini – finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA
cs.CV [Back]
[28] Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning cs.CV | cs.AI | cs.LGPDF
Khalil Akremi, Mariem Handous, Zied Bouslama, Farah Bassalah, Maryem Jebali
TL;DR: 本文提出了一种基于深度学习的自动化狂犬病诊断系统,通过迁移学习和数据增强技术分析荧光显微镜图像,旨在解决低数据环境下诊断资源不足的问题。
Details
Motivation: 狂犬病在亚非地区是重大公共卫生问题,传统诊断依赖荧光显微镜和专业人员,但此类专家在样本量少的地区稀缺,因此需要开发自动化AI诊断方案。
Result: 在155张显微镜图像数据集上,采用EfficientNetB0模型结合几何与颜色增强,通过分层3折交叉验证,在裁剪图像上取得了最优分类性能;TrivialAugmentWide被证明是最有效的数据增强策略。
Insight: 创新点在于系统比较了四种深度学习架构和三种数据增强策略在低数据场景下的效果,证实了迁移学习与针对性数据增强能提升模型鲁棒性,并部署了在线工具以推动实际应用。
Abstract: Rabies remains a major public health concern across many African and Asian countries, where accurate diagnosis is critical for effective epidemiological surveillance. The gold standard diagnostic methods rely heavily on fluorescence microscopy, necessitating skilled laboratory personnel for the accurate interpretation of results. Such expertise is often scarce, particularly in regions with low annual sample volumes. This paper presents an automated, AI-driven diagnostic system designed to address these challenges. We developed a robust pipeline utilizing fluorescent image analysis through transfer learning with four deep learning architectures: EfficientNetB0, EfficientNetB2, VGG16, and Vision Transformer (ViTB16). Three distinct data augmentation strategies were evaluated to enhance model generalization on a dataset of 155 microscopic images (123 positive and 32 negative). Our results demonstrate that TrivialAugmentWide was the most effective augmentation technique, as it preserved critical fluorescent patterns while improving model robustness. The EfficientNetB0 model, utilizing Geometric & Color augmentation and selected through stratified 3fold cross-validation, achieved optimal classification performance on cropped images. Despite constraints posed by class imbalance and a limited dataset size, this work confirms the viability of deep learning for automating rabies diagnosis. The proposed method enables fast and reliable detection with significant potential for further optimization. An online tool was deployed to facilitate practical access, establishing a framework for future medical imaging applications. This research underscores the potential of optimized deep learning models to transform rabies diagnostics and improve public health outcomes.
[29] KD-Judge: A Knowledge-Driven Automated Judge Framework for Functional Fitness Movements on Edge Devices cs.CVPDF
Shaibal Saha, Fan Li, Yunge Li, Arun Iyengar, Lucas Alves
TL;DR: 本文提出KD-Judge,一种面向功能性健身动作的知识驱动自动化评判框架,旨在解决人工评判的主观性、耗时性及规则动态变化等问题。该框架利用基于LLM的检索增强生成和思维链规则结构化流程,将非结构化的规则手册转换为可执行的机器可读表示,并结合基于确定性规则的姿态引导运动学推理系统来评估动作重复的有效性和时间边界。为提高边缘设备效率,引入了双重策略缓存机制,实验表明其在CFRep数据集上实现了准确的动作级别评估和快于实时的执行速度。
Details
Motivation: 功能性健身动作在训练、比赛和健康项目中广泛应用,但依赖人工主观判断、时间限制和规则演变使得一致性地执行重复动作标准具有挑战性;现有基于AI的方法主要依赖学习到的评分或基于参考的比较,缺乏明确的基于规则的透明且确定性的重复级别验证。
Result: 在CFRep数据集上进行评判评估,实现了可靠的规则结构化性能和准确的重复级别评估,执行速度快于实时(实时因子RTF < 1);启用缓存策略后,在资源受限的边缘设备上,相比无缓存基线,预录制和直播场景分别实现了最高3.36倍和15.91倍的加速。
Insight: 创新点在于将非结构化规则通过LLM驱动的检索增强生成和思维链流程转化为可执行的机器可读表示,并结合确定性规则与姿态引导运动学推理进行透明、可解释的评判;同时,针对边缘设备优化的双重策略缓存机制有效提升了计算效率,实现了可扩展的、基于规则的重复级别分析系统。
Abstract: Functional fitness movements are widely used in training, competition, and health-oriented exercise programs, yet consistently enforcing repetition (rep) standards remains challenging due to subjective human judgment, time constraints, and evolving rules. Existing AI-based approaches mainly rely on learned scoring or reference-based comparisons and lack explicit rule-based, limiting transparency and deterministic rep-level validation. To address these limitations, we propose KD-Judge, a novel knowledge-driven automated judging framework for functional fitness movements. It converts unstructured rulebook standards into executable, machine-readable representations using an LLM-based retrieval-augmented generation and chain-of-thought rule-structuring pipeline. The structured rules are then incorporated by a deterministic rule-based judging system with pose-guided kinematic reasoning to assess rep validity and temporal boundaries. To improve efficiency on edge devices, including a high-performance desktop and the resource-constrained Jetson AGX Xavier, we introduce a dual strategy caching mechanism that can be selectively applied to reduce redundant and unnecessary computation. Experiments demonstrate reliable rule-structuring performance and accurate rep-level assessment, with judgment evaluation conducted on the CFRep dataset, achieving faster-than-real-time execution (real-time factor (RTF) < 1). When the proposed caching strategy is enabled, the system achieves up to 3.36x and 15.91x speedups on resource-constrained edge device compared to the non-caching baseline for pre-recorded and live-streaming scenarios, respectively. These results show that KD-Judge enables transparent, efficient, and scalable rule-grounded rep-level analysis that can complement human judging in practice.
[30] Environmental Understanding Vision-Language Model for Embodied Agent cs.CV | cs.AIPDF
Jinsik Bang, Jaeyeon Bae, Donggyu Lee, Siyeol Jung, Taehwan Kim
TL;DR: 本文提出了一种名为环境理解具身智能体(EUEA)的新框架,旨在通过微调视觉语言模型(VLM)的四个核心技能——物体感知、任务规划、动作理解和目标识别——来提升指令跟随型具身智能体在环境理解方面的能力,并引入了恢复步骤和群体相对策略优化(GRPO)阶段来纠正错误和优化技能预测,从而实现了更可靠的任务执行。
Details
Motivation: 尽管视觉语言模型在感知和推理方面表现出色,但现有的指令跟随型具身智能体在环境理解方面仍存在局限,例如交互失败或依赖环境元数据,因此需要提升其环境理解能力以实现更可靠的任务执行。
Result: 在ALFRED任务基准上,所提出的VLM框架显著优于行为克隆基线,平均成功率提升了8.86%,而恢复步骤和GRPO阶段进一步带来了3.03%的性能增益,从而整体提升了智能体的表现。
Insight: 论文的创新点在于系统地定义了并微调了四个核心环境理解技能,并引入了恢复机制和GRPO优化阶段来增强智能体的鲁棒性;从客观角度看,这种技能分解和迭代优化方法为提升VLM在具身任务中的实际应用提供了可借鉴的框架。
Abstract: Vision-language models (VLMs) have shown strong perception and reasoning abilities for instruction-following embodied agents. However, despite these abilities and their generalization performance, they still face limitations in environmental understanding, often failing on interactions or relying on environment metadata during execution. To address this challenge, we propose a novel framework named Environmental Understanding Embodied Agent (EUEA), which fine-tunes four core skills: 1) object perception for identifying relevant objects, 2) task planning for generating interaction subgoals, 3) action understanding for judging success likelihood, and 4) goal recognition for determining goal completion. By fine-tuning VLMs with EUEA skills, our framework enables more reliable task execution for instruction-following. We further introduce a recovery step that leverages these core skills and a group relative policy optimization (GRPO) stage that refines inconsistent skill predictions. The recovery step samples alternative actions to correct failure cases, and the GRPO stage refines inconsistent skill predictions. Across ALFRED tasks, our VLM significantly outperforms a behavior-cloning baseline, achieving an 8.86% improvement in average success rate. The recovery and GRPO stages provide an additional 3.03% gain, further enhancing overall performance. Finally, our skill-level analyses reveal key limitations in the environmental understanding of closed- and open-source VLMs and identify the capabilities necessary for effective agent-environment interaction.
[31] If you’re waiting for a sign… that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems cs.CV | cs.AIPDF
Jiamin Chang, Minhui Xue, Ruoxi Sun, Shuchao Pang, Salil S. Kanhere
TL;DR: 本文研究了基于大型视觉语言模型(LVLM)的具身视觉语言代理系统(VLAS)在面对环境视觉信号时的信任边界混淆问题,即代理难以区分合法环境信号与恶意视觉注入。作者构建了一个双意图数据集和评估框架,发现现有LVLM代理无法可靠平衡这一权衡,并提出了一个将感知与决策分离的多代理防御框架,以动态评估视觉输入的可靠性,从而在对抗扰动下显著减少误导行为并保持正确响应。
Details
Motivation: 动机是解决具身视觉语言代理系统中环境信号的双重性带来的安全挑战:代理需要响应合法的环境线索(如交通灯),但又要对可能覆盖用户意图、构成安全风险的误导性视觉注入保持鲁棒性,这种矛盾被称为信任边界混淆。
Result: 在基于结构和基于噪声的视觉注入下,对7个LVLM代理在多个具身设置中进行了系统评估,结果表明现有代理无法可靠平衡权衡。提出的多代理防御框架显著减少了误导行为,同时保持了正确响应,并在对抗扰动下提供了鲁棒性保证。
Insight: 宣称的创新点包括:1)识别并形式化了VLAS中的信任边界混淆问题;2)构建了双意图数据集和评估框架来量化该问题;3)提出了一个将感知与决策分离的多代理防御框架,通过动态可靠性评估来缓解漏洞。从客观角度看,其核心创新在于将安全视角引入具身代理的视觉信号处理,并通过架构分离(而非单纯模型改进)来增强鲁棒性,为VLAS的安全设计提供了新思路。
Abstract: Recent advances in embodied Vision-Language Agentic Systems (VLAS), powered by large vision-language models (LVLMs), enable AI systems to perceive and reason over real-world scenes. Within this context, environmental signals such as traffic lights are essential in-band signals that can and should influence agent behavior. However, similar signals could also be crafted to operate as misleading visual injections, overriding user intent and posing security risks. This duality creates a fundamental challenge: agents must respond to legitimate environmental cues while remaining robust to misleading ones. We refer to this tension as trust boundary confusion. To study this behavior, we design a dual-intent dataset and evaluation framework, through which we show that current LVLM-based agents fail to reliably balance this trade-off, either ignoring useful signals or following harmful ones. We systematically evaluate 7 LVLM agents across multiple embodied settings under both structure-based and noise-based visual injections. To address these vulnerabilities, we propose a multi-agent defense framework that separates perception from decision-making to dynamically assess the reliability of visual inputs. Our approach significantly reduces misleading behaviors while preserving correct responses and provides robustness guarantees under adversarial perturbations. The code of the evaluation framework and artifacts are made available at https://anonymous.4open.science/r/Visual-Prompt-Inject.
[32] Wan-Image: Pushing the Boundaries of Generative Visual Intelligence cs.CVPDF
Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao
TL;DR: Wan-Image是一个统一的视觉生成系统,旨在将图像生成模型从休闲合成器转变为专业级生产力工具。它通过融合大语言模型的认知能力和扩散Transformer的高保真像素合成,解决了现有模型在可控性、复杂排版渲染和严格身份保持方面的瓶颈。该系统基于大规模多模态数据扩展、细粒度标注引擎和强化学习数据,实现了超长复杂文本渲染、超多样化人像生成、调色板引导生成、多主体身份保持、连贯序列视觉生成、精确多模态交互编辑、原生Alpha通道生成和高效4K合成等专业能力。
Details
Motivation: 当前扩散模型虽然在美学生成上表现出色,但在需要绝对可控性、复杂排版渲染和严格身份保持的严谨设计工作流中经常遇到瓶颈。论文旨在解决这些问题,将图像生成模型提升为专业生产力工具。
Result: 在多样化的人类评估中,Wan-Image在整体性能上超过了Seedream 5.0 Lite和GPT Image 1.5,并在具有挑战性的任务中与Nano Banana Pro达到了同等水平。
Insight: 论文的核心创新点在于其原生统一的多模态架构,将大语言模型的认知能力与扩散Transformer的高保真像素合成协同工作,实现了从细微用户意图到精确视觉输出的无缝转换。此外,大规模多模态数据扩展、系统化的细粒度标注引擎以及精心策划的强化学习数据,共同解锁了专家级的专业生成能力,超越了基础的指令跟随。
Abstract: We present Wan-Image, a unified visual generation system explicitly engineered to paradigm-shift image generation models from casual synthesizers into professional-grade productivity tools. While contemporary diffusion models excel at aesthetic generation, they frequently encounter critical bottlenecks in rigorous design workflows that demand absolute controllability, complex typography rendering, and strict identity preservation. To address these challenges, Wan-Image features a natively unified multi-modal architecture by synergizing the cognitive capabilities of large language models with the high-fidelity pixel synthesis of diffusion transformers, which seamlessly translates highly nuanced user intents into precise visual outputs. It is fundamentally powered by large-scale multi-modal data scaling, a systematic fine-grained annotation engine, and curated reinforcement learning data to surpass basic instruction following and unlock expert-level professional capabilities. These include ultra-long complex text rendering, hyper-diverse portrait generation, palette-guided generation, multi-subject identity preservation, coherent sequential visual generation, precise multi-modal interactive editing, native alpha-channel generation, and high-efficiency 4K synthesis. Across diverse human evaluations, Wan-Image exceeds Seedream 5.0 Lite and GPT Image 1.5 in overall performance, reaching parity with Nano Banana Pro in challenging tasks. Ultimately, Wan-Image revolutionizes visual content creation across e-commerce, entertainment, education, and personal productivity, redefining the boundaries of professional visual synthesis.
[33] SGAP-Gaze: Scene Grid Attention Based Point-of-Gaze Estimation Network for Driver Gaze cs.CVPDF
Pavan Kumar Sharma, Pranamesh Chakraborty
TL;DR: 本文提出了一种名为SGAP-Gaze的驾驶员注视点估计网络,该网络通过基于场景网格注意力的Transformer机制,融合驾驶员面部信息与交通场景图像,以提升注视点估计的准确性。作者还构建了一个包含同步驾驶员面部和交通场景图像的基准数据集UD-FSG,用于模型的训练与测试。
Details
Motivation: 现有驾驶员注视点估计模型主要依赖驾驶员面部信息,忽略了周围交通场景提供的上下文线索。本文旨在通过同时利用面部信息和场景图像,构建一个更鲁棒、更准确的驾驶员注视点估计模型,以更好地理解驾驶员对交通环境的态势感知。
Result: 在提出的UD-FSG数据集上,SGAP-Gaze模型取得了104.73的平均像素误差;在LBW数据集上取得了63.48的平均像素误差。与当前最先进的驾驶员注视点估计模型相比,平均像素误差降低了23.5%。空间像素分布分析表明,该模型在所有空间范围(包括场景外围关键区域)均实现了更低的误差。
Insight: 论文的核心创新点在于将交通场景图像通过基于Transformer的场景网格注意力机制显式地整合到注视点估计建模中,实现了面部模态与场景上下文的多模态融合。从客观角度看,这种场景感知的注意力机制为理解驾驶员在复杂动态环境中的视觉注意力提供了新的有效途径,其构建的多模态基准数据集UD-FSG也为该领域的研究提供了有价值的资源。
Abstract: Driver gaze estimation is essential for understanding the driver’s situational awareness of surrounding traffic. Existing gaze estimation models use driver facial information to predict the Point-of-Gaze (PoG) or the 3D gaze direction vector. We propose a benchmark dataset, Urban Driving-Face Scene Gaze (UD-FSG), comprising synchronized driver-face and traffic-scene images. The scene images provide cues about surrounding traffic, which can help improve the gaze estimation model, along with the face images. We propose SGAP-Gaze, Scene-Grid Attention based Point-of-Gaze estimation network, trained and tested on our UD-FSG dataset, which explicitly incorporates the scene images into the gaze estimation modelling. The gaze estimation network integrates driver face, eye, iris, and scene contextual information. First, the extracted features from facial modalities are fused to form a gaze intent vector. Then, attention scores are computed over the spatial scene grid using a Transformer-based attention mechanism fusing face and scene image features to obtain the PoG. The proposed SGAP-Gaze model achieves a mean pixel error of 104.73 on the UD-FSG dataset and 63.48 on LBW dataset, achieving a 23.5% reduction in mean pixel error compared to state-of-the-art driver gaze estimation models. The spatial pixel distribution analysis shows that SGAP-Gaze consistently achieves lower mean pixel error than existing methods across all spatial ranges, including the outer regions of the scene, which are rare but critical for understanding driver attention. These results highlight the effectiveness of integrating multi-modal gaze cues with scene-aware attention for a robust driver PoG estimation model in real-world driving environments.
[34] MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings cs.CV | cs.AI | cs.LGPDF
Zijie Li, Yichun Shi, Jingxiang Sun, Ye Wang, Yixuan Huang
TL;DR: MMCORE是一个用于多模态图像生成和编辑的统一框架,它利用预训练的视觉语言模型(VLM)通过可学习的查询令牌预测语义视觉嵌入,并将其作为扩散模型的条件信号,从而将VLM的丰富理解和推理能力高效迁移到视觉生成过程中。
Details
Motivation: 旨在设计一个高效统一的框架,将预训练视觉语言模型的强大语义理解能力迁移到图像生成任务中,避免自回归与扩散模型的深度融合或从头训练,以降低计算开销。
Result: 在广泛的文本到图像以及单/多图像编辑基准测试中,MMCORE始终优于最先进的基线模型,实现了高保真合成。
Insight: 创新点在于通过可学习查询令牌桥接VLM与扩散模型,实现语义嵌入的条件化生成,这是一种参数高效且无需深度架构融合的轻量级方法,可借鉴其利用预训练大模型作为语义控制器来增强生成模型能力的思路。
Abstract: We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.
[35] UniCon3R: Contact-aware 3D Human-Scene Reconstruction from Monocular Video cs.CVPDF
Tanuj Sur, Shashank Tripathi, Nikos Athanasiou, Ha Linh Nguyen, Kai Xu
TL;DR: UniCon3R是一个统一的、前馈式的框架,用于从单目视频中在线重建具有物理接触感知的3D人体与场景的4D序列。它通过显式建模人体姿态与场景几何之间的3D接触,并将接触信息作为校正线索来生成最终姿态,从而联合恢复高保真场景几何和场景内空间对齐的3D人体,解决了现有方法中人体漂浮或穿模等物理不合理的问题。
Details
Motivation: 现有前馈方法能实时重建世界坐标系下的人体运动和场景,但常产生人体漂浮于地面之上或穿透场景部分等物理上不合理的伪影。其关键原因是未能建模人体与环境之间的物理交互。仅预测接触作为辅助输出是不够的,接触必须主动用于校正重建结果。
Result: 在RICH、EMDB、3DPW和SLOPER4D等以人为中心的视频基准测试上的实验表明,UniCon3R在物理合理性和全局人体运动估计方面优于最先进的基线方法,同时实现了实时在线推理。
Insight: 核心创新点在于将接触建模为一种强大的内部先验,而非仅仅是外部度量。通过从人体姿态和场景几何推断3D接触,并主动利用接触作为校正线索来生成最终姿态,建立了一种新的、基于物理的联合人体-场景重建范式。
Abstract: We introduce UniCon3R (Unified Contact-aware 3D Reconstruction), a unified feed-forward framework for online human-scene 4D reconstruction from monocular videos. Recent feed-forward methods enable real-time world-coordinate human motion and scene reconstruction, but they often produce physically implausible artifacts such as bodies floating above the ground or penetrating parts of the scene. The key reason is that existing approaches fail to model physical interactions between the human and the environment. A natural next step is to predict human-scene contact as an auxiliary output – yet we find this alone is not sufficient: contact must actively correct the reconstruction. To address this, we explicitly model interaction by inferring 3D contact from the human pose and scene geometry and use the contact as a corrective cue for generating the final pose. This enables UniCon3R to jointly recover high-fidelity scene geometry and spatially aligned 3D humans within the scene. Experiments on standard human-centric video benchmarks such as RICH, EMDB, 3DPW and SLOPER4D show that UniCon3R outperforms state-of-the-art baselines on physical plausibility and global human motion estimation while achieving real-time online inference. We experimentally demonstrate that contact serves as a powerful internal prior rather than just an external metric, thus establishing a new paradigm for physically grounded joint human-scene reconstruction. Project page is available at https://surtantheta.github.io/UniCon3R .
[36] Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning cs.CV | cs.AIPDF
Palawat Busaranuvong, Reza Saadati Fard, Emmanuel Agu, Deepak Kumar, Shefalika Gautam
TL;DR: 本文提出了Infection-Reasoner,一个紧凑的4B参数视觉语言模型,用于慢性伤口感染的分类和基于证据的临床推理。该模型通过两阶段训练流程解决专家标注数据稀缺问题:首先利用GPT-5.1对未标注伤口图像生成思维链推理进行知识蒸馏,然后在小型标注数据集上使用强化学习进行后训练以优化分类推理。在异质伤口数据集上,该模型在准确率、敏感性和特异性方面均优于多个基线模型,包括GPT-5.1,其生成的推理依据也通过了多模态大语言模型评估和伤口专家评审。
Details
Motivation: 从照片评估慢性伤口感染具有挑战性,因为视觉外观因伤口病因、解剖位置和成像条件而异。现有的基于图像的深度学习方法主要侧重于分类,可解释性有限,而临床决策需要基于证据的解释。
Result: 在一个保留的异质伤口数据集上,Infection-Reasoner达到了86.8%的准确率、86.4%的敏感性和87.1%的特异性,优于包括GPT-5.1在内的多个强基线。推理质量评估方面,四个MLLM评估者的视觉支持一致性得分在0.722到0.903之间,专家评审认为61.8%的推理完全正确,32.4%部分正确。
Insight: 论文的创新点在于提出了一个紧凑的视觉语言模型专门用于伤口感染分类与推理,并设计了两阶段训练流程(推理蒸馏和强化学习后训练)来解决该领域专家标注数据稀缺的问题,实现了高性能与高可解释性的结合。从客观角度看,将思维链蒸馏与特定领域的强化学习微调相结合,为数据稀缺的医学视觉任务构建可解释模型提供了一种有效范式。
Abstract: Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8% accuracy, 86.4% sensitivity, and 87.1% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8% of rationales as Correct and 32.4% as Partially Correct.
[37] Visual Reasoning through Tool-supervised Reinforcement Learning cs.CVPDF
Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha
TL;DR: 本文提出了一种名为ToolsRL的工具监督强化学习框架,旨在帮助多模态大语言模型有效掌握工具使用以解决复杂的视觉推理任务。该框架采用两阶段课程学习,先通过工具特定奖励学习工具调用能力,再结合任务准确性奖励进行训练,从而避免优化冲突。实验表明该方法能高效提升模型在复杂视觉任务中的工具使用能力。
Details
Motivation: 解决多模态大语言模型在复杂视觉推理任务中难以有效学习和使用工具(如缩放、旋转、绘图等)的问题,以提高其任务解决能力。
Result: 实验显示ToolsRL框架能高效训练模型,使其在复杂视觉推理任务中展现出强大的工具使用能力,但摘要未提及具体基准测试或定量比较结果。
Insight: 创新点在于引入工具监督的强化学习课程,将工具调用学习与任务解决分阶段优化,使用简单、可解释的视觉工具并易于收集监督信号,避免了异构任务间的优化冲突。
Abstract: In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.
[38] Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens cs.CVPDF
Xinxuan Lu, Charless Fowlkes, Alexander C. Berg
TL;DR: 本文提出了一种通过学习参数化相机标记来实现文本到图像生成中精确相机控制的方法,该方法结合了3D渲染图像和真实感增强数据,在保持图像质量和提示保真度的同时达到了最先进的精度。
Details
Motivation: 当前文本到图像模型仅通过自然语言难以提供精确的相机控制,本文旨在解决这一问题,通过引入全局场景理解来实现精确的相机视角控制。
Result: 定性和定量实验表明,该方法在保持图像质量和提示保真度的同时达到了最先进的精度,并且相机标记学习到的分解几何表示能够迁移到未见过的物体类别。
Insight: 创新点在于通过学习参数化相机标记,将显式的3D相机结构嵌入到文本-视觉潜在空间中,实现了对几何感知提示的探索,避免了先前方法对物体特定外观相关性的过拟合。
Abstract: Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories. Our work shows that text-vision latent spaces can be endowed with explicit 3D camera structure, offering a pathway toward geometrically-aware prompts for text-to-image generation. Project page: https://randdl.github.io/viewtoken_control/
[39] DistortBench: Benchmarking Vision Language Models on Image Distortion Identification cs.CV | cs.AI | cs.LG | cs.ROPDF
Divyanshu Goyal, Akhil Eppa, Vanya Bannihatti Kumar
TL;DR: 本文提出了DistortBench,一个用于评估视觉语言模型在无参考失真感知能力上的诊断性基准,包含13,500个四选一问题,涵盖27种失真类型、六个感知类别和五个严重程度级别。评估了18个VLM模型,发现即使最佳模型准确率也仅为61.9%,低于人类多数投票基线65.7%,表明当前VLM在低级视觉感知方面存在明显不足。
Details
Motivation: 视觉语言模型在内容审核、图像恢复和质量监控等对低级图像退化敏感的领域应用日益广泛,但其识别失真类型和严重程度的能力尚未得到充分理解,因此需要建立一个专门的基准来评估和提升VLM的低级视觉感知能力。
Result: 在DistortBench基准上评估了18个VLM模型(包括17个开源模型和1个专有模型),最佳模型准确率为61.9%,低于人类多数投票基线65.7%(个体平均60.2%)。分析还发现模型性能随规模增长呈现弱且非单调的扩展趋势,多数基础-思维配对模型性能下降,且不同模型家族在严重程度响应模式上存在差异。
Insight: 创新点在于构建了首个专门针对VLM无参考失真感知能力的综合诊断基准DistortBench,系统涵盖了多种失真类型和严重程度。客观分析表明,该研究揭示了当前VLM在低级视觉理解上的显著缺陷,为未来模型改进提供了重要方向,且基准设计考虑了人类感知校准和旋转失真等扩展因素。
Abstract: Vision-language models (VLMs) are increasingly used in settings where sensitivity to low-level image degradations matters, including content moderation, image restoration, and quality monitoring. Yet their ability to recognize distortion type and severity remains poorly understood. We present DistortBench, a diagnostic benchmark for no-reference distortion perception in VLMs. DistortBench contains 13,500 four-choice questions covering 27 distortion types, six perceptual categories, and five severity levels: 25 distortions inherit KADID-10k calibrations, while two added rotation distortions use monotonic angle-based levels. We evaluate 18 VLMs, including 17 open-weight models from five families and one proprietary model. Despite strong performance on high-level vision-language tasks, the best model reaches only 61.9% accuracy, just below the human majority-vote baseline of 65.7% (average individual: 60.2%), indicating that low-level perceptual understanding remains a major weakness of current VLMs. Our analysis further reveals weak and non-monotonic scaling with model size, performance drops in most base–thinking pairs, and distinct severity-response patterns across model families. We hope DistortBench will serve as a useful benchmark for measuring and improving low-level visual perception in VLMs.
[40] A Computational Model of Message Sensation Value in Short Video Multimodal Features that Predicts Sensory and Behavioral Engagement cs.CVPDF
Haoning Xue, Jingwen Zhang, Xiaohui Wang, Diane Dagyong Kim, Yunya Song
TL;DR: 本研究基于信息感知价值理论,开发了一个计算模型,通过分析短视频的多模态特征来预测观众的感官和行为参与度。该模型在1200个短视频上进行了多模态特征分析和人工评估,并在三个短视频平台的14,492个未见数据上得到验证。研究发现,信息感知价值与感官参与度呈正相关,但与行为参与度呈倒U型关系:较高的信息感知价值能引发更强的感官刺激,而适中的信息感知价值则能优化行为参与度。
Details
Motivation: 当前媒体环境中充斥着感官刺激强烈的短视频,但现有研究多关注单一模态特征的影响,多模态特征对观众参与度的综合影响尚不明确,因此需要开发一个计算模型来预测短视频的感官和行为参与度。
Result: 模型在两个未见数据集(来自三个短视频平台,总计14,492个视频)上得到验证,结果显示信息感知价值与感官参与度正相关,与行为参与度呈倒U型关系,表明适中的信息感知价值能最大化行为参与。
Insight: 创新点在于将信息感知价值理论应用于短视频多模态特征分析,并建立了预测参与度的计算模型;客观来看,该研究通过大规模数据验证了多模态特征对参与度的非线性影响,为短视频研究和内容优化提供了理论工具。
Abstract: The contemporary media landscape is characterized by sensational short videos. While prior research examines the effects of individual multimodal features, the collective impact of multimodal features on viewer engagement with short videos remains unknown. Grounded in the theoretical framework of Message Sensation Value (MSV), this study develops and tests a computational model of MSV with multimodal feature analysis and human evaluation of 1,200 short videos. This model that predicts sensory and behavioral engagement was further validated across two unseen datasets from three short video platforms (combined N = 14,492). While MSV is positively associated with sensory engagement, it shows an inverted U-shaped relationship with behavioral engagement: Higher MSV elicits stronger sensory stimulation, but moderate MSV optimizes behavioral engagement. This research advances the theoretical understanding of short video engagement and introduces a robust computational tool for short video research.
[41] RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery cs.CVPDF
Bowen Zhang, Jesse T. Boulerice, Charvi Mendiratta, Nikhil Kuniyil, Satish Kumar
TL;DR: 本文提出了RareSpot+,一个用于航空图像中小型稀有野生动物检测的基准、模型和主动学习框架。该框架通过多尺度一致性学习、上下文感知增强和地理空间引导的主动学习来解决小目标检测困难和标注成本高的问题。
Details
Motivation: 解决航空图像中自动化野生动物监测的两个主要挑战:检测小型、稀有物种的困难,以及大规模专家标注的高成本。以草原犬鼠为例,它们体型小、分布稀疏且与背景难以区分,对传统检测模型构成严峻挑战。
Result: 在2平方公里的航空数据集上,RareSpot+将基线mAP@50提高了+35.2%(绝对提升+0.13)。在HerdNet、AED等多个野生动物基准上的跨数据集测试证明了其鲁棒的检测器级可迁移性。主动学习模块仅使用未标注图块1.7%的标注预算,就将草原犬鼠的AP进一步提升了14.5%。
Insight: 创新点包括:1)无需改变架构的多尺度一致性损失,通过对齐不同检测头的中间特征图来增强小目标定位;2)上下文感知增强,通过合成困难的、生态学上合理的样本来提高鲁棒性;3)利用领域特定空间先验(如草原犬鼠与洞穴的关联)的地理空间主动学习模块,结合测试时增强和元不确定性模型,以减少冗余标注。该框架还将基于视觉的检测与定量生态学分析(如聚类和共现分析)联系起来。
Abstract: Automated wildlife monitoring from aerial imagery is vital for conservation but remains limited by two persistent challenges: the difficulty of detecting small, rare species and the high cost of large-scale expert annotation. Prairie dogs exemplify this problem – they are ecologically important yet appear tiny, sparsely distributed, and visually indistinct from their surroundings, posing a severe challenge for conventional detection models. To overcome these limitations, we present RareSpot+, a detection framework that integrates multi-scale consistency learning, context-aware augmentation, and geospatially guided active learning to address these issues. A novel multi-scale consistency loss aligns intermediate feature maps across detection heads, enhancing localization of small (approx. 30 pixels wide) objects without architectural changes, while context-aware augmentation improves robustness by synthesizing hard, ecologically plausible examples. A geospatial active learning module exploits domain-specific spatial priors linking prairie dogs and burrows, together with test-time augmentation and a meta-uncertainty model, to reduce redundant labeling. On a 2 km^2 aerial dataset, RareSpot+ improves detection over the baseline mAP@50 by +35.2% (absolute +0.13). Cross-dataset tests on HerdNet, AED, and several other wildlife benchmarks demonstrate robust detector-level transferability. The active learning module further boosts prairie dog AP by 14.5% using an annotation budget of just 1.7% of the unlabeled tiles. Beyond detection, RareSpot+ enables spatial ecological analyses such as clustering and co-occurrence, linking vision-based detection with quantitative ecology.
[42] EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training cs.CV | cs.AI | cs.CLPDF
Yiyang Du, Zhanqiu Guo, Xin Ye, Liu Ren, Chenyan Xiong
TL;DR: 本文提出EmbodiedMidtrain方法,通过中间训练(mid-training)弥合视觉语言模型(VLM)与视觉语言动作模型(VLA)之间的数据分布差距。该方法利用轻量级可学习邻近度估计器从大规模VLM数据池中筛选与VLA最对齐的候选数据,对VLM进行中间训练后再进行下游VLA微调。实验表明,该方法在三个机器人操作基准上显著提升性能,且对不同VLM骨干网络均有效。
Details
Motivation: 现有VLA大多直接使用未针对具身领域适配的通用VLM,导致下游性能受限,因此需要解决VLM与VLA之间的数据分布不匹配问题。
Result: 在三个机器人操作基准测试中,中间训练方法持续提升不同VLM骨干网络的性能,达到与专家级VLA相当的水平,且优于使用更大模型规模和训练资源的现成VLM。
Insight: 创新点在于通过数据分布分析揭示VLA数据在VLM分布中的紧凑性与分离性,并设计数据引擎同时利用数据集级和样本级对齐信号,在保持VLM数据多样性的同时强化空间推理能力,为VLA微调提供更优初始化。
Abstract: Vision-Language-Action Models (VLAs) inherit their visual and linguistic capabilities from Vision-Language Models (VLMs), yet most VLAs are built from off-the-shelf VLMs that are not adapted to the embodied domain, limiting their downstream performance. In this work, we propose EmbodiedMidtrain to bridge the gap between VLMs and VLAs. We first characterize the data distribution gap between them, showing that VLA data occupy compact regions that are largely separated from the broader VLM distribution, while the degree of alignment varies substantially both across and within VLM data sources. Then, we build a mid-training data engine that leverages a lightweight learnable proximity estimator to select the most VLA-aligned candidates from a large VLM pool, and mid-trains the VLM on this curated mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks show that mid-training consistently improves performance across different VLM backbones, achieving results competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets. Further analysis reveals that mid-training provides a stronger initialization for VLA fine-tuning, with gains emerging from the earliest steps and widening throughout training. Moreover, the data engine captures both dataset-level and sample-level alignment signals, favoring spatial reasoning over text-centric tasks while preserving the diversity of the VLM data. We will release all code, data and models for future research.
[43] Cognitive Alignment At No Cost: Inducing Human Attention Biases For Interpretable Vision Transformers cs.CV | cs.AIPDF
Ethan Knights
TL;DR: 本文研究了通过使用人类显著性注视图微调Vision Transformer的自注意力权重,以缩小其与人类注意力特征之间的认知差距。实验表明,微调显著提升了模型与人类注意力的对齐度,且未损害其在ImageNet、ImageNet-C和ObjectNet等基准上的分类性能。相比之下,对ResNet-50 CNN进行相同操作会降低对齐度和准确率,表明ViT的模块化自注意力机制在分离空间优先级与表征逻辑方面具有独特优势。
Details
Motivation: 当前最先进的Vision Transformers在处理图像时与人类的注意力特征存在显著差异,本文旨在探索是否可以通过微调自注意力权重来缩小这一认知差距,从而提升Transformer的可解释性。
Result: 微调后的ViT-B/16在五个显著性指标上对齐度显著提升,并诱导出三种类人注意力偏差(如逆转对大物体的反人类偏向、增强对生命体的偏好、降低极端注意力熵)。贝叶斯等价性分析提供了决定性证据,表明这种认知对齐未损害模型在ImageNet、ImageNet-C和ObjectNet基准上的原始分类性能(达到SOTA水平)。
Insight: 创新点在于证明基于生物学的先验(人类注意力偏差)可以作为人类对齐注意力的免费涌现属性被注入ViT,从而在不牺牲性能的情况下提升可解释性。客观分析认为,ViT的模块化自注意力机制使其能够独特地解耦空间优先级与表征逻辑,这为构建更可解释且高性能的视觉模型提供了新思路。
Abstract: For state-of-the-art image understanding, Vision Transformers (ViTs) have become the standard architecture but their processing diverges substantially from human attentional characteristics. We investigate whether this cognitive gap can be shrunk by fine-tuning the self-attention weights of Google’s ViT-B/16 on human saliency fixation maps. To isolate the effects of semantically relevant signals from generic human supervision, the tuned model is compared against a shuffled control. Fine-tuning significantly improved alignment across five saliency metrics and induced three hallmark human-like biases: tuning reversed the baseline’s anti-human large-object bias toward small-objects, amplified the animacy preference and diminished extreme attention entropy. Bayesian parity analysis provides decisive to very-strong evidence that this cognitive alignment comes at no cost to the model’s original classification performance on in- (ImageNet), corrupted (ImageNet-C) and out-of-distribution (ObjectNet) benchmarks. An equivalent procedure applied to a ResNet-50 Convolutional Neural Network (CNN) instead degraded both alignment and accuracy, suggesting that the ViT’s modular self-attention mechanism is uniquely suited for dissociating spatial priority from representational logic. These findings demonstrate that biologically grounded priors can be instilled as a free emergent property of human-aligned attention, to improve transformer interpretability.
[44] Learning to count small and clustered objects with application to bacterial colonies cs.CVPDF
Minghua Zheng, Na Helian, Peter C. R. Lane, Yi Sun, Allen Donald
TL;DR: 该论文提出了ACFamNet和ACFamNet Pro两种方法,用于解决细菌菌落计数任务中面临的小尺寸、聚集性、高标注成本和跨物种泛化等挑战。ACFamNet通过改进的兴趣区域池化与对齐以及优化的特征工程来处理小尺寸和聚集对象;ACFamNet Pro则进一步引入了多头注意力和残差连接,以动态加权对象并改善梯度流。实验表明,ACFamNet Pro在5折交叉验证下取得了9.64%的平均归一化绝对误差,性能优于基准方法。
Details
Motivation: 动机是解决细菌菌落自动计数中的四个关键挑战:小物理尺寸、对象聚集、高数据标注成本和有限的跨物种泛化能力。现有方法FamNet虽然对聚集对象和昂贵标注有效,但在小菌落尺寸和跨物种泛化方面的效果未知。
Result: 在实验中,ACFamNet Pro在5折交叉验证下实现了9.64%的平均归一化绝对误差(MNAE),分别比ACFamNet和FamNet提高了2.23%和12.71%,达到了更优的计数性能。
Insight: 创新点包括:针对小尺寸和聚集对象,提出了新颖的兴趣区域池化与对齐方法以及优化的特征工程(ACFamNet);进一步通过引入多头注意力和残差连接,实现了对象的动态加权和梯度流改进(ACFamNet Pro),这增强了模型的泛化能力和计数精度,可借鉴于其他小对象检测和密集计数任务。
Abstract: Automated bacterial colony counting from images is an important technique to obtain data required for the development of vaccines and antibiotics. However, bacterial colonies present unique machine vision challenges that affect counting, including (1) small physical size, (2) object clustering, (3) high data annotation cost, and (4) limited cross-species generalisation. While FamNet is an established object counting technique effective for clustered objects and costly data annotation, its effectiveness for small colony sizes and cross-species generalisation remains unknown. To address the first three challenges, we propose ACFamNet, an extension of FamNet that handles small and clustered objects using a novel region of interest pooling with alignment and optimised feature engineering. To address all four challenges above, we introduce ACFamNet Pro, which augments ACFamNet with multi-head attention and residual connections, enabling dynamic weighting of objects and improved gradient flow. Experiments show that ACFamNet Pro achieves a mean normalised absolute error (MNAE) of 9.64% under 5-fold cross-validation, outperforming ACFamNet and FamNet by 2.23% and 12.71%, respectively.
[45] PASTA: A Patch-Agnostic Twofold-Stealthy Backdoor Attack on Vision Transformers cs.CV | cs.CRPDF
Dazhuang Liu, Yanqi Qiao, Rui Wang, Kaitai Liang, Georgios Smaragdakis
TL;DR: 本文提出了一种针对视觉Transformer(ViT)的双重隐蔽后门攻击方法PASTA,该方法通过触发辐射效应(TRE)实现任意补丁位置的后门激活,并在像素和注意力域均保持隐蔽性。
Details
Motivation: 现有基于补丁的后门攻击通常假设推理时触发器位置固定以最大化注意力,但忽略了ViT中自注意力机制的长程依赖性,且往往牺牲视觉和注意力隐蔽性,容易被检测。
Result: 在四个数据集上的实验表明,PASTA在任意补丁位置的平均攻击成功率高达99.13%,视觉和注意力隐蔽性分别提升144.43倍和18.68倍,对最先进ViT防御的鲁棒性提升2.79倍,优于基于CNN和ViT的基线方法。
Insight: 创新点在于揭示了触发辐射效应(TRE),并提出多位置触发器插入策略和双层优化框架,使模型与触发器迭代适应,在保持高攻击成功率的同时实现像素和注意力域的双重隐蔽性。
Abstract: Vision Transformers (ViTs) have achieved remarkable success across vision tasks, yet recent studies show they remain vulnerable to backdoor attacks. Existing patch-wise attacks typically assume a single fixed trigger location during inference to maximize trigger attention. However, they overlook the self-attention mechanism in ViTs, which captures long-range dependencies across patches. In this work, we observe that a patch-wise trigger can achieve high attack effectiveness when activating backdoors across neighboring patches, a phenomenon we term the Trigger Radiating Effect (TRE). We further find that inter-patch trigger insertion during training can synergistically enhance TRE compared to single-patch insertion. Prior ViT-specific attacks that maximize trigger attention often sacrifice visual and attention stealthiness, making them detectable. Based on these insights, we propose PASTA, a twofold stealthy patch-wise backdoor attack in both pixel and attention domains. PASTA enables backdoor activation when the trigger is placed at arbitrary patches during inference. To achieve this, we introduce a multi-location trigger insertion strategy to enhance TRE. However, preserving stealthiness while maintaining strong TRE is challenging, as TRE is weakened under stealthy constraints. We therefore formulate a bi-level optimization problem and propose an adaptive backdoor learning framework, where the model and trigger iteratively adapt to each other to avoid local optima. Extensive experiments show that PASTA achieves 99.13% attack success rate across arbitrary patches on average, while significantly improving visual and attention stealthiness (144.43x and 18.68x) and robustness (2.79x) against state-of-the-art ViT defenses across four datasets, outperforming CNN- and ViT-based baselines.
[46] Semi-Supervised Flow Matching for Mosaiced and Panchromatic Fusion Imaging cs.CVPDF
Peiming Luo, Nan Wang, Litong Liu, Jiahan Huang, Chenxu Wu
TL;DR: 本文提出了一种新颖的半监督流匹配框架,用于融合低分辨率马赛克高光谱图像与高分辨率全色图像,以实现单次拍摄的视频级高分辨率高光谱成像。该方法采用两阶段训练流程:首先预训练无监督先验网络生成初始伪高分辨率高光谱图像,然后训练条件流匹配模型生成目标图像,并引入随机投票机制迭代优化初始估计。推理时采用无冲突梯度引导策略确保光谱和空间一致性重建。
Details
Motivation: 解决单次拍摄中融合低分辨率马赛克高光谱图像与高分辨率全色图像这一严重不适定问题,以替代受限于特定协议或手工假设的现有扩散方法。
Result: 在多个基准数据集上的实验表明,该方法在定量和定性性能上均显著优于代表性基线方法,实现了优越的融合效果。
Insight: 创新点在于将无监督方案与流匹配无缝集成,形成通用且高效的生成框架;随机投票机制和冲突无关梯度引导策略增强了融合的鲁棒性和一致性;该框架可灵活扩展到其他图像融合任务并与无监督或盲图像恢复算法集成。
Abstract: Fusing a low resolution (LR) mosaiced hyperspectral image (HSI) with a high resolution (HR) panchromatic (PAN) image offers a promising avenue for video-rate HR-HSI imaging via single-shot acquisition, yet its severely ill-posed nature remains a significant challenge. In this work, we propose a novel semi-supervised flow matching framework for mosaiced and PAN image fusion. Unlike previous diffusion-based approaches constrained by specific protocols or handcrafted assumptions, our method seamlessly integrates an unsupervised scheme with flow matching, resulting in a generalizable and efficient generative framework. Specifically, our method follows a two-stage training pipeline. First, we pretrain an unsupervised prior network to produce an initial pseudo HR-HSI. Building on this, we then train a conditional flow matching model to generate the target HR-HSI, introducing a random voting mechanism that iteratively refines the initial HR-HSI estimate, enabling robust and effective fusion. During inference, we employ a conflict-free gradient guidance strategy that ensures spectrally and spatially consistent HR-HSI reconstruction. Experiments on multiple benchmark datasets demonstrate that our method achieves superior quantitative and qualitative performance by a significant margin compared to representative baselines. Beyond mosaiced and PAN fusion, our approach provides a flexible generative framework that can be readily extended to other image fusion tasks and integrated with unsupervised or blind image restoration algorithms.
[47] IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory cs.CV | cs.AIPDF
Weitong Kong, Di Wen, Kunyu Peng, David Schneider, Zeyun Zhong
TL;DR: IMPACT-CYCLE是一个基于合约的多智能体系统,用于长视频语义记忆的声明级监督校正。它将长视频理解重新定义为对共享语义记忆(包含类型化声明、声明依赖图和来源日志的结构化版本化状态)的迭代声明级维护,通过角色化智能体在明确权限合约下分解验证任务,并在自动化证据不足时升级至人类仲裁,以降低校正成本。
Details
Motivation: 解决长视频理解中错误校正成本过高的问题,现有多模态流程输出不透明、缺乏中间状态,导致标注者需重新处理原始视频并从头重建时序逻辑,核心瓶颈在于缺乏使人工工作量与错误范围成比例的监督接口。
Result: 在VidOR数据集上的实验显示,下游推理(VQA)得分从0.71提升至0.79,人类仲裁成本降低4.8倍,工作量显著低于手动标注。
Insight: 创新点在于将长视频理解重构为声明级语义记忆的迭代维护,引入基于合约的多智能体系统实现局部验证与依赖闭包重验证,使校正成本与错误范围成比例;客观分析认为其结构化状态表示和权限升级机制为视频理解系统的可解释性和人机协作提供了新思路。
Abstract: Correcting errors in long-video understanding is disproportionately costly: existing multimodal pipelines produce opaque, end-to-end outputs that expose no intermediate state for inspection, forcing annotators to revisit raw video and reconstruct temporal logic from scratch. The core bottleneck is not generation quality alone, but the absence of a supervisory interface through which human effort can be proportional to the scope of each error. We present IMPACT-CYCLE, a supervisory multi-agent system that reformulates long-video understanding as iterative claim-level maintenance of a shared semantic memory – a structured, versioned state encoding typed claims, a claim dependency graph, and a provenance log. Role-specialized agents operating under explicit authority contracts decompose verification into local object-relation correctness, cross-temporal consistency, and global semantic coherence, with corrections confined to structurally dependent claims. When automated evidence is insufficient, the system escalates to human arbitration as the supervisory authority with final override rights; dependency-closure re-verification then ensures correction cost remains proportional to error scope. Experiments on VidOR show substantially improved downstream reasoning (VQA: 0.71 to 0.79) and a 4.8x reduction in human arbitration cost, with workload significantly lower than manual annotation. Code will be released at https://github.com/MKong17/IMPACT_CYCLE.
[48] HumanScore: Benchmarking Human Motions in Generated Videos cs.CVPDF
Yusu Fang, Tiange Xiang, Tian Tan, Narayan Schuetz, Scott Delp
TL;DR: 本文提出了HumanScore框架,用于系统评估AI生成视频中人体动作的质量,定义了六个可解释的指标,涵盖运动学合理性、时间稳定性和生物力学一致性,并对13个先进模型进行了评估。
Details
Motivation: 当前视频生成技术虽在视觉真实感上取得进展,但缺乏系统评估人体动作逼真度和动态渲染的方法,HumanScore旨在填补这一空白。
Result: 通过精心设计的提示词评估13个SOTA模型,揭示了感知合理性与运动生物力学保真度之间的差距,识别了常见失败模式(如时间抖动、解剖学不合理姿势和动作漂移),并基于定量和物理意义标准提供了稳健的模型排名。
Insight: 创新点在于提出首个系统评估生成视频中人体动作的框架,通过多维度指标实现细粒度诊断,超越仅依赖视觉真实感的评估,为模型改进提供了具体方向。
Abstract: Recent advances in model architectures, compute, and data scale have driven rapid progress in video generation, producing increasingly realistic content. Yet, no prior method systematically measures how faithfully these systems render human bodies and motion dynamics. In this paper, we present HumanScore, a systematic framework to evaluate the quality of human motions in AI-generated videos. HumanScore defines six interpretable metrics spanning kinematic plausibility, temporal stability, and biomechanical consistency, enabling fine-grained diagnosis beyond visual realism alone. Through carefully designed prompts, we elicit a diverse set of movements at varying intensities and evaluate videos generated by thirteen state-of-the-art models. Our analysis reveals consistent gaps between perceptual plausibility and motion biomechanical fidelity, identifies recurrent failure modes (e.g., temporal jitter, anatomically implausible poses, and motion drift), and produces robust model rankings from quantitative and physically meaningful criteria.
[49] WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring cs.CV | cs.LGPDF
Mobin Habibpour, Niloufar Alipour Talemi, John Spodnik, Camren J. Khoury, Fatemeh Afghah
TL;DR: 本文提出了WildFireVQA,一个用于空中野火监测的大规模视觉问答基准,它结合了RGB图像和辐射热测量数据。该基准包含6,097个RGB-热样本,每个样本配有34个问题,总计207,298个多项选择题,涵盖存在检测、分类、分布分割、定位定向、跨模态推理和飞行规划等任务。
Details
Motivation: 现有空中视觉问答基准缺乏针对野火监测、基于热测量数据的多模态推理评估,因此需要构建一个集成热辐射数据的专用基准来支持野火监测的态势感知。
Result: 实验表明,在当前模型中,RGB模态表现最强,而检索增强的热上下文能为更强的多模态大语言模型带来性能提升,凸显了基于温度推理的价值以及现有模型在安全关键野火场景中的局限性。
Insight: 创新点在于首次构建了结合辐射热数据的空中野火VQA基准,并采用MLLM辅助生成答案与传感器驱动确定性标注、人工验证及一致性检查相结合的可靠标注流程,同时建立了基于热辐射统计的全面评估协议。
Abstract: Wildfire monitoring requires timely, actionable situational awareness from airborne platforms, yet existing aerial visual question answering (VQA) benchmarks do not evaluate wildfire-specific multimodal reasoning grounded in thermal measurements. We introduce WildFireVQA, a large-scale VQA benchmark for aerial wildfire monitoring that integrates RGB imagery with radiometric thermal data. WildFireVQA contains 6,097 RGB-thermal samples, where each sample includes an RGB image, a color-mapped thermal visualization, and a radiometric thermal TIFF, and is paired with 34 questions, yielding a total of 207,298 multiple-choice questions spanning presence and detection, classification, distribution and segmentation, localization and direction, cross-modal reasoning, and flight planning for operational wildfire intelligence. To improve annotation reliability, we combine multimodal large language model (MLLM)-based answer generation with sensor-driven deterministic labeling, manual verification, and intra-frame and inter-frame consistency checks. We further establish a comprehensive evaluation protocol for representative MLLMs under RGB, Thermal, and retrieval-augmented settings using radiometric thermal statistics. Experiments show that across task categories, RGB remains the strongest modality for current models, while retrieved thermal context yields gains for stronger MLLMs, highlighting both the value of temperature-grounded reasoning and the limitations of existing MLLMs in safety-critical wildfire scenarios. The dataset and benchmark code are open-source at https://github.com/mobiiin/WildFire_VQA.
[50] From Scene to Object: Text-Guided Dual-Gaze Prediction cs.CV | cs.AI | cs.ROPDF
Zehong Ke, Yanbo Jiang, Jinhao Li, Zhiyuan Liu, Yiqian Tu
TL;DR: 本文提出了一种用于可解释驾驶员注意力预测的双分支注视预测框架DualGaze-VLM,旨在解决现有数据集仅提供场景级全局注视标注而无法支持细粒度对象级认知建模的问题。通过构建首个对象级驾驶员注意力数据集G-W3DA,并设计一个结合语义查询和条件感知特征调制的视觉语言模型架构,实现了文本引导下的精确对象级空间锚定。
Details
Motivation: 现有驾驶员注意力预测数据集缺乏细粒度的对象级标注,导致视觉语言模型在语义推理时出现严重的文本-视觉解耦和视觉偏差幻觉,无法实现精确的、基于文本的对象级注意力预测。
Result: 在W3DA基准测试上的大量实验表明,DualGaze-VLM在空间对齐指标上持续超越现有最先进模型,在安全关键场景下的相似度指标上实现了高达17.8%的提升。视觉图灵测试显示,其生成的注意力热图被88.22%的人类评估者认为是真实的。
Insight: 创新点在于建立了一个从数据构建到模型架构的完整范式:1)利用多模态大语言模型与SAM3结合,通过严格交叉验证构建高质量对象级数据集G-W3DA,从根本上消除标注幻觉;2)提出DualGaze-VLM架构,通过提取语义查询的隐藏状态并利用条件感知SE-Gate动态调制视觉特征,实现了意图驱动的精确空间锚定。
Abstract: Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve precise object-level attention prediction, this paper proposes a novel dual-branch gaze prediction framework, establishing a complete paradigm from data construction to model architecture. First, we construct G-W3DA, a object-level driver attention dataset. By integrating a multimodal large language model with the Segment Anything Model 3 (SAM3), we decouple macroscopic heatmaps into object-level masks under rigorous cross-validation, fundamentally eliminating annotation hallucinations. Building upon this high-quality data foundation, we propose the DualGaze-VLM architecture. This architecture extracts the hidden states of semantic queries and dynamically modulates visual features via a Condition-Aware SE-Gate, achieving intent-driven precise spatial anchoring. Extensive experiments on the W3DA benchmark demonstrate that DualGaze-VLM consistently surpasses existing state-of-the-art (SOTA) models in spatial alignment metrics, notably achieving up to a 17.8% improvement in Similarity (SIM) under safety-critical scenarios. Furthermore, a visual Turing test reveals that the attention heatmaps generated by DualGaze-VLM are perceived as authentic by 88.22% of human evaluators, proving its capability to generate rational cognitive priors.
[51] Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation cs.CVPDF
Tianshui Chen, Jianman Lin, Zhijing Yang, Chunmei Qing, Guangrun Wang
TL;DR: 本文提出了一种新颖的时空一致性相关学习算法,用于解决语音保留的面部表情操纵任务,该任务旨在修改面部情感的同时精确保持与说话内容相关的嘴部动画。该方法通过建模不同情感下说话者面部动画在空间和时间上的高度相关性,并将其作为显式度量来监督生成过程,从而摆脱了对难以获取的成对训练数据的依赖。
Details
Motivation: 当前方法依赖于难以获取的、针对特定人物的成对训练样本,这限制了语音保留面部表情操纵在真实场景中的应用。本文的动机是利用不同情感下表达相同内容时面部动画在时空上存在高度相关性这一发现,来提供有效的监督信号。
Result: 论文在摘要中未提及具体的定量结果、基准测试或与现有方法的比较。
Insight: 主要创新点在于发现了不同情感下相同语音内容对应的面部动画存在时空相关性,并据此设计了空间一致性相关度量、时间一致性相关度量以及一个相关性感知的自适应策略,将这些度量整合为额外的损失函数来监督模型训练,从而无需成对数据即可实现高质量的表情操纵。
Abstract: Speech-preserving facial expression manipulation (SPFEM) aims to modify facial emotions while meticulously maintaining the mouth animation associated with spoken content. Current works depend on inaccessible paired training samples for the person, where two aligned frames exhibit the same speech content yet differ in emotional expression, limiting the SPFEM applications in real-world scenarios. In this work, we discover that speakers who convey the same content with different emotions exhibit highly correlated local facial animations in both spatial and temporal spaces, providing valuable supervision for SPFEM. To capitalize on this insight, we propose a novel spatial-temporal coherent correlation learning (STCCL) algorithm, which models the aforementioned correlations as explicit metrics and integrates the metrics to supervise manipulating facial expression and meanwhile better preserving the facial animation of spoken content. To this end, it first learns a spatial coherent correlation metric, ensuring that the visual correlations of adjacent local regions within an image linked to a specific emotion closely resemble those of corresponding regions in an image linked to a different emotion. Simultaneously, it develops a temporal coherent correlation metric, ensuring that the visual correlations of specific regions across adjacent image frames associated with one emotion are similar to those in the corresponding regions of frames associated with another emotion. Recognizing that visual correlations are not uniform across all regions, we have also crafted a correlation-aware adaptive strategy that prioritizes regions that present greater challenges. During SPFEM model training, we construct the spatial-temporal coherent correlation metric between corresponding local regions of the input and output image frames as an additional loss to supervise the generation process.
[52] Bio-inspired Color Constancy: From Gray Anchoring Theory to Gray Pixel Methods cs.CVPDF
Kai-Fu Yang, Fu-Ya Luo, Yong-Jie Li
TL;DR: 本文提出了一个全面的技术框架,将生物机制、计算理论和算法实现相结合,用于仿生颜色恒常性研究。论文系统性地回顾了生物颜色恒常性的计算理论,表明光照估计可简化为早期视觉中的灰色锚点检测任务,并在统一的理论框架下重新解释了典型的灰色像素检测方法。最后,作者提出了一种简单的基于学习的方法,将反射模型约束与特征学习相结合,以探索基于灰色像素检测的仿生颜色恒常性潜力。
Details
Motivation: 颜色恒常性是许多生物视觉系统的基本能力,也是计算机成像系统中的关键步骤。仿生建模为阐明颜色恒常性的计算原理和开发高效计算方法提供了有前景的途径。然而,仿生颜色恒常性方法仍未得到充分探索,且缺乏全面分析。
Result: 广泛的实验证实了灰色像素检测对于颜色恒常性的有效性,并展示了仿生方法的潜力。
Insight: 论文的主要创新点在于构建了一个整合生物学、计算理论和算法的统一框架,将光照估计问题重新表述为灰色锚点检测,并提出了一个结合物理模型约束与数据驱动学习的简单方法,为仿生颜色恒常性研究提供了新的视角和潜力验证。
Abstract: Color constancy is a fundamental ability of many biological visual systems and a crucial step in computer imaging systems. Bio-inspired modeling offers a promising way to elucidate the computational principles underlying color constancy and to develop efficient computational methods. However, bio-inspired methods for color constancy remain underexplored and lack a comprehensive analysis. This paper presents a comprehensive technical framework that integrates biological mechanisms, computational theory, and algorithmic implementation for bio-inspired color constancy. Specifically, we systematically revisit the computational theory of biological color constancy, which shows that illuminant estimation can be reduced to the task of gray-anchor (pixel or surface) detection in early vision. Subsequently, typical gray-pixel detection methods, including Gray-Pixel and Grayness-Index, are reinterpreted within a unified theoretical framework with the Lambertian reflection model and biological color-opponent mechanisms. Finally, we propose a simple learning-based method that couples reflection-model constraints with feature learning to explore the potential of bio-inspired color constancy based on gray-pixel detection. Extensive experiments confirm the effectiveness of gray-pixel detection for color constancy and demonstrate the potential of bio-inspired methods.
[53] X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference cs.CVPDF
Yixiao Zeng, Jianlei Zheng, Chaoda Zheng, Shijia Chen, Mingdian Liu
TL;DR: 本文提出了X-Cache,一种无需训练的加速方法,用于加速基于自回归视频扩散的驾驶世界模型的推理。该方法通过跨连续生成块而非去噪步骤进行缓存,利用双度量门控机制和块输入指纹来决定是否重用缓存残差,同时通过识别KV更新块来切断误差传播,从而在保持最低性能下降的同时实现显著的推理加速。
Details
Motivation: 当前基于自回归视频扩散的驾驶世界模型虽然能实现高保真、可控的多摄像头生成,但其推理成本是交互式部署的瓶颈。现有的扩散缓存方法是为多步去噪的离线视频生成设计的,不适用于少步蒸馏模型和闭环交互生成场景,因为少步模型缺乏跨步冗余,而序列级并行化技术需要未来条件信息,这在交互生成中无法提供。
Result: 在X-world(一个基于多块因果DiT、采用少步去噪和滚动KV缓存的生产级多摄像头动作条件驾驶世界模型)上实现X-Cache,实现了71%的块跳过率和2.6倍的实时加速,同时保持最低的性能下降。
Insight: 创新点在于提出了一种新的缓存轴(跨连续生成块而非去噪步骤),引入了双度量门控机制和结构-动作感知的块输入指纹来独立决策块的重计算或重用,并通过强制KV更新块的全计算来切断误差传播,从而有效解决了少步自回归世界模型推理中的冗余利用和误差累积问题。
Abstract: Real-time world simulation is becoming a key infrastructure for scalable evaluation and online reinforcement learning of autonomous driving systems. Recent driving world models built on autoregressive video diffusion achieve high-fidelity, controllable multi-camera generation, but their inference cost remains a bottleneck for interactive deployment. However, existing diffusion caching methods are designed for offline video generation with multiple denoising steps, and do not transfer to this scenario. Few-step distilled models have no inter-step redundancy left for these methods to reuse, and sequence-level parallelization techniques require future conditioning that closed-loop interactive generation does not provide. We present X-Cache, a training-free acceleration method that caches along a different axis: across consecutive generation chunks rather than across denoising steps. X-Cache maintains per-block residual caches that persist across chunks, and applies a dual-metric gating mechanism over a structure- and action-aware block-input fingerprint to independently decide whether each block should recompute or reuse its cached residual. To prevent approximation errors from permanently contaminating the autoregressive KV cache, X-Cache identifies KV update chunks (the forward passes that write clean keys and values into the persistent cache) and unconditionally forces full computation on these chunks, cutting off error propagation. We implement X-Cache on X-world, a production multi-camera action-conditioned driving world model built on multi-block causal DiT with few-step denoising and rolling KV cache. X-Cache achieves 71% block skip rate with 2.6x wall-clock speedup while maintaining minimum degradation.
[54] Efficient INT8 Single-Image Super-Resolution via Deployment-Aware Quantization and Teacher-Guided Training cs.CVPDF
Pham Phuong Nam Nguyen, Nam Tien Le, Thi Kim Trang Vo, Nhu Tinh Anh Nguyen
TL;DR: 本文提出了一种面向部署的高效INT8单图像超分辨率框架,采用提取-精炼-上采样设计,通过低分辨率空间计算、轻量级可重参数化主干网络和PixelShuffle重建实现紧凑推理图。采用三阶段训练流程:基础重建学习、基于Charbonnier损失和DCT域监督的保真度精炼(结合基于Mamba的教师模型进行置信度加权输出级蒸馏)、以及直接在融合部署图上进行量化感知训练,并结合权重裁剪和BatchNorm重校准提升量化稳定性。
Details
Motivation: 解决高效单图像超分辨率(SISR)中重建保真度、模型紧凑性和低比特部署鲁棒性之间的平衡问题,特别是在x3超分辨率场景下尤为挑战。
Result: 在MAI 2026量化4K图像超分辨率挑战赛测试集上,最终AIO MAI提交在目标移动端INT8部署设置下达到29.79 dB PSNR和0.8634 SSIM,最终得分为1.8。消融实验显示,教师引导监督将动态INT8 TFLite重建从29.91 dB/0.853提升至30.0003 dB/0.856,而固定形状可部署INT8 TFLite模型达到30.006 dB/0.857。
Insight: 创新点包括:部署感知的量化框架设计(提取-精炼-上采样)、三阶段教师引导训练流程(结合空间监督、DCT域监督和输出级蒸馏)、以及量化优化技术(权重裁剪和BatchNorm重校准)。从客观角度看,其将模型架构设计与量化训练流程紧密结合,特别是教师蒸馏与量化感知训练的协同,为低比特超分辨率模型的实用部署提供了系统化解决方案。
Abstract: Efficient single-image super-resolution (SISR) requires balancing reconstruction fidelity, model compactness, and robustness under low-bit deployment, which is especially challenging for x3 SR. We present a deployment-oriented quantized SISR framework based on an extract-refine-upsample design. The student performs most computation in the low-resolution space and uses a lightweight re-parameterizable backbone with PixelShuffle reconstruction, yielding a compact inference graph. To improve quality without significantly increasing complexity, we adopt a three-stage training pipeline: Stage 1 learns a basic reconstruction mapping with spatial supervision; Stage 2 refines fidelity using Charbonnier loss, DCT-domain supervision, and confidence-weighted output-level distillation from a Mamba-based teacher; and Stage 3 applies quantization-aware training directly on the fused deploy graph. We further use weight clipping and BatchNorm recalibration to improve quantization stability. On the MAI 2026 Quantized 4K Image Super-Resolution Challenge test set, our final AIO MAI submission achieves 29.79 dB PSNR and 0.8634 SSIM, obtaining a final score of 1.8 under the target mobile INT8 deployment setting. Ablation on Stage 3 optimization shows that teacher-guided supervision improves the dynamic INT8 TFLite reconstruction from 29.91 dB/0.853 to 30.0003 dB/0.856, while the fixed-shape deployable INT8 TFLite artifact attains 30.006 dB/0.857.
[55] Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA cs.CV | cs.AIPDF
Zibo Xu, Qiang Li, Ke Lu, Jin Wang, Weizhi Nie
TL;DR: 本文提出了一种名为双重因果推断(DCI)的新框架,用于解决医学视觉问答(MedVQA)任务中存在的可观测与不可观测混杂偏倚问题。该框架首次统一整合了后门调整(BDA)和工具变量(IV)学习,旨在提取去混杂的表征,以捕获真实的因果关系,从而提高模型的泛化能力和诊断推理的可靠性。
Details
Motivation: 现有MedVQA方法容易过拟合于表层的跨模态相关性,忽略了多模态医学数据中固有的偏倚,导致模型易受跨模态混杂效应影响,无法提供可信的诊断推理。
Result: 在SLAKE、SLAKE-CP、VQA-RAD和PathVQA四个基准数据集上的大量实验表明,该方法持续优于现有方法,尤其在分布外(OOD)泛化方面表现突出。
Insight: 主要创新点在于首次构建了一个统一架构,通过结构因果模型(SCM)同时处理可观测与不可观测混杂因子:用BDA缓解可观测偏倚,用从共享潜在空间学习的IV补偿不可观测混杂。为确保IV有效性,设计了互信息约束以最大化其与融合多模态表征的依赖性,同时最小化其与不可观测混杂及目标答案的关联。这增强了跨模态推理的可解释性和鲁棒性,明确解耦了真实因果效应与虚假的跨模态捷径。
Abstract: Medical Visual Question Answering (MedVQA) aims to generate clinically reliable answers conditioned on complex medical images and questions. However, existing methods often overfit to superficial cross-modal correlations, neglecting the intrinsic biases embedded in multimodal medical data. Consequently, models become vulnerable to cross-modal confounding effects, severely hindering their ability to provide trustworthy diagnostic reasoning. To address this limitation, we propose a novel Dual Causal Inference (DCI) framework for MedVQA. To the best of our knowledge, DCI is the first unified architecture that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to jointly tackle both observable and unobserved confounders. Specifically, we formulate a Structural Causal Model (SCM) where observable cross-modal biases (e.g., frequent visual and textual co-occurrences) are mitigated via BDA, while unobserved confounders are compensated using an IV learned from a shared latent space. To guarantee the validity of the IV, we design mutual information constraints that maximize its dependence on the fused multimodal representations while minimizing its associations with the unobserved confounders and target answers. Through this dual mechanism, DCI extracts deconfounded representations that capture genuine causal relationships. Extensive experiments on four benchmark datasets, SLAKE, SLAKE-CP, VQA-RAD, and PathVQA, demonstrate that our method consistently outperforms existing approaches, particularly in out-of-distribution (OOD) generalization. Furthermore, qualitative analyses confirm that DCI significantly enhances the interpretability and robustness of cross-modal reasoning by explicitly disentangling true causal effects from spurious cross-modal shortcuts.
[56] UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval cs.CV | cs.MMPDF
Haokun Wen, Xuemeng Song, Haoyu Zhang, Xiangyu Zhao, Weili Guan
TL;DR: 本文提出了UniCVR,首个统一的零样本组合视觉检索框架,可同时处理组合图像检索、多轮组合图像检索和组合视频检索三个任务,无需任何任务特定的人工标注数据。该框架结合了多模态大语言模型(MLLM)的组合查询理解能力和视觉语言预训练(VLP)模型的结构化视觉检索能力,通过两阶段方法实现:第一阶段训练MLLM作为组合查询编码器,第二阶段引入MLLM引导的双层重排序机制。
Details
Motivation: 组合图像检索、多轮组合图像检索和组合视频检索具有共同的范式(即组合参考视觉内容与修改文本来检索目标),但此前这些任务被孤立研究,缺乏统一的框架和零样本解决方案。
Result: 在涵盖所有三个任务的五个基准测试上进行的大量实验表明,UniCVR实现了最先进的性能,验证了其有效性和泛化能力。
Insight: 创新点包括:1)首次提出统一的零样本组合视觉检索框架,将三个相关任务整合;2)两阶段设计,结合MLLM和VLP模型的互补优势;3)提出基于聚类的困难负样本采样策略以加强对比监督;4)引入MLLM引导的自适应预算子集评分和双层重排序机制,以最小计算开销提升排名准确性。
Abstract: Composed image retrieval, multi-turn composed image retrieval, and composed video retrieval all share a common paradigm: composing the reference visual with modification text to retrieve the desired target. Despite this shared structure, the three tasks have been studied in isolation, with no prior work proposing a unified framework, let alone a zero-shot solution. In this paper, we propose UniCVR, the first unified zero-shot composed visual retrieval framework that jointly addresses all three tasks without any task-specific human-annotated data. UniCVR strategically combines two complementary strengths: Multimodal Large Language Models (MLLMs) for compositional query understanding and Vision-Language Pre-trained (VLP) models for structured visual retrieval. Concretely, UniCVR operates in two stages. In Stage I, we train the MLLM as a compositional query embedder via contrastive learning on a curated multi-source dataset of approximately 3.5M samples, bridging the heterogeneous embedding spaces between the MLLM and the frozen VLP gallery encoder. A cluster-based hard negative sampling strategy is proposed to strengthen contrastive supervision. In Stage II, we introduce an MLLM-guided dual-level reranking mechanism that applies adaptive budgeted subset scoring to a small number of top-ranked candidates, and then exploits the resulting relevance signals through a dual-level re-scoring scheme, producing more accurate final rankings with minimal computational overhead. Extensive experiments across five benchmarks covering all three tasks demonstrate that UniCVR achieves cutting-edge performance, validating its effectiveness and generalizability. Our data and code will be released upon acceptance.
[57] SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark cs.CVPDF
Gui Wang, YongSong Zhou, Kaijun Deng, Wooi Ping Cheah, Rong Qu
TL;DR: 该论文提出了SurgCoT,一个用于评估多模态大语言模型(MLLMs)在外科手术视频中进行链式思维(CoT)时空推理能力的统一基准。该基准涵盖7个外科专业和35种不同手术,通过一个结构化的CoT框架(包含问题-选项-知识-线索-答案)来评估五个核心推理维度。对10个领先MLLMs的评估揭示了商业模型优于开源和医学专用模型,并存在显著的推理差距,同时SurgCoT能有效评估并增强渐进式时空推理能力。
Details
Motivation: 动机在于外科手术视频的细粒度时空推理至关重要,但MLLMs在此领域的能力尚未被充分探索,因此需要建立一个基准来弥合MLLM能力与临床推理需求之间的差距。
Result: 在SurgCoT基准上评估了10个领先的MLLMs,结果显示:1)商业模型优于开源和医学专用模型;2)在外科CoT推理方面存在显著差距;3)SurgCoT能够实现有效评估并增强渐进式时空推理。
Insight: 创新点在于提出了一个统一、结构化的基准(SurgCoT),通过包含“知识”和“线索”字段的密集标注协议,系统地评估MLLMs在外科视频中的多维度链式思维推理能力,为缩小模型能力与临床需求差距提供了可复现的测试平台。
Abstract: Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: https://github.com/CVI-SZU/SurgCoT.
[58] Hybrid Latent Reasoning with Decoupled Policy Optimization cs.CVPDF
Tao Cheng, Shi-Zhe Chen, Hao Zhang, Yixin Qin, Jinwen Luo
TL;DR: 本文提出了HyLaR框架,通过结合离散文本生成与连续视觉潜在表示,实现混合潜在推理。该框架采用DePO(解耦策略优化)方法,在混合空间中应用强化学习,通过分解策略梯度目标并施加独立的信任区域约束,有效优化推理过程。
Details
Motivation: 现有思维链推理方法在视觉任务中通常将信号离散化以适应LLM输入,导致早期语义崩溃和细粒度细节丢失;而外部工具引入的刚性瓶颈限制了推理灵活性。尽管潜在推理范式内部化视觉状态以克服这些限制,但优化混合离散-连续动作空间仍具挑战性。
Result: 在细粒度感知和通用多模态理解基准测试中,HyLaR优于标准MLLMs和最先进的潜在推理方法,实现了SOTA性能。
Insight: 创新点在于提出混合潜在推理框架HyLaR,以及解耦策略优化方法DePO,通过独立约束文本和潜在组件并引入精确闭式vMF KL正则化,有效解决了混合动作空间的优化难题,提升了多模态推理的灵活性和性能。
Abstract: Chain-of-Thought (CoT) reasoning significantly elevates the complex problem-solving capabilities of multimodal large language models (MLLMs). However, adapting CoT to vision typically discretizes signals to fit LLM inputs, causing early semantic collapse and discarding fine-grained details. While external tools can mitigate this, they introduce a rigid bottleneck, confining reasoning to predefined operations. Although recent latent reasoning paradigms internalize visual states to overcome these limitations, optimizing the resulting hybrid discrete-continuous action space remains challenging. In this work, we propose HyLaR (Hybrid Latent Reasoning), a framework that seamlessly interleaves discrete text generation with continuous visual latent representations. Specifically, following an initial cold-start supervised fine-tuning (SFT), we introduce DePO (Decoupled Policy Optimization) to enable effective reinforcement learning within this hybrid space. DePO decomposes the policy gradient objective, applying independent trust-region constraints to the textual and latent components, alongside an exact closed-form von Mises-Fisher (vMF) KL regularizer. Extensive experiments demonstrate that HyLaR outperforms standard MLLMs and state-of-the-art latent reasoning approaches across fine-grained perception and general multimodal understanding benchmarks. Code is available at https://github.com/EthenCheng/HyLaR.
[59] Image Generators are Generalist Vision Learners cs.CV | cs.AIPDF
Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun
TL;DR: 本文提出Vision Banana模型,通过将视觉感知任务重构为图像生成问题,利用图像生成预训练学习通用视觉表示,在多种2D和3D视觉任务上实现SOTA性能,同时保持基础模型的图像生成能力。
Details
Motivation: 解决生成式视觉模型是否具备强大视觉理解能力的问题,探索图像生成预训练是否类似LLM预训练,能作为通用视觉学习范式。
Result: 在分割任务上超越Segment Anything Model 3,在深度估计任务上媲美Depth Anything系列,在多种视觉任务上达到SOTA或可比拟领域专用模型。
Insight: 创新点在于将视觉任务输出参数化为RGB图像,实现感知到生成的统一框架;客观来看,图像生成预训练可能成为构建基础视觉模型的核心范式,类似于文本生成在语言任务中的作用。
Abstract: Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model’s image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation’s role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.
[60] X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis cs.CVPDF
Gui Wang, Zehao Zhong, YongSong Zhou, Yudong Li, Ende Wu
TL;DR: 本文提出了首个用于评估多模态大语言模型在眼科诊断中跨模态渐进临床推理能力的基准X-PCR,该基准模拟完整的诊断流程,包含渐进推理链和跨模态整合任务,基于26,415张图像和177,868个专家验证的VQA对构建,覆盖52种眼科疾病。对21个MLLM的评估揭示了它们在渐进推理和跨模态整合方面的关键不足。
Details
Motivation: 现有MLLM的临床推理能力缺乏评估,且当前基准多为单模态数据,无法评估临床实践中至关重要的渐进推理和跨模态整合能力。
Result: 在提出的X-PCR基准上评估了21个MLLM,结果揭示了这些模型在渐进推理和跨模态整合方面存在显著差距。
Insight: 创新点在于构建了首个模拟完整临床工作流程(从图像质量评估到临床决策的六阶段推理链)并整合六种成像模态的跨模态推理基准,为评估MLLM的临床诊断能力提供了全面、结构化的测试平台。
Abstract: Despite significant progress in Multi-modal Large Language Models (MLLMs), their clinical reasoning capacity for multi-modal diagnosis remains largely unexamined. Current benchmarks, mostly single-modality data, can’t evaluate progressive reasoning and cross-modal integration essential for clinical practice. We introduce the Cross-Modality Progressive Clinical Reasoning (X-PCR) benchmark, the first comprehensive evaluation of MLLMs through a complete ophthalmology diagnostic workflow, with two reasoning tasks: 1) a six-stage progressive reasoning chain spanning image quality assessment to clinical decision-making, and 2) a cross-modality reasoning task integrating six imaging modalities. The benchmark comprises 26,415 images and 177,868 expert-verified VQA pairs curated from 51 public datasets, covering 52 ophthalmic diseases. Evaluation of 21 MLLMs reveals critical gaps in progressive reasoning and cross-modal integration. Dataset and code: https://github.com/CVI-SZU/X-PCR.
[61] SignDATA: Data Pipeline for Sign Language Translation cs.CV | cs.CLPDF
Kuanwei Chen, Tingyi Lin
TL;DR: 本文介绍了SignDATA,一个用于手语翻译的数据预处理工具包,旨在解决手语数据集因标注模式、片段时序、手语者取景和隐私约束不一致而难以标准化处理的问题。该工具包提供两种端到端处理流程:一种生成标准化的姿态数据,另一种生成裁剪后的视频数据,并通过模块化设计支持不同后端和配置,以提高手语研究的可重复性和可比性。
Details
Motivation: 现有手语数据预处理流程通常分散、依赖于特定后端且文档不完善,导致从原始视频到训练就绪的姿态或视频数据的转换缺乏标准化,难以进行公平比较和可重复研究。
Result: 论文通过面向研究的评估设计验证了工具包,包括后端比较、预处理消融实验和隐私感知的视频生成,但摘要中未提及具体在哪个基准测试上达到何种性能水平(如SOTA)。
Insight: 主要创新点在于提供了一个配置驱动的、可复现的预处理框架,将提取器选择、归一化策略和隐私权衡等决策显式化、可配置化,并通过通用接口、类型化作业模式和阶段检查点机制实现了灵活性和可比较性。
Abstract: Sign-language datasets are difficult to preprocess consistently because they vary in annotation schema, clip timing, signer framing, and privacy constraints. Existing work usually reports downstream models, while the preprocessing pipeline that converts raw video into training-ready pose or video artifacts remains fragmented, backend-specific, and weakly documented. We present SignDATA, a config-driven preprocessing toolkit that standardizes heterogeneous sign-language corpora into comparable outputs for learning. The system supports two end-to-end recipes: a pose recipe that performs acquisition, manifesting, person localization, clipping, cropping, landmark extraction, normalization, and WebDataset export, and a video recipe that replaces pose extraction with signer-cropped video packaging. SignDATA exposes interchangeable MediaPipe and MMPose backends behind a common interface, typed job schemas, experiment-level overrides, and per-stage checkpointing with config- and manifest-aware hashes. We validate the toolkit through a research-oriented evaluation design centered on backend comparison, preprocessing ablations, and privacy-aware video generation on datasets. Our contribution is a reproducible preprocessing layer for sign-language research that makes extractor choice, normalization policy, and privacy tradeoffs explicit, configurable, and empirically comparable.Code is available at https://github.com/balaboom123/signdata-slt.
[62] Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models cs.CVPDF
Rong Quan, Yantao Lai, Dong Liang, Jie Qin
TL;DR: 本文提出ScanVLA模型,用于解决目标指引导向的扫视路径预测任务,该任务旨在根据语言描述预测人在视觉场景中搜索特定目标物体时的注意力扫视路径。模型利用视觉语言模型提取对齐的多模态特征,并通过历史增强解码器和冻结的分割LoRA组件来提升对细粒度位置信息的感知,从而显著提升预测性能。
Details
Motivation: 解决在目标指引导向下,如何有效融合视觉与语言信息以准确预测人类视觉搜索过程中的注意力扫视路径这一关键问题。
Result: 在目标指引导向的扫视路径预测任务上,ScanVLA模型在广泛的实验中显著超越了现有的扫视路径预测方法。
Insight: 创新点在于首次利用视觉语言模型提取对齐的多模态特征,并设计了历史增强扫视路径解码器直接利用历史注视点位置信息,同时采用冻结的分割LoRA作为辅助组件以更精确地定位指代对象,这些改进在不显著增加计算和时间成本的情况下提升了任务性能。
Abstract: Object Referring-guided Scanpath Prediction (ORSP) aims to predict the human attention scanpath when they search for a specific target object in a visual scene according to a linguistic description describing the object. Multimodal information fusion is a key point of ORSP. Therefore, we propose a novel model, ScanVLA, to first exploit a Vision-Language Model (VLM) to extract and fuse inherently aligned visual and linguistic feature representations from the input image and referring expression. Next, to enhance the ScanVLA’s perception of fine-grained positional information, we not only propose a novel History Enhanced Scanpath Decoder (HESD) that directly takes historical fixations’ position information as input to help predict a more reasonable position for the current fixation, but also adopt a frozen Segmentation LoRA as an auxiliary component to help localize the referred object more precisely, which improves the scanpath prediction task without incurring additional large computational and time costs. Extensive experimental results demonstrate that ScanVLA can significantly outperform existing scanpath prediction methods under object referring.
[63] Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation cs.CVPDF
Xingyu Zhu, Junfeng Fang, Shuo Wang, Beier Zhu, Zhicai Wang
TL;DR: 本文提出了一种名为MPD的双阶段框架,旨在缓解大型视觉-语言模型(LVLMs)中的幻觉问题,同时避免性能下降。该方法通过语义感知的组件解耦来提取纯粹的幻觉成分,并采用可解释的参数更新来选择性修改与幻觉最相关的参数。实验表明,MPD在LLaVA-Bench和MME基准测试上实现了最先进的性能,将幻觉减少了23.4%,同时保持了97.4%的通用生成能力,且没有额外的计算成本。
Details
Motivation: 大型视觉-语言模型(LVLMs)虽然具有强大的生成能力,但经常产生幻觉,损害输出可靠性。基于标注数据的微调是最直接的解决方案,但其计算成本高昂,而现有的基于表示的方法虽然高效,却因幻觉成分提取不完整和非选择性参数更新而导致通用生成能力下降。
Result: 在LLaVA-Bench和MME基准测试上的广泛实验表明,MPD实现了最先进的性能,将幻觉减少了23.4%,同时保持了97.4%的通用生成能力,且没有额外的计算成本。
Insight: 论文的创新点在于提出了一个双阶段框架MPD,它结合了语义感知的组件解耦来更纯粹地提取幻觉成分,以及可解释的选择性参数更新来避免损害模型的通用生成能力。从客观角度看,这种方法在高效缓解幻觉的同时,成功避免了性能退化,为平衡模型可靠性与能力提供了新思路。
Abstract: Large Vision-Language Models (LVLMs) exhibit powerful generative capabilities but frequently produce hallucinations that compromise output reliability. Fine-tuning on annotated data devoid of hallucinations offers the most direct solution, while its high computational cost motivates recent representation-based methods, which focus on mitigating hallucinatory components within hidden representations. Though efficient, we empirically observe that these methods degrade general generation capacity due to incomplete extraction of hallucination components and non-selective parameter updates. To address these limitations, we propose MPD, a dual-stage framework for mitigating hallucinations without performance degradation. Specifically, our MPD relies on two essential factors: (1) semantic-aware component disentanglement to extract pure hallucination components, and (2) interpretable parameter updates that selectively modify parameters most relevant to hallucination. Extensive experiments demonstrate that MPD achieves state-of-the-art performance, reducing hallucinations by 23.4% while maintaining 97.4% of general generative capability as evaluated on LLaVA-Bench and MME, with no additional computational cost.
[64] LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel cs.CV | cs.AIPDF
Zhe Feng, Sen Lian, Changwei Wang, Muyang Zhang, Tianlong Tan
TL;DR: LaplacianFormer是一种新型Transformer变体,通过使用拉普拉斯核替代softmax注意力机制,以降低计算复杂度并改善中等距离token交互。该方法结合了理论分析和经验观察,引入了可证明的单射特征映射以保持细粒度信息,并采用Nyström近似和牛顿-舒尔茨迭代实现高效计算。
Details
Motivation: 解决softmax注意力二次复杂度问题,现有线性注意力方法使用高斯核近似但缺乏理论依据且过度抑制中程token交互,因此提出基于拉普拉斯核的替代方案。
Result: 在ImageNet上的实验表明,LaplacianFormer实现了较强的性能-效率权衡,同时提升了注意力表达力。
Insight: 创新点包括:1) 使用拉普拉斯核作为softmax的理论替代;2) 引入可证明单射特征映射防止低秩近似下的表达退化;3) 结合Nyström近似和牛顿-舒尔茨迭代实现高效计算,无需矩阵求逆或SVD;4) 提供定制CUDA实现支持边缘部署。
Abstract: The quadratic complexity of softmax attention presents a major obstacle for scaling Transformers to high-resolution vision tasks. Existing linear attention variants often replace the softmax with Gaussian kernels to reduce complexity, but such approximations lack theoretical grounding and tend to oversuppress mid-range token interactions. We propose LaplacianFormer, a Transformer variant that employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis. To address expressiveness degradation under low-rank approximations, we introduce a provably injective feature map that retains fine-grained token information. For efficient computation, we adopt a Nyström approximation of the kernel matrix and solve the resulting system using Newton–Schulz iteration, avoiding costly matrix inversion and SVD. We further develop custom CUDA implementations for both the kernel and solver, enabling high-throughput forward and backward passes suitable for edge deployment. Experiments on ImageNet show that LaplacianFormer achieves strong performance-efficiency trade-offs while improving attention expressiveness.
[65] Self-supervised pretraining for an iterative image size agnostic vision transformer cs.CVPDF
Nedyalko Prisadnikov, Danda Pani Paudel, Yuqian Fu, Luc Van Gool
TL;DR: 本文提出了一种基于DINO自蒸馏目标的顺序到全局自监督学习框架,用于训练图像尺寸无关的视觉Transformer,通过高效的积分图像块提取方法实现大规模预训练,并在ImageNet-1K及下游分类任务上取得竞争性性能,同时保持计算成本与输入分辨率无关。
Details
Motivation: 解决现有视觉Transformer(如DINO)在自监督学习中计算效率低、难以处理高分辨率图像的问题,旨在开发一种图像尺寸无关的基础骨干网络。
Result: 在ImageNet-1K和下游分类任务上达到竞争性性能,计算预算恒定,不随输入分辨率变化。
Insight: 创新点包括顺序到全局的自监督学习框架、积分图像块提取方法,以及迭代处理多缩放块以实现分辨率无关性,可借鉴于高效视觉模型设计。
Abstract: Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO’s self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.
[66] SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation cs.CV | cs.ROPDF
Chris Choy, Junha Lee, Chunghyun Park, Minsu Cho, Jan Kautz
TL;DR: 本文提出了SpaCeFormer,一种无需候选框的开放词汇3D实例分割方法,通过空间曲线Transformer架构实现了每场景0.14秒的极快推理速度,比现有多阶段2D+3D方法快2-3个数量级。同时构建了大规模数据集SpaCeFormer-3M,包含604K个实例的300万条多视角一致标注,在多个基准测试中实现了零样本性能的显著提升。
Details
Motivation: 解决开放词汇3D实例分割中现有方法存在的瓶颈:多阶段2D+3D流程计算缓慢(每场景数百秒),而端到端方法依赖碎片化掩码和外部候选框生成。
Result: 在ScanNet200上实现11.1的零样本mAP,比之前最佳无候选框方法提升2.8倍;在ScanNet++和Replica上分别达到22.9和24.1 mAP,超越所有现有方法(包括使用多视角2D输入的方法)。
Insight: 创新点包括:1) 结合空间窗口注意力与Morton曲线序列化的空间连贯特征提取;2) 使用RoPE增强的解码器直接从学习查询中预测实例掩码,无需外部候选框;3) 通过多视角掩码聚类与视觉语言模型标注构建高质量大规模数据集的方法。
Abstract: Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU > 0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8x improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.
[67] Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing cs.CVPDF
Xi Chen, Xu Chen, Xiangyang Jia, Xu Zhang, Shuquan Wei
TL;DR: 本文提出了一种名为Fast-then-Fine的两阶段检索框架,用于解决遥感图像-文本跨模态检索中细粒度对齐与检索效率难以兼顾的问题。该框架将检索过程分解为文本无关的召回阶段和文本引导的重排序阶段,前者使用粗粒度表示进行高效候选筛选,后者通过无参数的平衡文本引导交互块增强细粒度对齐,并设计了跨模态与模态内损失联合优化多粒度表示。
Details
Motivation: 解决遥感图像中因密集多目标分布和复杂背景导致的跨模态检索难题,即现有方法要么依赖复杂的跨模态交互导致检索效率低下,要么需要大规模预训练耗费大量数据和计算资源。
Result: 在公开基准测试上的大量实验表明,所提FTF框架在达到有竞争力的检索精度的同时,相比现有方法显著提升了检索效率。
Insight: 核心创新点在于将检索任务分解为“快速召回”与“精细重排序”两个解耦阶段,并引入无参数的文本引导交互块进行细粒度对齐,避免了引入额外可学习参数;同时,通过多粒度表示和联合损失设计优化对齐过程,兼顾了精度与效率。
Abstract: Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.
[68] CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs cs.CVPDF
Xingcheng Zhou, Hao Guo, Rui Song, Walter Zimmer, Mingyu Liu
TL;DR: 本文提出了CCTVBench,一个基于对比一致性的交通视频问答基准测试,通过配对真实事故视频与世界模型生成的反事实视频,结合最小差异的互斥假设问题,评估多模态大语言模型在安全关键交通推理中的表现。
Details
Motivation: 为了解决安全关键交通推理中模型需要具备对比一致性的问题,即模型必须在事故发生时检测真实危险,并在近乎相同的反事实场景下可靠地拒绝看似合理但错误的假设。
Result: 实验表明,开源和专有视频大语言模型在标准单实例QA指标与四元组级别的对比一致性之间存在巨大且持续的差距,其中不可靠的’以上都不是’拒绝是关键瓶颈。提出的C-TCD对比解码方法在推理时利用语义互斥的对比视频作为输入,提高了实例级QA和对比一致性。
Insight: 创新点在于构建了基于配对真实与反事实视频的对比一致性基准,并提供了可操作的诊断框架来分解失败模式;提出的C-TCD方法通过引入对比解码,有效提升了模型在复杂交通场景下的推理可靠性。
Abstract: Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and quadruple-level contrastive consistency, with unreliable none-of-the-above rejection as a key bottleneck. Finally, we introduce C-TCD, a contrastive decoding approach leveraging a semantically exclusive counterpart video as the contrast input at inference time, improving both instance-level QA and contrastive consistency.
[69] DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion cs.CVPDF
Yongji Long, Shijun Liang, Jintao Li, Yun Li
TL;DR: 本文提出DynamicRad,一种用于长视频扩散模型的内容自适应稀疏注意力范式。它通过结合径向局部性先验,采用双模式策略(静态比率和动态阈值)来平衡效率与质量,并利用离线贝叶斯优化和语义运动路由器实现鲁棒且低开销的自适应稀疏化。
Details
Motivation: 现有方法依赖刚性静态掩码进行稀疏化,在复杂动态场景中可能丢失关键的长程信息,导致效率与质量权衡不佳。本文旨在解决长视频扩散中自适应稀疏注意力的问题。
Result: 在HunyuanVideo和Wan2.1-14B基准测试中,DynamicRad实现了1.7倍至2.5倍的推理加速,有效稀疏度超过80%,动态模式在某些长序列设置中甚至匹配或超过密集基线。
Insight: 创新点包括基于径向局部性先验的自适应稀疏注意力、双模式执行策略、以及结合离线贝叶斯优化和轻量级语义运动路由器的鲁棒优化流程,为视频扩散模型提供了高效且质量保持的稀疏注意力机制。
Abstract: Leveraging the natural spatiotemporal energy decay in video diffusion offers a path to efficiency, yet relying solely on rigid static masks risks losing critical long-range information in complex dynamics. To address this issue, we propose \textbf{DynamicRad}, a unified sparse-attention paradigm that grounds adaptive selection within a radial locality prior. DynamicRad introduces a \textbf{dual-mode} strategy: \textit{static-ratio} for speed-optimized execution and \textit{dynamic-threshold} for quality-first filtering. To ensure robustness without online search overhead, we integrate an offline Bayesian Optimization (BO) pipeline coupled with a \textbf{semantic motion router}. This lightweight projection module maps prompt embeddings to optimal sparsity regimes with \textbf{minimal runtime overhead}. Unlike online profiling methods, our offline BO optimizes attention reconstruction error (MSE) on a physics-based proxy task, ensuring rapid convergence. Experiments on HunyuanVideo and Wan2.1-14B demonstrate that DynamicRad pushes the efficiency–quality Pareto frontier, achieving \textbf{1.7$\times$–2.5$\times$ inference speedups} with \textbf{over 80% effective sparsity}. In some long-sequence settings, the dynamic mode even matches or exceeds the dense baseline, while mask-aware LoRA further improves long-horizon coherence. Code is available at https://github.com/Adamlong3/DynamicRad.
[70] Video-ToC: Video Tree-of-Cue Reasoning cs.CVPDF
Qizhong Tan, Zhuotao Tian, Guangming Lu, Jun Yu, Wenjie Pei
TL;DR: 本文提出Video-ToC,一种新颖的视频推理框架,通过树状线索推理来增强视频理解。该框架包含三个关键创新:树引导的视觉线索定位机制、基于推理需求估计的动态奖励机制以及用于训练的数据集自动标注流程。在多个视频理解基准测试上的评估证明了其优越性。
Details
Motivation: 现有的视频大语言模型在处理复杂视频理解任务时存在推理能力有限和可能产生幻觉的问题,主要原因是它们主要依赖预训练的内在推理逻辑,而缺乏对输入视频内容的感知自适应。
Result: 在六个视频理解基准测试和一个视频幻觉基准测试上的广泛评估表明,Video-ToC优于基线方法和近期方法。
Insight: 主要创新点包括:1) 树状结构引导的细粒度视觉线索定位,增强了模型的感知能力;2) 基于推理需求估计的动态强化学习奖励机制,实现了按需激励;3) 自动构建用于监督微调和强化学习训练的数据集流程。从客观角度看,将结构化推理模式(树状)与动态奖励策略结合,是针对视频内容进行自适应、细粒度推理的有效尝试。
Abstract: Existing Video Large Language Models (Video LLMs) struggle with complex video understanding, exhibiting limited reasoning capabilities and potential hallucinations. In particular, these methods tend to perform reasoning solely relying on the pretrained inherent reasoning rationales whilst lacking perception-aware adaptation to the input video content. To address this, we propose \textbf{Video-ToC}, a novel video reasoning framework that enhances video understanding through tree-of-cue reasoning. Specifically, our approach introduces three key innovations: (1) A tree-guided visual cue localization mechanism, which endows the model with enhanced fine-grained perceptual capabilities through structured reasoning patterns; (2) A reasoning-demand reward mechanism, which dynamically adjusts the reward value for reinforcement learning (RL) based on the estimation of reasoning demands, enabling on-demand incentives for more effective reasoning strategies; and (3) An automated annotation pipeline that constructs the Video-ToC-SFT-1k and Video-ToC-RL-2k datasets for supervised fine-tuning (SFT) and RL training, respectively. Extensive evaluations on six video understanding benchmarks and a video hallucination benchmark demonstrate the superiority of Video-ToC over baselines and recent methods. Code is available at https://github.com/qizhongtan/Video-ToC.
[71] ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards cs.CVPDF
Wentao Yan, Shengqin Wang, Huichi Zhou, Yihang Chen, Kun Shao
TL;DR: 本文提出ProMMSearchAgent,一种通过过程导向奖励训练的可泛化多模态搜索智能体,旨在解决基于结果监督稀疏性和实时网络环境不可预测性对多模态知识密集型视觉推理任务中强化学习训练的阻碍。该方法采用一种新颖的模拟到真实训练范式,将策略学习解耦到确定性的本地静态沙箱中,并利用内省式过程导向奖励生成密集的行为元数据,以奖励正确的认知决策(仅在视觉或事实不确定时启动搜索)。实验表明,该本地训练的策略能够零样本迁移到真实的Google搜索API,并在多个基准测试上达到新的最先进性能。
Details
Motivation: 解决在知识密集型视觉推理任务中,使用强化学习训练多模态智能体时面临的基于结果的监督信号极度稀疏以及实时网络环境不可预测性这两个根本性瓶颈。
Result: 在FVQA-test、InfoSeek和MMSearch三个基准测试上,ProMMSearchAgent分别以+5.1%、+6.3%和+11.3%的优势超越了之前的MMSearch-R1模型,实现了新的最先进性能。
Insight: 核心创新点在于提出了一个模拟到真实的训练范式,将策略学习解耦到可控的静态沙箱中,并设计了内省式过程导向奖励机制。该机制通过探测智能体自身的参数化知识边界来生成密集的监督信号,从而明确奖励正确的认知决策过程(即判断何时需要启动外部搜索),这为解决稀疏奖励和复杂环境下的策略学习提供了新思路。
Abstract: Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent’s own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.
[72] Evian: Towards Explainable Visual Instruction-tuning Data Auditing cs.CV | cs.AIPDF
Zimu Jia, Mingjie Xu, Andrew Estornell, Jiaheng Wei
TL;DR: 该论文提出了EVIAN框架,用于对大型视觉语言模型(LVLM)的训练数据进行可解释的审计。通过构建一个包含30万样本、注入了多种细微缺陷的基准测试集,并引入一种’分解后评估’的新范式,将模型响应分解为视觉描述、主观推理和事实主张三个认知组件,从而在图像-文本一致性、逻辑连贯性和事实准确性三个正交维度上进行针对性评估。
Details
Motivation: 现有LVLM训练数据集质量参差不齐,且当前的数据过滤方法依赖于粗粒度的评分,无法识别逻辑谬误或事实错误等细微的语义缺陷,这成为开发更可靠模型的瓶颈。
Result: 实验表明,使用EVIAN筛选出的紧凑、高质量子集进行微调的模型,其性能持续超越了使用数量级更大数据集训练的模型,挑战了以规模为中心的主流范式。研究还发现,逻辑连贯性是数据质量评估中最关键的因素。
Insight: 创新点在于提出了’分解后评估’的数据审计范式,将复杂的审计任务分解为可验证的子任务(视觉描述、主观推理、事实主张),并沿三个正交维度(一致性、连贯性、准确性)进行自动化评估,这为构建高质量训练数据提供了可解释且有效的工具。
Abstract: The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel “Decomposition-then-Evaluation” paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.
[73] Exploring Spatial Intelligence from a Generative Perspective cs.CVPDF
Muzhi Zhu, Shunyao Jiang, Huanyi Zheng, Zekai Luo, Hao Zhong
TL;DR: 本文提出了生成式空间智能(GSI)的概念,并引入了首个用于量化该能力的基准测试GSI-Bench,该基准包含真实世界数据集GSI-Real和合成数据集GSI-Syn,通过实验证明在GSI-Syn上进行微调不仅能提升模型的生成式空间智能,还能增强其空间理解能力。
Details
Motivation: 当前多模态大语言模型的空间智能评估主要局限于理解视角,本文旨在探究现代生成式或统一多模态模型是否具备生成式空间智能(即生成图像时尊重和操纵3D空间约束的能力),并研究如何测量和提升这种能力。
Result: 实验表明,在GSI-Syn合成数据集上对统一多模态模型进行微调,在合成和真实任务上均取得了显著提升,并且意外地提高了下游空间理解任务的性能,这为生成式训练能切实增强空间推理提供了首个明确证据。
Insight: 论文的主要创新点在于首次从生成视角系统性地定义和评估空间智能,并构建了包含真实与合成数据的可扩展基准GSI-Bench;其核心洞见是生成式训练可以成为增强多模态模型空间推理能力的新途径,这挑战了传统上将理解与生成能力分开评估的范式。
Abstract: Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.
[74] Where are they looking in the operating room? cs.CVPDF
Keqi Chen, Séraphin Baributsa, Lilien Schewski, Vinkle Srivastav, Didier Mutter
TL;DR: 本文首次将视线跟随任务引入手术室场景,提出了基于视线预测的新方法,用于临床角色预测、手术阶段识别和团队沟通检测,并在4D-OR和Team-OR数据集上取得了SOTA性能。
Details
Motivation: 解决手术室这一复杂高风险环境中视觉注意力建模的缺失问题,以推动手术工作流分析、临床角色理解和团队交互研究。
Result: 在4D-OR和Team-OR数据集上,临床角色预测F1分数达0.92,手术阶段识别达0.95,团队沟通检测性能较现有基线提升超过30%,达到SOTA水平。
Insight: 创新点在于将视线跟随任务扩展到手术领域,并提出了纯视线热图方法和自监督时空编码模型,为手术数据分析提供了新的视角和工具。
Abstract: Purpose: Gaze-following, the task of inferring where individuals are looking, has been widely studied in computer vision, advancing research in visual attention modeling, social scene understanding, and human-robot interaction. However, gaze-following has never been explored in the operating room (OR), a complex, high-stakes environment where visual attention plays an important role in surgical workflow analysis. In this work, we introduce the concept of gaze-following to the surgical domain, and demonstrate its great potential for understanding clinical roles, surgical phases, and team communications in the OR. Methods: We extend the 4D-OR dataset with gaze-following annotations, and extend the Team-OR dataset with gaze-following and a new team communication activity annotations. Then, we propose novel approaches to address clinical role prediction, surgical phase recognition, and team communication detection using a gaze-following model. For role and phase recognition, we propose a gaze heatmap-based approach that uses gaze predictions solely; for team communication detection, we train a spatial-temporal model in a self-supervised way that encodes gaze-based clip features, and then feed the features into a temporal activity detection model. Results: Experimental results on the 4D-OR and Team-OR datasets demonstrate that our approach achieves state-of-the-art performance on all downstream tasks. Quantitatively, our approach obtains F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition. Furthermore, it significantly outperforms existing baselines in team communication detection, improving previous best performances by over 30%. Conclusion: We introduce gaze-following in the OR as a novel research direction in surgical data science, highlighting its great potential to advance surgical workflow analysis in computer-assisted interventions.
[75] Beyond ZOH: Advanced Discretization Strategies for Vision Mamba cs.CV | cs.AIPDF
Fady Ibrahim, Guangjun Liu, Guanghui Wang
TL;DR: 本文系统比较了Vision Mamba框架中六种离散化方案(ZOH、FOH、BIL、POL、HOH、RK4)在视觉任务上的表现,发现BIL方法在精度和效率间提供了最佳权衡,建议将其作为SSM模型的默认离散化基线。
Details
Motivation: Vision Mamba使用的零阶保持(ZOH)离散化假设输入信号在采样间隔内恒定,这会降低动态视觉环境中的时间保真度并限制基于SSM的视觉模型的精度,因此需要探索更先进的离散化策略。
Result: 在标准视觉基准测试(如图像分类、语义分割和目标检测)中,POL和HOH方法带来了最大的精度提升但训练计算成本较高,而BIL方法相比ZOH能提供一致的改进且额外开销适中,在精度和效率间取得了最佳平衡。
Insight: 创新点在于首次系统评估了多种离散化方案在Vision Mamba中的影响,揭示了离散化在基于SSM的视觉架构中的关键作用,并基于实证结果推荐BIL作为SOTA SSM模型的默认离散化方法,为模型设计提供了新的优化方向。
Abstract: Vision Mamba, as a state space model (SSM), employs a zero-order hold (ZOH) discretization, which assumes that input signals remain constant between sampling instants. This assumption degrades temporal fidelity in dynamic visual environments and constrains the attainable accuracy of modern SSM-based vision models. In this paper, we present a systematic and controlled comparison of six discretization schemes instantiated within the Vision Mamba framework: ZOH, first-order hold (FOH), bilinear/Tustin transform (BIL), polynomial interpolation (POL), higher-order hold (HOH), and the fourth-order Runge-Kutta method (RK4). We evaluate each method on standard visual benchmarks to quantify its influence in image classification, semantic segmentation, and object detection. Our results demonstrate that POL and HOH yield the largest gains in accuracy at the cost of higher training-time computation. In contrast, the BIL provides consistent improvements over ZOH with modest additional overhead, offering the most favorable trade-off between precision and efficiency. These findings elucidate the pivotal role of discretization in SSM-based vision architectures and furnish empirically grounded justification for adopting BIL as the default discretization baseline for state-of-the-art SSM models.
[76] RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking cs.CV | cs.AIPDF
Roie Kazoom, Yotam Gigi, George Leifman, Tomer Shekel, Genady Beryozkin
TL;DR: 本文提出了RSRCC,一个用于遥感区域变化理解的问答基准数据集,包含12.6万个问题,旨在解决传统变化检测仅定位变化而无法用自然语言解释具体变化内容的问题。该数据集专注于细粒度的、基于特定语义变化的推理问题。
Details
Motivation: 现有遥感变化描述数据集通常只描述图像级别的整体差异,缺乏对局部细粒度语义推理的探索。为了填补这一空白,需要构建一个专门用于细粒度、基于推理的遥感变化问答基准。
Result: 构建了包含12.6万个问题的RSRCC数据集,分为8.7万训练、1.71万验证和2.2万测试实例。据作者所知,这是第一个明确为这种细粒度推理监督设计的遥感变化问答基准。
Insight: 创新点在于引入了层次化的半自动数据构建流程,其核心是使用检索增强的Best-of-N排序作为关键的最终消歧阶段,从而可扩展地过滤噪声和模糊候选,同时保留有语义意义的变化。这为构建高质量、细粒度的视觉语言数据集提供了一种有效方法。
Abstract: Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.
[77] The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm cs.CV | cs.AIPDF
Karan Goyal, Dikshant Kukreja
TL;DR: 本文挑战了当前视觉语言模型(VLMs)在统一多模态知识发现中的可信度,指出主流范式存在功能性盲视问题,即模型过度依赖语言先验而忽略视觉表示。作者提出了一种基于信息论的模态翻译协议,定义了三个新指标(ToS、CoS、FoS)和语义充分性准则(SSC),并假设了多模态扩展的发散定律,旨在为真正可信的多模态推理奠定基础。
Details
Motivation: 解决当前视觉语言模型在合成多模态数据时缺乏可信度的问题,特别是模型因视觉表示瓶颈而过度依赖语言先验,导致功能性盲视,无法从视觉输入中提取可靠知识。
Result: 通过模态翻译协议量化了视觉的代价,提出了ToS、CoS、FoS三个指标和SSC准则,并假设了多模态扩展的发散定律,为评估模型提供了新方法,但摘要未提及具体基准测试或定量结果。
Insight: 创新点在于从信息论角度挑战传统多模态评估方法,避免数据集偏差与架构无能的混淆,并通过SSC作为主动架构蓝图,推动AI系统真正实现多模态推理,而非追求虚幻的多模态增益。
Abstract: The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics – the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing – culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of “multimodal gain”. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.
[78] R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs cs.CVPDF
Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, Bernt Schiele
TL;DR: 本文提出了一种名为R-CoV的区域感知验证链方法,旨在以训练无关的后处理方式减轻大型视觉语言模型(LVLMs)中的物体幻觉问题。该方法通过模拟人类聚焦图像特定区域的理解过程,引导模型自身进行区域级处理,并以此作为线索来检测和缓解其自身的物体幻觉。
Details
Motivation: 大型视觉语言模型在多模态理解和推理任务中表现出色,但仍存在物体幻觉问题,即在视觉输入中声称存在不存在的物体。为解决这一问题,作者受人类理解复杂视觉信息时聚焦特定区域细节的启发,提出了R-CoV方法。
Result: 在多个LVLMs和广泛使用的幻觉基准测试上的大量实验表明,R-CoV能显著减轻LVLMs中的物体幻觉。
Insight: 创新点在于提出了一种无需训练、不依赖外部检测模型的后处理验证链框架,通过引导模型进行自省式的区域级验证来缓解幻觉,方法简单有效且易于集成到现有模型中。
Abstract: Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information – often focusing on specific image regions or details within a given sample – we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. As a simple yet effective method, R-CoV can be seamlessly integrated into various LVLMs in a training-free manner and without relying on external detection models. Extensive experiments on several widely used hallucination benchmarks across multiple LVLMs demonstrate that R-CoV can significantly alleviate object hallucinations in LVLMs. Project page: https://github.com/Jiahao000/R-CoV.
[79] SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models cs.CVPDF
Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, Bernt Schiele
TL;DR: 本文提出了SSL-R1,一种通用的自监督强化学习框架,用于多模态大语言模型的后训练。该框架直接从图像中获取可验证的奖励,通过将广泛使用的自监督学习任务重新定义为一系列可验证的视觉谜题,无需人工或外部模型监督。在多个多模态理解和推理基准测试上的实验表明,该方法显著提升了模型性能。
Details
Motivation: 现有基于可验证奖励的强化学习方法主要依赖以语言为中心的先验知识和昂贵的人工标注,这限制了多模态大语言模型内在的视觉理解能力和奖励设计的可扩展性。本文旨在解决这一问题,探索直接从图像中获取可验证奖励的途径。
Result: 在多个多模态理解和推理基准测试上,使用SSL-R1进行后训练的模型性能得到了显著提升,展示了该方法在提升模型能力方面的有效性。
Insight: 核心创新点在于将视觉领域的自监督学习任务重新定义为可用于强化学习后训练的可验证视觉谜题,从而构建了一个无需人工标注、以视觉为中心的、可扩展的奖励设计框架。这为大规模应用强化学习训练多模态模型提供了新的思路和经验。
Abstract: Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs’ intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: https://github.com/Jiahao000/SSL-R1.
[80] GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers cs.CVPDF
Yuxuan Xue, Ruofan Liang, Egor Zakharov, Timur Bagautdinov, Chen Cao
TL;DR: GeoRelight是一个统一的框架,通过多模态扩散变换器联合解决单张照片中人物的几何重建和重光照问题。它引入了无失真的3D表示iNOD和混合数据训练策略,实现了优于顺序模型和忽略几何方法的效果。
Details
Motivation: 解决单张照片人物重光照这一病态问题,现有方法要么采用顺序流程导致误差累积,要么在重光照过程中未显式利用3D几何信息,限制了物理一致性。
Result: 通过联合优化几何和重光照,GeoRelight在性能上超越了顺序模型和先前忽略几何的系统,实现了更好的效果。
Insight: 创新点在于提出联合几何重建与重光照的统一多模态扩散变换器框架,并引入与潜在扩散模型兼容的无失真3D表示iNOD以及结合合成与自动标注真实数据的混合训练策略,提升了任务的物理一致性和性能。
Abstract: Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: GeoRelight. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than both sequential models and previous systems that ignored geometry.
[81] Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback cs.CVPDF
Guotao Liang, Zhangcheng Wang, Juncheng Hu, Haitao Zhou, Ziteng Xue
TL;DR: 本文提出了一种名为’Render-in-the-Loop’的新型生成范式,用于通过视觉自反馈来生成矢量图形(SVG)。该方法将SVG合成重构为一个逐步的、视觉上下文感知的过程,通过渲染中间代码状态到累积画布上,使模型能够观察演变的视觉上下文,并利用即时反馈指导后续生成。
Details
Motivation: 现有基于多模态大语言模型(MLLMs)的SVG生成方法通常采用开环的’盲画’方式,即模型生成符号代码序列而不感知中间视觉结果,这未能充分利用MLLMs视觉编码器中的强大视觉先验,并且难以处理部分画布状态和隐含的遮挡关系。
Result: 在标准基准MMSVGBench上,基于该框架实例化的多模态基础模型超越了强大的开源基线模型,在文本到SVG和图像到SVG任务上均表现出卓越的数据效率和泛化能力。
Insight: 核心创新在于将视觉反馈循环引入SVG生成过程,并提出了细粒度路径分解来构建密集的多步视觉轨迹、视觉自反馈(VSF)训练策略以及渲染与验证(RaV)推理机制,从而实现了对中间视觉状态的利用和更优的生成控制。
Abstract: Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-loop “blind drawing” approach, where models generate symbolic code sequences without perceiving intermediate visual outcomes. This methodology severely underutilizes the powerful visual priors embedded in MLLMs vision encoders, treating SVG generation as a disjointed textual sequence modeling task rather than an integrated visuo-spatial one. Consequently, models struggle to reason about partial canvas states and implicit occlusion relationships, which are visually explicit but textually ambiguous. To bridge this gap, we propose Render-in-the-Loop, a novel generation paradigm that reformulates SVG synthesis as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step, leveraging on-the-fly feedback to guide subsequent generation. However, we demonstrate that applying this visual loop naively to off-the-shelf models is suboptimal due to their inability to leverage incremental visual-code mappings. To address this, we first utilize fine-grained path decomposition to construct dense multi-step visual trajectories, and then introduce a Visual Self-Feedback (VSF) training strategy to condition the next primitive generation on intermediate visual states. Furthermore, a Render-and-Verify (RaV) inference mechanism is proposed to effectively filter degenerate and redundant primitives. Our framework, instantiated on a multimodal foundation model, outperforms strong open-weight baselines on the standard MMSVGBench. This result highlights the remarkable data efficiency and generalization capability of our Render-in-the-Loop paradigm for both Text-to-SVG and Image-to-SVG tasks.
[82] Amodal SAM: A Unified Amodal Segmentation Framework with Generalization cs.CVPDF
Bo Zhang, Zhuotao Tian, Xin Tao, Songlin Tang, Jun Yu
TL;DR: 本文提出了Amodal SAM,一个统一的模态分割框架,利用SAM(Segment Anything Model)进行模态图像和视频分割,通过轻量级空间补全适配器、目标感知遮挡合成管道和新颖的学习目标,在保持SAM强大泛化能力的同时,扩展其固有能力以处理模态分割任务。
Details
Motivation: 现有模态分割方法主要关注训练域内的分割,缺乏有效泛化到新对象类别和未见场景的能力,因此需要开发一个具有强大泛化能力的统一框架来解决模态分割中的遮挡区域预测问题。
Result: 在标准基准测试中,Amodal SAM实现了最先进的性能,同时在新场景中表现出鲁棒的泛化能力。
Insight: 创新点包括:轻量级空间补全适配器用于遮挡区域重建,目标感知遮挡合成管道解决模态标注稀缺问题,以及强制执行区域一致性和拓扑正则化的新颖学习目标,这些设计共同提升了模态分割的泛化性和实用性。
Abstract: Amodal segmentation is a challenging task that aims to predict the complete geometric shape of objects, including their occluded regions. Although existing methods primarily focus on amodal segmentation within the training domain, these approaches often lack the generalization capacity to extend effectively to novel object categories and unseen contexts. This paper introduces Amodal SAM, a unified framework that leverages SAM (Segment Anything Model) for both amodal image and amodal video segmentation. Amodal SAM preserves the powerful generalization ability of SAM while extending its inherent capabilities to the amodal segmentation task. The improvements lie in three aspects: (1) a lightweight Spatial Completion Adapter that enables occluded region reconstruction, (2) a Target-Aware Occlusion Synthesis (TAOS) pipeline that addresses the scarcity of amodal annotations by generating diverse synthetic training data, and (3) novel learning objectives that enforce regional consistency and topological regularization. Extensive experiments demonstrate that Amodal SAM achieves state-of-the-art performance on standard benchmarks, while simultaneously exhibiting robust generalization to novel scenarios. We anticipate that this research will advance the field toward practical amodal segmentation systems capable of operating effectively in unconstrained real-world environments.
[83] Exploring High-Order Self-Similarity for Video Understanding cs.CVPDF
Manjin Kim, Heeseung Kwon, Karteek Alahari, Minsu Cho
TL;DR: 本文提出了一种多阶时空自相似性(MOSS)模块,用于增强视频理解中的运动建模能力。该模块通过捕捉不同阶数的时空自相似性特征,有效整合了视频动态的多方面信息,并在多个视频任务中显著提升了性能,同时计算开销极小。
Details
Motivation: 动机在于探索高阶时空自相似性(STSS)以更全面地表示视频中的时间动态,解决现有方法在捕捉复杂运动模式方面的不足。
Result: 在视频动作识别、以运动为中心的视频VQA和真实世界机器人任务上的大量实验表明,MOSS模块带来了显著改进,验证了其作为通用时间建模模块的广泛适用性。
Insight: 创新点在于引入多阶STSS概念并设计轻量级MOSS模块来学习和整合这些特征,从客观角度看,该方法通过高阶相似性挖掘了视频动态的深层结构,为时间建模提供了新视角。
Abstract: Space-time self-similarity (STSS), which captures visual correspondences across frames, provides an effective way to represent temporal dynamics for video understanding. In this work, we explore higher-order STSS and demonstrate how STSSs at different orders reveal distinct aspects of these dynamics. We then introduce the Multi-Order Self-Similarity (MOSS) module, a lightweight neural module designed to learn and integrate multi-order STSS features. It can be applied to diverse video tasks to enhance motion modeling capabilities while consuming only marginal computational cost and memory usage. Extensive experiments on video action recognition, motion-centric video VQA, and real-world robotic tasks consistently demonstrate substantial improvements, validating the broad applicability of MOSS as a general temporal modeling module. The source code and checkpoints will be publicly available.
[84] GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction cs.CVPDF
Zhenlong Wu, Zihan Zheng, Xuanxuan Wang, Qianhe Wang, Hua Yang
TL;DR: 本文提出GeoRect4D,一个用于稀疏多视角视频动态3D重建的统一框架。它通过一个闭环优化过程,将显式的3D几何一致性与生成式细化相结合,以解决重建中的几何崩溃、轨迹漂移和漂浮伪影等问题。
Details
Motivation: 从稀疏多视角视频重建动态3D场景是一个高度不适定问题,现有方法直接引入生成先验来补全缺失内容,但随机性的2D生成与确定性的3D几何之间存在不匹配,常导致结构漂移和时间不一致。
Result: 在多个数据集上的大量实验表明,GeoRect4D在重建保真度、感知质量和时空一致性方面均达到了最先进的性能。
Insight: 核心创新在于提出了一个包含退化感知反馈机制的闭环优化框架,结合了基于锚点的鲁棒动态3D高斯溅射(3DGS)基底与单步扩散校正器,并引入了结构锁定机制和时空协调注意力来保持物理合理性。此外,渐进优化策略通过随机几何净化消除漂浮物,并通过生成式蒸馏将纹理细节注入显式表示。
Abstract: Reconstructing dynamic 3D scenes from sparse multi-view videos is highly ill-posed, often leading to geometric collapse, trajectory drift, and floating artifacts. Recent attempts introduce generative priors to hallucinate missing content, yet naive integration frequently causes structural drift and temporal inconsistency due to the mismatch between stochastic 2D generation and deterministic 3D geometry. In this paper, we propose GeoRect4D, a novel unified framework for sparse-view dynamic reconstruction that couples explicit 3D consistency with generative refinement via a closed-loop optimization process. Specifically, GeoRect4D introduces a degradation-aware feedback mechanism that incorporates a robust anchor-based dynamic 3DGS substrate with a single-step diffusion rectifier to hallucinate high-fidelity details. This rectifier utilizes a structural locking mechanism and spatiotemporal coordinated attention, effectively preserving physical plausibility while restoring missing content. Furthermore, we present a progressive optimization strategy that employs stochastic geometric purification to eliminate floaters and generative distillation to infuse texture details into the explicit representation. Extensive experiments demonstrate that GeoRect4D achieves state-of-the-art performance in reconstruction fidelity, perceptual quality, and spatiotemporal consistency across multiple datasets.
[85] LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model cs.CVPDF
Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng
TL;DR: LLaDA2.0-Uni是一个统一的离散扩散大语言模型(dLLM),它在一个原生集成的框架内支持多模态理解和生成。其架构结合了全语义离散分词器、基于MoE的dLLM主干网络和扩散解码器。通过SigLIP-VQ将连续视觉输入离散化,模型能够在主干网络中对文本和视觉输入进行块级掩码扩散,同时解码器将视觉令牌重建为高保真图像。通过主干网络中的前缀感知优化和解码器中的少步蒸馏,推理效率得到了超越并行解码的提升。在精心策划的大规模数据和定制的多阶段训练流程支持下,LLaDA2.0-Uni在多模态理解方面与专用视觉语言模型(VLM)相当,同时在图像生成和编辑方面表现出色。
Details
Motivation: 解决当前多模态模型在理解和生成任务上通常分离的问题,旨在创建一个原生统一、可扩展的框架,以支持交错生成和推理。
Result: 论文宣称模型在多模态理解方面与专用视觉语言模型(VLM)相当,在图像生成和编辑方面表现出色,但没有提及具体的基准测试名称或SOTA比较。
Insight: 创新点在于将离散扩散机制与大语言模型(LLM)主干(基于MoE)原生集成,通过全语义离散化(SigLIP-VQ)和块级掩码扩散统一处理文本和视觉输入,并利用前缀感知优化和少步蒸馏提升推理效率,为统一基础模型提供了一个有前景的范式。
Abstract: We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.
[86] OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model cs.CV | cs.AI | cs.CLPDF
Qiguang Chen, Chengyu Luan, Jiajun Wu, Qiming Yu, Yi Yang
TL;DR: 该论文提出了OMIBench基准测试,用于评估大型视觉语言模型在奥林匹克竞赛级别的多图像推理能力,涵盖生物、化学、数学和物理等学科,并包含手动标注的推理过程和评估协议。
Details
Motivation: 当前奥林匹克级别的多模态推理基准测试主要关注单图像分析,缺乏对跨多图像上下文信息的利用,因此需要专门评估模型在多图像证据分布下的推理能力。
Result: 在OMIBench上的广泛实验显示,现有模型存在显著性能差距,即使最强的LVLM如Gemini-3-Pro也仅达到约50%的准确率,这确立了该基准作为研究多图像推理的关键资源。
Insight: 创新点在于构建了首个专注于奥林匹克级别多图像推理的基准测试,通过手动标注的理性分析和语义匹配协议,为LVLM在多图像上下文理解方面的改进提供了标准化评估工具。
Abstract: Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.
[87] DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation cs.CVPDF
Hyeonwoo Kim, Jeonghwan Kim, Kyungwon Cho, Hanbyul Joo
TL;DR: 本文提出DeVI框架,利用文本条件生成的合成视频实现灵巧的物理角色控制,通过结合3D人体跟踪和2D物体跟踪的混合奖励,克服生成视频物理不精确的问题,实现与未见物体的零样本交互。
Details
Motivation: 解决合成视频因物理保真度低和纯2D特性而难以直接用于物理角色控制的问题,旨在利用视频生成模型中丰富的交互知识进行灵巧机器人操作规划。
Result: 在灵巧手-物交互建模上优于现有模仿3D人-物交互演示的方法,并在多物体场景和文本驱动动作多样性上验证了有效性,展示了视频作为人-物交互感知运动规划器的优势。
Insight: 创新点在于提出混合跟踪奖励机制,结合3D人体与2D物体跟踪,仅需生成视频即可实现零样本泛化,为利用合成视频进行物理控制提供了新思路。
Abstract: Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipulations that are difficult to capture with motion capture systems. While the rich interaction knowledge embedded in these synthetic videos holds strong potential for motion planning in dexterous robotic manipulation, their limited physical fidelity and purely 2D nature make them difficult to use directly as imitation targets in physics-based character control. We present DeVI (Dexterous Video Imitation), a novel framework that leverages text-conditioned synthetic videos to enable physically plausible dexterous agent control for interacting with unseen target objects. To overcome the imprecision of generative 2D cues, we introduce a hybrid tracking reward that integrates 3D human tracking with robust 2D object tracking. Unlike methods relying on high-quality 3D kinematic demonstrations, DeVI requires only the generated video, enabling zero-shot generalization across diverse objects and interaction types. Extensive experiments demonstrate that DeVI outperforms existing approaches that imitate 3D human-object interaction demonstrations, particularly in modeling dexterous hand-object interactions. We further validate the effectiveness of DeVI in multi-object scenes and text-driven action diversity, showcasing the advantage of using video as an HOI-aware motion planner.
cs.LG [Back]
[88] Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization cs.LG | cs.CLPDF
Carter Adams, Rafael Oliveira, Gabriel Almeida, Sofia Torres
TL;DR: 本文针对大型视觉语言模型(LVLM)的强化微调范式(RLVR)进行了理论分析,通过引入工具增强马尔可夫决策过程(TA-MDP)框架,从理论上解释了GRPO算法的收敛性、奖励分解的影响以及策略在分布外任务上的泛化能力。
Details
Motivation: 现有基于可验证奖励的强化微调(如Visual-ARFT)虽在实证上成功,但其理论基础薄弱,缺乏对奖励复合结构如何影响GRPO收敛性以及为何在少量工具增强任务上训练能泛化到分布外领域的严格解释。
Result: 在TA-MDP框架下,证明了GRPO在复合奖励下以O(1/√T)速率收敛到一阶稳定点(定理1);推导了奖励分解定理,界定了分解优化与联合优化之间的次优差距(定理2);建立了工具增强策略的PAC-Bayes泛化界,解释了Visual-ARFT中观察到的强分布外迁移(定理3)。
Insight: 创新点在于首次为LVLM的强化微调提供了系统的理论框架(TA-MDP)和严格的理论保证,揭示了奖励分解的利弊条件以及工具使用策略的泛化机理,为后续算法设计提供了理论指导。
Abstract: Reinforcement fine-tuning with verifiable rewards (RLVR) has emerged as a powerful paradigm for equipping large vision-language models (LVLMs) with agentic capabilities such as tool use and multi-step reasoning. Despite striking empirical successes, most notably Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), the theoretical underpinnings of this paradigm remain poorly understood. In particular, two critical questions lack rigorous answers: (i)how does the composite structure of verifiable rewards (format compliance, answer accuracy, tool executability) affect the convergence of Group Relative Policy Optimization (GRPO), and (ii)why does training on a small set of tool-augmented tasks transfer to out-of-distribution domains? We address these gaps by introducing the \emph{Tool-Augmented Markov Decision Process} (TA-MDP), a formal framework that models multimodal agentic decision-making with bounded-depth tool calls. Within this framework, we establish three main results. First, we prove that GRPO under composite verifiable rewards converges to a first-order stationary point at rate $O(1/\sqrt{T})$ with explicit dependence on the number of reward components and group size (\textbf{Theorem1}). Second, we derive a \emph{Reward Decomposition Theorem} that bounds the sub-optimality gap between decomposed per-component optimization and joint optimization, providing a precise characterization of when reward decomposition is beneficial (\textbf{Theorem2}). Third, we establish a PAC-Bayes generalization bound for tool-augmented policies that explains the strong out-of-distribution transfer observed in Visual-ARFT (\textbf{Theorem~3}).
[89] DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data cs.LG | cs.AI | cs.CL | cs.IRPDF
Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yusheng Song
TL;DR: 本文提出了DR-Venus,一个仅使用约10K开放数据训练的4B参数边缘规模深度研究智能体。通过两阶段训练方法(代理监督微调和代理强化学习),结合数据质量提升和利用优化,该模型在多个深度研究基准上显著优于9B参数以下的先前模型,并缩小了与30B级别大模型的差距。
Details
Motivation: 解决在有限开放数据下训练强大边缘规模深度研究智能体的问题,以满足成本、延迟和隐私方面的实际部署需求。
Result: 在多个深度研究基准上,DR-Venus-4B显著优于9B参数以下的先前代理模型,并缩小了与30B级别系统的差距,展示了小模型的部署潜力。
Insight: 创新点包括:通过代理监督微调结合严格数据清洗和长轨迹重采样提升数据质量与利用;设计基于信息增益和格式感知正则化的回合级奖励,增强强化学习的监督密度和信用分配。客观分析表明,4B参数智能体已具备强大性能潜力,突出了测试时扩展的价值。
Abstract: Edge-scale deep research agents based on small language models are attractive for real-world deployment due to their advantages in cost, latency, and privacy. In this work, we study how to train a strong small deep research agent under limited open-data by improving both data quality and data utilization. We present DR-Venus, a frontier 4B deep research agent for edge-scale deployment, built entirely on open data. Our training recipe consists of two stages. In the first stage, we use agentic supervised fine-tuning (SFT) to establish basic agentic capability, combining strict data cleaning with resampling of long-horizon trajectories to improve data quality and utilization. In the second stage, we apply agentic reinforcement learning (RL) to further improve execution reliability on long-horizon deep research tasks. To make RL effective for small agents in this setting, we build on IGPO and design turn-level rewards based on information gain and format-aware regularization, thereby enhancing supervision density and turn-level credit assignment. Built entirely on roughly 10K open-data, DR-Venus-4B significantly outperforms prior agentic models under 9B parameters on multiple deep research benchmarks, while also narrowing the gap to much larger 30B-class systems. Our further analysis shows that 4B agents already possess surprisingly strong performance potential, highlighting both the deployment promise of small models and the value of test-time scaling in this setting. We release our models, code, and key recipes to support reproducible research on edge-scale deep research agents.
[90] Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine cs.LG | cs.AI | cs.CLPDF
Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu
TL;DR: 本文提出了一种模块化医疗对话框架BMBE,通过将语言模型与贝叶斯推理引擎分离,解决了现有自主诊断代理混淆自然语言处理和概率推理的问题。该框架使用LLM仅作为传感器处理患者语言,而诊断推理由独立的贝叶斯引擎完成,从而实现了隐私保护、可校准诊断和成本效益。
Details
Motivation: 当前大型语言模型在作为自主诊断代理时,混淆了自然语言沟通和概率推理两种能力,这被视为架构缺陷而非工程问题。本文旨在通过模块化设计严格分离这两种功能,以提升诊断的可靠性、隐私性和可解释性。
Result: 在实证和LLM生成的知识库上验证,BMBE框架在成本远低于前沿独立模型的情况下,通过廉价传感器与引擎组合实现了性能超越,展示了架构优势而非信息优势,并表现出对对抗性患者沟通风格的鲁棒性。
Insight: 创新点在于将语言处理与概率推理解耦的模块化架构,这使得诊断系统具备可校准的选择性诊断能力、统计分离优势以及内置隐私保护。从客观角度看,这种分离设计为医疗AI提供了可审计、可替换组件的高效解决方案,避免了LLM黑箱推理的不确定性。
Abstract: Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.
[91] CHASM: Unveiling Covert Advertisements on Chinese Social Media cs.LG | cs.AI | cs.CL | cs.CV | cs.CYPDF
Jingyi Zheng, Tianyi Hu, Yule Liu, Zhen Sun, Zongmin Zhang
TL;DR: 本文介绍了首个用于评估多模态大语言模型在社交媒体上检测隐蔽广告能力的数据集CHASM,该数据集基于中国社交媒体平台Rednote的真实场景构建,包含4,992个高质量匿名实例。实验表明,当前MLLMs在零样本和上下文学习设置下均无法可靠检测隐蔽广告,而基于该数据集的微调能带来性能提升,但仍面临检测评论中细微线索及图文结构差异等挑战。
Details
Motivation: 现有评估大语言模型在社交媒体内容审核中的基准完全忽视了隐蔽广告这一严重威胁,其伪装成常规帖子欺骗消费者,引发重大伦理和法律问题。
Result: 在零样本和上下文学习设置下,当前所有多模态大语言模型均不足以可靠检测隐蔽广告;在CHASM数据集上微调开源MLLMs能带来明显的性能提升,但仍存在显著挑战。
Insight: 创新点在于构建了首个针对中文社交媒体隐蔽广告检测的高质量多模态评估数据集CHASM,并揭示了当前MLLMs在此任务上的局限性及微调的有效性,为后续研究和平台审核提供了基准和方向。
Abstract: Current benchmarks for evaluating large language models (LLMs) in social media moderation completely overlook a serious threat: covert advertisements, which disguise themselves as regular posts to deceive and mislead consumers into making purchases, leading to significant ethical and legal concerns. In this paper, we present the CHASM, a first-of-its-kind dataset designed to evaluate the capability of Multimodal Large Language Models (MLLMs) in detecting covert advertisements on social media. CHASM is a high-quality, anonymized, manually curated dataset consisting of 4,992 instances, based on real-world scenarios from the Chinese social media platform Rednote. The dataset was collected and annotated under strict privacy protection and quality control protocols. It includes many product experience sharing posts that closely resemble covert advertisements, making the dataset particularly challenging.The results show that under both zero-shot and in-context learning settings, none of the current MLLMs are sufficiently reliable for detecting covert advertisements.Our further experiments revealed that fine-tuning open-source MLLMs on our dataset yielded noticeable performance gains. However, significant challenges persist, such as detecting subtle cues in comments and differences in visual and textual structures.We provide in-depth error analysis and outline future research directions. We hope our study can serve as a call for the research community and platform moderators to develop more precise defenses against this emerging threat.
[92] ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control cs.LG | cs.CVPDF
Shelly Golan, Michael Finkelson, Ariel Bereslavsky, Yotam Nitzan, Or Patashnik
TL;DR: 本文提出ParetoSlider,一种用于扩散模型的多目标强化学习后训练框架,旨在解决传统方法因使用单一标量奖励而无法在推理时灵活权衡多个冲突目标(如图像编辑中的提示符遵循与源保真度)的问题。该框架通过将连续变化的偏好权重作为条件信号进行训练,使单个模型能够近似整个帕累托前沿,从而在推理时无需重新训练或维护多个检查点即可实现最优权衡的精细控制。
Details
Motivation: 传统基于强化学习的生成模型后训练方法通常依赖单一标量奖励,当涉及多个标准时,普遍采用’早期标量化’将奖励合并为固定加权和,这导致模型在训练时即固定于单一权衡点,无法在推理时对内在冲突的目标进行控制。
Result: 在SD3.5、FluxKontext和LTX-2三种先进的流匹配骨干模型上评估表明,单个偏好条件化模型在性能上匹配或超过了为固定奖励权衡单独训练的基线模型,同时提供了对竞争性生成目标的细粒度控制。
Insight: 创新点在于将多目标强化学习框架与扩散模型后训练结合,通过连续偏好权重条件化,使单个模型能够覆盖整个帕累托前沿,实现了训练后模型在推理时对多目标权衡的动态、连续控制,避免了维护多个模型或重新训练的开销。
Abstract: Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization’’ collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals – such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.
cs.AI [Back]
[93] ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models cs.AI | cs.CL | cs.LGPDF
Kemal Düzkar
TL;DR: 本文提出了ThermoQA基准测试,包含293个开放式工程热力学问题,分为三个层级:属性查找、组件分析和完整循环分析。使用CoolProp 7.2.0程序化计算真实答案,评估了六个前沿大语言模型。结果表明,Claude Opus 4.6、GPT-5.4和Gemini 3.1 Pro表现最佳,同时揭示了模型在跨层级推理能力上的显著差异。
Details
Motivation: 为了解决现有基准测试在评估大语言模型热力学推理能力方面的不足,特别是区分记忆性知识检索与深层推理过程,作者构建了一个多层级、开放式的热力学问题数据集。
Result: 在三个独立运行中,Claude Opus 4.6以94.1%的准确率领先,其次是GPT-5.4(93.1%)和Gemini 3.1 Pro(92.5%)。跨层级性能下降从2.8个百分点到32.5个百分点不等,表明属性记忆不等于热力学推理。超临界水、R-134a制冷剂和联合循环燃气轮机分析成为自然区分器,性能差异达40-60个百分点。
Insight: 论文的创新点在于构建了一个程序化生成真实答案、覆盖三个推理层级的热力学基准测试,并引入了推理一致性作为独立的评估维度。这为评估LLM在复杂工程领域的深层推理能力提供了新的方法论,强调了多层级评估和一致性分析的重要性。
Abstract: We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code are open-source at https://huggingface.co/datasets/olivenet/thermoqa
[94] Automated Detection of Dosing Errors in Clinical Trial Narratives: A Multi-Modal Feature Engineering Approach with LightGBM cs.AI | cs.CLPDF
Mohammad AL-Smadi
TL;DR: 本文提出了一种用于自动检测临床试验叙述中给药错误的系统,该系统采用梯度提升(LightGBM)模型,并结合了全面的多模态特征工程。特征集包含3451个特征,涵盖传统NLP特征、密集语义嵌入、领域特定医学模式和基于Transformer的评分,从九个互补文本字段中提取。在类别严重不平衡的CT-DEB基准数据集上,通过五折集成平均,测试ROC-AUC达到0.8725。
Details
Motivation: 临床试验要求严格遵守用药方案,但给药错误仍然是影响患者安全和试验完整性的持续挑战。本文旨在自动化检测非结构化临床试验叙述中的给药错误。
Result: 在类别严重不平衡(阳性率4.9%)的CT-DEB基准数据集上,通过五折集成平均,测试ROC-AUC达到0.8725(交叉验证AUC为0.8833 ± 0.0091)。特征效率分析表明,选择前500-1000个特征可获得最佳性能(AUC 0.886-0.887),优于完整的3451个特征集(AUC 0.879)。
Insight: 创新点在于综合了大规模、多模态的特征工程,并系统性地评估了不同特征类型(如句子嵌入、稀疏词汇特征)在严重类别不平衡下的临床文本分类中的贡献。研究发现,特征选择本身作为一种正则化技术至关重要,且稀疏词汇特征与密集表示在专业临床文本分类中具有互补性。
Abstract: Clinical trials require strict adherence to medication protocols, yet dosing errors remain a persistent challenge affecting patient safety and trial integrity. We present an automated system for detecting dosing errors in unstructured clinical trial narratives using gradient boosting with comprehensive multi-modal feature engineering. Our approach combines 3,451 features spanning traditional NLP (TF-IDF, character n-grams), dense semantic embeddings (all-MiniLM-L6v2), domain-specific medical patterns, and transformer-based scores (BiomedBERT, DeBERTa-v3), used to train a LightGBM model. Features are extracted from nine complementary text fields (median 5,400 characters per sample) ensuring complete coverage across all 42,112 clinical trial narratives. On the CT-DEB benchmark dataset with severe class imbalance (4.9% positive rate), we achieve 0.8725 test ROC-AUC through 5-fold ensemble averaging (cross-validation: 0.8833 + 0.0091 AUC). Systematic ablation studies reveal that removing sentence embeddings causes the largest performance degradation (2.39%), demonstrating their critical role despite contributing only 37.07% of total feature importance. Feature efficiency analysis demonstrates that selecting the top 500-1000 features yields optimal performance (0.886-0.887 AUC), outperforming the full 3,451-feature set (0.879 AUC) through effective noise reduction. Our findings highlight the importance of feature selection as a regularization technique and demonstrate that sparse lexical features remain complementary to dense representations for specialized clinical text classification under severe class imbalance.
[95] From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents cs.AI | cs.CL | cs.ET | cs.MA | cs.ROPDF
Trilok Padhi, Ramneet Kaur, Krishiv Agarwal, Adam D. Cobb, Daniel Elenius
TL;DR: 本文提出了一种用于解释LLM智能体时序概念演变的框架,通过结合逐步奖励建模和保形预测,对模型内部表示进行统计标注,并训练线性探针来识别与任务成功、失败或推理漂移相对应的潜在方向。
Details
Motivation: 尽管LLM智能体在多步推理和决策任务中能力日益增强,但其指导序列行为的内部机制仍然不透明,因此需要一种方法来解释其内部概念在时间维度上的演变。
Result: 在ScienceWorld和AlfWorld两个模拟交互环境上的实验结果表明,这些时序概念是线性可分的,揭示了与任务成功对齐的可解释结构,并初步展示了通过利用该框架在模型内部引导已识别的成功方向来提升LLM智能体性能的结果。
Insight: 创新点在于将保形预测与逐步奖励建模结合,为LLM智能体的内部表示提供统计可靠的时序解释,并实现了早期失败检测和干预,为复杂交互场景中可信赖的自主语言模型提供了新路径。
Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of reasoning, planning, and acting within interactive environments. Despite their growing capability to perform multi-step reasoning and decision-making tasks, internal mechanisms guiding their sequential behavior remain opaque. This paper presents a framework for interpreting the temporal evolution of concepts in LLM agents through a step-wise conformal lens. We introduce the conformal interpretability framework for temporal tasks, which combines step-wise reward modeling with conformal prediction to statistically label model’s internal representation at each step as successful or failing. Linear probes are then trained on these representations to identify directions of temporal concepts - latent directions in the model’s activation space that correspond to consistent notions of success, failure or reasoning drift. Experimental results on two simulated interactive environments, namely ScienceWorld and AlfWorld, demonstrate that these temporal concepts are linearly separable, revealing interpretable structures aligned with task success. We further show preliminary results on improving an LLM agent’s performance by leveraging the proposed framework for steering the identified successful directions inside the model. The proposed approach, thus, offers a principled method for early failure detection as well as intervention in LLM-based agents, paving the path towards trustworthy autonomous language models in complex interactive settings.
[96] ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks cs.AI | cs.CLPDF
Jan-Philipp Schmidt
TL;DR: ActuBench是一个多智能体LLM流水线,用于自动生成和评估符合国际精算协会教育大纲的高级精算评估项目。该流水线通过适配器分离了四个LLM角色:一个负责起草项目,一个构建干扰项,第三个独立验证前两个阶段并驱动有界单次修复循环,第四个成本优化的辅助代理处理维基百科摘要和主题标记。研究评估了来自八个提供商的50个语言模型在两个互补基准上的表现,并报告了三个主要发现。
Details
Motivation: 解决精算领域高级评估项目自动化生成与评估的挑战,并创建一个公开可访问的基准测试平台。
Result: 在包含100个经验上最难的单项选择题和100个由LLM法官评分的开放式问题的两个基准上评估了50个模型。研究发现,多智能体验证至关重要,本地托管开源模型在成本-性能帕累托前沿表现优异,且单项选择题和LLM法官评分模式的排名存在显著差异。
Insight: 创新点在于设计了一个分离角色、具有独立验证和修复循环的多智能体LLM流水线,用于专业领域(精算)的评估生成与质量保证。客观分析认为,其将多智能体协作框架应用于特定垂直领域的基准创建与模型评估,并揭示了不同评估模式(MCQ vs. 开放式)对排名的影响,具有方法论和实践价值。
Abstract: We present ActuBench, a multi-agent LLM pipeline for the automated generation and evaluation of advanced actuarial assessment items aligned with the International Actuarial Association (IAA) Education Syllabus. The pipeline separates four LLM roles by adapter: one agent drafts items, one constructs distractors, a third independently verifies both stages and drives bounded one-shot repair loops, and a cost-optimized auxiliary agent handles Wikipedia-note summarization and topic labelling. The items, per-model responses and complete leaderboard are published as a browsable web interface at https://actubench.de/en/, allowing readers and practitioners to inspect individual items without a repository checkout. We evaluate 50 language models from eight providers on two complementary benchmarks – 100 empirically hardest multiple-choice items and 100 open-ended items scored by an LLM judge – and report three headline findings. First, multi-agent verification is load-bearing: the independent verifier flags a majority of drafted items on first pass, most of which the one-shot repair loop resolves. Second, locally-hosted open-weights inference sits on the cost-performance Pareto front: a Gemma~4 model running on consumer hardware and a Cerebras-hosted 120B open-weights model dominate the near-zero-cost region, with the latter within one item of the top of the leaderboard. Third, MCQ and LLM-as-Judge rankings differ meaningfully: the MCQ scaffold inflates the performance ceiling, and Judge-mode evaluation is needed to discriminate at the frontier.
[97] Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning cs.AI | cs.CLPDF
Zoya Volovikova, Nikita Sorokin, Dmitriy Lukashevskiy, Aleksandr Panov, Alexey Skrynnik
TL;DR: 本文提出了SuperIgor框架,用于指令跟随任务。该框架通过自学习机制让语言模型生成并优化高层计划,减少了对人工标注的依赖。其核心是RL智能体与语言模型的迭代协同训练:智能体学习执行计划,语言模型则根据RL反馈和偏好调整计划,形成一个相互促进的反馈循环。
Details
Motivation: 解决现有指令跟随方法依赖预定义子任务和大量人工标注数据的问题,旨在通过自引导的计划提取与强化学习结合,实现更灵活、数据高效的指令理解与执行。
Result: 在具有丰富动态性和随机性的环境中验证,结果表明SuperIgor智能体比基线方法更严格地遵循指令,并对未见过的指令展现出强大的泛化能力。
Insight: 创新点在于将语言模型作为计划生成器与目标条件强化学习智能体进行迭代协同训练,形成了一个自引导的反馈循环,使计划质量和任务执行能力共同提升,减少了对外部监督的依赖。
Abstract: We introduce SuperIgor, a framework for instruction-following tasks. Unlike prior methods that rely on predefined subtasks, SuperIgor enables a language model to generate and refine high-level plans through a self-learning mechanism, reducing the need for manual dataset annotation. Our approach involves iterative co-training: an RL agent is trained to follow the generated plans, while the language model adapts and modifies these plans based on RL feedback and preferences. This creates a feedback loop where both the agent and the planner improve jointly. We validate our framework in environments with rich dynamics and stochasticity. Results show that SuperIgor agents adhere to instructions more strictly than baseline methods, while also demonstrating strong generalization to previously unseen instructions.
cs.MA [Back]
[98] Trust, Lies, and Long Memories: Emergent Social Dynamics and Reputation in Multi-Round Avalon with LLM Agents cs.MA | cs.AI | cs.CLPDF
Suveen Ellawela
TL;DR: 本文研究了大型语言模型(LLM)智能体在重复进行隐藏角色欺骗游戏《抵抗组织:阿瓦隆》时涌现的社会动态。通过让智能体保留对过往游戏(包括角色和行为)的记忆,研究发现了两大关键现象:一是跨游戏记忆会自发形成角色依赖的声誉动态,高声誉玩家获得更多团队邀请;二是更高的推理努力会支持更复杂的策略性欺骗行为。
Details
Motivation: 动机在于超越以往对单局游戏性能的研究,通过让LLM智能体在保留记忆的情况下进行多轮重复游戏,以探究社会动态(如声誉和欺骗)如何随着交互的演化而涌现。
Result: 在188场游戏中,研究发现:1)当智能体保留跨游戏记忆时,声誉动态自发涌现,高声誉玩家获得团队邀请的次数增加46%;2)在高推理努力设置下,邪恶玩家更倾向于采取先建立信任再破坏的策略(75% vs 低努力设置的36%)。
Insight: 论文的创新点在于通过多轮、带记忆的重复游戏设置,首次在LLM智能体中系统地观察到了可测量的、角色依赖的声誉动态以及推理努力对策略性欺骗的影响,为研究复杂社会交互的涌现提供了新范式。
Abstract: We study emergent social dynamics in LLM agents playing The Resistance: Avalon, a hidden-role deception game. Unlike prior work on single-game performance, our agents play repeated games while retaining memory of previous interactions, including who played which roles and how they behaved, enabling us to study how social dynamics evolve. Across 188 games, two key phenomena emerge. First, reputation dynamics emerge organically when agents retain cross-game memory: agents reference past behavior in statements like “I am wary of repeating last game’s mistake of over-trusting early success.” These reputations are role-conditional: the same agent is described as “straightforward” when playing good but “subtle” when playing evil, and high-reputation players receive 46% more team inclusions. Second, higher reasoning effort supports more strategic deception: evil players more often pass early missions to build trust before sabotaging later ones, 75% in high-effort games vs 36% in low-effort games. Together, these findings show that repeated interaction with memory gives rise to measurable reputation and deception dynamics among LLM agents.
[99] Anchor-and-Resume Concession Under Dynamic Pricing for LLM-Augmented Freight Negotiation cs.MA | cs.AI | cs.CLPDF
Hoang Nguyen, Lu Wang, Marta Gaia Bras
TL;DR: 本文提出了一种用于货运谈判中动态定价场景的锚定与恢复让步框架,通过基于实时价差推导的β参数实现自适应让步,并利用锚定与恢复机制保证报价单调非递减,从而解决了传统固定β参数无法适应定价更新以及LLM代理成本高、非确定性的问题。
Details
Motivation: 解决货运经纪在动态定价条件下谈判时,传统时间依赖让步框架因固定β参数无法适应模型中途更新目标的问题,以及LLM代理存在的推理成本高、定价非确定性和易受提示注入攻击的局限性。
Result: 在115,125次谈判的实证评估中,自适应β在不同价差区间表现出针对性行为:窄价差时快速让步以优先成交;中宽价差时在经纪节省方面达到或超过最佳固定β基线。与200亿参数LLM代理相比,取得了相似的成交率和节省;与更现实的LLM承运商(随机对手方)相比,保持了可比的节省和更高的成交率。
Insight: 创新点在于将定价决策与LLM解耦,LLM仅作为自然语言翻译层,而核心让步逻辑由确定性公式处理,从而实现了可扩展性、低推理成本和决策透明性;提出的两索引锚定与恢复机制保证了在任意定价变动下的报价单调性。
Abstract: Freight brokerages negotiate thousands of carrier rates daily under dynamic pricing conditions where models frequently revise targets mid-conversation. Classical time-dependent concession frameworks use a fixed shape parameter $β$ that cannot adapt to these updates. Deriving $β$ from the live spread enables adaptation but introduces a new problem: a pricing shift can cause the formula to retract a previous offer, violating monotonicity. LLM-powered brokers offer flexibility but require expensive reasoning models, produce non-deterministic pricing, and remain vulnerable to prompt injection. We propose a two-index anchor-and-resume framework that addresses both limitations. A spread-derived $β$ maps each load’s margin structure to the correct concession posture, while the anchor-and-resume mechanism guarantees monotonically non-decreasing offers under arbitrary pricing shifts. All pricing decisions remain in a deterministic formula; the LLM, when used, serves only as a natural-language translation layer. Empirical evaluation across 115,125 negotiations shows that the adaptive $β$ tailors behavior by regime: in narrow spreads, it concedes quickly to prioritize deal closure and load coverage; in medium and wide spreads, it matches or exceeds the best fixed-$β$ baselines in broker savings. Against an unconstrained 20-billion-parameter LLM broker, it achieves similar agreement rates and savings. Against LLM-powered carriers as more realistic stochastic counterparties, it maintains comparable savings and higher agreement rates than against rule-based opponents. By decoupling the LLM from pricing logic, the framework scales horizontally to thousands of concurrent negotiations with negligible inference cost and transparent decision-making.
cs.CY [Back]
[100] Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics cs.CY | cs.CV | econ.EMPDF
Xinxin Zhuo, Mengyuan Niu, Ruizhe Wang, Junyan Yang, Qiao Wang
TL;DR: 该论文提出了一种视觉-语义与时空框架(SEVI),用于诊断城市街道经济活力。该方法通过实例分割解析街景物理与语义信息,结合VLM-LLM双阶段流程标准化品牌层级,并利用LBS数据引入时空滞后设计以捕捉实际需求,构建了涵盖商业活动、空间利用和物理环境的三维诊断系统。
Details
Motivation: 现有基于街景图像(SVI)的微观街道经济评估方法语义浅层,忽视了品牌层级异质性和结构性衰退问题,因此需要更精细的框架来支持精准空间资源分配。
Result: 在南京八个潮汐时段进行的时空滞后地理加权回归实验揭示了准因果的时空异质性:街道活力源于层级品牌集聚与商场外部性的交互作用,高质量界面在午间和晚间吸引力最强,而结构性衰退则产生滞后的夜间排斥效应。
Insight: 创新点在于整合视觉-语义解析与时空滞后设计,通过品牌溢价指数和类别加权高斯溢出模型构建三维诊断系统,为精准空间治理提供了数据驱动的证据支持。
Abstract: Micro-scale street-level economic assessment is fundamental for precision spatial resource allocation. While Street View Imagery (SVI) advances urban sensing, existing approaches remain semantically superficial and overlook brand hierarchy heterogeneity and structural recession. To address this, we propose a visual-semantic and field-based spatiotemporal framework, operationalized via the Street Economic Vitality Index (SEVI). Our approach integrates physical and semantic streetscape parsing through instance segmentation of signboards, glass interfaces, and storefront closures. A dual-stage VLM-LLM pipeline standardizes signage into global hierarchies to quantify a spatially smoothed brand premium index. To overcome static SVI limitations, we introduce a temporal lag design using Location-Based Services (LBS) data to capture realized demand. Combined with a category-weighted Gaussian spillover model, we construct a three-dimensional diagnostic system covering Commercial Activity, Spatial Utilization, and Physical Environment. Experiments based on time-lagged geographically weighted regression across eight tidal periods in Nanjing reveal quasi-causal spatiotemporal heterogeneity. Street vibrancy arises from interactions between hierarchical brand clustering and mall-induced externalities. High-quality interfaces show peak attraction during midday and evening, while structural recession produces a lagged nighttime repulsion effect. The framework offers evidence-based support for precision spatial governance.
cs.CR [Back]
[101] AgentSOC: A Multi-Layer Agentic AI Framework for Security Operations Automation cs.CR | cs.AI | cs.CLPDF
Joyjit Roy, Samaresh Kumar Singh
TL;DR: 本文提出了一个名为AgentSOC的多层智能体AI框架,旨在解决安全运营中心(SOC)在关联异构告警、解释多阶段攻击进程以及选择安全有效响应行动方面面临的挑战。该框架通过整合感知、预测性推理和基于风险的行动规划,为SOC自动化提供了一个统一的抽象层和操作循环。
Details
Motivation: 当前SOC在处理异构告警、理解复杂攻击链以及规划合规响应时存在效率低下和一致性差的问题,需要一种更自动化、智能化的解决方案来提升运营效率和安全性。
Result: 概念性评估表明,AgentSOC提高了告警分诊的一致性,能够预测攻击者意图,并提供在安全效能与运营影响之间取得平衡、且操作上可行的遏制建议。使用LANL认证数据的概念验证(POC)演示也证明了该架构的可行性。
Insight: 创新点在于提出了一个整合了感知、推理和规划的多层智能体框架,通过统一的抽象层和操作循环来实现端到端的SOC自动化,其混合智能体推理方法为开发自适应、更安全的SOC自动化系统提供了潜在基础。
Abstract: Security Operations Centers (SOCs) increasingly encounter difficulties in correlating heterogeneous alerts, interpreting multi-stage attack progressions, and selecting safe and effective response actions. This study introduces AgentSOC, a multi-layered agentic AI framework that enhances SOC automation by integrating perception, anticipatory reasoning, and risk-based action planning. The proposed architecture consolidates several layers of abstraction to provide a single operational loop to support normalizing alerts, enriching context, generating hypotheses, validating structural feasibility, and executing policy-compliant responses. Conceptually evaluated within a large enterprise environment, AgentSOC improves triage consistency, anticipates attackers’ intentions, and provides recommended containment options that are both operationally feasible and well-balanced between security efficacy and operational impact. The results suggest that hybrid agentic reasoning has the potential to serve as a foundation for developing adaptive, safer SOC automation in large enterprises. Additionally, a minimal Proof-Of-Concept (POC) demonstration using LANL authentication data demonstrated the feasibility of the proposed architecture.
eess.AS [Back]
[102] Explainable Speech Emotion Recognition: Weighted Attribute Fairness to Model Demographic Contributions to Social Bias eess.AS | cs.AI | cs.CLPDF
Tomisin Ogunnubi, Yupei Li, Björn Schuller
TL;DR: 本文提出了一种用于语音情感识别(SER)系统的公平性建模方法,该方法通过学习人口统计属性与模型误差之间的联合关系来显式捕捉分配偏差。研究在合成数据上验证了该公平性指标,并将其应用于评估在CREMA-D数据集上微调的HuBERT和WavLM模型,结果表明该方法能更好地捕捉受保护属性与偏差之间的互信息,并量化个体属性对基于自监督学习(SSL)的SER模型偏差的绝对贡献,同时揭示了HuBERT和WavLM模型中存在性别偏差的迹象。
Details
Motivation: 语音情感识别系统在心理健康和教育等敏感领域应用日益广泛,其有偏预测可能造成危害,而传统公平性指标(如均衡几率、人口统计均等)往往忽略了人口统计属性与模型预测之间的联合依赖性。
Result: 在合成数据上验证了所提公平性指标的有效性,并在CREMA-D数据集上评估微调后的HuBERT和WavLM模型,结果表明所提公平性模型能捕捉更多受保护属性与偏差之间的互信息,并量化了个体属性对SSL-based SER模型偏差的绝对贡献,同时分析揭示了HuBERT和WavLM均存在性别偏差。
Insight: 创新点在于提出了一种新的公平性建模方法,显式地学习人口统计属性与模型误差的联合关系以捕捉分配偏差,这比传统公平性指标更能揭示属性间的复杂依赖关系;从客观角度看,该方法提供了一种量化个体属性对模型偏差贡献的途径,为理解和缓解SER模型中的社会偏见提供了更细粒度的分析工具。
Abstract: Speech Emotion Recognition (SER) systems have growing applications in sensitive domains such as mental health and education, where biased predictions can cause harm. Traditional fairness metrics, such as Equalised Odds and Demographic Parity, often overlook the joint dependency between demographic attributes and model predictions. We propose a fairness modelling approach for SER that explicitly captures allocative bias by learning the joint relationship between demographic attributes and model error. We validate our fairness metric on synthetic data, then apply it to evaluate HuBERT and WavLM models finetuned on the CREMA-D dataset. Our results indicate that the proposed fairness model captures more mutual information between protected attributes and biases and quantifies the absolute contribution of individual attributes to bias in SSL-based SER models. Additionally, our analysis reveals indications of gender bias in both HuBERT and WavLM.
cs.IR [Back]
[103] SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition cs.IR | cs.CLPDF
Jielong Tang, Xujie Yuan, Jiayang Liu, Jianxing Yu, Xiao Dong
TL;DR: 本文提出SAKE框架,通过自感知推理和自适应搜索工具调用,协调内部知识利用与外部知识探索,以解决开放世界社交媒体中长尾、快速演变和未见实体的挑战,提升GMNER任务的性能。
Details
Motivation: 现有GMNER方法依赖启发式检索或MLLMs内部知识迭代优化,前者可能引入噪声证据降低已知实体精度,后者受限于模型知识边界易产生幻觉,因此需要一种能自适应决定何时检索的框架。
Result: 在两个广泛使用的社交媒体基准测试上进行了大量实验,证明了SAKE框架的有效性,但摘要未提及具体定量结果或是否达到SOTA水平。
Insight: 创新点包括:1)通过多前向采样量化实体级不确定性,生成显式知识差距信号;2)构建高质量思维链数据集SAKE-SeCoT,赋予模型基本自感知和工具使用能力;3)采用混合奖励函数的智能体强化学习,使模型从僵化搜索模仿进化为真正自感知的检索决策。
Abstract: Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model’s entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE’s effectiveness.
[104] Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge cs.IR | cs.CL | cs.DB | cs.LGPDF
Naizhong Xu
TL;DR: 本文提出了SmartVector框架,通过为密集向量嵌入增加时间感知、置信度衰减和关系感知三个显式属性,并模拟海马体-新皮层记忆巩固的五阶段生命周期,来增强检索增强生成(RAG)系统的知识表示能力。该框架使用混合语义相关性、时间有效性、实时置信度和图关系重要性的四信号评分替代纯余弦相似度进行检索,并通过后台整合代理检测矛盾、构建依赖边并传播更新。
Details
Motivation: 现代RAG系统将向量嵌入视为静态、上下文无关的产物,缺乏创建时间、来源可信度或依赖关系的概念,导致在处理版本化技术查询时准确率低下(如VersionRAG报告仅58%),因为检索可能返回语义相似但时间无效的内容。
Result: 在一个包含258个向量和138个查询的可复现合成版本化策略基准测试中,SmartVector相比纯余弦RAG,将top-1准确率大致翻倍(62.0% vs. 31.0%),将陈旧答案率从35.0%降至13.3%,将预期校准误差降低近2倍(0.244 vs. 0.470),并将单次词编辑的重新嵌入成本降低77%,且在0%至75%的矛盾注入率范围内保持鲁棒性。
Insight: 创新点包括:1)受神经科学启发的记忆巩固生命周期模型;2)将时间、置信度和关系属性显式编码到嵌入中;3)使用四信号混合评分进行检索,超越纯语义相似度;4)基于图神经网络的依赖更新传播机制;5)结合艾宾浩斯式指数衰减、用户反馈再巩固和对数访问强化的置信度函数。这些设计使RAG系统能动态处理知识的时间有效性、可信度和依赖关系,提升准确性和效率。
Abstract: Modern retrieval-augmented generation (RAG) systems treat vector embeddings as static, context-free artifacts: an embedding has no notion of when it was created, how trustworthy its source is, or which other embeddings depend on it. This flattening of knowledge has a measurable cost: recent work on VersionRAG reports that conventional RAG achieves only 58% accuracy on versioned technical queries, because retrieval returns semantically similar but temporally invalid content. We propose SmartVector, a framework that augments dense embeddings with three explicit properties – temporal awareness, confidence decay, and relational awareness – and a five-stage lifecycle modeled on hippocampal-neocortical memory consolidation. A retrieval pipeline replaces pure cosine similarity with a four-signal score that mixes semantic relevance, temporal validity, live confidence, and graph-relational importance. A background consolidation agent detects contradictions, builds dependency edges, and propagates updates along those edges as graph-neural-network-style messages. Confidence is governed by a closed-form function combining an Ebbinghaus-style exponential decay, user-feedback reconsolidation, and logarithmic access reinforcement. We formalize the model, relate it to temporal knowledge graph embedding, agentic memory architectures, and uncertainty-aware RAG, and present a reference implementation. On a reproducible synthetic versioned-policy benchmark of 258 vectors and 138 queries, SmartVector roughly doubles top-1 accuracy over plain cosine RAG (62.0% vs. 31.0% on a held-out split), drops stale-answer rate from 35.0% to 13.3%, cuts Expected Calibration Error by nearly 2x (0.244 vs. 0.470), reduces re-embedding cost per single-word edit by 77%, and is robust across contradiction-injection rates from 0% to 75%.
cs.SD [Back]
[105] From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR cs.SD | cs.CVPDF
Nan Xu, Shiheng Li, Shengchao Hou
TL;DR: 本文提出了一种用于光学音乐识别(OMR)第二阶段解码的新方法,专注于将视觉管道输出的符号和事件候选解码为可编辑、可验证、可导出的乐谱结构,尤其针对复杂的复调钢琴乐谱。该方法将解码问题构建为结构解码问题,其核心是使用拓扑识别与概率引导搜索(BeadSolver)。
Details
Motivation: 解决复杂复调乐谱(尤其是钢琴谱)在光学音乐识别中,声部分离和节拍内时序这两个主要瓶颈问题,旨在将视觉识别结果解码为结构化、可用的乐谱数据。
Result: 论文提出了一种实用的解码组件,可用于真实的OMR系统,并为未来端到端、多模态和强化学习方法积累结构化乐谱数据提供了路径。摘要中未提及具体的基准测试或定量比较结果。
Insight: 创新点在于将第二阶段解码形式化为结构解码问题,并采用拓扑识别与概率引导搜索(BeadSolver)作为核心方法;同时,提出了一种结合程序生成与识别反馈标注的数据策略,以支持模型训练和数据积累。
Abstract: We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.