Table of Contents

cs.CL [Back]

[1] Large Language Models Explore by Latent Distilling cs.CL | cs.AI | cs.LGPDF

Yuanhao Zeng, Ao Lu, Lufei Li, Zheng Zhang, Yexin Li

TL;DR: 本文提出了一种名为探索性采样(ESamp)的解码方法,旨在增强大型语言模型(LLM)在生成过程中的语义多样性。该方法通过在测试时训练一个轻量级的蒸馏器来预测LLM的深层隐藏表示,并利用预测误差作为新颖性信号来重新加权候选token,从而引导解码偏向于较少探索的语义模式。

Details

Motivation: 标准随机采样方法主要产生词汇层面的表面变化,限制了语义探索,而生成多样化的响应对于LLM的测试时扩展至关重要。本文旨在解决LLM生成中语义多样性不足的问题。

Result: 实验结果表明,ESamp显著提升了推理模型的Pass@k效率,在数学、科学和代码生成基准测试中表现出稳健的泛化能力,并在创意写作中打破了多样性与连贯性之间的权衡,其性能优于或可与强大的随机和启发式基线方法相媲美。

Insight: 创新点在于利用神经网络对熟悉输入预测误差较低、对新颖输入预测误差较高的特性,在测试时动态训练一个蒸馏器来建模LLM的深度表示转换,并将预测误差用作引导解码的新颖性信号。这种方法以低开销(最坏情况低于5%)实现了对语义探索的显式鼓励。

Abstract: Generating diverse responses is crucial for test-time scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface-level lexical variation, limiting semantic exploration. In this paper, we propose Exploratory Sampling (ESamp), a decoding approach that explicitly encourages semantic diversity during generation. ESamp is motivated by the well-known observation that neural networks tend to make lower-error predictions on inputs similar to those encountered before, and incur higher prediction error on novel ones. Building on this property, we train a lightweight Distiller at test time to predict deep-layer hidden representations of the LLM from its shallow-layer representations to model the LLM’s depth-wise representation transitions. During decoding, the Distiller continuously adapts to the mappings induced by the current generation context. ESamp uses the prediction error as a novelty signal to reweight candidate token extensions conditioned on the current prefix, thereby biasing decoding toward less-explored semantic patterns. ESamp is implemented with an asynchronous training–inference pipeline, with less than 5% worst case overhead (1.2% in the optimized release). Empirical results show that ESamp significantly boosts the Pass@k efficiency of reasoning models, showing superior or comparable performance to strong stochastic and heuristic baselines. Notably, ESamp achieves robust generalization across mathematics, science, and code generation benchmarks and breaks the trade-off between diversity and coherence in creative writing. Our code has released at: https://github.com/LinesHogan/tLLM.


[2] BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks cs.CL | cs.AI | cs.SEPDF

Xinming Tu, Tianze Wang, Yingzhou, Lu, Kexin Huang

TL;DR: 本文提出了BenchGuard,首个用于面向任务的、基于执行的智能体基准测试的自动化审计框架。该框架利用前沿大语言模型(LLM)作为系统审计员,通过结构化协议交叉验证基准测试的所有构件(如规范、评估脚本),并可结合智能体解决方案或执行轨迹作为诊断证据。在两个科学基准测试上的应用表明,它能有效发现基准测试本身存在的缺陷,如致命错误和隐含假设,且成本低廉。

Details

Motivation: 随着基准测试日益复杂,许多表面上的智能体失败实际上源于基准测试本身的缺陷,如破碎的规范、隐含假设和僵化的评估脚本,这些脚本会惩罚有效的替代方法。因此,需要一种自动化方法来审计评估基础设施本身,以确保其可靠性和公平性。

Result: 在ScienceAgentBench上,BenchGuard识别出12个作者确认的问题(包括导致任务无法解决的致命错误)。在BIXBench Verified-50子集上,其发现与专家识别的问题完全匹配率达到83.3%,并发现了先前人工审查完全遗漏的缺陷。对50个复杂生物信息学任务的完整审计成本低于15美元。

Insight: 核心创新点在于将前沿LLM的角色从被评估对象转变为评估基础设施的主动验证者,实现了基准测试的自动化、低成本审计。这为AI辅助的基准测试开发指明了新方向,即利用AI模型来验证和改进评估框架本身,而不仅仅是作为测试对象。

Abstract: As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the first automated auditing framework for task-oriented, execution-based agent benchmarks. BenchGuard cross-verifies all benchmark artifacts via structured LLM protocols, optionally incorporating agent solutions or execution traces as additional diagnostic evidence. Deployed on two prominent scientific benchmarks, BenchGuard identified 12 author-confirmed issues in ScienceAgentBench - including fatal errors rendering tasks unsolvable - and exactly matched 83.3% of expert-identified issues on the BIXBench Verified-50 subset, catching defects that prior human review missed entirely. A full audit of 50 complex bioinformatics tasks costs under USD 15, making automated benchmark auditing a practical and valuable complement to human review. These findings point toward AI-assisted benchmark development, where frontier models serve not only as subjects of evaluation but as active participants in validating the evaluation infrastructure itself.


[3] Dynamic Decision Learning: Test-Time Evolution for Abnormality Grounding in Rare Diseases cs.CLPDF

Jun Li, Mingxuan Liu, Jiazhen Pan, Che Liu, Wenjia Bai

TL;DR: 本文提出了一种名为动态决策学习(DDL)的框架,旨在解决罕见疾病临床异常定位任务中因数据稀缺导致监督微调不切实际和单次推理不稳定的问题。DDL通过优化指令和整合视觉扰动下的预测,使冻结的大型视觉语言模型(LVLM)能够在语言和视觉空间中迭代优化其决策,从而提高定位质量并生成基于共识的可靠性分数。

Details

Motivation: 罕见疾病的临床异常定位常受数据稀缺阻碍,使得监督微调不切实际,且单次推理极不稳定,因此需要一种无需微调、能提升模型在测试时决策稳定性和准确性的方法。

Result: 在脑成像基准测试(包括一个涵盖281种病理类型的罕见疾病数据集)上,DDL在罕见疾病病例上将mAP@75提升了高达105%,超越了适应基线和监督微调方法;此外,在严重分布偏移和任务难度增加的情况下,DDL的可靠性分数与定位准确性之间表现出更强的校准性。

Insight: 创新点在于提出了一种测试时演化框架,通过优化指令和视觉扰动下的预测整合,使冻结的LVLM能够动态优化决策,无需微调即可提升罕见疾病定位性能,并引入了基于共识的可靠性分数来量化模型置信度,增强了模型在分布偏移下的鲁棒性和可解释性。

Abstract: Clinical abnormality grounding for rare diseases is often hindered by data scarcity, making supervised fine-tuning impractical and single-pass inference highly unstable. We propose Dynamic Decision Learning (DDL), a framework that enables frozen large vision-language models (LVLMs) to refine their decisions across both language and visual spaces by optimizing instructions and consolidating predictions under visual perturbations. This process improves localization quality and produces a consensus-based reliability score that quantifies model confidence. Results on brain imaging benchmarks, including a rare-disease dataset with 281 pathology types across models ranging from 3B to 72B parameters, show that DDL improves mAP@75 by up to 105% on rare-disease cases and outperforms adaptation baselines and supervised fine-tuning. Furthermore, DDL demonstrates stronger calibration between reliability scores and localization accuracy under severe distribution shifts and increasing task difficulty. Code is available at: https://lijunrio.github.io/DDL/


[4] Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models cs.CLPDF

Dan Shi, Zhuowen Han, Simon Ostermann, Renren Jin, Josef van Genabith

TL;DR: 本文通过特征层面的机制分析,探究了强化学习(RL)后训练为何能提升大语言模型(LLM)的泛化能力,而监督微调(SFT)却常导致通用能力遗忘。研究发现,SFT会快速引入大量高度特化的特征并早期稳定,而RL则引发更克制、持续演变的特征变化,并保留了基础模型的表征。作者识别出一组紧凑、任务无关的特征集,它们直接介导了跨任务的泛化,并通过特征干预实验验证了其因果作用。

Details

Motivation: 旨在揭示RL后训练相比SFT能更好地提升LLM推理性能并实现泛化的内在机制,解决两者在泛化能力上存在差异的根本原因尚不明确的问题。

Result: 在相同基础模型和相同数据上训练的RL与SFT模型对比实验中,通过特征干预证实了所识别特征集的因果作用:禁用这些特征会显著降低RL模型的泛化性能,而增强它们能提升基础模型的性能。

Insight: 创新点在于提出了一种特征层面的机制分析方法,将不同模型内部激活对齐到共享特征空间,从而可解释地追踪特征演化。研究发现RL通过诱导更克制、持续演变的特征变化来保留基础表征,并识别出一组紧凑、任务无关的泛化中介特征,这为理解模型泛化提供了新的可解释性视角和潜在的干预手段。

Abstract: Reinforcement learning (RL)-based post-training often improves the reasoning performance of large language models (LLMs) beyond the training domain, while supervised fine-tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear. To bridge this gap, we present a feature-level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL- and SFT-tuned models are trained from the same base model on identical data. Leveraging our interpretability framework, we align internal activations across models within a shared feature space and analyze how features evolve during post-training. We find that SFT rapidly introduces many highly specialized features that stabilize early in training, whereas RL induces more restrained and continually evolving feature changes that largely preserve base models’ representations. Focusing on samples where RL succeeds but the base model fails, we identify a compact, task-agnostic set of features that directly mediate generalization across diverse tasks. Feature-level interventions confirm their causal role: disabling these features significantly degrades RL models’ generalization performance, while amplifying them improves base models’ performance. The code is available at https://github.com/danshi777/RL-generalization.


[5] Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs cs.CL | cs.AIPDF

Sagnik Chatterjee, Atharva Patil, Sricharan Ramesh

TL;DR: 本文提出了一种名为Dual-Track CoT(双轨思维链)的新方法,旨在通过预算感知的逐步引导,帮助小型语言模型(SLMs)在有限的算力和令牌预算下进行可靠的多步推理。

Details

Motivation: 现有推理方法(如自我一致性、思维树)虽然能提升性能,但通常需要高昂的令牌成本且缺乏细粒度的步骤控制。本文旨在探索小型模型能否在相同或更少的令牌开销下实现可靠的推理,这既是一个科学问题(探究过程监督和简单测试时控制能否替代模型规模),也具有实际部署价值(适用于设备端、低延迟或成本受限的场景)。

Result: 摘要中未提及具体的定量结果或基准测试,但暗示该方法旨在提升SLMs在固定成本下的推理性能。

Insight: 论文的核心创新点在于提出了一个预算感知的逐步引导框架,强调对推理步骤的细粒度控制(如令牌预算管理和冗余步骤拒绝),试图用过程监督和测试时优化来弥补小型模型在规模上的不足,为资源受限环境下的高效推理提供了新思路。

Abstract: Large Language Models (LLMs) solve many reasoning tasks via chain-of-thought (CoT) prompting, but smaller models (about 7 to 8B parameters) still struggle with multi-step reasoning under tight compute and token budgets. Existing test time reasoning methods such as self consistency (sampling multiple rationales and voting), Tree-of-Thoughts (search over intermediate thoughts), and critique revise loops improve performance, but often at high token cost and without fine-grained step-level control. This project1 aims to address that gap: can Small Language Models (SLMs) reason reliably using the same or fewer tokens? This question is both scientific and practical. Scientifically, it probes whether process supervision and simple test-time controls (such as token budgets and rejection of redundant steps) can substitute for model scale or large sampling counts. Practically, many deployments (on-device, low-latency, or cost-constrained settings) cannot afford huge models or dozens of sampled rationales per query. A method that improves SLM reasoning at fixed cost would therefore be directly useful.


[6] Analyzing LLM Reasoning to Uncover Mental Health Stigma cs.CL | cs.AIPDF

Sreehari Sankar, Aliakbar Nafar, Mona Barman, Hannah K. Heitz, Ashwin Kumar

TL;DR: 本文通过分析大型语言模型(LLMs)在心理健康应用中的中间推理步骤,揭示了传统多选题评估方法无法捕捉的、隐藏在模型底层逻辑中的污名化语言和内在偏见。研究利用临床专业知识对污名化语言模式进行分类和标注,并评估其严重性,同时扩展了现有心理健康污名化基准以涵盖更广泛的心理状况。

Details

Motivation: 现有对LLMs心理健康污名化的评估主要依赖多选题,无法捕捉模型底层逻辑中的偏见,因此需要分析其推理步骤以揭示隐藏的污名化语言和驱动逻辑。

Result: 研究发现,评估模型推理比传统多选题方法暴露了显著更多的污名化,并有助于识别LLMs逻辑缺陷及其对心理健康状况理解的不足。

Insight: 创新点在于通过分析LLMs的中间推理步骤(而非仅最终输出)来系统性识别和分类污名化语言模式,并引入严重性评级以区分显性与隐性偏见,这为评估和缓解AI模型的社会偏见提供了更细粒度的框架。

Abstract: While large language models (LLMs) are increasingly being explored for mental health applications, recent studies reveal that they can exhibit stigma toward individuals with psychological conditions. Existing evaluations of this stigma primarily rely on multiple-choice questions (MCQs), which fail to capture the biases embedded within the models’ underlying logic. In this paper, we analyze the intermediate reasoning steps of LLMs to uncover hidden stigmatizing language and the internal rationales driving it. We leverage clinical expertise to categorize common patterns of stigmatizing language directed at individuals with psychological conditions and use this framework to identify and tag problematic statements in LLM reasoning. Furthermore, we rate the severity of these statements, distinguishing between overt prejudice and more subtle, less immediately harmful biases. To broaden the reasoning domain and capture a wider array of patterns, we also extend an existing mental health stigma benchmark by incorporating additional psychological conditions. Our findings demonstrate that evaluating model reasoning not only exposes substantially more stigma than traditional MCQ-based methods but it helps to identify the flaws in the LLMs’ logic and their understanding of mental health conditions.


[7] The Dynamics of Delusion: Modeling Bidirectional False Belief Amplification in Human-Chatbot Dialogue cs.CL | cs.HCPDF

Ashish Mehta, Jared Moore, Jacy Reese Anthis, William Agnew, Eric Lin

TL;DR: 该论文通过分析具有妄想思维个体的聊天记录数据集,开发了一个潜在状态模型来量化人类与聊天机器人之间双向错误信念的累积与衰减影响。研究发现,双向影响模型显著优于单向模型,揭示了人类对聊天机器人施加强烈但短暂的影响,而聊天机器人则对人类产生更持久的影响,且聊天机器人自身输出之间存在稳定的自影响,这是长期对话中维持妄想的主要途径。

Details

Motivation: 针对AI聊天机器人可能加剧用户妄想信念的担忧,缺乏人类与聊天机器人相互强化错误信念的定量证据,本文旨在通过建模量化这种双向影响,以揭示互动中的反馈循环机制。

Result: 在独特的人类妄想聊天数据集上,双向影响模型大幅优于单向模型;人类对聊天机器人的影响强烈但短暂,聊天机器人对人类的影响更持久;聊天机器人的自影响是长期对话中累积影响的主导途径,表明其能持续传播妄想。

Insight: 创新点在于首次提供了人类与聊天机器人互动形成妄想反馈循环的定量证据,并分解为具有不同时间动态的路径;从客观角度看,该模型揭示了AI系统安全风险中时间尺度差异的重要性,为开发更安全的AI系统提供了依据。

Abstract: There is growing concern that AI chatbots might fuel delusional beliefs in users. Some have suggested that humans and chatbots mutually reinforce false beliefs over time, but quantitative evidence is lacking. Using a unique dataset of chat logs from individuals who exhibited delusional thinking, we developed a latent state model that captures accumulating and decaying influences between humans and chatbots. We find that a bidirectional influence model substantially outperforms a unidirectional alternative where humans are the primary driver of delusion. We find that humans exert strong but short-lived influence on chatbots, whereas chatbots exert longer-lasting influence on humans. Moreover, chatbots exert strong, stable self-influence over their own future outputs that tends to perpetuate delusions over long stretches of conversation. In fact, this chatbot self-influence constituted the dominant pathway when considering accumulated influence over time. Overall, these results indicate that humans tend to drive sharp, immediate increases in delusion, whereas chatbots sustain and propagate these effects over longer timescales. Together, these findings provide the first quantitative evidence that human-chatbot interactions can form feedback loops of delusion, decomposable into distinct pathways with dissociable temporal dynamics. By doing so, they can inform the development of safer AI systems.


[8] Diagnosis, Bad Planning & Reasoning. Treatment, SCOPE – Planning for Hybrid Querying over Clinical Trial Data cs.CLPDF

Suparno Roy Chowdhury, Manan Roy Choudhury, Tejas Anvekar, Muhammad Ali Khan, Kaneez Zahra Rubab Khakwani

TL;DR: 本文提出SCOPE框架,通过多LLM规划器将临床试验表格推理任务分解为行选择、结构化规划和执行三个阶段,以解决现有LLM方法在隐含规划假设下出现的‘错误推理’问题。

Details

Motivation: 针对临床试验表格推理中答案需通过语义理解(如归一化、分类、提取或轻量级领域推理)间接推导的场景,现有LLM方法常因隐含规划假设导致推理错误,因此需设计显式规划机制来恢复隐含属性(如治疗类型、附加药物、终点角色或随访状态)。

Result: 在1,500个肿瘤学临床试验表格的混合推理问题上,SCOPE在准确率上优于零样本、少样本、思维链、TableGPT2、Blend-SQL和EHRAgent等基线方法,并在准确率与效率间取得了更优的权衡。

Insight: 创新点在于将临床试验推理定义为独特的表格理解问题,并采用基于混合规划器的分解方法,通过显式指定源字段、推理规则和输出约束来减少歧义,从而提升推理准确性;客观来看,多LLM规划器的结构化任务分解策略为复杂领域表格推理提供了可扩展的解决方案。

Abstract: We study clinical trial table reasoning, where answers are not directly stored in visible cells but must be reasoned from semantic understanding through normalization, classification, extraction, or lightweight domain reasoning. Motivated by the observation that current LLM approaches often suffer from “bad reasoning” under implicit planning assumptions, we focus on settings in which the model must recover implicit attributes such as therapy type, added agents, endpoint roles, or follow-up status from partially observed clinical-trial tables. We propose SCOPE (Structured Clinical hybrid Planning for Evidence retrieval in clinical trials), a multi-LLM planner-based framework that decomposes the task into row selection, structured planning, and execution. The planner makes the source field, reasoning rules, and output constraints explicit before answer generation, reducing ambiguity relative to direct prompting. We evaluate SCOPE on 1,500 hybrid reasoning questions over oncology clinical-trial tables against zero-shot, few-shot, chain-of-thought, TableGPT2, Blend-SQL, and EHRAgent. Results show that explicit multi-LLM planning improves accuracy for reasoning-based questions while offering a stronger accuracy-efficiency tradeoff than heavier agentic baselines. Our findings position clinical trial reasoning as a distinct table understanding problem and highlight hybrid planner-based decomposition as an effective solution


[9] CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation cs.CLPDF

Rui Qi, Fengran Mo, Sijin Lu, Yufeng Chen, Jian-Yun Nie

TL;DR: 本文提出CroSearch-R1框架,一种基于搜索增强强化学习的方法,旨在更好地利用多语言知识提升检索增强生成(RAG)的效果。该方法通过多轮检索与跨语言知识整合,将不同语言的知识动态对齐到统一表示空间,并引入多语言rollout机制优化跨语言推理可迁移性。

Details

Motivation: 解决在多语言集合中,简单拼接不同语言知识可能因语言差异而无法有效提升RAG性能的问题,旨在更有效地利用跨语言知识补充和修正原始语言的事实。

Result: 实验结果表明,该框架有效利用了跨语言互补性,提升了基于多语言集合的RAG效果。

Insight: 创新点包括:将跨语言知识整合融入GRPO强化学习过程,采用多轮检索策略动态对齐多语言知识,以及引入多语言rollout机制优化跨语言推理迁移;从客观角度看,该方法强调了跨语言知识在RAG中的动态整合与强化学习优化的结合,为多语言RAG系统设计提供了新思路。

Abstract: A multilingual collection may contain useful knowledge in other languages to supplement and correct the facts in the original language for Retrieval-Augmented Generation (RAG). However, the vanilla approach that simply concatenates multiple pieces of knowledge from different languages into the context may fail to improve effectiveness due to the potential disparities across languages. To better leverage multilingual knowledge, we propose CroSearch-R1, a search-augmented reinforcement learning framework to integrate multilingual knowledge into the Group Relative Policy Optimization (GRPO) process. In particular, the approach adopts a multi-turn retrieval strategy with cross-lingual knowledge integration to dynamically align the knowledge from other languages as supplementary evidence into a unified representation space. Furthermore, we introduce a multilingual rollout mechanism to optimize reasoning transferability across languages. Experimental results demonstrate that our framework effectively leverages cross-lingual complementarity and improves the effectiveness of RAG with multilingual collections.


[10] BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate cs.CL | cs.AI | cs.LGPDF

Arnon Mazza, Elad Levi

TL;DR: BARRED框架通过维度分解和多智能体辩论,仅使用任务描述和少量未标注样本生成高质量合成训练数据,用于训练定制化策略护栏模型,在准确性和效率上超越现有大型语言模型和专用护栏模型。

Details

Motivation: 解决定制化策略护栏部署的挑战:通用安全模型无法捕捉任务特定需求,而提示LLMs存在边界案例性能不稳定和推理成本高的问题;训练定制分类器虽能兼顾准确性与效率,但需要大量标注数据,成本高昂。

Result: 在多种定制化策略任务上,使用BARRED合成数据微调的小型语言模型持续超越最先进的专有LLM(包括推理模型)和专用护栏模型;消融研究证实维度分解和基于辩论的验证对确保数据多样性和标签保真度至关重要。

Insight: 创新点在于通过维度分解确保领域空间全面覆盖,并利用多智能体辩论验证标签正确性,从而生成忠实且多样的合成训练数据;该方法消除了对大量人工标注的依赖,为准确定制护栏提供了可扩展的解决方案。

Abstract: Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.


[11] Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs cs.CLPDF

Jianghang Lin, Haihua Yang, Deli Yu, Kai Wu, Kai Ye

TL;DR: 本文提出了一种以实体为中心的医学数据工程框架,通过从权威医学文献中自动提取实体构建医学实体树(MET),并基于此开发了包含节点引导检索、两阶段混合过滤对齐管道和知识感知数据合成的数据引擎,显著提升了通用多模态大语言模型在医学任务上的性能。

Details

Motivation: 传统基于模态或科室的粗粒度数据管理策略无法捕捉临床医学知识的层次化和互联性,限制了多模态大语言模型在医学应用中的细粒度识别和复杂推理能力。

Result: 在六个医学基准测试上的广泛评估表明,该方法显著增强了通用多模态大语言模型的医学能力,使其在处理复杂临床查询方面表现优异,并在多种医学场景中达到了最先进的性能水平。

Insight: 创新点在于构建了结构化的医学实体树作为统一知识库,并设计了与之协同的数据工程流程,实现了对医学知识的系统性编码和精准的视觉-语义对齐,为医学多模态学习提供了可扩展的数据基础。

Abstract: Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or department. Such fragmented approaches fail to capture the hierarchical and interconnected nature of clinical medical knowledge, limiting the models’ ability to perform fine-grained recognition and complex reasoning. In this paper, we propose a novel Entity-Centric Medical Data Engineering framework. We automatically extract entities from authoritative medical literature to construct a Medical Entity Tree (MET), a hierarchical structure that systematically encodes diseases, anatomical structures, modalities, and symptoms into a unified knowledge repository. Building upon the MET, we propose an advanced data engine that includes: (1) node-guided retrieval to anchor raw data to specific medical concepts, (2) a two-stage hybrid filtering and alignment pipeline to ensure precise visual-semantic correspondence, and (3) knowledge-aware data synthesis to generate enriched captions and targeted reasoning VQA pairs, leveraging structural constraints. Extensive evaluations across six medical benchmarks demonstrate that our approach significantly enhances the medical capabilities of general-purpose MLLMs, improving their ability to handle complex clinical queries and achieve state-of-the-art performance in diverse medical contexts.


[12] The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models cs.CL | cs.AIPDF

Abhinav Kumar Singh, Harsha Vardhan Khurdula, Yoeven D Khemlani, Vineet Agarwal

TL;DR: 本文提出了结构化输出基准(SOB),这是一个多源基准测试,用于评估大语言模型从文本、图像和音频三种模态中提取结构化数据的能力。基准包含5000条文本、209条图像和115条音频记录,每项任务要求模型根据自然语言问题和JSON模式生成符合模式的答案。研究评估了21个前沿和开源模型,发现模型在模式合规性上近乎完美,但在值准确性(基于叶子节点精确匹配)上表现不佳,尤其是在长上下文的音频任务中。

Details

Motivation: 现有基准测试要么只关注模式合规性,要么仅在单一源域内评估值正确性,缺乏一个公平、多源、能隔离原始视觉或语音处理能力影响的结构化输出质量评估基准。

Result: 在文本、图像和音频三个源域上,使用七种指标评估了21个模型。结果显示,模型在模式合规性上接近完美,但最佳值准确性(精确叶子值匹配)在文本上为83.0%,图像上为67.2%,音频上仅为23.7%,表明长上下文使提取任务显著变难。

Insight: 创新点在于设计了一个多源(文本、图像、音频)基准,通过文本归一化表示隔离了原始模态处理能力的影响,实现了公平、源无关的比较;客观分析表明,该基准揭示了模型在结构化输出中模式合规与值准确性之间的显著差距,尤其是在复杂长上下文场景下,为未来模型改进提供了明确方向。

Abstract: Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context regardless of source modality; this deliberate design isolates structured-output capability from raw vision or speech-processing quality, ensuring a fair, source-agnostic comparison. Our benchmark comprises 5,000 text evaluation records derived from multi-hop QA drawn from a 25,091-record full corpus, 209 image records from OCR-processed PDFs across seven document types including multi-column layouts, dense tables, scanned historical documents, small-print text, and mathematical typesetting, and 115 audio records from the AMI corpus. Each record pairs a natural-language question with a JSON schema that the model must follow and a ground-truth answer verified against the source context. We evaluate 21 frontier and open-weight models across three source domains and seven metrics. Our results reveal a consistent pattern: models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0% on text, 67.2% on images, and 23.7% on audio, where longer context makes extraction substantially harder. We release the dataset, evaluation pipeline, and all related code.


[13] Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives cs.CL | cs.AIPDF

Yu Wang, Emmanuele Chersoni, Chu-Ren Huang

TL;DR: 这篇论文通过研究指示词(如英语的’this/that’和中文的’zhè/nà’)来探究大语言模型是否真正从文本中习得了具身认知和文化惯例。研究基于320名母语者的6400个回答建立了人类基线,发现英语和中文母语者在空间指称和视角采择上存在系统性差异。然而,五个最先进的大语言模型未能内在地理解近端-远端对比,也未表现出文化差异,默认采用以英语为中心的推理模式。

Details

Motivation: 动机是探究大语言模型是否真正从文本数据中获取了具身认知(即基于身体和环境的认知)和文化惯例,特别是针对不同语言中普遍存在但用法有微妙差异的空间指示词。

Result: 人类基线结果显示:英语母语者能可靠区分近端和远端指称物,但在视角采择上存在困难;中文母语者能流畅切换视角,但能容忍远端指称的模糊性。相比之下,五个SOTA大语言模型在理解近端-远端对比上失败,且未表现出文化差异,其推理模式是英语中心主义的。

Insight: 论文的创新点在于:(1)提出了基于指示词的新评估任务,作为评估模型具身认知和文化惯例的新视角;(2)提供了人类跨文化解释不对称性的实证证据;(3)为自我中心-社会中心辩论提供了新视角,表明两种取向共存但随语言变化;(4)呼吁在未来模型设计中考虑个体差异。从客观角度看,该研究将语言学中的经典现象(指示词)转化为可量化的、揭示模型根本局限性的评测基准,方法具有启发性。

Abstract: Do large language models (LLMs) truly acquire embodied cognition and cultural conventions from text? We introduce demonstratives, fundamental spatial expressions like “this/that” in English and “zhè/nà” in Chinese, as a novel probe for grounded knowledge. Using 6,400 responses from 320 native speakers, we establish a human baseline: English speakers reliably distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to inherently understand the proximal-distal contrast and show no cultural differences, defaulting to English-centric reasoning. Our study contributes (i) a new task, based on demonstratives, as a new lens for evaluating embodied cognition and cultural conventions; (ii) empirical evidence of cross-cultural asymmetries in human interpretation; (iii) a new perspective on the egocentric-sociocentric debate, showing both orientations coexist but vary across languages; and (iv) a call to address individual variation in future model design.


[14] One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement cs.CLPDF

Yixiao Zhou, Dongzhou Cheng, zhiliang wu, Yi Yang, Yu Cheng

TL;DR: 本文提出ReQueR框架,通过强化学习训练一个查询精炼器(Refiner),将模糊的人类查询重写为显式的逻辑分解,以在推理时激发大型语言模型的潜在推理能力,实现单一精炼器对多种未见模型的通用推理激发。

Details

Motivation: 解决大型语言模型因人类查询的模糊性与机器激活所需结构化逻辑之间的分布不匹配,而无法有效利用其潜在推理能力的问题,同时克服现有方法(如微调成本高或静态提示无效)的局限性。

Result: 在多种模型架构和基准测试上取得1.7%至7.2%的绝对性能提升,平均优于强基线2.1%,展示了其通用性和有效性。

Insight: 创新点在于将推理激发视为推理时对齐任务,并引入基于教育心理学“最近发展区”理论的自适应求解器层次结构作为课程机制来稳定训练;其核心价值在于提供了一个“一对多”的推理激发范式,一个在小模型集上训练的精炼器即可有效激发多种未见模型的推理能力。

Abstract: Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by fine-tuning each model individually or rely on static prompts that fail to resolve query-level structural complexity. In this paper, we propose ReQueR (\textbf{Re}inforcement \textbf{Que}ry \textbf{R}efinement), a modular framework that treats reasoning elicitation as an inference-time alignment task. We train a specialized Refiner policy via Reinforcement Learning to rewrite raw queries into explicit logical decompositions, treating frozen LLMs as the environment. Rooted in the classical Zone of Proximal Development from educational psychology, we introduce the Adaptive Solver Hierarchy, a curriculum mechanism that stabilizes training by dynamically aligning environmental difficulty with the Refiner’s evolving competence. ReQueR yields consistent absolute gains of 1.7%–7.2% across diverse architectures and benchmarks, outperforming strong baselines by 2.1% on average. Crucially, it provides a promising paradigm for one-to-many inference-time reasoning elicitation, enabling a single Refiner trained on a small set of models to effectively unlock reasoning in diverse unseen models. Code is available at https://github.com/newera-xiao/ReQueR.


[15] From World-Gen to Quest-Line: A Dependency-Driven Prompt Pipeline for Coherent RPG Generation cs.CL | cs.AIPDF

Dominik Borawski, Marta Szulc, Robert Chudy, Małgorzata Giedrowicz, Piotr Mironowicz

TL;DR: 本文提出了一种基于依赖感知的多阶段提示管道,用于生成连贯的角色扮演游戏内容。该方法将生成过程分解为世界构建、非玩家角色创建、玩家角色创建、战役级任务规划和任务扩展等顺序阶段,每个阶段都依赖于前序阶段的结构化JSON输出,从而减少叙事漂移和幻觉,提升生成内容的连贯性和可控性。

Details

Motivation: 解决大型语言模型在复杂多层角色扮演游戏叙事生成中存在的连贯性、可控性和结构一致性问题,通过结构化中间表示和依赖驱动的流程来增强生成质量。

Result: 通过以人为中心的定性评估,在多个独立运行中,该管道在结构完整性、内部一致性、叙事连贯性、多样性和可操作性等标准上均能生成逻辑合理且结构有效的RPG内容,且复杂度增加时质量未下降。

Insight: 创新点在于采用依赖感知的多阶段提示管道,通过结构化JSON中间表示强制数据流和模式,将高层战役规划与详细任务扩展分离,从而同时改善全局结构和局部叙事;该方法可推广至其他需要基于演化上下文状态进行顺序推理的领域。

Abstract: Large Language Models (LLMs) have shown strong potential for narrative generation, but their use in complex, multi-layered role-playing game (RPG) worlds is still limited by issues of coherence, controllability, and structural consistency. This paper explores a dependency-aware, multi-stage prompt pipeline for procedural RPG content generation that models narrative dependencies through structured intermediate representations. The approach decomposes generation into sequential stages: world building, non-player character creation, player character creation, campaign-level quest planning, and quest expansion. Each stage conditions on structured JSON outputs from previous stages. By enforcing schemas and explicit data flow, the pipeline reduces narrative drift, limits hallucinations, and supports scalable creation of interconnected narrative elements. The system is evaluated qualitatively through human-centered analysis across multiple independent runs. Outputs are assessed using criteria such as structural completeness, internal consistency, narrative coherence, diversity, and actionability. Results show that the pipeline consistently generates logically sound and structurally valid RPG content, without quality degradation as complexity increases. Separating high-level campaign planning from detailed quest expansion improves both global structure and local storytelling. These findings suggest that dependency-aware prompt pipelines with structured intermediate representations are an effective design pattern for LLM-based procedural content generation. This approach may also generalize to other domains requiring sequential reasoning over evolving contextual states.


[16] Modeling Human-Like Color Naming Behavior in Context cs.CLPDF

Yuqing Zhang, Ecesu Ürker, Tessa Verhoef, Gemma Boleda, Arianna Bisazza

TL;DR: 本文针对NeLLCom-Lex框架中神经网络智能体通过监督学习和强化学习习得的颜色命名词汇系统与人类颜色类别存在系统性差异(如颜色空间区域非凸)的问题,提出了两种改进因素:在监督学习中上采样稀有颜色术语,以及在强化学习交互中引入多听者设置。研究发现,上采样提高了词汇多样性和系统信息量,而多听者设置促进了更凸的颜色类别形成,两者的适度结合能产生最接近人类颜色命名系统的词汇。

Details

Motivation: 现有基于交互神经智能体的词汇涌现模型(如NeLLCom-Lex)能模拟人类词汇学习,但其产生的颜色命名词汇在颜色空间中形成高度非凸区域,与人类典型的凸性颜色类别不符,因此需要改进模型以更好地模拟人类颜色命名行为。

Result: 实验表明,上采样稀有颜色术语提高了词汇多样性和系统信息量;多听者强化学习交互促进了更凸的颜色类别;两者的适度结合产生的词汇系统与人类系统最为相似,并通过凸性度量进行了量化评估。

Insight: 创新点在于引入上采样稀有术语和多听者交互机制来修正词汇涌现模型的系统性偏差,并采用凸性度量来量化颜色类别的几何一致性,这为模拟更逼真的人类语义类别形成提供了方法学借鉴。

Abstract: Modeling the emergence of human-like lexicons in computational systems has advanced through the use of interacting neural agents, which simulate both learning and communicative pressures. The NeLLCom-Lex framework (Zhang et al., 2025) allows neural agents to develop pragmatic color naming behavior and human-like lexicons through supervised learning (SL) from human data and reinforcement learning (RL) in referential games. Despite these successes, the lexicons that emerge diverge systematically from human color categories, producing highly non-convex regions in color space, which contrast with the convexity typical of human categories. To address this, we introduce two factors, upsampling rare color terms during SL and multi-listener RL interactions, and adopt a convexity measure to quantify geometric coherence. We find that upsampling improves lexical diversity and system-level informativeness of the color lexicon, while many-listener setups promote more convex color categories. The combination of moderate upsampling and multiple listeners produces lexicons most similar to human systems.


[17] Backtranslation Augmented Direct Preference Optimization for Neural Machine Translation cs.CLPDF

Mehrdad Ghassabi, Spehr Rajabi, Hamidreza Baradaran Kashani, Sadra Hakim, Mahshid Keivandarian

TL;DR: 本文提出了一种基于强化学习的后训练范式,通过直接偏好优化(DPO)来提升神经机器翻译(NMT)系统的翻译质量。该方法仅需通用文本语料和专家(人或AI)的迭代反馈,在英语到德语的高资源语言对上,将gemma3-1b模型的COMET分数从0.703显著提升至0.747。

Details

Motivation: 尽管基于监督并行数据训练的NMT系统取得了巨大进展,但仍存在持续的翻译错误,本文旨在通过基于强化学习的后训练范式来有效纠正这些错误。

Result: 在英语到德语翻译任务上,应用DPO驱动的框架后,gemma3-1b模型的COMET分数从0.703提升至0.747,表明翻译质量得到显著改善。

Insight: 创新点在于将DPO这一高效的偏好优化方法引入NMT后训练,仅需通用文本和专家反馈即可稳定提升预训练模型,避免了传统RL方法的不稳定性,为基于偏好的模型微调提供了新路径。

Abstract: Contemporary neural machine translation (NMT) systems are almost exclusively built by training on supervised parallel data. Despite the tremendous progress achieved, these systems still exhibit persistent translation errors. This paper proposes that a post-training paradigm based on reinforcement learning (RL) can effectively rectify such mistakes. We introduce a novel framework that requires only a general text corpus and an expert translator which can be either human or an AI system to provide iterative feedback. In our experiments, we focus specifically on English-to-German translation as a representative high-resource language pair. Crucially, we implement this RL-based post-training using Direct Preference Optimization (DPO). Applying our DPO-driven framework to the gemma3-1b model yields a significant improvement in translation quality, elevating it’s COMET score from 0.703 to 0.747 on the English to German task. The results demonstrate that DPO offers an efficient and stable pathway for enhancing pre-trained NMT models through preference-based post-training.


[18] CGU-ILALab at FoodBench-QA 2026: Comparing Traditional and LLM-based Approaches for Recipe Nutrient Estimation cs.CL | cs.AIPDF

Wei-Chun Chen, Yu-Xuan Chen, I-Fang Chung, Ying-Jia Lin

TL;DR: 本文系统评估了从传统词法匹配到深度语义编码器再到大型语言模型(LLM)等多种方法,在从非结构化食谱文本中估算营养成分这一任务上的表现。研究发现,在严格的欧盟法规标准下,预测准确性与计算效率之间存在明显权衡:TF-IDF基线方法推理速度快但精度一般,DeBERTa-v3在数据稀缺时表现不佳,而少样本LLM推理及混合LLM优化流程在所有营养成分类别上取得了最高的验证准确率,但其推理延迟也显著更高。

Details

Motivation: 解决从非结构化食谱文本中准确估算营养成分的挑战,该挑战源于模糊的食材术语和高度可变的用量表达,这对于膳食监测至关重要。

Result: 在FoodBench-QA 2026基准上,遵循EU Regulation 1169/2011的严格容差标准,少样本LLM推理(如Gemini 2.5 Flash)和混合LLM优化流程(TF-IDF结合Gemini 2.5 Flash)在所有营养成分类别上取得了最高的验证准确率,达到了最佳性能(SOTA)。

Insight: 论文宣称的创新点在于系统性地比较了不同表征能力模型在该任务上的表现,并揭示了准确率与效率的权衡。客观来看,其核心洞察是LLM能够利用预训练的世界知识来解析模糊术语和规范化非标准单位,这是纯词法方法难以实现的,这为处理非结构化文本中的语义模糊性问题提供了有效路径,但同时也凸显了实际部署中实时效率与营养精度之间的权衡。

Abstract: Accurate nutrient estimation from unstructured recipe text is an important yet challenging problem in dietary monitoring, due to ambiguous ingredient terminology and highly variable quantity expressions. We systematically evaluate models spanning a wide range of representational capacity, from lexical matching methods (TF-IDF with Ridge Regression), to deep semantic encoders (DeBERTa-v3), to generative reasoning with large language models (LLMs). Under the strict tolerance criteria defined by EU Regulation 1169/2011, our empirical results reveal a clear trade-off between predictive accuracy and computational efficiency. The TF-IDF baseline achieves moderate nutrient estimation performance with near-instantaneous inference, whereas the DeBERTa-v3 encoder performs poorly under task-specific data scarcity. In contrast, few-shot LLM inference (e.g., Gemini 2.5 Flash) and a hybrid LLM refinement pipeline (TF-IDF combined with Gemini 2.5 Flash) deliver the highest validation accuracy across all nutrient categories. These improvements likely arise from the ability of LLMs to leverage pre-trained world knowledge to resolve ambiguous terminology and normalize non-standard units, which remain difficult for purely lexical approaches. However, these gains come at the cost of substantially higher inference latency, highlighting a practical deployment trade-off between real-time efficiency and nutritional precision in dietary monitoring systems.


[19] MAIC-UI: Making Interactive Courseware with Generative UI cs.CL | cs.AI | cs.HCPDF

Shangqing Tu, Yanjia Li, Keyu Chen, Sichen Zhang, Jifan Yu

TL;DR: MAIC-UI是一个零代码创作系统,旨在帮助教育工作者从教科书、PPT和PDF等材料中快速创建和编辑交互式STEM课件。它通过结构化知识分析、两阶段生成-验证-优化流程以及基于统一差异的增量生成技术,解决了现有生成式AI工具在生成交互性、处理长文档、保证教学准确性以及编辑效率方面的不足。

Details

Motivation: 传统创建交互式STEM课件需要HTML/CSS/JavaScript专业知识,对教育工作者构成障碍。现有的生成式AI工具只能生成静态演示而非交互式模拟,难以处理长文档,缺乏教学准确性保障机制,且完全重新生成修改耗时过长,打断了创作流程。

Result: 一项有40名参与者的对照实验室研究表明,与直接的文本到HTML生成方法相比,MAIC-UI减少了编辑迭代次数(4.9次 vs. 7.0次),并显著提高了可学习性和可控性。一项为期三个月、涉及53名高中生的课堂部署显示,使用MAIC-UI的试点班级在STEM科目上取得了9.21分的提升,而对照组则下降了2.32分,表明其能促进学习自主性并减少结果差异。

Insight: 论文宣称的创新点包括:1) 利用多模态理解进行结构化知识分析以确保教学严谨性;2) 将内容对齐与视觉优化分离的两阶段生成-验证-优化流程;3) 基于统一差异的增量生成技术,结合“点击定位”编辑,实现了10秒以内的快速迭代周期。从客观角度看,其核心创新在于将生成式UI创作从“一次性生成静态代码”范式,转向了支持快速、精准、教学导向的增量编辑工作流,这对于降低教育技术创作门槛具有实际意义。

Abstract: Creating interactive STEM courseware traditionally requires HTML/CSS/JavaScript expertise, leaving barriers for educators. While generative AI can produce HTML codes, existing tools generate static presentations rather than interactive simulations, struggle with long documents, and lack pedagogical accuracy mechanisms. Furthermore, full regeneration for modifications requires 200–600 seconds, disrupting creative flow. We present MAIC-UI, a zero-code authoring system that enables educators to create and rapidly edit interactive courseware from textbooks, PPTs, and PDFs. MAIC-UI employs: (1) structured knowledge analysis with multi-modal understanding to ensure pedagogical rigor; (2) a two-stage generate-verify-optimize pipeline separating content alignment from visual refinement; and (3) Click-to-Locate editing with Unified Diff-based incremental generation achieving sub-10-second iteration cycles. A controlled lab study with 40 participants shows MAIC-UI reduces editing iterations (4.9 vs. 7.0) and significantly improves learnability and controllability compared to direct Text-to-HTML generation. A three-month classroom deployment with 53 high school students demonstrates that MAIC-UI fosters learning agency and reduces outcome disparities – the pilot class achieved 9.21-point gains in STEM subjects compared to -2.32 points in control classes. Our code is available at https://github.com/THU-MAIC/MAIC-UI.


[20] DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios cs.CLPDF

Jinxiang Meng, Shaoping Huang, Fangyu Lei, Jingyu Guo, Haoxiang Liu

TL;DR: 本文提出了DV-World基准测试,包含260个任务,用于评估数据可视化(DV)智能体在真实世界场景中的表现,涵盖原生环境交互、跨平台演进和主动意图对齐三个领域,并采用混合评估框架进行综合性能评估。

Details

Motivation: 现有基准测试存在代码沙箱限制、单语言仅创建任务和完美意图假设等不足,无法反映真实世界数据可视化的复杂需求,因此需要构建一个更贴近实际工作流程的评估体系。

Result: 实验表明,当前最先进模型在DV-World上的总体性能不足50%,揭示了其在处理真实世界数据可视化复杂挑战时的关键缺陷。

Insight: 创新点包括:1)构建了覆盖真实世界专业生命周期的多领域基准任务;2)引入了混合评估框架,结合数值精度对齐和基于MLLM的语义视觉评估;3)通过用户模拟器模拟模糊需求,评估智能体的主动意图对齐能力,为开发适用于企业工作流的通用数据可视化智能体提供了现实测试平台。

Abstract: Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code-sandbox confinement, single-language creation-only tasks, and assumption of perfect intent. To bridge these gaps, we introduce DV-World, a benchmark of 260 tasks designed to evaluate DV agents across real-world professional lifecycles. DV-World spans three domains: DV-Sheet for native spreadsheet manipulation including chart and dashboard creation as well as diagnostic repair; DV-Evolution for adapting and restructuring reference visual artifacts to fit new data across diverse programming paradigms and DV-Interact for proactive intent alignment with a user simulator that mimics real-world ambiguous requirements. Our hybrid evaluation framework integrates Table-value Alignment for numerical precision and MLLM-as-a-Judge with rubrics for semantic-visual assessment. Experiments reveal that state-of-the-art models achieve less than 50% overall performance, exposing critical deficits in handling the complex challenges of real-world data visualization. DV-World provides a realistic testbed to steer development toward the versatile expertise required in enterprise workflows. Our data and code are available at \href{https://github.com/DA-Open/DV-World}{this project page}.


cs.CV [Back]

[21] Interactive Episodic Memory with User Feedback cs.CVPDF

Nikesh Subedi, Loris Bazzani, Ziad Al-Halah

TL;DR: 本文提出了交互式情景记忆任务EM-QnF,通过引入用户反馈机制来解决传统EM-NLQ任务中查询模糊或不完整导致的错误响应问题。作者构建了反馈交互数据集,并提出轻量级训练方案及可插拔的反馈对齐模块FALM,使现有模型能够有效利用反馈进行预测优化。

Details

Motivation: 现有EM-NLQ方法采用一次性查询-响应模式,忽略了现实场景中查询可能存在的模糊性和不完整性,导致模型适用性受限。本文旨在通过引入交互式反馈机制来弥补这一缺陷。

Result: 在三个具有挑战性的基准测试上,该方法显著超越了当前最优模型(SOTA),性能优于或与商用大型视觉语言模型相当,同时保持了高效性。使用人类生成反馈的评估表明其能很好地泛化到真实场景。

Insight: 创新点在于将交互式反馈机制引入情景记忆任务,提出轻量级训练方案避免序列化优化开销,并设计可插拔的FALM模块使现有模型能快速适配反馈交互。这为处理模糊查询提供了可扩展的解决方案。

Abstract: In episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g., “Where did I place the mug?”) that requires searching a long egocentric video, captured from the user’s perspective, to find the moment that answers it. However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods ignore this key aspect and address EM-NLQ in a one-shot setup, limiting their applicability in real-world scenarios. In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model’s initial prediction or add more information (e.g., “Before this. I’m looking for the big blue mug not the white one”), helping the model refine its predictions interactively. To this end, we collect datasets for feedback-based interaction and propose a lightweight training scheme that avoids expensive sequential optimization. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. Our approach significantly improves over the state of the art on three challenging benchmarks and is better than or competitive with commercial large vision-language models while remaining efficient. Evaluation with human-generated feedback shows that it generalizes well to real-world scenarios.


[22] Agentic AI for Remote Sensing: Technical Challenges and Research Directions cs.CVPDF

Muhammad Akhtar Munir, Muhammad Umer Sheikh, Akashah Shabbir, Muhammad Haris Khan, Fahad Khan

TL;DR: 这篇立场论文探讨了将智能体AI应用于遥感领域时面临的技术挑战和研究方向。论文指出,遥感工作流涉及地理参考、多模态和时序结构数据,通用智能体AI的隐含假设在此类工作流中会失效,导致错误在步骤间无声传播。因此,需要围绕结构化地理空间状态、工具感知推理、验证器引导执行等原则,重新设计面向遥感的本土智能体。

Details

Motivation: 解决通用智能体AI在遥感多步分析工作流中,因地理空间、时序和物理约束而失效的问题,以构建可靠的地理空间智能体。

Result: 本文是一篇立场论文,未报告具体定量实验结果,但提出了未来研究方向,包括构建遥感专用基准测试、混合监督与强化学习、约束性自我改进以及超越最终答案准确性的轨迹级评估。

Insight: 创新点在于系统分析了通用智能体模型在遥感工作流中的结构性失效模式,并提出了以结构化地理空间状态为核心、强调地理空间一致性与物理有效性的本土智能体设计原则,为领域专用智能体开发提供了新视角。

Abstract: Earth Observation (EO) is moving beyond static prediction toward multi-step analytical workflows that require coordinated reasoning over data, tools, and geospatial state. While foundation models and vision-language models have expanded representation learning and language-grounded interaction for remote sensing, and agentic AI has demonstrated long-horizon reasoning and external tool use, EO is not a straightforward extension of generic agentic AI. EO workflows operate over georeferenced, multi-modal, and temporally structured data, where operations such as reprojection, resampling, compositing, and aggregation actively transform the underlying state and can constrain subsequent analysis. As a result, errors may propagate silently across steps, and correctness depends not only on internal coherence, but also on geospatial consistency, temporally valid comparisons, and physical validity. This position paper argues that these challenges are structural rather than incidental. We identify the implicit assumptions commonly made in generic agentic models, analyze how they break in geospatial workflows, and characterize the resulting failure modes in multi-step EO pipelines. We then outline design principles for EO-native agents centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and learning objectives aligned with geospatial and physical validity. Finally, we present research directions spanning EO-specific benchmarks, hybrid supervised and reinforcement learning, constrained self-improvement, and trajectory-level evaluation beyond final-answer accuracy. Building reliable geospatial agents therefore requires rethinking agent design around the physical, geospatial, and workflow constraints that govern EO analysis.


[23] Subjective Portrait Region Cropping in Landscape Videos with Temporal Annotation Smoothing cs.CVPDF

Cheng-Han Lee, Maniratnam Mandal, Neil Birkbeck, Yilin Wang, Balu Adsumilli

TL;DR: 本文针对移动设备上不同显示分辨率和方向模式下的视频消费需求,提出了一种通过时间性裁剪视频帧内重要区域以调整视频宽高比的方法,旨在最小化失真并保留核心内容。为了解决该领域缺乏大规模标注数据的问题,作者构建了LIVE-YouTube视频裁剪数据库(LIVE-YT VC),包含1800个视频并由90名受试者标注,是目前最大的公开主观视频肖像区域裁剪数据库。此外,作者还提出了一个经过后处理的版本LIVE-YT VC++,通过新颖的帧内时间滤波器平滑主观标注。论文展示了该数据库在SmartVidCrop算法和先进视频定位模型上的应用价值,并探索了其与视频显著性预测的相似性,同时将视频定位模型重新用于宽高比变换任务并在数据集上进行了微调。

Details

Motivation: 随着移动设备上视频消费的多样化(不同手持设备显示分辨率和方向模式),调整视频宽高比面临挑战:静态裁剪和边框填充会损害视觉质量,而扭曲可能破坏视频原意。因此,需要一种更有效的方法来时间性地裁剪视频帧中的重要区域,以最小化失真并保留关键内容。

Result: 论文构建了LIVE-YouTube视频裁剪数据库(LIVE-YT VC),包含1800个视频,由90名人类受试者标注,是目前最大的公开主观视频肖像区域裁剪数据库。通过使用SmartVidCrop算法和先进视频定位模型(如视频接地模型)在该数据库上进行评估,展示了其作为未来研究基准的实用性。此外,通过微调视频定位模型于宽高比变换任务,验证了数据资源的有效性。

Insight: 主要创新点包括:1)构建了大规模的主观视频肖像区域裁剪数据库LIVE-YT VC,填补了该领域数据资源的空白;2)提出了LIVE-YT VC++版本,引入帧内时间滤波器平滑主观标注,提高了标注的时序一致性;3)将视频定位模型重新用于视频宽高比变换任务,拓展了模型的应用范围;4)探索了视频裁剪标注与视频显著性预测之间的相似性,为跨任务研究提供了见解。这些贡献为推进视频宽高比转换模型的发展提供了资源,确保移动友好视频内容在重塑后保持质量和意义。

Abstract: With the rise of mobile video consumption on diverse handheld display resolutions and orientation modes, altering videos to aspect ratios poses challenges. Static cropping and border padding often compromises visual quality, while warping may distort a video’s intended meaning. Here we advocate for a more effective approach: cropping significant regions within video frames in a temporal manner, while minimizing distortion and preserving essential content. One barrier to solving this problem is the lack of sufficiently large-scale database devoted to informing these tasks. Towards filling this gap, we introduce the LIVE-YouTube Video Cropping (LIVE-YT VC) database, featuring 1800 videos, annotated by 90 human subjects. Using videos sourced from the YouTube-UGC and LSVQ Databases, this new resource is the largest publicly-available subjective video portrait region cropping database. We also introduce a post-processed version of the database, called LIVE-YT VC++, whereby a novel intra-frame temporal filter was deployed to smooth subjective annotations within each video. We demonstrate the usefulness of this new data resource using the SmartVidCrop algorithm and state-of-the-art video grounding models, in hopes of establishing our subjective dataset as a benchmark for future research. Our contributions offer a resource for advancing video aspect ratio transformation models towards ensuring that reshaped mobile-friendly video content retains its quality and meaning. Since our labels bear resemblances to video saliency annotations, we also conducted an additional analysis to explore the similarity between our labels and video saliency predictions. Finally, we repurposed state-of-the-art video grounding models for aspect ratio change tasks, and fine-tuned them on our dataset. As a service to the research community, we plan to open source the project.


[24] ViPO: Visual Preference Optimization at Scale cs.CV | cs.AIPDF

Ming Li, Jie Wu, Justin Cui, Xiaojie Li, Rui Wang

TL;DR: 本文提出了ViPO大规模视觉偏好优化数据集和Poly-DPO算法。针对现有开源偏好数据集存在噪声、分辨率低、分布不平衡等问题,作者构建了包含百万级高质量图像对和视频对的ViPO数据集,并提出了Poly-DPO算法,该算法通过引入多项式项动态调整模型置信度,以增强对噪声数据的鲁棒性。研究发现,在高质量数据集上,Poly-DPO会退化为标准DPO,验证了数据质量的重要性。

Details

Motivation: 当前视觉生成模型的偏好优化面临两大挑战:一是现有开源偏好数据集存在冲突的偏好模式、低分辨率、提示词多样性有限和分布不平衡等数据瓶颈;二是在此类噪声数据上直接优化效果不佳,阻碍了偏好优化范式的有效扩展。

Result: 在噪声数据集Pick-a-Pic V2上,Poly-DPO在GenEval基准上相比Diffusion-DPO,对SD1.5和SDXL模型分别取得了6.87和2.32的性能提升。在高质量ViPO数据集上训练的模型,其性能远超基于现有开源偏好数据集训练的模型。

Insight: 主要创新点包括:1) 构建了大规模、高质量、平衡的视觉偏好数据集ViPO,解决了数据瓶颈;2) 提出了具有自适应性的Poly-DPO算法,通过动态调整置信度来鲁棒地处理噪声数据。核心洞察是:提升视觉偏好优化的规模需要算法适应性和数据质量双管齐下,当数据质量足够高时,复杂的优化算法可能变得不必要,这凸显了高质量数据集的根本重要性。

Abstract: While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm remains largely unexplored. Current open-source preference datasets contain conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn preferences, hindering effective scaling. To enhance robustness against noise, we propose Poly-DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence based on dataset characteristics, enabling effective learning across diverse data distributions. Beyond biased patterns, existing datasets suffer from low resolution, limited prompt diversity, and imbalanced distributions. To facilitate large-scale visual preference optimization by tackling data bottlenecks, we construct ViPO, a massive-scale preference dataset with 1M image pairs at 1024px across five categories and 300K video pairs at 720p+ across three categories. State-of-the-art generative models and diverse prompts ensure reliable preference signals with balanced distributions. Remarkably, when applying Poly-DPO to our high-quality dataset, the optimal configuration converges to standard DPO. This convergence validates dataset quality and Poly-DPO’s adaptive nature: sophisticated optimization becomes unnecessary with sufficient data quality, yet remains valuable for imperfect datasets. We validate our approach across visual generation models. On noisy datasets like Pick-a-Pic V2, Poly-DPO achieves 6.87 and 2.32 gains over Diffusion-DPO on GenEval for SD1.5 and SDXL, respectively. For ViPO, models achieve performance far exceeding those trained on existing open-source preference datasets. These results confirm that addressing both algorithmic adaptability and data quality is essential for scaling visual preference optimization.


[25] DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation cs.CVPDF

Mohamad Zamini, Diksha Shukla

TL;DR: 本文提出了DouC,一种无需训练的双分支CLIP框架,用于开放词汇语义分割。该方法将密集预测分解为两个互补组件:OG-CLIP通过轻量级推理时令牌门控提升局部补丁的可靠性,FADE-CLIP通过冻结视觉基础模型引导的代理注意力注入外部结构先验。两个分支在logit层面融合,无需额外可学习参数或重新训练,保持了CLIP的零样本泛化能力。

Details

Motivation: 解决现有基于CLIP的无训练方法在开放词汇分割中,因依赖单一推理机制而难以同时处理不可靠的局部令牌和空间连贯性不足的问题。

Result: 在八个基准测试和多个CLIP骨干网络上的广泛实验表明,DouC始终优于先前的无训练方法,并且能随着模型容量的增加而获得更好的性能。

Insight: 创新点在于将密集预测任务解耦为专注于局部可靠性和全局结构感知的两个互补分支,并通过logit融合实现协同;其无需训练、保持零样本泛化的设计为高效利用基础模型提供了新思路。

Abstract: Open-vocabulary semantic segmentation requires assigning pixel-level semantic labels while supporting an open and unrestricted set of categories. Training-free CLIP-based approaches preserve strong zero-shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training-free dual-branch CLIP framework that decomposes dense prediction into two complementary components. OG-CLIP improves patch-level reliability via lightweight, inference-time token gating, while FADE-CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions, with optional instance-aware correction applied as post-processing. DouC introduces no additional learnable parameters, requires no retraining, and preserves CLIP’s zero-shot generalization. Extensive experiments across eight benchmarks and multiple CLIP backbones demonstrate that DouC consistently outperforms prior training-free methods and scales favorably with model capacity.


[26] ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching cs.CVPDF

Jong Woo Nam, Amanda S. Rios, Bartlett W. Mel

TL;DR: 本文提出了ShapeY,一个用于评估物体识别系统形状识别能力的原则性基准框架。该框架包含68,200张灰度图像,涵盖200个3D物体从多个视角渲染的视图,并可施加非形状的‘外观’变化。通过最近邻匹配任务,ShapeY专门探究OR系统嵌入空间的细粒度结构,以评估其是否能够根据3D形状相似性对物体视图进行聚类,而不受视角和其他非形状变化的影响。

Details

Motivation: 人类物体识别高度依赖形状线索和跨3D视角识别物体的能力,而深度网络常依赖纹理、背景等非形状线索,导致泛化性和鲁棒性存在缺陷。为解决这一差距,需要建立一个专门评估形状识别能力的基准。

Result: 在测试了321个不同架构的预训练网络后,发现即使在最先进的模型中,实现鲁棒的基于形状的识别也面临重大挑战。这些模型难以在3D视角和外观变化中保持一致的泛化,并且偶尔会出现将形状明显完全不同的物体错误匹配的严重错误。

Insight: 创新点在于提出了一个原则性的、专注于形状识别能力的基准框架ShapeY,它通过最近邻匹配任务和一套定量定性指标(如错误率图、视角调谐曲线等)来全面评估系统。其核心见解是强调了发展具有解耦和不变性物体编码的人工视觉系统的重要性,以推动其向类人形状识别能力迈进。

Abstract: Object recognition (OR) in humans relies heavily on shape cues and the ability to recognize objects across varying 3D viewpoints. Unlike humans, deep networks often rely on non-shape cues such as texture and background, leading to vulnerabilities in generalization and robustness. To address this gap, we introduce ShapeY, a novel and principled benchmarking framework designed to evaluate shape-based recognition capability in OR systems. ShapeY comprises 68,200 grayscale images of 200 3D objects rendered from multiple viewpoints and optionally subjected to non-shape ``appearance’’ changes. Using a nearest-neighbor matching task, ShapeY specifically probes the fine-grained structure of an OR system’s embedding space by evaluating whether object views are clustered by 3D shape similarity across varying 3D viewpoints and other non-shape changes. ShapeY provides a suite of quantitative and qualitative performance readouts, including error rate graphs, viewpoint tuning curves, histograms of positive and negative matching scores, and grids showing ordered best matches, which together offer a comprehensive evaluation of an OR system’s shape understanding capability. Testing of 321 pre-trained networks with diverse architectures reveals significant challenges in achieving robust shape-based recognition: even state-of-the-art models struggle to generalize consistently across 3D viewpoint and appearance changes, and are prone to infrequent but egregious matches of objects of obviously completely different shape. ShapeY establishes a principled framework for advancing artificial vision systems toward human-like shape recognition capabilities, emphasizing the importance of disentangled and invariant object encodings.


[27] Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models cs.CVPDF

Weixing Wang, Liudvikas Zekas, Anton Hackl, Constantin Alexander Auga, Parisa Shahabinejad

TL;DR: 该论文提出了XTC-Bench评估框架,用于衡量统一多模态模型在视觉理解和生成任务之间的跨任务语义一致性,并引入了连续跨任务一致性指标进行量化分析。

Details

Motivation: 现有评估方法独立评估统一多模态模型的理解和生成能力,无法检验其语义对齐性,因此需要新的框架来诊断模型内部表征的一致性。

Result: 在八个开源和一个商业统一模型上的实验表明,高生成或理解性能并不保证强跨任务对齐,且一致性受学习目标耦合度而非仅架构统一性影响。

Insight: 创新点在于通过场景图构建的评估框架和CCTA指标,实现了细粒度的跨任务语义一致性量化,为超越孤立任务性能、提升模型内部表征对齐提供了具体方向。

Abstract: Unified Multimodal Models (uMMs) aim to support both visual understanding and visual generation within a shared representation. However, existing evaluation protocols assess these two capabilities independently and do not examine whether they are semantically aligned. As a result, it remains unclear whether current uMMs learn coherent unified representations that remain consistent across tasks given a visual concept. We introduce XTC-Bench, a scene-graph-grounded evaluation framework that measures cross-task visual semantic consistency. By deriving both generation prompts and understanding queries from a structured scene graph, our framework enables fact-level alignment analysis across objects, attributes, and relations. We propose Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts, isolating internal consistency from standalone task accuracy. Extensive experiments on eight open-source and one commercial unified models reveal that high generation or understanding performance does not imply strong cross-task alignment, and architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone. XTC-Bench provides a reproducible and model-agnostic framework for diagnosing representation-level misalignment, offering a concrete direction for advancing unified multimodal modeling beyond isolated task performance.


[28] One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations cs.CVPDF

Ravikumar Balakrishnan, Sanket Mendapara

TL;DR: 本文通过嵌入引导的排版扰动方法探究视觉语言模型(VLM)的安全性问题,发现多模态嵌入距离能有效预测攻击成功率,并揭示了两种失败模式:感知可读性与安全对齐的交互作用。

Details

Motivation: 现有研究主要关注最大化排版提示注入攻击的成功率,但未能解释为何某些渲染方式能绕过VLM的安全对齐机制,本文旨在从可解释角度探究其内在原因。

Result: 在包含GPT-4o、Claude等四个VLM的实验中,多模态嵌入距离与攻击成功率呈强负相关(r=-0.71至-0.93,p<0.01);通过CWA-SSA优化嵌入相似度能同时恢复文本可读性并降低安全拒绝率,具体主导机制取决于模型安全过滤器强度与视觉退化程度。

Insight: 提出以多模态嵌入距离作为模型无关、可解释的代理指标来预测攻击成功率;创新性地将攻击成功分解为感知可读性与安全对齐两个共现的失败模式,并开发了无需目标模型访问的红队测试工具。

Abstract: Typographic prompt injection exploits vision language models’ (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emph{why} certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR ($r{=}{-}0.71$ to ${-}0.93$, $p{<}0.01$), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we directly maximize image text embedding similarity under bounded $\ell_\infty$ perturbations via CWA-SSA across four surrogate embedding models, stress testing both factors without access to the target model. Experiments across five degradation settings on GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL confirm that optimization recovers readability and reduces safety aligned refusals as two co-occurring effects, with the dominant mechanism depending on the model’s safety filter strength and the degree of visual degradation.


[29] M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering cs.CV | cs.AIPDF

Jiatong Ma, Longteng Guo, Yuchen Liu, Zijia Zhao, Dongze Hao

TL;DR: 本文提出了M^3-VQA,一个新颖的基于知识的视觉问答基准,旨在评估多模态大语言模型在细粒度多模态实体理解和复杂多跳推理方面的能力。该基准包含涉及视觉和文本来源中多个不同实体的多样化多实体问题,要求模型在可追溯的详细证据和精心构建的多模态知识库支持下,跨多个文档执行顺序和并行的多跳推理。作者评估了16个领先的MLLM,结果显示模型在知识获取和推理方面面临重大挑战,而推理感知的智能检索方法优于启发式方法。

Details

Motivation: 现有VQA数据集主要关注粗粒度类别和针对单个实体的简单推理,缺乏对细粒度多模态实体理解和复杂多跳推理的评估。为了弥补这一空白,作者提出了M^3-VQA基准,以更全面地评估和推动MLLM在多模态推理能力上的发展。

Result: 在三种设置下(无外部知识、提供黄金证据、检索增强输入)评估了16个领先的MLLM,结果普遍较差,揭示了MLLM在知识获取和推理方面的显著挑战。模型在没有外部信息时表现不佳,但在提供精确证据后有明显改善。此外,推理感知的智能检索方法超越了启发式方法。

Insight: 论文的创新点在于构建了一个专注于多模态、多实体、多跳推理的VQA基准,强调细粒度实体理解和跨文档的复杂推理。客观来看,其提出的结构化推理需求(顺序与并行多跳推理)和结合可追溯证据与多模态知识库的评估框架,为提升MLLM的复杂多模态理解能力提供了新的、更具挑战性的评估范式和方向指引。

Abstract: We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus on coarse-grained categories and simple reasoning over single entities, M$^3$-VQA introduces diverse multi-entity questions involving multiple distinct entities from both visual and textual sources. It requires models to perform both sequential and parallel multi-hop reasoning across multiple documents, supported by traceable, detailed evidence and a curated multimodal knowledge base. We evaluate 16 leading MLLMs under three settings: without external knowledge, with gold evidence, and with retrieval-augmented input. The poor results reveal significant challenges for MLLMs in knowledge acquisition and reasoning. Models perform poorly without external information but improve markedly when provided with precise evidence. Furthermore, reasoning-aware agentic retrieval surpasses heuristic methods, highlighting the importance of structured reasoning for complex multimodal understanding. M$^3$-VQA presents a more challenging evaluation for advancing the multimodal reasoning capabilities of MLLMs. Our code and dataset are available at https://github.com/CASIA-IVA-Lab/M3VQA.


[30] IAM: Identity-Aware Human Motion and Shape Joint Generation cs.CVPDF

Wenqi Jia, Zekun Li, Abhay Mittal, Chengcheng Tang, Chuan Guo

TL;DR: 本文提出了一种身份感知的人体运动与形状联合生成框架IAM,通过多模态信号(如自然语言描述和视觉线索)表示身份信息,并同时生成运动序列和身体形状参数,以解决现有方法忽略身体形态对运动动态影响的问题。

Details

Motivation: 现有文本驱动人体运动生成方法通常假设身份中性的运动,使用标准身体表示生成动作,忽略了身体形态(如比例、质量分布和年龄)对运动动态的显著影响,这常导致物理不一致的运动。

Result: 在运动捕捉数据集和大规模野外视频上的广泛实验表明,该方法提高了运动真实性和运动-身份一致性,同时保持了高运动质量。

Insight: 创新点在于引入多模态身份表示(而非显式几何测量)以及联合运动-形状生成范式,使身份线索直接调节运动动态,增强了物理一致性。

Abstract: Recent advances in text-driven human motion generation enable models to synthesize realistic motion sequences from natural language descriptions. However, most existing approaches assume identity-neutral motion and generate movements using a canonical body representation, ignoring the strong influence of body morphology on motion dynamics. In practice, attributes such as body proportions, mass distribution, and age significantly affect how actions are performed, and neglecting this coupling often leads to physically inconsistent motions. We propose an identity-aware motion generation framework that explicitly models the relationship between body morphology and motion dynamics. Instead of relying on explicit geometric measurements, identity is represented using multimodal signals, including natural language descriptions and visual cues. We further introduce a joint motion-shape generation paradigm that simultaneously synthesizes motion sequences and body shape parameters, allowing identity cues to directly modulate motion dynamics. Extensive experiments on motion capture datasets and large-scale in-the-wild videos demonstrate improved motion realism and motion-identity consistency while maintaining high motion quality. Project page: https://vjwq.github.io/IAM


[31] FCMBench-Video: Benchmarking Document Video Intelligence cs.CV | cs.CE | cs.MMPDF

Runze Cui, Fangxin Shang, Yehui Yang, Qing Yang, Tao Chen

TL;DR: 该论文提出了FCMBench-Video基准测试,用于评估文档视频智能,包括文档感知、时间定位和基于证据的推理能力。该基准包含1,200个长视频和11,322个专家标注的问答实例,覆盖中英文,并在九个最新的视频多模态大语言模型上进行了评估,展示了不同系统在任务上的性能差异。

Details

Motivation: 解决在金融信贷审核、远程验证等真实性敏感应用中,文档视频理解的需求,这些场景需要高决策准确性和证据可追溯性,而现有基准缺乏对文档视频时序冗余、跨帧证据整合及采集过程线索的评估。

Result: 在九个最新的Video-MLLMs上评估,结果显示FCMBench-Video能有效区分系统能力:计数任务对时长最敏感,跨文档验证和基于证据的选择任务探测了高级证据整合能力,视觉提示注入提供了额外的鲁棒性维度,整体得分分布广泛且近似钟形,表明基准未饱和或由简单案例主导。

Insight: 创新点包括:通过原子采集与组合工作流构建隐私合规且真实的文档视频数据集,引入时间跨度和退化控制以模拟实际捕获条件,以及设计任务(如跨文档验证)来专门评估真实性敏感应用中的证据整合能力,为文档视频理解提供了可复现的基准。

Abstract: Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question–answer instances, covering 28 document types over 20s–60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.


[32] Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation cs.CVPDF

Wentao Jiang, Yuanchan Xu, Heng Yuan

TL;DR: 本文提出了一种名为RDCNet的新型图像分类网络,该网络基于ResNet-34架构,通过集成三个协同创新的模块来提升分类性能。这些模块包括:多分支随机空洞卷积模块,用于捕获多尺度细粒度特征并增强对噪声和过拟合的鲁棒性;细粒度特征增强模块,用于融合全局上下文与局部特征;以及上下文激励模块,通过空间注意力和通道重校准动态强调任务相关特征。

Details

Motivation: 解决传统卷积神经网络在图像分类任务中难以同时捕获多尺度上下文信息、抑制背景噪声以及容易在噪声区域过拟合的问题。

Result: 在CIFAR-10、CIFAR-100、SVHN、Imagenette和Imagewoof五个基准数据集上进行了广泛实验,RDCNet均取得了最先进的分类准确率,分别以0.02%、1.12%、0.18%、4.73%和3.56%的优势超越了次优的竞争方法。

Insight: 主要创新点在于将随机性(随机掩码机制)与结构化多尺度特征提取(多分支空洞卷积)相结合以增强鲁棒性,并通过细粒度特征增强与上下文激励模块实现了全局上下文与局部特征的动态、自适应融合,从而有效提升了模型对细微视觉模式的敏感度和对背景干扰的抑制能力。

Abstract: Image classification remains a fundamental yet challenging task in computer vision, particularly when fine-grained feature extraction and background noise suppression are required simultaneously. Conventional convolutional neural networks, despite their remarkable success in hierarchical feature learning, often struggle with capturing multi-scale contextual information and are susceptible to overfitting when confronted with noisy or irrelevant image regions. In this paper, we propose RDCNet (Image Classification Network with Random Dilated Convolution), a novel architecture built upon ResNet-34 that integrates three synergistic innovations to address these limitations: (1) a Multi-Branch Random Dilated Convolution (MRDC) module that employs parallel branches with varying dilation rates combined with a stochastic masking mechanism to capture fine-grained features across multiple scales while enhancing robustness against noise and overfitting; (2) a Fine-Grained Feature Enhancement (FGFE) module embedded within MRDC that bridges global contextual information with local feature representations through adaptive pooling and bilinear interpolation, thereby amplifying sensitivity to subtle visual patterns; and (3) a Context Excitation (CE) module that leverages softmax-based spatial attention and channel recalibration to dynamically emphasize task-relevant features while suppressing background interference. Extensive experiments conducted on five benchmark datasets – CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof – demonstrate that RDCNet consistently achieves state-of-the-art classification accuracy, outperforming the second-best competing methods by margins of 0.02%, 1.12%, 0.18%, 4.73%, and 3.56%, respectively, thereby validating the effectiveness and generalizability of the proposed approach across diverse visual recognition scenarios.


[33] DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams cs.CV | cs.AI | cs.CLPDF

Anirudh Iyengar Kaniyar Narayana Iyengar, Tampu Ravi Kumar, Gaurav Najpande, Manan Suri, Dinesh Manocha

TL;DR: 本文提出了DRAGON基准测试,用于评估模型在图表问答任务中基于视觉证据进行推理的能力。该基准要求模型在给出正确答案的同时,定位出支持该答案的图表区域(如坐标轴、图例、标签等),以验证其推理是否真正基于视觉信息而非文本关联或数据伪影。数据集包含来自六个现有图表QA数据集的11,664个标注实例,并发布了包含2,445个实例的测试集及标准化评估框架。

Details

Motivation: 现有视觉语言模型在图表问答任务中虽能达到高准确率,但无法保证其推理过程真正基于图表中的视觉证据,可能依赖文本相关性或数据集伪影,这限制了可靠评估和模型可解释性。

Result: 在DRAGON基准上评估了八个最新的视觉语言模型,并分析了它们在不同图表领域中定位推理证据的能力。该基准为系统评估图表推理提供了标准。

Insight: 创新点在于提出了首个专注于证据定位的图表推理评估基准,强调模型需输出支持答案的视觉区域边界框,推动了基于视觉证据的可解释推理研究。从客观角度看,其将多源图表数据集整合并统一标注格式,为评估模型真正的视觉理解能力提供了重要工具。

Abstract: Diagram question answering (DQA) requires models to interpret structured visual representations such as charts, maps, infographics, circuit schematics, and scientific diagrams. Recent vision-language models (VLMs) often achieve high answer accuracy on these tasks, yet correct answers do not guarantee that models ground their reasoning in the diagram regions that support the prediction. Models may instead rely on textual correlations or dataset artifacts without identifying the visual evidence required to verify the answer. This limitation prevents reliable evaluation of diagram reasoning and reduces interpretability. We introduce DRAGON, a benchmark for evaluating evidence-grounded visual reasoning in diagrams. Given a diagram, a question, and the correct answer, a model must predict bounding boxes that correspond to the visual elements required to justify the answer. These evidence regions may include answer-bearing components, textual labels, legends, axes, connectors, and other supporting structures involved in the reasoning process. The DRAGON dataset contains 11,664 annotated question instances collected from six diagram QA datasets: ChartQA, Circuit-VQA, InfographicsVQA, MapIQ, MapWise, and AI2D. We release a 2,445-instance benchmark test set with human-verified reasoning evidence annotations and a standardized evaluation framework. We evaluate eight recent VLMs and analyze their ability to localize reasoning evidence across diverse diagram domains. DRAGON enables systematic evaluation of diagram reasoning and supports future research on models that ground their predictions in visual evidence.


[34] Personalized Cross-Modal Emotional Correlation Learning for Speech-Preserving Facial Expression Manipulation cs.CVPDF

Tianshui Chen, Yujie Zhu, Jianman Lin, Zhijing Yang, Chunmei Qing

TL;DR: 本文提出了一种个性化跨模态情感关联学习(PCMECL)算法,用于解决语音保留的面部表情操纵(SPFEM)任务中因配对数据稀缺而难以直接监督的问题。该算法通过引入个性化提示和特征差分技术,改进了现有视觉语言模型(VLM)的监督能力,从而更精确地对齐视觉与语义特征,并可作为即插即用模块集成到现有SPFEM模型中。

Details

Motivation: SPFEM任务旨在在不改变与原始语音相关的嘴部运动的前提下增强人脸表情表现力,其核心挑战在于缺乏同一人物相同语音但不同表情的对齐帧数据,这阻碍了直接的情感操纵监督。虽然现有VLM能提取对齐的视觉和语义特征,但其直接应用存在局限性。

Result: 在多个数据集上的广泛实验表明,该算法具有优越的有效性。

Insight: 主要创新点在于:1)通过结合个体视觉信息学习个性化提示,以捕捉个体间的表情差异,建立更细粒度的视觉-语义关联;2)采用特征差分技术来关联跨模态特征,通过匹配视觉特征变化与语义特征变化来弥合模态间的固有差异,提供更精确对齐的监督信号。该模块设计为即插即用,可无缝集成到现有模型中。

Abstract: Speech-preserving facial expression manipulation (SPFEM) aims to enhance human expressiveness without altering mouth movements tied to the original speech. A primary challenge in this domain is the scarcity of paired data, namely aligned frames of the same individual with identical speech but different expressions, which impedes direct supervision for emotional manipulation. While current Visual-Language Models (VLMs) can extract aligned visual and semantic features, making them a promising source of supervision, their direct application is limited. To this end, we propose a Personalized Cross-Modal Emotional Correlation Learning (PCMECL) algorithm that refines VLM-based supervision through two major improvements. First, standard VLMs rely on a single generic prompt for each emotion, failing to capture expressive variations among individuals. PCMECL addresses this limitation by conditioning on individual visual information to learn personalized prompts, thereby establishing more fine-grained visual-semantic correlations. Second, even with personalization, inherent discrepancies persist between the visual and semantic feature distributions. To bridge this modality gap, PCMECL employs feature differencing to correlate the modalities, providing more precisely aligned supervision by matching the change in visual features to the change in semantic features. As a plug-and-play module, PCMECL can be seamlessly integrated into existing SPFEM models. Extensive experiments across various datasets demonstrate the superior efficacy of our algorithm.


[35] Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval cs.CVPDF

Guosheng Zhang, Linkai Liu, Keyao Wang, Haixiao Yue, Zhiwen Tan

TL;DR: 本文提出了一种名为SSA-ME的新颖框架,旨在解决大型多模态模型在统一多模态检索任务中存在的视觉忽视和语义漂移问题。该框架通过显著性感知建模,利用视觉专家识别图像-文本对中的显著视觉概念,并引入显著性引导的目标函数来更好地对齐跨模态注意力与语义区域,从而增强细粒度表示学习。

Details

Motivation: 现有基于对比学习的统一多模态检索方法主要关注样本级目标,忽视了关键的主体级语义,导致模型在复杂多模态查询中难以对语义连贯的主体进行分组,表现为语义对齐偏差(即无法准确定位文本所指的显著视觉区域),并且模型过度依赖文本线索,造成视觉模态忽视和视觉知识利用不足。

Result: 在MMEB基准测试上,该方法取得了最先进的性能,表明引入主体级建模能显著提升多模态检索效果。全面的定性分析进一步证明了该方法的可解释性和有效性。

Insight: 论文的创新点在于提出了显著性感知的多模态嵌入框架,通过显著性引导的目标函数和特征再生模块,显式地建模并强调显著视觉主体,从而平衡跨模态整合并缓解语义漂移。从客观角度看,将主体级语义与显著性检测相结合,为增强多模态表示的细粒度对齐和可解释性提供了新思路。

Abstract: Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model’s ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation–where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME leverages LMMs and visual experts to identify and emphasize salient visual concepts in image-text pairs, and introduces a saliency-guided objective to better align cross-modal attention with semantically meaningful regions. Additionally, a feature regeneration module recalibrates visual features based on the derived saliency maps, ensuring a balanced and semantically coherent integration across modalities. Extensive experiments show that our method achieves state-of-the-art performance on the MMEB benchmark, demonstrating that incorporating subject-level modeling substantially improves multimodal retrieval. Comprehensive qualitative analyses further illustrate the interpretability and effectiveness of our approach.


[36] OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding cs.CVPDF

Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu

TL;DR: 本文提出了OmniVTG,一个用于开放世界视频时序定位的大规模数据集,以及一种名为自校正思维链的训练范式,旨在增强多模态大语言模型的时序定位能力。该数据集通过语义覆盖迭代扩展流程构建,并利用以描述为中心的数据引擎进行高质量标注。实验表明,该方法不仅在OmniVTG数据集上表现出色,还在四个现有VTG基准测试中实现了零样本状态最先进的性能。

Details

Motivation: 现有视频时序定位数据集在规模和语义多样性上有限,导致模型在常见概念和罕见概念上的性能存在差距,难以适应开放世界场景。

Result: 该方法在提出的OmniVTG数据集上表现出色,并在Charades-STA、ActivityNet Captions、TACoS和QVHighlights四个现有VTG基准测试上实现了零样本状态最先进的性能。

Insight: 创新点包括:1) 通过语义覆盖迭代扩展流程构建大规模开放世界数据集;2) 利用MLLMs在密集描述生成上的优势,设计以描述为中心的数据标注引擎;3) 提出自校正思维链训练范式,通过预测、反思和精炼的流程,将MLLMs更强的视频理解能力迁移到直接定位任务上,弥合罕见与常见概念的性能差距。

Abstract: Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs’ video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks. Code is available at https://github.com/oceanflowlab/OmniVTG.


[37] The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents cs.CV | cs.AIPDF

Yuwei Sun, Yuxuan Yao, Hui Li, Siyu Zhu

TL;DR: 本文提出了一种递归稀疏专家混合框架,集成到传统扩散模型中,通过迭代细化视觉标记来增强多模态文本到图像生成任务中的结构化推理能力。

Details

Motivation: 扩散模型在高保真数据合成方面取得成功,但在复杂结构化推理(如文本跟随任务)方面能力受限,主要挑战在于视觉标记的连续非离散特性阻碍了语言模型中潜在推理和递归策略的扩展。

Result: 在类别条件ImageNet图像生成任务以及GenEval和DPG基准测试上的综合评估表明,该方法在提升模型图像生成性能方面具有优越性。

Insight: 创新点在于从模块化人类认知中获得灵感,在联合注意力层中引入递归组件,通过稀疏选择神经模块迭代细化视觉标记,并设计门控网络动态选择专家模块,以增强多模态扩散模型的结构化推理能力。

Abstract: Diffusion models have achieved success in high-fidelity data synthesis, yet their capacity for more complex, structured reasoning like text following tasks remains constrained. While advances in language models have leveraged strategies such as latent reasoning and recursion to enhance text understanding capabilities, extending these to multimodal text-to-image generation tasks is challenging due to the continuous and non-discrete nature of visual tokens. To tackle this problem, we draw inspiration from modular human cognition and propose a recursive, sparse mixture-of-experts framework integrated into conventional diffusion models. Our approach introduces a recursive component within joint attention layers that iteratively refines visual tokens over multiple latent steps while efficiently sharing parameters via sparse selection of neural modules. At each step, a gating network is devised to dynamically select specialized neural modules, conditioned on the current visual tokens, the diffusion timestep, and the conditioning information. Comprehensive evaluation on class-conditioned ImageNet image generation tasks and additional studies on the GenEval and DPG benchmark demonstrate the superiority of the proposed method in enhancing model image generation performance.


[38] Towards Robust Deep Learning-based Rumex Obtusifolius Detection from Drone Images cs.CVPDF

Fabian Dionys Schrag, Mehmet Ozgur Turkoglu, Konrad Schindler, Ralph Lukas Stoop

TL;DR: 本文研究领域自适应(DA)在无人机图像中检测阔叶酸模(Rumex obtusifolius)杂草的任务。作者使用地面车辆采集的公开数据集作为源域,在无人机采集的自定义目标域数据集上进行评估。研究发现,CNN模型(如ResNet)即使经过微调,泛化性能也较差;而应用领域自适应技术(如矩匹配和最大分类器差异)能显著提升性能。更重要的是,自监督预训练的视觉Transformer模型(DINOv2/DINOv3)能更好地处理域偏移,在目标域上取得了F1分数约0.8的高性能。

Details

Motivation: 解决在无人机图像中检测阔叶酸模杂草时,由于源域(地面车辆图像)与目标域(无人机图像)数据分布不同导致的模型泛化性能差的问题。

Result: 在自定义无人机数据集(AGSMultiRumex)上,自监督预训练的ViT模型微调后达到F1≈0.8的高性能,超越了应用领域自适应技术(如矩匹配)的ResNet模型。

Insight: 自监督预训练的视觉Transformer(如DINO系列)通过大规模预训练获得的通用表征,能更有效地处理域偏移,在领域自适应任务中可能比传统CNN结合DA技术更具优势;同时,公开的无人机杂草检测数据集(AGSMultiRumex)可促进相关研究。

Abstract: Domain adaptation (DA) addresses the challenge of transferring a machine learning model trained on a source domain to a target domain with a different data distribution. In this work, we study DA for the task of Rumex obtusifolius (Rumex) image classification. We train models on a published, ground vehicle-based dataset (source) and evaluate their performance on a custom target dataset acquired by unmanned aerial vehicles (UAVs). We find that Convolutional Neural Network (CNN) models, specifically ResNets, generalize poorly to the target domain, even after fine-tuning on the source data. Applying moment-matching and maximum classifier discrepancy, two established DA techniques, substantially improves target-domain performance. However, Vision Transformer (ViT) models pretrained with self-supervised objectives (DINOv2, DINOv3) handle domain shifts intrinsically well, surpassing even moment-matching-trained ResNets, likely due to the rich, general-purpose representations acquired during large-scale pretraining. Using ViTs fine-tuned on the source dataset, we demonstrate high classification performances in the range of F1=0.8 on our target dataset. To support further research on DA for weed detection in grassland systems, we publicly release our UAV-based target dataset AGSMultiRumex, comprising data from 15 flights over Swiss meadows.


[39] Assessment of the quantitative impact of occlusal positioning splints on temporomandibular joint conditions cs.CVPDF

Agnieszka Anna Tomaka, Krzysztof Domino, Dariusz Pojda, Michał Tarnawski

TL;DR: 本文提出了一种计算方法来定量分析使用咬合定位夹板对颞下颌关节(TMJ)配置的影响。该方法将定位夹板建模为基于多模态数据(包括CBCT、面部运动采集和牙科扫描)推导出的下颌骨预定刚性变换的物理实现,通过设计制造夹板并评估其定位精度,将误差表示为刚性运动空间中的变换并进行统计分析,进而模拟关节间隙变化,从而实现对TMJ配置的间接评估。

Details

Motivation: 解决传统方法需要多次成像来评估不同下颌位置下颞下颌关节配置的问题,旨在通过单次解剖模型和变换数据实现定量评估,减少重复成像需求。

Result: 研究作为方法论演示,通过清晰的逐步图形展示支持,但未提供临床验证;定位精度通过石膏模型重复扫描评估,使用基于变换的误差分析和表面距离度量来量化计划与实际配置之间的差异。

Insight: 创新点在于将咬合定位夹板建模为刚性变换的物理实现,并利用多模态数据集成在统一坐标系中,通过误差变换统计分析和模拟传播来间接评估关节变化,为TMJ的定量分析提供了计算框架。

Abstract: A computational method for quantitative analysis of temporomandibular joint (TMJ) configuration using occlusal positioning splints is proposed and demonstrated. The method models a positioning splint as a physical realization of a predefined rigid transformation of the mandible, derived from multimodal data, including CBCT, facial motion acquisition, and dental scans integrated within a common coordinate system. Splints corresponding to selected mandibular positions are designed and fabricated, and their positioning accuracy is evaluated using repeated scans of plaster models. Discrepancies are represented as error transformations and analyzed statistically in the space of rigid motions. The estimated transformations are propagated to segmented TMJ structures, enabling simulation-based evaluation of joint space changes. Transformation-based error analysis and surface distance metrics are used to quantify differences between planned and achieved configurations. The method enables indirect assessment of TMJ configuration using a single anatomical model and transformation data, reducing the need for repeated imaging across multiple mandibular positions. This study is intended as a methodological demonstration, supported by a clear step-by-step graphical presentation, and does not aim to provide clinical validation.


[40] HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation cs.CVPDF

Bingzi Zhang, Kaisi Guan, Ruihua Song

TL;DR: 本文提出HuM-Eval,一种用于评估生成视频中人体运动质量的粗到细框架。该框架首先使用视觉语言模型进行全局视频质量的粗略评估,然后通过2D姿态验证解剖学正确性,利用3D人体运动评估运动稳定性。实验表明,HuM-Eval在人类相关性上优于现有基线,并引入了包含1000个多样化提示的HuM-Bench基准,对现有文本到视频模型进行了详细评估。

Details

Motivation: 现有视频生成评估指标主要关注全局场景统计,往往忽略细粒度的人体细节,导致与人类主观偏好不一致。为了弥补这一差距,需要一种以人为中心的评估框架来准确评估生成人体运动视频的质量。

Result: HuM-Eval在实验中达到了58.2%的平均人类相关性,优于最先进的基线方法。

Insight: 创新点在于采用粗到细的策略,结合全局质量评估与细粒度的人体解剖正确性和运动稳定性分析。从客观角度看,该框架通过整合2D姿态和3D运动分析,提供了更全面、更符合人类感知的视频质量评估方法,并建立了专门的基准HuM-Bench来推动领域发展。

Abstract: Video generation models have developed rapidly in recent years, where generating natural human motion plays a pivotal role. However, accurately evaluating the quality of generated human motion video remains a significant challenge. Existing evaluation metrics primarily focus on global scene statistics, often overlooking fine-grained human details and consequently failing to align with human subjective preference. To bridge this gap, we propose HuM-Eval, a novel human-centric evaluation framework that adopts a coarse-to-fine strategy. Specifically, our framework first utilizes a Vision Language Model to perform a coarse assessment of global video quality. It then proceeds to a fine-grained analysis, using 2D pose to verify anatomical correctness and 3D human motion to evaluate motion stability. Extensive experiments demonstrate that HuM-Eval achieves an average human correlation of 58.2%, outperforming state-of-the-art baselines. Furthermore, we introduce HuM-Bench, a comprehensive benchmark comprising 1,000 diverse prompts, and conduct a detailed evaluation of existing text-to-video models, paving the way for next-generation human motion generation.


[41] CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation cs.CV | cs.AIPDF

Qianqian Chen, Anglin Liu, Jingyang Zhang, Yudong Zhang

TL;DR: 本文提出了一种名为概念推理扩展(CoRE)的持续学习框架,用于解决脑部病变MRI分割任务中因病理和多模态异质性带来的挑战。该框架通过将视觉特征与结构化概念对齐,模拟临床推理过程,从而指导模型的可解释专家路由和按需增长,在防止参数冗余的同时最大化知识重用。

Details

Motivation: 动机在于解决现有持续学习范式在处理脑部影像数据时存在的容量限制、参数冗余问题,以及传统基于图像感知的策略难以应对病理和多模态异质性的局限性,旨在开发一个能适应动态临床任务且保持知识可重用性的模型。

Result: 在12个连续的脑部病变MRI分割任务上进行广泛评估,结果表明CoRE达到了最先进的性能水平,并为未来高效适应提供了高知识起点,同时展现出优异的少样本迁移能力和临床可解释性。

Insight: 创新点在于引入了概念推理机制,通过将图像标记与分层概念库对齐来联合决策,这模拟了临床推理过程,实现了基于需求的模型增长和可解释的路由,从而在持续学习中有效整合临床先验知识,避免参数膨胀,提升处理非平稳临床数据流的能力。

Abstract: Accurate brain lesion segmentation in MRI is vital for effective clinical diagnosis and treatment planning. Due to high annotation costs and strict data privacy regulations, universal models require employing Continual Learning (CL) to adapt to evolving clinical tasks without losing previously acquired knowledge. However, existing CL paradigms often suffer from capacity limits or redundant parameter growth, and even advanced dynamic methods rely mostly on image-perception strategies that struggle to handle the substantial pathological and multimodal heterogeneity inherent in brain imaging. To address this issue, we propose Concept-Reasoning Expansion (CoRE) framework, which establishes a joint decision-making mechanism by integrating visual features with structured concepts. Through the alignment of image tokens with a hierarchical concept library, CoRE simulates clinical reasoning to guide both interpretable expert routing and demand-based model growth. This collaborative process ensures model evolution is grounded in clinical priors, preventing redundant parameter expansion while maximizing knowledge reuse. Extensive evaluations across 12 sequential brain lesion MRI tasks demonstrate that CoRE achieves state-of-the-art performance and provides a high knowledge starting point for efficient future adaptation. Its superior few-shot transferability and clinical interpretability further validate its effectiveness in managing non-stationary clinical data streams. Our code will be released soon.


[42] Benchmarking and Improving GUI Agents in High-Dynamic Environments cs.CVPDF

Enqi Liu, Liyuan Pan, Zhi Gao, Yan Yang, Chenrui Shi

TL;DR: 本文针对图形用户界面(GUI)智能体在高动态环境中的挑战,提出了DynamicGUIBench基准测试和DynamicUI智能体。DynamicGUIBench是一个涵盖十个应用程序和多种交互场景的在线GUI基准,专注于动作间界面变化显著的高动态环境。DynamicUI智能体以交互过程的屏幕录制视频为输入,通过动态感知器、精炼策略和反思模块三个组件,有效捕捉动态界面中的关键状态信息,从而提升决策性能。

Details

Motivation: 现有GUI智能体通常依赖单张截图进行决策,在高动态环境中导致部分可观测甚至不可观测的马尔可夫决策过程,无法充分捕获动作所需的关键界面状态信息。本文旨在系统探索并解决这一挑战。

Result: 在DynamicGUIBench上的实验表明,DynamicUI显著提升了在高动态GUI环境中的性能,同时在其它公共基准测试上保持了有竞争力的表现。

Insight: 创新点包括:1) 引入首个专注于高动态GUI环境的综合性在线基准DynamicGUIBench;2) 提出DynamicUI智能体,其核心是使用屏幕录制视频作为输入,通过动态感知器对视频帧进行聚类和关键帧选择以捕捉动态上下文,并结合精炼策略(动作条件过滤)和反思模块来优化决策轨迹,从而有效处理界面动态变化带来的可观测性问题。

Abstract: Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce DynamicGUIBench, a comprehensive online GUI benchmark spanning ten applications and diverse interaction scenarios characterized by important interface changes between actions. Furthermore, we present DynamicUI, an agent designed for dynamic interfaces, which takes screen-recording videos of the interaction process as input and consists of three components: a dynamic perceiver, a refinement strategy, and a reflection. Specifically, the dynamic perceiver clusters frames of the GUI video, generates captions for the centroids, and iteratively selects the most informative frames as the salient dynamic context. Considering that there may be inconsistencies and noise between the selected frames and the textual context of the agent, the refinement strategy employs an action-conditioned filtering to refine thoughts to mitigate thought-action inconsistency and redundancy. Based on the refined agent trajectories, the reflection module provides effective and accurate guidance for further actions. Experiments on DynamicGUIBench demonstrate that DynamicUI significantly improves the performance in dynamic GUI environments, while maintaining competitive performance on other public benchmarks.


[43] Beyond Fidelity: Semantic Similarity Assessment in Low-Level Image Processing cs.CVPDF

Runjie Wang, Weiling Chen, Tiesong Zhao, Chang Wen Chen

TL;DR: 本文针对低层图像处理任务,提出了一种新的评估指标——语义相似度,以衡量图像处理前后语义内容的保持程度。传统基于视觉保真度的图像质量评估(IQA)方法在深度学习时代已显不足,因为处理后的图像可能在保持感知质量的同时改变了语义内容。为此,作者形式化了语义相似度评估任务,并基于语义实体及其关系提出了结构化语义表示,进而设计了Triplet-based Semantic Similarity Score(T3S)指标。T3S通过建模前景实体、背景实体及其关系来评估语义相似度,并在COCO和SPA-Data数据集上验证了其优于现有保真度指标和代表性语义基线方法的性能。

Details

Motivation: 随着深度学习和生成模型的兴起,低层图像处理后的图像可能在保持视觉质量的同时改变了语义内容,这使得传统的基于视觉保真度的图像质量评估(IQA)方法在语义层面评估上显得不足。因此,需要一种新的评估方法来专门衡量图像处理前后语义内容的保持程度。

Result: 在COCO和SPA-Data数据集上的实验表明,所提出的T3S指标在评估语义相似度方面一致地优于现有的保真度导向指标(如PSNR、SSIM)和代表性的语义级基线方法,并且能更好地反映不同退化条件下语义内容的渐进变化。

Insight: 论文的创新点在于首次将语义相似度形式化为低层图像处理的一个独立评估任务,并提出了基于语义实体和关系的结构化语义表示框架。T3S指标结合了语义实体提取、前景-背景解耦和开放世界类别/关系建模,为低层视觉任务的语义评估提供了新的视角和工具。从客观角度看,这项工作强调了在现代低层视觉中语义评估的重要性,并为未来相关研究提供了可借鉴的评估范式。

Abstract: Low-level image processing has long been evaluated mainly from the perspective of visual fidelity. However, with the rise of deep learning and generative models, processed images may preserve perceptual quality while altering semantic content, making conventional Image Quality Assessment (IQA) insufficient for semantic-level assessment. In this paper, we formalize \textit{Semantic Similarity} as a new evaluation task for low-level image processing, aimed at measuring whether semantic content is preserved after processing. We further present a structured formulation of image semantics based on semantic entities and their relations, and discuss the desired properties and constraints of a valid semantic similarity index. Based on this formulation, we propose Triplet-based Semantic Similarity Score (T3S), which models image semantics through foreground entities, background entities, and relations. T3S combines semantic entity extraction, foreground-background disentanglement, and open-world class/relation modeling. Experiments on COCO and SPA-Data show that T3S consistently outperforms existing fidelity-oriented metrics and representative semantic-level baselines, while better reflecting progressive semantic changes under diverse degradations. These results highlight the importance of semantic assessment in modern low-level vision.


[44] A Systematic Post-Train Framework for Video Generation cs.CVPDF

Zeyue Xue, Siming Fu, Jie Huang, Shuai Lu, Haoran Li

TL;DR: 本文提出了一种系统化的视频生成后训练框架,旨在弥合大规模视频扩散模型预训练性能与实际部署需求之间的差距。该框架通过监督微调、基于人类反馈的强化学习、提示增强和推理优化四个协同阶段,提升模型的视觉质量、时序一致性和指令跟随能力。

Details

Motivation: 当前大规模视频扩散模型在生成高分辨率、语义丰富内容方面表现出色,但在实际部署中仍存在提示敏感性、时序不一致和推理成本过高等关键问题,导致预训练性能与真实世界需求之间存在显著差距。

Result: 广泛的实验表明,该统一流程有效减轻了常见伪影,在严格遵守采样成本约束的同时,显著提升了可控性和视觉美感。

Insight: 创新点在于提出了一个包含SFT、RLHF(采用专为视频扩散设计的GRPO方法)、提示增强和推理优化的系统性后训练框架,旨在协同提升模型性能,为构建稳定、可适应且有效的实际部署后训练流程提供了实用蓝图。

Abstract: While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.


[45] Image Compression with Bubble-Aware Frame Rate Adaptation for Energy-Efficient Video Capsule Endoscopy cs.CVPDF

Oliver Bause, Jörg Gammerdinger, Julia Werner

TL;DR: 本文提出了一种用于视频胶囊内窥镜(VCE)的节能图像压缩与帧率自适应方法。该方法通过一个图像压缩流水线大幅减少传输数据量,同时利用压缩过程本身的特性识别主要由气泡导致的低诊断价值帧,并动态降低这些阶段的图像采集与传输帧率,从而在保持诊断图像质量和异常检测敏感性的前提下,显著降低系统能耗。

Details

Motivation: 视频胶囊内窥镜(VCE)的尺寸限制导致电池寿命短,而高画质图像的采集与传输能耗高,这限制了其临床应用。本文旨在解决这一矛盾,在保证诊断质量的同时,通过减少数据传输来延长设备工作时间。

Result: 在RISC-V平台上使用Kvasir-Capsule和Galar数据集进行评估。压缩方法在峰值信噪比(PSNR)为40.3 dB时实现了5.748(82.6%)的压缩比,视觉质量损失可忽略。系统平均能耗降低了20.58%。此外,提出的气泡感知帧率自适应策略使能耗进一步降低了高达40%。

Insight: 主要创新点在于将图像压缩过程与低价值帧(如含气泡帧)的识别相结合,无需额外的图像分析开销,并据此进行动态帧率调整。这为资源受限的嵌入式医疗设备提供了一种高效的端到端节能设计思路,通过感知内容价值来优化数据采集与传输链路。

Abstract: Video Capsule Endoscopy (VCE) is a promising method for improving the medical examination of the small intestine in the gastrointestinal tract. A key challenge is their limited size, resulting in a short battery lifetime which conflicts with high energy consumption for image capturing and transmission to an on-body device. Thus, we propose an image compression pipeline that substantially reduces the transmitted data while preserving diagnostic image quality. Furthermore, we exploit characteristics of the compression process to identify frames with low diagnostic value mainly caused by bubbles, without requiring additional image analysis. For low-visibility frames, a dynamic bubble-aware frame rate adaptation strategy reduces image acquisition and transmission during these phases while preserving sensitivity to potential anomalies. The proposed compression and frame rate adaptation are evaluated on a RISC-V platform using the Kvasir-Capsule and Galar datasets. The compression method achieves a compression ratio of 5.748 (82.6%) at a peak signal-to-noise ratio of 40.3 dB, indicating negligible loss of visual quality. The compression accomplished a mean energy reduction of the whole system by 20.58%. Additionally, the proposed bubble-aware frame rate adaptation reduced the energy consumption by up to 40%. These results demonstrate the potential of our method to increase the applicability of VCE.


[46] DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing cs.CV | cs.AIPDF

Hanqing Yang, Qiang Zhou, Yongchao Du, Sashuai Zhou, Zhibin Wang

TL;DR: 本文提出DDA-Thinker,一个以规划模块(Thinker)为中心的框架,用于解决图像编辑中需要复杂推理的任务。该框架将规划模块与固定的生成模型(Editor)解耦,并采用双原子强化学习进行优化,通过可验证清单分解反馈,从而提升推理驱动的图像编辑性能。

Details

Motivation: 现有图像编辑模型在视觉保真度上表现良好,但在需要复杂推理的任务上存在困难。本文旨在研究和增强图像编辑中基于推理的规划能力。

Result: 在RISE-Bench和KRIS-Bench等推理驱动图像编辑基准上的大量实验表明,该方法显著提升了整体性能,使一个社区模型能够达到与强大专有模型相竞争的结果。

Insight: 创新点在于提出了解耦的、以Thinker为中心的范式,便于独立分析和优化规划模块;并设计了双原子强化学习框架,通过认知原子奖励和视觉原子奖励分别评估规划质量和最终图像质量,且清单合成基于理性参考描述,提升了训练效果。

Abstract: Recent image editing models have achieved strong visual fidelity but often struggle with tasks requiring complex reasoning. To investigate and enhance the reasoning-grounded planning for image editing, we propose DDA-Thinker, a Thinker-centric framework designed for the independent optimization of a planning module (Thinker) over a fixed generative model (Editor). This decoupled Thinker-centric paradigm facilitates a controlled analysis of the planning module and makes its contribution under a fixed Editor easier to assess. To effectively guide this Thinker, we introduce a dual-atomic reinforcement learning framework. This framework decomposes feedback into two distinct atomic rewards implemented through verifiable checklists: a cognitive-atomic reward to directly assess the quality of the Thinker’s executable plan, which serves as the actionable outcome of the Thinker’s reasoning, and a visual-atomic reward to assess the final image quality. To improve checklist quality, our checklist synthesis is grounded not only in the source image and user instruction but also in a rational reference description of the ideal post-edit scene. To support this training, we further develop a two-stage data curation pipeline that first synthesizes a diverse and reasoning-focused dataset, then applies difficulty-aware refinement to curate an effective training curriculum for reinforcement learning. Extensive experiments on reasoning-driven image editing benchmarks, including RISE-Bench and KRIS-Bench, demonstrate that our approach substantially improves overall performance. Our method enables a community model to achieve results competitive with strong proprietary models, highlighting the practical potential of Thinker-centric optimization under a fixed-editor setting.


[47] DualGeo: A Dual-View Framework for Worldwide Image Geo-localization cs.CVPDF

Junchao Cui, Wenqi Shi, Shaoyong Du, Hang He, Xuanzi Ma

TL;DR: DualGeo是一个用于全球图像地理定位的两阶段框架。第一阶段通过双向交叉注意力融合图像和语义分割特征,建立地理表示基础,并通过双视图对比学习将融合特征与GPS坐标对齐,构建全局检索数据库。第二阶段通过地理聚类对检索到的候选位置进行重排序,并利用大型多模态模型进行最终坐标预测。

Details

Motivation: 现有方法依赖于对环境变化(如光照、季节、天气)敏感的视觉特征,且缺乏有效的后处理来过滤异常候选位置,限制了定位精度。DualGeo旨在解决这些局限性。

Result: 在IM2GPS、IM2GPS3k和YFCC4k数据集上的实验表明,DualGeo优于最先进的方法,将街道级(<1公里)和城市级(<25公里)定位精度分别提高了3.6%-16.58%和1.29%-8.77%。

Insight: 创新点包括:1) 通过双向交叉注意力融合多模态特征(图像和语义分割)以增强表示;2) 采用双视图对比学习进行特征对齐;3) 引入地理聚类重排序和大型多模态模型进行后处理精炼。这为处理环境变化和候选过滤提供了系统性的解决方案。

Abstract: Worldwide image geo-localization aims to infer the geographic location of an image captured anywhere on Earth, spanning street, city, regional, national, and continental scales. Existing methods rely on visual features that are sensitive to environmental variations (e.g., lighting, season, and weather) and lack effective post-processing to filter outlier candidates, limiting localization accuracy. To address these limitations, we propose DualGeo, a two-stage framework for worldwide image geo-localization. First, it establishes a geo-representational foundation by fusing image and semantic segmentation features via bidirectional cross-attention. The fused features are then aligned with GPS coordinates through dual-view contrastive learning to build a global retrieval database. Second, it performs geo-cognitive refinement by re-ranking retrieved candidates using geographic clustering. It then feeds them into large multimodal models (LMMs) for final coordinate prediction. Experiments on IM2GPS, IM2GPS3k, and YFCC4k show that DualGeo outperforms state-of-the-art methods, improving street-level (<1 km) and city-level (<25 km) localization accuracy by 3.6%-16.58% and 1.29%-8.77%, respectively. Our code and datasets are available : https://github.com/CJ310177/DualGeo.


[48] Vision SmolMamba: Spike-Guided Token Pruning for Energy-Efficient Spiking State-Space Vision Models cs.CVPDF

Dewei Bai, Hongxiang Peng, Yunyun Zeng, Ziyu Zhang, Hong Qu

TL;DR: 本文提出Vision SmolMamba,一种用于视觉任务的高效能脉冲状态空间模型。其核心是脉冲引导的时空令牌剪枝器(SST-TP),它利用脉冲激活强度和首次脉冲延迟来评估令牌重要性,逐步移除冗余令牌以保持稀疏性。该模型将脉冲事件直接集成到双向状态空间循环中,构建了一个用于高效长程建模的脉冲状态空间视觉骨干网络。

Details

Motivation: 脉冲Transformer通过脉冲驱动的自注意力在长程视觉建模中展现出潜力,但其二次方的令牌交互本质上与脉冲神经计算的稀疏性和事件驱动特性不匹配。本文旨在解决这一局限性,提出一种更符合脉冲计算本质的高效能架构。

Result: 在静态和基于事件的基准测试(包括ImageNet-1K、CIFAR10/100、CIFAR10-DVS和DVS128 Gesture)上的广泛实验表明,Vision SmolMamba始终实现了优异的精度-效率权衡。与先前的脉冲Transformer基线和一种脉冲Mamba变体相比,其估计能耗至少降低了1.5倍,同时保持了有竞争力或更高的精度。

Insight: 主要创新点在于将脉冲引导的令牌稀疏性与状态空间建模相结合,为脉冲视觉系统提供了一个可扩展且高能效的范式。具体而言,SST-TP机制利用脉冲信号本身(激活强度和首次延迟)来指导动态令牌剪枝,这更符合脉冲计算的生物启发特性。从客观角度看,将状态空间模型(如Mamba)的高效长程建模能力与脉冲神经网络的事件驱动、稀疏计算特性相融合,是一个有前景的研究方向。

Abstract: Spiking Transformers have shown strong potential for long-range visual modeling through spike-driven self-attention. However, their quadratic token interactions remain fundamentally misaligned with the sparse and event-driven nature of spiking neural computation. To address this limitation, we propose Vision SmolMamba, an energy-efficient spiking state-space architecture that integrates spike-driven dynamics with linear-time selective recurrence. The key idea is a Spike-Guided Spatio-Temporal Token Pruner (SST-TP), which estimates token importance using both spike activation strength and first-spike latency. This mechanism progressively removes redundant tokens while preserving salient spatio-temporal information, enabling efficient scaling with token sparsity. Based on this mechanism, the proposed SmolMamba block incorporates spike events directly into bidirectional state-space recurrence, forming a spiking state-space vision backbone for efficient long-range modeling. Extensive experiments on both static and event-based benchmarks, including ImageNet-1K, CIFAR10/100, CIFAR10-DVS, and DVS128 Gesture, demonstrate that Vision SmolMamba consistently achieves superior accuracy-efficiency trade-offs. In particular, it reduces the estimated energy cost by at least 1.5x compared with prior spiking Transformer baselines and a Spiking Mamba variant while maintaining competitive or improved accuracy. These results demonstrate that combining spike-guided token sparsity with state-space modeling offers a scalable and energy-efficient paradigm for spiking vision systems.


[49] Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models cs.CVPDF

Jiayi Guo, Linqing Wang, Jiangshan Wang, Yang Yue, Zeyu Liu

TL;DR: 本文提出了一种名为“通过再生进行精炼(Refinement via Regeneration, RvR)”的新框架,用于提升统一多模态模型(UMMs)在文本到图像(T2I)任务中的图像精炼能力。该方法将精炼过程重新定义为条件图像再生,而非传统的基于编辑的精炼(RvE),从而允许更大的修改空间和更完整的语义对齐。

Details

Motivation: 当前基于UMM的精炼方法主要遵循“通过编辑进行精炼(RvE)”范式,其依赖粗略的编辑指令并强制进行像素级内容保留,这导致了精炼不完整且修改空间受限。本文旨在解决这些限制,以提升精炼效果。

Result: 在多个基准测试上,RvR方法显著提升了性能:Geneval从0.78提升至0.91,DPGBench从84.02提升至87.21,UniGenBench++从61.53提升至77.41。

Insight: 核心创新在于将精炼任务从“编辑”范式转变为“条件再生”范式,通过利用目标提示词和初始图像的语义令牌来引导图像再生,从而摆脱了像素级保留的束缚,扩大了有效修改空间,实现了更彻底的语义对齐。这为UMMs的迭代优化提供了一种新思路。

Abstract: Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.


[50] Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models cs.CV | cs.AIPDF

Chengsheng Zhang, Chenghao Sun, Xinyan Jiang, Wei Li, Xinmei Tian

TL;DR: 本文提出了一种名为Prefill-Time Intervention(PTI)的新方法,用于缓解大型视觉语言模型(LVLM)中的幻觉问题。该方法通过在预填充阶段对模型的Key-Value缓存进行一次性、模态感知的干预,从源头纠正容易产生幻觉的表征,从而避免解码阶段错误的自回归累积。

Details

Motivation: 现有基于引导向量(steering vectors)的方法主要在解码阶段干预以减少幻觉,但会无意中加剧残留幻觉的严重性。这是因为错误在自回归解码过程中不断累积。本文旨在通过在错误累积发生之前的预填充阶段进行干预来解决这一问题。

Result: 大量实验表明,PTI在缓解幻觉方面表现出显著性能,并且在不同解码策略、LVLM模型和基准测试(如MME、MM-Vet、LLaVA-Bench)上具有良好的泛化能力。此外,PTI与现有的解码阶段方法是正交的,可以进行即插即用的集成以进一步提升性能。

Insight: 主要创新点在于将干预时机从解码阶段提前到预填充阶段,并提出了一种模态感知、解耦的引导方法(分别引导键和值),从源头纠正幻觉倾向的表征。这为解决LVLM幻觉问题提供了一个新的、高效的干预范式,且能与现有方法互补。

Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable progress in visual-textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses. While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs. To address this, we propose Prefill-Time Intervention (PTI), a novel steering paradigm that intervenes only once during the prefill stage, enhancing the initial Key-Value (KV) cache before error accumulation occurs. Specifically, PTI is modality-aware, deriving distinct directions for visual and textual representations. This intervention is decoupled to steer keys toward visually-grounded objects and values to filter background noise, correcting hallucination-prone representations at their source. Extensive experiments demonstrate PTI’s significant performance in mitigating hallucinations and its generalizability across diverse decoding strategies, LVLMs, and benchmarks. Moreover, PTI is orthogonal to existing decoding-stage methods, enabling plug-and-play integration and further boosting performance. Code is available at: https://github.com/huaiyi66/PTI.


[51] Exploring Remote Photoplethysmography for Neonatal Pain Detection from Facial Videos cs.CV | eess.IVPDF

Ashutosh Dhamaniya, Anup Kumar Gupta, Trishna Saikia, Puneet Gupta

TL;DR: 本文提出了一种基于远程光电容积描记术(rPPG)的非接触式新生儿疼痛检测方法,通过从面部视频中提取脉搏信号并结合音频特征,以客观评估新生儿疼痛。

Details

Motivation: 新生儿疼痛若未得到及时评估可能导致发育延迟等不良后果,传统接触式生理参数监测方法不适合长期使用且可能增加疾病传播风险,因此需要开发非接触式、客观可靠的疼痛评估方法。

Result: 实验表明,rPPG信号能为新生儿疼痛检测提供有用信息,其中蓝色通道提取的信号优于其他颜色通道,且结合rPPG与音频特征能获得比单一模态更好的结果。

Insight: 创新点包括引入质量参数选择受皮肤形变影响最小的ROI信号,并使用信噪比作为适应度参数提取噪声影响最小的rPPG信号,实现了非接触式生理信号的有效提取与多模态融合提升检测性能。

Abstract: Unaddressed pain in neonates can lead to adverse effects, including delayed development and slower weight gain, emphasising the need for more objective and reliable pain assessment methods. Hence, automated methods using behavioural and physiological pain indicators have been developed to aid healthcare professionals in the Neonatal ICU. Traditional contact-based methods for physiological parameter estimation are unsuitable for long-term monitoring and increase the risk of spreading diseases like COVID-19. We introduce a novel approach using remote photoplethysmography (rPPG) to estimate pulse signals in a non-contact manner and employ them for neonatal pain detection. The temporal signals acquired from regions-of-interest (ROIs) affected by skin deformations may exhibit lower quality and provide erroneous rPPG signals. Therefore, we incorporated a quality parameter to select the temporal signals obtained from ROIs that are least affected by skin deformations. Further, we employed signal-to-noise ratio as a fitness parameter to extract the rPPG signal corresponding to the clip that is least affected by noise. Experimental findings demonstrate that the rPPG signals provide useful information for neonatal pain detection, and signals extracted from the blue colour channel outperform those extracted from other colour channels. We also show that combining rPPG and audio features provides better results than individual modalities.


Ran Gu, Benjamin Hou, Mélanie Hébert, Asmita Indurkar, Yifan Yang

TL;DR: 该论文提出了OcularChat,一个基于Qwen2.5-VL微调的多模态大语言模型,用于通过彩色眼底照片进行年龄相关性黄斑变性的诊断和交互式对话。模型在模拟医患对话数据集上训练,在AMD严重程度分类任务上表现出色,并能提供诊断推理和临床解释。

Details

Motivation: 现有深度学习模型在视网膜疾病检测中多为静态预测,缺乏临床推理和交互解释。该研究旨在利用多模态大语言模型,将诊断预测与有临床意义的对话相结合,以支持临床决策和患者咨询。

Result: 在AREDS数据集上,OcularChat在诊断晚期AMD、色素异常和玻璃膜疣大小三个任务上的准确率分别为0.954、0.849和0.678,显著优于现有MLLMs。在AREDS2数据集上,它同样是所有任务中性能最好的方法。三位独立眼科医生基于5分制临床评分标准的主观评估中,OcularChat在各项指标上的平均得分均高于强基线模型。

Insight: 创新点在于将MLLM应用于眼科诊断,通过大规模模拟医患对话进行微调,实现了高精度的AMD分类与可解释的交互式临床推理。这为开发准确、可解释且临床有用的基于图像的AI诊断系统提供了新思路。

Abstract: Despite strong performance of deep learning models in retinal disease detection, most systems produce static predictions without clinical reasoning or interactive explanation. Recent advances in multimodal large language models (MLLMs) integrate diagnostic predictions with clinically meaningful dialogue to support clinical decision-making and patient counseling. In this study, OcularChat, an MLLM, was fine-tuned from Qwen2.5-VL using simulated patient-physician dialogues to diagnose age-related macular degeneration (AMD) through visual question answering on color fundus photographs (CFPs). A total of 705,850 simulated dialogues paired with 46,167 CFPs were generated to train OcularChat to identify key AMD features and produce reasoned predictions. OcularChat demonstrated strong classification performance in AREDS, achieving accuracies of 0.954, 0.849, and 0.678 for the three diagnostic tasks: advanced AMD, pigmentary abnormalities, and drusen size, significantly outperforming existing MLLMs. On AREDS2, OcularChat remained the top-performing method on all tasks. Across three independent ophthalmologist graders, OcularChat achieved higher mean scores than a strong baseline model for advanced AMD (3.503 vs. 2.833), pigmentary abnormalities (3.272 vs. 2.828), drusen size (3.064 vs. 2.433), and overall impression (2.978 vs. 2.464) on a 5-point clinical grading rubric. Beyond strong objective performance in AMD severity classification, OcularChat demonstrated the ability to provide diagnostic reasoning, clinically relevant explanations, and interactive dialogue, with high performance in subjective ophthalmologist evaluation. These findings suggest that MLLMs may enable accurate, interpretable, and clinically useful image-based diagnosis and classification of AMD.


[53] Improving Diversity in Black-box Few-shot Knowledge Distillation cs.CV | cs.LGPDF

Tri-Nhan Vo, Dang Nguyen, Kien Do, Sunil Gupta

TL;DR: 本文提出了一种新颖的黑盒少样本知识蒸馏方法,通过改进生成对抗网络的训练策略,自适应地选择高置信度图像并动态引入对抗学习,以增强蒸馏数据集的多样性,从而显著提升学生网络的性能。

Details

Motivation: 针对传统知识蒸馏方法需要大量训练数据和教师网络内部访问权限的限制,以及现有黑盒少样本知识蒸馏方法在生成合成图像时缺乏主动策略以提升多样性的问题,本文旨在解决黑盒少样本知识蒸馏中数据多样性不足的挑战。

Result: 在七个图像数据集上的广泛实验表明,该方法在少样本知识蒸馏方法中达到了最先进的性能水平。

Insight: 创新点在于提出了一种动态自适应选择高置信度图像并融入对抗学习的训练方案,有效扩展和改进了蒸馏数据集的多样性,从而提升了学生网络的准确性;从客观角度看,该方法通过结合教师监督和对抗学习,为数据受限场景下的知识蒸馏提供了新的优化思路。

Abstract: Knowledge distillation (KD) is a well-known technique to effectively compress a large network (teacher) to a smaller network (student) with little sacrifice in performance. However, most KD methods require a large training set and internal access to the teacher, which are rarely available due to various restrictions. These challenges have originated a more practical setting known as black-box few-shot KD, where the student is trained with few images and a black-box teacher. Recent approaches typically generate additional synthetic images but lack an active strategy to promote their diversity, a crucial factor for student learning. To address these problems, we propose a novel training scheme for generative adversarial networks, where we adaptively select high-confidence images under the teacher’s supervision and introduce them to the adversarial learning on-the-fly. Our approach helps expand and improve the diversity of the distillation set, significantly boosting student accuracy. Through extensive experiments, we achieve state-of-the-art results among other few-shot KD methods on seven image datasets. The code is available at https://github.com/votrinhan88/divbfkd.


[54] Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning cs.CVPDF

Yashwant Pravinrao Bangde, Debaditya Roy

TL;DR: 本文提出了一种名为指令-证据对比双流解码(IECD2)的解码框架,旨在解决视觉语言模型在遵循指令时生成内容与视觉证据弱对齐的问题。该方法通过并行维护指令驱动和证据驱动的两个词元概率分布,并利用基于对称KL散度的对比门控机制自适应融合,以平衡语言信息性和视觉忠实性。

Details

Motivation: 现有视觉语言模型在指令跟随和开放域视觉语言推理中表现良好,但常产生与视觉证据弱相关的流畅输出,且指令提示会放大语言先验,在视觉信号不确定或模糊时加剧此问题。

Result: 在POPE、MME、VQAv2、AMBER、MS-COCO和LLaVA-Bench等多个数据集上的评估显示,IECD2在任务准确性和推理性能上均取得一致提升,相比最先进的解码方法,在所有评估指标上显著减少了幻觉现象。

Insight: 创新点在于引入双流解码机制和基于对称KL散度的对比门控,显式平衡语言信息与视觉证据,从而抑制仅由语言先验驱动但缺乏视觉支持的词元,提升生成内容的视觉忠实性。

Abstract: Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens this issue by amplifying language priors, especially when the visual signal is uncertain or ambiguous. To address this challenge, we propose a decoding framework that explicitly balances linguistic informativeness and visual faithfulness during generation. Our method, Instruction-Evidence Contrastive Dual-Stream Decoding (IECD2), maintains two parallel probability distributions of tokens at each decoding step: an instruction-driven stream that promotes expressive and informative responses, and an evidence-driven stream that enforces strict grounding in the image. These two streams are adaptively fused using a symmetric KL-based contrast-based gate, which suppresses tokens favored by language priors but unsupported by visual evidence, while preserving them when both distributions agree. We evaluate IECD2 on multiple datasets spanning various generative vision-language reasoning tasks such as captioning and visual question answering, including POPE, MME, VQAv2, AMBER, MS-COCO, and LLaVA-Bench. IECD2 demonstrates consistent improvements in task accuracy and reasoning performance, alongside a substantial reduction in hallucination across all evaluation metrics compared to state-of-the-art decoding approaches.


[55] Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation cs.CV | cs.SDPDF

Yupeng Zhou, Lianghua Huang, Zhifan Wu, Jiabao Wang, Yupeng Shi

TL;DR: 本文提出Mutual Forcing框架,用于快速自回归音视频生成,并实现长序列音视频同步。该方法采用两阶段训练策略:先训练单模态生成器,再耦合为统一模型进行联合训练。通过在同一权重共享模型中集成少步和多步生成模式,实现自蒸馏和训练-推理一致性提升,从而无需额外双向教师模型即可实现高效生成。

Details

Motivation: 解决音视频联合建模与快速自回归生成两大挑战,旨在直接训练原生快速因果音视频模型,避免传统流式蒸馏流程中先训练双向模型再通过多阶段蒸馏转换为因果生成器的复杂过程。

Result: 实验表明,Mutual Forcing在使用仅4到8步采样时,匹配或超越了需要约50步采样的强基线方法,在效率和质量上均展现出显著优势。

Insight: 创新点在于通过单一权重共享模型中的少步与多步生成模式相互促进,实现自蒸馏和训练-推理一致性提升,无需额外教师模型,降低了训练开销并允许模型直接从真实配对数据中学习。

Abstract: In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.


[56] SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring cs.CV | cs.AIPDF

Hector G. Rodriguez, Marcus Rohrbach

TL;DR: 本文提出了SIEVES方法,旨在通过评估多模态大语言模型(MLLMs)答案的可视化证据质量,来提升其在真实世界分布外(OOD)场景下的选择性预测性能,从而在满足用户定义风险水平的同时,显著提高系统的回答覆盖率。

Details

Motivation: 尽管多模态大语言模型在视觉语言任务上性能日益强大,但传统基准测试趋于饱和,而实际可靠部署需要在分布外场景下满足极低的错误容忍度。选择性预测的目标是在给定风险水平下,最大化系统回答的输入比例(覆盖率)。

Result: 在多个具有挑战性的OOD基准测试(V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, 和 AdVQA)上,SIEVES方法相比不进行视觉定位的基线模型,将覆盖率提升了高达三倍。该方法在五个测试的OOD数据集和三种推理模型(Pixel-Reasoner, o3, Gemini-3-Pro)上均表现出泛化能力,且无需针对特定基准或模型进行训练或适配。

Insight: 核心创新在于要求推理模型在回答问题时提供局部化的视觉证据,并设计了一个选择器来显式学习评估该定位的质量。这一设计使得选择器能够泛化,甚至迁移到无法获取其权重或逻辑值的专有推理模型(如o3和Gemini-3-Pro)上,带来超越单纯准确率提升的覆盖率增益。

Abstract: Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable generalization, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all five tested OOD datasets and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation.


cs.CY [Back]

[57] Three Models of RLHF Annotation: Extension, Evidence, and Authority cs.CY | cs.AI | cs.CLPDF

Steve Coyne

TL;DR: 本文区分了RLHF标注中人类判断的三种规范性角色模型:扩展(extension)、证据(evidence)和权威(authority),并论证了这些模型对RLHF流程设计的影响,建议针对不同维度采用定制化流程而非单一统一流程。

Details

Motivation: 解决RLHF方法中人类标注判断的规范性角色不明确的问题,探讨其隐含假设及对标注流程设计的影响。

Result: 通过文献综述分析,识别了三种模型在现有RLHF研究中的隐含应用,并描述了因混淆这些模型而导致的失败模式。

Insight: 创新性地提出了三种RLHF标注的概念模型,并建议将标注分解为不同维度并采用相应模型定制流程,为RLHF的规范性基础和实践设计提供了理论框架。

Abstract: Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers’ own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.


eess.AS [Back]

[58] Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models eess.AS | cs.AI | cs.CL | cs.LG | cs.SDPDF

Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee

TL;DR: 本文首次系统性地实证研究了音频感知大语言模型(ALLMs)中的不确定性估计问题,评估了五种代表性方法在多个模型和不同评估场景下的表现,发现语义级和基于验证的方法在通用音频推理任务上优于词元级基线,但在面向可信度的任务中,方法的有效性更依赖于模型和基准。

Details

Motivation: 音频感知大语言模型(ALLMs)在音频理解和推理任务中表现出色,但经常产生幻觉或过度自信的输出,而现有不确定性估计研究主要集中在纯文本LLMs,针对ALLMs中由音频条件生成带来的感知模糊性和跨模态接地等挑战的研究尚属空白。

Result: 在通用音频推理基准上,语义熵、离散语义熵和P(True)等语义级和验证方法持续优于预测熵和长度归一化熵等词元级基线;在面向可信度的基准(如幻觉检测和不可回答问题回答)上,不同不确定性方法的相对有效性变得显著依赖于具体模型和基准。

Insight: 论文的创新点在于首次对ALLMs的不确定性估计进行系统性实证研究,揭示了方法性能在不同任务类型(通用推理 vs. 可信度评估)上的差异,强调了针对音频模态特有挑战(如感知模糊性)设计不确定性估计方法的必要性,并为基于不确定性的自适应推理等下游应用提供了基础。

Abstract: Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in text-only LLMs, it remains largely unexplored for ALLMs, where audio-conditioned generation introduces additional challenges such as perceptual ambiguity and cross-modal grounding. In this work, we present the first systematic empirical study of uncertainty estimation in ALLMs. We benchmark five representative methods, including predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True), across multiple models and diverse evaluation settings spanning general audio understanding, reasoning, hallucination detection, and unanswerable question answering. Our results reveal two key findings. First, semantic-level and verification-based methods consistently outperform token-level baselines on general audio reasoning benchmarks. Second, on trustworthiness-oriented benchmarks, the relative effectiveness of uncertainty methods becomes notably more model- and benchmark-dependent, indicating that conclusions drawn from general reasoning settings do not straightforwardly transfer to hallucination and unanswerable-question scenarios. We further explore uncertainty-based adaptive inference as a potential downstream application. We hope this study provides a foundation for future research on reliable, uncertainty-aware audio-language systems.


eess.IV [Back]

[59] CRC-SAM: SAM-Based Multi-Modal Segmentation and Quantification of Colorectal Cancer in CT, Colonoscopy, and Histology Images eess.IV | cs.CVPDF

Daniel Lao

TL;DR: CRC-SAM是一个基于SAM的统一框架,用于在结肠镜、CT和组织病理学图像中进行结直肠癌分割。它建立在MedSAM基础上,通过引入低秩适应(LoRA)层到冻结的编码器中,实现了高效的多模态域迁移,使用极少的可训练参数就能在代表性不足的模态上取得良好效果。

Details

Motivation: 解决现有方法多为单模态,无法在临床工作流(涉及多种成像模态)中提供一致、模态无关的分割问题,旨在为结直肠癌分析提供一个统一的跨模态分割框架。

Result: 在MSD-Colon、CVC-ClinicDB和EBHI-Seg三个基准数据集上的实验表明,其性能优于最先进的基线方法,证明了其在多模态分割上的优越性。

Insight: 主要创新点是将轻量级的LoRA适配技术集成到冻结的基础模型(MedSAM)编码器中,实现了高效的多模态域迁移和知识共享,为基于基础模型的医学图像分析提供了一种参数高效的跨模态适应方案。

Abstract: We present CRC-SAM, a unified framework for colorectal cancer segmentation across colonoscopy, CT, and histopathology images. Unlike prior single-modality methods, CRC-SAM provides consistent, modality-agnostic segmentation throughout the clinical workflow. Built on MedSAM, it incorporates low-rank adaptation (LoRA) layers into a frozen encoder, enabling efficient domain transfer to underrepresented modalities with minimal trainable parameters. Experiments on MSD-Colon, CVC-ClinicDB, and EBHI-Seg demonstrate superior performance across modalities, outperforming state-of-the-art baselines and highlighting the effectiveness of lightweight LoRA adaptation for foundation-model-based colorectal cancer analysis.


quant-ph [Back]

[60] QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding quant-ph | cs.CVPDF

Shuxiang Cao, Zijian Zhang, Abhishek Agarwal, Grace Bratrud, Niyaz R. Beysengulov

TL;DR: 本文介绍了QCalEval,这是首个用于评估视觉语言模型(VLMs)理解量子计算校准图能力的基准测试。该基准包含来自22个实验家族、87种场景类型的243个样本,涵盖超导量子比特和中性原子系统,并在零样本和上下文学习两种设置下评估六种问题类型。研究发现,最佳通用零样本模型平均得分为72.3,许多开源模型在多图像上下文学习中表现下降,而前沿闭源模型则有显著提升。通过一个90亿参数规模的监督微调消融实验,表明微调能提升零样本性能,但无法弥合多模态上下文学习的差距。作为参考案例,作者发布了基于Qwen3.5-35B-A3B的开源模型NVIDIA Ising Calibration 1,其零样本平均得分为74.7。

Details

Motivation: 量子计算校准依赖于对实验数据的解释,校准图是完成此任务最通用的人类可读表示形式,但目前缺乏对视觉语言模型(VLMs)解释这些图能力的系统性评估。

Result: 在QCalEval基准上,最佳通用零样本模型的平均得分为72.3。许多开源模型在多图像上下文学习设置下性能下降,而前沿闭源模型则有显著提升。监督微调(SFT)在90亿参数规模上能提升零样本性能,但无法缩小多模态上下文学习的性能差距。发布的参考模型NVIDIA Ising Calibration 1在零样本设置下达到74.7的平均分。

Insight: 论文的主要创新点是创建了首个专门针对量子校准图理解的VLM基准测试QCalEval,填补了该领域系统性评估的空白。从客观角度看,该研究揭示了当前VLMs(特别是开源模型)在处理专业科学可视化数据(如量子校准图)时,在上下文学习能力上的局限性,并强调了多模态上下文学习作为关键挑战的重要性。发布的参考模型为后续研究提供了可复现的基线。

Abstract: Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.


cs.IR [Back]

[61] GeoSearch: Augmenting Worldwide Geolocalization with Web-Scale Reverse Image Search and Image Matching cs.IR | cs.CVPDF

Tung-Duong Le-Duc, Hoang-Quoc Nguyen-Son, Minh-Son Dao

TL;DR: GeoSearch是一个开放世界地理定位框架,通过将网络规模的反向图像搜索集成到检索增强生成(RAG)流程中,来增强全球图像地理定位能力。它利用从网页提取的文本证据和数据库检索的坐标来增强大型多模态模型(LMM)的提示,并采用图像匹配和基于置信度的门控两层过滤机制来减少噪声。

Details

Motivation: 解决现有基于RAG和LMM的生成方法在参考集中不存在的场景下表现不佳的问题,应对全球视觉多样性带来的挑战。

Result: 在标准基准测试Im2GPS3k和YFCC4k上,在考虑数据泄露的评估下,GeoSearch展示了优越性能。

Insight: 创新点在于将网络规模的反向图像搜索引入RAG流程,并设计了两层过滤机制来有效利用网络信息并减少噪声,这为开放世界地理定位提供了一种增强数据源和去噪的新思路。

Abstract: Worldwide image geolocalization, which aims to predict the GPS coordinates of any image on Earth, remains challenging due to global visual diversity. Recent generative approaches based on Retrieval-Augmented Generation (RAG) and Large Multimodal Models (LMMs) leverage candidates retrieved from fixed databases for reasoning, but often struggle with scenes that are absent from the reference set. In this work, we propose GeoSearch, an open-world geolocation framework that integrates web-scale reverse image search into the RAG pipeline. GeoSearch augments LMM prompts with database-retrieved coordinates and textual evidence extracted from web pages. To mitigate noise from irrelevant content, we introduce a two-layer filtering mechanism consisting of image matching, followed by confidence-based gating. Experiments on standard benchmarks Im2GPS3k and YFCC4k demonstrate the superiority of GeoSearch under leakage-aware evaluation. Our code and data are publicly available to support reproducibility.


cs.GR [Back]

[62] Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation cs.GR | cs.AI | cs.CLPDF

Lanshan He, Haozhou Pang, Qi Gan, Xin Shen, Ziwei Zhang

TL;DR: 本文提出了Cutscene Agent,一个基于大语言模型(LLM)的智能体框架,用于自动化端到端生成3D游戏过场动画。该框架包含一个通过模型上下文协议(MCP)与游戏引擎双向集成的工具包、一个由导演智能体协调多个专业子智能体的多智能体系统,以及一个用于评估的层次化基准CutsceneBench。

Details

Motivation: 解决传统过场动画制作流程复杂、耗时且需要多领域专家协作的问题,旨在通过LLM智能体自动化这一过程,降低制作门槛和成本。

Result: 在提出的CutsceneBench基准上评估了一系列LLM,分析了它们在这一需要长视野、多步骤协调的复杂任务上的性能。

Insight: 主要创新点在于:1)通过MCP协议实现LLM智能体与游戏引擎的双向、闭环集成,使智能体不仅能调用引擎功能,还能实时感知场景状态;2)采用分层多智能体架构(导演协调专业子智能体)并引入视觉反馈循环,模拟真实制作流程;3)提出了首个针对长序列、强约束工具调用协调能力的评估基准,填补了现有基准的空白。

Abstract: Cutscenes are carefully choreographed cinematic sequences embedded in video games and interactive media, serving as the primary vehicle for narrative delivery, character development, and emotional engagement. Producing cutscenes is inherently complex: it demands seamless coordination across screenwriting, cinematography, character animation, voice acting, and technical direction, often requiring days to weeks of collaborative effort from multidisciplinary teams to produce minutes of polished content. In this work, we present Cutscene Agent, an LLM agent framework for automated end-to-end cutscene generation. The framework makes three contributions: (1)a Cutscene Toolkit built on the Model Context Protocol (MCP) that establishes \emph{bidirectional} integration between LLM agents and the game engine – agents not only invoke engine operations but continuously observe real-time scene state, enabling closed-loop generation of editable engine-native cinematic assets; (2)a multi-agent system where a director agent orchestrates specialist subagents for animation, cinematography, and sound design, augmented by a visual reasoning feedback loop for perception-driven refinement; and (3)~CutsceneBench, a hierarchical evaluation benchmark for cutscene generation. Unlike typical tool-use benchmarks that evaluate short, isolated function calls, cutscene generation requires long-horizon, multi-step orchestration of dozens of interdependent tool invocations with strict ordering constraints – a capability dimension that existing benchmarks do not cover. We evaluate a range of LLMs on CutsceneBench and analyze their performance across this challenging task.


cs.CR [Back]

[63] The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive cs.CR | cs.CLPDF

Alex Bogdan, Adrian de Valois-Franklin

TL;DR: 本文发现前沿大语言模型(LLM)输出的词元(token)秩-频率分布惊人地收敛于同一个两参数的曼德布罗特(Mandelbrot)分布,这一统计规律性使得开发一种仅需CPU、每词元处理时间仅2.6微秒的实时评分原语成为可能。该原语可用于模型指纹识别和黑盒输出评估,作为复合评估栈中的第一层筛选工具。

Details

Motivation: 现有基于采样的检测器延迟高,且缺乏无需密码学水印或模型内部访问的模型来源验证方法。本文旨在利用LLM输出中发现的普遍统计规律,构建一个高效、轻量的实时验证原语。

Result: 在来自五个独立供应商的六个当代模型、两种生成规模以及五个保留领域的测试中,36个模型-领域组合中有34个的拟合优度R²超过0.94,且曼德布罗特分布相比齐夫(Zipf)分布更受AIC准则青睐。拟合参数在不同模型间可清晰区分,为模型指纹识别提供了基础。该评分原语在FRANK、TruthfulQA和HaluEval基准上的初步结果表明,它在检测词汇异常和未支持的实体方面有效,但在检测领域适当词汇中的推理错误方面存在结构性的局限。

Insight: 核心创新点在于发现了不同前沿LLM输出词元分布普遍遵循曼德布罗特分布这一统计规律,并基于此构建了一个超低延迟的实时评分原语。这为无需模型内部信息的黑盒模型指纹识别和输出质量初步评估提供了一个全新的、极其高效的底层工具,可作为现有复杂验证流程的轻量级补充层。

Abstract: We report a striking statistical regularity in frontier LLM outputs that enables a CPU-only scoring primitive running at 2.6 microseconds per token, with estimated latency up to 100,000$\times$ (five orders of magnitude) below existing sampling-based detectors. Across six contemporary models from five independent vendors, two generation sizes, and five held-out domains, token rank-frequency distributions converge to the same two-parameter Mandelbrot ranking distribution, with 34 of 36 model-by-domain fits exceeding $R^{2} = 0.94$ and 35 of 36 favoring Mandelbrot over Zipf by AIC. The shared family does not collapse the models into statistical duplicates. Fitted Mandelbrot parameters remain cleanly separable between models: the cross-model spread in $q$ (1.63 to 3.69) exceeds its per-model bootstrap standard deviation (0.03 to 0.10) by more than an order of magnitude, yielding tens of standard deviations of separation per few thousand output tokens. Two capabilities follow. First, statistical model fingerprinting: text from a vendor-delivered LLM can be tested against its claimed model family without cryptographic watermarks or access to model internals, supporting provenance verification and silent-substitution audits. Second, a model-agnostic reference distribution for black-box output assessment, from which we derive a single-pass scoring primitive that composes with model log probabilities when available and degrades to a rank-only mode usable on closed APIs. Pilot results on FRANK, TruthfulQA, and HaluEval map where the primitive helps (lexical anomalies, unsupported entities) and where it structurally cannot (reasoning errors in domain-appropriate vocabulary). We position the primitive as a first-pass triage layer in compound evaluation stacks, not as a replacement for sampling-based or source-conditioned verifiers.


cs.LG [Back]

[64] Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks cs.LG | cs.CLPDF

Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov

TL;DR: 本文介绍了Odysseys,一个用于评估网络智能体在真实、长周期任务上性能的新基准。该基准包含200个源自真实浏览会话的多站点、长流程任务,并在真实互联网上评估。研究发现现有二元评估方法不足,因此引入了基于评分量表的评估体系,并提出了轨迹效率指标。测试表明,即使前沿模型在任务成功率(44.5%)和效率(每步得分率1.15%)上仍有巨大提升空间。

Details

Motivation: 现有网络智能体基准大多集中于短周期、单站点任务,前沿模型已接近饱和,无法反映现实世界中需要持续上下文和跨站点推理的长周期、多站点工作流(如跨域比价、多服务行程规划)。

Result: 在Odysseys基准上,最强的前沿模型任务成功率为44.5%,轨迹效率(每步得分率)仅为1.15%,表明现有智能体在长周期任务的成功率和效率上均有巨大提升空间。

Insight: 创新点在于构建了首个基于真实长周期浏览会话、在真实互联网上评估的基准,并引入了基于多维度评分量表的细粒度评估方法(优于常用的轨迹级LLM-as-a-judge评估)以及关注效率的轨迹效率指标,为评估智能体在开放网络环境中长时间操作的能力提供了更现实的衡量标准。

Abstract: Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multiple services, or summarizing information from multiple search queries, require sustained context and cross-site reasoning over potentially hours of browsing. To capture and evaluate such behaviors, we introduce Odysseys: a benchmark of 200 long-horizon web tasks derived from real world browsing sessions evaluated on the live Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and introduce a rubric-based evaluation, annotating each Odysseys task with an average of 6.1 graded rubrics. We demonstrate that this yields higher agreement with humans and provides a more fine-grained signal than commonly used trajectory-level LLM-as-a-judge evaluation metrics. We tested several leading frontier models and find that the strongest models achieve a success rate of 44.5%, which leaves substantial room for future improvements. Beyond task success, we argue that efficiency is a first-class concern for long-horizon agents. We introduce a Trajectory Efficiency metric (rubric score per step) and find that even frontier agents achieve only 1.15%, marking an evident need for agents that can succeed efficiently and not simply eventually. Odysseys isolates the critical evaluation of long-horizon proficiency in open-web environments, providing a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours. We release our tasks, evaluation scripts, and other results at https://odysseys-website.pages.dev


[65] VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation cs.LG | cs.CL | cs.CV | stat.MLPDF

Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, Amit Ranjan Trivedi

TL;DR: 本文研究了将视觉语言模型(VLM)用作多模态系统自动评估员时,其给出的分数缺乏可靠性指示的问题。通过使用无需重新训练的一致性预测框架,将VLM的分数转化为校准的预测区间,作者首次系统分析了三个VLM评估员在14个视觉任务类别上的表现。研究发现评估不确定性高度依赖于任务,并揭示了一种未被标准评估指标捕捉的失效模式:排序-评分解耦,即评估员能正确排序但无法给出可靠的绝对分数。

Details

Motivation: 动机在于VLM越来越多地被用作多模态系统的自动评估员,但其给出的分数没有提供可靠性指示,这限制了其在评估中的可信度。

Result: 在14个视觉任务类别上的实验结果表明,评估不确定性因任务而异:在美学和自然图像任务上,预测区间覆盖约40%的分数范围,而在图表和数学推理任务上则扩大到约70%。研究还发现,在干净、多标注者的图像描述基准测试上,同一评估员和方法产生的区间宽度可缩小4.5倍。

Insight: 创新点在于首次将一致性预测框架系统性地应用于VLM-as-a-Judge场景,量化了评估不确定性并揭示了其任务依赖性。客观分析认为,其核心洞察是识别了‘排序-评分解耦’这一失效模式,并指出区间宽度主要由任务难度和标注质量驱动,这为构建更可靠的多模态评估提供了方法论指导和定量可靠性地图。

Abstract: Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge’s point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: https://github.com/divake/VLM-Judge-Uncertainty


[66] Barriers to Universal Reasoning With Transformers (And How to Overcome Them) cs.LG | cs.CLPDF

Oliver Kraus, Yash Sarrof, Yuekun Yao, Alexander Koller, Michael Hahn

TL;DR: 本文探讨了Transformer模型在使用思维链(CoT)时在长度泛化方面的局限性,并提出了通过引入独特标记和值变化编码来克服这些障碍的方法,从而实现对图灵机的长度可泛化模拟。

Details

Motivation: 研究动机在于探究Transformer在CoT下能否泛化到训练时未见过的更长推理链,并解决其在标准位置编码和有限词汇表下无法超越TC^0复杂度的问题。

Result: 理论分析表明,在允许词汇表随问题规模增长的情况下,Transformer能够实现对图灵机的长度可泛化模拟,其中CoT轨迹长度与模拟运行时间呈线性关系;实证研究也验证了所提方法能提升在困难问题上的长度泛化性能。

Insight: 创新点在于通过引入独特标记(signpost tokens)和仅记录值变化的编码方式,克服了重复复制和最后出现检索这两个核心障碍,为提升Transformer的长度泛化能力提供了可操作的指导。

Abstract: Chain-of-Thought (CoT) has been shown to empirically improve Transformers’ performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied. We use recent theoretical frameworks for Transformer length generalization and find that – under standard positional encodings and a finite alphabet – Transformers with CoT cannot solve problems beyond $TC^0$, i.e. the expressivity benefits do not hold under the stricter requirement of length-generalizable learnability. However, if we allow the vocabulary to grow with problem size, we attain a length-generalizable simulation of Turing machines where the CoT trace length is linear in the simulated runtime up to a constant. Our construction overcomes two core obstacles to reliable length generalization: repeated copying and last-occurrence retrieval. We assign each tape position a unique signpost token, and log only value changes to enable recovery of the current tape symbol through counts circumventing both barriers. Further, we empirically show that the use of such signpost tokens and value change encodings provide actionable guidance to improve length generalization on hard problems.


[67] Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence cs.LG | cs.AI | cs.CVPDF

NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki

TL;DR: Nemotron 3 Nano Omni是Nemotron多模态系列的最新模型,首次原生支持音频、文本、图像和视频输入。该模型在架构、训练数据和训练方法上取得进步,在所有模态上都比前代模型Nemotron Nano V2 VL有持续的精度提升,尤其在真实文档理解、长音视频理解和智能体计算机使用方面取得领先结果。基于高效的Nemotron 3 Nano 30B-A3B骨干网络,它进一步结合了创新的多模态token缩减技术,相比同类规模模型,显著降低了推理延迟并提高了吞吐量。作者发布了BF16、FP8和FP4格式的模型检查点,以及部分训练数据和代码库。

Details

Motivation: 构建一个高效、开源且原生支持音频、文本、图像和视频输入的多模态模型,以提升多模态理解任务的性能并降低推理成本。

Result: 在所有模态上持续优于前代模型Nemotron Nano V2 VL,在真实文档理解、长音视频理解和智能体计算机使用等任务上取得领先(SOTA)结果。

Insight: 创新点在于首次原生集成音频输入,并采用了创新的多模态token缩减技术,这有效降低了推理延迟并提高了吞吐量,为高效多模态模型设计提供了新思路。模型和部分资源的开源也促进了该领域的研究与发展。

Abstract: We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.


cs.AI [Back]

[68] Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling cs.AI | cs.CL | cs.LGPDF

Ocean Monjur, Shahriar Kabir Nahin, Anshuman Chhabra

TL;DR: 本文重新审视了LLM剪枝对测试时计算扩展(TTS)推理性能的影响,发现与结构化剪枝不同,非结构化剪枝不仅不会显著降低TTS性能,有时甚至能超越未剪枝的完整模型。

Details

Motivation: 针对当前大型语言模型(LLMs)推理能力强但参数量大、推理成本高的问题,研究剪枝方法在保持性能的同时减少模型规模,并特别探究非结构化剪枝在TTS场景下的有效性,以挑战先前认为剪枝会损害TTS性能的假设。

Result: 在s1.1-7B和Qwen3-8B两个推理LLM上的四个推理基准测试中,非结构化剪枝相比结构化剪枝提升了TTS性能,有时甚至优于未剪枝的完整模型;同时实证研究了不同层间稀疏度分配策略的影响。

Insight: 创新点在于揭示了非结构化剪枝(通过精细移除冗余或有害权重)可以增强TTS推理能力,挑战了剪枝必然损害TTS性能的传统观念,并强调了层间稀疏度分配策略的重要性,为高效LLM部署提供了新思路。

Abstract: While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing performance. However, specific to reasoning LLMs, prior work has shown that structured pruning (methods which removes entire set of layer blocks), significantly degrades TTS reasoning performance. In this work, we revisit this assumption and instead investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and at times can even outperform the unpruned full-weight LLMs. Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating unstructured pruning methods. These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can improve TTS effectiveness even further.


[69] Recursive Multi-Agent Systems cs.AI | cs.CL | cs.LGPDF

Xiyuan Yang, Jiaru Zou, Rui Pan, Ruizhong Qiu, Pan Lu

TL;DR: 本文提出了RecursiveMAS,一个递归多智能体框架,将整个多智能体系统视为一个统一的隐空间递归计算。该框架通过轻量级的RecursiveLink模块连接异构智能体,形成一个协作循环,实现分布内隐式思维生成和跨智能体隐状态传递。通过内外循环学习算法进行迭代式全系统协同优化,在多个基准测试中实现了性能提升、推理加速和令牌使用减少。

Details

Motivation: 将递归计算这一新的扩展轴从单一模型扩展到多智能体系统,探究智能体协作本身是否可以通过递归进行扩展,以深化推理能力。

Result: 在涵盖数学、科学、医学、搜索和代码生成的9个基准测试上,与先进的单/多智能体及递归计算基线相比,RecursiveMAS平均准确率提升8.3%,端到端推理速度提升1.2-2.4倍,令牌使用量减少34.6%-75.6%。

Insight: 将多智能体系统整体建模为隐空间递归计算,通过RecursiveLink模块实现高效的跨智能体隐状态传递与协同优化;提出的内外循环学习算法实现了基于梯度的跨递归轮次信用分配,确保了递归训练的稳定性与效率。

Abstract: Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion? To this end, we introduce RecursiveMAS, a recursive multi-agent framework that casts the entire system as a unified latent-space recursive computation. RecursiveMAS connects heterogeneous agents as a collaboration loop through the lightweight RecursiveLink module, enabling in-distribution latent thoughts generation and cross-agent latent state transfer. To optimize our framework, we develop an inner-outer loop learning algorithm for iterative whole-system co-optimization through shared gradient-based credit assignment across recursion rounds. Theoretical analyses of runtime complexity and learning dynamics establish that RecursiveMAS is more efficient than standard text-based MAS and maintains stable gradients during recursive training. Empirically, we instantiate RecursiveMAS under 4 representative agent collaboration patterns and evaluate across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. In comparison with advanced single/multi-agent and recursive computation baselines, RecursiveMAS consistently delivers an average accuracy improvement of 8.3%, together with 1.2$\times$-2.4$\times$ end-to-end inference speedup, and 34.6%-75.6% token usage reduction. Code and Data are provided in https://recursivemas.github.io.


cs.RO [Back]

[70] Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System cs.RO | cs.AI | cs.CL | cs.CVPDF

Yifei Wei, Linqing Zhong, Yi Liu, Yuxiang Lu, Xindong He

TL;DR: 本文提出Libra-VLA,一种新颖的从粗到精的双系统视觉-语言-动作模型架构,通过显式地将学习复杂性解耦为粗粒度和细粒度层次来实现训练均衡,并利用这种结构模块化实现异步执行策略,以解决现有方法在将高层语义指令映射到连续动作时存在的语义-执行鸿沟问题。

Details

Motivation: 当前主流的视觉-语言-动作模型采用单一生成范式,以扁平、非分层的方式直接将视觉-语言特征映射到高频运动命令,忽视了机器人操作固有的层次性,导致语义与执行之间的鸿沟扩大,并为将高层语义接地到连续动作带来了沉重的表示负担。

Result: 实证分析表明,性能相对于动作分解粒度遵循倒U型曲线,当两个子系统的学习难度达到平衡时性能达到峰值。

Insight: 创新点在于显式地将混合动作空间分解为离散的宏观方向性到达和连续的微观姿态对齐,并采用从粗到精的双系统异步架构,这为实现开放世界操作提供了可扩展、鲁棒且响应迅速的解决方案。

Abstract: Vision-Language-Action (VLA) models are a promising paradigm for generalist robotic manipulation by grounding high-level semantic instructions into executable physical actions. However, prevailing approaches typically adopt a monolithic generation paradigm, directly mapping visual-linguistic features to high-frequency motor commands in a flat, non-hierarchical fashion. This strategy overlooks the inherent hierarchy of robotic manipulation, where complex actions can be naturally modeled in a Hybrid Action Space, decomposing into discrete macro-directional reaching and continuous micro-pose alignment, severely widening the semantic-actuation gap and imposing a heavy representational burden on grounding high-level semantics to continuous actions. To address this, we introduce Libra-VLA, a novel Coarse-to-Fine Dual-System VLA architecture. We explicitly decouple the learning complexity into a coarse-to-fine hierarchy to strike a training equilibrium, while simultaneously leveraging this structural modularity to implement an asynchronous execution strategy. The Semantic Planner predicts discrete action tokens capturing macro-directional intent, while the Action Refiner conditions on coarse intent to generate high-frequency continuous actions for precise alignment. Crucially, our empirical analysis reveals that performance follows an inverted-U curve relative to action decomposition granularity, peaking exactly when the learning difficulty is balanced between the two sub-systems. With the asynchronous design, our approach offers a scalable, robust, and responsive solution for open-world manipulation.


[71] VISION-SLS: Safe Perception-Based Control from Learned Visual Representations via System Level Synthesis cs.RO | cs.CV | cs.LG | eess.SY | math.OCPDF

Antoine P. Leeman, Shuyu Zhan, Melanie N. Zeilinger, Glen Chou

TL;DR: 本文提出了VISION-SLS方法,一种基于高分辨率RGB图像的非线性输出反馈控制方法,能够在部分可观测性、传感器噪声和非线性动力学下,提供鲁棒的约束满足保证。该方法结合了从预训练视觉特征中学习到的低维观测映射(具有状态依赖误差界)和通过系统级合成(SLS)优化的因果仿射时变输出反馈策略,并开发了一种新颖的可扩展求解器来处理非凸优化问题。

Details

Motivation: 解决从高维视觉输入(如图像)进行安全、鲁棒的输出反馈控制的问题,特别是在存在部分可观测性、噪声和非线性动态时,确保约束满足的保证。

Result: 在两个模拟视觉运动任务(4D汽车和10D四旋翼,图像分辨率>=512x512像素)和一个部分可观测的59D人形机器人任务上,该方法实现了安全的信息收集行为,在经验校准的误差界内保证约束满足。在硬件(地面车辆)上的实验也验证了其有效性,在安全率和求解时间上优于基线方法。

Insight: 创新点在于将学习到的视觉抽象表示(具有量化误差界)与系统级合成(SLS)框架相结合,并开发了高效的求解器(顺序凸规划结合Riccati递归),使得基于SLS的安全视觉运动输出反馈控制能够大规模实用化。

Abstract: We propose VISION-SLS, a method for nonlinear output-feedback control from high-resolution RGB images which provides robust constraint satisfaction guarantees under calibrated uncertainty bounds despite partial observability, sensor noise, and nonlinear dynamics. To enable scalability while retaining guarantees, we propose: (i) a learned low-dimensional observation map from pretrained visual features with state-dependent error bounds, and (ii) a causal affine time-varying output-feedback policy optimized via System Level Synthesis (SLS). We develop a scalable, novel solver for the resulting nonconvex program that leverages sequential convex programming coupled with efficient Riccati recursions. On two simulated visuomotor tasks (a 4D car and a 10D quadrotor) with >= 512 x 512 pixels and a 59D humanoid task with partial observability, our method enables safe, information-gathering behavior that reduces uncertainty while guaranteeing constraint satisfaction with empirically-calibrated error bounds. We also validate our method on hardware, safely controlling a ground vehicle from onboard images, outperforming baselines in safety rate and solve times. Together, these results show that learned visual abstractions coupled with an efficient solver make SLS-based safe visuomotor output-feedback practical at scale. The code implementation of our method is available at https://github.com/trustworthyrobotics/VISION-SLS.