Table of Contents
- cs.CL [Total: 30]
- cs.CV [Total: 77]
- q-bio.QM [Total: 1]
- cs.DB [Total: 1]
- cs.SD [Total: 3]
- cs.CY [Total: 1]
- cs.HC [Total: 2]
- eess.IV [Total: 1]
- cs.LG [Total: 4]
- cs.AI [Total: 1]
cs.CL [Back]
[1] Can Structural Cues Save LLMs? Evaluating Language Models in Massive Document Streams cs.CLPDF
Yukyung Lee, Yebin Lim, Woojun Jung, Wonjun Choi, Susik Yoon
TL;DR: 本文提出了StreamBench基准,用于评估语言模型在包含多个并发事件的大规模文档流环境中的表现,包含主题聚类、时序问答和摘要三个任务。研究发现,通过提供按事件组织的结构化线索,可以显著提升模型在聚类和时序问答任务上的性能,表明结构化线索是处理大规模文档流的一个有前景的方向。
Details
Motivation: 现有基准要么关注单一复杂事件,要么为每个查询提供精心整理的输入,未能评估模型在多个并发事件混合于同一文档流中产生冲突时的表现,因此需要构建一个更贴近真实流式环境的评估基准。
Result: 在StreamBench基准(包含2016和2025年主要新闻事件的605个事件和15,354份文档)上,结构化线索将聚类任务性能提升高达4.37%,时序问答任务提升高达9.63%,帮助模型定位相关信息并区分不同事件。
Insight: 论文的创新点在于构建了模拟真实新闻流冲突的StreamBench基准,并诊断性地验证了结构化线索(按事件组织关键事实)作为辅助信息能有效提升模型在流式环境中的表现,这为未来处理大规模文档流的工作提供了重要方向;客观来看,将事件结构作为外部知识注入模型以缓解信息混淆是一个实用且可借鉴的思路。
Abstract: Evaluating language models in streaming environments is critical, yet underexplored. Existing benchmarks either focus on single complex events or provide curated inputs for each query, and do not evaluate models under the conflicts that arise when multiple concurrent events are mixed within the same document stream. We introduce StreamBench, a benchmark built from major news stories in 2016 and 2025, comprising 605 events and 15,354 documents across three tasks: Topic Clustering, Temporal Question Answering, and Summarization. To diagnose how models fail, we compare performance with and without structural cues, which organize key facts by event. We find that structural cues improve performance on clustering (up to +4.37%) and temporal QA (up to +9.63%), helping models locate relevant information and separate distinct events. While temporal reasoning remains an open challenge inherent to current LLMs, consistent gains across tasks show that structural cues are a promising direction for future work in massive document streams.
[2] GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams cs.CL | cs.AIPDF
Yushun Zhang, Weiping Fu, Zesheng Yang, Bo Zhao, Lingling Zhang
TL;DR: 本文介绍了GeoChallenge,一个包含9万个自动生成的多选几何证明问题的数据集,用于评估大语言模型在结合文本和图表进行多步几何推理方面的能力。实验表明,当前先进LLM与人类在几何推理上存在显著差距,并揭示了模型在精确匹配、视觉依赖不足和推理发散等方面的常见失败模式。
Details
Motivation: 现有几何推理基准在规模上有限,且缺乏基于视觉的多选题,难以可靠评估复杂推理。本文旨在通过构建大规模、细粒度标注的几何问题数据集,以更全面地评估LLM的符号推理能力。
Result: 在多个先进LLM上的实验显示,最佳模型GPT-5-nano的精确匹配率为75.89%,远低于人类的94.74%,揭示了模型与人类之间的性能差距。
Insight: 创新点在于构建了大规模、自动生成的几何多选问题数据集,并提供细粒度复杂度评级和形式化语言标注,支持可控评估。从客观角度看,该工作系统性地识别了LLM在几何推理中的三类常见失败模式,为模型改进提供了明确方向。
Abstract: Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence.
[3] From Comprehension to Reasoning: A Hierarchical Benchmark for Automated Financial Research Reporting cs.CLPDF
Yiyun Zhu, Yidong Jiang, Ziwen Xu, Yinsheng Yao, Dawei Cheng
TL;DR: 本文提出了FinReasoning基准,用于评估大语言模型在生成中文金融研究报告时的能力,将报告生成分解为语义一致性、数据对齐和深度洞察三个阶段,并引入了细粒度的评估框架,包括幻觉纠正和12项核心分析技能指标。评估发现大多数模型存在理解与执行的差距,且没有模型在所有阶段都表现卓越。
Details
Motivation: 现有金融基准主要关注对已完成报告的理解,而非评估模型生成可靠分析的能力,且现有评估框架仅能标记幻觉,缺乏对深层分析技能的结构化度量,无法发现关键分析瓶颈。
Result: 在FinReasoning基准上,大多数模型表现出理解与执行的差距:能识别错误但难以生成准确修正,能检索数据但难以以正确格式返回。Doubao-Seed-1.8、GPT-5和Kimi-K2在整体性能中排名前三,但各自能力分布不同,没有模型在所有三个赛道上取得压倒性优势。
Insight: 创新点在于将金融报告生成分解为与真实分析师工作流对齐的三个阶段进行层次化评估,并提出了一个加强幻觉纠正评估和包含12项指标的分析技能评估框架,这为系统评估模型的分析推理能力提供了新方法。
Abstract: Large language models (LLMs) are increasingly used to generate financial research reports, shifting from auxiliary analytic tools to primary content producers. Yet recent real-world deployments reveal persistent failures–factual errors, numerical inconsistencies, fabricated references, and shallow analysis–that can distort assessments of corporate fundamentals and ultimately trigger severe economic losses. However, existing financial benchmarks focus on comprehension over completed reports rather than evaluating whether a model can produce reliable analysis. Moreover, current evaluation frameworks merely flag hallucinations and lack structured measures for deeper analytical skills, leaving key analytical bottlenecks undiscovered. To address these gaps, we introduce FinReasoning, a benchmark that decomposes Chinese research-report generation into three stages aligned with real analyst workflows, assessing semantic consistency, data alignment, and deep insight. We further propose a fine-grained evaluation framework that strengthens hallucination-correction assessment and incorporates a 12-indicator rubric for core analytical skills. Based on the evaluation results, FinReasoning reveals that most models exhibit a understanding-execution gap: they can identify errors but struggle to generate accurate corrections; they can retrieve data but have difficulty returning it in correct format. Furthermore, no model achieves overwhelming superiority across all three tracks; Doubao-Seed-1.8, GPT-5, and Kimi-K2 rank as the top three in overall performance, yet each exhibits a distinct capability distribution. The evaluation resource is available at https://github.com/TongjiFinLab/FinReasoning.
[4] LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models cs.CL | cs.AIPDF
Wei Zhang, Lintong Du, Yuanhe Zhang, Zhenhong Zhou, Kun Wang
TL;DR: 本文提出LARFT(长度感知强化微调)训练框架,旨在解决大语言模型在遵循长度指令时存在的认知-行动差距问题。该方法通过结合面向长度的强化学习和事后长度感知,联合优化模型对长度信息的内部表征及其满足长度约束的策略,从而实现精确可靠的长度指令遵循。
Details
Motivation: 现有方法主要通过外部施加长度信号或优化目标来强制长度约束,但忽视了模型在长度认知上的内在缺陷。论文旨在解决大语言模型在精确控制输出长度方面的根本限制。
Result: 在四个基础模型上的广泛实验表明,LARFT在三个长度指令遵循基准测试中平均提升了+20.92分,同时在四个通用能力基准测试上仅出现-1.45分的轻微下降,优于现有基线方法。
Insight: 创新点在于将长度认知与模型行动对齐,通过事后自我感知任务让模型学习识别自身生成的实际长度,从而联合优化内部长度表征和策略。这为解决指令遵循中的特定约束控制问题提供了新思路。
Abstract: Despite the strong performance of Large Language Models (LLMs) on complex instruction-following tasks, precise control of output length remains a persistent challenge. Existing methods primarily attempt to enforce length constraints by externally imposing length signals or optimization objectives, while largely overlooking the underlying limitation: the model’s intrinsic deficit in length cognition. To address this, we propose LARFT (Length-Aware Reinforcement Fine-Tuning), a training framework that aligns the model’s length cognition with its action. Specifically, LARFT integrates length-oriented reinforcement learning with a hindsight length awareness. By transforming on-policy data into hindsight self-awareness tasks where the model learns to identify the actual length of its own generation, LARFT jointly optimizes the model’s internal representation of length information and refines its policy to satisfy length constraints, thereby achieving precise and reliable length instruction following. Extensive experiments across four base models demonstrate that LARFT outperforms existing baselines, achieving an average improvement of +20.92 points across three length instruction following benchmarks with only a marginal decline of -1.45 points on four general capability benchmarks.
[5] The α-Law of Observable Belief Revision in Large Language Model Inference cs.CL | cs.AIPDF
Mike Farmer, Abhinav Kochar, Yugyung Lee
TL;DR: 本文研究了指令调优大语言模型在推理过程中更新答案概率的规律,发现存在一个乘性缩放定律(α-law),该定律通过一个信念修正指数控制先验信念与验证证据在更新中的结合方式。理论分析表明指数小于1是重复修正下渐近稳定的充要条件,实证评估在多个模型和数据集上验证了模型在单步修正中接近贝叶斯更新但略高于稳定边界,而多步修正中指数下降并产生收缩性长期动态,符合理论预测。
Details
Motivation: 解决大语言模型通过思维链、自我反思或多智能体辩论等机制迭代修正输出时,缺乏对其概率更新稳定性的原则性保证的问题。
Result: 在涵盖GPQA Diamond、TheoremQA、MMLU-Pro和ARC-Challenge的4,975个问题以及GPT-5.2和Claude Sonnet 4等多个模型系列上的实证评估显示,模型在单步修正中表现出接近贝叶斯更新的行为(略高于稳定边界),多步实验则表明指数随连续修正而下降,产生与理论稳定性预测一致的收缩性长期动态。使用Llama-3.3-70B的token级验证进一步证实了对数概率测量和自我报告置信度激发中的类似行为。
Insight: 创新点在于提出了α-law作为描述可观测推理时更新行为的乘性缩放定律,并引入信念修正指数作为监控LLM推理系统更新稳定性和推理质量的原则性诊断工具;客观分析认为,该工作从外部可观测行为而非内部贝叶斯推理的角度刻画了LLM的更新动态,揭示了模型架构特定的信任比模式(如GPT-5.2平衡先验与证据,Claude略微偏向新证据),为理解和评估迭代推理过程的稳定性提供了理论框架和实证基准。
Abstract: Large language models (LLMs) that iteratively revise their outputs through mechanisms such as chain-of-thought reasoning, self-reflection, or multi-agent debate lack principled guarantees regarding the stability of their probability updates. We identify a consistent multiplicative scaling law that governs how instruction-tuned LLMs revise probability assignments over candidate answers, expressed as a belief revision exponent that controls how prior beliefs and verification evidence are combined during updates. We show theoretically that values of the exponent below one are necessary and sufficient for asymptotic stability under repeated revision. Empirical evaluation across 4,975 problems spanning graduate-level benchmarks (GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge) and multiple model families (GPT-5.2 and Claude Sonnet 4) reveals near-Bayesian update behavior, with models operating slightly above the stability boundary in single-step revisions. However, multi-step experiments demonstrate that the exponent decreases over successive revisions, producing contractive long-run dynamics consistent with theoretical stability predictions. Token-level validation using Llama-3.3-70B further confirms similar behavior across both log-probability measurements and self-reported confidence elicitation. Analysis of update components exposes architecture-specific trust-ratio patterns, with GPT-5.2 showing balanced weighting between prior and evidence, while Claude modestly favors new evidence. This work characterizes observable inference-time update behavior rather than internal Bayesian reasoning, and introduces the α-law as a principled diagnostic for monitoring update stability and reasoning quality in LLM inference systems.
[6] Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion cs.CL | cs.AI | cs.LGPDF
Zhen Tan, Chengshuai Zhao, Song Wang, Jundong Li, Tianlong Chen
TL;DR: 本文提出了一种新颖的强化蒸馏框架,旨在将大型语言模型的稳健推理能力提炼到更小的学生模型中。该方法通过‘解释性反转’生成针对性探针,迫使学生模型阐明答案背后的逻辑,并结合一种新颖的强化学习算法来奖励连贯的推理过程,从而解决现有蒸馏方法中常见的模式记忆和泛化能力差的问题。
Details
Motivation: 将大型语言模型的强大推理能力蒸馏到更小、计算效率更高的学生模型中仍然是一个未解决的挑战。现有方法常导致学生模型仅进行浅层的模式记忆,泛化能力不佳。本文旨在克服这些限制,超越简单的模仿,灌输更深层次的概念理解。
Result: 在12个数据集上的广泛评估表明,该方法取得了显著提升。以Gemma-7b作为学生模型,该方法比零样本性能平均提升20.39%,比最先进的蒸馏基线平均提升6.02%。此外,该方法训练的模型展现出卓越的训练效率(例如,仅使用10-25%的训练数据即可超越普通微调)以及对分布外任务的强大泛化能力。
Insight: 核心创新点在于:1. 解释性反转:通过生成‘解释性探针’,迫使模型解释其推理逻辑,而非仅仅记忆答案模式,这有助于解决模式记忆问题。2. 解释性GRPO:一种新颖的强化学习算法,引入了对话结构效用奖励,明确奖励模型在整个探针对话中保持连贯的推理过程,从而提升泛化能力。从客观角度看,将强化学习与解释性探针结合,为知识蒸馏提供了一种结构化、可解释的优化路径,有望提升小模型的内在推理质量。
Abstract: Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes’’ that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39%} increase over zero-shot performance and a \textbf{6.02%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25%} training data) and strong generalization to out-of-distribution tasks. Implementation is released at https://github.com/Zhen-Tan-dmml/ExGRPO.git.
[7] Reviewing the Reviewer: Graph-Enhanced LLMs for E-commerce Appeal Adjudication cs.CL | cs.IRPDF
Yuchen Du, Ashley Li, Zixi Huang
TL;DR: 本文提出了一种基于图增强LLM的电子商务申诉裁决框架,通过引入显式的行动建模和冲突感知图推理来解决传统审核工作流中的信息不对称问题。该框架利用历史案例构建知识图谱,并结合请求更多信息(RMI)机制,显著提升了自动化裁决与人类专家决策的一致性。
Details
Motivation: 在分层审核工作流中,由于信息不对称(如校验者拥有验证行动而初审者或自动化系统缺乏),难以从纠错信号中有效学习。本文旨在通过显式行动建模和结构化推理来克服这一挑战。
Result: 在电子商务卖家申诉裁决的大规模评估中,仅使用标准LLM的基线仅达到70.8%的人类专家对齐率;加入行动建模和RMI后提升至87.5%;进一步结合基于检索的知识图谱,离线性能达到95.8%,在线部署后保持96.3%的对齐率,展现了实际有效性。
Insight: 创新点包括提出EAFD(证据-行动-因素-决策)模式作为裁决推理的最小表示,通过操作基础防止幻觉,并利用冲突建模从纠错信号中学习;同时,框架具备RMI能力,能在证据不足时精准识别未执行的验证行动并生成针对性信息请求,增强了推理的可解释性和实用性。
Abstract: Hierarchical review workflows, where a second-tier reviewer (Checker) corrects first-tier (Maker) decisions, generate valuable correction signals that encode why initial judgments failed. However, learning from these signals is hindered by information asymmetry: corrections often depend on verification actions unavailable to Makers or automated systems. We address this challenge by introducing explicit action modeling as an inferential constraint that grounds reasoning in verifiable operations rather than unconstrained text generation. We propose the Evidence-Action-Factor-Decision (EAFD) schema, a minimal representation for adjudication reasoning that prevents hallucination through operational grounding and enables learning from correction signals via explicit conflict modeling. Building on this schema, we develop a conflict-aware graph reasoning framework that: (1) constructs EAFD graphs from historical cases capturing Maker-Checker disagreements, (2) aggregates them into a retrievable knowledge base, and (3) performs top-down deductive reasoning for new cases by projecting validated resolution paths from precedents. A distinctive capability is the Request More Information (RMI) outcome: when evidence is insufficient, the system identifies precisely which verification actions remain unexecuted and generates targeted information requests. We evaluate the framework in large-scale e-commerce seller appeal adjudication. While a standard LLM-only baseline achieves only 70.8% alignment with human experts, incorporating action modeling with RMI improves alignment to 87.5%. Augmenting this with the retrieval-based knowledge graph yields the best offline performance of 95.8%. Following online deployment, the framework maintains robust performance, achieving a 96.3% alignment rate in production, demonstrating its real-world effectiveness.
[8] Full-Stack Domain Enhancement for Combustion LLMs: Construction and Optimization cs.CL | cs.AIPDF
Quanjia Xiao, Weimin Ouyang, Zonglin Yang, Tianhao Wu, Qingguo Zhou
TL;DR: 本文针对燃烧科学领域,提出了首个全栈式领域增强大语言模型工作流,通过自动化领域语料构建、增量预训练、指令微调和基于可验证奖励的强化学习,确保模型内化物理规律而非仅学习文本统计模式,并发布了专用评估基准FlameBench。实验表明该方法在燃烧科学推理任务上显著优于最先进的通用闭源模型和传统检索增强生成方法。
Details
Motivation: 解决通用大语言模型在燃烧科学等复杂物理系统领域因领域知识不足和无法遵守物理守恒定律而产生严重幻觉的问题,旨在开发具备可靠科学推理能力的领域专用模型。
Result: 在专门设计的燃烧科学复杂推理评估基准FlameBench上,所提出的模型在燃烧科学推理任务上的性能显著超越了最先进的通用闭源模型(如GPT-4)和传统的检索增强生成方法,达到了该领域的新SOTA水平。
Insight: 创新点在于提出了一个集成了从数据构建到强化学习验证的完整、可验证的领域增强工作流,确保模型学习物理本质;同时,构建并开源了领域专用评估基准FlameBench,为领域大模型的可靠评估提供了基础。从客观角度看,其将可验证性(如物理定律遵守)融入模型训练全流程的思路,对构建高可靠性领域专家模型具有重要借鉴意义。
Abstract: Large language models (LLMs) in the direction of task adaptation and capability enhancement for professional fields demonstrate significant application potential. Nevertheless, for complex physical systems such as combustion science, general-purpose LLMs often generate severe hallucinations due to insufficient domain knowledge and the inability to adhere to physical conservation laws. To address this issue, we propose the first full-stack domain-enhanced LLM workflow tailored for the field of combustion science, which integrates automated domain corpus construction, incremental pre-training, instruction fine-tuning, and verifiable reward-based reinforcement learning. This workflow ensures that the model truly internalizes physical laws rather than merely learning textual statistical patterns. We also release FlameBench, a standardized evaluation benchmark specifically designed for complex reasoning tasks in combustion science. Experimental results demonstrate that the model developed in this work significantly outperforms state-of-the-art general-purpose closed-source models and traditional retrieval-augmented generation methods on combustion science reasoning tasks. This work lays a solid technical and resource foundation for the subsequent development of domain-specific scientific research agents with reliable scientific reasoning capabilities.
[9] From Tokens To Agents: A Researcher’s Guide To Understanding Large Language Models cs.CLPDF
Daniele Barolo
TL;DR: 本章为非技术背景的研究者提供了一个理解大型语言模型(LLMs)的综合框架,通过剖析其六个核心组件(预训练数据、分词与嵌入、Transformer架构、概率生成、对齐和智能体能力),帮助研究者批判性地评估LLMs是否及如何适用于其特定研究需求,并以一个基于LLM的智能体模拟社交媒体动态的案例研究作为例证。
Details
Motivation: 研究者在使用LLMs时面临关键选择,需要理解其工作机制(能做什么、不能做什么)以有效利用,但缺乏无需深厚技术背景的全面指导。
Result: 本章未提供具体的定量实验结果或基准测试,而是构建了一个分析框架,并通过一个扩展的案例研究(模拟社交媒体动态)来展示其应用。
Insight: 创新点在于将LLM的复杂技术机制分解为六个可理解的核心组件,并强调从研究启示和具体能力/局限性的角度进行分析,而非提供规定性指南,从而为跨领域研究者提供了一个批判性推理的实用框架。
Abstract: Researchers face a critical choice: how to use – or not use – large language models in their work. Using them well requires understanding the mechanisms that shape what LLMs can and cannot do. This chapter makes LLMs comprehensible without requiring technical expertise, breaking down six essential components: pre-training data, tokenization and embeddings, transformer architecture, probabilistic generation, alignment, and agentic capabilities. Each component is analyzed through both technical foundations and research implications, identifying specific affordances and limitations. Rather than prescriptive guidance, the chapter develops a framework for reasoning critically about whether and how LLMs fit specific research needs, finally illustrated through an extended case study on simulating social media dynamics with LLM-based agents.
[10] Autonoma: A Hierarchical Multi-Agent Framework for End-to-End Workflow Automation cs.CL | cs.LGPDF
Eslam Reda, Maged Yasser, Sara El-Metwally
TL;DR: 本文提出了Autonoma,一个用于端到端工作流自动化的分层多智能体框架。该框架通过高层协调器、规划器和监督器,将开放式自然语言指令转化为由模块化专业智能体执行的稳健多步骤工作流,解决了现有单体智能体架构在可扩展性、错误传播和任务专注度方面的挑战。
Details
Motivation: 当前单体智能体架构难以可靠地将开放式用户指令转化为稳健的多步骤工作流,面临可扩展性差、错误易传播和跨多样化任务难以保持专注等问题。
Result: Autonoma在安全LAN环境中实现了97%的任务完成率和98%的智能体交接成功率,验证了其操作可靠性和高效协作能力。
Insight: 创新点在于采用原则性的分层架构,将编排逻辑与专业执行解耦,通过主动监控和错误处理确保鲁棒性,并以即插即用方式集成新能力,增强了系统的可扩展性和模块化。系统还支持多模态输入(文本、语音、图像、文件)和双语(英语、阿拉伯语),提升了包容性。
Abstract: The increasing complexity of user demands necessitates automation frameworks that can reliably translate open-ended instructions into robust, multi-step workflows. Current monolithic agent architectures often struggle with the challenges of scalability, error propagation, and maintaining focus across diverse tasks. This paper introduces Autonoma, a structured, hierarchical multi-agent framework designed for end-to-end workflow automation from natural language prompts. Autonoma employs a principled, multi-tiered architecture where a high-level Coordinator validates user intent, a Planner generates structured workflows, and a Supervisor dynamically manages the execution by orchestrating a suite of modular, specialized agents (e.g., for web browsing, coding, file management). This clear separation between orchestration logic and specialized execution ensures robustness through active monitoring and error handling, while enabling extensibility by allowing new capabilities to be integrated as plug-and-play agents without modifying the core engine. Implemented as a fully functional system operating within a secure LAN environment, Autonoma addresses critical data privacy and reliability concerns. The system is further engineered for inclusivity, accepting multi-modal input (text, voice, image, files) and supporting both English and Arabic. Autonoma achieved a 97% task completion rate and a 98% successful agent handoff rate, confirming its operational reliability and efficient collaboration.
[11] CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation cs.CL | cs.AIPDF
Yannian Gu, Zhongzhen Huang, Linjie Mu, Xizhuo Zhang, Shaoting Zhang
TL;DR: 本文提出了一个名为CURE的多模态基准测试,用于评估多模态大语言模型在临床诊断中的理解和检索能力。该基准包含500个映射到医生引用文献的多模态临床病例,旨在分离模型的基础多模态推理能力与证据检索及应用能力。
Details
Motivation: 现有基准主要评估MLLMs在端到端问答场景中的表现,这限制了区分模型基础多模态推理能力与证据检索熟练度的能力,因此需要一个新的基准来解耦这两方面的贡献。
Result: 评估显示,当提供医生参考证据时,先进模型在鉴别诊断任务中准确率高达73.4%;但当依赖独立检索机制时,性能大幅下降至25.4%,突显了整合多模态临床证据与检索精确支持文献的双重挑战。
Insight: CURE基准的创新在于通过控制证据设置来分离推理与检索的评估,揭示了MLLMs在临床应用中检索能力的瓶颈,为未来模型优化提供了关键方向。
Abstract: Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end-to-end answering scenarios. This limits the ability to disentangle a model’s foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising $500$ multimodal clinical cases mapped to physician-cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state-of-the-art MLLMs across distinct evidence-gathering paradigms in both closed-ended and open-ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clinical reasoning proficiency when supplied with physician reference evidence (achieving up to $73.4%$ accuracy on differential diagnosis), their performance substantially declines (as low as $25.4%$) when reliant on independent retrieval mechanisms. This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature. CURE is publicly available at https://github.com/yanniangu/CURE.
[12] From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG cs.CL | cs.AIPDF
Yucheng Chu, Haoyu Han, Shen Dong, Hang Li, Kaiqi Yang
TL;DR: 本文提出了一种名为GraphRAG的图检索增强生成框架,用于自动简答题评分(ASAG)。该方法将参考资料组织成结构化知识图谱,以显式建模概念间的依赖关系,从而克服传统‘扁平化’向量检索在处理复杂教育内容时难以捕捉结构关系和进行多跳推理的局限。实验表明,该结构化的检索方法在NGSS数据集上显著优于标准RAG基线。
Details
Motivation: 解决大型语言模型(LLMs)在自动简答题评分中因依赖通用预训练而产生的幻觉和严格评分标准遵循问题,以及标准RAG的‘扁平’向量检索机制将知识视为孤立片段,无法捕捉复杂教育内容所需的结构关系和推理链的局限性。
Result: 在Next Generation Science Standards (NGSS) 数据集上的实验评估表明,该结构化方法在所有指标上均显著优于标准RAG基线。特别是HippoRAG实现版在评估科学与工程实践(SEP)方面取得了实质性改进,证实了结构检索在验证高阶学术评估所需逻辑推理链方面的优越性。
Insight: 核心创新点在于将检索增强生成(RAG)从‘扁平’向量检索升级为基于知识图谱的结构化检索(GraphRAG),通过构建知识图谱并利用神经符号算法(如HippoRAG)进行关联图遍历,以检索全面、连贯的证据子图,从而更好地支持多跳推理和逻辑链验证。这为需要复杂推理的评估任务提供了一种新的知识增强范式。
Abstract: Automated short answer grading (ASAG) is critical for scaling educational assessment, yet large language models (LLMs) often struggle with hallucinations and strict rubric adherence due to their reliance on generalized pre-training. While Rretrieval-Augmented Generation (RAG) mitigates these issues, standard “flat” vector retrieval mechanisms treat knowledge as isolated fragments, failing to capture the structural relationships and multi-hop reasoning essential for complex educational content. To address this limitation, we introduce a Graph Retrieval-Augmented Generation (GraphRAG) framework that organizes reference materials into a structured knowledge graph to explicitly model dependencies between concepts. Our methodology employs a dual-phase pipeline: utilizing Microsoft GraphRAG for high-fidelity graph construction and the HippoRAG neurosymbolic algorithm to execute associative graph traversals, thereby retrieving comprehensive, connected subgraphs of evidence. Experimental evaluations on a Next Generation Science Standards (NGSS) dataset demonstrate that this structural approach significantly outperforms standard RAG baselines across all metrics. Notably, the HippoRAG implementation achieved substantial improvements in evaluating Science and Engineering Practices (SEP), confirming the superiority of structural retrieval in verifying the logical reasoning chains required for higher-order academic assessment.
[13] Multilingual Hate Speech Detection and Counterspeech Generation: A Comprehensive Survey and Practical Guide cs.CLPDF
Zahra Safdari Fesaghandis, Suman Kalyan Maity
TL;DR: 本文是一篇关于多语言仇恨言论检测和反言论生成的全面综述与实践指南,分析了单语系统在非英语和混合语言环境中的局限性,并提出了一个包含任务设计、数据整理和评估的三阶段框架。
Details
Motivation: 动机在于解决在线仇恨言论在多语言环境中的挑战,特别是单语系统无法捕捉文化特定表达和隐含仇恨的问题,以促进更公平有效的检测与反言论生成。
Result: 综述整合了多语言资源和技术的最新进展,并指出了数据稀缺、系统偏见和多模态需求等持续障碍,但未提供具体的定量实验结果或基准比较。
Insight: 创新点在于将技术进展与伦理文化考量结合,提供了一个可扩展的框架,强调上下文感知和包容性系统设计,为研究者和实践者提供了实用指南。
Abstract: Combating online hate speech in multilingual settings requires approaches that go beyond English-centric models and capture the cultural and linguistic diversity of global online discourse. This paper presents a comprehensive survey and practical guide to multilingual hate speech detection and counterspeech generation, integrating recent advances in natural language processing. We analyze why monolingual systems often fail in non-English and code-mixed contexts, missing implicit hate and culturally specific expressions. To address these challenges, we outline a structured three-phase framework - task design, data curation, and evaluation - drawing on state-of-the-art datasets, models, and metrics. The survey consolidates progress in multilingual resources and techniques while highlighting persistent obstacles, including data scarcity in low-resource languages, fairness and bias in system development, and the need for multimodal solutions. By bridging technical progress with ethical and cultural considerations, we provide researchers, practitioners, and policymakers with scalable guidelines for building context-aware, inclusive systems. Our roadmap contributes to advancing online safety through fairer, more effective detection and counterspeech generation across diverse linguistic environments.
[14] URAG: A Benchmark for Uncertainty Quantification in Retrieval-Augmented Large Language Models cs.CL | cs.AI | cs.IRPDF
Vinh Nguyen, Cuong Dang, Jiahao Zhang, Hoa Tran, Minh Tran
TL;DR: 本文提出了一个名为URAG的基准测试,用于系统评估检索增强生成(RAG)大语言模型在不确定性量化方面的表现。该基准将开放式生成任务转化为多项选择题,利用符合预测进行不确定性量化,并在医疗、编程、科学、数学和通用文本等多个领域对8种标准RAG方法进行了评估。
Details
Motivation: 当前RAG评估主要关注答案正确性,未能充分衡量检索对LLM不确定性和可靠性的影响,因此需要建立一个综合基准来填补这一空白。
Result: 在URAG基准上评估了8种RAG方法,使用准确率和基于LAC与APS度量的预测集大小来衡量性能。分析发现:准确率提升常伴随不确定性降低,但在检索噪声下此关系不成立;简单模块化RAG方法通常比复杂推理流程提供更好的准确率-不确定性权衡;没有单一RAG方法在所有领域都可靠。
Insight: 创新点在于提出了首个专注于RAG系统不确定性量化的综合基准URAG,并采用符合预测进行原则性评估。客观分析认为,其揭示了检索深度、参数知识依赖和置信度线索暴露会放大自信错误和幻觉,为分析和增强检索增强系统的可信度提供了系统性工具。
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for enhancing LLMs in scenarios that demand extensive factual knowledge. However, current RAG evaluations concentrate primarily on correctness, which may not fully capture the impact of retrieval on LLM uncertainty and reliability. To bridge this gap, we introduce URAG, a comprehensive benchmark designed to assess the uncertainty of RAG systems across various fields like healthcare, programming, science, math, and general text. By reformulating open-ended generation tasks into multiple-choice question answering, URAG allows for principled uncertainty quantification via conformal prediction. We apply the evaluation pipeline to 8 standard RAG methods, measuring their performance through both accuracy and prediction-set sizes based on LAC and APS metrics. Our analysis shows that (1) accuracy gains often coincide with reduced uncertainty, but this relationship breaks under retrieval noise; (2) simple modular RAG methods tend to offer better accuracy-uncertainty trade-offs than more complex reasoning pipelines; and (3) no single RAG approach is universally reliable across domains. We further show that (4) retrieval depth, parametric knowledge dependence, and exposure to confidence cues can amplify confident errors and hallucinations. Ultimately, URAG establishes a systematic benchmark for analyzing and enhancing the trustworthiness of retrieval-augmented systems. Our code is available on GitHub.
[15] LLM-MRD: LLM-Guided Multi-View Reasoning Distillation for Fake News Detection cs.CL | cs.AIPDF
Weilin Zhou, Shanwen Tan, Enhao Gu, Yurong Qian
TL;DR: 本文提出了一种名为LLM-MRD的新颖师生框架,用于多模态假新闻检测。该方法首先通过学生多视图推理模块从文本、视觉和跨模态视角构建综合基础,然后利用教师多视图推理模块生成深度推理链作为监督信号,最后通过核心的校准蒸馏机制将复杂的推理知识高效地提炼到轻量级的学生模型中。
Details
Motivation: 现有方法在多模态假新闻检测中存在局限性,包括缺乏全面的多视图判断与融合,以及因大语言模型(LLM)计算成本高导致的推理效率低下。本文旨在解决这些问题。
Result: 实验表明,LLM-MRD显著优于现有最先进的基线方法。在所有竞争方法和数据集上的评估中,其在准确率(ACC)上平均提升5.19%,在假新闻F1分数(F1-Fake)上平均提升6.33%,达到了SOTA水平。
Insight: 论文的创新点在于提出了一个结合多视图推理和知识蒸馏的师生框架,利用LLM生成深度推理链作为丰富的监督信号,并通过校准蒸馏机制高效地将复杂的推理能力迁移到轻量级模型中,从而在保持高性能的同时解决了LLM推理效率低的问题。
Abstract: Multimodal fake news detection is crucial for mitigating societal disinformation. Existing approaches attempt to address this by fusing multimodal features or leveraging Large Language Models (LLMs) for advanced reasoning. However, these methods suffer from serious limitations, including a lack of comprehensive multi-view judgment and fusion, and prohibitive reasoning inefficiency due to the high computational costs of LLMs. To address these issues, we propose \textbf{LLM}-Guided \textbf{M}ulti-View \textbf{R}easoning \textbf{D}istillation for Fake News Detection ( \textbf{LLM-MRD}), a novel teacher-student framework. The Student Multi-view Reasoning module first constructs a comprehensive foundation from textual, visual, and cross-modal perspectives. Then, the Teacher Multi-view Reasoning module generates deep reasoning chains as rich supervision signals. Our core Calibration Distillation mechanism efficiently distills this complex reasoning-derived knowledge into the efficient student model. Experiments show LLM-MRD significantly outperforms state-of-the-art baselines. Notably, it demonstrates a comprehensive average improvement of 5.19% in ACC and 6.33% in F1-Fake when evaluated across all competing methods and datasets. Our code is available at https://github.com/Nasuro55/LLM-MRD
[16] PrefPO: Pairwise Preference Prompt Optimization cs.CLPDF
Rahul Singhal, Pradyumna Tambwekar, Karime Maamari
TL;DR: PrefPO是一种基于成对偏好的最小化提示优化方法,灵感来自人类反馈强化学习(RLHF)。它通过LLM判别器对模型输出进行成对偏好比较,为LLM优化器提供反馈,从而迭代提升性能。该方法在9个BIG-Bench Hard任务和IFEval-Hard上评估,在6/9任务上达到或超越SOTA方法,且在无标签设置下仍能有效工作,同时显著改善了提示的简洁性和抗攻击性。
Details
Motivation: 现有提示工程方法依赖标注数据且生成的提示冗长重复,PrefPO旨在减少对标注数据和超参数调优的依赖,仅需初始提示和自然语言标准即可进行优化。
Result: 在9个BIG-Bench Hard任务中,PrefPO在6/9任务上匹配或超越GEPA、MIPRO和TextGrad等SOTA方法;在IFEval-Hard上与TextGrad性能相当(82.4% vs 84.5%)。无标签设置下,在6/9任务上接近有标签性能。提示长度和重复内容分别减少3-5倍,且LLM和人类评估均优于TextGrad。抗提示攻击性更强(37% vs 86%易受攻击率)。
Insight: 创新点包括:基于成对偏好的优化框架减少数据依赖;结合LLM判别器与优化器的迭代反馈机制;支持有/无标签双模式优化;显著提升提示简洁性与鲁棒性。客观分析认为其将RLHF思想迁移至提示优化,为轻量化自动提示工程提供了新思路。
Abstract: Prompt engineering is effective but labor-intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference-based approach reduces the need for labeled data and hyperparameter tuning-only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG-Bench Hard (BBH) tasks and IFEval-Hard, a newly-curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3-5x. Furthermore, both LLM and human judges rate PrefPO’s prompts higher than TextGrad’s. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.
[17] Prompt-tuning with Attribute Guidance for Low-resource Entity Matching cs.CL | cs.AIPDF
Lihui Liu, Carl Yang
TL;DR: 本文提出了一种名为PROMPTATTRIB的低资源实体匹配方法,该方法通过属性级提示调优和逻辑推理来解决传统方法对大量标注数据依赖以及现有提示调优方法忽视属性信息、缺乏可解释性的问题。
Details
Motivation: 传统监督学习方法需要大量高质量标注数据,成本高昂;现有低资源提示调优方法主要关注实体级匹配,忽略了关键的属性级信息且缺乏可解释性。
Result: 在真实世界数据集上的广泛实验证明了PROMPTATTRIB的有效性。
Insight: 创新点在于结合实体级和属性级提示以融入更丰富的上下文信息,并利用模糊逻辑公式进行推理以增强可解释性;同时借鉴SimCSE思想,在软提示上集成基于dropout的对比学习以进一步提升性能。
Abstract: Entity Matching (EM) is an important task that determines the logical relationship between two entities, such as Same, Different, or Undecidable. Traditional EM approaches rely heavily on supervised learning, which requires large amounts of high-quality labeled data. This labeling process is both time-consuming and costly, limiting practical applicability. As a result, there is a strong need for low-resource EM methods that can perform well with minimal labeled data. Recent prompt-tuning approaches have shown promise for low-resource EM, but they mainly focus on entity-level matching and often overlook critical attribute-level information. In addition, these methods typically lack interpretability and explainability. To address these limitations, this paper introduces PROMPTATTRIB, a comprehensive solution that tackles EM through attribute-level prompt tuning and logical reasoning. PROMPTATTRIB uses both entity-level and attribute-level prompts to incorporate richer contextual information and employs fuzzy logic formulas to infer the final matching label. By explicitly considering attributes, the model gains a deeper understanding of the entities, resulting in more accurate matching. Furthermore, PROMPTATTRIB integrates dropout-based contrastive learning on soft prompts, inspired by SimCSE, which further boosts EM performance. Extensive experiments on real-world datasets demonstrate the effectiveness of PROMPTATTRIB.
[18] Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas cs.CL | cs.GTPDF
Víctor Gallego
TL;DR: 本文研究LLM策略合成:利用大语言模型为多智能体环境迭代生成程序化智能体策略。该方法不通过强化学习训练神经策略,而是提示LLM生成Python策略函数,在自博弈中评估它们,并利用跨迭代的性能反馈进行优化。研究重点是比较稀疏反馈(仅标量奖励)与密集反馈(奖励加社会指标:效率、平等、可持续性、和平)在反馈工程中的效果。在两个经典顺序社会困境游戏(Gathering和Cleanup)和两个前沿LLM(Claude Sonnet 4.6, Gemini 3.1 Pro)上,密集反馈在所有指标上持续匹配或超越稀疏反馈。优势在Cleanup公共物品游戏中最大,提供社会指标帮助LLM校准成本高昂的清理-收获权衡。社会指标并未引发对公平的过度优化,而是作为协调信号,引导LLM走向更有效的合作策略,包括领地划分、自适应角色分配和避免浪费性攻击。研究还进行了对抗性实验以确定LLM是否能奖励黑客攻击这些环境,表征了五类攻击并讨论了缓解措施,突出了LLM策略合成中表达性与安全性之间的内在张力。
Details
Motivation: 解决在多智能体顺序社会困境中,如何利用LLM合成程序化策略,并探索不同反馈设计(稀疏vs.密集)对策略合作性与性能的影响,同时评估该方法的安全风险。
Result: 在Gathering和Cleanup两个顺序社会困境基准上,使用Claude Sonnet 4.6和Gemini 3.1 Pro进行实验,密集反馈(包含社会指标)在所有评估指标(如效率、平等、可持续性、和平)上均匹配或优于仅提供标量奖励的稀疏反馈,尤其在Cleanup游戏中优势显著。
Insight: 创新点在于提出了一种基于LLM提示迭代生成和优化程序化策略的框架,并系统研究了反馈工程中社会指标作为协调信号的作用,证明其能有效引导LLM学习复杂合作策略(如领地划分、角色分配)。客观分析认为,该方法为多智能体策略合成提供了可解释、免训练的替代方案,但同时也揭示了表达性与安全性的权衡风险,需关注对抗性攻击的缓解。
Abstract: We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.
[19] EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models cs.CL | cs.IR | cs.LGPDF
J. Ben Tamo, Yuxing Lu, Benoit L. Marteau, Micky C. Nnamdi, May D. Wang
TL;DR: 本文提出了EvidenceRL,一个基于强化学习的框架,旨在减少大型语言模型(LLMs)的幻觉,通过强化模型生成答案时对证据的遵循性。该方法通过评估候选回答的‘证据支撑度’(与检索证据和上下文的一致性)和‘正确性’(与参考答案的一致性)来优化模型,并在心脏诊断和法律推理两个高风险领域验证了其有效性。
Details
Motivation: 大型语言模型虽然流畅,但容易产生幻觉,生成看似合理但缺乏证据支持的答案,这在需要可验证信息支撑决策的高风险领域尤为严重。本文旨在解决LLMs在生成答案时证据不一致的问题。
Result: 在心脏诊断任务上,使用Llama-3.2-3B模型,F1@3从37.0提升至54.5,证据支撑度(G_max@3)从47.6提升至78.2,幻觉减少了近5倍,有证据支持的诊断比例从31.8%提升至61.6%。在法律推理任务上,使用Llama-3.1-8B模型,忠实度(Faithfulness)从32.8%提升至67.6%。结果表明该方法在多个领域都能持续提升证据支撑度和忠实度,且不牺牲任务准确率。
Insight: 论文的创新点在于提出了一个结合证据一致性和正确性的强化学习优化框架(GRPO),直接针对LLMs的幻觉问题进行行为修正。从客观角度看,其将证据一致性作为可优化的强化学习目标,为提升模型的可信度和事实性提供了一种系统性的训练方法。
Abstract: Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucinations drop nearly 5$\times$ and evidence-supported diagnoses increase from 31.8% to 61.6%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8% to 67.6% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at https://github.com/Wizaaard/EvidenceRL.git.
[20] FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment cs.CL | cs.AIPDF
Betty Xiong, Jillian Fisher, Benjamin Newman, Meng Hu, Shivangi Gupta
TL;DR: 本文介绍了FDARxBench,一个基于美国FDA药品标签文档构建的、专家精心策划的真实世界基准测试,用于评估文档问答系统在仿制药评估背景下的表现。该基准包含事实性、多跳推理和拒绝回答任务,并设计了开卷与闭卷推理的评估协议。实验表明,现有模型在事实性、长上下文检索和安全拒绝行为方面存在显著不足。
Details
Motivation: 动机是解决当前语言模型在理解和回答基于复杂、异构的临床与监管信息(如FDA药品标签)的问题时面临的困难,特别是为了满足仿制药评估中对准确文档问答的需求。
Result: 在专有和开源模型上的实验揭示了模型在事实性、长上下文检索和安全拒绝行为方面存在显著差距,为具有挑战性的监管级标签理解评估提供了基础。
Insight: 创新点在于通过与FDA监管评估专家合作,构建了一个高质量、专家策划的多阶段流水线来生成真实世界的QA示例,并设计了专门针对监管文档理解的评估协议,这为评估LLM在专业领域的推理能力提供了新的基准和方法。
Abstract: We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension. The benchmark is designed to support evaluation of LLM behavior on drug-label questions.
[21] TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models? cs.CLPDF
Xinyu Guo, Yazhou Zhang, Jing Qin
TL;DR: 本文介绍了TextReasoningBench,一个用于系统评估大语言模型在文本分类任务中推理策略有效性和效率的基准。通过比较七种推理策略在十个LLM和五个数据集上的表现,研究发现推理并不总能提升分类性能,且往往效率低下。
Details
Motivation: 当前研究普遍假设推理策略能统一提升各类NLP任务性能,但推理是否真正有益于文本分类任务,尤其是在考虑其高昂的token和时间成本时,尚缺乏深入探索。
Result: 实验表明:推理策略并非普遍提升分类性能,如CoT和SC-CoT在大型模型上仅带来有限增益(+1%至+3%),而更复杂的方法(如ToT和GoT)甚至可能降低性能;推理效率低下,许多策略增加10倍至100倍的token消耗,但性能提升微乎其微。
Insight: 论文创新点在于引入成本感知的评估指标(如每推理token的性能增益和性能提升相对于token成本增长的效率),并系统验证了推理策略在文本分类中的局限性,挑战了推理普遍有益的假设。
Abstract: Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro-F1, we introduce two cost-aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC-CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10$\times$ to 100$\times$ (e.g., SC-CoT and ToT) while providing only marginal performance improvements.
[22] DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs cs.CLPDF
Xuan Qi, Luxi He, Dan Roth, Xingyu Fu
TL;DR: 本文提出了DataProphet,一种无需训练即可评估多模态大语言模型(MLLMs)监督数据对目标基准测试泛化能力的指标。研究发现,直观的任务相似性并不能可靠预测下游性能提升,而DataProphet通过结合多模态困惑度、相似性和数据多样性,能有效预测数据集的训练效果,并指导更好的数据选择。
Details
Motivation: 解决如何在实际训练前,预测某个训练数据集对特定目标基准测试性能的影响,以指导MLLMs监督数据的选择,挑战了传统依赖直观任务相似性的数据选择方法。
Result: 在涵盖7个不同任务的14个视觉-语言数据集上的实验表明,DataProphet产生的监督数据排名与实际训练后性能增益排名高度相关(Kendall’s tau达86.0%)。在数据选择上,它比均匀选择提升高达6.9%,比最先进的基于训练的基线方法提升1.4%,甚至比基于实验性能的oracle选择高出0.2%。
Insight: 创新点在于揭示了MLLMs中数据泛化更依赖于具体数据集而非宽泛的任务类别,并提出了一个无需训练、结合多模态特征的量化指标DataProphet来预测数据效用,为高效数据选择和组合提供了新工具。
Abstract: Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall’s tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.
[23] LoopRPT: Reinforcement Pre-Training for Looped Language Models cs.CLPDF
Guo Tang, Shixin Jiang, Heng Chang, Nuo Chen, Yuhan Li
TL;DR: 本文提出LoopRPT,一个专为循环语言模型设计的强化预训练框架。它通过将下一个token预测重构为下一个token推理任务,并利用EMA教师参考和带噪声的潜在状态展开,将强化信号直接分配给潜在推理步骤,从而直接塑造中间表示,将有效推理压缩到更少的迭代中。
Details
Motivation: 现有强化学习范式主要针对输出token,与循环语言模型(LoopLMs)的隐式循环推理结构不匹配。本文旨在解决这一结构不匹配问题,为LoopLMs开发一个能直接塑造中间表示的强化预训练方法。
Result: 在Ouro架构上的多尺度模型实验表明,LoopRPT持续提升了每步表示质量,在准确性与计算量的权衡上实现了帕累托最优。特别是在困难token上的显著收益表明,LoopRPT增强了早期阶段的推理能力,而非仅仅鼓励过早退出。
Insight: 核心创新在于将强化学习目标从输出token层面重新定位于循环模型的潜在推理步骤,通过EMA教师和噪声潜在状态展开来分配奖励,从而直接优化中间表示,学习高效的潜在推理。这为循环模型的训练提供了一个新的、有原则的范式。
Abstract: Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.
[24] Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking cs.CLPDF
Tomas Ruiz, Tanalp Agustoslu, Carsten Schwemmer
TL;DR: 本文研究了多模态大语言模型(MLLM)基准测试中的人类标注差异(HLV)问题,提出了一种新的评估协议,该协议同时考虑人类标注一致与不一致的情况。通过将协议应用于Gemma 3和Qwen 2.5 VL等模型,发现大模型在高一致性数据上表现最佳,但在高分歧数据上可能不如中等规模模型,表明仅基于共识标签的基准测试会高估模型能力。
Details
Motivation: 尽管大语言模型发展迅速,但基准测试中的人类标注差异(即不同标注者判断的系统性差异)问题仍未得到充分探索。本文旨在填补这一空白,以更真实地评估MLLM在内容审核等主观任务中的能力。
Result: 在社交媒体内容分类数据集上,对Gemma 3和Qwen 2.5 VL等SOTA MLLM家族进行评估。结果显示,更大模型在高一致性数据子集上表现最好,但在人类标注分歧高的数据上,其表现常常不如中等规模模型。
Insight: 论文的创新点在于提出了一种明确考虑人类标注一致性与分歧的MLLM基准测试评估协议。客观来看,其核心洞察是:模型参数数量并非决定其对模糊性和主观性敏感度的唯一因素,而纳入标注差异的评估能提供更现实、更鲁棒的模型能力评估,挑战了仅依赖共识标签的传统基准测试范式。
Abstract: Human Label Variation (HLV), i.e. systematic differences among annotators’ judgments, remains underexplored in benchmarks despite rapid progress in large language model (LLM) development. We address this gap by introducing an evaluation protocol for multimodal large language model (MLLM) benchmarking that explicitly accounts for two conditions: (1) human label agreement and (2) disagreement. We apply this protocol to two state-of-the-art MLLM families (Gemma 3, Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset. Across tasks, we find that larger models tend to perform best on high-agreement subsets, yet often underperform medium-sized models when human disagreement is high, indicating that parameter count alone does not determine sensitivity to ambiguity and subjectivity. These results show that benchmarks based solely on consensus labels can overstate model capabilities in such domains and that incorporating human label variation yields more realistic and robust assessments of MLLMs in content moderation pipelines.
[25] SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Southeast Asia cs.CLPDF
Zhixiang Lu, Chong Zhang, Yulong Li, Angelos Stefanidis, Anh Nguyen
TL;DR: 本文提出了一种名为SAGE(可持续智能体引导专家微调)的框架,旨在解决低资源东南亚语言翻译中的文化适应性问题,同时兼顾环境可持续性。该框架通过强化学习智能体自动筛选高质量、文化相关的训练数据,并利用LoRA高效微调开源大语言模型,在显著减少数据使用和能耗的同时,实现了翻译性能的提升。
Details
Motivation: 解决低资源地区(特别是东南亚)因高质量、文化相关数据稀缺以及大规模噪声数据训练能耗过高,而导致大语言模型在翻译任务中部署困难的问题,以平衡数字包容性与环境可持续性。
Result: 在英语与七种东南亚低资源语言的翻译任务上,SAGE在BLEU-4和COMET-22指标上达到了新的最先进(SOTA)性能。与使用完整数据集的基线相比,它在减少97.1%数据使用和95.2%训练能耗的同时,性能超越了基线。
Insight: 创新性地提出了“正确数据优于大数据”的能源感知范式,通过基于GRPO优化的强化学习智能体进行数据自动筛选,并结合LoRA进行高效微调。这为在资源受限和注重可持续性的场景下,构建高性能、低能耗的语言模型提供了一条可扩展且负责任的技术路径。
Abstract: The vision of an inclusive World Wide Web is impeded by a severe linguistic divide, particularly for communities in low-resource regions of Southeast Asia. While large language models (LLMs) offer a potential solution for translation, their deployment in data-poor contexts faces a dual challenge: the scarcity of high-quality, culturally relevant data and the prohibitive energy costs of training on massive, noisy web corpora. To resolve the tension between digital inclusion and environmental sustainability, we introduce Sustainable Agent-Guided Expert-tuning (SAGE). This framework pioneers an energy-aware paradigm that prioritizes the “right data” over “big data”. Instead of carbon-intensive training on unfiltered datasets, SAGE employs a reinforcement learning (RL) agent, optimized via Group Relative Policy Optimization (GRPO), to autonomously curate a compact training set. The agent utilizes a semantic reward signal derived from a small, expert-constructed set of community dialogues to filter out noise and cultural misalignment. We then efficiently fine-tune open-source LLMs on this curated data using Low-Rank Adaptation (LoRA). We applied SAGE to translation tasks between English and seven low-resource languages (LRLs) in Southeast Asia. Our approach establishes new state-of-the-art performance on BLEU-4 and COMET-22 metrics, effectively capturing local linguistic nuances. Crucially, SAGE surpasses baselines trained on full datasets while reducing data usage by 97.1% and training energy consumption by 95.2%. By delivering high-performance models with a minimal environmental footprint, SAGE offers a scalable and responsible pathway to bridge the digital divide in the Global South.
[26] RouterKGQA: Specialized–General Model Routing for Constraint-Aware Knowledge Graph Question Answering cs.CL | cs.DB | cs.IRPDF
Bo Yuan, Hexuan Deng, Xuebo Liu, Min Zhang
TL;DR: 本文提出了RouterKGQA框架,用于知识图谱问答(KGQA)。该框架通过协作使用小型专用模型和大型通用模型,在保证推理质量的同时显著降低成本。专用模型生成初始推理路径,通用模型仅在必要时进行知识图谱引导的修复,并结合了约束感知的答案过滤和更高效的智能体工作流。
Details
Motivation: 现有KGQA方法存在两难:基于检索的小型专用模型效率高但易产生不可达路径并忽略隐式约束;基于智能体的大型通用模型结构推理能力强但成本高昂。本文旨在设计一个协作框架,以最小成本提升性能。
Result: 实验结果表明,RouterKGQA在多个基准测试中平均F1分数比之前最佳方法高出3.57分,Hits@1高出0.49分,同时每个问题平均仅需1.15次LLM调用,实现了性能提升与成本降低。
Insight: 创新点在于提出了专用-通用模型路由协作范式,结合了约束感知答案过滤以降低冗余,并设计了更高效的通用智能体工作流。其核心思想是根据需要动态调用资源,在成本与性能间取得平衡。
Abstract: Knowledge graph question answering (KGQA) is a promising approach for mitigating LLM hallucination by grounding reasoning in structured and verifiable knowledge graphs. Existing approaches fall into two paradigms: retrieval-based methods utilize small specialized models, which are efficient but often produce unreachable paths and miss implicit constraints, while agent-based methods utilize large general models, which achieve stronger structural grounding at substantially higher cost. We propose RouterKGQA, a framework for specialized–general model collaboration, in which a specialized model generates reasoning paths and a general model performs KG-guided repair only when needed, improving performance at minimal cost. We further equip the specialized with constraint-aware answer filtering, which reduces redundant answers. In addition, we design a more efficient general agent workflow, further lowering inference cost. Experimental results show that RouterKGQA outperforms the previous best by 3.57 points in F1 and 0.49 points in Hits@1 on average across benchmarks, while requiring only 1.15 average LLM calls per question. Codes and models are available at https://github.com/Oldcircle/RouterKGQA.
[27] Predicting States of Understanding in Explanatory Interactions Using Cognitive Load-Related Linguistic Cues cs.CLPDF
Yu Wang, Olcay Türk, Angela Grimminger, Hendrik Buschmeier
TL;DR: 该论文研究了如何利用对话中说话者和听者的语言特征来实时预测听者在解释性互动中的理解状态。通过分析MUNDEX语料库中的面对面棋盘游戏解释对话,发现与认知负荷相关的三个语言线索(说话者话语的信息价值和句法复杂性,以及听者互动性注视行为的变化)与听者理解水平相关。使用现成分类器和微调的多模态BERT分类器的实验表明,结合这些线索和文本特征可以预测听者的四种理解状态。
Details
Motivation: 解决在解释性互动中实时预测听者理解状态的问题,探索语言特征(尤其是与认知负荷相关的线索)在此任务中的有效性。
Result: 在MUNDEX语料库上的分类实验中,结合三个语言线索和文本特征后,对四种理解状态(理解、部分理解、不理解、误解)的预测性能得到提升,证明了预测的可行性。
Insight: 创新点在于将认知负荷相关的语言线索(信息价值、句法复杂性、互动性注视变化)与文本特征结合用于理解状态预测;客观分析认为,多模态方法(结合语言和视觉行为)和实时预测框架在对话系统中有应用潜力。
Abstract: We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener’s state of understanding in explanatory interactions on a moment-by-moment basis. Specifically, we examine three linguistic cues related to cognitive load and hypothesised to correlate with listener understanding: the information value (operationalised with surprisal) and syntactic complexity of the speaker’s utterances, and the variation in the listener’s interactive gaze behaviour. Based on statistical analyses of the MUNDEX corpus of face-to-face dialogic board game explanations, we find that individual cues vary with the listener’s level of understanding. Listener states (‘Understanding’, ‘Partial Understanding’, ‘Non-Understanding’ and ‘Misunderstanding’) were self-annotated by the listeners using a retrospective video-recall method. The results of a subsequent classification experiment, involving two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier, demonstrate that prediction of these four states of understanding is generally possible and improves when the three linguistic cues are considered alongside textual features.
[28] Reasoning Gets Harder for LLMs Inside A Dialogue cs.CLPDF
Ivan Kartáč, Mateusz Lango, Ondřej Dušek
TL;DR: 本文通过引入BOULDER基准测试,研究了在任务导向对话(TOD)环境中,大型语言模型(LLMs)的推理能力表现。研究发现,与孤立任务相比,LLMs在多轮对话设置下的推理性能显著下降,这种差距主要由对话的多轮性质、角色条件以及工具使用要求等因素驱动。
Details
Motivation: 当前LLMs在推理基准测试上表现优异,但这些评估通常基于孤立任务,与真实世界任务导向对话(TOD)中的使用场景存在差异。这种不匹配引发了对基准测试性能是否能准确反映LLMs在TOD环境中推理鲁棒性的担忧。
Result: 在涵盖八个旅行相关任务(需要算术、空间和时间推理,涉及常识和形式化方面)的BOULDER基准上,对八个LLMs的实验显示,孤立设置与基于对话的设置之间存在显著且一致的性能差距。
Insight: 论文的创新点在于构建了一个动态基准BOULDER,通过提供孤立和对话两种变体,实现了受控比较并减轻了数据污染。主要发现是对话的多轮性质是导致推理性能下降的关键因素,这强调了在现实交互场景中评估LLM推理的必要性。
Abstract: Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models’ reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.
[29] Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models cs.CLPDF
Sai Koneru, Elphin Joe, Christine Kirchhoff, Jian Wu, Sarah Rajtmajer
TL;DR: 这篇论文引入了一个基于美国国家气候评估的受控认知冲突框架,用于评估指令微调语言模型在用户压力下对上下文证据的忠实度与用户对齐之间的权衡。通过对19个参数量从0.27B到32B的指令微调模型进行细粒度消融实验,研究发现,在用户压力下,仅提供更丰富的上下文证据并不能可靠地防止模型向用户对齐的逆转,并识别了三种主要失败模式。
Details
Motivation: 解决在争议性领域中,指令微调语言模型如何在用户对齐压力和忠实于上下文证据之间取得平衡的问题,并评估这种张力。
Result: 在受控固定证据设置下,研究发现:1)在Llama-3和Gemma-3等模型系列中,添加认知细微差别(如研究空白)会增加对奉承的敏感性;2)鲁棒性呈非单调缩放,某些中小规模模型对对抗性用户压力特别敏感;3)在冲突下,不同模型的分布集中度不同,例如,推理蒸馏变体(DeepSeek-R1-Qwen)比其指令微调对应模型表现出更高的分散性。
Insight: 论文的创新点在于提出了一个受控认知冲突框架来系统评估证据忠实度与用户压力的交互,并揭示了仅靠提供丰富上下文证据不足以保证模型在压力下的认知完整性,强调了为认知完整性进行显式训练的必要性。从客观角度看,其对证据构成和不确定性线索的细粒度消融分析,以及对模型规模与鲁棒性非单调关系的发现,具有重要的方法论和模型设计启示。
Abstract: In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.
[30] Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation cs.CL | cs.AI | cs.LGPDF
Richard J. Young
TL;DR: 本文通过实验证明,大语言模型(LLM)思维链(CoT)忠实度的测量结果高度依赖于所使用的分类器,而非模型的固有客观属性。研究使用三种不同的分类器(纯正则表达式检测器、正则表达式+LLM两阶段流水线、独立的Claude Sonnet 4评判器)对来自12个开源模型的10,276条受提示影响的推理轨迹进行评估,发现它们得出的总体忠实度率存在显著差异,且能导致模型排名反转。
Details
Motivation: 当前关于思维链忠实度的研究通常报告单一的聚合数值,这暗示忠实度是模型可客观测量的属性。本文旨在挑战这一观点,揭示测量结果对分类器选择的敏感性,从而解决跨研究比较忠实度指标的不可靠性问题。
Result: 在相同数据上,三种分类器得出的总体忠实度率分别为74.4%、82.6%和69.7%,其95%置信区间互不重叠。每个模型的忠实度差距在2.6到30.6个百分点之间,均具有统计显著性。分类器间的一致性很低(Cohen’s kappa范围从0.06到0.42),且能导致模型排名发生反转(例如Qwen3.5-27B的排名从第1变为第7)。
Insight: 核心创新点在于系统性地揭示了忠实度测量的方法论依赖性:不同分类器对“忠实”这一概念的操作化定义严格程度不同(如词汇提及 vs. 认知依赖),导致对相同行为产生分歧测量。这提示未来评估不应报告单一估计值,而应报告跨多种分类方法下的敏感性范围,以提高可比性和可靠性。
Abstract: Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively, with non-overlapping 95% confidence intervals. Per-model gaps range from 2.6 to 30.6 percentage points; all are statistically significant (McNemar’s test, p < 0.001). The disagreements are systematic, not random: inter-classifier agreement measured by Cohen’s kappa ranges from 0.06 (“slight”) for sycophancy hints to 0.42 (“moderate”) for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd. The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior. These results demonstrate that published faithfulness numbers cannot be meaningfully compared across studies that use different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates.
cs.CV [Back]
[31] Diffusion-Guided Semantic Consistency for Multimodal Heterogeneity cs.CV | cs.AIPDF
Jing Liu, Zhengliang Guo, Yan Wang, Xiaoguang Zhu, Yao Du
TL;DR: 本文提出了一种名为SemanticFL的新型联邦学习框架,旨在解决非独立同分布(non-IID)客户端数据,特别是多模态感知场景下的语义异构问题。该框架利用预训练的Stable Diffusion模型(包括VAE编码的潜变量和U-Net分层特征)提取丰富的语义表示,构建共享潜空间以对齐异构客户端,并通过高效的客户端-服务器架构将繁重计算卸载到服务器。同时,采用跨模态对比学习的一致性机制来稳定收敛。
Details
Motivation: 联邦学习在非独立同分布客户端数据下性能严重下降,尤其是在多模态感知设置中,现有方法难以解决客户端间的底层语义差异,导致多媒体系统感知性能不佳。
Result: 在CIFAR-10、CIFAR-100和TinyImageNet等基准测试上进行广泛实验,涵盖多种异构场景。结果表明,SemanticFL超越了现有联邦学习方法,相比FedAvg实现了高达5.49%的准确率提升,验证了其在异构多模态数据感知任务中学习鲁棒表示的有效性。
Insight: 创新点在于利用预训练扩散模型的丰富语义表示(如VAE潜变量和U-Net特征)来构建隐私保护的共享潜空间,以对齐异构客户端;同时,通过高效的客户端-服务器架构和跨模态对比学习的一致性机制,提升了联邦学习在非独立同分布多模态数据下的性能。从客观角度看,该方法将生成模型(扩散模型)的语义能力引入联邦学习,为解决语义异构问题提供了新思路。
Abstract: Federated learning (FL) is severely challenged by non-independent and identically distributed (non-IID) client data, a problem that degrades global model performance, especially in multimodal perception settings. Conventional methods often fail to address the underlying semantic discrepancies between clients, leading to suboptimal performance for multimedia systems requiring robust perception. To overcome this, we introduce SemanticFL, a novel framework that leverages the rich semantic representations of pre-trained diffusion models to provide privacy-preserving guidance for local training. Our approach leverages multi-layer semantic representations from a pre-trained Stable Diffusion model (including VAE-encoded latents and U-Net hierarchical features) to create a shared latent space that aligns heterogeneous clients, facilitated by an efficient client-server architecture that offloads heavy computation to the server. A unified consistency mechanism, employing cross-modal contrastive learning, further stabilizes convergence. We conduct extensive experiments on benchmarks including CIFAR-10, CIFAR-100, and TinyImageNet under diverse heterogeneity scenarios. Our results demonstrate that SemanticFL surpasses existing federated learning approaches, achieving accuracy gains of up to 5.49% over FedAvg, validating its effectiveness in learning robust representations for heterogeneous and multimodal data for perception tasks.
[32] AURORA: Adaptive Unified Representation for Robust Ultrasound Analysis cs.CVPDF
Ufaq Khan, L. D. M. S. Sai Teja, Ayuba Shakiru, Mai A. Shaaban, Yutong Xie
TL;DR: 本文提出AURORA框架,一种基于Qwen3-VL视觉Transformer的统一多任务模型,用于处理超声图像分析中的分割、检测、分类和关键点回归等任务。该方法通过将中间令牌特征投影为空间特征图,并利用轻量级多尺度特征金字塔进行融合,实现了像素级预测与全局推理的共享表示。训练时采用任务感知采样和选择性损失平衡来管理异构监督并减少任务不平衡。
Details
Motivation: 解决超声图像因扫描仪、操作者和解剖目标差异导致的模型泛化能力差的问题,应对FMC-UIA挑战中要求单一模型处理多任务、多器官、多数据集的困难。
Result: 在验证集上性能从67%提升至85%,在官方测试集上所有任务平均得分达到81.84%。
Insight: 创新点包括:基于Transformer视觉编码器的统一多任务框架设计、中间令牌特征到空间特征图的投影与多尺度融合机制、以及任务感知采样与选择性损失平衡的训练策略。客观分析认为,该方法通过共享表示和轻量级任务头实现了跨超声任务的简单优化与强适应性,为医学影像多任务学习提供了可借鉴的架构。
Abstract: Ultrasound images vary widely across scanners, operators, and anatomical targets, which often causes models trained in one setting to generalize poorly to new hospitals and clinical conditions. The Foundation Model Challenge for Ultrasound Image Analysis (FMC-UIA) reflects this difficulty by requiring a single model to handle multiple tasks, including segmentation, detection, classification, and landmark regression across diverse organs and datasets. We propose a unified multi-task framework based on a transformer visual encoder from the Qwen3-VL family. Intermediate token features are projected into spatial feature maps and fused using a lightweight multi-scale feature pyramid, enabling both pixel-level predictions and global reasoning within a shared representation. Each task is handled by a small task-specific prediction head, while training uses task-aware sampling and selective loss balancing to manage heterogeneous supervision and reduce task imbalance. Our method is designed to be simple to optimize and adaptable across a wide range of ultrasound analysis tasks. The performance improved from 67% to 85% on the validation set and achieved an average score of 81.84% on the official test set across all tasks. The code is publicly available at: https://github.com/saitejalekkala33/FMCUIA-ISBI.git
[33] LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray cs.CV | cs.AIPDF
Myeongkyun Kang, Yanting Yang, Xiaoxiao Li
TL;DR: 该论文提出了LoFi方法,一种用于胸部X光片的位置感知细粒度表征学习框架,通过联合优化Sigmoid、图像描述和位置感知描述损失,并利用轻量级大语言模型,以解决现有对比学习模型缺乏区域级监督和大型视觉语言模型在外部验证中细粒度表征能力有限的问题。该方法进一步将细粒度编码器集成到基于检索的上下文学习中,以增强胸部X光片的定位性能。
Details
Motivation: 解决胸部X光片检索和短语定位任务中,由于对比学习模型缺乏区域级监督,以及大型视觉语言模型在外部验证中捕获细粒度表征能力有限,导致性能不佳的问题。
Result: 在MIMIC-CXR和PadChest-GR数据集上的大量实验表明,该方法在检索和短语定位任务上取得了优越的性能。
Insight: 创新点在于引入了位置感知描述损失,通过定位和密集描述目标实现区域级监督,从而促进细粒度表征学习;并将细粒度编码器与基于检索的上下文学习相结合,以提升跨不同设置的胸部X光片定位能力。
Abstract: Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.
[34] In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing cs.CVPDF
Xiao Fang, Yiming Gong, Stanislav Panev, Celso de Melo, Shuowen Hu
TL;DR: 本文提出了一种针对车辆检测器的真实世界伪装攻击框架,将车辆伪装攻击建模为条件图像编辑问题,通过微调ControlNet直接在真实图像上合成伪装车辆,实现了更强的攻击效果和更好的隐蔽性。
Details
Motivation: 解决深度神经网络在计算机视觉中易受对抗攻击的问题,特别是伪装攻击需要同时欺骗检测器并保持对人类隐蔽的挑战。
Result: 在COCO和LINZ数据集上的实验表明,该方法导致AP50下降超过38%,攻击效果显著优于现有方法,同时更好地保持了车辆结构并提升了人类感知的隐蔽性。
Insight: 创新点在于将伪装攻击形式化为条件图像编辑问题,并设计了联合优化车辆结构保真度、风格一致性和对抗有效性的统一目标;该方法还展示了良好的泛化能力和向物理世界的可迁移性。
Abstract: Deep neural networks (DNNs) have achieved remarkable success in computer vision but remain highly vulnerable to adversarial attacks. Among them, camouflage attacks manipulate an object’s visible appearance to deceive detectors while remaining stealthy to humans. In this paper, we propose a new framework that formulates vehicle camouflage attacks as a conditional image-editing problem. Specifically, we explore both image-level and scene-level camouflage generation strategies, and fine-tune a ControlNet to synthesize camouflaged vehicles directly on real images. We design a unified objective that jointly enforces vehicle structural fidelity, style consistency, and adversarial effectiveness. Extensive experiments on the COCO and LINZ datasets show that our method achieves significantly stronger attack effectiveness, leading to more than 38% AP50 decrease, while better preserving vehicle structure and improving human-perceived stealthiness compared to existing approaches. Furthermore, our framework generalizes effectively to unseen black-box detectors and exhibits promising transferability to the physical world. Project page is available at https://humansensinglab.github.io/CtrlCamo
[35] ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models cs.CVPDF
Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini
TL;DR: 本文提出了ProactiveBench基准,用于评估多模态大语言模型(MLLMs)在需要用户干预时的主动求助能力。该基准基于七个重构数据集,涵盖识别遮挡物体、提升图像质量和解释粗略草图等任务。研究评估了22个MLLMs,发现它们普遍缺乏主动性,且主动性与模型能力无关,提示策略收效甚微。有趣的是,对话历史和上下文学习会引入负面偏见。通过强化学习微调的实验表明主动性是可学习的。
Details
Motivation: 研究动机是探究MLLMs是否能像人类一样在遇到困难时主动请求简单用户干预,以促进有效协作,例如在识别遮挡物体时寻求帮助移除障碍。
Result: 在ProactiveBench基准上评估22个MLLMs,结果显示模型普遍缺乏主动性,主动性与模型容量不相关,提示仅带来边际收益。对话历史和上下文学习对性能有负面影响。通过强化学习微调的方法表明主动性可以学习并能泛化到未见场景。
Insight: 论文的创新点在于首次系统性地定义和评估MLLMs的主动性,并构建了专门的基准ProactiveBench。客观分析发现,主动性并非现有MLLMs的固有能力,且传统上下文学习可能适得其反,而强化学习微调是提升主动性的有效途径,这为构建更协作的多模态模型提供了新方向。
Abstract: Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar “proactive” behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) “hinting” at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.
[36] Narrative Aligned Long Form Video Question Answering cs.CVPDF
Rahul Jain, Keval Doshi, Burak Uzkent, Garin Kessler
TL;DR: 本文提出了NA-VQA基准测试,用于评估长视频中的深层时序和叙事推理能力,并针对现有多模态大语言模型在远距离证据问题上的不足,提出了一个以叙事为中心的框架Video-NaRA,通过构建事件链和结构化记忆来提升长距离推理性能。
Details
Motivation: 现有长视频推理基准大多依赖局部线索,未能捕捉需要追踪意图、连接遥远事件并重建整个电影因果链的叙事推理能力,因此需要新的评估方法和模型来解决这一问题。
Result: 在NA-VQA基准上,最先进的多模态大语言模型在需要远距离证据的问题上表现不佳;而提出的Video-NaRA框架将长距离推理性能提升了高达3%,证明了其在处理复杂叙事结构上的有效性。
Insight: 论文的创新点在于引入了强调叙事连贯性和长距离依赖的基准NA-VQA,以及提出了通过构建事件级链和结构化记忆来显式建模叙事的Video-NaRA框架,这为提升模型对分散叙事信息的整合能力提供了新思路。
Abstract: Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.
[37] Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following cs.CVPDF
Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park
TL;DR: 本文提出了一种无需指令的调优方法,用于在医学领域微调大型视觉语言模型(LVLM),该方法仅利用图像-描述对进行微调,通过引入动量代理指令替代精心设计的文本指令,并采用响应打乱策略,以减少对人工指令的依赖,同时保持模型的指令跟随能力。
Details
Motivation: 在医学领域,构建大规模、高质量的指令数据集具有挑战性,因为需要专家知识,因此需要减少对手工指令的依赖。
Result: 在SKINCON、WBCAtt、CBIS和MIMIC-CXR数据集上的多项选择视觉问答任务中达到了最先进的准确率,显著提升了LVLM在医学领域的微调效率。
Insight: 创新点包括使用动量代理指令来保留预训练模型的指令跟随能力,以及响应打乱策略来减轻模型对先前词语的过度依赖,从而在无需显式指令的情况下实现有效的领域适应。
Abstract: Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model’s over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.
[38] VeloxNet: Efficient Spatial Gating for Lightweight Embedded Image Classification cs.CVPDF
Md Meftahul Ferdaus, Elias Ioup, Mahdi Abdelguerfi, Anton Netchaev, Steven Sloan
TL;DR: 本文提出了VeloxNet,一种用于嵌入式图像分类的轻量级CNN架构。它用门控多层感知器(gMLP)块取代了SqueezeNet中的fire模块,其中空间门控单元(SGU)通过学习的空间投影和乘法门控来捕获全局空间依赖关系,从而在减少参数的同时提高精度。
Details
Motivation: 在嵌入式设备(如用于航空灾害监测和基础设施检查)上部署深度学习模型,需要架构在精度与模型大小、内存和延迟的严格约束之间取得平衡。
Result: 在三个航空图像数据集(AIDER、CDD、LDD)上评估,与包括MobileNet变体、ShuffleNet、EfficientNet和近期视觉Transformer在内的11个基线模型相比,VeloxNet将参数量相对于SqueezeNet减少了46.1%(从740,970降至399,366),同时加权F1分数在AIDER上提高了6.32%,在CDD上提高了30.83%,在LDD上提高了2.51%。
Insight: 核心创新点在于用基于空间门控单元(SGU)的gMLP块替代传统的局部卷积模块(如fire模块),实现了单层内的全局空间建模,以更少的参数提升了分类精度和参数效率,为资源受限部署提供了新思路。
Abstract: Deploying deep learning models on embedded devices for tasks such as aerial disaster monitoring and infrastructure inspection requires architectures that balance accuracy with strict constraints on model size, memory, and latency. This paper introduces VeloxNet, a lightweight CNN architecture that replaces SqueezeNet’s fire modules with gated multi-layer perceptron (gMLP) blocks for embedded image classification. Each gMLP block uses a spatial gating unit (SGU) that applies learned spatial projections and multiplicative gating, enabling the network to capture spatial dependencies across the full feature map in a single layer. Unlike fire modules, which are limited to local receptive fields defined by small convolutional kernels, the SGU provides global spatial modeling at each layer with fewer parameters. We evaluate VeloxNet on three aerial image datasets: the Aerial Image Database for Emergency Response (AIDER), the Comprehensive Disaster Dataset (CDD), and the Levee Defect Dataset (LDD), comparing against eleven baselines including MobileNet variants, ShuffleNet, EfficientNet, and recent vision transformers. VeloxNet reduces the parameter count by 46.1% relative to SqueezeNet (from 740,970 to 399,366) while improving weighted F1 scores by 6.32% on AIDER, 30.83% on CDD, and 2.51% on LDD. These results demonstrate that substituting local convolutional modules with spatial gating blocks can improve both classification accuracy and parameter efficiency for resource-constrained deployment. The source code will be made publicly available upon acceptance of the paper.
[39] Vision Tiny Recursion Model (ViTRM): Parameter-Efficient Image Classification via Recursive State Refinement cs.CVPDF
Ange-Clément Akazan, Abdoulaye Koroko, Verlon Roel Mbingui, Choukouriyah Arinloye, Hassan Fifen
TL;DR: 论文提出了Vision Tiny Recursion Model (ViTRM),一种参数高效的图像分类架构。它用一个仅3层的小型递归块,递归应用N次,来替代传统ViT中L层的编码器。在CIFAR-10和CIFAR-100数据集上,ViTRM在参数远少于CNN和ViT模型的情况下,保持了有竞争力的性能。
Details
Motivation: 解决当前视觉模型(如CNN和ViT)参数量大、计算资源需求高的问题,使其难以在资源受限的环境中部署。
Result: 在CIFAR-10和CIFAR-100基准测试上,ViTRM分别比基于CNN的模型和ViT模型减少了高达6倍和84倍的参数,同时保持了有竞争力的性能。
Insight: 核心创新点在于将递归计算(通过迭代状态精炼)作为视觉任务中架构深度的参数高效替代方案,证明了小规模递归网络通过重复应用可以实现与深度模型相当的性能。
Abstract: The success of deep learning in computer vision has been driven by models of increasing scale, from deep Convolutional Neural Networks (CNN) to large Vision Transformers (ViT). While effective, these architectures are parameter-intensive and demand significant computational resources, limiting deployment in resource-constrained environments. Inspired by Tiny Recursive Models (TRM), which show that small recursive networks can solve complex reasoning tasks through iterative state refinement, we introduce the \textbf{Vision Tiny Recursion Model (ViTRM)}: a parameter-efficient architecture that replaces the $L$-layer ViT encoder with a single tiny $k$-layer block ($k{=}3$) applied recursively $N$ times. Despite using up to $6 \times $ and $84 \times$ fewer parameters than CNN based models and ViT respectively, ViTRM maintains competitive performance on CIFAR-10 and CIFAR-100. This demonstrates that recursive computation is a viable, parameter-efficient alternative to architectural depth in vision.
[40] Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis cs.CV | cs.AIPDF
Sheng Lu, Hao Chen, Rui Yin, Juyan Ba, Yu Zhang
TL;DR: 该论文提出了Gastric-X,一个用于胃癌分析的大规模多模态基准数据集,包含1.7K个病例,每个病例整合了静息/动态CT扫描、内窥镜图像、生化指标、诊断报告和肿瘤区域标注,以模拟真实临床工作流。作者基于此数据集系统评估了当前视觉语言模型在视觉问答、报告生成、跨模态检索、疾病分类和病灶定位五个核心任务上的能力,旨在推动医学VLM的发展并探究其理解深度。
Details
Motivation: 当前视觉语言模型在自然领域表现出色,但在医学诊断中的应用受限,主要原因是缺乏全面、结构化且能反映真实临床流程的数据集。为了推动VLM在胃癌等临床应用中的发展,需要构建一个能模拟临床工作流的多模态基准。
Result: 论文未在摘要中提供具体的定量评估结果或与SOTA模型的比较数据,但构建了包含1.7K病例的Gastric-X数据集,并计划在此数据集上对现有VLMs在五个核心任务上进行系统性评估。
Insight: 创新点在于构建了一个大规模、多模态、多阶段(整合影像、生化指标、文本报告和标注)的胃癌分析基准数据集,其设计紧密贴合真实临床决策流程。这为评估和开发下一代医学VLMs提供了一个宝贵的资源,并提出了一个关键研究问题:模型能否有效关联生化信号与空间肿瘤特征及文本报告,从而向医生的认知和证据推理过程对齐。
Abstract: Recent vision-language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases. Each case in Gastric-X includes paired resting and dynamic CT scans, endoscopic image, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.
[41] ReXInTheWild: A Unified Benchmark for Medical Photograph Understanding cs.CV | cs.LGPDF
Oishi Banerjee, Sung Eun Kim, Alexandra N. Willauer, Julius M. Kernbach, Abeer Rihan Alomaish
TL;DR: 本文介绍了ReXInTheWild,这是一个用于评估视觉语言模型在理解日常医疗照片方面能力的综合性基准测试。该基准包含955个临床医生验证的多选题,涵盖7个临床主题的484张来自生物医学文献的照片。评估发现,领先的多模态大语言模型表现差异显著,其中Gemini-3准确率最高(78%),而医学专用模型MedGemma仅达到37%。
Details
Motivation: 解决当前缺乏全面评估视觉语言模型在解释日常医疗照片内容能力基准的问题,这些照片在远程医疗和在线健康对话中广泛应用,需要结合细粒度自然图像理解和领域特定医学推理。
Result: 在ReXInTheWild基准上,Gemini-3准确率为78%,Claude Opus 4.5为72%,GPT-5为68%,而MedGemma仅为37%。系统错误分析揭示了从低级几何错误到高级推理失败的四类常见错误。
Insight: 创新点在于构建了一个结合自然图像理解和医学推理的具有挑战性且临床基础的统一基准,并进行了系统的错误分类,为模型改进提供了方向;客观来看,该基准填补了医疗照片理解评估的空白,促进了多模态模型在医疗领域的应用研究。
Abstract: Everyday photographs taken with ordinary cameras are already widely used in telemedicine and other online health conversations, yet no comprehensive benchmark evaluates whether vision-language models can interpret their medical content. Analyzing these images requires both fine-grained natural image understanding and domain-specific medical reasoning, a combination that challenges both general-purpose and specialized models. We introduce ReXInTheWild, a benchmark of 955 clinician-verified multiple-choice questions spanning seven clinical topics across 484 photographs sourced from the biomedical literature. When evaluated on ReXInTheWild, leading multimodal large language models show substantial performance variation: Gemini-3 achieves 78% accuracy, followed by Claude Opus 4.5 (72%) and GPT-5 (68%), while the medical specialist model MedGemma reaches only 37%. A systematic error analysis also reveals four categories of common errors, ranging from low-level geometric errors to high-level reasoning failures and requiring different mitigation strategies. ReXInTheWild provides a challenging, clinically grounded benchmark at the intersection of natural image understanding and medical reasoning. The dataset is available on HuggingFace.
[42] SurfaceXR: Fusing Smartwatch IMUs and Egocentric Hand Pose for Seamless Surface Interactions cs.CV | cs.HC | cs.LGPDF
Vasco Xu, Brian Chen, Eric J. Gonzalez, Andrea Colaço, Henry Hoffmann
TL;DR: SurfaceXR 是一种融合头戴式设备的手部追踪与智能手表 IMU 数据的传感器融合方法,旨在实现日常表面上的稳健交互,解决了 XR 中空中手势易疲劳、不精确以及现有视觉方法在表面平面估计和手部追踪方面的不足。
Details
Motivation: 解决扩展现实(XR)中空中手势导致的疲劳和不精确问题,以及当前以自我为中心的视觉方法在手部追踪和表面平面估计方面的不可靠性。
Result: 一项 21 名参与者的研究验证了 SurfaceXR 在触摸追踪和 8 类手势识别方面的有效性,相比单模态方法有显著提升。
Insight: 创新点在于融合了头戴式手部追踪(提供 3D 位置数据)和智能手表 IMU(捕获高频运动)这两种互补模态,以实现更鲁棒的表面交互。
Abstract: Mid-air gestures in Extended Reality (XR) often cause fatigue and imprecision. Surface-based interactions offer improved accuracy and comfort, but current egocentric vision methods struggle due to hand tracking challenges and unreliable surface plane estimation. We introduce SurfaceXR, a sensor fusion approach combining headset-based hand tracking with smartwatch IMU data to enable robust inputs on everyday surfaces. Our insight is that these modalities are complementary: hand tracking provides 3D positional data while IMUs capture high-frequency motion. A 21-participant study validates SurfaceXR’s effectiveness for touch tracking and 8-class gesture recognition, demonstrating significant improvements over single-modality approaches.
[43] dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3 cs.CV | cs.AIPDF
Saikat Dutta, Biplab Banerjee, Hamid Rezatofighi
TL;DR: 本文提出了dinov3.seg框架,将DINOv3扩展为专门用于开放词汇语义分割(OVSS)的系统。该方法通过设计任务特定架构、联合利用全局与局部文本嵌入、早期与晚期视觉表示精炼以及高分辨率局部-全局推理策略,显著提升了在复杂场景下的密集预测精度和鲁棒性。
Details
Motivation: 现有基于视觉语言模型(VLM)的开放词汇语义分割方法通常依赖全局对比学习得到的表示,这些表示在密集预测任务中表现欠佳,导致空间精度不足且在复杂杂乱场景中鲁棒性较差。
Result: 在五个广泛采用的OVSS基准测试上进行了大量实验,结果表明该方法始终优于当前最先进(SOTA)的方法,证明了其有效性和鲁棒性。
Insight: 创新点包括:为DINOv3骨干网络定制任务特定架构;联合利用与全局[CLS]令牌和局部补丁级视觉特征对齐的文本嵌入;在图像-文本交互前进行早期视觉表示精炼,并在交互后进行晚期相关特征精炼;以及基于滑动窗口聚合的高分辨率局部-全局推理策略,以平衡细节与上下文。
Abstract: Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of text-defined categories, demanding reliable generalization to unseen classes at inference. Although modern vision-language models (VLMs) support strong open-vocabulary recognition, their representations learned through global contrastive objectives remain suboptimal for dense prediction, prompting many OVSS methods to depend on limited adaptation or refinement of image-text similarity maps. This, in turn, restricts spatial precision and robustness in complex, cluttered scenes. We introduce dinov3.seg, extending dinov3.txt into a dedicated framework for OVSS. Our contributions are four-fold. First, we design a task-specific architecture tailored to this backbone, systematically adapting established design principles from prior open-vocabulary segmentation work. Second, we jointly leverage text embeddings aligned with both the global [CLS] token and local patch-level visual features from ViT-based encoder, effectively combining semantic discrimination with fine-grained spatial locality. Third, unlike prior approaches that rely primarily on post hoc similarity refinement, we perform early refinement of visual representations prior to image-text interaction, followed by late refinement of the resulting image-text correlation features, enabling more accurate and robust dense predictions in cluttered scenes. Finally, we propose a high-resolution local-global inference strategy based on sliding-window aggregation, which preserves spatial detail while maintaining global context. We conduct extensive experiments on five widely adopted OVSS benchmarks to evaluate our approach. The results demonstrate its effectiveness and robustness, consistently outperforming current state-of-the-art methods.
[44] Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion cs.CV | cs.ROPDF
Sima Ashayer, Hoang H. Nguyen, Yu Liang, Mina Sartipi
TL;DR: 本文提出了一种轻量级、社会感知的行人过街意图预测架构,通过融合注意力、位置、情境和交互四种行为流,并使用高速编码器、紧凑的4-token Transformer和全局自注意力池化来处理。模型还引入了变分瓶颈和马氏距离检测器来量化不确定性,从而生成校准的概率和可操作的风险评分。
Details
Motivation: 为了解决自动驾驶车辆在城市环境中安全导航时对行人意图进行准确预测的需求,特别是需要一种轻量且能融合社会行为信息的模型。
Result: 在PSI 1.0基准测试中,模型仅使用结构化、可解释的特征,取得了0.9 F1、0.94 AUC-ROC和0.78 MCC,优于最近的视觉语言模型;在PSI 2.0数据集上建立了0.78 F1和0.79 AUC-ROC的初始基线。基于马氏距离的选择性预测在80%覆盖率下将测试准确率提高了0.4个百分点。
Insight: 创新点包括:多行为流融合架构、紧凑Transformer设计、以及结合变分瓶颈和马氏距离的不确定性量化方法,使模型在保持高效的同时具备风险感知能力,且与模态无关,易于集成到视觉语言管道中。
Abstract: Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.
[45] Dual-Domain Representation Alignment: Bridging 2D and 3D Vision via Geometry-Aware Architecture Search cs.CV | cs.AIPDF
Haoyu Zhang, Zhihao Yu, Rui Wang, Yaochu Jin, Qiqi Liu
TL;DR: 本文提出了一种名为EvoNAS的高效分布式多目标进化神经架构搜索框架,旨在解决大型视觉模型在资源受限边缘设备上部署时的高推理成本问题。该框架通过构建融合视觉状态空间与视觉Transformer模块的混合超网,并采用跨架构双域知识蒸馏策略来提升表示能力和排序一致性,同时引入基于GPU资源池和异步调度的分布式多模型并行评估机制以降低大规模验证成本。
Details
Motivation: 现代计算机视觉需要在预测精度与实时效率之间取得平衡,但大型视觉模型的高推理成本限制了其在资源受限边缘设备上的部署。尽管进化神经架构搜索适用于多目标优化,但其实际应用受到候选评估昂贵和子网络排序不一致两个问题的阻碍。
Result: 在COCO、ADE20K、KITTI和NYU-Depth v2等基准数据集上的实验表明,搜索得到的架构(称为EvoNets)在准确性与效率之间实现了帕累托最优权衡。与代表性的基于CNN、ViT和Mamba的模型相比,EvoNets在严格计算预算下实现了更低的推理延迟和更高的吞吐量,并在下游任务(如新视角合成)中保持了强大的泛化能力。
Insight: 论文的创新点包括:1)构建了融合VSS与ViT模块的混合超网,结合了VSS块的计算效率与ViT模块的语义表达能力;2)提出了跨架构双域知识蒸馏策略,增强了共享超网的表示能力并提升了排序一致性;3)设计了基于GPU资源池和异步调度的分布式多模型并行评估框架,相比传统数据并行评估效率提升超过70%。从客观角度看,该研究通过架构搜索与知识蒸馏的协同优化,为平衡模型性能与效率提供了可扩展的解决方案。
Abstract: Modern computer vision requires balancing predictive accuracy with real-time efficiency, yet the high inference cost of large vision models (LVMs) limits deployment on resource-constrained edge devices. Although Evolutionary Neural Architecture Search (ENAS) is well suited for multi-objective optimization, its practical use is hindered by two issues: expensive candidate evaluation and ranking inconsistency among subnetworks. To address them, we propose EvoNAS, an efficient distributed framework for multi-objective evolutionary architecture search. We build a hybrid supernet that integrates Vision State Space and Vision Transformer (VSS-ViT) modules, and optimize it with a Cross-Architecture Dual-Domain Knowledge Distillation (CA-DDKD) strategy. By coupling the computational efficiency of VSS blocks with the semantic expressiveness of ViT modules, CA-DDKD improves the representational capacity of the shared supernet and enhances ranking consistency, enabling reliable fitness estimation during evolution without extra fine-tuning. To reduce the cost of large-scale validation, we further introduce a Distributed Multi-Model Parallel Evaluation (DMMPE) framework based on GPU resource pooling and asynchronous scheduling. Compared with conventional data-parallel evaluation, DMMPE improves efficiency by over 70% through concurrent multi-GPU, multi-model execution. Experiments on COCO, ADE20K, KITTI, and NYU-Depth v2 show that the searched architectures, termed EvoNets, consistently achieve Pareto-optimal trade-offs between accuracy and efficiency. Compared with representative CNN-, ViT-, and Mamba-based models, EvoNets deliver lower inference latency and higher throughput under strict computational budgets while maintaining strong generalization on downstream tasks such as novel view synthesis. Code is available at https://github.com/EMI-Group/evonas
[46] PFM-VEPAR: Prompting Foundation Models for RGB-Event Camera based Pedestrian Attribute Recognition cs.CV | cs.AI | cs.LGPDF
Minghe Xu, Rouying Wu, ChiaWei Chu, Xiao Wang, Yu Li
TL;DR: 本文提出了一种名为PFM-VEPAR的RGB-事件相机行人属性识别框架,通过设计一个轻量级的Event Prompter模块,利用离散余弦变换(DCT)从事件数据中高效提取频域特征来增强RGB分支,并结合外部记忆库与Hopfield网络进行关联记忆增强表示学习,最后通过交叉注意力机制融合多模态信息进行属性预测。
Details
Motivation: 解决现有基于事件的行人属性识别方法中,双流多模态融合计算开销大,且忽略了上下文样本提供的宝贵指导信息的问题。
Result: 在多个基准数据集上进行了广泛实验,充分验证了所提RGB-Event PAR框架的有效性。
Insight: 创新点在于:1) 用轻量的DCT/IDCT操作替代计算密集的辅助主干网络来提取事件特征;2) 引入结合外部记忆库和Hopfield网络的关联记忆机制,以挖掘和利用跨样本的全局关系知识。
Abstract: Event-based pedestrian attribute recognition (PAR) leverages motion cues to enhance RGB cameras in low-light and motion-blur scenarios, enabling more accurate inference of attributes like age and emotion. However, existing two-stream multimodal fusion methods introduce significant computational overhead and neglect the valuable guidance from contextual samples. To address these limitations, this paper proposes an Event Prompter. Discarding the computationally expensive auxiliary backbone, this module directly applies extremely lightweight and efficient Discrete Cosine Transform (DCT) and Inverse DCT (IDCT) operations to the event data. This design extracts frequency-domain event features at a minimal computational cost, thereby effectively augmenting the RGB branch. Furthermore, an external memory bank designed to provide rich prior knowledge, combined with modern Hopfield networks, enables associative memory-augmented representation learning. This mechanism effectively mines and leverages global relational knowledge across different samples. Finally, a cross-attention mechanism fuses the RGB and event modalities, followed by feed-forward networks for attribute prediction. Extensive experiments on multiple benchmark datasets fully validate the effectiveness of the proposed RGB-Event PAR framework. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR
[47] Efficiency Follows Global-Local Decoupling cs.CVPDF
Zhenyu Yang, Gensheng Pei, Tao Chen, Yichao Zhou, Tianfei Zhou
TL;DR: 论文提出ConvNeur架构,通过解耦全局推理和局部表示的角色来提升视觉模型的效率。该架构包含两个分支:轻量级神经记忆分支用于聚合全局上下文,局部保持分支用于提取精细结构,并通过学习门控机制让全局线索调制局部特征,实现次二次方计算复杂度。
Details
Motivation: 解决现代视觉模型在捕获图像级上下文的同时不牺牲局部细节且保持计算效率的权衡问题。
Result: 在标准分类、检测和分割基准测试中,ConvNeur在相似或更低计算成本下匹配或超越可比替代方案,并在相似预算下提供有利的准确性与延迟权衡。
Insight: 创新点在于明确解耦全局与局部处理,通过双分支结构和学习门控实现高效调制,保留了局部处理的归纳偏置并降低了全全局注意力的开销。
Abstract: Modern vision models must capture image-level context without sacrificing local detail while remaining computationally affordable. We revisit this tradeoff and advance a simple principle: decouple the roles of global reasoning and local representation. To operationalize this principle, we introduce ConvNeur, a two-branch architecture in which a lightweight neural memory branch aggregates global context on a compact set of tokens, and a locality-preserving branch extracts fine structure. A learned gate lets global cues modulate local features without entangling their objectives. This separation yields subquadratic scaling with image size, retains inductive priors associated with local processing, and reduces overhead relative to fully global attention. On standard classification, detection, and segmentation benchmarks, ConvNeur matches or surpasses comparable alternatives at similar or lower compute and offers favorable accuracy versus latency trade-offs at similar budgets. These results support the view that efficiency follows global-local decoupling.
[48] CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management cs.CVPDF
Chao Wang, Xudong Tan, Jianjian Cao, Kangcong Li, Tao Chen
TL;DR: CurveStream是一个无需训练、基于曲率感知的分层视觉记忆管理框架,旨在解决多模态大语言模型处理流式视频时因视觉令牌线性增长导致的内存溢出或灾难性遗忘问题。它通过实时计算曲率分数来识别关键语义转换点,并利用在线K-Sigma动态阈值将视频帧自适应地路由到清晰和模糊记忆状态,从而在严格令牌预算下提升流式视频理解性能。
Details
Motivation: 现有视觉保留和记忆管理方法通常依赖均匀采样、低层物理指标或被动缓存淘汰,缺乏内在语义感知,可能破坏上下文连贯性并模糊短暂但关键的语义转换。
Result: 在多个时间尺度评估中,CurveStream在StreamingBench和OVOBench基准上分别实现了10.69%和13.58%的绝对性能提升,超过了各自基线,达到了新的最先进水平。
Insight: 创新点在于利用连续特征轨迹中高曲率区域与关键全局语义转换的对齐性,通过几何洞察设计曲率评分和动态阈值机制,实现语义感知的自适应记忆管理,无需额外训练即可显著提升流式视频理解效率。
Abstract: Multimodal Large Language Models have achieved significant success in offline video understanding, yet their application to streaming videos is severely limited by the linear explosion of visual tokens, which often leads to Out-of-Memory (OOM) errors or catastrophic forgetting. Existing visual retention and memory management methods typically rely on uniform sampling, low-level physical metrics, or passive cache eviction. However, these strategies often lack intrinsic semantic awareness, potentially disrupting contextual coherence and blurring transient yet critical semantic transitions. To address these limitations, we propose CurveStream, a training-free, curvature-aware hierarchical visual memory management framework. Our approach is motivated by the key observation that high-curvature regions along continuous feature trajectories closely align with critical global semantic transitions. Based on this geometric insight, CurveStream evaluates real-time semantic intensity via a Curvature Score and integrates an online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under a strict token budget. Evaluations across diverse temporal scales confirm that this lightweight framework, CurveStream, consistently yields absolute performance gains of over 10% (e.g., 10.69% on StreamingBench and 13.58% on OVOBench) over respective baselines, establishing new state-of-the-art results for streaming video perception.The code will be released at https://github.com/streamingvideos/CurveStream.
[49] MagicSeg: Open-World Segmentation Pretraining via Counterfactural Diffusion-Based Auto-Generation cs.CVPDF
Kaixin Cai, Pengzhen Ren, Jianhua Han, Yi Zhu, Hang Xu
TL;DR: MagicSeg提出了一种基于反事实扩散模型的自动生成数据集方法,用于开放世界语义分割的预训练。该方法从类别标签出发,生成高质量文本描述,进而指导扩散模型生成图像,并同时生成对应的负样本作为反事实对比训练数据。通过结合开放词汇检测模型和交互式分割模型提取精确分割标签,并利用伪掩码监督和辅助反事实对比训练增强模型性能。
Details
Motivation: 开放世界语义分割严重依赖大规模图像-文本对数据集,但这些数据往往缺乏细粒度像素级标注,且获取成本高昂。为解决数据稀缺问题,利用扩散模型的强大图像生成能力,自动生成适用于开放世界分割的训练数据。
Result: 在PASCAL VOC、PASCAL Context和COCO数据集上评估,分别达到62.9%、26.7%和40.2%的性能,实现了SOTA(state-of-the-art)水平。
Insight: 创新点在于提出了一种基于反事实扩散的自动数据生成管道,通过同时生成正负样本进行对比训练,并结合开放词汇检测和交互式分割提取精确标签。该方法降低了数据标注成本,为开放世界分割提供了有效的预训练策略。
Abstract: Open-world semantic segmentation presently relies significantly on extensive image-text pair datasets, which often suffer from a lack of fine-grained pixel annotations on sufficient categories. The acquisition of such data is rendered economically prohibitive due to the substantial investments of both human labor and time. In light of the formidable image generation capabilities of diffusion models, we introduce a novel diffusion model-driven pipeline for automatically generating datasets tailored to the needs of open-world semantic segmentation, named “MagicSeg”. Our MagicSeg initiates from class labels and proceeds to generate high-fidelity textual descriptions, which in turn serve as guidance for the diffusion model to generate images. Rather than only generating positive samples for each label, our process encompasses the simultaneous generation of corresponding negative images, designed to serve as paired counterfactual samples for contrastive training. Then, to provide a self-supervised signal for open-world segmentation pretraining, our MagicSeg integrates an open-vocabulary detection model and an interactive segmentation model to extract object masks as precise segmentation labels from images based on the provided category labels. By applying our dataset to the contrastive language-image pretraining model with the pseudo mask supervision and the auxiliary counterfactual contrastive training, the downstream model obtains strong performance on open-world semantic segmentation. We evaluate our model on PASCAL VOC, PASCAL Context, and COCO, achieving SOTA with performance of 62.9%, 26.7%, and 40.2%, respectively, demonstrating our dataset’s effectiveness in enhancing open-world semantic segmentation capabilities. Project website: https://github.com/ckxhp/magicseg.
[50] FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow cs.CVPDF
Zhifei Yang, Guangyao Zhai, Keyang Lu, YuYang Yin, Chao Zhang
TL;DR: FlowScene是一种基于多模态图条件的三分支场景生成模型,通过紧密耦合的修正流模型在生成过程中交换物体信息,实现场景布局、物体形状和纹理的协同生成,从而在保证高真实感的同时实现对物体形状、纹理和关系的细粒度控制以及场景级别的风格一致性。
Details
Motivation: 现有语言驱动检索方法虽能生成合理场景但缺乏物体级控制和场景级风格一致性,而基于图的方法虽可控性强但难以生成高保真纹理结果,限制了实际应用。本文旨在解决在保持高真实感的同时,实现对室内场景几何与外观的精确控制和风格一致性的问题。
Result: 大量实验表明,FlowScene在生成真实性、风格一致性和与人类偏好对齐方面,均优于语言条件和图条件的基线方法。
Insight: 核心创新在于提出了一种紧密耦合的、基于多模态图条件的修正流生成框架,通过分支间信息交换实现跨图的协同推理,从而统一了细粒度物体控制与整体场景风格一致性。这为可控、高保真的3D场景生成提供了一种新颖的架构思路。
Abstract: Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects’ shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.
[51] K-GMRF: Kinetic Gauss-Markov Random Field for First-Principles Covariance Tracking on Lie Groups cs.CV | cs.LGPDF
ZhiMing Li
TL;DR: 本文提出了一种名为K-GMRF的在线、无需训练的协方差矩阵跟踪框架,用于在流形上跟踪非平稳协方差矩阵。该方法将问题重新表述为李群上的受迫刚体运动,通过源自欧拉-庞加莱方程的二阶动力学模型,将观测解释为驱动潜在角速度的扭矩,并使用保结构辛积分器进行传播。
Details
Motivation: 现有协方差跟踪方法要么忽略流形约束,要么依赖一阶更新,在快速演化时会产生不可避免的相位滞后。本文旨在解决这一问题,提出一种能处理流形约束并减少滞后的二阶动力学方法。
Result: 在三个领域验证了方法的鲁棒跟踪性能:在合成椭圆上,K-GMRF将角度误差比黎曼指数移动平均降低了30倍;在SO(3)稳定任务(20%数据丢失)中,将测地误差从29.4°降至9.9°;在OTB运动模糊序列(BlurCar2)上,将IoU从0.55提升至0.74,成功率达96%。
Insight: 核心创新在于将协方差跟踪问题建模为李群上的二阶受迫刚体动力学,并利用辛积分器进行结构保持的传播,理论上证明了在恒定旋转下可实现零稳态误差,优于一阶基线的比例滞后。该方法提供了一个完全可微的、即插即用的几何先验模块。
Abstract: Tracking non-stationary covariance matrices is fundamental to vision yet hindered by existing estimators that either neglect manifold constraints or rely on first-order updates, incurring inevitable phase lag during rapid evolution. We propose K-GMRF, an online, training-free framework for covariance tracking that reformulates the problem as forced rigid-body motion on Lie groups. Derived from the Euler-Poincaré equations, our method interprets observations as torques driving a latent angular velocity, propagated via a structure-preserving symplectic integrator. We theoretically prove that this second-order dynamics achieves zero steady-state error under constant rotation, strictly superior to the proportional lag of first-order baselines. Validation across three domains demonstrates robust tracking fidelity: (i) on synthetic ellipses, K-GMRF reduces angular error by 30x compared to Riemannian EMA while maintaining stability at high speeds; (ii) on SO(3) stabilization with 20% dropout, it decreases geodesic error from 29.4° to 9.9°; and (iii) on OTB motion-blur sequences, it improves loU from 0.55 to 0.74 on BlurCar2 with a 96% success rate. As a fully differentiable symplectic module, K-GMRF provides a plug-and-play geometric prior for data-constrained scenarios and an interpretable layer within modern deep architectures.
[52] Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning cs.CVPDF
Qin Zhang, Peiyu Jing, Hong-Xing Yu, Fangqiang Ding, Fan Nie
TL;DR: 本文介绍了Physion-Eval,一个用于评估生成视频物理真实性的基准数据集,通过专家人工标注对五种最先进的视频生成模型在自我中心和他者中心视角下的物理违规进行细粒度诊断。该数据集包含10,990条专家推理轨迹,覆盖22个物理类别,揭示了当前模型在物理关键场景中普遍存在物理故障。
Details
Motivation: 现有视频生成模型的评估主要依赖自动化指标或粗略的人工判断,难以深入诊断生成动态何时及为何违反真实世界的物理约束,因此需要一种能够系统评估物理真实性的方法。
Result: 在物理关键场景中,83.3%的他者中心视角和93.5%的自我中心视角生成视频至少存在一个人类可识别的物理故障,突显了当前模型的局限性。
Insight: 创新点在于构建了一个大规模、细粒度的物理真实性评估基准,通过结合真实参考视频、时间定位故障、结构化分类和自然语言解释,为视频生成模型的物理基础发展提供了新标准。
Abstract: Video generation models are increasingly used as world simulators for storytelling, simulation, and embodied AI. As these models advance, a key question arises: do generated videos obey the physical laws of the real world? Existing evaluations largely rely on automated metrics or coarse human judgments such as preferences or rubric-based checks. While useful for assessing perceptual quality, these methods provide limited insight into when and why generated dynamics violate real-world physical constraints. We introduce Physion-Eval, a large-scale benchmark of expert human reasoning for diagnosing physical realism failures in videos generated by five state-of-the-art models across egocentric and exocentric views, containing 10,990 expert reasoning traces spanning 22 fine-grained physical categories. Each generated video is derived from a corresponding real-world reference video depicting a clear physical process, and annotated with temporally localized glitches, structured failure categories, and natural-language explanations of the violated physical behavior. Using this dataset, we reveal a striking limitation of current video generation models: in physics-critical scenarios, 83.3% of exocentric and 93.5% of egocentric generated videos exhibit at least one human-identifiable physical glitch. We hope Physion-Eval will set a new standard for physical realism evaluation and guide the development of physics-grounded video generation. The benchmark is publicly available at https://huggingface.co/datasets/PhysionLabs/Physion-Eval.
[53] FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement cs.CV | cs.AIPDF
Ming Hu, Yongsheng Huo, Mingyu Dou, Jianfu Yin, Peng Zhao
TL;DR: FB-CLIP是一个用于细粒度零样本异常检测的框架,通过前景-背景解耦和多策略文本表示来增强异常定位能力。它在文本模态中结合了EOT特征、全局池化表示和注意力加权token特征以提供更丰富的语义线索;在视觉模态中通过多视角软分离(身份、语义、空间维度)和背景抑制来减少干扰并提高判别性。此外,语义一致性正则化(SCR)用于对齐图像特征与正常/异常文本原型,抑制不确定匹配并扩大语义差距。
Details
Motivation: 解决细粒度异常检测中标注数据稀缺导致的零样本检测挑战,以及现有视觉语言模型(如CLIP)存在的前景-背景特征纠缠和文本语义粗糙的问题。
Result: 实验表明,FB-CLIP在零样本设置下能有效区分复杂背景中的异常,实现了准确的细粒度异常检测和定位。
Insight: 创新点包括多策略文本表示增强语义丰富性、多视角软分离实现前景-背景解耦、以及语义一致性正则化(SCR)来优化特征对齐和语义差距。这些方法共同提升了模型在零样本细粒度异常检测中的判别能力和定位精度。
Abstract: Fine-grained anomaly detection is crucial in industrial and medical applications, but labeled anomalies are often scarce, making zero-shot detection challenging. While vision-language models like CLIP offer promising solutions, they struggle with foreground-background feature entanglement and coarse textual semantics. We propose FB-CLIP, a framework that enhances anomaly localization via multi-strategy textual representations and foreground-background separation. In the textual modality, it combines End-of-Text features, global-pooled representations, and attention-weighted token features for richer semantic cues. In the visual modality, multi-view soft separation along identity, semantic, and spatial dimensions, together with background suppression, reduces interference and improves discriminability. Semantic Consistency Regularization (SCR) aligns image features with normal and abnormal textual prototypes, suppressing uncertain matches and enlarging semantic gaps. Experiments show that FB-CLIP effectively distinguishes anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings.
[54] ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding cs.CVPDF
Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, Cong Wang
TL;DR: 本文提出了ParallelVLM,一种无需训练的推测解码框架,旨在无损加速视频大语言模型(Video-LLM)的推理过程。该方法通过并行化推测和验证阶段,并结合一种无偏验证器引导的剪枝策略,解决了长视频场景下推测模型与目标模型间的相互等待和加速比受限问题,显著提升了视频理解任务的解码效率。
Details
Motivation: 当前Video-LLM在视频理解任务中表现出色,但其自回归解码效率受海量视频令牌的严重制约。现有的视觉令牌剪枝方法虽能部分缓解瓶颈,但仍存在信息损失且加速效果有限。本文旨在克服长视频设置中推测模型与目标模型间的相互等待和加速比受限问题,实现无损且高效的解码加速。
Result: 大量实验表明,ParallelVLM将推测窗口有效扩展了1.6~1.8倍,并保持了较高的接受长度。与原始自回归解码相比,在LLaVA-Onevision-72B和Qwen2.5-VL-32B模型上,分别将多个视频理解基准测试的推理速度加速了3.36倍和2.42倍。
Insight: 论文宣称的创新点在于:1)一个无需训练、两阶段并行化的推测解码框架,最大化硬件利用率;2)一种无偏验证器引导的剪枝策略,通过消除注意力引导剪枝中的位置偏差,更好地对齐推测模型与目标模型。从客观角度看,其核心创新在于将推测解码范式系统性地适配到视频-语言多模态场景,并通过并行化和无偏剪枝机制,在保持无损(即信息不损失)的前提下,显著突破了长视频解码的加速瓶颈。
Abstract: Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporate an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention-guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by $1.6\sim1.8\times$ with high accepted lengths, and accelerates various video understanding benchmarks by 3.36$\times$ on LLaVA-Onevision-72B and 2.42$\times$ on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.
[55] OrbitNVS: Harnessing Video Diffusion Priors for Novel View Synthesis cs.CVPDF
Jinglin Liang, Zijian Zhou, Rui Huang, Shuangping Huang, Yichen Gong
TL;DR: OrbitNVS 将新视角合成任务重新定义为轨道视频生成任务,通过定制模型设计和训练策略,利用预训练视频生成模型的丰富视觉先验来实现高质量合成。该方法通过相机适配器实现精确相机控制,设计法线图生成分支以注意力机制提升几何一致性,并应用像素空间监督缓解模糊外观。
Details
Motivation: 解决现有新视角合成方法在单视图输入下对未观测区域合成效果不佳,以及在保持几何和外观一致性方面面临的挑战。
Result: 在GSO和OmniObject3D基准测试上显著优于先前方法,特别是在具有挑战性的单视图设置下(例如PSNR分别提升2.9 dB和2.4 dB)。
Insight: 创新点在于将NVS重构为轨道视频生成以利用视频扩散先验,通过相机适配器、法线图引导的注意力机制和像素空间监督来分别增强相机控制、几何一致性和外观清晰度。
Abstract: Novel View Synthesis (NVS) aims to generate unseen views of a 3D object given a limited number of known views. Existing methods often struggle to synthesize plausible views for unobserved regions, particularly under single-view input, and still face challenges in maintaining geometry- and appearance-consistency. To address these issues, we propose OrbitNVS, which reformulates NVS as an orbit video generation task. Through tailored model design and training strategies, we adapt a pre-trained video generation model to the NVS task, leveraging its rich visual priors to achieve high-quality view synthesis. Specifically, we incorporate camera adapters into the video model to enable accurate camera control. To enhance two key properties of 3D objects, geometry and appearance, we design a normal map generation branch and use normal map features to guide the synthesis of the target views via attention mechanism, thereby improving geometric consistency. Moreover, we apply a pixel-space supervision to alleviate blurry appearance caused by spatial compression in the latent space. Extensive experiments show that OrbitNVS significantly outperforms previous methods on the GSO and OmniObject3D benchmarks, especially in the challenging single-view setting (\eg, +2.9 dB and +2.4 dB PSNR).
[56] Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement cs.CVPDF
Chunlei Zhang, Jiahao Xia, Yun Xiao, Bo Jiang, Jian Zhang
TL;DR: 本文提出了一种名为HRNet的非迭代式混合多模态图像配准网络,旨在解决现有方法中模态私有信息泄漏和单一变换类型限制的问题。该方法通过跨尺度特征解耦和自适应投影模块学习稳定的共享特征空间,并利用混合参数预测模块联合估计全局刚性参数和局部形变场,实现从粗到细的配准。
Details
Motivation: 解决多模态图像配准中两个关键问题:一是现有解耦方法主要正则化共享部分,导致模态私有信息泄漏到共享特征空间;二是多数多尺度框架仅支持单一变换类型,难以同时处理全局错位和局部形变共存的混合配准场景。
Result: 在四个多模态数据集上的大量实验表明,该方法在刚性和非刚性配准任务上均达到了最先进的性能水平(SOTA)。
Insight: 创新点包括:1)提出跨尺度解耦与自适应投影模块,有效抑制模态私有信息并投影共享特征到稳定子空间;2)设计混合参数预测模块,非迭代地联合估计全局刚性变换和局部形变场;3)引入模态特定批归一化,增强共享主干网络的特征提取能力。从客观角度看,该方法将特征解耦与混合变换预测耦合,为混合多模态配准提供了统一的端到端框架。
Abstract: Multimodal image registration is a fundamental task and a prerequisite for downstream cross-modal analysis. Despite recent progress in shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, allowing modality-private cues to leak into the shared space. Second, most multi-scale frameworks support only a single transformation type, limiting their applicability when global misalignment and local deformation coexist. To address these issues, we formulate hybrid multimodal registration as jointly learning a stable shared feature space and a unified hybrid transformation. Based on this view, we propose HRNet, a Hybrid Registration Network that couples representation disentanglement with hybrid parameter prediction. A shared backbone with Modality-Specific Batch Normalization (MSBN) extracts multi-scale features, while a Cross-scale Disentanglement and Adaptive Projection (CDAP) module suppresses modality-private cues and projects shared features into a stable subspace for matching. Built on this shared space, a Hybrid Parameter Prediction Module (HPPM) performs non-iterative coarse-to-fine estimation of global rigid parameters and deformation fields, which are fused into a coherent deformation field. Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on rigid and non-rigid registration tasks. The code is available at the project website.
[57] IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1 cs.CVPDF
Jun Wang, Xiaoyan Huang
TL;DR: 本文提出IUP-Pose,一种用于实时相对位姿回归的解耦迭代不确定性传播框架。它通过隐式密集对齐和轻量级多头双向交叉注意力模块实现跨视图特征对齐,并采用解耦的旋转-平移估计流程,在保持端到端可微的同时实现了高精度与高效率。
Details
Motivation: 现有相对位姿回归方法面临两难:基于特征匹配的流程精度高但RANSAC不可微阻断梯度流,而基于ViT的回归器可端到端训练但计算成本过高,无法实时部署。核心瓶颈在于旋转与平移估计的耦合以及跨视图特征对齐不足。
Result: 在MegaDepth1500数据集上达到73.3%的AUC@20deg,具有完全的端到端可微性,吞吐量达70 FPS,参数量仅为37M,实现了精度与效率的有利权衡,适用于实时边缘部署。
Insight: 创新点包括:1) 提出解耦的迭代不确定性传播框架,将旋转与平移估计分离;2) 引入轻量级多头双向交叉注意力模块进行隐式密集特征对齐,无需显式匹配监督;3) 通过旋转单应性矩阵在迭代中重新对齐特征图以提升平移估计精度;4) 在保持SOTA级精度的同时实现了高帧率与低参数量。
Abstract: Relative pose estimation is fundamental for SLAM, visual localization, and 3D reconstruction. Existing Relative Pose Regression (RPR) methods face a key trade-off: feature-matching pipelines achieve high accuracy but block gradient flow via non-differentiable RANSAC, while ViT-based regressors are end-to-end trainable but prohibitively expensive for real-time deployment. We identify the core bottlenecks as the coupling between rotation and translation estimation and insufficient cross-view feature alignment. We propose IUP-Pose, a geometry-driven decoupled iterative framework with implicit dense alignment. A lightweight Multi-Head Bi-Cross Attention (MHBC) module aligns cross-view features without explicit matching supervision. The aligned features are processed by a decoupled rotation-translation pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, and feature maps are realigned via rotational homography H_inf before translation prediction. IUP-Pose achieves 73.3% AUC@20deg on MegaDepth1500 with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating a favorable accuracy-efficiency trade-off for real-time edge deployment.
[58] Dual Prompt-Driven Feature Encoding for Nighttime UAV Tracking cs.CV | cs.AIPDF
Yiheng Wang, Changhong Fu, Liangliang Yao, Haobo Zuo, Zijie Zhang
TL;DR: 本文提出了一种双提示驱动的特征编码方法(DPTracker),用于提升夜间无人机跟踪的鲁棒性。该方法通过金字塔光照提示器提取多尺度频率感知的光照提示,并通过动态视点提示器调制可变形卷积偏移以适应视点变化,从而集成提示条件特征适应和上下文感知提示演化,以促进领域不变的特征编码。
Details
Motivation: 现有特征编码方法在夜间无人机跟踪中常忽略关键的光照和视点线索,导致在挑战性条件下感知能力下降和跟踪性能退化。
Result: 大量实验验证了所提DPTracker在应对夜间无人机跟踪任务上的有效性。消融研究突出了DPTracker中各组件的贡献。在多种夜间无人机跟踪场景下的真实世界测试进一步证明了其鲁棒性和实用性。
Insight: 创新点在于提出了一个集成了光照和视点提示的双提示驱动框架,通过专门的提示器分别处理这两个关键挑战因素,并利用提示条件特征适应和上下文感知提示演化来实现领域不变的特征编码,从而提升夜间复杂条件下的跟踪鲁棒性。
Abstract: Robust feature encoding constitutes the foundation of UAV tracking by enabling the nuanced perception of target appearance and motion, thereby playing a pivotal role in ensuring reliable tracking. However, existing feature encoding methods often overlook critical illumination and viewpoint cues, which are essential for robust perception under challenging nighttime conditions, leading to degraded tracking performance. To overcome the above limitation, this work proposes a dual prompt-driven feature encoding method that integrates prompt-conditioned feature adaptation and context-aware prompt evolution to promote domain-invariant feature encoding. Specifically, the pyramid illumination prompter is proposed to extract multi-scale frequency-aware illumination prompts. %The dynamic viewpoint prompter adapts the sampling to different viewpoints, enabling the tracker to learn view-invariant features. The dynamic viewpoint prompter modulates deformable convolution offsets to accommodate viewpoint variations, enabling the tracker to learn view-invariant features. Extensive experiments validate the effectiveness of the proposed dual prompt-driven tracker (DPTracker) in tackling nighttime UAV tracking. Ablation studies highlight the contribution of each component in DPTracker. Real-world tests under diverse nighttime UAV tracking scenarios further demonstrate the robustness and practical utility. The code and demo videos are available at https://github.com/yiheng-wang-duke/DPTracker.
[59] Semantic Audio-Visual Navigation in Continuous Environments cs.CVPDF
Yichen Zeng, Hebaixu Wang, Meng Liu, Yu Zhou, Chen Gao
TL;DR: 本文提出了SAVN-CE(连续环境中的语义视听导航)任务和MAGNet模型,旨在解决智能体在连续3D空间中导航时,因目标声音间歇性消失而导致的导航困难问题。MAGNet通过多模态Transformer联合编码空间与语义目标表示,并整合历史上下文与自运动线索,实现了记忆增强的目标推理。
Details
Motivation: 现有视听导航方法通常依赖预计算的房间脉冲响应进行双耳音频渲染,将智能体限制在离散网格位置,导致空间不连续的观测。本文旨在建立一个更真实的连续环境导航设定,并解决目标声音间歇性中断导致智能体失去目标信息的挑战。
Result: 在SAVN-CE任务上的综合实验表明,MAGNet显著优于现有最先进方法,在成功率上实现了高达12.1%的绝对提升,并且对短时声音和长距离导航场景表现出鲁棒性。
Insight: 创新点在于提出了连续环境中的语义视听导航新设定,并设计了基于多模态Transformer的记忆增强目标推理模型,通过联合编码空间-语义目标表示和整合历史上下文,有效处理了目标声音中断的挑战。
Abstract: Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.
[60] Making Video Models Adhere to User Intent with Minor Adjustments cs.CVPDF
Daniel Ajisafe, Eric Hedlin, Helge Rhodin, Kwang Moo Yi
TL;DR: 本文提出一种通过微调用户提供的边界框来提升文本到视频扩散模型生成质量和控制输入遵从性的方法,通过优化边界框以更好地对齐模型内部注意力图,并平衡前景与背景的关注度。
Details
Motivation: 当前基于边界框或布局的文本到视频扩散模型控制方法在确保生成内容严格遵循控制输入方面仍存在问题,需要改进控制输入的遵从性。
Result: 实验包括用户研究验证了方法的有效性,表明即使对边界框进行微小调整也能显著提升生成质量和对控制输入的遵从性。
Insight: 创新点在于提出可微分的平滑掩码和注意力最大化目标来优化边界框位置,使控制输入更符合模型内部表示,从而提高生成效果。
Abstract: With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method. Our code is made available on the project webpage to foster future research from the community.
[61] Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification cs.CVPDF
Kunlun Xu, Haotong Cheng, Jiangmeng Li, Xu Zou, Jiahuan Zhou
TL;DR: 本文提出了一种名为VLADR的视觉语言属性解耦与增强方法,用于解决终身行人重识别任务。该方法通过多粒度文本属性解耦机制挖掘图像的全局和局部属性,并利用跨域跨模态属性对齐方案实现细粒度知识迁移,从而提升模型的抗遗忘和泛化能力。
Details
Motivation: 现有终身行人重识别方法通常从零开始学习或基于视觉预训练模型,忽略了视觉语言模型中蕴含的细粒度属性知识,导致知识获取和抗遗忘能力有限。本文旨在利用视觉语言模型的通用知识,通过显式建模共享的人类属性来改善跨域知识迁移。
Result: 实验结果表明,VLADR在抗遗忘和泛化能力上分别比现有最先进方法提升了1.9%-2.2%和2.1%-2.5%,在标准基准测试中达到了SOTA水平。
Insight: 创新点在于提出了多粒度文本属性解耦机制和跨域跨模态属性增强方案,将视觉语言模型的通用知识细粒度地应用于终身学习,通过属性对齐实现更有效的知识迁移和遗忘缓解。
Abstract: Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9%-2.2% and 2.1%-2.5% on anti-forgetting and generalization capacity. Our source code is available at https://github.com/zhoujiahuan1991/CVPR2026-VLADR
[62] Unbiased Dynamic Multimodal Fusion cs.CVPDF
Shicai Wei, Kaijie Zhang, Luyi Chen, Tao He, Guiduo Duan
TL;DR: 本文提出了一种无偏动态多模态学习(UDML)框架,旨在解决传统动态多模态融合方法在极端噪声条件下无法准确评估模态质量,以及忽视模态内在依赖偏差导致难以学习的模态受到双重抑制的问题。该框架通过噪声感知不确定性估计器和量化模态依赖偏差的加权机制,实现了更准确、更公平的动态融合。
Details
Motivation: 传统动态多模态方法依赖经验性指标评估模态质量,在噪声极低或极高时失效,且通常假设各模态初始贡献相同,忽略了内在的模态依赖偏差,导致难以学习的模态受到双重惩罚,性能可能不如静态融合。
Result: 在多种多模态基准任务上的广泛实验验证了所提UDML框架的有效性、多功能性和泛化能力。
Insight: 创新点在于提出了噪声感知不确定性估计器,通过向模态数据添加可控噪声并预测其强度,使模型学习特征损坏与噪声水平间的清晰对应关系,从而在宽噪声范围内准确度量不确定性;同时,通过模态丢弃量化多模态网络固有的模态依赖偏差,并将其纳入加权机制,消除了对难学习模态的双重抑制效应。
Abstract: Traditional multimodal methods often assume static modality quality, which limits their adaptability in dynamic real-world scenarios. Thus, dynamical multimodal methods are proposed to assess modality quality and adjust their contribution accordingly. However, they typically rely on empirical metrics, failing to measure the modality quality when noise levels are extremely low or high. Moreover, existing methods usually assume that the initial contribution of each modality is the same, neglecting the intrinsic modality dependency bias. As a result, the modality hard to learn would be doubly penalized, and the performance of dynamical fusion could be inferior to that of static fusion. To address these challenges, we propose the Unbiased Dynamic Multimodal Learning (UDML) framework. Specifically, we introduce a noise-aware uncertainty estimator that adds controlled noise to the modality data and predicts its intensity from the modality feature. This forces the model to learn a clear correspondence between feature corruption and noise level, allowing accurate uncertainty measure across both low- and high-noise conditions. Furthermore, we quantify the inherent modality reliance bias within multimodal networks via modality dropout and incorporate it into the weighting mechanism. This eliminates the dual suppression effect on the hard-to-learn modality. Extensive experiments across diverse multimodal benchmark tasks validate the effectiveness, versatility, and generalizability of the proposed UDML. The code is available at https://github.com/shicaiwei123/UDML.
[63] TSegAgent: Zero-Shot Tooth Segmentation via Geometry-Aware Vision-Language Agents cs.CVPDF
Shaojie Zhuang, Lu Yin, Guangshun Wei, Yunpeng Li, Xilu Wang
TL;DR: TSegAgent是一种零样本牙齿分割方法,通过结合通用基础模型的表示能力和牙齿解剖的几何归纳偏置,将牙齿分割重新定义为几何推理问题,无需任务特定训练即可从口腔内扫描的3D模型中分割和识别牙齿。
Details
Motivation: 解决现有牙齿分割方法依赖密集标注数据和任务特定3D神经网络训练导致的高标注成本和泛化性差的问题。
Result: 实验结果表明,该方法能够以较低的计算和标注成本实现准确可靠的牙齿分割与识别,并在多样化和未见过的牙科扫描上展现出强大的泛化能力。
Insight: 创新点在于将牙齿分割重新定义为零样本几何推理问题,结合多视图视觉抽象和基于几何的推理,并显式编码牙齿弓形结构和体积关系等结构约束以减少不确定性并缓解过拟合。
Abstract: Automatic tooth segmentation and identification from intra-oral scanned 3D models are fundamental problems in digital dentistry, yet most existing approaches rely on task-specific 3D neural networks trained with densely annotated datasets, resulting in high annotation cost and limited generalization to scans from unseen sources. Thus, we propose TSegAgent, which addresses these challenges by reformulating dental analysis as a zero-shot geometric reasoning problem rather than a purely data-driven recognition task. The key idea is to combine the representational capacity of general-purpose foundation models with explicit geometric inductive biases derived from dental anatomy. Instead of learning dental-specific features, the proposed framework leverages multi-view visual abstraction and geometry-grounded reasoning to infer tooth instances and identities without task-specific training. By explicitly encoding structural constraints such as dental arch organization and volumetric relationships, the method reduces uncertainty in ambiguous cases and mitigates overfitting to particular shape distributions. Experimental results demonstrate that this reasoning-oriented formulation enables accurate and reliable tooth segmentation and identification with low computational and annotation cost, while exhibiting strong generalization across diverse and previously unseen dental scans.
[64] WorldAgents: Can Foundation Image Models be Agents for 3D World Models? cs.CVPDF
Ziya Erkoç, Angela Dai, Matthias Nießner
TL;DR: 本文探讨了2D基础图像模型是否具备3D世界模型能力,并提出了一种多智能体框架(WorldAgents)来利用和评估这种潜在能力。该框架通过视觉语言模型(VLM)导演、图像生成器和两步验证器协同工作,成功合成了具有3D一致性的广阔、逼真世界,并实现了新视角渲染。
Details
Motivation: 鉴于2D基础图像模型在生成高保真输出方面的卓越能力,本文旨在探究这些模型是否内在地具备3D世界建模能力,并解决如何系统评估和利用这种潜在能力进行3D世界合成的问题。
Result: 通过在各种基础模型上的广泛实验,论文表明2D模型确实封装了对3D世界的理解。所提出的智能体方法能够生成连贯且稳健的3D重建,产生可渲染新视角的输出场景,从而成功合成了广阔、逼真且3D一致的世界。
Insight: 论文的创新点在于将3D世界合成任务构建为一个多智能体协作框架,通过VLM导演生成提示、图像生成器合成新视图,以及基于VLM的两步验证器从2D图像和3D重建空间进行筛选,从而有效挖掘和利用了2D基础模型的隐含3D能力。从客观角度看,这种智能体化框架为评估和激发现有模型的3D理解提供了一种系统化、可扩展的方法。
Abstract: Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.
[65] BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates cs.CVPDF
Phuong-Anh Nguyen, Tien Anh Pham, Duc-Trong Le, Cam-Van Thi Nguyen
TL;DR: 本文提出了一种名为BALM的模型无关框架,旨在解决多模态学习中因不平衡缺失率(IMR)导致的模态间学习失衡问题。该框架包含特征校准模块(FCM)和梯度再平衡模块(GRM),能够在不改变骨干网络结构的情况下,通过重新校准特征和调整梯度动态,实现更平衡的多模态表示学习。
Details
Motivation: 现实场景中多模态数据常存在不平衡缺失率,导致信息丰富的模态主导优化,而较弱或部分缺失的模态贡献不足,扭曲了表示学习和梯度动态,因此需要一种通用方法来平衡这种失衡。
Result: 在多个多模态情感识别(MER)基准测试中,BALM框架在各种缺失和不平衡设置下均能一致地增强鲁棒性并提升性能。
Insight: 创新点在于从训练过程视角出发,通过特征校准和梯度再平衡两个互补模块,以模型无关的方式处理不平衡缺失率问题,为多模态学习提供了可插拔的通用解决方案。
Abstract: Learning from multiple modalities often suffers from imbalance, where information-rich modalities dominate optimization while weaker or partially missing modalities contribute less. This imbalance becomes severe in realistic settings with imbalanced missing rates (IMR), where each modality is absent with different probabilities, distorting representation learning and gradient dynamics. We revisit this issue from a training-process perspective and propose BALM, a model-agnostic plug-in framework to achieve balanced multimodal learning under IMR. The framework comprises two complementary modules: the Feature Calibration Module (FCM), which recalibrates unimodal features using global context to establish a shared representation basis across heterogeneous missing patterns; the Gradient Rebalancing Module (GRM), which balances learning dynamics across modalities by modulating gradient magnitudes and directions from both distributional and spatial perspectives. BALM can be seamlessly integrated into diverse backbones, including multimodal emotion recognition (MER) models, without altering their architectures. Experimental results across multiple MER benchmarks confirm that BALM consistently enhances robustness and improves performance under diverse missing and imbalance settings. Code available at: https://github.com/np4s/BALM_CVPR2026.git
[66] PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing cs.CVPDF
Jiadong Liang, Bojun Xiong, Jie Tian, Hua Li, Xiao Long
TL;DR: 本文提出了一种名为PerformRecast的肖像视频编辑方法,专注于仅根据驱动视频编辑面部表情,同时保持头部姿态不变。该方法利用3D形变人脸模型(3DMM)的参数分离特性,改进了关键点变换公式以实现表情与头部姿态的更好解耦,并通过分离监督面部与非面部区域来提升生成质量。
Details
Motivation: 现有肖像动画研究主要关注根据驱动视频的运动来动画化静态肖像,难以将面部表情与头部姿态旋转解耦,从而缺乏独立编辑表情的能力。本文旨在解决这一挑战,为影视动画行业提供更精细的表演重铸(recast)能力。
Result: 大量实验表明,该方法能生成更忠实于驱动视频的高质量结果,在可控性和效率方面均优于现有方法。
Insight: 核心创新在于利用3DMM的参数分离特性改进关键点变换公式,实现了表情与头部姿态的更好解耦;同时,通过分离监督面部与非面部区域,解决了生成结果中面部边界错位的问题,提供了更精细的控制能力。
Abstract: This paper primarily investigates the task of expression-only portrait video performance editing based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile expression-only video editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we improve the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more fine-grained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. Our code, data and trained models are available at https://youku-aigc.github.io/PerformRecast.
[67] PhysNeXt: Next-Generation Dual-Branch Structured Attention Fusion Network for Remote Photoplethysmography Measurement cs.CVPDF
Junzhe Cao, Bo Zhao, Zhiyi Niu, Dan Guo, Yue Sun
TL;DR: 本文提出PhysNeXt,一种用于远程光电容积描记术(rPPG)测量的新一代双分支结构化注意力融合网络。该方法通过联合利用原始视频帧和时空图(STMap)表示,结合时空差异建模单元、跨模态交互模块和基于结构化注意力的解码器,协同增强脉搏信号提取的鲁棒性。实验表明,PhysNeXt在挑战性条件下能实现更稳定和细粒度的rPPG信号恢复。
Details
Motivation: 当前rPPG方法主要基于原始视频的端到端建模或中间时空图表示,前者保留完整时空信息但引入大量运动伪影和光照变化噪声,后者降低数据量和计算复杂度但可能丢失高频细节。动机是有效整合两种方法的优势,以提升rPPG测量的鲁棒性。
Result: 实验结果表明,PhysNeXt在挑战性条件下实现了更稳定和细粒度的rPPG信号恢复,验证了视频和STMap表示联合建模的有效性。
Insight: 创新点在于提出双输入深度学习框架,联合利用视频帧和STMap表示,并通过时空差异建模、跨模态交互和结构化注意力解码器进行协同增强。从客观角度看,这种双分支融合策略有望平衡信息完整性与计算效率,为rPPG信号处理提供新思路。
Abstract: Remote photoplethysmography (rPPG) enables contactless measurement of heart rate and other vital signs by analyzing subtle color variations in facial skin induced by cardiac pulsation. Current rPPG methods are mainly based on either end-to-end modeling from raw videos or intermediate spatial-temporal map (STMap) representations. The former preserves complete spatiotemporal information and can capture subtle heartbeat-related signals, but it also introduces substantial noise from motion artifacts and illumination variations. The latter stacks the temporal color changes of multiple facial regions of interest into compact two-dimensional representations, significantly reducing data volume and computational complexity, although some high-frequency details may be lost. To effectively integrate the mutual strengths, we propose PhysNeXt, a dual-input deep learning framework that jointly exploits video frames and STMap representations. By incorporating a spatio-temporal difference modeling unit, a cross-modal interaction module, and a structured attention-based decoder, PhysNeXt collaboratively enhances the robustness of pulse signal extraction. Experimental results demonstrate that PhysNeXt achieves more stable and fine-grained rPPG signal recovery under challenging conditions, validating the effectiveness of joint modeling of video and STMap representations. The codes will be released.
[68] Uncertainty-aware Prototype Learning with Variational Inference for Few-shot Point Cloud Segmentation cs.CV | cs.AIPDF
Yifei Zhao, Fanyu Zhao, Yinsheng Li
TL;DR: 本文提出了一种名为UPL(Uncertainty-aware Prototype Learning)的概率方法,用于解决少样本点云语义分割问题。该方法通过双流原型精炼模块联合利用支持和查询样本的有限信息来丰富原型表示,并将原型学习建模为变分推断问题,从而显式地建模不确定性,提供鲁棒且可解释的预测。
Details
Motivation: 现有基于原型的方法通常从支持集构建紧凑且确定性的原型来指导查询分割,但这种刚性表示无法捕捉由稀缺监督引入的内在不确定性,导致鲁棒性下降和泛化能力有限。
Result: 在ScanNet和S3DIS基准测试上的广泛实验表明,UPL在不同设置下均实现了最先进的性能,同时提供了可靠的不确定性估计。
Insight: 主要创新点包括:1)引入双流原型精炼模块,联合利用支持和查询信息;2)将原型学习形式化为变分推断问题,将类别原型视为潜变量,实现显式的不确定性建模。这为少样本学习提供了概率框架,增强了模型的鲁棒性和可解释性。
Abstract: Few-shot 3D semantic segmentation aims to generate accurate semantic masks for query point clouds with only a few annotated support examples. Existing prototype-based methods typically construct compact and deterministic prototypes from the support set to guide query segmentation. However, such rigid representations are unable to capture the intrinsic uncertainty introduced by scarce supervision, which often results in degraded robustness and limited generalization. In this work, we propose UPL (Uncertainty-aware Prototype Learning), a probabilistic approach designed to incorporate uncertainty modeling into prototype learning for few-shot 3D segmentation. Our framework introduces two key components. First, UPL introduces a dual-stream prototype refinement module that enriches prototype representations by jointly leveraging limited information from both support and query samples. Second, we formulate prototype learning as a variational inference problem, regarding class prototypes as latent variables. This probabilistic formulation enables explicit uncertainty modeling, providing robust and interpretable mask predictions. Extensive experiments on the widely used ScanNet and S3DIS benchmarks show that our UPL achieves consistent state-of-the-art performance under different settings while providing reliable uncertainty estimation. The code is available at https://fdueblab-upl.github.io/.
[69] FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs cs.CVPDF
Zhihan Yin, Jianxin Liang, Yueqian Wang, Yifeng Yao, Huishuai Zhang
TL;DR: 本文提出了FREAK基准,这是一个用于细粒度评估多模态大语言模型幻觉问题的综合性基准。该基准通过高质量、逼真的图像和细粒度的反常识编辑,创新性地评估MLLMs在详细视觉感知中的幻觉现象。实验表明,当前SOTA模型在详细视觉感知方面存在严重的幻觉问题。
Details
Motivation: 现有幻觉评估基准存在任务过于简化导致指标饱和,或多样性不足无法充分评估先进多模态模型幻觉程度的问题,需要一个新的、更全面的评估基准。
Result: 在FREAK基准上的广泛实验揭示了SOTA模型在详细视觉感知方面存在严重的幻觉问题。通过一个受控子集和系统评估CoT提示技术,揭示了关于幻觉模式和模型推理过程的关键见解。
Insight: 创新点在于构建了一个基于高质量逼真图像和细粒度反常识编辑的综合基准,用于细粒度评估MLLMs的幻觉,并设计了受控子集来间接评估模型感知目标细节信息的能力,从而更深入地分析幻觉模式和推理过程。
Abstract: Multimodal Large Language Models (MLLMs) suffer from hallucinations. Existing hallucination evaluation benchmarks are often limited by over-simplified tasks leading to saturated metrics, or insufficient diversity that fails to adequately assess the hallucination extent in state-of-the-art multimodal models. To address this gap, we propose FREAK, a comprehensive multimodal benchmark designed for fine-grained hallucination assessment in MLLMs. Through high-quality photorealistic images featuring fine-grained counter-commonsense edits, FREAK innovatively evaluates hallucination phenomena in detailed visual perception of MLLMs. Extensive experiments on FREAK show severe hallucination issues in SOTA models regarding detailed visual perception. To enable deeper investigation, we curate a controlled subset to indirectly evaluate the model’s ability to perceive target detailed information. Through systematic evaluation of prevailing Chain-of-Thought (CoT) prompting techniques within this task, we reveal critical insights regarding hallucination patterns and model reasoning processes.
[70] Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images cs.CVPDF
Donghai Fang, Yongheng Li, Zhen Wang, Yuansong Zeng, Wenwen Min
TL;DR: 本文提出HINGE方法,将预训练的单细胞基础模型(sc-FM)适配为从组织学图像生成空间基因表达的条件生成器,通过引入SoftAdaLN调制、表达空间掩码扩散目标和课程学习策略,在三个空间转录组数据集上超越了现有方法。
Details
Motivation: 空间转录组技术成本高、通量有限,现有从HE染色组织学图像预测基因表达的生成方法常忽略基因间依赖关系,而预训练sc-FM虽能捕获这些关系,但缺乏视觉通路且与条件生成任务存在目标不匹配,因此需要一种适配方法。
Result: 在三个空间转录组数据集上评估,HINGE在平均皮尔逊相关性上优于SOTA基线,并产生了更准确的空间标记表达模式和更高的成对共表达一致性。
Insight: 创新点包括:1)提出SoftAdaLN轻量调制模块,在保持sc-FM预训练基因关系的同时注入视觉上下文;2)采用表达空间掩码扩散目标和课程学习确保目标对齐与训练稳定;3)为预训练sc-FM适配组织学条件生成提供了实用路径。
Abstract: Spatial transcriptomics (ST) enables spot-level in situ expression profiling, but its high cost and limited throughput motivate predicting expression directly from HE-stained histology. Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches. However, most existing generative approaches omit explicit modeling of gene-gene dependencies, undermining biological coherence. Single-cell foundation models (sc-FMs), pre-trained across diverse cell populations, capture these critical gene relationships that histology alone cannot reveal. Yet, applying expression-only sc-FMs to histology-conditioned expression modeling is nontrivial due to the absence of a visual pathway, a mismatch between their pre-training and conditional ST objectives, and the scarcity of mixed-cell ST supervision. To address these challenges, we propose HINGE (HIstology-coNditioned GEneration), which retrofits a pre-trained sc-FM into a conditional expression generator while mostly preserving its learned gene relationships. We achieve this by introducing SoftAdaLN, a lightweight, identity-initialized modulation that injects layer-wise visual context into the backbone, coupled with an expression-space masked diffusion objective and a warm-start curriculum to ensure objective alignment and training stability. Evaluated on three ST datasets, ours outperforms state-of-the-art baselines on mean Pearson correlation and yields more accurate spatial marker expression patterns and higher pairwise co-expression consistency, establishing a practical route to adapt pre-trained sc-FMs for histology-conditioned spatial expression generation.
[71] FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision cs.CVPDF
Zekai Wu, Shuqi Fan, Mengyin Liu, Yuhua Luo, Xincheng Lin
TL;DR: FlashCap是一种基于闪烁LED和事件视觉的毫秒级精度人体运动捕捉系统,旨在解决精确运动时序(PMT)问题。该研究还构建了名为FlashMotion的多模态数据集,并提出ResPose基线方法,在精确运动时序和高时间分辨率人体姿态估计任务中显著提升了性能。
Details
Motivation: 当前人体姿态估计(HPE)领域普遍忽视精确运动时序(PMT),主要原因是缺乏高时间分辨率标注数据集;而现有基于高速RGB相机的PMT方案成本高、对光线敏感且计算复杂,难以日常应用。
Result: 在FlashMotion数据集上,ResPose方法将姿态估计误差降低了约40%,并实现了毫秒级的时序精度。
Insight: 创新点在于首次提出基于闪烁LED和事件相机的运动捕捉系统FlashCap,并构建了包含事件、RGB、LiDAR和IMU的多模态毫秒级分辨率数据集FlashMotion;所提出的ResPose基线通过事件和RGB学习残差姿态,简单有效,为高时间分辨率运动分析开辟了新研究方向。
Abstract: Precise motion timing (PMT) is crucial for swift motion analysis. A millisecond difference may determine victory or defeat in sports competitions. Despite substantial progress in human pose estimation (HPE), PMT remains largely overlooked by the HPE community due to the limited availability of high-temporal-resolution labeled datasets. Today, PMT is achieved using high-speed RGB cameras in specialized scenarios such as the Olympic Games; however, their high costs, light sensitivity, bandwidth, and computational complexity limit their feasibility for daily use. We developed FlashCap, the first flashing LED-based MoCap system for PMT. With FlashCap, we collect a millisecond-resolution human motion dataset, FlashMotion, comprising the event, RGB, LiDAR, and IMU modalities, and demonstrate its high quality through rigorous validation. To evaluate the merits of FlashMotion, we perform two tasks: precise motion timing and high-temporal-resolution HPE. For these tasks, we propose ResPose, a simple yet effective baseline that learns residual poses based on events and RGBs. Experimental results show that ResPose reduces pose estimation errors by ~40% and achieves millisecond-level timing accuracy, enabling new research opportunities. The dataset and code will be shared with the community.
[72] Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach cs.CVPDF
Shiqi Gao, Zitong Xu, Kang Fu, Huiyu Duan, Xiongkuo Min
TL;DR: 该论文提出了TIEdit基准测试和EditProbe评估器,用于系统评估文本引导图像编辑(TIE)方法。TIEdit包含512张源图像和8个编辑任务的提示,生成了10个SOTA模型产生的5,120张编辑图像,并通过专家标注获得了15,360个平均意见分数(MOS)。EditProbe则通过探测多模态大语言模型的中间层表示来估计编辑质量,实验表明其与人类感知判断具有更强的相关性。
Details
Motivation: 现有文本引导图像编辑方法的评估基准规模有限,且与人类感知判断相关性弱,缺乏可靠的多维度评估标准。
Result: 在TIEdit基准上,广泛使用的自动评估指标与人类判断相关性有限,而EditProbe在感知质量、编辑对齐和内容保留三个维度上均实现了与人类感知的显著更强对齐。
Insight: 创新点在于构建了大规模、多任务、专家标注的TIE评估基准TIEdit,并提出了基于中间层探测的LLM评估器EditProbe,通过利用中间层语义表示而非最终输出来提升评估的感知对齐性。
Abstract: Evaluating text-guided image editing (TIE) methods remains a challenging problem, as reliable assessment should simultaneously consider perceptual quality, alignment with textual instructions, and preservation of original image content. Despite rapid progress in TIE models, existing evaluation benchmarks remain limited in scale and often show weak correlation with human perceptual judgments. In this work, we introduce TIEdit, a benchmark for systematic evaluation of text-guided image editing methods. TIEdit consists of 512 source images paired with editing prompts across eight representative editing tasks, producing 5,120 edited images generated by ten state-of-the-art TIE models. To obtain reliable subjective ratings, 20 experts are recruited to produce 307,200 raw subjective ratings, which accumulates into 15,360 mean opinion scores (MOSs) across three evaluation dimensions: perceptual quality, editing alignment, and content preservation. Beyond the benchmark itself, we further propose EditProbe, an LLM-based evaluator that estimates editing quality via intermediate-layer probing of hidden representations. Instead of relying solely on final model outputs, EditProbe extracts informative representations from intermediate layers of multimodal large language models to better capture semantic and perceptual relationships between source images, editing instructions, and edited results. Experimental results demonstrate that widely used automatic evaluation metrics show limited correlation with human judgments on editing tasks, while EditProbe achieves substantially stronger alignment with human perception. Together, TIEdit and EditProbe provide a foundation for more reliable and perceptually aligned evaluation of text-guided image editing methods.
[73] ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection cs.CVPDF
Chengzhi Hong, Bijun Li
TL;DR: 本文提出ReManNet,一种用于单目3D车道线检测的黎曼流形网络。该方法基于道路流形假设,将道路建模为三维空间中的平滑二维流形,车道线作为其嵌入的一维子流形。通过黎曼高斯描述符编码几何信息,并结合视觉特征进行融合,同时提出3D隧道车道IoU损失函数以优化形状对齐。
Details
Motivation: 现有单目3D车道线检测方法因深度模糊和几何约束弱而面临挑战,常导致车道线形状扭曲。主流方法依赖简化的物理假设,缺乏车道与路面之间不变的几何-拓扑耦合,使得2D到3D的映射不适定且脆弱。
Result: 在标准基准测试(如OpenLane)上,ReManNet取得了最先进(SOTA)或具有竞争力的结果。在OpenLane数据集上,F1分数比基线提升8.2%,比先前最佳方法提升1.8%,场景级增益最高达6.6%。
Insight: 创新点包括:提出道路流形假设,将车道检测问题形式化为流形上的几何建模;使用黎曼高斯描述符在对称正定流形上编码几何信息;设计3D隧道车道IoU损失,通过管状邻域的切片重叠实现点-曲线联合优化,提升形状对齐能力。
Abstract: Monocular 3D lane detection remains challenging due to depth ambiguity and weak geometric constraints. Mainstream methods rely on depth guidance, BEV projection, and anchor- or curve-based heads with simplified physical assumptions, remapping high-dimensional image features while only weakly encoding road geometry. Lacking an invariant geometric-topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in $\mathbb{R}^3$, lanes are embedded 1D submanifolds, and sampled lane points are dense observations, thereby coupling metric and topology across surfaces, curves, and point sets. Building on this, we propose ReManNet, which first produces initial lane predictions with an image backbone and detection heads, then encodes geometry as Riemannian Gaussian descriptors on the symmetric positive-definite (SPD) manifold, and fuses these descriptors with visual features through a lightweight gate to maintain coherent 3D reasoning. We also propose the 3D Tunnel Lane IoU (3D-TLIoU) loss, a joint point-curve objective that computes slice-wise overlap of tubular neighborhoods along each lane to improve shape-level alignment. Extensive experiments on standard benchmarks demonstrate that ReManNet achieves state-of-the-art (SOTA) or competitive results. On OpenLane, it improves F1 by +8.2% over the baseline and by +1.8% over the previous best, with scenario-level gains of up to +6.6%. The code will be publicly available at https://github.com/changehome717/ReManNet.
[74] One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment cs.CVPDF
Wen Yin, Cencen Liu, Dingrui Liu, Bing Su, Yuan-Fang Li
TL;DR: 本文提出TATAR框架,通过任务感知的后训练策略,在单一多模态大语言模型中统一图像质量评估(IQA)和图像美学评估(IAA)。该方法针对IQA和IAA任务的不同特性,设计了快-慢推理机制、两阶段学习流程和非对称奖励,有效解决了现有统一方法中存在的推理与优化不匹配问题。
Details
Motivation: 现有统一IQA和IAA的方法采用任务无关的相同推理策略和奖励机制,这与IQA依赖低层客观感知线索、IAA需要高层语义审美的本质不匹配,导致性能不佳。本文旨在解决这种推理不匹配和优化不匹配问题。
Result: 在八个基准测试上的广泛实验表明,TATAR在领域内和跨域设置下均优于先前的统一基线,在两项任务上均取得更好性能,与任务专用模型竞争力相当,并在美学评估上实现了更稳定的训练动态。
Insight: 核心创新在于提出任务条件化后训练范式,通过快-慢任务特定推理构建(IQA配简洁感知依据,IAA配审慎美学叙述)、两阶段SFT+GRPO学习建立任务感知行为先验后进行奖励驱动优化,以及针对IQA的高斯分数整形和针对IAA的Thurstone式完成排序的非对称奖励机制,实现了对异构感知评分任务的有效统一建模。
Abstract: Unifying Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) in a single multimodal large language model is appealing, yet existing methods adopt a task-agnostic recipe that applies the same reasoning strategy and reward to both tasks. We show this is fundamentally misaligned: IQA relies on low-level, objective perceptual cues and benefits from concise distortion-focused reasoning, whereas IAA requires deliberative semantic judgment and is poorly served by point-wise score regression. We identify these as a reasoning mismatch and an optimization mismatch, and provide empirical evidence for both through controlled probes. Motivated by these findings, we propose TATAR (Task-Aware Thinking with Asymmetric Rewards), a unified framework that shares the visual-language backbone while conditioning post-training on each task’s nature. TATAR combines three components: fast–slow task-specific reasoning construction that pairs IQA with concise perceptual rationales and IAA with deliberative aesthetic narratives; two-stage SFT+GRPO learning that establishes task-aware behavioral priors before reward-driven refinement; and asymmetric rewards that apply Gaussian score shaping for IQA and Thurstone-style completion ranking for IAA. Extensive experiments across eight benchmarks demonstrate that TATAR consistently outperforms prior unified baselines on both tasks under in-domain and cross-domain settings, remains competitive with task-specific specialized models, and yields more stable training dynamics for aesthetic assessment. Our results establish task-conditioned post-training as a principled paradigm for unified perceptual scoring. Our code is publicly available at https://github.com/yinwen2019/TATAR.
[75] Decoupled Sensitivity-Consistency Learning for Weakly Supervised Video Anomaly Detection cs.CVPDF
Hantao Zheng, Ning Han, Yawen Zeng, Hao Chen
TL;DR: 本文提出DeSC(解耦敏感度-一致性)框架,用于弱监督视频异常检测,通过分别训练两个专用流来应对瞬态和持续异常检测的冲突目标,最终通过协作推理机制融合两者的优势,实现平衡预测。
Details
Motivation: 现有弱监督视频异常检测方法采用统一框架进行联合优化,但存在敏感度与稳定性之间的权衡问题,导致预测碎片化或过度平滑。
Result: 在UCF-Crime数据集上达到89.37% AUC(提升1.29%),在XD-Violence数据集上达到87.18% AP(提升2.22%),实现了新的SOTA性能。
Insight: 创新点在于将敏感度与一致性解耦为两个专用流,分别采用激进和稳健的优化策略,并通过协作推理减少个体偏差,这为处理多尺度时间模式的任务提供了可借鉴的架构设计思路。
Abstract: Recent weakly supervised video anomaly detection methods have achieved significant advances by employing unified frameworks for joint optimization. However, this paradigm is limited by a fundamental sensitivity-stability trade-off, as the conflicting objectives for detecting transient and sustained anomalies lead to either fragmented predictions or over-smoothed responses. To address this limitation, we propose DeSC, a novel Decoupled Sensitivity-Consistency framework that trains two specialized streams using distinct optimization strategies. The temporal sensitivity stream adopts an aggressive optimization strategy to capture high-frequency abrupt changes, whereas the semantic consistency stream applies robust constraints to maintain long-term coherence and reduce noise. Their complementary strengths are fused through a collaborative inference mechanism that reduces individual biases and produces balanced predictions. Extensive experiments demonstrate that DeSC establishes new state-of-the-art performance by achieving 89.37% AUC on UCF-Crime (+1.29%) and 87.18% AP on XD-Violence (+2.22%). Code is available at https://github.com/imzht/DeSC.
[76] Learning Hierarchical Orthogonal Prototypes for Generalized Few-Shot 3D Point Cloud Segmentation cs.CV | cs.AIPDF
Yifei Zhao, Fanyu Zhao, Zhongyuan Zhang, Shengtang Wu, Yixuan Lin
TL;DR: 本文提出HOP3D框架,用于解决广义少样本3D点云分割任务中,适应新类别时导致基础类别性能下降的稳定性-可塑性权衡问题。该方法通过分层正交原型学习和基于熵的少样本正则化器,在梯度级和表示级解耦基础与新类别的学习,从而在保持基础类别性能的同时,有效适应新类别。
Details
Motivation: 解决广义少样本3D点云分割中,因适应新类别而干扰共享表示、导致基础类别遗忘的稳定性-可塑性权衡挑战。
Result: 在ScanNet200和ScanNet++基准测试上,HOP3D在1-shot和5-shot设置下均持续优于现有最先进基线方法。
Insight: 创新点在于分层正交化机制,从梯度和表示两个层面解耦基础与新类别的学习,并结合基于熵的正则化器利用预测不确定性来优化原型学习,促进平衡预测,有效缓解基础-新类别干扰。
Abstract: Generalized few-shot 3D point cloud segmentation aims to adapt to novel classes from only a few annotations while maintaining strong performance on base classes, but this remains challenging due to the inherent stability-plasticity trade-off: adapting to novel classes can interfere with shared representations and cause base-class forgetting. We present HOP3D, a unified framework that learns hierarchical orthogonal prototypes with an entropy-based few-shot regularizer to enable robust novel-class adaptation without degrading base-class performance. HOP3D introduces hierarchical orthogonalization that decouples base and novel learning at both the gradient and representation levels, effectively mitigating base-novel interference. To further enhance adaptation under sparse supervision, we incorporate an entropy-based regularizer that leverages predictive uncertainty to refine prototype learning and promote balanced predictions. Extensive experiments on ScanNet200 and ScanNet++ demonstrate that HOP3D consistently outperforms state-of-the-art baselines under both 1-shot and 5-shot settings. The code is available at https://fdueblab-hop3d.github.io/.
[77] From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models cs.CVPDF
Weile Gong, Yiping Zuo, Zijian Lu, Xin He, Weibei Fan
TL;DR: 本文针对现代视觉语言模型(VLM)作为生成式OCR引擎时,由于自回归解码倾向于语义合理性而非视觉可验证性,导致罕见但严重的错误(如过度生成和未支持的替换)的问题,提出了一种模型无关的几何风险控制器。该控制器通过探测输入的多个结构化视图、应用轻量级结构筛选,并仅在跨视图共识和稳定性满足预定标准时才接受转录,从而将冻结VLM的OCR任务构建为一个选择性接受/弃权问题,实现了对极端错误风险和灾难性过度生成的可预测控制。
Details
Motivation: 生成式OCR在部署中存在核心错位:自回归解码追求语义合理性,而OCR需要视觉基础和几何可验证的输出,这种不匹配会导致严重错误,即使基准准确率很高也带来部署风险。
Result: 在冻结的VLM主干和标准OCR基准测试上的实验表明,该方法以可预测的覆盖成本为代价,持续降低了极端错误风险和灾难性的过度生成。
Insight: 创新点在于将冻结VLM的OCR任务形式化为一个选择性接受/弃权问题,并提出了一个模型无关的几何风险控制器,通过多视图共识和稳定性验证来实施显式的系统级风险控制,而非无约束生成,这为可靠部署生成式OCR提供了新思路。
Abstract: Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors semantic plausibility, whereas OCR requires outputs that are visually grounded and geometrically verifiable. This mismatch produces severe errors, especially over-generation and unsupported substitutions, creating deployment risk even when benchmark accuracy remains high. We therefore formulate frozen VLM OCR as a selective accept/abstain problem and propose a model-agnostic Geometric Risk Controller. The controller probes multiple structured views of the same input, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria, yielding a small family of operating points. Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs. Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.
[78] Evaluating Vision Foundation Models for Pixel and Object Classification in Microscopy cs.CVPDF
Carolin Teuber, Anwai Archit, Tobias Boothe, Peter Ditte, Jochen Rink
TL;DR: 本文评估了多种视觉基础模型(包括通用模型SAM、SAM2、DINOv3和领域特定模型μSAM、PathoSAM)在显微图像像素分类和对象分类任务上的性能,结合浅层学习和注意力探测方法,在五个多样化数据集上进行了测试,结果表明这些模型相比手工特征提取方法有持续改进,并为该领域建立了基准。
Details
Motivation: 解决显微图像分析中交互式语义分割(像素分类)和对象级分类任务仍广泛依赖基于特征的浅层学习,而缺乏大规模预训练数据集和计算效率的问题,探索视觉基础模型是否能提升这些任务的性能。
Result: 在五个多样化且具有挑战性的数据集上,视觉基础模型结合浅层学习和注意力探测方法相比手工特征提取方法取得了持续改进,为显微图像分析建立了新的基准。
Insight: 创新点在于首次系统评估了视觉基础模型在显微图像像素和对象分类任务中的应用潜力,结合浅层学习策略,为领域提供了实用改进路径和未来发展方向。
Abstract: Deep learning underlies most modern approaches and tools in computer vision, including biomedical imaging. However, for interactive semantic segmentation (often called pixel classification in this context) and interactive object-level classification (object classification), feature-based shallow learning remains widely used. This is due to the diversity of data in this domain, the lack of large pretraining datasets, and the need for computational and label efficiency. In contrast, state-of-the-art tools for many other vision tasks in microscopy - most notably cellular instance segmentation - already rely on deep learning and have recently benefited substantially from vision foundation models (VFMs), particularly SAM. Here, we investigate whether VFMs can also improve pixel and object classification compared to current approaches. To this end, we evaluate several VFMs, including general-purpose models (SAM, SAM2, DINOv3) and domain-specific ones ($μ$SAM, PathoSAM), in combination with shallow learning and attentive probing on five diverse and challenging datasets. Our results demonstrate consistent improvements over hand-crafted features and provide a clear pathway toward practical improvements. Furthermore, our study establishes a benchmark for VFMs in microscopy and informs future developments in this area.
[79] Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision cs.CV | cs.AIPDF
Jiyeong Kim, Yerim So, Hyesong Choi, Uiwon Hwang, Dongbo Min
TL;DR: 本文提出了一种名为SeGroS的微调框架,旨在解决统一多模态模型(UMMs)中存在的粒度不匹配和监督冗余问题。该框架通过引入一种新颖的视觉基础图来构建两种互补的监督信号,从而增强模型的生成保真度和跨模态对齐能力。
Details
Motivation: 当前统一多模态模型的生成训练范式存在固有的局限性,特别是粒度不匹配和监督冗余问题,这影响了模型的性能和对齐效果。
Result: 在GenEval、DPGBench和CompBench等多个基准测试上的广泛评估表明,SeGroS显著提升了各种UMM架构的生成保真度和跨模态对齐性能。
Insight: 论文的核心创新点在于提出了语义基础监督(SeGroS)框架,通过构建语义视觉提示和语义基础损坏输入两种互补监督信号,有效解决了UMMs中的粒度不匹配问题,并增强了基于掩码的UMMs的监督效率,从而提升了模型的整体性能。
Abstract: Unified Multimodal Models (UMMs) have emerged as a promising paradigm that integrates multimodal understanding and generation within a unified modeling framework. However, current generative training paradigms suffer from inherent limitations. We present Semantically-Grounded Supervision (SeGroS), a fine-tuning framework designed to resolve the granularity mismatch and supervisory redundancy in UMMs. At its core, we propose a novel visual grounding map to construct two complementary supervision signals. First, we formulate semantic Visual Hints to compensate for the sparsity of text prompts. Second, we generate a semantically-grounded Corrupted Input to explicitly enhance the supervision of masking-based UMMs by restricting the reconstruction loss to core text-aligned regions. Extensive evaluations on GenEval, DPGBench, and CompBench demonstrate that SeGroS significantly improves generation fidelity and cross-modal alignment across various UMM architectures.
[80] HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks cs.CVPDF
Jingyu Guo, Ziye Chen, Ziwen Li, Zhengqing Gao, Jiaxin Huang
TL;DR: 本文提出了HUGE-Bench,一个用于评估无人机高级视觉-语言-动作任务的新基准测试。该基准旨在测试智能体能否根据简洁的高级指令,在保证安全的前提下执行复杂的、面向过程的飞行轨迹。它包含4个真实世界数字孪生场景、8个高级任务和总计256万米的轨迹数据,并基于对齐的3D高斯溅射与网格表示构建,以支持可扩展的生成和碰撞感知评估。
Details
Motivation: 现有无人机视觉语言导航基准主要关注冗长的、逐步的路线描述和以目标为中心的评价,难以诊断在真实操作中根据简洁高级指令生成安全多阶段行为的能力。因此,需要一个新的基准来测试智能体对高级语义的理解和安全执行复杂过程的能力。
Result: 在代表性的最先进视觉-语言-动作模型上的实验表明,这些模型在高级语义完成和安全执行方面存在显著差距,凸显了HUGE-Bench作为高级无人机自主性诊断测试平台的价值。
Insight: 创新点包括:1) 提出了首个专注于高级、简洁指令下过程导向与安全感知无人机任务的基准;2) 采用了结合照片级真实感渲染与可碰撞几何的对齐3D高斯溅射-网格表示,以支持大规模生成和精确评估;3) 引入了面向过程和碰撞感知的评估指标,综合衡量过程保真度、终端精度和安全性。
Abstract: Existing UAV vision-language navigation (VLN) benchmarks have enabled language-guided flight, but they largely focus on long, step-wise route descriptions with goal-centric evaluation, making them less diagnostic for real operations where brief, high-level commands must be grounded into safe multi-stage behaviors. We present HUGE-Bench, a benchmark for High-Level UAV Vision-Language-Action (HL-VLA) tasks that tests whether an agent can interpret concise language and execute complex, process-oriented trajectories with safety awareness. HUGE-Bench comprises 4 real-world digital twin scenes, 8 high-level tasks, and 2.56M meters of trajectories, and is built on an aligned 3D Gaussian Splatting (3DGS)-Mesh representation that combines photorealistic rendering with collision-capable geometry for scalable generation and collision-aware evaluation. We introduce process-oriented and collision-aware metrics to assess process fidelity, terminal accuracy, and safety. Experiments on representative state-of-the-art VLA models reveal significant gaps in high-level semantic completion and safe execution, highlighting HUGE-Bench as a diagnostic testbed for high-level UAV autonomy.
[81] Hyper-Connections for Adaptive Multi-Modal MRI Brain Tumor Segmentation cs.CVPDF
Lokendra Kumar, Shubham Aggarwal
TL;DR: 本文首次将超连接(Hyper-Connections, HC)应用于多模态脑肿瘤分割,作为一种即插即用的自适应连接机制,替代了nnU-Net、SwinUNETR等五种3D分割架构中的固定残差连接。在BraTS 2021数据集上,动态HC一致提升了所有模型的性能,最高带来1.03%的平均Dice增益,且参数量开销可忽略。改进在增强肿瘤子区域最为显著,表明其改善了细粒度边界描绘。模态消融实验进一步显示,配备HC的模型对临床主导序列(如T1ce对肿瘤核心、FLAIR对全肿瘤)表现出更敏锐的敏感性,这一特性在固定连接基线中不存在且在所有架构中一致。在2D设置中改进较小且对配置敏感,表明体积空间上下文放大了自适应聚合的益处。
Details
Motivation: 解决多模态医学图像分割中固定残差连接可能限制模型自适应融合不同模态特征的问题,旨在通过一种简单高效的机制提升模型对关键临床序列的敏感性和分割精度。
Result: 在BraTS 2021数据集上,HC使所有测试的3D模型性能提升,最高获得+1.03%的平均Dice分数增益,参数量开销可忽略;在增强肿瘤(Enhancing Tumor)子区域改进最明显;模态敏感性分析显示模型对T1ce(肿瘤核心)和FLAIR(全肿瘤)序列的依赖更合理。
Insight: 创新点在于提出超连接(HC)作为即插即用的自适应特征聚合模块,替代固定残差连接,能动态调整多模态信息融合,提升模型对关键临床序列的敏感性并改善边界分割;客观来看,这是一种轻量、通用且能有效利用3D空间上下文增强多模态融合的机制,具有广泛适用性。
Abstract: We present the first study of Hyper-Connections (HC) for volumetric multi-modal brain tumor segmentation, integrating them as a drop-in replacement for fixed residual connections across five architectures: nnU-Net, SwinUNETR, VT-UNet, U-Net, and U-Netpp. Dynamic HC consistently improves all 3D models on the BraTS 2021 dataset, yielding up to +1.03 percent mean Dice gain with negligible parameter overhead. Gains are most pronounced in the Enhancing Tumor sub-region, reflecting improved fine-grained boundary delineation. Modality ablation further reveals that HC-equipped models develop sharper sensitivity toward clinically dominant sequences, specifically T1ce for Tumor Core and Enhancing Tumor, and FLAIR for Whole Tumor, a behavior absent in fixed-connection baselines and consistent across all architectures. In 2D settings, improvements are smaller and configuration-sensitive, suggesting that volumetric spatial context amplifies the benefit of adaptive aggregation. These results establish HC as a simple, efficient, and broadly applicable mechanism for multi-modal feature fusion in medical image segmentation.
[82] IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment cs.CV | cs.LGPDF
Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer
TL;DR: 本文提出了IsoCLIP方法,旨在解决CLIP等视觉语言模型在图像-图像检索等单模态任务中存在的模态内不对齐问题。该方法通过分析CLIP投影器的结构,识别出负责模态间对齐的算子和仅负责模态内归一化的算子,并移除投影权重中的各向异性方向,从而在无需重新训练的情况下提升单模态任务性能并降低延迟。
Details
Motivation: CLIP等模型在跨模态任务上表现良好,但其单模态编码器(如图像编码器)应用于图像到图像检索等纯视觉任务时,性能会因模态内特征不对齐而下降。本文旨在研究并缓解CLIP投影器导致的这种模态内不对齐问题。
Result: 在多个预训练的CLIP类模型上,于单模态检索和分类基准测试中,该方法减少了模态内不对齐,显著降低了延迟,并且性能优于现有方法。
Insight: 创新点在于通过谱分析将CLIP投影器分解为模态间对齐算子和模态内归一化算子,并识别出对齐良好的近似各向同性子空间与各向异性的模态特定方向;通过直接从投影器权重中提取对齐子空间并移除各向异性方向,以无训练的方式有效提升单模态对齐,这为改进预训练多模态模型的单模态应用提供了新视角。
Abstract: Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.
[83] MedQ-Engine: A Closed-Loop Data Engine for Evolving MLLMs in Medical Image Quality Assessment cs.CVPDF
Jiyao Liu, Junzhi Ning, Wanying Qu, Lihao Liu, Chenglong Ma
TL;DR: 本文提出了MedQ-Engine,一个用于医学图像质量评估的闭环数据引擎。该引擎通过迭代评估模型、基于数据驱动聚类发现失败原型、利用这些原型作为检索锚点从百万级图像池中进行渐进式人机协同标注,并通过质量保证的微调实现模型进化,形成一个自我改进的循环。实验表明,该方法能高效提升多模态大语言模型在医学图像质量评估中的性能。
Details
Motivation: 医学图像质量评估是临床AI部署的前提,但当前的多模态大语言模型在提供带有临床推理的描述性评估方面远逊于人类专家。提升模型性能面临两大障碍:获取描述性标注的成本高昂,以及一次性数据收集无法适应模型不断演变的弱点。
Result: 在五种医学成像模态上的实验表明,MedQ-Engine将一个80亿参数的模型性能提升至超越GPT-4o超过13%,并将与人类专家的差距缩小至仅4.34%。该结果仅使用了1万条标注,其样本效率相比随机采样提升了4倍以上。
Insight: 论文的核心创新点在于提出了一个闭环、自我改进的数据引擎框架,通过数据驱动的失败原型发现和熵引导的路由机制来高效、低成本地识别和标注模型弱点。从客观角度看,其将主动学习、人机协同和模型迭代微调结合在一个统一框架内,为解决标注成本高和模型适应性差的问题提供了一种系统性的工程化解决方案。
Abstract: Medical image quality assessment (Med-IQA) is a prerequisite for clinical AI deployment, yet multimodal large language models (MLLMs) still fall substantially short of human experts, particularly when required to provide descriptive assessments with clinical reasoning beyond simple quality scores. However, improving them is hindered by the high cost of acquiring descriptive annotations and by the inability of one-time data collection to adapt to the model’s evolving weaknesses. To address these challenges, we propose MedQ-Engine, a closed-loop data engine that iteratively evaluates the model to discover failure prototypes via data-driven clustering, explores a million-scale image pool using these prototypes as retrieval anchors with progressive human-in-the-loop annotation, and evolves through quality-assured fine-tuning, forming a self-improving cycle. Models are evaluated on complementary perception and description tasks. An entropy-guided routing mechanism triages annotations to minimize labeling cost. Experiments across five medical imaging modalities show that MedQ-Engine elevates an 8B-parameter model to surpass GPT-4o by over 13% and narrow the gap with human experts to only 4.34%, using only 10K annotations with more than 4x sample efficiency over random sampling.
[84] SIMPLER: Efficient Foundation Model Adaptation via Similarity-Guided Layer Pruning for Earth Observation cs.CVPDF
Víctor Barreiro, Johannes Jakubik, Francisco Argüello, Dora B. Heras
TL;DR: 论文提出了一种名为SIMPLER的预微调架构选择方法,用于高效适配地球观测基础模型。该方法通过分析预训练视觉Transformer中更深层表征的稳定性,在无标签任务数据上计算层间表征相似度,并应用自动化评分函数来识别和剪枝冗余层,从而在模型适配前就减少推理和部署成本。
Details
Motivation: 微调地球观测基础模型的计算成本高昂,训练和部署时对时间和内存的需求都很大。现有的参数高效方法虽能降低训练成本,但保留了完整的推理复杂度;而事后压缩方法则需要在昂贵的完整微调后才能优化推理。因此,需要一种能在适配前就同时降低训练和推理成本的方法。
Result: 在Prithvi-EO-2模型上,SIMPLER剪枝了高达79%的参数,同时保留了94%的基线性能,实现了2.1倍的训练加速和2.6倍的推理加速。该方法还成功推广到了TerraMind(一个多模态地球观测基础模型)和ImageNet预训练的ViT-MAE上,证明了其在不同任务、架构和光谱模态上的适用性。
Insight: 创新点在于利用预训练视觉Transformer中更深层表征趋于稳定的特性,提出了一种无需梯度、无需幅度启发式、也无需超参数调优的层剪枝方法。该方法在模型适配前就进行架构选择,同时优化了训练和推理效率,为高效基础模型适配提供了一种新颖且通用的解决方案。
Abstract: Fine-tuning foundation models for Earth Observation is computationally expensive, with high training time and memory demands for both training and deployment. Parameter-efficient methods reduce training cost but retain full inference complexity, while post-hoc compression optimizes inference only after costly full fine-tuning. We introduce SIMPLER, a pre-fine-tuning architecture selection method that reduces inference and deployment costs by identifying an effective model depth before adaptation. SIMPLER exploits stabilization of representations in deeper layers of pre-trained vision transformers: it computes layer-wise representation similarity on unlabeled task data and applies an automated scoring function to select redundant layers, with no gradients, magnitude heuristics, or hyperparameter tuning required. On Prithvi-EO-2, SIMPLER prunes up to 79% of parameters while retaining 94% of baseline performance, yielding a 2.1x training speedup and 2.6x inference speedup. The method generalizes to TerraMind (a multimodal EO foundation model) and ImageNet-pretrained ViT-MAE, demonstrating applicability across tasks, architectures, and spectral modalities. Code is available at https://gitlab.citius.gal/hpc4rs/simpler.
[85] Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery cs.CV | cs.AIPDF
Jizhou Han, Chenhao Ding, Yuhang He, Qiang Wang, Shaokun Wang
TL;DR: 本文提出了一种名为类比文本概念生成器(ATCG)的即插即用模块,用于广义类别发现(GCD)任务。该模块通过从已标注知识中类比推理,为未标注样本生成文本概念,并将这些概念与视觉特征融合,将发现过程转化为视觉-文本推理,从而提升对细粒度、相似类别的区分能力。
Details
Motivation: 当前GCD方法主要依赖纯视觉流程,且监督学习与发现过程耦合松散,导致在细粒度、外观相似的类别上产生脆弱的分类边界。本文旨在通过引入文本概念来增强视觉表示,以更稳健地发现新类别并保持对已知类别的识别。
Result: 在六个基准测试上,ATCG一致地提升了GCD模型的整体性能、已知类别和新类别的识别准确率,尤其在细粒度数据集上取得了最显著的性能增益。
Insight: 核心创新点在于通过类比推理生成文本概念,将先验知识迁移到新数据,实现了视觉与文本模态的协同推理,从而锐化了类别边界。该方法无需改变现有GCD管路的整体设计,具有很好的通用性和可扩展性。
Abstract: Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visual-only pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine-grained, look-alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual-textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with the largest gains on fine-grained data. Our code is available at: https://github.com/zhou-9527/AnaLogical-GCD.
[86] SegVGGT: Joint 3D Reconstruction and Instance Segmentation from Multi-View Images cs.CVPDF
Jinyuan Qu, Hongyang Li, Lei Zhang
TL;DR: SegVGGT是一个端到端的统一框架,能够直接从多视角RGB图像中同时进行前馈式3D重建和实例分割。它通过引入与多层级几何特征交互的对象查询,将实例识别深度集成到基于视觉几何的Transformer中,并提出了帧级注意力分布对齐策略以解决注意力分散问题。
Details
Motivation: 现有3D实例分割方法通常依赖高质量点云或配准的RGB-D扫描,需要复杂的多阶段处理流程,且对重建噪声高度敏感;而当前的前馈Transformer在多视图3D重建方面取得突破,但仍与高层语义理解脱节。本文旨在解决从多视角RGB图像直接进行联合3D重建与实例分割的问题。
Result: 在ScanNetv2和ScanNet200基准测试上,SegVGGT实现了最先进的性能,超越了最近的联合模型和基于RGB-D的方法,并在ScanNet++上表现出强大的泛化能力。
Insight: 主要创新点包括:1)将对象查询与多层级几何特征交互,在视觉几何Transformer中深度集成实例识别;2)提出帧级注意力分布对齐策略,通过显式引导对象查询在训练中关注与实例相关的帧,解决全局图像令牌过多导致的注意力分散问题,且不增加推理开销。这是一种将语义理解与几何重建在统一前馈框架中深度融合的有效方法。
Abstract: 3D instance segmentation methods typically rely on high-quality point clouds or posed RGB-D scans, requiring complex multi-stage processing pipelines, and are highly sensitive to reconstruction noise. While recent feed-forward transformers have revolutionized multi-view 3D reconstruction, they remain decoupled from high-level semantic understanding. In this work, we present SegVGGT, a unified end-to-end framework that simultaneously performs feed-forward 3D reconstruction and instance segmentation directly from multi-view RGB images. By introducing object queries that interact with multi-level geometric features, our method deeply integrates instance identification into the visual geometry grounded transformer. To address the severe attention dispersion problem caused by the massive number of global image tokens, we propose the Frame-level Attention Distribution Alignment (FADA) strategy. FADA explicitly guides object queries to attend to instance-relevant frames during training, providing structured supervision without extra inference overhead. Extensive experiments demonstrate that SegVGGT achieves the state-of-the-art performance on ScanNetv2 and ScanNet200, outperforming both recent joint models and RGB-D-based approaches, while exhibiting strong generalization capabilities on ScanNet++.
[87] HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction cs.CV | cs.AI | cs.LGPDF
Ruicheng Yuan, Zhenxuan Zhang, Anbang Wang, Liwei Hu, Xiangqian Hua
TL;DR: 本文提出了HiPath,一个轻量级的视觉语言模型框架,用于结构化病理报告预测。该框架基于冻结的UNI2和Qwen3骨干网络,通过三个可训练模块(总计1500万参数)处理多图像视觉编码、跨模态对齐和结构化诊断生成,在749K真实世界中文病理病例上训练,在严格准确率和临床可接受准确率上均优于基线模型,并展示了良好的泛化能力。
Details
Motivation: 现有病理视觉语言模型将结构化、多粒度的病理报告简化为扁平标签或自由文本,无法有效捕捉报告的层次化结构,因此需要开发专门针对结构化报告预测的模型。
Result: 在749K真实世界中文病理病例数据集上,HiPath达到68.9%的严格准确率和74.7%的临床可接受准确率,安全率为97.3%,优于所有使用相同冻结骨干的基线模型;跨医院评估中严格准确率仅下降3.4个百分点,安全率保持97.1%,证实了其泛化能力。
Insight: 创新点包括将结构化报告预测作为主要训练目标,以及引入分层补丁聚合器(HiPA)、分层对比学习(HiCL)和基于槽的掩码诊断预测(Slot-MDP)三个模块,分别处理多图像编码、跨模态对齐和结构化生成,实现了轻量级且高效的结构化病理报告预测。
Abstract: Pathology reports are structured, multi-granular documents encoding diagnostic conclusions, histological grades, and ancillary test results across one or more anatomical sites; yet existing pathology vision-language models (VLMs) reduce this output to a flat label or free-form text. We present HiPath, a lightweight VLM framework built on frozen UNI2 and Qwen3 backbones that treats structured report prediction as its primary training objective. Three trainable modules totalling 15M parameters address complementary aspects of the problem: a Hierarchical Patch Aggregator (HiPA) for multi-image visual encoding, Hierarchical Contrastive Learning (HiCL) for cross-modal alignment via optimal transport, and Slot-based Masked Diagnosis Prediction (Slot-MDP) for structured diagnosis generation. Trained on 749K real-world Chinese pathology cases from three hospitals, HiPath achieves 68.9% strict and 74.7% clinically acceptable accuracy with a 97.3% safety rate, outperforming all baselines under the same frozen backbone. Cross-hospital evaluation confirms generalisation with only a 3.4pp drop in strict accuracy while maintaining 97.1% safety.
[88] Adaptive Greedy Frame Selection for Long Video Understanding cs.CV | cs.AI | cs.CLPDF
Yuning Huang, Fengqing Zhu
TL;DR: 本文提出了一种自适应贪婪帧选择方法,用于解决长视频理解中视觉语言模型输入帧数受限的问题。该方法通过构建候选帧池,在SigLIP和DINOv2嵌入空间中联合优化查询相关性和语义代表性,并利用预设策略和轻量级分类器自适应选择帧,以提高视频问答的准确性。
Details
Motivation: 当前大型视觉语言模型在处理长视频问答时,常因输入帧数和视觉令牌数量受限而影响推理效率。简单的稀疏采样可能遗漏关键帧,而纯相关性驱动选择则易导致帧重复且牺牲时间上分散的证据覆盖。
Result: 在MLVU基准测试中,该方法在不同帧预算下均优于均匀采样和近期强基线,尤其在严格预算下提升最为显著。
Insight: 创新点在于将帧选择问题形式化为一个归一化、单调且子模的目标函数优化,并引入问题类型分类器自适应调整相关性与覆盖度的权衡,从而在有限计算资源下提升长视频理解性能。
Abstract: Large vision–language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
[89] 2K Retrofit: Entropy-Guided Efficient Sparse Refinement for High-Resolution 3D Geometry Prediction cs.CVPDF
Tianbao Zhang, Zhenyu Liang, Zhenbo Song, Nana Wang, Xiaomei Zhang
TL;DR: 本文提出了2K Retrofit框架,旨在解决现有几何基础模型在高分辨率(如2K)图像上进行直接推理时计算和内存成本过高的问题。该框架无需修改或重新训练骨干网络,通过利用快速的粗略预测和基于熵的稀疏细化,选择性地增强高不确定性区域,从而以最小开销实现精确且高保真的2K分辨率输出。
Details
Motivation: 动机是解决自动驾驶、机器人和AR/MR等领域中,高分辨率几何预测对于鲁棒感知至关重要,但现有基础模型由于可扩展性限制,难以直接应用于现实世界的高分辨率场景,因为直接推理会导致难以承受的计算和内存需求。
Result: 在广泛使用的基准测试上的大量实验表明,2K Retrofit在准确性和速度上均达到了最先进的水平,弥合了高分辨率3D视觉应用中研究进展与可扩展部署之间的差距。
Insight: 创新点在于提出了一个通用的、无需修改骨干网络的框架,通过结合快速粗预测和基于熵的稀疏细化策略,实现了高效的高分辨率推理。从客观角度看,其核心洞察是利用不确定性(熵)来指导计算资源的分配,仅对关键区域进行细化,这是一种计算效率与精度平衡的有效方法。
Abstract: High-resolution geometric prediction is essential for robust perception in autonomous driving, robotics, and AR/MR, but current foundation models are fundamentally limited by their scalability to real-world, high-resolution scenarios. Direct inference on 2K images with these models incurs prohibitive computational and memory demands, making practical deployment challenging. To tackle the issue, we present 2K Retrofit, a novel framework that enables efficient 2K-resolution inference for any geometric foundation model, without modifying or retraining the backbone. Our approach leverages fast coarse predictions and an entropy-based sparse refinement to selectively enhance high-uncertainty regions, achieving precise and high-fidelity 2K outputs with minimal overhead. Extensive experiments on widely used benchmark demonstrate that 2K Retrofit consistently achieves state-of-the-art accuracy and speed, bridging the gap between research advances and scalable deployment in high-resolution 3D vision applications. Code will be released upon acceptance.
[90] VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking cs.CV | cs.AI | cs.CLPDF
Jingyang Lin, Jialian Wu, Jiang Liu, Ximeng Sun, Ze Wang
TL;DR: 本文提出了VideoSeek,一种用于长视频理解与推理的智能体模型。其核心创新在于利用视频逻辑流主动寻找答案关键证据,而非贪婪地解析所有密集采样的视频帧,从而在保持甚至提升视频理解能力的同时,大幅减少了计算成本。
Details
Motivation: 现有视频智能体模型严重依赖对密集采样视频帧的贪婪解析,导致计算成本高昂。本文旨在解决这一问题,通过主动寻求关键证据来构建一个高效的长视频理解智能体。
Result: 在四个具有挑战性的视频理解和推理基准测试中,VideoSeek在显著减少使用帧数(例如,在LVBench上比其基础模型GPT-5少用93%的帧)的同时,实现了强大的准确率,甚至在LVBench上比GPT-5绝对提升了10.2个百分点。
Insight: 论文宣称的创新点在于利用视频逻辑流进行主动寻求(Tool-Guided Seeking)的范式,以及一个支持多粒度视频观察和查询感知探索的“思考-行动-观察”循环工具包设计。从客观角度看,其将主动信息寻求机制与视频逻辑流结合,为高效长视频理解提供了一个新颖且有效的框架,工具包设计的互补作用也值得借鉴。
Abstract: Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.
[91] X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving cs.CV | cs.AIPDF
Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen
TL;DR: 本文提出了X-World,一个面向端到端自动驾驶的可控、多摄像头世界模型。它能够根据历史多视角摄像头数据和未来动作序列,在视频空间中生成遵循指令的未来多摄像头视频流,并支持对动态交通参与者、静态道路元素以及天气、时间等外观层面的控制,旨在为自动驾驶提供可扩展且可复现的仿真评估基础。
Details
Motivation: 当前端到端自动驾驶的评估严重依赖昂贵、场景覆盖有限且难以复现的真实道路测试,因此需要一种能够根据给定动作生成逼真未来观测、且长期可控稳定的真实世界仿真器。
Result: 实验表明,X-World能够实现高质量的多视角视频生成,具备(i)跨摄像头强视角一致性,(ii)长序列生成的稳定时序动态,以及(iii)严格遵循动作指令和可选场景控制的高可控性。
Insight: 核心创新在于设计了一个多视角潜在视频生成器,旨在显式地鼓励在不同控制信号下的跨视角几何一致性和时序连贯性。此外,模型将世界仿真与视频风格迁移能力相结合,通过外观提示控制外观,同时保留底层动作和场景动态,为可扩展评估提供了实用基础。
Abstract: Scalable and reliable evaluation is increasingly critical in the end-to-end era of autonomous driving, where vision–language–action (VLA) policies directly map raw sensor streams to driving actions. Yet, current evaluation pipelines still rely heavily on real-world road testing, which is costly, biased toward limited scenario coverage, and difficult to reproduce. These challenges motivate a real-world simulator that can generate realistic future observations under proposed actions, while remaining controllable and stable over long horizons. We present X-World, an action-conditioned multi-camera generative world model that simulates future observations directly in video space. Given synchronized multi-view camera history and a future action sequence, X-World generates future multi-camera video streams that follow the commanded actions. To ensure reproducible and editable scene rollouts, X-World further supports optional controls over dynamic traffic agents and static road elements, and retains a text-prompt interface for appearance-level control (e.g., weather and time of day). Beyond world simulation, X-World also enables video style transfer by conditioning on appearance prompts while preserving the underlying action and scene dynamics. At the core of X-World is a multi-view latent video generator designed to explicitly encourage cross-view geometric consistency and temporal coherence under diverse control signals. Experiments show that X-World achieves high-quality multi-view video generation with (i) strong view consistency across cameras, (ii) stable temporal dynamics over long rollouts, and (iii) high controllability with strict action following and faithful adherence to optional scene controls. These properties make X-World a practical foundation for scalable and reproducible evaluation.
[92] MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI cs.CVPDF
Rozain Shakeel, Abdul Rahman Mohammad Ali, Muneeb Mushtaq, Tausifa Jan Saleem, Tajamul Ashraf
TL;DR: 本文提出了MedSPOT,一个面向临床图形用户界面(GUI)的工作流感知顺序定位基准。该基准旨在评估多模态大语言模型(MLLMs)在真实医疗软件环境中执行顺序、多步骤视觉定位任务的能力,弥补了现有基准主要关注孤立单步查询的不足。
Details
Motivation: 现有GUI基准大多关注孤立的单步定位查询,忽视了真实世界医疗界面中所需的顺序性、工作流驱动的推理能力。多模态大语言模型在高风险临床软件环境中进行可靠视觉定位的能力尚未得到充分探索。
Result: 研究构建了包含216个任务驱动视频和597个标注关键帧的数据集,每个任务包含2到3个相互依赖的定位步骤。提出了严格的顺序评估协议(首次错误即终止任务评估)和全面的失败分类法,用于系统诊断模型在临床GUI环境中的行为。
Insight: 创新点在于将程序性交互建模为一系列结构化空间决策,并引入工作流感知的顺序推理评估范式,强调错误传播和多步骤任务中的上下文依赖性。这为评估医疗软件环境中的多模态模型建立了一个更现实且安全关键的基准。
Abstract: Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.
[93] CFCML: A Coarse-to-Fine Crossmodal Learning Framework For Disease Diagnosis Using Multimodal Images and Tabular Data cs.CVPDF
Tianling Liu, Hongying Liu, Fanhua Shang, Lequan Yu, Tong Han
TL;DR: 本文提出了一种名为CFCML的从粗到细跨模态学习框架,用于融合医学图像和表格数据进行疾病诊断。该框架通过探索多粒度特征关系和建立分层锚点对比学习策略,逐步减小模态间差异,并提取判别性信息。
Details
Motivation: 解决临床实践中医学图像与表格数据之间存在显著模态差异的问题,现有跨模态学习方法多关注高层特征关系,忽略了图像局部信息和任务相关信息的提取。
Result: 在MEN和Derm7pt数据集上的实验表明,该方法在AUC指标上分别比现有最优方法提升了1.53%和0.91%,达到了SOTA水平。
Insight: 创新点在于提出了从粗到细的两阶段学习框架:粗粒度阶段探索多粒度特征关系初步减小模态差异;细粒度阶段引入包含类别信息的单模态与跨模态原型,并设计分层锚点关系挖掘策略进行多视角对比学习,有效增强类间差异并减小类内差异。
Abstract: In clinical practice, crossmodal information including medical images and tabular data is essential for disease diagnosis. There exists a significant modality gap between these data types, which obstructs advancements in crossmodal diagnostic accuracy. Most existing crossmodal learning (CML) methods primarily focus on exploring relationships among high-level encoder outputs, leading to the neglect of local information in images. Additionally, these methods often overlook the extraction of task-relevant information. In this paper, we propose a novel coarse-to-fine crossmodal learning (CFCML) framework to progressively reduce the modality gap between multimodal images and tabular data, by thoroughly exploring inter-modal relationships. At the coarse stage, we explore the relationships between multi-granularity features from various image encoder stages and tabular information, facilitating a preliminary reduction of the modality gap. At the fine stage, we generate unimodal and crossmodal prototypes that incorporate class-aware information, and establish hierarchical anchor-based relationship mining (HRM) strategy to further diminish the modality gap and extract discriminative crossmodal information. This strategy utilize modality samples, unimodal prototypes, and crossmodal prototypes as anchors to develop contrastive learning approaches, effectively enhancing inter-class disparity while reducing intra-class disparity from multiple perspectives. Experimental results indicate that our method outperforms the state-of-the-art (SOTA) methods, achieving improvements of 1.53% and 0.91% in AUC metrics on the MEN and Derm7pt datasets, respectively. The code is available at https://github.com/IsDling/CFCML.
[94] Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR cs.CV | cs.AIPDF
Ziye Yuan, Ruchang Yao, Chengxin Zheng, Yusheng Zhao, Daxiang Dong
TL;DR: 本文针对多模态大语言模型在OCR任务中因高层语义梯度干扰低层视觉特征而导致的细粒度信息丢失问题,提出了一种名为Detached Skip-Links的轻量级改进方法。该方法在前向传播中复用浅层特征,同时在联合训练时阻断跳跃连接分支的梯度回传,从而解耦特征聚合与梯度传播。此外,论文还引入了R-Probe诊断工具,通过一个浅层解码器评估投影后视觉令牌的像素级可重建性,以验证细粒度信息的保留情况。实验在多种ViT骨干网络和大规模多模态基准测试中验证了方法的有效性。
Details
Motivation: 多模态大语言模型在需要细粒度视觉细节的OCR任务上表现不佳,作者发现其根本原因在于多层特征融合中存在一个被忽视的优化问题:跳跃连接为高层语义目标到早期视觉层引入了直接的梯度回传路径,这会覆盖低层信号并破坏训练稳定性。
Result: 在多种ViT骨干网络和多模态基准测试上,以及高达700万训练样本的规模下,该方法在OCR相关基准测试上取得了持续提升,并在通用多模态任务上带来了明显收益。
Insight: 核心创新点在于提出了Detached Skip-Links,这是一种非对称设计,在不增加可学习参数的情况下,通过解耦前向特征聚合与反向梯度传播来减少梯度干扰,从而稳定训练并提升收敛性。同时,R-Probe作为一种诊断工具,为评估MLLM中细粒度视觉信息的保留程度提供了一种可量化的方法。
Abstract: Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.
[95] MFil-Mamba: Multi-Filter Scanning for Spatial Redundancy-Aware Visual State Space Models cs.CVPDF
Puskal Khadka, KC Santosh
TL;DR: 本文提出MFil-Mamba,一种基于多滤波器扫描骨干的新型视觉状态空间模型,旨在解决将状态空间模型(SSMs)扩展到计算机视觉任务时面临的挑战,如图像的非序列结构和复杂二维空间依赖性问题。该方法通过多滤波器扫描捕获独特且上下文相关的空间信息,并引入自适应加权机制融合多扫描输出,从而减少冗余并保持空间关系。
Details
Motivation: 将状态空间模型(尤其是Mamba架构)扩展到计算机视觉任务存在挑战,因为视觉数据具有非序列结构和复杂的二维空间依赖关系。现有方法多依赖对相同输入的不同遍历策略,这引入了冗余并扭曲了图像内的空间关系。
Result: MFil-Mamba在多个基准测试中超越了现有最先进模型,包括图像分类、目标检测、实例分割和语义分割。例如,其小变体在ImageNet-1K上达到83.2%的top-1准确率,在MS COCO上达到47.3%的框AP和42.7%的掩码AP,在ADE20K数据集上达到48.5%的mIoU。
Insight: 创新点在于提出多滤波器扫描骨干,使每次扫描能捕获独特且上下文相关的空间信息,减少冗余;同时引入自适应加权机制有效融合多扫描输出,并结合架构增强,以更好地处理视觉数据的空间依赖关系。
Abstract: State Space Models (SSMs), especially recent Mamba architecture, have achieved remarkable success in sequence modeling tasks. However, extending SSMs to computer vision remains challenging due to the non-sequential structure of visual data and its complex 2D spatial dependencies. Although several early studies have explored adapting selective SSMs for vision applications, most approaches primarily depend on employing various traversal strategies over the same input. This introduces redundancy and distorts the intricate spatial relationships within images. To address these challenges, we propose MFil-Mamba, a novel visual state space architecture built on a multi-filter scanning backbone. Unlike fixed multi-directional traversal methods, our design enables each scan to capture unique and contextually relevant spatial information while minimizing redundancy. Furthermore, we incorporate an adaptive weighting mechanism to effectively fuse outputs from multiple scans in addition to architectural enhancements. MFil-Mamba achieves superior performance over existing state-of-the-art models across various benchmarks that include image classification, object detection, instance segmentation, and semantic segmentation. For example, our tiny variant attains 83.2% top-1 accuracy on ImageNet-1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on the ADE20K dataset. Code and models are available at https://github.com/puskal-khadka/MFil-Mamba.
[96] Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning cs.CV | cs.AIPDF
Jiajie Li, Chenhui Xu, Meihuan Liu, Jinjun Xiong
TL;DR: 本文提出了一种名为Chain-of-Adaptation(CoA)的适应框架,旨在通过强化学习将领域知识整合到视觉语言模型中,同时保持其预训练的多模态先验和泛化能力。
Details
Motivation: 解决传统领域特定数据集微调会无意中改变模型预训练的多模态先验,导致泛化能力下降的问题。
Result: 在标准手术基准测试中,无论是分布内还是分布外设置,CoA都比监督微调实现了更高的准确性、更强的泛化性和更稳定的行为。
Insight: 创新点在于引入了一种结构化的推理格式,通过强化学习增强领域对齐而不牺牲通用的多模态能力,为视觉语言模型的领域专业化提供了一条可靠途径。
Abstract: Conventional fine-tuning on domain-specific datasets can inadvertently alter a model’s pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model’s inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model’s core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.
[97] Synergistic Perception and Generative Recomposition: A Multi-Agent Orchestration for Expert-Level Building Inspection cs.CVPDF
Hui Zhong, Yichun Gao, Luyan Liu, Xusen Guo, Zhaonian Kuang
TL;DR: 本文提出了一种名为FacadeFixer的多智能体协同框架,用于解决建筑立面缺陷检测中因几何多变、背景复杂、缺陷复合以及高质量像素级标注稀缺而导致的模型泛化难题。该框架通过协调检测与分割智能体处理多类型缺陷干扰,并结合生成智能体进行语义重组,从复杂背景中解耦缺陷并合成高保真增强数据。
Details
Motivation: 建筑立面缺陷检测对于结构健康监测和城市维护至关重要,但面临几何变化大、背景复杂、复合缺陷以及高质量像素级标注严重稀缺等挑战,导致现有检测和分割模型泛化能力不足。
Result: 在引入的涵盖六种主要立面类别的多任务数据集上进行广泛实验,结果表明FacadeFixer在像素级结构异常捕获方面显著优于现有最先进的基线方法。
Insight: 核心创新在于将缺陷感知视为协同推理任务而非孤立识别,通过多智能体协同(检测、分割、生成)实现缺陷解耦与语义重组,并利用生成式合成作为解决基础设施检测中数据稀缺问题的鲁棒方案,同时贡献了一个新的像素级标注数据集。
Abstract: Building facade defect inspection is fundamental to structural health monitoring and sustainable urban maintenance, yet it remains a formidable challenge due to extreme geometric variability, low contrast against complex backgrounds, and the inherent complexity of composite defects (e.g., cracks co-occurring with spalling). Such characteristics lead to severe pixel imbalance and feature ambiguity, which, coupled with the critical scarcity of high-quality pixel-level annotations, hinder the generalization of existing detection and segmentation models. To address gaps, we propose \textit{FacadeFixer}, a unified multi-agent framework that treats defect perception as a collaborative reasoning task rather than isolated recognition. Specifically,\textit{FacadeFixer} orchestrates specialized agents for detection and segmentation to handle multi-type defect interference, working in tandem with a generative agent to enable semantic recomposition. This process decouples intricate defects from noisy backgrounds and realistically synthesizes them onto diverse clean textures, generating high-fidelity augmented data with precise expert-level masks. To support this, we introduce a comprehensive multi-task dataset covering six primary facade categories with pixel-level annotations. Extensive experiments demonstrate that \textit{FacadeFixer} significantly outperforms state-of-the-art (SOTA) baselines. Specifically, it excels in capturing pixel-level structural anomalies and highlights generative synthesis as a robust solution to data scarcity in infrastructure inspection. Our code and dataset will be made publicly available.
[98] Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning cs.CVPDF
Hui Zhong, Yichun Gao, Luyan Liu, Hai Yang, Wang Wang
TL;DR: 本文提出了首个用于评估大型多模态模型在建筑结构病理推理方面能力的多层次基准测试DefectBench,通过整合12个分散数据集构建标准化本体,并在语义感知、空间定位和生成几何分割三个认知维度上评估了18个SOTA模型,发现当前LMMs在拓扑理解和语义诊断方面表现优异,但在度量定位精度上存在不足,同时验证了零样本生成分割的可行性。
Details
Motivation: 解决建筑立面自动化检测领域依赖专用判别模型(如YOLO、Mask R-CNN)导致的被动感知和泛化能力受限问题,并填补大型多模态模型在高风险工程领域缺乏严格评估标准的空白。
Result: 在DefectBench上评估18个SOTA LMMs,结果显示模型在语义感知和拓扑理解(诊断“是什么”和“如何”)方面表现突出,但在空间定位精度(“在哪里”)上显著不足;同时证明零样本生成分割可与专用监督网络竞争,无需领域特定训练。
Insight: 创新点包括引入人机协同半自动标注框架统一碎片化数据集,构建首个多维层次化基准以评估LMMs超越基础语义识别的推理能力,并验证通用基础模型通过生成分割实现零样本领域适应的潜力,为土木工程自主AI智能体发展奠定新基线。
Abstract: Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on specialized discriminative models (e.g., YOLO, Mask R-CNN) that excel at pixel-level localization but are constrained to passive perception and worse generization without the visual understandng to interpret structural topology. Large Multimodal Models (LMMs) promise a paradigm shift toward active reasoning, yet their application in such high-stakes engineering domains lacks rigorous evaluation standards. To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology. Building on this foundation, we present \textit{DefectBench}, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition. \textit{DefectBench} evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation. Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing “what” and “how”), they exhibit significant deficiencies in metric localization precision (“where”). Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training. This work provides both a rigorous benchmarking standard and a high-quality open-source database, establishing a new baseline for the advancement of autonomous AI agents in civil engineering.
[99] EgoForge: Goal-Directed Egocentric World Simulator cs.CV | cs.MMPDF
Yifan Shen, Jiateng Liu, Xinzhuo Li, Yuanzhe Liu, Bingxuan Li
TL;DR: EgoForge是一种以目标为导向的自我中心世界模拟器,能够从单个自我中心图像、高级指令和可选的外部视角等最小静态输入中生成连贯的第一人称视频序列。该研究通过引入VideoDiffusionNFT方法,在扩散采样过程中优化目标完成度、时间因果性、场景一致性和感知保真度,以提升意图对齐和时间一致性。
Details
Motivation: 解决生成式世界模型在模拟自我中心视频时面临的挑战,包括快速视角变化、频繁的手-物交互以及依赖于潜在人类意图的目标导向过程,现有方法在场景演化、动作动态建模或监督需求方面存在局限。
Result: 在广泛的实验中,EgoForge在语义对齐、几何稳定性和运动保真度方面相比强基线模型取得了一致的提升,并在真实世界智能眼镜实验中展现出鲁棒性能。
Insight: 创新点在于从最小静态输入生成目标导向的自我中心视频,并提出了轨迹级奖励引导的扩散采样细化方法VideoDiffusionNFT,以综合优化多个视频质量指标,这为低监督需求的动态环境模拟提供了新思路。
Abstract: Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.
[100] LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis cs.CVPDF
Stanislaw Szymanowicz, Minghao Chen, Jianyuan Wang, Christian Rupprecht, Andrea Vedaldi
TL;DR: LagerNVS是一种用于新视角合成(NVS)的编码器-解码器神经网络,它利用具有3D感知能力的潜在特征,在无需显式3D重建的情况下实现实时、高质量的视图生成。该方法通过从经过显式3D监督预训练的3D重建网络初始化编码器,并结合轻量级解码器进行端到端光度损失训练,在已知和未知相机参数的情况下均能取得优异性能。
Details
Motivation: 尽管近期研究表明神经网络无需显式3D重建即可完成新视角合成等3D任务,但作者认为在神经网络设计中融入强大的3D归纳偏置仍然至关重要。
Result: LagerNVS在Re10k数据集上达到了31.4 PSNR的峰值信噪比,实现了最先进的确定性前馈新视角合成,渲染速度达到实时,并能泛化到野外数据。
Insight: 核心创新点在于将经过显式3D监督预训练的3D重建网络作为编码器初始化,从而将强3D先验注入到端到端的NVS网络中,实现了性能与效率的平衡。该方法还展示了与扩散解码器结合进行生成式外推的潜力。
Abstract: Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue that strong 3D inductive biases are still helpful in the design of such networks. We show this point by introducing LagerNVS, an encoder-decoder neural network for NVS that builds on `3D-aware’ latent features. The encoder is initialized from a 3D reconstruction network pre-trained using explicit 3D supervision. This is paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis (including 31.4 PSNR on Re10k), with and without known cameras, renders in real time, generalizes to in-the-wild data, and can be paired with a diffusion decoder for generative extrapolation.
[101] Improving Image-to-Image Translation via a Rectified Flow Reformulation cs.CVPDF
Satoshi Iizuka, Shun Okamoto, Kazuhiro Fukui
TL;DR: 本文提出了图像到图像整流流重构(I2I-RFR),一种实用的插件式重构方法,将标准的I2I回归网络重新表述为连续时间传输模型。该方法通过向主干网络输入通道级联添加噪声损坏的真实目标,并优化一个简单的t加权像素损失,从而在推理时支持基于ODE的渐进细化,同时基本保留了标准的有监督训练流程。实验表明,I2I-RFR能广泛提升多种I2I任务和主干网络的性能,特别是在感知质量和细节保留方面。
Details
Motivation: 解决像素级I2I回归在处理不适定和多模态目标时过度平滑的问题,以及生成式替代方案通常需要额外组件、任务特定调优和更复杂训练/推理流程的缺点,旨在为传统I2I模型提供一种轻量级的连续时间细化方法。
Result: 在多个图像到图像翻译和视频恢复任务上的广泛实验表明,I2I-RFR通常能提升各种任务和主干网络的性能,在感知质量和细节保留方面增益尤为明显。
Insight: 创新点在于通过简单的输入通道扩展和t加权损失,将回归网络重新解释为整流流,实现了推理时的ODE渐进细化,无需蒸馏或复杂生成流程,是一种轻量级且易于集成的插件式改进方案。
Abstract: In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression networks as continuous-time transport models. While pixel-wise I2I regression is simple, stable, and easy to adapt across tasks, it often over-smooths ill-posed and multimodal targets, whereas generative alternatives often require additional components, task-specific tuning, and more complex training and inference pipelines. Our method augments the backbone input by channel-wise concatenation with a noise-corrupted version of the ground-truth target and optimizes a simple t-reweighted pixel loss. This objective admits a rectified-flow interpretation via an induced velocity field, enabling ODE-based progressive refinement at inference time while largely preserving the standard supervised training pipeline. In most cases, adopting I2I-RFR requires only expanding the input channels, and inference can be performed with a few explicit solver steps (e.g., 3 steps) without distillation. Extensive experiments across multiple image-to-image translation and video restoration tasks show that I2I-RFR generally improves performance across a wide range of tasks and backbones, with particularly clear gains in perceptual quality and detail preservation. Overall, I2I-RFR provides a lightweight way to incorporate continuous-time refinement into conventional I2I models without requiring a heavy generative pipeline.
[102] MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering cs.CVPDF
Yuan Zhou, Yongzhi Li, Yanqi Dai, Xingyu Zhu, Yi Tan
TL;DR: MuSteerNet是一个从视频生成3D人体反应动作的框架,通过观察-反应相互引导机制,解决现有方法中视觉观察与反应类型关系扭曲的问题,从而合成与视频内容更匹配的反应动作。
Details
Motivation: 现有视频驱动的人体反应生成方法未能有效利用视频输入来引导合成,导致生成的反应动作与视频内容不匹配,其根本原因在于视觉观察与反应类型之间存在严重的关系扭曲。
Result: 广泛的实验和消融研究验证了该方法的有效性,使其能够实现有竞争力的性能。
Insight: 提出了原型反馈引导机制,通过门控增量校正调制器和关系边界约束来精炼视觉观察,并引入双耦合反应精炼,充分利用校正后的视觉线索进一步引导生成反应动作的优化,从而改善反应质量。
Abstract: Video-driven human reaction generation aims to synthesize 3D human motions that directly react to observed video sequences, which is crucial for building human-like interactive AI systems. However, existing methods often fail to effectively leverage video inputs to steer human reaction synthesis, resulting in reaction motions that are mismatched with the content of video sequences. We reveal that this limitation arises from a severe relational distortion between visual observations and reaction types. In light of this, we propose MuSteerNet, a simple yet effective framework that generates 3D human reactions from videos via observation-reaction mutual steering. Specifically, we first propose a Prototype Feedback Steering mechanism to mitigate relational distortion by refining visual observations with a gated delta-rectification modulator and a relational margin constraint, guided by prototypical vectors learned from human reactions. We then introduce Dual-Coupled Reaction Refinement that fully leverages rectified visual cues to further steer the refinement of generated reaction motions, thereby effectively improving reaction quality and enabling MuSteerNet to achieve competitive performance. Extensive experiments and ablation studies validate the effectiveness of our method. Code coming soon: https://github.com/zhouyuan888888/MuSteerNet.
[103] Wildfire Spread Scenarios: Increasing Sample Diversity of Segmentation Diffusion Models with Training-Free Methods cs.CVPDF
Sebastian Gerard, Josephine Sullivan
TL;DR: 本文针对不确定性环境(如野火蔓延、医学诊断、自动驾驶)中需要预测多种可能结果的需求,研究了无需重新训练即可提升分割扩散模型样本多样性的方法。作者将粒子引导和SPELL两种原本用于自然图像生成的技术适配到离散分割任务,并提出了一种简单的基于聚类的技术。在LIDC医学数据集、修改版Cityscapes数据集以及新提出的MMFire野火蔓延数据集上验证了这些方法,相比朴素采样,它们显著提高了样本多样性,同时保持了图像质量和运行效率。
Details
Motivation: 在不确定环境中预测未来状态需要模型能够考虑多种可能结果,而扩散模型虽然能学习多模态分布,但朴素采样计算效率低下,可能需要数百个样本才能找到低概率但仍有操作相关性的模式。因此,本文旨在解决样本高效的模糊分割挑战,通过评估无需训练的采样方法来促进多样化的预测。
Result: 在MMFire数据集上,这些方法将HM IoU*指标提升了高达7.5%;在Cityscapes数据集上提升了16.4%。这表明无需训练的方法可以有效增加分割扩散模型的样本多样性,同时对图像质量和运行时成本影响很小。
Insight: 创新点包括将粒子引导和SPELL技术从自然图像生成适配到离散分割任务,并提出一种简单的基于聚类的技术。从客观角度看,论文通过引入新数据集MMFire和验证训练免费方法在多样分割任务上的有效性,为扩散模型在不确定性预测领域的应用提供了实用的样本效率提升方案。
Abstract: Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider multiple plausible outcomes. While diffusion models can effectively learn such multi-modal distributions, naively sampling from these models is computationally inefficient, potentially requiring hundreds of samples to find low-probability modes that may still be operationally relevant. In this work, we address the challenge of sample-efficient ambiguous segmentation by evaluating several training-free sampling methods that encourage diverse predictions. We adapt two techniques, particle guidance and SPELL, originally designed for the generation of diverse natural images, to discrete segmentation tasks, and additionally propose a simple clustering-based technique. We validate these approaches on the LIDC medical dataset, a modified version of the Cityscapes dataset, and MMFire, a new simulation-based wildfire spread dataset introduced in this paper. Compared to naive sampling, these approaches increase the HM IoU* metric by up to 7.5% on MMFire and 16.4% on Cityscapes, demonstrating that training-free methods can be used to efficiently increase the sample diversity of segmentation diffusion models with little cost to image quality and runtime. Code and dataset: https://github.com/SebastianGer/wildfire-spread-scenarios
[104] CoVR-R:Reason-Aware Composed Video Retrieval cs.CVPDF
Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala
TL;DR: 本文提出了一种名为CoVR-R的零样本推理优先方法,用于组合视频检索任务。该方法利用大型多模态模型来推断编辑操作所隐含的因果和时序后果,并将经过推理的查询与候选视频对齐,无需任务特定的微调。作者还提出了一个名为CoVR-Reason的新基准,用于评估CoVR中的推理能力。实验表明,该方法在召回率指标上优于强基线,尤其在处理隐含效应子集时表现出色。
Details
Motivation: 现有组合视频检索工作假设修改文本完全指定了视觉变化,忽略了编辑操作可能产生的后续效应和隐含后果(如运动、状态转换、视角或时长线索)。作者认为成功的CoVR需要对这些后续效应进行推理。
Result: 在提出的CoVR-Reason基准上,该零样本方法在召回率(Recall@K)指标上优于强检索基线,并且在处理隐含效应子集时表现尤为突出。自动和人工分析均证实了其检索结果具有更高的步骤一致性和效应事实性。
Insight: 核心创新点在于将显式的因果和时序后果推理引入通用多模态模型,以实现有效的组合视频检索。这减少了对任务特定监督的依赖,提升了对具有挑战性的隐含效应案例的泛化能力,并增强了检索结果的可解释性,为可解释的视频搜索提供了一个可扩展且原则性的框架。
Abstract: Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.
[105] LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation cs.CV | cs.AIPDF
Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu
TL;DR: LumosX是一个用于个性化视频生成的框架,通过结合数据收集和模型设计,解决了多主体视频生成中身份与属性精确对齐的挑战。它在数据侧构建了包含主体-属性关系先验的基准数据集,在模型侧引入了关系自注意力和关系交叉注意力机制,以实现细粒度、身份一致且语义对齐的生成效果。
Details
Motivation: 现有基于扩散模型的文本到视频生成方法在实现多主体(尤其是人脸)与属性的精确对齐方面存在不足,缺乏确保组内一致性的显式机制。论文旨在解决这一细粒度身份-属性对齐的挑战。
Result: 在作者构建的综合性基准上进行评估,LumosX在细粒度、身份一致且语义对齐的个性化多主体视频生成任务上达到了最先进的(SOTA)性能。
Insight: 创新点在于数据和模型的双重推进:1)数据侧,利用多模态大语言模型从独立视频中推断并构建主体-属性关系先验,形成结构化基准;2)模型侧,提出关系自注意力和关系交叉注意力机制,将位置感知嵌入与细化的注意力动态相结合,显式地建模主体-属性依赖关系,强制组内凝聚和组间分离。
Abstract: Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.
[106] From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering cs.CV | cs.AI | cs.LGPDF
Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib
TL;DR: 本文针对现有图像篡改检测基准主要依赖对象掩码、与真实篡改信号严重错位的问题,提出了一种新的像素级、语义和语言感知的视觉语言模型图像篡改任务框架。作者引入了一个涵盖多种编辑原语及其语义类别的分类法,发布了一个包含逐像素篡改图和类别监督的新基准,并提出了一个训练框架和评估指标,以量化像素级正确性、定位置信度以及对篡改区域语义理解和自然语言描述的能力。
Details
Motivation: 现有篡改检测基准严重依赖对象掩码,导致许多掩码内像素未被修改或仅轻微改动,而掩码外细微但重要的编辑却被视为自然图像,这与真实篡改信号严重不符。因此,需要将VLM图像篡改检测重新定义为像素级、语义和语言感知的任务。
Result: 作者在提出的新基准上重新评估了现有强大的分割/定位基线模型和最新的篡改检测器,揭示了仅使用掩码指标的严重过评分和欠评分问题,并暴露了这些模型在微编辑和掩码外篡改上的失败模式。新框架为篡改定位、语义分类和描述建立了严格标准。
Insight: 创新点在于:1)提出了一个连接低层像素变化与高层语义理解的篡改分类法;2)构建了包含逐像素篡改图和配对类别监督的新基准,实现了检测与分类的统一评估协议;3)设计了量化像素级正确性、定位置信度以及语义理解和语言描述能力的训练框架与评估指标,推动了该领域从掩码到像素、语义和语言描述的进步。
Abstract: Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.
[107] MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints cs.CVPDF
Yu Qi, Xinyi Xu, Ziyu Guo, Siyuan Ma, Renrui Zhang
TL;DR: 本文提出MME-CoF-Pro,一个用于评估视频生成模型推理连贯性的综合基准,包含303个样本和16个类别,通过无提示、文本提示和视觉提示三种设置来评估模型在因果一致性方面的表现。
Details
Motivation: 现有文献缺乏对视频生成模型推理连贯性的评估,为确保生成事件在帧间保持因果一致性,需要建立专门的评估基准。
Result: 在7个开源和闭源视频模型上的评估显示:视频生成模型的推理连贯性较弱,与生成质量脱钩;文本提示能提升表面正确性但常导致不一致和幻觉推理;视觉提示对结构化感知任务有益但在细粒度感知上存在困难。
Insight: 创新点在于定义了推理连贯性这一属性,并设计了包含多类别样本和三种提示设置的评估框架,以及引入推理分数来评估过程级必要中间推理步骤,为理解模型推理机制提供了可控分析工具。
Abstract: Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable deployment, a property we define as reasoning coherence. To bridge the gap in literature for missing reasoning coherence evaluation, we propose MME-CoF-Pro, a comprehensive video reasoning benchmark to assess reasoning coherence in video models. Specifically, MME-CoF-Pro contains 303 samples across 16 categories, ranging from visual logical to scientific reasoning. It introduces Reasoning Score as evaluation metric for assessing process-level necessary intermediate reasoning steps, and includes three evaluation settings, (a) no hint (b) text hint and (c) visual hint, enabling a controlled investigation into the underlying mechanisms of reasoning hint guidance. Evaluation results in 7 open and closed-source video models reveals insights including: (1) Video generative models exhibit weak reasoning coherence, decoupled from generation quality. (2) Text hints boost apparent correctness but often cause inconsistency and hallucinated reasoning (3) Visual hints benefit structured perceptual tasks but struggle with fine-grained perception. Website: https://video-reasoning-coherence.github.io/
q-bio.QM [Back]
[108] Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarity Search q-bio.QM | cs.AI | cs.CVPDF
Himadri Samanta
TL;DR: 本研究提出了一种用于胸部X光片印象部分草稿生成的多模态检索增强生成(RAG)系统。该系统结合了对比图像-文本嵌入、基于病例的相似性检索和引用约束的草稿生成,旨在确保生成内容与历史放射学报告的事实对齐,以提高临床可靠性。
Details
Motivation: 解决完全生成式方法在自动化放射学报告生成中常出现的幻觉问题和缺乏临床依据的局限性,以提高其在真实世界工作流程中的可靠性。
Result: 在MIMIC-CXR数据集上的实验结果表明,多模态融合检索相比仅图像检索显著提升了性能,在临床相关发现上Recall@5超过0.95;生成的草稿具有可解释的输出和明确的引用可追溯性,提升了可信度。
Insight: 创新点在于将多模态检索增强生成(RAG)框架应用于放射学报告生成,通过融合图像和文本嵌入进行相似性检索,并结合引用覆盖和基于置信度的拒绝等安全机制,实现了事实对齐和可追溯的草稿生成,为可靠的临床决策支持提供了新思路。
Abstract: Automated radiology report generation has gained increasing attention with the rise of deep learning and large language models. However, fully generative approaches often suffer from hallucinations and lack clinical grounding, limiting their reliability in real-world workflows. In this study, we propose a multimodal retrieval-augmented generation (RAG) system for grounded drafting of chest radiograph impressions. The system combines contrastive image-text embeddings, case-based similarity retrieval, and citation-constrained draft generation to ensure factual alignment with historical radiology reports. A curated subset of the MIMIC-CXR dataset was used to construct a multimodal retrieval database. Image embeddings were generated using CLIP encoders, while textual embeddings were derived from structured impression sections. A fusion similarity framework was implemented using FAISS indexing for scalable nearest-neighbor retrieval. Retrieved cases were used to construct grounded prompts for draft impression generation, with safety mechanisms enforcing citation coverage and confidence-based refusal. Experimental results demonstrate that multimodal fusion significantly improves retrieval performance compared to image-only retrieval, achieving Recall@5 above 0.95 on clinically relevant findings. The grounded drafting pipeline produces interpretable outputs with explicit citation traceability, enabling improved trustworthiness compared to conventional generative approaches. This work highlights the potential of retrieval-augmented multimodal systems for reliable clinical decision support and radiology workflow augmentation
cs.DB [Back]
[109] ReViSQL: Achieving Human-Level Text-to-SQL cs.DB | cs.CLPDF
Yuxuan Zhu, Tengjun Jin, Yoojin Choi, Daniel Kang
TL;DR: 本文提出了ReViSQL框架,首次在BIRD基准测试中实现了人类水平的Text-to-SQL准确率。该框架通过构建经过验证的高质量数据集BIRD-Verified,并采用基于可验证奖励的强化学习(RLVR)和推理时扩展技术,显著提升了模型性能,而无需复杂的AI代理架构。
Details
Motivation: 当前最先进的Text-to-SQL AI代理在BIRD基准上仍未达到人类水平的准确率,作者认为解决这一差距的关键不在于增加架构复杂性,而在于提高训练数据的质量。
Result: 在专家验证的BIRD Mini-Dev集上,ReViSQL-235B-A22B模型取得了93.2%的执行准确率,超过了代理人类水平准确率(92.96%),并比先前开源SOTA方法提升了9.8%;轻量级模型ReViSQL-30B-A3B以7.5倍更低的每查询成本达到了先前SOTA水平。
Insight: 核心创新在于强调数据质量而非模型架构的极端复杂性是提升Text-to-SQL性能的关键;具体方法包括设计数据校正与验证工作流构建高质量数据集BIRD-Verified,以及结合RLVR和推理时扩展(基于执行的协调和多数投票)的简洁框架。
Abstract: Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have focused on enhancing SQL reasoning by developing large language models and AI agents that decompose Text-to-SQL tasks into manually designed, step-by-step pipelines. However, despite these extensive architectural engineering efforts, a significant gap remains: even state-of-the-art (SOTA) AI agents have not yet achieved the human-level accuracy on the BIRD benchmark. In this paper, we show that closing this gap does not require further architectural complexity, but rather clean training data to improve SQL reasoning of the underlying models. We introduce ReViSQL, a streamlined framework that achieves human-level accuracy on BIRD for the first time. Instead of complex AI agents, ReViSQL leverages reinforcement learning with verifiable rewards (RLVR) on BIRD-Verified, a dataset we curated comprising 2.5k verified Text-to-SQL instances based on the BIRD Train set. To construct BIRD-Verified, we design a data correction and verification workflow involving SQL experts. We identified and corrected data errors in 61.1% of a subset of BIRD Train. By training on BIRD-Verified, we show that improving data quality alone boosts the single-generation accuracy by 8.2-13.9% under the same RLVR algorithm. To further enhance performance, ReViSQL performs inference-time scaling via execution-based reconciliation and majority voting. Empirically, we demonstrate the superiority of our framework with two model scales: ReViSQL-235B-A22B and ReViSQL-30B-A3B. On an expert-verified BIRD Mini-Dev set, ReViSQL-235B-A22B achieves 93.2% execution accuracy, exceeding the proxy human-level accuracy (92.96%) and outperforming the prior open-source SOTA method by 9.8%. Our lightweight ReViSQL-30B-A3B matches the prior SOTA at a 7.5$\times$ lower per-query cost.
cs.SD [Back]
[110] CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation cs.SD | cs.AI | cs.CLPDF
Insung Lee, Taeyoung Jeong, Haejun Yoo, Du-Seong Chang, Myoung-Wan Koo
TL;DR: 本文提出了CAF-Score,一种用于无参考音频描述评估的新指标。该方法通过结合对比语言-音频预训练模型的粗粒度语义对齐能力与大音频语言模型的细粒度理解和句法感知能力,旨在更有效地检测句法不一致和细微的幻觉错误。
Details
Motivation: 现有评估方法存在不足:基于参考的指标成本高且难以评估声学保真度,而基于CLAP的方法常忽略句法错误和细粒度细节。因此,需要一种更鲁棒的无参考评估指标。
Result: 在BRACE基准测试上的实验表明,CAF-Score与人类判断的相关性最高,甚至在具有挑战性的场景中超越了基于参考的基线方法,达到了最先进的水平。
Insight: 核心创新点在于将CLAP的粗粒度语义对齐与LALM的细粒度推理能力进行校准与融合,从而在无参考条件下实现对音频描述质量和准确性的更全面评估。这种多模型能力互补的思路具有借鉴意义。
Abstract: While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP’s coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF-Score.
[111] Borderless Long Speech Synthesis cs.SD | cs.CL | eess.ASPDF
Xingchen Song, Di Wu, Dinghao Zhou, Pengyu Cheng, Hongwu Ding
TL;DR: 本文提出了一种名为Borderless Long Speech Synthesis的无边界长语音合成框架,旨在解决现有TTS系统缺乏全局上下文理解和副语言线索捕捉能力的问题。该框架采用’标注优先于过滤/清洗’的数据策略和全局-句子-令牌的多层次标注方案,并在模型侧结合了连续分词器、思维链推理和维度丢弃技术,以提升复杂指令的遵循能力。系统设计具有原生智能体特性,其分层标注可作为LLM智能体与合成引擎之间的结构化语义接口,实现从场景语义到语音细节的跨模态控制,从而将范式从文本到语音扩展至无边界长语音合成。
Details
Motivation: 现有TTS系统通常逐句合成并拼接,或仅从纯文本对话驱动,导致模型难以理解全局上下文和副语言线索,无法有效捕捉多说话人交互、情感演变和声学环境变化等真实世界现象。
Result: 论文未在摘要中提供具体的定量实验结果或基准测试比较,但指出所提出的思维链推理和维度丢弃技术显著改善了复杂条件下的指令遵循性能。
Insight: 创新点包括:1) 提出’标注优先于过滤/清洗’的数据策略与全局-句子-令牌分层标注方案;2) 在模型架构中引入思维链推理和维度丢弃以增强指令理解;3) 设计原生智能体架构,将分层标注转化为LLM智能体与合成引擎间的结构化语义接口,实现跨模态的宽频带控制,扩展了传统Text2Speech的范式。
Abstract: Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a “Labeling over filtering/cleaning” strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
[112] FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts cs.SD | cs.CVPDF
You Li, Dewei Zhou, Fan Ma, Fu Li, Dongliang He
TL;DR: 本文提出FoleyDirector框架,通过结构化时间脚本(STS)和双帧声音合成技术,解决了现有视频到音频(V2A)方法在细粒度时间控制上的不足,特别是在多事件场景或视觉线索不足时,实现了高保真音频与精确时间引导的平衡。
Details
Motivation: 现有V2A方法在复杂多事件场景或视觉信息不足(如小区域、屏幕外声音、遮挡物体)时,难以实现细粒度时间控制,限制了生成音频的表达性和可控性。
Result: 实验基于构建的DirectorSound数据集和评估基准VGGSoundDirector、DirectorBench,表明FoleyDirector在保持高音频保真度的同时,显著提升了时间可控性,推动了V2A生成向更可表达和可控的方向发展。
Insight: 创新点包括引入结构化时间脚本(STS)提供丰富时间信息,通过脚本引导的时间融合模块(含时间脚本注意力)实现特征融合,以及双帧声音合成技术处理复杂多事件场景,支持并行帧内和帧外音频生成,增强了可控性。
Abstract: Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model’s audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
cs.CY [Back]
[113] Overreliance on AI in Information-seeking from Video Content cs.CY | cs.CL | cs.HCPDF
Anders Giovanni Møller, Elisa Bassignana, Francesco Pierri, Luca Maria Aiello
TL;DR: 本文研究了生成式AI(特别是大语言模型)在视频信息检索任务中对用户准确性、效率和信心的影响。通过一项涉及约900名参与者和8000多个任务的实验,发现AI辅助能提升准确性和效率,但也会导致用户过度依赖AI,甚至在面对故意提供错误答案的AI时,用户的准确性会大幅下降,而自信心却保持不变,这揭示了AI中介视频信息检索中的基本安全风险。
Details
Motivation: 随着生成式AI和大语言模型作为用户与多媒体内容之间的中介日益普及,其在多媒体信息检索任务中的不准确性和潜在脆弱性所带来的影响尚未得到充分探索。论文旨在探究AI如何影响从视频中检索信息的准确性、效率和用户信心。
Result: 在包含三个条件(仅视频、视频+AI辅助、视频+欺骗性AI)的实验中,当参与者观看相关视频片段时,AI辅助使准确性提高了3-7%;当未观看时,准确性提高了27-35%。效率方面,短视频提升10%,长视频提升25%。然而,在与欺骗性AI互动时,准确性下降了高达32%,而用户的自信心在所有条件下保持稳定。
Insight: 论文的创新点在于首次系统性地实证研究了AI辅助在视频信息检索中的双重效应:既能显著提升性能,也导致用户过度依赖和盲目信任,即使面对错误信息时信心也不受影响。这揭示了当前AI中介信息检索系统存在的根本性安全漏洞,即用户缺乏对AI输出的批判性评估能力。
Abstract: The ubiquity of multimedia content is reshaping online information spaces, particularly in social media environments. At the same time, search is being rapidly transformed by generative AI, with large language models (LLMs) routinely deployed as intermediaries between users and multimedia content to retrieve and summarize information. Despite their growing influence, the impact of LLM inaccuracies and potential vulnerabilities on multimedia information-seeking tasks remains largely unexplored. We investigate how generative AI affects accuracy, efficiency, and confidence in information retrieval from videos. We conduct an experiment with around 900 participants on 8,000+ video-based information-seeking tasks, comparing behavior across three conditions: (1) access to videos only, (2) access to videos with LLM-based AI assistance, and (3) access to videos with a deceiving AI assistant designed to provide false answers. We find that AI assistance increases accuracy by 3-7% when participants viewed the relevant video segment, and by 27-35% when they did not. Efficiency increases by 10% for short videos and 25% for longer ones. However, participants tend to over-rely on AI outputs, resulting in accuracy drops of up to 32% when interacting with the deceiving AI. Alarmingly, self-reported confidence in answers remains stable across all three conditions. Our findings expose fundamental safety risks in AI-mediated video information retrieval.
cs.HC [Back]
[114] AI Psychosis: Does Conversational AI Amplify Delusion-Related Language? cs.HC | cs.AI | cs.CL | cs.CY | cs.SIPDF
Soorya Ram Shimgekar, Vipin Gunda, Jiwon Kim, Violeta J. Rodriguez, Hari Sundaram
TL;DR: 本文通过构建模拟用户(SimUsers)与对话AI进行多轮交互,研究AI如何影响妄想相关语言的演变。研究发现,有妄想相关语言历史的模拟用户在交互中DelusionScore逐渐升高,而无此历史的用户则保持稳定或下降;这种放大效应在不同主题中表现不同,其中现实怀疑和强迫推理主题增幅最大;通过基于当前DelusionScore调整AI响应,可显著降低该趋势。
Details
Motivation: 对话AI系统被越来越多地用于个人反思和情感倾诉,引发了对其可能加剧脆弱用户妄想思维的担忧,但缺乏实证证据。本文旨在探究对话AI在多轮交互中是否会放大妄想相关语言。
Result: 在构建的模拟用户与GPT、LLaMA和Qwen模型家族的对话实验中,有妄想相关语言历史的用户DelusionScore轨迹显著上升,而无此历史的用户则稳定或下降;通过基于DelusionScore调整AI响应,该上升趋势得到有效抑制。
Insight: 论文创新点包括:提出DelusionScore量化指标来测量妄想相关语言强度;通过模拟用户纵向数据构建实验框架;发现对话AI可能放大特定心理风险,并验证了状态感知安全机制(如基于DelusionScore的条件响应)的缓解效果。
Abstract: Conversational AI systems are increasingly used for personal reflection and emotional disclosure, raising concerns about their effects on vulnerable users. Recent anecdotal reports suggest that prolonged interactions with AI may reinforce delusional thinking – a phenomenon sometimes described as AI Psychosis. However, empirical evidence on this phenomenon remains limited. In this work, we examine how delusion-related language evolves during multi-turn interactions with conversational AI. We construct simulated users (SimUsers) from Reddit users’ longitudinal posting histories and generate extended conversations with three model families (GPT, LLaMA, and Qwen). We develop DelusionScore, a linguistic measure that quantifies the intensity of delusion-related language across conversational turns. We find that SimUsers derived from users with prior delusion-related discourse (Treatment) exhibit progressively increasing DelusionScore trajectories, whereas those derived from users without such discourse (Control) remain stable or decline. We further find that this amplification varies across themes, with reality skepticism and compulsive reasoning showing the strongest increases. Finally, conditioning AI responses on current DelusionScore substantially reduces these trajectories. These findings provide empirical evidence that conversational AI interactions can amplify delusion-related language over extended use and highlight the importance of state-aware safety mechanisms for mitigating such risks.
[115] Behavioral Engagement in VR-Based Sign Language Learning: Visual Attention as a Predictor of Performance and Temporal Dynamics cs.HC | cs.CVPDF
Davide Traini, José Manuel Alcalde-Llergo, Mariana Buenestado-Fernández, Domenico Ursino, Enrique Yeguas-Bolívar
TL;DR: 本研究分析了虚拟现实手语学习应用SONAR中的行为参与度,重点关注视觉注意力、视频回放频率和回放后观看时间三个自动提取的参与度指标与学习表现的关系。研究发现视觉注意力与测验成绩呈强正相关,回放后观看时间次之,而视频回放频率无显著关联。通过时间动态分析,揭示了学习过程中注意力峰值与信息密集视频段对齐,以及包括初始适应、学习期间振荡注意力周期和评估期间显著注意力峰值在内的阶段特异性参与模式。
Details
Motivation: 解决在虚拟现实手语学习环境中,如何通过自动提取的行为参与度指标来理解和预测学习表现的问题,并探究学习过程中的时间动态参与模式。
Result: 在SONAR VR应用的训练与验证测验中,视觉注意力和回放后观看时间是学习成功的显著预测因子,共同解释了大部分表现方差;视觉注意力与测验成绩呈强正相关(Pearson相关分析),而视频回放频率无显著关联;时间分析显示注意力峰值与训练和验证视频的信息密集段对齐。
Insight: 创新点在于将自动提取的行为参与度指标(特别是视觉注意力)与VR手语学习表现定量关联,并通过时间动态分析揭示了阶段特异性的参与模式;客观来看,该方法为沉浸式学习环境中的参与度评估提供了可量化的行为追踪框架,强调了持续和策略性分配的视觉注意力的核心作用。
Abstract: This study analyzes behavioral engagement in SONAR, a virtual reality application designed for sign language training and validation. We focus on three automatically derived engagement indicators (Visual Attention (VA), Video Replay Frequency (VRF), and Post-Playback Viewing Time (PPVT)) and examine their relationship with learning performance. Participants completed a self-paced Training phase, followed by a Validation quiz assessing retention. We employed Pearson correlation analysis to examine the relationships between engagement indicators and quiz performance, followed by binomial Generalized Linear Model (GLM) regression to assess their joint predictive contributions. Additionally, we conducted temporal analysis by aggregating moment-to-moment VA traces across all learners to characterize engagement dynamics during the learning session. Results show that VA exhibits a strong positive correlation with quiz performance,followed by PPVT, whereas VRF shows no meaningful association. A binomial GLM confirms that VA and PPVT are significant predictors of learning success, jointly explaining a substantial proportion of performance variance. Going beyond outcome-oriented analysis, we characterize temporal engagement patterns by aggregating moment-to-moment VA traces across all learners. The temporal profile reveals distinct attention peaks aligned with informationally dense segments of both training and validation videos, as well as phase-specific engagement dynamics, including initial acclimatization, oscillatory attention cycles during learning, and pronounced attentional peaks during assessment. Together, these findings highlight the central role of sustained and strategically allocated visual attention in VR-based sign language learning and demonstrate the value of behavioral trace data for understanding and predicting learner engagement in immersive environments.
eess.IV [Back]
[116] Investigating a Policy-Based Formulation for Endoscopic Camera Pose Recovery eess.IV | cs.CVPDF
Jan Emily Mangulabnan, Akshat Chauhan, Laura Fleig, Lalithkumar Seenivasan, Roger D. Soberanis-Mukul
TL;DR: 本文提出了一种基于策略的端到端内窥镜相机姿态恢复方法,旨在模仿专家根据先前相机状态估计轨迹的能力,直接预测短时域的相对运动,无需在推理时维护显式的几何表示。
Details
Motivation: 传统基于视觉的导航系统依赖于特征匹配和关键帧几何优化,在内窥镜成像的低纹理和快速光照变化等挑战性条件下性能下降,而本文旨在探索一种更接近外科医生推理方式的替代方案。
Result: 在尸体鼻窦内窥镜数据集上评估,在给定真实状态条件下,该方法实现了最低的平均平移误差和具有竞争力的旋转精度,并且在低纹理条件下表现出更强的鲁棒性。
Insight: 创新点在于采用基于策略的学习框架来直接预测相机运动,避免了传统几何方法中脆弱的对应匹配、纹理稀疏区域的不稳定性以及重建失败导致的姿态覆盖有限等问题,为内窥镜相机姿态恢复提供了一种新的可行思路。
Abstract: In endoscopic surgery, surgeons continuously locate the endoscopic view relative to the anatomy by interpreting the evolving visual appearance of the intraoperative scene in the context of their prior knowledge. Vision-based navigation systems seek to replicate this capability by recovering camera pose directly from endoscopic video, but most approaches do not embody the same principles of reasoning about new frames that makes surgeons successful. Instead, they remain grounded in feature matching and geometric optimization over keyframes, an approach that has been shown to degrade under the challenging conditions of endoscopic imaging like low texture and rapid illumination changes. Here, we pursue an alternative approach and investigate a policy-based formulation of endoscopic camera pose recovery that seeks to imitate experts in estimating trajectories conditioned on the previous camera state. Our approach directly predicts short-horizon relative motions without maintaining an explicit geometric representation at inference time. It thus addresses, by design, some of the notorious challenges of geometry-based approaches, such as brittle correspondence matching, instability in texture-sparse regions, and limited pose coverage due to reconstruction failure. We evaluate the proposed formulation on cadaveric sinus endoscopy. Under oracle state conditioning, we compare short-horizon motion prediction quality to geometric baselines achieving lowest mean translation error and competitive rotational accuracy. We analyze robustness by grouping prediction windows according to texture richness and illumination change indicating reduced sensitivity to low-texture conditions. These findings suggest that a learned motion policy offers a viable alternative formulation for endoscopic camera pose recovery.
cs.LG [Back]
[117] Maximizing mutual information between user-contexts and responses improve LLM personalization with no additional data cs.LG | cs.AI | cs.CLPDF
Hyunji Nam, Haoran Li, Natasha Jaques
TL;DR: 本文提出了一种名为互信息偏好优化(MIPO)的对比数据增强方法,用于提升大型语言模型(LLM)的个性化能力及其他任务性能。该方法通过基于正确提示生成积极响应和基于随机无关提示生成消极响应来构建偏好对,并利用直接偏好优化(DPO)进行训练,以最大化提示与模型响应之间的点对点条件互信息。实验表明,MIPO在个性化任务上实现了3-40%的性能提升,并在数学和多项选择题任务上取得了1-18%的改进,且无需额外数据或人工监督。
Details
Motivation: 现有LLM的后训练改进严重依赖昂贵的人工标注数据或外部验证器,且难以覆盖不易验证的智能任务。因此,需要一种无需外部监督的自我改进框架。
Result: 在基于Llama和Qwen-Instruct系列模型的实验中,MIPO在真实用户数据集上的个性化任务中相比强基线实现了3-40%的性能提升。此外,在数学和多项选择题任务上,无需额外数据或人工监督也取得了1-18%的改进。
Insight: 论文的创新点在于提出了一种通过最大化提示与响应间条件互信息来实现自我改进的对比数据增强方法。其核心洞察是,利用模型自身生成的正负样本对进行偏好优化,可以有效提升模型在个性化及推理任务上的性能,为无监督自我改进提供了新方向。
Abstract: While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new high-quality data is expensive to collect. More fundamentally, true intelligence goes far beyond tasks that are easily verifiable. Therefore, we need self-improvement frameworks that allow models to improve without external oversight. We propose Mutual Information Preference Optimization (MIPO), a contrastive data augmentation method that constructs preference pairs by generating a positive response conditioning on the correct prompt, and a negative response by conditioning on a random, unrelated prompt. We show that using Direct Preference Optimization (DPO) to learn from this paired data maximizes pointwise conditional mutual information (MI) (under the base LLM) between prompts and model responses. Empirical results with various-sized Llama- and Qwen-Instruct models show that when used to maximize MI between user context and response, MIPO provides an effective personalization technique, achieving 3-40% improvements on personalization tasks using real-user datasets compared to strong baselines. Surprisingly, MIPO can also be applied to improve performance on math and multiple-choice problems, yielding 1-18% without any additional data or human supervision. These results suggest a promising direction for self-improvement.
[118] FedPDPO: Federated Personalized Direct Preference Optimization for Large Language Model Alignment cs.LG | cs.CLPDF
Kewen Zhu, Liping Yi, Zhiming Zhao, Zhuang Qi, Han Yu
TL;DR: 本文提出了FedPDPO,一个用于大语言模型(LLM)偏好对齐的个性化联邦学习框架。它通过结合全局共享的LoRA适配器、个性化的客户端特定LLM头部、显式奖励头部以及瓶颈适配器,有效解决了联邦学习中非独立同分布(non-IID)偏好数据带来的挑战,实现了通信高效且性能优越的模型对齐。
Details
Motivation: 在联邦学习(FL)环境中,由于数据是去中心化、隐私敏感且高度非独立同分布的,将大语言模型(LLM)与人类偏好对齐具有挑战性。直接偏好优化(DPO)虽然比基于人类反馈的强化学习(RLHF)更高效,但直接在FL中应用会因非IID数据导致性能严重下降和隐式奖励泛化能力有限。
Result: 在多个偏好数据集上的广泛实验表明,该方法在联邦域内和跨域设置中均达到了最先进的性能,平均准确率最高提升了4.80%。
Insight: 创新点在于:1)采用参数高效微调架构,结合全局共享的LoRA适配器与个性化的客户端特定LLM头部来处理非IID异构性;2)引入带有客户端特定显式奖励头部的个性化DPO训练策略,以补充隐式奖励;3)使用瓶颈适配器来平衡全局和局部特征。从客观角度看,该工作将个性化联邦学习与高效的DPO对齐方法相结合,为解决FL中LLM对齐的核心难题提供了系统性的框架和理论分析。
Abstract: Aligning large language models (LLMs) with human preferences in federated learning (FL) is challenging due to decentralized, privacy-sensitive, and highly non-IID preference data. Direct Preference Optimization (DPO) offers an efficient alternative to reinforcement learning with human feedback (RLHF), but its direct application in FL suffers from severe performance degradation under non-IID data and limited generalization of implicit rewards. To bridge this gap, we propose FedPDPO (Federated Personalized Direct Preference Optimization), a personalized federated framework for preference alignment of LLMs. It adopts a parameter-efficient fine-tuning architecture where each client maintains a frozen pretrained LLM backbone augmented with a Low-Rank Adaptation (LoRA) adapter, enabling communication-efficient aggregation. To address non-IID heterogeneity, we devise (1) the globally shared LoRA adapter with the personalized client-specific LLM head. Moreover, we introduce (2) a personalized DPO training strategy with a client-specific explicit reward head to complement implicit rewards and further alleviate non-IID heterogeneity, and (3) a bottleneck adapter to balance global and local features. We provide theoretical analysis establishing the probabilistic foundation and soundness. Extensive experiments on multiple preference datasets demonstrate state-of-the-art performance, achieving up to 4.80% average accuracy improvements in federated intra-domain and cross-domain settings.
[119] Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States cs.LG | cs.AI | cs.CLPDF
Yurun Yuan, Tengyang Xie
TL;DR: 本文针对大语言模型(LLM)后训练中强化学习(RL)面临的‘能力天花板’问题,提出重新引入显式马尔可夫状态作为解决方案。论文指出,当前LLM后训练依赖于不断扩展的动作历史作为状态,而经典RL则依赖紧凑、信息丰富的马尔可夫状态,这是导致RL在LLMs中仅能微调预训练模式而非发现新策略的结构性瓶颈。通过理论证明和在一系列复杂逻辑谜题上的实验,论文展示了引入估计的马尔可夫状态能显著降低样本复杂度并突破标准RL后训练的性能边界。
Details
Motivation: 解决LLM后训练中强化学习面临的‘能力天花板’问题,即RL无法像在经典系统中那样发现新策略,而仅能微调预训练权重中已有的模式。
Result: 在一系列复杂逻辑谜题上,引入马尔可夫状态的方法一致地突破了标准RL后训练的性能边界。
Insight: 核心创新点在于将经典RL中的显式马尔可夫状态原则重新引入LLM后训练,以替代当前依赖‘历史作为状态’的建模方式。这为生成式AI解锁开放式发现和真正新的推理能力提供了关键的结构性改进方向。
Abstract: Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent “capability ceiling”: unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond “history-as-state” modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.
[120] Subspace Kernel Learning on Tensor Sequences cs.LG | cs.AI | cs.CVPDF
Lei Wang, Xi Ding, Yongsheng Gao, Piotr Koniusz
TL;DR: 本文提出了一种名为不确定性驱动核张量学习(UKTL)的新型核框架,用于处理高阶张量序列数据。该方法通过比较张量展开得到的模态子空间,实现表达性强且鲁棒的相似性度量,并采用可扩展的Nyström核线性化技术处理大规模数据。其核心创新在于不确定性感知的子空间加权机制,能自适应地降低不可靠模态分量的权重。在动作识别基准测试中取得了最先进的性能。
Details
Motivation: 解决从结构化多路数据(表示为高阶张量)中学习时,需要高效捕获张量模态间复杂交互的问题。
Result: 在动作识别基准数据集(NTU-60, NTU-120, Kinetics-Skeleton)上的广泛评估表明,UKTL实现了最先进的性能,并具有优异的泛化能力。
Insight: 主要创新点包括:不确定性感知的子空间加权机制,提升了鲁棒性和可解释性;通过结构化核组合自然融合多路和多模态交互;以及可扩展的Nyström线性化与动态学习的枢轴张量。这为结构化多路和多模态张量序列建立了一个原则性、可扩展且可解释的核学习范式。
Abstract: Learning from structured multi-way data, represented as higher-order tensors, requires capturing complex interactions across tensor modes while remaining computationally efficient. We introduce Uncertainty-driven Kernel Tensor Learning (UKTL), a novel kernel framework for $M$-mode tensors that compares mode-wise subspaces derived from tensor unfoldings, enabling expressive and robust similarity measure. To handle large-scale tensor data, we propose a scalable Nyström kernel linearization with dynamically learned pivot tensors obtained via soft $k$-means clustering. A key innovation of UKTL is its uncertainty-aware subspace weighting, which adaptively down-weights unreliable mode components based on estimated confidence, improving robustness and interpretability in comparisons between input and pivot tensors. Our framework is fully end-to-end trainable and naturally incorporates both multi-way and multi-mode interactions through structured kernel compositions. Extensive evaluations on action recognition benchmarks (NTU-60, NTU-120, Kinetics-Skeleton) show that UKTL achieves state-of-the-art performance, superior generalization, and meaningful mode-wise insights. This work establishes a principled, scalable, and interpretable kernel learning paradigm for structured multi-way and multi-modal tensor sequences.
cs.AI [Back]
[121] Teaching an Agent to Sketch One Part at a Time cs.AI | cs.CV | cs.GR | cs.LGPDF
Xiaodan Du, Ruize Xu, David Yunis, Yael Vinker, Greg Shakhnarovich
TL;DR: 本文提出了一种分步生成矢量草图的方法,通过训练基于多模态语言模型的智能体,结合监督微调与新颖的多轮过程奖励强化学习,实现可控的文本到矢量草图生成。该方法利用新构建的ControlSketch-Part数据集,该数据集包含丰富的部件级标注,通过自动标注流程将矢量草图分割为语义部件并分配路径。
Details
Motivation: 解决现有文本到矢量草图生成方法缺乏可解释性、可控性和局部编辑能力的问题,旨在实现分部件生成草图,提升生成过程的灵活性和用户控制。
Result: 实验结果表明,该方法在文本到矢量草图生成任务中实现了可解释、可控和局部可编辑的生成效果,具体基准未明确提及,但强调了结构化部件级数据和视觉反馈的贡献。
Insight: 创新点包括:引入多轮过程奖励强化学习训练智能体,构建带部件级标注的ControlSketch-Part数据集,以及通用的自动标注流程;从客观角度看,该方法通过结构化部件分解和过程反馈,提升了生成任务的模块化和可控性,为矢量图形生成提供了新思路。
Abstract: We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.