Table of Contents

cs.CL [Back]

[1] Rethinking LLM Human Simulation: When a Graph is What You Need

Joseph Suh,Suhong Moon,Serina Chang

Main category: cs.CL

TL;DR: 论文探讨了在人机交互模拟任务中,是否必须使用大语言模型(LLM)。作者提出了一种基于图神经网络(GNN)的轻量级替代方案(GEMS),在离散选择模拟任务中表现优于或持平LLM,同时更具效率、可解释性和透明度。

Details Motivation: LLM在人机交互模拟中被广泛使用,但其计算成本高且缺乏透明度。作者质疑其必要性,并提出基于图的模型是否能在某些任务中提供更优解决方案。

Contribution: 提出了GEMS框架,将离散选择模拟任务转化为图上的链接预测问题,结合关系知识和语言表示,证明了轻量级模型的潜力。

Method: GEMS使用GNN模拟人类离散选择行为,任务建模为图链接预测,仅在有需要时引入语言表示。

Result: 在三个数据集上的实验表明,GEMS在准确性与效率上优于或持平LLM,同时显著提升了可解释性和透明度。

Insight: 在某些人机交互模拟任务中,轻量级图模型可以替代LLM,不仅降低计算成本,还能提供更清晰的模型解释。

Abstract: Large language models (LLMs) are increasingly used to simulate humans, with applications ranging from survey prediction to decision-making. However, are LLMs strictly necessary, or can smaller, domain-grounded models suffice? We identify a large class of simulation problems in which individuals make choices among discrete options, where a graph neural network (GNN) can match or surpass strong LLM baselines despite being three orders of magnitude smaller. We introduce Graph-basEd Models for human Simulation (GEMS), which casts discrete choice simulation tasks as a link prediction problem on graphs, leveraging relational knowledge while incorporating language representations only when needed. Evaluations across three key settings on three simulation datasets show that GEMS achieves comparable or better accuracy than LLMs, with far greater efficiency, interpretability, and transparency, highlighting the promise of graph-based modeling as a lightweight alternative to LLMs for human simulation. Our code is available at https://github.com/schang-lab/gems.

[2] Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results

Jonathan Liu,Haoling Qiu,Jonathan Lasko,Damianos Karakos,Mahsa Yarmohammadi,Mark Dredze

Main category: cs.CL

TL;DR: 研究表明LLMs在医学语境中可能因偏见和错误产生不一致的建议,作者开发了一套自动化工具评估这些问题,发现LLM评估者一致性低,建议使用多LLM评估以提高结果的泛化性。

Details Motivation: LLMs在医学应用中可能存在幻觉、遗漏和偏见,但在缺乏真实数据的情况下,如何评估其一致性成为关键问题。

Contribution: 开发了自动化查询生成和评估基础设施,揭示了LLM评估者的低一致性,并提出了改进评估方法的建议。

Method: 1) 多维度生成真实医学查询;2) 使用LLM-as-a-judge和代理工作流检测幻觉和遗漏。

Result: LLM评估者的Cohen’s Kappa仅为0.118,某些LLM组合在统计显著但泛化性差。

Insight: 多LLM评估和公开一致性指标能提升结果的可信度,尤其在无真实数据时。

Abstract: Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen’s Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: https://github.com/BBN-E/medic-neurips-2025-demo.

[3] LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Liuhao Lin,Ke Li,Zihan Xu,Yuchen Shi,Yulei Qin,Yan Zhang,Xing Sun,Rongrong Ji

Main category: cs.CL

TL;DR: LTD-Bench提出了一种创新的基准测试方法,通过让大语言模型生成可视化绘图或可执行代码,直接暴露其在空间推理上的局限性,弥补了传统数值指标的不足。

Details Motivation: 当前大语言模型的评估方法依赖不透明的数值指标,掩盖了空间推理能力的缺陷,缺乏直观性。LTD-Bench旨在填补这一空白,提供直观的可视化评估手段。

Contribution: 1. 设计了LTD-Bench基准,通过绘图和代码生成任务直观评估LLM的空间推理能力;2. 揭示了LLM在语言与空间概念双向映射上的严重缺陷;3. 提供了一种诊断模型相似性的可视化分析方法。

Method: LTD-Bench采用生成任务(测试空间想象力)和识别任务(评估空间感知),涵盖三个难度级别,系统评估语言与空间的双向映射能力。

Result: 实验发现,即使表现优异的LLM在双向空间推理上存在严重不足,暴露了其作为世界模型的根本局限。

Insight: 可视化输出能直观揭示模型的空间推理能力缺陷,为模型评估和诊断提供了新的工具。

Abstract: Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research–relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept–a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench’s visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.

[4] Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation

Wongyu Kim,Hochang Lee,Sanghak Lee,Yoonsung Kim,Jaehyun Park

Main category: cs.CL

TL;DR: 该论文提出了一种名为M-Solomon的多模态嵌入器,通过自适应查询增强技术,仅在必要时对查询进行增强,从而降低延迟并提升性能。

Details Motivation: 现有基于LLM的嵌入器在查询增强时对所有查询进行增强,导致不必要的延迟,且部分查询增强可能损害性能。此外,多模态环境下的此类方法尚未被探索。

Contribution: 提出了M-Solomon,一种通用的多模态嵌入器,能够自适应地决定何时进行查询增强。通过数据集级别分组和合成增强技术,优化了性能与延迟。

Method: 1. 将查询分为需要增强和不需要增强的两组;2. 利用多模态LLM生成合适的增强查询;3. 自适应查询增强,仅对需要增强的查询生成前缀/augment,其他查询生成/embed。

Result: M-Solomon在性能上显著优于无增强的基线,同时超越了始终使用增强的基线,并大幅降低了嵌入延迟。

Insight: 自适应查询增强在多模态环境中具有潜力,能够权衡性能与效率,避免不必要的计算开销。

Abstract: Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.

[5] LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context

Yudong Li,Zhongliang Yang,Kejiang Chen,Wenxuan Wang,Tianxin Zhang,Sifang Wan,Kecheng Wang,Haitian Li,Xu Wang,Lefan Cheng,Youdan Yang,Baocheng Chen,Ziyu Liu,Yufei Sun,Liyan Wu,Wenya Wen,Xingchi Gu,Peiru Yang

Main category: cs.CL

TL;DR: LiveSecBench 是一个动态更新的中文大语言模型安全评测基准,涵盖六大关键维度,并提供公开排行榜。

Details Motivation: 为中文语境下的 LLM 应用场景提供动态更新的安全评测基准,确保模型在法律、伦理等方面的安全性。

Contribution: 提出首个面向中文的动态安全评测基准 LiveSecBench,覆盖六大关键维度,并支持持续更新。

Method: 基于中国法律和社会框架设计六大评测维度,并动态纳入新威胁向量(如文本到图像生成安全和智能体安全)。

Result: 已评测 18 个 LLM,提供了中文语境下 AI 安全的全景图。

Insight: 动态更新机制和本土化设计对评测中文 LLM 的安全性至关重要。

Abstract: In this work, we propose LiveSecBench, a dynamic and continuously updated safety benchmark specifically for Chinese-language LLM application scenarios. LiveSecBench evaluates models across six critical dimensions (Legality, Ethics, Factuality, Privacy, Adversarial Robustness, and Reasoning Safety) rooted in the Chinese legal and social frameworks. This benchmark maintains relevance through a dynamic update schedule that incorporates new threat vectors, such as the planned inclusion of Text-to-Image Generation Safety and Agentic Safety in the next update. For now, LiveSecBench (v251030) has evaluated 18 LLMs, providing a landscape of AI safety in the context of Chinese language. The leaderboard is publicly accessible at https://livesecbench.intokentech.cn/.

[6] AyurParam: A State-of-the-Art Bilingual Language Model for Ayurveda

Mohd Nauman,Sravan Gvm,Vijay Devane,Shyam Pawar,Viraj Thakur,Kundeshwar Pundalik,Piyush Sawarkar,Rohit Saluja,Maunendra Desarkar,Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: AyurParam-2.9B是一个针对阿育吠陀医学领域的双语语言模型,通过高质量数据集微调Param-1-2.9B,在BhshaBench-Ayur基准测试中表现优于同类开源模型,显示领域专业化的必要性。

Details Motivation: 主流大语言模型在需要深度领域知识的任务中表现不佳,特别是阿育吠陀等传统医学体系需要专门的模型适应其文化和临床复杂性。

Contribution: 提出了AyurParam-2.9B,一个针对阿育吠陀的双语语言模型,并通过高质量数据集和严格标注提升性能。

Method: 基于Param-1-2.9B模型,使用专家整理的阿育吠陀数据集进行微调,数据包括上下文感知和问答内容。

Result: 在BhshaBench-Ayur上超越同类开源模型,并与更大模型竞争,证明领域适应的有效性。

Insight: 领域专业化需高质量数据集和监督,以确保模型的可靠性和文化一致性。

Abstract: Current large language models excel at broad, general-purpose tasks, but consistently underperform when exposed to highly specialized domains that require deep cultural, linguistic, and subject-matter expertise. In particular, traditional medical systems such as Ayurveda embody centuries of nuanced textual and clinical knowledge that mainstream LLMs fail to accurately interpret or apply. We introduce AyurParam-2.9B, a domain-specialized, bilingual language model fine-tuned from Param-1-2.9B using an extensive, expertly curated Ayurveda dataset spanning classical texts and clinical guidance. AyurParam’s dataset incorporates context-aware, reasoning, and objective-style Q&A in both English and Hindi, with rigorous annotation protocols for factual precision and instructional clarity. Benchmarked on BhashaBench-Ayur, AyurParam not only surpasses all open-source instruction-tuned models in its size class (1.5–3B parameters), but also demonstrates competitive or superior performance compared to much larger models. The results from AyurParam highlight the necessity for authentic domain adaptation and high-quality supervision in delivering reliable, culturally congruent AI for specialized medical knowledge.

[7] Merging Continual Pretraining Models for Domain-Specialized LLMs: A Case Study in Finance

Kentaro Ueda,François Portet,Hirohiko Suwa,Keiichi Yasumoto

Main category: cs.CL

TL;DR: 论文研究了如何通过合并领域特定的持续预训练(CPT)专家模型来构建金融领域的LLMs,填补了CPT模型合并研究的空白,提出了一个三阶段评估框架,并比较了三种合并方法,为构建多技能LLMs提供了新思路。

Details Motivation: 通用LLMs在金融等专业领域表现不佳,需要融合领域知识、数学推理和多语言处理能力。直接训练多技能模型成本高且不稳定,因此研究CPT模型合并成为替代方案。

Contribution: 1)首次系统分析了CPT模型合并;2)提出了三阶段评估框架(知识恢复、互补性和涌现性);3)比较了三种合并方法(Task Arithmetic、TIES、DARE-TIES),并发现合并专家模型能提升性能并可能涌现跨领域能力。

Method: 采用三种模型合并方法(Task Arithmetic、TIES、DARE-TIES),通过金融领域的18个任务和8个数据集进行评估,重点关注知识恢复、互补性和涌现性。

Result: 结果显示:1)合并专家与基础模型可恢复CPT期间丢失的通用知识;2)合并专家模型能提升性能并可能涌现跨领域能力;3)Task Arithmetic表现强但超参数敏感,TIES更鲁棒。

Insight: 模型相似性与合并成功相关,但涌现能力取决于更复杂的因素。研究为如何从现有资产构建多技能LLMs提供了理论基础和实践指导。

Abstract: While LLMs excel at general tasks, they struggle in specialized domains like finance, requiring diverse skills in domain knowledge, mathematical reasoning, and multilingual processing. Merging domain-specific Continual Pre-training (CPT) “experts” offers a practical alternative to costly and unstable multi-skill training. However, unlike established Supervised Fine-Tuning (SFT) model-based merging, CPT model merging remains largely unexplored. We address this gap by creating financial LLMs from experts in finance, math, and Japanese. We propose a three-stage evaluation focusing on knowledge recovery, complementarity, and emergence, and assess three merging methods (Task Arithmetic, TIES, and DARE-TIES) on a comprehensive financial benchmark curated from 18 tasks across 8 established datasets. Results show that merging an expert with its base model recovers general knowledge lost during CPT, while merging experts improves performance and can yield emergent cross-domain skills. Among the methods, Task Arithmetic performs strongly but is hyperparameter-sensitive, whereas TIES is more robust. Our findings also suggest that while model similarity correlates with merging success, emergent skills depend on more complex factors. This work presents the first foundational analysis of CPT model merging, establishing a principled framework and providing clear guidance for building multi-skill LLMs from existing assets.

[8] CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

Ehsan Aghazadeh,Ahmad Ghasemi,Hedyeh Beyhaghi,Hossein Pishro-Nik

Main category: cs.CL

TL;DR: 论文提出了置信度引导的早期停止(CGES)方法,通过贝叶斯框架动态决定何时停止对LLM的多次查询,显著减少调用次数,同时保持准确性。

Details Motivation: 现有的自一致性方法(如多数投票)需固定查询次数,可能导致资源浪费或遗漏正确答案。CGES旨在通过动态停止策略提高效率和准确性。

Contribution: 1. 提出CGES框架,利用置信度信号动态调整查询次数;2. 提供理论和实验验证,支持其在噪声信号下的有效性;3. 在五个推理基准中,显著减少模型调用次数(约69%)且精度损失极小(0.06个百分点)。

Method: 1. 基于贝叶斯框架,构建候选答案的后验分布;2. 利用标量置信度信号(如token概率或奖励模型)指导提前停止;3. 当后验质量超过阈值时停止采样。

Result: 在推理任务中,CGES平均减少69%的模型调用(如从16次降至4.9次),同时精度损失仅为0.06个百分点。

Insight: 1. 动态停止策略显著提升效率;2. 置信度信号的设计对性能至关重要;3. 贝叶斯框架在小样本场景下表现稳健。

Abstract: Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls and can fail when the correct answer is rare. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models. CGES adaptively halts sampling once the posterior mass of a candidate exceeds a threshold. We provide theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals. Across five reasoning benchmarks, CGES reduces the average number of model calls by about 69 percent (for example, from 16.0 to 4.9) while matching the accuracy of self-consistency within 0.06 percentage points.

[9] Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis, Solution, and Interpretation

Renfei Dang,Peng Hu,Changjiang Gao,Shujian Huang

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型(LLMs)在新知识引入后的幻觉现象,揭示了特定知识类型的高陌生性是幻觉的主要驱动因素,并提出了一种名为KnownPatch的方法来缓解这一问题。

Details Motivation: 前人研究发现,在LLMs的微调过程中引入新知识可能导致已知信息任务中的幻觉现象,但其具体表现和机制尚未深入研究,本文旨在填补这一空白。

Contribution: 1)设计了Biography-Reasoning数据集进行精细化分析;2)发现特定知识类型的高陌生性而非新知识比例是幻觉的主要驱动因素;3)提出了KnownPatch方法缓解幻觉并改善注意力机制。

Method: 通过设计Biography-Reasoning数据集进行多知识类型和任务类型的分析,并提出KnownPatch方法,通过在训练后期加入少量已知知识样本缓解幻觉。

Result: KnownPatch有效减少了新知识导致的幻觉现象,并改善了模型对问题中关键实体的注意力分配,提升了性能。

Insight: 学习新知识会分散模型对关键实体的注意力,导致过度关注上下文,从而增加幻觉风险;这种注意力模式还可能传播到类似上下文中,扩大幻觉范围。

Abstract: Previous studies show that introducing new knowledge during large language models (LLMs) fine-tuning can lead to the generation of erroneous output when tested on known information, thereby triggering factual hallucinations. However, existing studies have not deeply investigated the specific manifestations and underlying mechanisms of these hallucinations. Our work addresses this gap by designing a controlled dataset Biography-Reasoning, and conducting a fine-grained analysis across multiple knowledge types and two task types, including knowledge question answering (QA) and knowledge reasoning tasks. We find that when fine-tuned on a dataset in which a specific knowledge type consists entirely of new knowledge, LLMs exhibit significantly increased hallucination tendencies. This suggests that the high unfamiliarity of a particular knowledge type, rather than the overall proportion of new knowledge, is a stronger driver of hallucinations, and these tendencies can even affect other knowledge types in QA tasks. To mitigate such factual hallucinations, we propose KnownPatch, which patches a small number of known knowledge samples in the later stages of training, effectively alleviating new-knowledge-induced hallucinations. Through attention analysis, we find that learning new knowledge reduces the model’s attention to key entities in the question, thus causing excessive focus on the surrounding context, which may increase the risk of hallucination. Moreover, the attention pattern can propagate to similar contexts, facilitating the spread of hallucinations to textually similar questions. Our method effectively mitigates the disruption of new knowledge learning to the model’s attention on key entities, accompanied by improved performance.

[10] Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning

Bowen Jin,TJ Collins,Donghan Yu,Mert Cemri,Shenao Zhang,Mengyu Li,Jay Tang,Tian Qin,Zhiyang Xu,Jiarui Lu,Guoli Yin,Jiawei Han,Zirui Wang

Main category: cs.CL

TL;DR: 本文提出了一种基于强化学习的集中式多智能体LLM系统CoRL,通过控制器LLM选择性地协调专家模型,以在控制成本的同时最大化任务性能。

Details Motivation: 现有分散式框架中,每个输入需调用多个LLM,导致高昂且不可控的推理成本。本文旨在设计一种成本高效且可控的集中式多LLM系统。

Contribution: 提出了CoRL框架,通过强化学习优化性能与成本的权衡,并支持多预算条件下的自适应行为。

Method: 采用强化学习框架,控制器LLM根据输入选择专家模型,优化性能成本和预算约束的双目标。

Result: 在四个基准测试中,CoRL在高预算下表现优于最佳专家LLM,低预算下仍保持强性能。

Insight: 集中式协调为多智能体LLM系统提供了可扩展性和成本效率。

Abstract: Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.

[11] Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval

Hung-Ting Chen,Xiang Liu,Shauli Ravfogel,Eunsol Choi

Main category: cs.CL

TL;DR: 这篇论文提出了一种新的检索模型AMER,通过生成多个查询向量来解决传统检索模型在多模态相关文档分布下的局限性,展示了在合成和真实数据集上的显著性能提升。

Details Motivation: 传统检索模型通常只生成一个查询向量,无法有效捕捉相关文档的多模态分布,导致在多目标检索任务中表现不佳。

Contribution: 提出了AMER模型,通过自回归生成多个查询向量,显著提高了在多模态分布场景下的检索性能。

Method: AMER采用自回归方式生成多个查询向量,并用所有这些向量从语料库中检索文档,以捕捉目标文档的多样性。

Result: 在合成数据集上,AMER的性能比单嵌入模型提高了4倍;在两个真实数据集上,分别实现了4%和21%的相对增益。

Insight: 多查询向量检索器在目标文档嵌入相似性较低的任务中表现尤为突出,为未来检索模型的设计提供了新方向。

Abstract: Most text retrievers generate \emph{one} query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on real-world multi-answer retrieval datasets and evaluate in-domain. AMER presents 4 and 21% relative gains over single-embedding baselines on two datasets we evaluate on. Furthermore, we consistently observe larger gains on the subset of dataset where the embeddings of the target documents are less similar to each other. We demonstrate the potential of using a multi-query vector retriever and open up a new direction for future work.

[12] MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Qianhao Yuan,Jie Lou,Zichao Li,Jiawei Chen,Yaojie Lu,Hongyu Lin,Le Sun,Debing Zhang,Xianpei Han

Main category: cs.CL

TL;DR: MemSearcher提出了一种优化的搜索代理工作流,通过迭代维护紧凑的存储并结合当前轮次信息,显著提升了多轮交互的效率和准确性。

Details Motivation: 现有的搜索代理要么保留全部历史信息导致计算和存储成本高,要么仅使用当前轮次丢失重要信息,难以平衡效率和准确性。

Contribution: 1. 提出了MemSearcher的工作流,动态维护紧凑存储;2. 设计了多上下文GRPO算法,联合优化推理、搜索和存储管理;3. 在多个基准上显著优于基线。

Method: 通过多上下文GRPO框架,对不同上下文下的轨迹组进行采样,并在组内传播轨迹级优势,实现端到端的强化学习训练。

Result: MemSearcher在7个公共基准上显著优于基线,3B模型甚至优于7B基线,证明了效率和准确性的平衡。

Insight: 动态维护紧凑存储和多上下文优化的结合是提升搜索代理性能的关键,同时也是降低计算开销的有效途径。

Abstract: Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user’s question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher

[13] Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

Amanda Bertsch,Adithya Pratapa,Teruko Mitamura,Graham Neubig,Matthew R. Gormley

Main category: cs.CL

TL;DR: Oolong是一个新的长文本推理基准测试,要求模型分析文本块并通过聚合回答分布问题,现有前沿模型在128K上下文下的准确率不足50%。

Details Motivation: 随着模型上下文长度的增加,现有长文本评估主要依赖检索任务,无法全面评估模型对长上下文的有效利用,Oolong旨在填补这一空白。

Contribution: 提出了Oolong基准测试,包括合成任务(Oolong-synth)和真实任务(Oolong-real),专注于原子级文本分析和聚合推理。

Method: 设计了需要分类、计数、时空和用户关系推理的任务,并发布了数据和评估工具。

Result: 前沿模型(如GPT-5、Claude-Sonnet-4、Gemini-2.5-Pro)在128K上下文下的准确率低于50%,表现不佳。

Insight: 当前模型在长上下文的聚合推理能力仍有显著不足,Oolong为未来模型开发提供了重要基准。

Abstract: As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.

cs.CV [Back]

[14] iFlyBot-VLA Technical Report

Yuan Zhang,Chenyu Xue,Wenjie Xu,Chao Ji,Jiajia wu,Jia Pan

Main category: cs.CV

TL;DR: iFlyBot-VLA 是一个大规模视觉-语言-动作(VLA)模型,通过新颖的双层动作表示框架和混合训练策略,显著提升了动作生成和三维感知能力。

Details Motivation: 现有视觉-语言模型在动作生成和时空推理能力上存在局限,无法直接支持复杂操作任务。为解决这一问题,本文提出结合隐含意图和显式动作的双层表示框架。

Contribution: 1) 预训练的潜在动作模型;2) 双层动作表示框架(隐含意图和显式动作);3) 混合训练策略增强三维感知;4) 开源部分数据集支持社区研究。

Method: 1) 预训练潜在动作模型;2) 双层动作监督(隐含意图和显式动作);3) 混合训练(机器人轨迹与通用QA数据);4) 对齐语言、视觉和动作表示空间。

Result: 在LIBERO Franka基准测试中表现优越,实际任务中达到高成功率,验证了模型的泛化能力和三维推理能力。

Insight: 1) 隐含意图与显式动作的结合是关键;2) 混合训练策略显著提升模型性能;3) 开源数据集有助于推动社区研究。

Abstract: We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community

[15] Challenging DINOv3 Foundation Model under Low Inter-Class Variability: A Case Study on Fetal Brain Ultrasound

Edoardo Conti,Riccardo Rosati,Lorenzo Federici,Adriano Mancini,Maria Chiara Fiorentin

Main category: cs.CV

TL;DR: 该研究首次在低类间变异性条件下对胎儿超声成像中的基础模型进行了全面评估,重点测试了DINOv3在区分解剖结构相似区域(如胎儿脑标准切面)时的表现,并提出了领域自适应预训练的重要性。

Details Motivation: 现有视觉基础模型(如DINOv3)在医学领域迁移能力表现优异,但其能否在解剖结构高度相似的条件下实现可靠区分尚未被系统研究。本文通过胎儿脑超声标准切面(TT、TV、TC)这一挑战性场景填补了这一空白。

Contribution: 1. 创建了一个统一的胎儿超声多中心基准数据集FetalUS-188K;2. 首次验证了DINOv3在低类间变异性条件下的表现;3. 证明了领域自适应预训练对提升模型稳健性的必要性。

Method: 1. 使用自监督学习预训练DINOv3以学习超声特异性特征;2. 通过线性探测和全微调两种标准化适应协议评估模型;3. 比较了两种初始化方案(胎儿超声预训练和自然图像预训练)的性能差异。

Result: 领域自适应预训练的模型表现显著优于自然图像预训练的模型,F1分数提升高达20%。特别是在区分TV等中间切面时,模型能保留关键的回声和结构线索。

Insight: 通用基础模型在低类间变异性条件下泛化能力不足,而领域特异性预训练是确保胎儿脑超声成像中稳健性和临床可靠性的关键。

Abstract: Purpose: This study provides the first comprehensive evaluation of foundation models in fetal ultrasound (US) imaging under low inter-class variability conditions. While recent vision foundation models such as DINOv3 have shown remarkable transferability across medical domains, their ability to discriminate anatomically similar structures has not been systematically investigated. We address this gap by focusing on fetal brain standard planes–transthalamic (TT), transventricular (TV), and transcerebellar (TC)–which exhibit highly overlapping anatomical features and pose a critical challenge for reliable biometric assessment. Methods: To ensure a fair and reproducible evaluation, all publicly available fetal ultrasound datasets were curated and aggregated into a unified multicenter benchmark, FetalUS-188K, comprising more than 188,000 annotated images from heterogeneous acquisition settings. DINOv3 was pretrained in a self-supervised manner to learn ultrasound-aware representations. The learned features were then evaluated through standardized adaptation protocols, including linear probing with frozen backbone and full fine-tuning, under two initialization schemes: (i) pretraining on FetalUS-188K and (ii) initialization from natural-image DINOv3 weights. Results: Models pretrained on fetal ultrasound data consistently outperformed those initialized on natural images, with weighted F1-score improvements of up to 20 percent. Domain-adaptive pretraining enabled the network to preserve subtle echogenic and structural cues crucial for distinguishing intermediate planes such as TV. Conclusion: Results demonstrate that generic foundation models fail to generalize under low inter-class variability, whereas domain-specific pretraining is essential to achieve robust and clinically reliable representations in fetal brain ultrasound imaging.

[16] Towards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images

Tuan Truong,Guillermo Jimenez Perez,Pedro Osorio,Matthias Lenga

Main category: cs.CV

TL;DR: 论文系统性评测了三种大型多模态模型(LMM)在医疗图像中燃烧式PHI检测的表现,发现LMM在OCR性能上优于传统方法,但整体PHI检测精度提升有限,并提出针对不同场景的LMM选择和部署策略。

Details Motivation: 医疗图像中的PHI检测对患者隐私保护和合规性至关重要,传统方法主要依赖OCR模型,而新兴的LMM为文本提取和语义分析提供了新机会。

Contribution: 系统地评测了GPT-4o、Gemini 2.5 Flash和Qwen 2.5 7B三种LMM的表现,提供了OCR性能和PHI检测精度的实证数据,并提出LMM选择和部署的建议。

Method: 采用两种管道配置:纯文本分析和OCR+语义分析结合的方案,对比了LMM与传统OCR模型的性能差异。

Result: LMM在OCR性能(WER和CER)上显著优于传统方法,但在复杂场景下PHI检测精度提升有限。高对比度文本区域中,不同管道配置表现相近。

Insight: LMM在OCR任务中表现优异,但其PHI检测能力的提升需结合具体场景和模型选择;模块化基础设施是实现高效部署的关键。

Abstract: The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.

[17] StrengthSense: A Dataset of IMU Signals Capturing Everyday Strength-Demanding Activities

Zeyu Yang,Clayton Souza Leite,Yu Xiao

Main category: cs.CV

TL;DR: 这篇论文提出了一个名为StrengthSense的开源数据集,包含11种需要力量的活动和2种非力量活动的IMU信号数据,旨在填补力量需求活动监测的数据空白。

Details Motivation: 现有的数据集在捕捉力量需求活动方面不够全面,限制了相关研究和应用的发展。

Contribution: StrengthSense数据集提供了11种力量需求活动和2种非力量活动的IMU信号数据,支持肌肉力量监测和活动识别研究。

Method: 数据集通过29名健康受试者佩戴10个IMU传感器收集,并利用视频记录进行标注,通过对比关节角度估计验证数据准确性。

Result: IMU估计的关节角度与视频提取的角度具有较高的一致性,验证了数据的可靠性。

Insight: StrengthSense数据集为开发新的活动识别算法和健康监测应用提供了重要基础。

Abstract: Tracking strength-demanding activities with wearable sensors like IMUs is crucial for monitoring muscular strength, endurance, and power. However, there is a lack of comprehensive datasets capturing these activities. To fill this gap, we introduce \textit{StrengthSense}, an open dataset that encompasses IMU signals capturing 11 strength-demanding activities, such as sit-to-stand, climbing stairs, and mopping. For comparative purposes, the dataset also includes 2 non-strength demanding activities. The dataset was collected from 29 healthy subjects utilizing 10 IMUs placed on limbs and the torso, and was annotated using video recordings as references. This paper provides a comprehensive overview of the data collection, pre-processing, and technical validation. We conducted a comparative analysis between the joint angles estimated by IMUs and those directly extracted from video to verify the accuracy and reliability of the sensor data. Researchers and developers can utilize \textit{StrengthSense} to advance the development of human activity recognition algorithms, create fitness and health monitoring applications, and more.

[18] Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis

Soham Joshi,Shwet Kamal Mishra,Viswanath Gopalakrishnan

Main category: cs.CV

TL;DR: 论文提出了一种自动化合成文本视觉问答(text-VQA)数据集的端到端流水线,利用OCR、ROI检测、标题生成和问题生成等技术,成功生成了包含72K QA对的大规模数据集。

Details Motivation: 传统文本VQA数据集的创建依赖人工标注,费时费力。随着多模态模型和OCR技术的成熟,急需一种自动化流水线来高效合成文本VQA数据集。

Contribution: 提出了首个自动化合成和验证大规模text-VQA数据集的流水线,涵盖了OCR、ROI检测、标题生成和问题生成等多个模块,并生成了72K QA对的公开数据集。

Method: 流水线集成了OCR文本检测与识别、ROI检测、标题生成和问题生成等技术,通过多模型协同工作实现QA对的自动化合成与验证。

Result: 生成了一个基于44K图像、包含72K QA对的大规模text-VQA数据集,验证了流水线的可行性和扩展性。

Insight: 通过模块化设计和多模态模型协作,可以有效减少人工标注成本,为文本VQA任务提供高质量的数据支持。

Abstract: Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.

[19] Markerless Augmented Reality Registration for Surgical Guidance: A Multi-Anatomy Clinical Accuracy Study

Yue Yang,Fabian Necker,Christoph Leuze,Michelle Chen,Andrey Finegersh,Jake Lee,Vasu Divi,Bruce Daniel,Brian Hargreaves,Jie Ying Wu,Fred M Baik

Main category: cs.CV

TL;DR: 该论文开发并评估了一种基于深度信息的无标记增强现实(AR)配准流程,在头戴式显示器上实现了手术导航,并在真实手术环境中验证了其在低曲率解剖结构上的准确性。

Details Motivation: 研究旨在解决传统手术导航中需要标记物的局限性,开发一种无标记的AR配准方法,以提高手术导航的便捷性和准确性,尤其是针对低曲率或小型解剖结构。

Contribution: 主要贡献包括:(1)提出了一种基于深度信息的无标记AR配准流程;(2)在真实手术环境中验证了方法的准确性;(3)展示了该方法在多种解剖结构上的适用性。

Method: 方法包括:(1)深度偏差校正;(2)人工辅助初始化;(3)全局和局部配准。通过HoloLens 2和CT数据的对齐,结合AR追踪工具验证表面追踪误差。

Result: 临床前验证显示AR追踪与CT数据高度一致(误差中位数0.781.20 mm)。临床应用中,各解剖结构的误差中位数为3.25.3 mm,满足中等风险手术的临床需求。

Insight: 无标记AR配准在手术导航中具有潜力,尤其是在小型或低曲率解剖结构上。人工引导初始化和全局-局部配准结合的方案显著提升了配准精度。

Abstract: Purpose: In this paper, we develop and clinically evaluate a depth-only, markerless augmented reality (AR) registration pipeline on a head-mounted display, and assess accuracy across small or low-curvature anatomies in real-life operative settings. Methods: On HoloLens 2, we align Articulated HAnd Tracking (AHAT) depth to Computed Tomography (CT)-derived skin meshes via (i) depth-bias correction, (ii) brief human-in-the-loop initialization, (iii) global and local registration. We validated the surface-tracing error metric by comparing “skin-to-bone” relative distances to CT ground truth on leg and foot models, using an AR-tracked tool. We then performed seven intraoperative target trials (feet x2, ear x3, leg x2) during the initial stage of fibula free-flap harvest and mandibular reconstruction surgery, and collected 500+ data per trial. Results: Preclinical validation showed tight agreement between AR-traced and CT distances (leg: median |Delta d| 0.78 mm, RMSE 0.97 mm; feet: 0.80 mm, 1.20 mm). Clinically, per-point error had a median of 3.9 mm. Median errors by anatomy were 3.2 mm (feet), 4.3 mm (ear), and 5.3 mm (lower leg), with 5 mm coverage 92-95%, 84-90%, and 72-86%, respectively. Feet vs. lower leg differed significantly (Delta median ~1.1 mm; p < 0.001). Conclusion: A depth-only, markerless AR pipeline on HMDs achieved ~3-4 mm median error across feet, ear, and lower leg in live surgical settings without fiducials, approaching typical clinical error thresholds for moderate-risk tasks. Human-guided initialization plus global-to-local registration enabled accurate alignment on small or low-curvature targets, improving the clinical readiness of markerless AR guidance.

[20] From Instance Segmentation to 3D Growth Trajectory Reconstruction in Planktonic Foraminifera

Huahua Lin,Xiaohao Cai,Mark Nixon,James M. Mulqueeney,Thomas H. G. Ezard

Main category: cs.CV

TL;DR: 论文提出了一个端到端的流程,结合实例分割和专门设计的腔室排序算法,自动从高分辨率CT扫描中重建浮游有孔虫的三维生长轨迹,显著减少了人工操作。

Details Motivation: 浮游有孔虫的腔室生长轨迹分析对环境研究具有重要意义,但目前依赖人工分割方法,效率低且主观性强。研究旨在解决这一问题。

Contribution: 提出了首个全自动且可复现的数字有孔虫生长分析流程,实现了从实例分割到三维生长轨迹重建的完整解决方案。

Method: 结合多种实例分割方法(针对不同空间特征优化)和腔室排序算法,重建生长轨迹并评估分段准确性对重建的影响。

Result: 实验表明,该方法在专家标注数据集上显著减少人工干预,同时在生物学意义上保持了较高准确性。腔室排序算法对小腔室的分割不足仍保持鲁棒性。

Insight: 尽管分割模型在小腔室上存在欠分割问题,但腔室排序算法的鲁棒性确保了生长轨迹重建的稳定性,为大规模生态研究奠定了基础。

Abstract: Planktonic foraminifera, marine protists characterized by their intricate chambered shells, serve as valuable indicators of past and present environmental conditions. Understanding their chamber growth trajectory provides crucial insights into organismal development and ecological adaptation under changing environments. However, automated tracing of chamber growth from imaging data remains largely unexplored, with existing approaches relying heavily on manual segmentation of each chamber, which is time-consuming and subjective. In this study, we propose an end-to-end pipeline that integrates instance segmentation, a computer vision technique not extensively explored in foraminifera, with a dedicated chamber ordering algorithm to automatically reconstruct three-dimensional growth trajectories from high-resolution computed tomography scans. We quantitatively and qualitatively evaluate multiple instance segmentation methods, each optimized for distinct spatial features of the chambers, and examine their downstream influence on growth-order reconstruction accuracy. Experimental results on expert-annotated datasets demonstrate that the proposed pipeline substantially reduces manual effort while maintaining biologically meaningful accuracy. Although segmentation models exhibit under-segmentation in smaller chambers due to reduced voxel fidelity and subtle inter-chamber connectivity, the chamber-ordering algorithm remains robust, achieving consistent reconstruction of developmental trajectories even under partial segmentation. This work provides the first fully automated and reproducible pipeline for digital foraminiferal growth analysis, establishing a foundation for large-scale, data-driven ecological studies.

[21] Fast Measuring Pavement Crack Width by Cascading Principal Component Analysis

Zhicheng Wang,Junbiao Pang

Main category: cs.CV

TL;DR: 论文提出了一种基于PCA和RPCA的级联框架,用于从数字图像中高效提取路面裂缝宽度,解决了裂缝边界形态复杂和非均匀的问题,同时在计算效率和测量精度上优于现有技术。

Details Motivation: 路面裂缝宽度的精确量化对评估结构完整性和指导维护干预至关重要,但由于裂缝边界的复杂性和非均匀性,以及需要快速测量的需求,传统方法效果有限。

Contribution: 提出了一个结合PCA和RPCA的级联框架,能够高效且准确地提取裂缝宽度,解决了现有技术的不足。

Method: 方法分为三阶段:1) 使用现有算法进行裂缝分割;2) 通过PCA确定准平行裂缝的主方向轴;3) 利用RPCA提取不规则裂缝的主传播轴。

Result: 在三个公开数据集上的评估表明,该方法在计算效率和测量精度上优于现有技术。

Insight: 通过结合PCA和RPCA,能够有效处理裂缝的复杂形态和非均匀性问题,为路面维护提供了更高效的工具。

Abstract: Accurate quantification of pavement crack width plays a pivotal role in assessing structural integrity and guiding maintenance interventions. However, achieving precise crack width measurements presents significant challenges due to: (1) the complex, non-uniform morphology of crack boundaries, which limits the efficacy of conventional approaches, and (2) the demand for rapid measurement capabilities from arbitrary pixel locations to facilitate comprehensive pavement condition evaluation. To overcome these limitations, this study introduces a cascaded framework integrating Principal Component Analysis (PCA) and Robust PCA (RPCA) for efficient crack width extraction from digital images. The proposed methodology comprises three sequential stages: (1) initial crack segmentation using established detection algorithms to generate a binary representation, (2) determination of the primary orientation axis for quasi-parallel cracks through PCA, and (3) extraction of the Main Propagation Axis (MPA) for irregular crack geometries using RPCA. Comprehensive evaluations were conducted across three publicly available datasets, demonstrating that the proposed approach achieves superior performance in both computational efficiency and measurement accuracy compared to existing state-of-the-art techniques.

[22] Autobiasing Event Cameras for Flickering Mitigation

Mehdi Sefidgar Dilmaghani,Waseem Shariff,Cian Ryan,Joe Lemley,Peter Corcoran

Main category: cs.CV

TL;DR: 该论文提出了一种自调节偏置机制,用于事件相机中的闪烁抑制,通过CNN动态调整偏置,显著提升了在不同光照条件下的性能。

Details Motivation: 事件相机在快速变化的光强环境下容易受到闪烁影响,传统方法依赖额外硬件或软件滤波。本文旨在利用事件相机本身的偏置设置,提出一种自主的闪烁抑制方案。

Contribution: 提出了一种创新的自调节偏置机制,无需额外硬件或软件滤波,能够在25 Hz至500 Hz的广泛频率范围内有效抑制闪烁。

Method: 使用简单的卷积神经网络(CNN)在空间域检测闪烁实例,并动态调整事件相机的特定偏置以减少闪烁影响。

Result: 在YOLO人脸检测框架下测试结果表明,检测置信度和捕获人脸帧比例显著提升,平均梯度(闪烁指标)下降38.2%(亮光)和53.6%(低光)。

Insight: 利用事件相机固有的偏置设置结合CNN可以实现高效的闪烁抑制,为其在恶劣光照环境下的应用提供了新方向。

Abstract: Understanding and mitigating flicker effects caused by rapid variations in light intensity is critical for enhancing the performance of event cameras in diverse environments. This paper introduces an innovative autonomous mechanism for tuning the biases of event cameras, effectively addressing flicker across a wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on additional hardware or software for flicker filtering, our approach leverages the event cameras inherent bias settings. Utilizing a simple Convolutional Neural Networks -CNNs, the system identifies instances of flicker in a spatial space and dynamically adjusts specific biases to minimize its impact. The efficacy of this autobiasing system was robustly tested using a face detector framework under both well-lit and low-light conditions, as well as across various frequencies. The results demonstrated significant improvements: enhanced YOLO confidence metrics for face detection, and an increased percentage of frames capturing detected faces. Moreover, the average gradient, which serves as an indicator of flicker presence through edge detection, decreased by 38.2 percent in well-lit conditions and by 53.6 percent in low-light conditions. These findings underscore the potential of our approach to significantly improve the functionality of event cameras in a range of adverse lighting scenarios.

[23] Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

Jinhwan Seo,Yoonki Cho,Junhyug Noh,Sung-eui Yoon

Main category: cs.CV

TL;DR: 这篇技术报告提出了一种三阶段框架来解决Grounded Video QA任务,通过引入触发时刻(trigger moment)显著提升了时空定位和跟踪的性能。

Details Motivation: GVQA任务需要对视频内容进行复杂推理,并在视觉上定位和跟踪目标对象,现有方法在此任务上表现不佳。

Contribution: 主要贡献是提出了触发时刻概念,并通过CORTEX提示方法找到目标对象最显眼的帧,作为时空定位和跟踪的锚点。

Method: 通过视频推理与QA、时空定位和跟踪三个阶段分解GVQA任务,引入CORTEX提示识别触发时刻。

Result: 在GVQA任务中获得了0.4968的HOTA分数,相比去年的获胜分数0.2704有显著提升。

Insight: 触发时刻的引入为多模态大语言模型在视频任务中的时空定位提供了高效锚点,提升了模型的鲁棒性。

Abstract: In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning & QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year’s winning score of 0.2704 on GVQA task.

[24] MM-UNet: Morph Mamba U-shaped Convolutional Networks for Retinal Vessel Segmentation

Jiawen Liu,Yuanbo Zeng,Jiaming Liang,Yizhen Yang,Yiheng Zhang,Enhui Cai,Xiaoqi Sheng,Hongmin Cai

Main category: cs.CV

TL;DR: 论文提出了一种名为MM-UNet的新型架构,用于精确的视网膜血管分割。通过引入Morph Mamba卷积层和反向选择性状态引导模块,该方法显著提升了分割精度和鲁棒性,在两个公开数据集上取得了优于现有方法的性能。

Details Motivation: 视网膜血管的极细且分枝多变的特性,使其与传统分割目标差异显著,导致现有方法在分割精度和鲁棒性上面临挑战。

Contribution: 1. 提出MM-UNet架构,采用Morph Mamba卷积层增强分枝结构感知能力;2. 设计反向选择性状态引导模块(Reverse Selective State Guidance),提升几何边界感知和解码效率。

Method: 1. Morph Mamba卷积层替换逐点卷积,通过形态感知特征采样提升拓扑感知能力;2. 反向选择性状态引导模块结合反向引导理论和状态空间建模。

Result: 在DRIVE和STARE数据集上分别实现了1.64%和1.25%的F1-score提升,验证了方法的有效性。

Insight: 通过形态感知和状态空间建模的结合,能够更好地处理视网膜血管的细粒度分枝结构,为类似分割任务提供了新思路。

Abstract: Accurate detection of retinal vessels plays a critical role in reflecting a wide range of health status indicators in the clinical diagnosis of ocular diseases. Recently, advances in deep learning have led to a surge in retinal vessel segmentation methods, which have significantly contributed to the quantitative analysis of vascular morphology. However, retinal vasculature differs significantly from conventional segmentation targets in that it consists of extremely thin and branching structures, whose global morphology varies greatly across images. These characteristics continue to pose challenges to segmentation precision and robustness. To address these issues, we propose MM-UNet, a novel architecture tailored for efficient retinal vessel segmentation. The model incorporates Morph Mamba Convolution layers, which replace pointwise convolutions to enhance branching topological perception through morph, state-aware feature sampling. Additionally, Reverse Selective State Guidance modules integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency. Extensive experiments conducted on two public retinal vessel segmentation datasets demonstrate the superior performance of the proposed method in segmentation accuracy. Compared to the existing approaches, MM-UNet achieves F1-score gains of 1.64 $%$ on DRIVE and 1.25 $%$ on STARE, demonstrating its effectiveness and advancement. The project code is public via https://github.com/liujiawen-jpg/MM-UNet.

[25] Language-Enhanced Generative Modeling for PET Synthesis from MRI and Blood Biomarkers

Zhengjie Zhang,Xiaoxie Mao,Qihao Guo,Shaoting Zhang,Qi Huang,Mu Zhou,Fang Xie,Mianxin Liu

Main category: cs.CV

TL;DR: 这篇论文提出了一个融合大语言模型(LLM)和多模态信息的语言增强生成模型,用于从MRI和血液生物标志物合成PET图像,旨在解决阿尔茨海默病诊断中PET成本高、可及性差的问题。

Details Motivation: 阿尔茨海默病的诊断高度依赖淀粉样蛋白-β PET(Abeta-PET),但其高昂的成本和有限的可用性限制了临床应用。因此,研究是否可以通过血液生物标志物(BBMs)和MRI扫描预测Abeta-PET的空间模式。

Contribution: 1. 提出了一种语言增强的生成模型,利用大语言模型(LLM)和多模态信息融合技术合成PET图像。2. 通过合成PET图像构建了一个全自动的AD诊断流程,显著提升了诊断性能。3. 实验证明合成PET在图像质量和诊断一致性上接近真实PET。

Method: 1. 收集了566名参与者的Abeta-PET、T1加权MRI和BBMs数据。2. 设计了基于LLM和多模态融合的生成模型来合成PET图像。3. 对合成图像的质量、诊断一致性和临床应用进行了全面评估。

Result: 1. 合成PET图像在结构细节(SSIM=0.920)和区域模式(Pearson’s r=0.955)上与真实PET高度一致。2. 基于合成PET的诊断准确性达到0.80,且其模型(AUC=0.78)优于单独使用MRI(AUC=0.68)或BBMs(AUC=0.73)的模型。3. LLM集成和提示工程显示了显著优势。

Insight: 1. 语言增强的生成模型为多模态医学图像合成提供了新思路。2. 合成PET图像可以作为真实PET的替代方案,降低诊断成本。3. 结合LLM的方法在医学图像生成领域具有潜力。

Abstract: Background: Alzheimer’s disease (AD) diagnosis heavily relies on amyloid-beta positron emission tomography (Abeta-PET), which is limited by high cost and limited accessibility. This study explores whether Abeta-PET spatial patterns can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566 participants. A language-enhanced generative model, driven by a large language model (LLM) and multimodal information fusion, was developed to synthesize PET images. Synthesized images were evaluated for image quality, diagnostic consistency, and clinical applicability within a fully automated diagnostic pipeline. Findings: The synthetic PET images closely resemble real PET scans in both structural details (SSIM = 0.920 +/- 0.003) and regional patterns (Pearson’s r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic PET, we developed a fully automatic AD diagnostic pipeline integrating PET synthesis and classification. The synthetic PET-based model (AUC = 0.78) outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while combining synthetic PET and BBMs further improved performance (AUC = 0.79). Ablation analysis supports the advantages of LLM integration and prompt engineering. Interpretation: Our language-enhanced generative model synthesizes realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial pattern assessment and improving the diagnostic workflow for Alzheimer’s disease.

[26] Object-Centric 3D Gaussian Splatting for Strawberry Plant Reconstruction and Phenotyping

Jiajia Li,Keyi Zhu,Qianwen Zhang,Dong Chen,Qi Sun,Zhaojian Li

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于对象中心的高斯溅射(3DGS)框架,用于草莓植物的3D重建和表型分析。通过结合SAM-2分割和背景掩码,该方法实现了高效、准确的植物重建,并自动提取重要的植物性状。

Details Motivation: 传统的植物表型分析方法耗时、费力且具有破坏性,而现有的3DGS方法在农业场景中通常重建整个场景,引入了噪声和计算复杂度。为了解决这些问题,论文提出了一种专注于植物对象的3D重建方法。

Contribution: 1) 提出了结合SAM-2分割和背景掩码的对象中心3DGS框架;2) 实现了高效、准确的草莓植物重建;3) 设计了一种自动提取植物性状(如高度和冠层宽度)的方法。

Method: 使用SAM-2对草莓植物进行分割,并通过alpha通道掩码去除背景。随后采用3D高斯溅射技术实现植物对象的3D重建,结合DBSCAN聚类和PCA自动分析植物性状。

Result: 实验结果表明,该方法在重建精度和计算效率上优于传统方法,为草莓植物的表型分析提供了非破坏性、可扩展的解决方案。

Insight: 通过对象中心的3D重建和背景去除,可以显著减少噪声和计算负担,同时提高表型分析的准确性,这一思路可推广到其他农业场景。

Abstract: Strawberries are among the most economically significant fruits in the United States, generating over $2 billion in annual farm-gate sales and accounting for approximately 13% of the total fruit production value. Plant phenotyping plays a vital role in selecting superior cultivars by characterizing plant traits such as morphology, canopy structure, and growth dynamics. However, traditional plant phenotyping methods are time-consuming, labor-intensive, and often destructive. Recently, neural rendering techniques, notably Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have emerged as powerful frameworks for high-fidelity 3D reconstruction. By capturing a sequence of multi-view images or videos around a target plant, these methods enable non-destructive reconstruction of complex plant architectures. Despite their promise, most current applications of 3DGS in agricultural domains reconstruct the entire scene, including background elements, which introduces noise, increases computational costs, and complicates downstream trait analysis. To address this limitation, we propose a novel object-centric 3D reconstruction framework incorporating a preprocessing pipeline that leverages the Segment Anything Model v2 (SAM-2) and alpha channel background masking to achieve clean strawberry plant reconstructions. This approach produces more accurate geometric representations while substantially reducing computational time. With a background-free reconstruction, our algorithm can automatically estimate important plant traits, such as plant height and canopy width, using DBSCAN clustering and Principal Component Analysis (PCA). Experimental results show that our method outperforms conventional pipelines in both accuracy and efficiency, offering a scalable and non-destructive solution for strawberry plant phenotyping.

[27] Estimation of Segmental Longitudinal Strain in Transesophageal Echocardiography by Deep Learning

Anders Austlid Taskén,Thierry Judge,Erik Andreas Rye Berg,Jinyang Yu,Bjørnar Grenne,Frank Lindseth,Svend Aakhus,Pierre-Marc Jodoin,Nicolas Duchateau,Olivier Bernard,Gabriel Kiss

Main category: cs.CV

TL;DR: 这篇论文提出了首个自动化深度学习管道autoStrain,用于通过经食管超声心动图(TEE)估算节段纵向应变(SLS),显著提升心脏功能评估的效率和精度。

Details Motivation: 当前SLS估算技术依赖大量手动操作和专业知识,效率低且资源密集,限制了其在监测中的应用。为此,研究提出了自动化解决方案。

Contribution: 主要贡献包括:(1)提出首个基于深度学习的SLS自动化估算管道autoStrain;(2)对比了两种DL方法(TeeFlow和TeeTracker)的性能;(3)利用仿真数据(synTEE)解决真实数据不足的问题。

Method: 采用了两种DL方法:基于RAFT光流模型的TeeFlow(密集帧间预测)和基于CoTracker点轨迹模型的TeeTracker(稀疏长序列预测)。训练和评估使用合成的TEE数据集(synTEE)。

Result: TeeTracker表现优于TeeFlow,运动估算的平均距离误差为0.65 mm;临床验证中,SLS估算与临床参考的差异为1.09%(-8.90%至11.09%)。

Insight: 研究表明,结合AI驱动运动估计与TEE可显著提升心脏功能评估的精度和效率,仿真数据的引入有助于模型对异常变形的量化。

Abstract: Segmental longitudinal strain (SLS) of the left ventricle (LV) is an important prognostic indicator for evaluating regional LV dysfunction, in particular for diagnosing and managing myocardial ischemia. Current techniques for strain estimation require significant manual intervention and expertise, limiting their efficiency and making them too resource-intensive for monitoring purposes. This study introduces the first automated pipeline, autoStrain, for SLS estimation in transesophageal echocardiography (TEE) using deep learning (DL) methods for motion estimation. We present a comparative analysis of two DL approaches: TeeFlow, based on the RAFT optical flow model for dense frame-to-frame predictions, and TeeTracker, based on the CoTracker point trajectory model for sparse long-sequence predictions. As ground truth motion data from real echocardiographic sequences are hardly accessible, we took advantage of a unique simulation pipeline (SIMUS) to generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with ground truth myocardial motion to train and evaluate both models. Our evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a mean distance error in motion estimation of 0.65 mm on a synTEE test dataset. Clinical validation on 16 patients further demonstrated that SLS estimation with our autoStrain pipeline aligned with clinical references, achieving a mean difference (95% limits of agreement) of 1.09% (-8.90% to 11.09%). Incorporation of simulated ischemia in the synTEE data improved the accuracy of the models in quantifying abnormal deformation. Our findings indicate that integrating AI-driven motion estimation with TEE can significantly enhance the precision and efficiency of cardiac function assessment in clinical settings.

[28] Can Foundation Models Revolutionize Mobile AR Sparse Sensing?

Yiqin Zhao,Tian Guo

Main category: cs.CV

TL;DR: 论文探讨了基础模型是否能革新移动AR稀疏感知,通过实验证明其在几何感知图像扭曲方面的显著改进,并展示了在3D场景重建中的领先性能。

Details Motivation: 移动感知系统在计算和功耗等约束下,长期面临感知质量与效率的权衡问题。稀疏感知作为一种关键策略,却因信息缺失导致准确性下降。因此,研究基础模型是否能解决这一问题成为核心动机。

Contribution: 论文的主要贡献在于通过实验验证了基础模型在移动稀疏感知中的潜力,尤其在几何感知图像扭曲和3D场景重建方面的性能提升。

Method: 研究基于真实移动AR数据,评估基础模型在稀疏感知中的表现,重点关注几何感知图像扭曲技术及其跨帧信息重用的准确性。

Result: 结果表明,基础模型显著提升了稀疏感知的性能,尤其在3D场景重建任务中表现领先。

Insight: 论文揭示了基础模型在移动稀疏感知中的潜力与挑战,为未来研究提供了重要方向。

Abstract: Mobile sensing systems have long faced a fundamental trade-off between sensing quality and efficiency due to constraints in computation, power, and other limitations. Sparse sensing, which aims to acquire and process only a subset of sensor data, has been a key strategy for maintaining performance under such constraints. However, existing sparse sensing methods often suffer from reduced accuracy, as missing information across space and time introduces uncertainty into many sensing systems. In this work, we investigate whether foundation models can change the landscape of mobile sparse sensing. Using real-world mobile AR data, our evaluations demonstrate that foundation models offer significant improvements in geometry-aware image warping, a central technique for enabling accurate reuse of cross-frame information. Furthermore, our study demonstrates the scalability of foundation model-based sparse sensing and shows its leading performance in 3D scene reconstruction. Collectively, our study reveals critical aspects of the promises and the open challenges of integrating foundation models into mobile sparse sensing systems.

[29] Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimer’s Disease Diagnosis

Delin Ma,Menghui Zhou,Jun Qi,Yun Yang,Po Yang

Main category: cs.CV

TL;DR: 该论文提出了一种基于协作注意力与一致性引导的MRI和PET融合框架,用于阿尔茨海默病的诊断,强调模态特定特征的重要性并减少分布差异的影响。

Details Motivation: 阿尔茨海默病(AD)的早期诊断非常重要,但现有的多模态融合方法忽视了模态特定特征的诊断价值,并且由于分布差异导致性能下降。

Contribution: 1. 引入可学习参数表示(LPR)块补偿缺失模态信息;2. 设计了共享编码器和模态独立编码器以保留共享和特定表示;3. 提出了一个一致性引导机制对齐多模态潜在分布。

Method: 1. 使用LPR块处理缺失信息;2. 结合共享和模态独立编码器;3. 采用一致性机制优化多模态对齐。

Result: 在ADNI数据集上,该方法优于现有的多模态融合策略。

Insight: 模态特定特征和多模态分布对齐对于提升AD诊断性能至关重要。

Abstract: Alzheimer’s disease (AD) is the most prevalent form of dementia, and its early diagnosis is essential for slowing disease progression. Recent studies on multimodal neuroimaging fusion using MRI and PET have achieved promising results by integrating multi-scale complementary features. However, most existing approaches primarily emphasize cross-modal complementarity while overlooking the diagnostic importance of modality-specific features. In addition, the inherent distributional differences between modalities often lead to biased and noisy representations, degrading classification performance. To address these challenges, we propose a Collaborative Attention and Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The proposed model introduces a learnable parameter representation (LPR) block to compensate for missing modality information, followed by a shared encoder and modality-independent encoders to preserve both shared and specific representations. Furthermore, a consistency-guided mechanism is employed to explicitly align the latent distributions across modalities. Experimental results on the ADNI dataset demonstrate that our method achieves superior diagnostic performance compared with existing fusion strategies.

[30] Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency

Hao Li,Daiwei Lu,Jesse d’Almeida,Dilara Isik,Ehsan Khodapanah Aghdam,Nick DiSanto,Ayberk Acar,Susheela Sharma,Jie Ying Wu,Robert J. Webster III,Ipek Oguz

Main category: cs.CV

TL;DR: 论文提出了一种通过域不变特征学习和隐式一致性改进内窥镜单目绝对深度估计的方法,优于现有技术。

Details Motivation: 内窥镜手术中,从单目图像估计绝对深度具有挑战性,现有基于域适应的监督学习方法仍存在域差距问题。

Contribution: 提出了一种隐式特征对齐方法,通过对抗学习和方向性特征一致性学习域不变特征,改进了深度估计。

Method: 使用对抗学习和方向性特征一致性,训练深度网络从翻译的合成图像和真实内窥镜图像中学习域不变特征。

Result: 在中央气道模型的内窥镜视频上,该方法在绝对和相对深度指标上均优于现有技术。

Insight: 隐式特征对齐比图像级域适应更有效地缩小域差距,适用于实际场景的深度估计。

Abstract: Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network takes translated synthetic and real endoscopic frames as input and learns latent domain-invariant features via adversarial learning and directional feature consistency. The evaluation is conducted on endoscopic videos of central airway phantoms with manually aligned absolute depth maps. Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics, and consistently improves results across various backbones and pretrained weights. Our code is available at https://github.com/MedICL-VU/MDE.

[31] Are Euler angles a useful rotation parameterisation for pose estimation with Normalizing Flows?

Giorgos Sfikas,Konstantina Nikolaidou,Foteini Papadopoulou,George Retsinas,Anastasios L. Kesidis

Main category: cs.CV

TL;DR: 该论文探讨了在基于归一化流(Normalizing Flows)的姿态估计中,使用欧拉角(Euler angles)作为旋转参数的有效性。

Details Motivation: 姿态估计在3D计算机视觉中非常重要,但单一的点估计可能不够,尤其在姿态模糊的情况下。本文旨在验证欧拉角尽管有缺点,是否仍能作为有效的旋转参数。

Contribution: 提出了使用欧拉角作为归一化流模型的旋转参数,并探讨其相对于复杂参数化的实用性。

Method: 通过归一化流模型,比较欧拉角和其他复杂旋转参数在姿态估计中的表现。

Result: 研究表明,欧拉角在某些情况下可以作为有效的旋转参数,尽管其存在局限性。

Insight: 欧拉角的简单性和直接性在某些场景下可能优于复杂参数化方法,尤其在不明确姿态或对称性对象的情况下。

Abstract: Object pose estimation is a task that is of central importance in 3D Computer Vision. Given a target image and a canonical pose, a single point estimate may very often be sufficient; however, a probabilistic pose output is related to a number of benefits when pose is not unambiguous due to sensor and projection constraints or inherent object symmetries. With this paper, we explore the usefulness of using the well-known Euler angles parameterisation as a basis for a Normalizing Flows model for pose estimation. Isomorphic to spatial rotation, 3D pose has been parameterized in a number of ways, either in or out of the context of parameter estimation. We explore the idea that Euler angles, despite their shortcomings, may lead to useful models in a number of aspects, compared to a model built on a more complex parameterisation.

[32] SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

Fangxun Shu,Yongjie Ye,Yue Liao,Zijian Kang,Weijie Yin,Jiacong Wang,Xiao Liang,Shuicheng Yan,Chao Feng

Main category: cs.CV

TL;DR: SAIL-RL是一种基于强化学习的框架,通过双奖励机制(Thinking Reward和Judging Reward)提升多模态大语言模型(MLLMs)的推理能力,解决了现有方法在推理质量和策略上的不足,并在实验中表现出色。

Details Motivation: 现有方法仅基于结果监督,无法确保推理过程的质量;且采用统一的推理策略,导致在简单任务上过度思考或在复杂任务上思考不足。SAIL-RL旨在解决这些问题,提升模型的可靠性和适应性。

Contribution: 提出双奖励机制(Thinking Reward和Judging Reward),分别评估推理质量和自适应选择推理策略,从而提升MLLMs的推理能力和多模态理解能力。

Method: 通过强化学习框架SAIL-RL,结合Thinking Reward(评估事实基础、逻辑一致性和答案一致性)和Judging Reward(自适应决定是否需要深度推理)。

Result: 在SAIL-VL2上实验表明,SAIL-RL在推理和多模态理解任务中表现优异,性能接近GPT-4o,并显著减少幻觉问题。

Insight: SAIL-RL展示了通过强化学习动态调整推理策略的潜力,为构建更可靠和自适应的MLLMs提供了新思路。

Abstract: We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.

Cuong Tuan Nguyen,Ngoc Tuan Nguyen,Triet Hoang Minh Dao,Huy Minh Nhat,Huy Truong Dinh

Main category: cs.CV

TL;DR: 该论文提出了一种基于图神经网络(GNN)的方法,用于手写数学表达式(HME)的结构识别,通过建模图结构并预测链接关系来优化识别结果。

Details Motivation: 手写数学表达式的结构识别是一个复杂任务,传统方法难以准确捕捉符号间的空间依赖关系。通过图神经网络建模HME为图结构,可以更好地表示和优化符号之间的空间关系。

Contribution: 论文的主要贡献包括:1)将HME建模为图结构,节点表示符号,边表示空间依赖;2)提出一种GNN链接预测模型,用于优化和去除不必要的连接,形成最终的符号标签图。

Method: 方法分为三步:1)使用深度BLSTM网络进行符号分割、识别和空间关系分类,生成初始图;2)通过2D-CFG解析器生成所有可能的空间关系;3)基于GNN的链接预测模型进一步优化图结构。

Result: 实验结果表明,该方法在HME结构识别任务中表现出色,验证了其有效性。

Insight: 通过GNN建模图结构可以更灵活地捕捉符号间的复杂空间关系,为手写数学表达式识别提供了新思路。

Abstract: We propose a Graph Neural Network (GNN)-based approach for Handwritten Mathematical Expression (HME) recognition by modeling HMEs as graphs, where nodes represent symbols and edges capture spatial dependencies. A deep BLSTM network is used for symbol segmentation, recognition, and spatial relation classification, forming an initial primitive graph. A 2D-CFG parser then generates all possible spatial relations, while the GNN-based link prediction model refines the structure by removing unnecessary connections, ultimately forming the Symbol Label Graph. Experimental results demonstrate the effectiveness of our approach, showing promising performance in HME structure recognition.

[34] M3PD Dataset: Dual-view Photoplethysmography (PPG) Using Front-and-rear Cameras of Smartphones in Lab and Clinical Settings

Jiankai Tang,Tao Zhang,Jia Li,Yiru Zhang,Mingyu Zhang,Kegang Wang,Yuming Hao,Bolin Wang,Haiyang Li,Xingyao Wang,Yuanchun Shi,Yuntao Wang,Sichong Qian

Main category: cs.CV

TL;DR: 论文介绍了M3PD数据集,这是首个公开的双视角移动光电容积描记(PPG)数据集,同时提出F3Mamba模型,通过Mamba时序建模融合面部和指尖视角,显著降低心率误差并提升鲁棒性。

Details Motivation: 便携式生理监测对心血管疾病早期检测至关重要,但现有方法需要专用设备或固定姿势,限制了可访问性和实用性。智能手机视频PPG面临可靠性挑战,缺乏公开数据集和针对心血管患者的可靠应用。

Contribution: 1)首个公开的双视角(面部和指尖)移动PPG数据集M3PD;2)提出F3Mamba模型,融合双视角数据,显著降低心率误差(21.9%-30.2%)并提升鲁棒性。

Method: 基于M3PD数据集,F3Mamba通过Mamba架构进行时序建模,融合面部和指尖的视频数据,优化心率估计。

Result: F3Mamba在心率误差和鲁棒性上超过单视角基线方法(误差降低21.9%-30.2%)。

Insight: 双视角数据融合和Mamba时序建模能有效提升视频PPG的精度和可靠性,为心血管疾病监测提供新思路。

Abstract: Portable physiological monitoring is essential for early detection and management of cardiovascular disease, but current methods often require specialized equipment that limits accessibility or impose impractical postures that patients cannot maintain. Video-based photoplethysmography on smartphones offers a convenient noninvasive alternative, yet it still faces reliability challenges caused by motion artifacts, lighting variations, and single-view constraints. Few studies have demonstrated reliable application to cardiovascular patients, and no widely used open datasets exist for cross-device accuracy. To address these limitations, we introduce the M3PD dataset, the first publicly available dual-view mobile photoplethysmography dataset, comprising synchronized facial and fingertip videos captured simultaneously via front and rear smartphone cameras from 60 participants (including 47 cardiovascular patients). Building on this dual-view setting, we further propose F3Mamba, which fuses the facial and fingertip views through Mamba-based temporal modeling. The model reduces heart-rate error by 21.9 to 30.2 percent over existing single-view baselines while improving robustness in challenging real-world scenarios. Data and code: https://github.com/Health-HCI-Group/F3Mamba.

[35] CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

Jizheng Ma,Xiaofei Zhou,Yanlong Song,Han Yan

Main category: cs.CV

TL;DR: CoCoVa提出了一种新的连续跨模态推理框架,通过在潜在空间中进行迭代推理,弥补了视觉语言模型中离散语言处理的局限性,提升了任务性能和推理效率。

Details Motivation: 人类认知中存在许多难以用语言表达的隐性思维过程,而当前的视觉语言模型(VLMs)局限于离散的语言令牌空间推理,限制了视觉感知的丰富性。CoCoVa旨在解决这一差距,实现更自然的跨模态推理。

Contribution: 1.提出CoCoVa框架,利用连续的潜在空间推理链;2.引入Latent Q-Former(LQ-Former)作为动态推理引擎;3.设计动态令牌选择机制和多任务学习目标,确保潜在表示与视觉和文本模态对齐。

Method: 1.通过LQ-Former迭代优化潜在思维向量链;2.动态选择显著视觉区域;3.结合对比学习和扩散重建的多任务目标训练模型。

Result: CoCoVa在1.5B参数下超越7B-9B基准模型,扩展到7B时仍具竞争力,定性分析显示潜在空间捕捉了可解释的结构化推理模式。

Insight: 连续潜在空间推理能够更好地模拟人类认知的隐性思维过程,为视觉语言模型的跨模态理解提供了新方向。

Abstract: In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.

[36] RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

Jiahe Song,Chuang Wang,Bowen Jiang,Yinfan Wang,Hao Zheng,Xingjian Wei,Chengjin Liu,Junyuan Gao,Yubin Wang,Lijun Wu,Jiang Wu,Qian Yu,Conghui He

Main category: cs.CV

TL;DR: RxnCaption将化学反应图解析任务重新定义为视觉提示引导的图像描述问题,结合BIVP策略和MolYOLO分子检测器,显著提升了提取质量,并发布了大规模数据集RxnCaption-11k。

Details Motivation: 化学文献中的反应数据以图像形式存在,难以被机器读取和用于训练模型。为解决这一问题,RxnCaption提出了一种新的解析框架。

Contribution: 1. 将反应图解析任务重新定义为图像描述问题;2. 提出BIVP策略,结合MolYOLO检测器简化模型设计;3. 发布了RxnCaption-11k数据集。

Method: 采用视觉提示引导的图像描述方法,利用MolYOLO预先生成分子边界框和索引,将解析任务转化为自然语言描述问题。

Result: RxnCaption-VL在多项指标上达到SOTA性能,提取质量显著提升。

Insight: 利用视觉提示和图像描述任务的结合可以有效解决化学文献中的结构化信息提取问题,推动化学领域的AI应用。

Abstract: Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed “BBox and Index as Visual Prompt” (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.

[37] Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds

Leon Schwarzer,Matthias Zeller,Daniel Casado Herraez,Simon Dierl,Michael Heidingsfeld,Cyrill Stachniss

Main category: cs.CV

TL;DR: 论文提出了一种自监督学习方法,用于分割稀疏且噪声较大的雷达点云中的运动对象。通过对比自监督表示学习和有限标注数据的监督微调,结合聚类损失函数和动态点去除策略,提升了分割性能和标注效率。

Details Motivation: 自动驾驶系统需要高效、可靠的运动对象分割方法。雷达传感器可直接测量多普勒速度,但其点云数据稀疏且有噪声,标注成本高。因此,研究自监督学习方法以减少对标注数据的依赖。

Contribution: 1. 提出一种自监督学习方法,结合对比学习和聚类损失函数;2. 引入动态点去除策略优化聚类;3. 在有限标注数据下通过微调提升分割性能。

Method: 采用两步法:1. 对比自监督表示学习,设计聚类损失函数并基于动态点去除优化聚类;2. 使用少量标注数据监督微调模型。

Result: 方法在自监督预训练后显著提升了分割性能,标注效率更高,达到了先进水平。

Insight: 自监督学习可有效减少对标注数据的依赖;雷达点云的多普勒信息可用于运动感知表示学习;动态点去除策略提升了聚类的鲁棒性。

Abstract: Moving object segmentation is a crucial task for safe and reliable autonomous mobile systems like self-driving cars, improving the reliability and robustness of subsequent tasks like SLAM or path planning. While the segmentation of camera or LiDAR data is widely researched and achieves great results, it often introduces an increased latency by requiring the accumulation of temporal sequences to gain the necessary temporal context. Radar sensors overcome this problem with their ability to provide a direct measurement of a point’s Doppler velocity, which can be exploited for single-scan moving object segmentation. However, radar point clouds are often sparse and noisy, making data annotation for use in supervised learning very tedious, time-consuming, and cost-intensive. To overcome this problem, we address the task of self-supervised moving object segmentation of sparse and noisy radar point clouds. We follow a two-step approach of contrastive self-supervised representation learning with subsequent supervised fine-tuning using limited amounts of annotated data. We propose a novel clustering-based contrastive loss function with cluster refinement based on dynamic points removal to pretrain the network to produce motion-aware representations of the radar data. Our method improves label efficiency after fine-tuning, effectively boosting state-of-the-art performance by self-supervised pretraining.

[38] Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs

Arya Shah,Vaibhav Tripathi

Main category: cs.CV

TL;DR: 该论文通过统一基准测试,量化了猫和人类视觉表征在多种视觉模型(如CNN、ViT、自监督ViT)中的对齐程度,发现自监督ViT(DINO)表现最佳,揭示了跨物种视觉计算的潜在一致性。

Details Motivation: 研究猫和人类视觉系统的差异如何在下游视觉表征中体现,探索不同视觉模型(如CNN、ViT、自监督ViT)在跨物种对齐中的表现。

Contribution: 提出了统一的冻结编码器基准测试,量化了多种视觉模型在跨物种表征对齐中的表现;发现自监督ViT(DINO)在猫和人类视觉对齐中表现最优。

Method: 使用层级的Centered Kernel Alignment(CKA)和Representational Similarity Analysis(RSA)等方法,对比分析了CNN、监督ViT、窗口化ViT和自监督ViT在不同深度层的表征对齐程度。

Result: DINO ViT-B/16在跨物种对齐中表现最佳(平均CKA-RBF≈0.814,RSA≈0.698),监督ViT在CKA上表现接近但几何对应较弱,窗口化ViT表现最差。

Insight: 自监督学习与ViT的归纳偏置结合,能够生成更接近猫和人类视觉系统的表征,为跨物种视觉计算的神经科学研究提供了可测试的假设。

Abstract: Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF $\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.

[39] ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

Duo Xu,Hao Cheng,Xin Lin,Zhen Xie,Hao Wang

Main category: cs.CV

TL;DR: 该论文提出了一个多阶段代码驱动的自动化流水线ChartM3,用于生成复杂的多维多步视觉推理数据集,提升多模态大语言模型在图表理解任务中的表现。

Details Motivation: 当前研究在复杂图表场景和计算密集型推理任务的覆盖上存在不足,限制了多模态大语言模型的真实应用能力。

Contribution: 提出了ChartM3框架,通过检索增强生成(RAG)和链式思维(CoT)策略,生成高质量的多维度推理数据集,显著提升了模型的推理能力和跨领域泛化性能。

Method: 采用多阶段代码驱动的流水线,结合RAG和CoT生成推理代码,驱动图表渲染和问题相关的统计计算。

Result: 生成了包含38K图表和142K问答对的训练数据集,以及2,871高质量评估样本。实验表明,该数据集显著提升了模型的性能,使小模型在复杂图表理解任务中表现与大规模模型相当。

Insight: 通过自动化生成多样化的高质量推理数据集,可以有效提升模型的复杂任务处理能力,减少对大规模模型的依赖。

Abstract: Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.

[40] From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics

Nicolas Schuler,Lea Dewald,Nick Baldig,Jürgen Graf

Main category: cs.CV

TL;DR: 本文探讨了在移动机器人边缘设备上部署小规模视觉语言模型(VLMs)的能力,重点研究其零样本场景理解与动作识别的潜力。

Details Motivation: 尽管大语言模型和视觉语言模型在视频理解、场景解释和常识推理方面取得了显著进展,但其高计算复杂度限制了在边缘设备和移动机器人中的应用。因此,研究小规模模型在边缘设备上的表现具有重要意义。

Contribution: 论文的主要贡献是评估小规模视觉语言模型在移动机器人边缘设备上的性能,并分析其在零样本场景理解与动作识别任务中的潜力、挑战和局限性。

Method: 论文提出了一种管道,用于在多样化数据集(包括城市景观、校园和室内场景)上评估小规模视觉语言模型的性能。实验重点讨论了模型在边缘设备上的表现。

Result: 实验结果表明,小规模视觉语言模型在边缘设备上具有一定的潜力,但也揭示了其固有偏差、挑战和局限性。

Insight: 研究发现,尽管小规模模型可以部署到边缘设备,但其性能仍需权衡精度与推理时间,且模型的设计需考虑实际应用中的多样性和动态环境。

Abstract: Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/

[41] MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer

Taiga Yamane,Satoshi Suzuki,Ryo Masumura,Shotaro Tora

Main category: cs.CV

TL;DR: MVAFormer是一种基于Transformer的多视角时空动作识别方法,专注于解决多摄像头视角下的动作识别问题,特别适用于时空动作识别(STAR)任务。通过引入保留空间信息的特征图和分视角的自注意力机制,显著提升了性能。

Details Motivation: 多视角动作识别任务中,现有方法仅适用于整个视频的单动作识别,无法处理时空动作识别(STAR)任务中对每个人的动作顺序识别的需求。MVAFormer旨在填补这一空白。

Contribution: 提出了MVAFormer,一种适用于STAR任务的多视角动作识别方法,并设计了一种新型的基于Transformer的多视角协作模块,保留了空间信息并分视角建模关系。

Method: 利用特征图而非丢失空间信息的嵌入向量,设计了分视角的自注意力机制,有效建模多视角关系。

Result: 在新数据集上的实验表明,MVAFormer在F-measure上比基线方法提升了4.4分。

Insight: 保留空间信息的特征图和多视角分块的自注意力机制是实现高效多视角协作的关键。

Abstract: Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition~(STAR) setting, in which each person’s action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with lost spatial information, our module utilizes the feature map for effective cooperation in the STAR setting, which preserves the spatial information. Furthermore, in our module, we divide the self-attention for the same and different views to model the relationship between multiple views effectively. The results of experiments using a newly collected dataset demonstrate that MVAFormer outperforms the comparison baselines by approximately $4.4$ points on the F-measure.

[42] DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding

Zixuan Liu,Siavash H. Khajavi,Guangkai Jiang

Main category: cs.CV

TL;DR: 该论文介绍了DetectiumFire数据集,填补了火灾领域多模态数据集的空白,支持计算机视觉和语言任务的研究与应用。

Details Motivation: 现有的多模态模型在火灾领域的应用受限,主要因为缺乏高质量的公开数据集。DetectiumFire旨在解决这一问题。

Contribution: 推出了DetectiumFire,一个大规模多模态数据集,包含高质量标注的火灾图像和视频,覆盖多样场景和风险等级。

Method: 数据集包含22.5k高分辨率图像和2.5k视频,标注了传统视觉标签和文本描述,支持多种任务,如目标检测和图像生成。

Result: 实验验证了数据集在目标检测、基于扩散的图像生成和视觉语言推理等任务中的实用性。

Insight: DetectiumFire的多样性和高质量标注为火灾相关研究和智能安全系统的开发提供了重要支持。

Abstract: Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community. The dataset is available at https://kaggle.com/datasets/38b79c344bdfc55d1eed3d22fbaa9c31fad45e27edbbe9e3c529d6e5c4f93890

[43] UniChange: Unifying Change Detection with Multimodal Large Language Model

Xu Zhang,Danyang Li,Xiaohang Dong,Tianhao Wu,Hualong Yu,Jianye Wang,Qicheng Li,Xiang Li

Main category: cs.CV

TL;DR: UniChange提出了一种基于多模态大语言模型(MLLM)的统一变化检测框架,首次将BCD和SCD任务整合到一个模型中,并通过特殊标记和文本提示提升泛化能力。

Details Motivation: 当前变化检测模型通常只能从单一类型的标注数据中学习,无法同时利用多样化的BCD和SCD数据集,导致泛化能力和多功能性受限。

Contribution: 1. UniChange是首个基于MLLM的统一变化检测模型;2. 通过引入三个特殊标记[T1]、[T2]和[CHANGE],整合了BCD和SCD任务;3. 利用文本提示指导变化类别识别,减少对预定义分类头的依赖。

Method: UniChange结合生成式语言能力和专用CD功能,通过特殊标记和文本提示实现多源数据学习,即使类定义冲突也可有效融合。

Result: 在四个公开基准测试(WHU-CD、S2Looking、LEVIR-CD+和SECOND)上取得SOTA性能,IoU分数分别为90.41、53.04、78.87和57.62。

Insight: UniChange展示了MLLM在统一多样化任务中的潜力,特别是在类定义冲突的情况下仍能有效学习。

Abstract: Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.

[44] Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes

Robinson Umeike,Neil Getty,Yin Xiangyu,Yi Jiang

Main category: cs.CV

TL;DR: 论文通过PtychoBench基准比较了监督微调(SFT)和上下文学习(ICL)两种策略在低数据场景下的X射线ptychography任务中的表现,发现最优策略取决于任务模态。

Details Motivation: 在高级显微镜工作流自动化中,通用基础模型(如LLM和VLM)的潜力巨大,但针对科学任务的适应性策略尚不明确。

Contribution: 提出了PtychoBench基准,系统地比较了SFT和ICL策略,揭示了任务依赖的最优适应路径。

Method: 使用PtychoBench基准,在视觉伪影检测(VLM)和文本参数推荐(LLM)任务中评估SFT和ICL策略。

Result: 视觉任务中SFT和ICL互补性高,上下文引导微调模型表现最优;文本任务中ICL表现更优,超越SFT模型。

Insight: 任务模态决定了最优适应策略,上下文感知提示和微调模型的上下文干扰现象值得关注。

Abstract: The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful “super-expert” SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.

[45] ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing

Yaosen Chen,Wei Wang,Xuming Wen,Han Yang,Yanru Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于能量的视频镜头组装优化方法(ESA),通过学习参考视频的拍摄风格和语法规则,自动生成符合特定叙事或艺术风格的视频。

Details Motivation: 传统的视频镜头组装依赖人工编辑,现有的智能视频编辑技术难以捕捉创作者的独特艺术表达,因此需要一种能够自动学习并模仿参考风格的优化方法。

Contribution: 1. 提出了基于能量的镜头组装优化框架;2. 通过视觉-语义匹配和能量模型学习参考视频的风格;3. 实现了无需编辑经验的用户也能生成高质量视频。

Method: 1. 利用大语言模型生成脚本并与视频库匹配;2. 从参考视频中提取镜头属性(如大小、运动、语义);3. 使用能量模型评分候选镜头序列;4. 结合语法规则优化组装。

Result: 该方法能够生成符合参考视频风格的连贯视频,即使是无经验的用户也能轻松创作视觉吸引力的作品。

Insight: 能量模型可以有效捕捉视频编辑中的风格特征,结合语义匹配和语法规则,为自动化视频编辑提供了新的思路。

Abstract: Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator’s unique artistic expression in shot assembly.To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: https://sobeymil.github.io/esa.com

[46] VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Kevin Qinghong Lin,Yuhao Zheng,Hangyu Ran,Dantong Zhu,Dongxing Mao,Linjie Li,Philip Torr,Alex Jinpeng Wang

Main category: cs.CV

TL;DR: VCode是一个多模态编码基准,使用SVG作为符号化视觉表示,通过代码生成任务评估模型在视觉为中心的编码能力上的表现。

Details Motivation: 当前编码领域的进展主要集中在语言为中心的任务(如程序合成和调试),而视觉为中心的编码任务未被充分探索。

Contribution: 1)提出VCode基准,将多模态理解任务转换为SVG代码生成任务;2)提出CodeVQA评估协议,通过问答验证SVG的符号化保真度;3)提出VCoder框架,通过迭代修订和视觉工具增强模型的SVG生成能力。

Method: 1)基于SVG的符号化视觉表示;2)引入CodeVQA评估协议;3)VCoder框架包含“迭代修订”和“视觉工具”两个模块。

Result: VCoder在基准测试中优于Claude-4-Opus,整体提升12.3分。人类和VLM在SVG上的表现较差,但一致性表明符号化视觉表示的潜力。

Insight: SVG作为一种紧凑、可解释和可执行的视觉表示,在多模态编码中具有潜力,但当前VLM在专业知识和3D推理上仍有局限。

Abstract: Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model’s intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

[47] Keeping it Local, Tiny and Real: Automated Report Generation on Edge Computing Devices for Mechatronic-Based Cognitive Systems

Nicolas Schuler,Lea Dewald,Jürgen Graf

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于边缘计算的自动化报告生成流水线,用于多模态传感器数据的自然语言报告生成,适用于移动机器人等认知系统,保护隐私并无需外部服务。

Details Motivation: 在自动驾驶和服务机器人等关键任务中,需要对大量异构数据进行评估。为促进这些系统的评估和接受,自动化报告生成变得至关重要。

Contribution: 论文的主要贡献是提出了一种完全基于本地模型的自动化报告生成流水线,可在边缘计算设备上部署,保护隐私并避免依赖外部服务。

Method: 提出了一种多模态传感器数据驱动的自然语言报告生成方法,依赖于本地模型,适用于边缘设备。

Result: 通过在室内、室外和城市环境等多领域数据集上的评估,证明了方法的有效性和实用性,并提供了定量和定性结果。

Insight: 本地化模型部署不仅保护了隐私,还消除了对外部服务的依赖,为移动机器人和认知系统的实际应用提供了可扩展的解决方案。

Abstract: Recent advancements in Deep Learning enable hardware-based cognitive systems, that is, mechatronic systems in general and robotics in particular with integrated Artificial Intelligence, to interact with dynamic and unstructured environments. While the results are impressive, the application of such systems to critical tasks like autonomous driving as well as service and care robotics necessitate the evaluation of large amount of heterogeneous data. Automated report generation for Mobile Robotics can play a crucial role in facilitating the evaluation and acceptance of such systems in various domains. In this paper, we propose a pipeline for generating automated reports in natural language utilizing various multi-modal sensors that solely relies on local models capable of being deployed on edge computing devices, thus preserving the privacy of all actors involved and eliminating the need for external services. In particular, we evaluate our implementation on a diverse dataset spanning multiple domains including indoor, outdoor and urban environments, providing quantitative as well as qualitative evaluation results. Various generated example reports and other supplementary materials are available via a public repository.

[48] LiteVoxel: Low-memory Intelligent Thresholding for Efficient Voxel Rasterization

Jee Won Lee,Jongseong Brad Choi

Main category: cs.CV

TL;DR: LiteVoxel是一种自适应的稀疏体素光栅化训练框架,通过改进损失函数和阈值剪枝策略,显著降低内存占用(40%-60%),同时保持图像质量和训练效率。

Details Motivation: 稀疏体素光栅化(SVR)在场景重建中速度快且可微分,但存在内存占用高、低频内容拟合不足以及剪枝启发式方法不稳定等问题。LiteVoxel旨在解决这些问题。

Contribution: 1. 引入逆Sobel重加权和伽马斜坡调整的损失函数,提升低频区域的梯度分配。2. 提出基于深度分位数和EMA-迟滞保护的剪枝逻辑,以及基于射线足迹的优先级驱动的细分策略。3. 在保持性能的同时,大幅降低内存占用。

Method: 1. 低频感知的损失函数设计(逆Sobel重加权+伽马斜坡)。2. 自适应的剪枝和细分策略(深度分位数+EMA迟滞+射线足迹优先级)。3. 明确的内存增长预算控制。

Result: 在Mip-NeRF 360和Tanks & Temples数据集上的实验表明,LiteVoxel显著减少了内存占用(40%-60%),同时保持PSNR/SSIM和训练效率。低频细节和边界稳定性也得到改善。

Insight: 通过动态调整梯度分配和剪枝策略,可以在保证质量的同时显著优化内存效率,适用于资源受限的场景重建任务。

Abstract: Sparse-voxel rasterization is a fast, differentiable alternative for optimization-based scene reconstruction, but it tends to underfit low-frequency content, depends on brittle pruning heuristics, and can overgrow in ways that inflate VRAM. We introduce LiteVoxel, a self-tuning training pipeline that makes SV rasterization both steadier and lighter. Our loss is made low-frequency aware via an inverse-Sobel reweighting with a mid-training gamma-ramp, shifting gradient budget to flat regions only after geometry stabilize. Adaptation replaces fixed thresholds with a depth-quantile pruning logic on maximum blending weight, stabilized by EMA-hysteresis guards and refines structure through ray-footprint-based, priority-driven subdivision under an explicit growth budget. Ablations and full-system results across Mip-NeRF 360 (6scenes) and Tanks & Temples (3scenes) datasets show mitigation of errors in low-frequency regions and boundary instability while keeping PSNR/SSIM, training time, and FPS comparable to a strong SVRaster pipeline. Crucially, LiteVoxel reduces peak VRAM by ~40%-60% and preserves low-frequency detail that prior setups miss, enabling more predictable, memory-efficient training without sacrificing perceptual quality.

[49] Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data

Jessica Plassmann,Nicolas Schuler,Georg von Freymann,Michael Schuth

Main category: cs.CV

TL;DR: 论文研究了无监督学习方法在剪切散斑图像工业缺陷检测中的应用,比较了三种模型(全连接自编码器、卷积自编码器和师生特征匹配模型),表明师生模型在分类鲁棒性和缺陷定位方面表现最佳。

Details Motivation: 传统剪切散斑检测依赖专家解释和标记数据,工业应用受限;研究旨在通过无监督学习减少对标记数据的依赖,实现自动化缺陷检测。

Contribution: 1. 提出并比较了三种无监督学习架构;2. 开发了可控数据集模拟真实检测条件;3. 验证了师生模型在分类和定位中的优越性。

Method: 使用无缺陷数据训练三种模型(全连接自编码器、卷积自编码器和师生模型),并通过t-SNE可视化特征可分性;YOLOv8作为参考基准。

Result: 师生模型在分类鲁棒性和缺陷定位表现最佳,特征可分性优于自编码器模型。

Insight: 无监督学习在工业缺陷检测中具有潜力,师生模型尤其适用于复杂条件下的高质量检测。

Abstract: Shearography is a non-destructive testing method for detecting subsurface defects, offering high sensitivity and full-field inspection capabilities. However, its industrial adoption remains limited due to the need for expert interpretation. To reduce reliance on labeled data and manual evaluation, this study explores unsupervised learning methods for automated anomaly detection in shearographic images. Three architectures are evaluated: a fully connected autoencoder, a convolutional autoencoder, and a student-teacher feature matching model. All models are trained solely on defect-free data. A controlled dataset was developed using a custom specimen with reproducible defect patterns, enabling systematic acquisition of shearographic measurements under both ideal and realistic deformation conditions. Two training subsets were defined: one containing only undistorted, defect-free samples, and one additionally including globally deformed, yet defect-free, data. The latter simulates practical inspection conditions by incorporating deformation-induced fringe patterns that may obscure localized anomalies. The models are evaluated in terms of binary classification and, for the student-teacher model, spatial defect localization. Results show that the student-teacher approach achieves superior classification robustness and enables precise localization. Compared to the autoencoder-based models, it demonstrates improved separability of feature representations, as visualized through t-SNE embeddings. Additionally, a YOLOv8 model trained on labeled defect data serves as a reference to benchmark localization quality. This study underscores the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments.

[50] The Urban Vision Hackathon Dataset and Models: Towards Image Annotations and Accurate Vision Models for Indian Traffic

Akash Sharma,Chinmay Mhatre,Sankalp Gawali,Ruthvik Bokkasam,Brij Kishore,Vishwajeet Pattanaik,Tarun Rambha,Abdul R. Pinjari,Vijay Kovvali,Anirban Chakraborty,Punit Rathore,Raghu Krishnapuram,Yogesh Simmhan

Main category: cs.CV

TL;DR: 本文介绍了印度交通场景的大规模标注数据集UVH-26,通过众包标注和多数投票算法生成高效标注,并在多个现代检测器上验证了域特异性数据的重要性。最终模型性能优于COCO数据集训练的基准模型,为复杂交通场景的智能交通系统提供了基础。

Details Motivation: 现有全球数据集(如COCO)对印度复杂交通场景的覆盖不足,因此作者构建了首个针对印度交通的大规模标注数据集UVH-26,以填补这一空白。

Contribution: 1. 发布了首个印度交通场景的大规模标注数据集UVH-26。2. 通过众包标注和多数投票算法生成了高质量的共识标注。3. 验证了域特异性数据对检测模型性能的提升。

Method: 1. 从2800个班加罗尔交通摄像头收集26,646张高分辨图像。2. 通过565名学生的众包标注生成了180万个标注框。3. 使用多数投票和STAPLE算法生成高质量共识标注。4. 在多个检测器(如YOLO11、RT-DETR等)上进行了实验。

Result: 在UVH-26上训练的模型比COCO训练的基准模型在mAP50:95上提升了8.4-31.5%,RT-DETR-X表现最佳(mAP50:95为0.67)。

Insight: 域特异性数据对复杂交通场景的检测任务至关重要,能够显著提升模型性能。众包标注结合多数投票算法是生成高质量标注的有效方法。

Abstract: This report describes the UVH-26 dataset, the first public release by AIM@IISc of a large-scale dataset of annotated traffic-camera images from India. The dataset comprises 26,646 high-resolution (1080p) images sampled from 2800 Bengaluru’s Safe-City CCTV cameras over a 4-week period, and subsequently annotated through a crowdsourced hackathon involving 565 college students from across India. In total, 1.8 million bounding boxes were labeled across 14 vehicle classes specific to India: Cycle, 2-Wheeler (Motorcycle), 3-Wheeler (Auto-rickshaw), LCV (Light Commercial Vehicles), Van, Tempo-traveller, Hatchback, Sedan, SUV, MUV, Mini-bus, Bus, Truck and Other. Of these, 283k-316k consensus ground truth bounding boxes and labels were derived for distinct objects in the 26k images using Majority Voting and STAPLE algorithms. Further, we train multiple contemporary detectors, including YOLO11-S/X, RT-DETR-S/X, and DAMO-YOLO-T/L using these datasets, and report accuracy based on mAP50, mAP75 and mAP50:95. Models trained on UVH-26 achieve 8.4-31.5% improvements in mAP50:95 over equivalent baseline models trained on COCO dataset, with RT-DETR-X showing the best performance at 0.67 (mAP50:95) as compared to 0.40 for COCO-trained weights for common classes (Car, Bus, and Truck). This demonstrates the benefits of domain-specific training data for Indian traffic scenarios. The release package provides the 26k images with consensus annotations based on Majority Voting (UVH-26-MV) and STAPLE (UVH-26-ST) and the 6 fine-tuned YOLO and DETR models on each of these datasets. By capturing the heterogeneity of Indian urban mobility directly from operational traffic-camera streams, UVH-26 addresses a critical gap in existing global benchmarks, and offers a foundation for advancing detection, classification, and deployment of intelligent transportation systems in emerging nations with complex traffic conditions.

[51] Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification

Md Rashidunnabi,Kailash A. Hambarde,Vasco Lopes,Joao C. Neves,Hugo Proenca

Main category: cs.CV

TL;DR: 本文提出了一种高效的视频行人重识别框架MTF-CVReID,通过七个互补模块增强跨视角和时间一致性,在保持实时性能的同时实现了最先进的性能。

Details Motivation: 解决跨视角(如空中-地面监控)视频行人重识别中的极端视角变化、尺度差异和时间不一致性问题。

Contribution: 提出MTF-CVReID框架,包含七个创新模块(如CSFN、MRFH、IAMM等),显著提升了跨视角鲁棒性和时间一致性,同时保持计算高效。

Method: 在ViT-B/16骨干网络上引入七个模块,包括跨流特征归一化(CSFN)、多分辨率特征协调(MRFH)、身份感知记忆模块(IAMM)等。

Result: 在AG-VPReID基准测试中实现了最先进性能,跨数据集泛化能力强(G2A-VReID和MARS),且保持189 FPS的实时效率。

Insight: 精心设计的适配器模块可以显著提升模型性能,同时不牺牲计算效率,为跨视角和时间鲁棒性提供了一种有效解决方案。

Abstract: Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at https://github.com/MdRashidunnabi/MTF-CVReID

[52] A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding

Jingyu Lu,Haonan Wang,Qixiang Zhang,Xiaomeng Li

Main category: cs.CV

TL;DR: 提出了一种名为VCFlow的层次化解码框架,用于从fMRI中重建视觉体验,无需受试者特定训练,解决了跨受试者泛化的挑战。

Details Motivation: 开发一种能够在无需受试者特定训练的情况下,从fMRI信号中重建视觉体验的方法,具有临床应用的潜力。

Contribution: 提出了VCFlow框架,模拟人类视觉系统的腹侧-背侧结构,学习多维度表示,并结合对比学习策略增强跨受试者语义特征提取。

Method: 通过分层架构捕捉视觉系统的多维度信息,并使用特征级对比学习提高受试者不变性。

Result: 在牺牲7%精度的情况下,VCFlow能以每秒10帧的速度生成重建视频,无需重新训练,显示了高效性和临床应用潜力。

Insight: 视觉系统的多层次建模和对比学习的结合能有效提升跨受试者视觉解码的性能。

Abstract: Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.

[53] Zero-Shot Multi-Animal Tracking in the Wild

Jan Frederik Meier,Timo Lüddecke

Main category: cs.CV

TL;DR: 这篇论文提出了一种零样本多动物跟踪框架,结合了Grounding Dino目标检测器和Segment Anything Model 2 (SAM 2)跟踪器,无需重新训练或超参数调整即可应用于新数据集。

Details Motivation: 多动物跟踪是了解动物生态和行为的重要任务,但由于栖息地、运动模式和物种外观的多样性,其实现极具挑战性。传统方法通常需要针对每个场景进行大量模型微调和启发式设计。

Contribution: 论文的主要贡献是开发了一种基于视觉基础模型的零样本多动物跟踪框架,能够在新数据集上实现一致且强健的性能。

Method: 方法结合了Grounding Dino目标检测器和SAM 2跟踪器,并通过精心设计的启发式规则优化跟踪效果。

Result: 在ChimpAct、Bird Flock Tracking、AnimalTrack和GMOT-40子集上的评估表明,该方法在多样物种和环境中表现优异。

Insight: 研究表明,利用现有的视觉基础模型组合能够有效解决零样本多动物跟踪问题,减少了对特定数据集的依赖和人工设计的负担。

Abstract: Multi-animal tracking is crucial for understanding animal ecology and behavior. However, it remains a challenging task due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive model fine-tuning and heuristic design for each application scenario. In this work, we explore the potential of recent vision foundation models for zero-shot multi-animal tracking. By combining a Grounding Dino object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics, we develop a tracking framework that can be applied to new datasets without any retraining or hyperparameter adaptation. Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate strong and consistent performance across diverse species and environments. The code is available at https://github.com/ecker-lab/SAM2-Animal-Tracking.

[54] Robust Face Liveness Detection for Biometric Authentication using Single Image

Poulami Raha,Yeongnam Chae

Main category: cs.CV

TL;DR: 该论文提出了一种轻量级的CNN框架,用于检测人脸识别系统中的欺骗攻击(如打印/显示、视频和包裹攻击),确保快速的生物特征认证。同时,还发布了包含500多个视频的新数据集。

Details Motivation: 人脸识别系统容易受到呈现攻击(如打印/显示、视频和包裹攻击)的影响,导致非法访问。为了提升安全性,需要一种高效的活体检测方法。

Contribution: 1. 提出了一种轻量级的CNN框架,用于快速检测多种欺骗攻击;2. 创建了一个包含500多个视频的新数据集;3. 展示了攻击检测的实际效果。

Method: 使用轻量级CNN架构,通过输入的单一图像检测欺骗攻击,包括打印/显示、视频和包裹攻击。

Result: 框架在CPU上实现了1-2秒的生物特征认证速度,验证了其在实际应用中的高效性。

Insight: 轻量级CNN能够在不牺牲性能的情况下实现快速的活体检测,且新数据集有助于提升未来研究的可靠性。

Abstract: Biometric technologies are widely adopted in security, legal, and financial systems. Face recognition can authenticate a person based on the unique facial features such as shape and texture. However, recent works have demonstrated the vulnerability of Face Recognition Systems (FRS) towards presentation attacks. Using spoofing (aka.,presentation attacks), a malicious actor can get illegitimate access to secure systems. This paper proposes a novel light-weight CNN framework to identify print/display, video and wrap attacks. The proposed robust architecture provides seamless liveness detection ensuring faster biometric authentication (1-2 seconds on CPU). Further, this also presents a newly created 2D spoof attack dataset consisting of more than 500 videos collected from 60 subjects. To validate the effectiveness of this architecture, we provide a demonstration video depicting print/display, video and wrap attack detection approaches. The demo can be viewed in the following link: https://rak.box.com/s/m1uf31fn5amtjp4mkgf1huh4ykfeibaa

[55] Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Tianfan Peng,Yuntao Du,Pengzhou Ji,Shijie Dong,Kailin Jiang,Mingchuan Ma,Yijun Tian,Jinhe Bi,Qian Li,Wei Du,Feng Xiao,Lizhen Cui

Main category: cs.CV

TL;DR: 本文介绍了UniPruneBench,一个用于视觉令牌剪枝的统一可扩展基准,旨在解决大型多模态模型(LMMs)中视觉令牌冗余导致的推理效率低下问题。

Details Motivation: 当前的多模态模型由于图像编码引入的大量视觉令牌导致推理效率低下,现有的令牌压缩方法评估零散且不一致,需要统一的基准。

Contribution: 提出了UniPruneBench基准,覆盖六个能力维度和十个数据集,评估了十种压缩算法和三类LMMs,同时结合任务准确性和系统级指标。

Method: 设计了一个标准化协议,通过实验对比了不同压缩算法在多模态模型中的应用效果,重点关注剪枝比率对性能的影响。

Result: 研究发现:(1)随机剪枝是强基线,(2)无单一方法在所有场景中表现最佳,(3)任务对剪枝的敏感度差异大,(4)剪枝比率是性能下降的主要因素。

Insight: 未来的高效多模态建模研究需考虑任务特性和剪枝比率的平衡,UniPruneBench为其提供了可靠基础。

Abstract: Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.

[56] Differentiable Hierarchical Visual Tokenization

Marius Aasan,Martine Hjelkrem-Tan,Nico Catalano,Changkyu Choi,Adín Ramírez Rivera

Main category: cs.CV

TL;DR: 该论文提出了一种端到端的可微分分层视觉标记化方法,解决了传统ViT中固定patch标记忽略图像空间和语义结构的问题。

Details Motivation: Vision Transformers(ViT)使用固定的patch标记,忽略了图像的空间和语义结构。作者希望通过一种可微分的标记化方法,自适应图像内容,同时保持与现有架构的兼容性。

Contribution: 1. 提出了一种端到端的可微分分层视觉标记化方法;2. 该方法支持像素级的自适应标记化;3. 保持了与现有预训练模型的兼容性,并支持图像分类和密集预测任务。

Method: 采用分层模型选择和信息准则,实现图像内容的自适应标记化,同时支持端到端训练。

Result: 在图像分类和密集预测任务上表现出色,并能支持光栅到矢量的转换。

Insight: 通过分层标记化方法,可以更好地捕捉图像的精细结构,同时保持模型的通用性。

Abstract: Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.

[57] VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Zhicheng Zhang,Weicheng Wang,Yongjie Zhu,Wenyu Qin,Pengfei Wan,Di Zhang,Jufeng Yang

Main category: cs.CV

TL;DR: 论文提出了VidEmo框架,一种基于情感线索的推理方法,用于视频情感理解,通过两阶段调优(情感知识注入和情感树强化学习)和新的数据集Emo-CFG,在15项任务上取得竞争性表现。

Details Motivation: 视频情感理解的动态性和多线索依赖性带来了挑战,现有方法难以捕捉复杂的情绪状态及其推理过程。

Contribution: 1. 提出VidEmo框架,结合情感线索推理和指令跟随能力;2. 开发两阶段调优方法(情感知识注入+情感树强化学习);3. 构建Emo-CFG数据集(210万多样本),包含可解释的情感问答和细粒度标注。

Method: 1. 使用课程情感学习注入情感知识;2. 通过情感树强化学习实现情感推理;3. 统一基础属性感知、表达分析和高层情感理解的阶段性框架。

Result: 在15项人脸感知任务中达到竞争性表现,为情感理解任务设立了新里程碑。

Insight: 情感树推理和多阶段调优能够有效捕捉动态情感状态,Emo-CFG数据集为细粒度情感分析提供了重要资源。

Abstract: Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.

[58] LLEXICORP: End-user Explainability of Convolutional Neural Networks

Vojtěch Kůr,Adam Bajger,Adam Kukučka,Marek Hradil,Vít Musil,Tomáš Brázdil

Main category: cs.CV

TL;DR: LLEXICORP提出了一种结合概念相关性传播(CRP)和多模态大规模语言模型的模块化流程,自动生成自然语言解释,降低理解深度神经网络的难度。

Details Motivation: 当前CRP方法依赖专家手动解释,限制了可扩展性和可访问性。LLEXICORP旨在通过自动化命名和解释,提高CNN模型的透明性。

Contribution: 1. 提出结合CRP和多模态语言模型的框架;2. 自动生成自然语言解释;3. 支持对不同受众定制描述。

Method: 使用CRP提取概念原型,通过多模态语言模型自动命名和生成解释;通过示例提示确保语言模型理解CRP语义。

Result: 在ImageNet和VGG16上定性评估表明,该方法能生成直观、可信的解释,提升模型可解释性。

Insight: 语言模型与概念解释方法的结合可以显著降低深度学习模型的解释门槛,推动透明AI系统的发展。

Abstract: Convolutional neural networks (CNNs) underpin many modern computer vision systems. With applications ranging from common to critical areas, a need to explain and understand the model and its decisions (XAI) emerged. Prior works suggest that in the top layers of CNNs, the individual channels can be attributed to classifying human-understandable concepts. Concept relevance propagation (CRP) methods can backtrack predictions to these channels and find images that most activate these channels. However, current CRP workflows are largely manual: experts must inspect activation images to name the discovered concepts and must synthesize verbose explanations from relevance maps, limiting the accessibility of the explanations and their scalability. To address these issues, we introduce Large Language model EXplaIns COncept Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a multimodal large language model. Our approach automatically assigns descriptive names to concept prototypes and generates natural-language explanations that translate quantitative relevance distributions into intuitive narratives. To ensure faithfulness, we craft prompts that teach the language model the semantics of CRP through examples and enforce a separation between naming and explanation tasks. The resulting text can be tailored to different audiences, offering low-level technical descriptions for experts and high-level summaries for non-technical stakeholders. We qualitatively evaluate our method on various images from ImageNet on a VGG16 model. Our findings suggest that integrating concept-based attribution methods with large language models can significantly lower the barrier to interpreting deep neural networks, paving the way for more transparent AI systems.

[59] Dynamic Reflections: Probing Video Representations with Text Alignment

Tyler Zhu,Tengda Han,Leonidas Guibas,Viorica Pătrăucean,Maks Ovsjanikov

Main category: cs.CV

TL;DR: 该论文首次系统研究了视频与文本表示的跨模态对齐问题,揭示了视觉和文本数据的丰富性对对齐效果的影响,并提出了参数化的测试时缩放定律。同时还探讨了语义对齐与下游任务表现的关系,为视频表征能力的评估提供了新视角。

Details Motivation: 尽管图像与文本的对齐研究已取得进展,但视频数据的时序特性在多模态对齐中仍未被充分探索。论文旨在填补这一空白,探究视频与文本表示对齐的潜力与挑战。

Contribution: 1. 首次全面研究视频与文本表示对齐;2. 提出参数化的测试时缩放定律,预测对齐效果;3. 揭示语义对齐与下游任务性能的关联;4. 为时空数据的表征评估提供了零样本测试方法。

Method: 通过实验分析现代视频和语言编码器的跨模态对齐能力,探究视觉(静态图像vs.多帧视频)和文本(单描述vs.多描述)数据的丰富性对对齐的影响。提出缩放定律并验证其预测能力。

Result: 实验表明,对齐效果高度依赖测试数据的丰富性,缩放定律表现出色。强语义对齐与通用视频理解能力相关,为评估模型提供了新依据。

Insight: 视频与文本的对齐不仅是多模态研究的重要方向,也可作为评估视频编码器表征能力的工具,尤其是在零样本场景下。时序推理能力与对齐效果的关联进一步为模型设计提出了挑战。

Abstract: The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/

[60] PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

Antonio Oroz,Matthias Nießner,Tobias Kirschstein

Main category: cs.CV

TL;DR: PercHead是一种用于单图像3D头部重建和语义3D编辑的方法,通过双分支编码器和ViT解码器实现一致的3D重建,并结合感知监督策略提升几何和外观保真度。

Details Motivation: 单图像3D头部重建和编辑面临视角遮挡、监督信号弱以及3D编辑模糊性的挑战。

Contribution: 提出了一个统一的3D头部重建模型,结合感知监督策略(DINOv2和SAM2.1),并扩展支持语义3D编辑。

Method: 采用双分支编码器和ViT解码器,通过迭代交叉注意力将2D特征提升到3D空间,渲染使用高斯泼溅。

Result: 在视角合成任务中达到SOTA,且对极端视角表现鲁棒;支持直观的3D编辑。

Insight: 感知监督策略能有效提升3D重建的质量;几何和风格解耦简化了编辑任务。

Abstract: We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE

[61] When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Yiyang Zhou,Haoqin Tu,Zijun Wang,Zeyu Wang,Niklas Muennighoff,Fan Nie,Yejin Choi,James Zou,Chaorui Deng,Shen Yan,Haoqi Fan,Cihang Xie,Huaxiu Yao,Qinghao Ye

Main category: cs.CV

TL;DR: MIRA是一个新提出的基准测试,旨在评估模型在需要生成中间可视化图像以辅助推理的任务中的表现,强调视觉链式思维(Visual-CoT)的重要性。

Details Motivation: 传统的链式思维(CoT)方法仅依赖文本,无法处理需要可视化辅助的复杂推理任务。人类常通过绘图辅助思考,而现有模型在此类任务中表现不佳,因此提出了MIRA。

Contribution: 1. 提出了专注于视觉推理的MIRA基准测试;2. 包含546个多模态问题,标注了中间视觉图像和最终答案;3. 设计了统一的评估协议,涵盖不同输入级别;4. 验证了视觉信息对模型推理的关键作用。

Method: 1. 设计复杂任务,需生成中间图像辅助推理;2. 提供三种输入设置(直接输入、文本CoT输入、Visual-CoT输入);3. 评估模型性能时使用pass@k和多数投票准确率;4. 对比纯文本与提供视觉线索的效果。

Result: 现有模型在纯文本提示下表现较差,但使用中间视觉线索后性能平均提升33.7%。扩展搜索空间或设计对齐的文本提示仅带来有限改进,验证了视觉信息的重要性。

Insight: 视觉信息在复杂推理任务中不可或缺,想象和生成中间图像能显著提升模型表现。未来多模态模型需更重视视觉链式思维的生成能力。

Abstract: We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through “drawing to think”. To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.

[62] PLUTO-4: Frontier Pathology Foundation Models

Harshith Padigela,Shima Nofallah,Atchuth Naveen Chilaparasetti,Ryun Han,Andrew Walker,Judy Shen,Chintan Shah,Blake Martin,Aashish Sood,Elliot Miller,Ben Glass,Andy Beck,Harsha Pokkalla,Syed Ashar Javed

Main category: cs.CV

TL;DR: PLUTO-4是一组前沿的病理学基础模型,包括高效的PLUTO-4S和前沿规模的PLUTO-4G,通过自监督目标在大型病理图像库上预训练,在多种病理任务中实现SOTA性能。

Details Motivation: 大规模病理图像基础模型在不同任务中展现了强大的迁移能力,但需要进一步扩展规模和优化架构以适应多样化应用需求。

Contribution: 1. 提出PLUTO-4家族,包括高效紧凑的PLUTO-4S和前沿规模的PLUTO-4G;2. 通过FlexiViT和2D-RoPE嵌入优化多尺度部署;3. 在大规模多机构病理图像库上预训练。

Method: 1. 使用自监督目标(源自DINOv2)预训练;2. PLUTO-4S采用FlexiViT架构支持多尺度;3. PLUTO-4G专注于单一patch大小以最大化表征能力。

Result: 在多个公共和内部基准测试中表现优异,包括11%的皮肤病诊断提升,以及高效的部署性能。

Insight: PLUTO-4通过架构优化和大规模预训练,展示了基础模型在病理学中的广泛潜力,尤其适合实际部署和研究应用。

Abstract: Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including patch-level classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4’s potential to transform real-world applications as a backbone for translational research and diagnostic use cases.

[63] Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

Dmitrii Pozdeev,Alexey Artemov,Ananta R. Bhattarai,Artem Sevastopolsky

Main category: cs.CV

TL;DR: DenseMarks提出了一种新的学习表示方法,通过点轨迹学习人类头部的规范嵌入,实现高质量密集对应。

Details Motivation: 解决人类头部图像的高质量密集对应问题,特别是在多样化姿态和个体中的鲁棒性。

Contribution: 提出DenseMarks表示方法,结合对比损失、多任务学习和空间连续性约束,形成可解释的规范空间。

Method: 使用Vision Transformer预测像素的3D嵌入,通过对比损失和多任务学习(面部标志和分割)训练网络。

Result: 在几何感知点匹配和3D Morphable Models的单目头部跟踪中取得SOTA结果。

Insight: 规范空间瓶颈确保表示的跨姿态和个体一致性,覆盖整个头部(包括头发)。

Abstract: We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.

eess.IV [Back]

[64] Opto-Electronic Convolutional Neural Network Design Via Direct Kernel Optimization

Ali Almuallem,Harshana Weligampola,Abhiram Gnanasambandam,Wei Xu,Dilshan Godaliyadda,Hamid R. Sheikh,Stanley H. Chan,Qi Guo

Main category: eess.IV

TL;DR: 论文提出了一种两阶段设计光电子卷积神经网络的方法,通过直接优化光学前端的核以减少计算和内存需求,并在深度估计任务中取得优于端到端训练的效果。

Details Motivation: 传统的光电子神经网络端到端优化方法由于模拟成本高和参数空间大,限制了其效率和稳定性。本文旨在解决这一问题。

Contribution: 主要贡献是提出了一种两阶段设计策略,通过直接优化光学前端的核,显著降低了计算和内存需求,同时提高了训练稳定性。

Method: 方法分为两阶段:1.训练标准电子CNN;2.通过直接优化光学前端的核实现光学前端的元表面阵列设计。

Result: 在单目深度估计任务中,该方法在相同训练时间和资源限制下,精度是端到端训练的两倍。

Insight: 通过分离光学和电子模块的优化,可以显著减少计算复杂度并提高性能,这为光电子神经网络设计提供了新思路。

Abstract: Opto-electronic neural networks integrate optical front-ends with electronic back-ends to enable fast and energy-efficient vision. However, conventional end-to-end optimization of both the optical and electronic modules is limited by costly simulations and large parameter spaces. We introduce a two-stage strategy for designing opto-electronic convolutional neural networks (CNNs): first, train a standard electronic CNN, then realize the optical front-end implemented as a metasurface array through direct kernel optimization of its first convolutional layer. This approach reduces computational and memory demands by hundreds of times and improves training stability compared to end-to-end optimization. On monocular depth estimation, the proposed two-stage design achieves twice the accuracy of end-to-end training under the same training time and resource constraints.

[65] MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization

Yalda Zafari,Hongyi Pan,Gorkem Durak,Ulas Bagci,Essam A. Rashed,Mohamed Mabrok

Main category: eess.IV

TL;DR: MammoClean是一个用于标准化和量化偏差的框架,旨在解决乳腺X光数据集的异质性问题,提高AI模型的泛化能力和临床可靠性。

Details Motivation: 乳腺X光数据集的异质性(数据质量、元数据标准和人口分布的差异)导致模型泛化能力差,限制了AI系统在临床中的部署。

Contribution: 提出了MammoClean框架,标准化数据选择和图像处理,统一元数据,并系统量化偏差来源。

Method: 框架包括病例选择标准化、图像处理(如偏侧性和强度校正)和元数据统一。应用在CBIS-DDSM、TOMPEI-CMMD和VinDr-Mammo数据集上,量化乳腺密度和异常分布的偏移。

Result: 实验表明,在未处理的数据集上训练的AI模型性能显著下降,而经MammoClean处理的数据集能提升模型的跨域泛化能力。

Insight: 数据集的标准化和偏差量化是提升AI模型临床可靠性的关键步骤,MammoClean为实现公平且高效的乳腺X光AI系统提供了实用工具。

Abstract: The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.

[66] Resource-efficient Automatic Refinement of Segmentations via Weak Supervision from Light Feedback

Alix de Langlais,Benjamin Billot,Théo Aguilar Vidal,Marc-Olivier Gauci,Hervé Delingette

Main category: eess.IV

TL;DR: SCORE是一个弱监督框架,通过轻量级反馈学习优化医学图像分割,减少标注需求,性能接近现有方法。

Details Motivation: 医学图像分割需要高精度,但全监督方法标注成本高,现有自动分割工具可能不满足临床要求,SCORE旨在减少监督需求。

Contribution: 提出SCORE框架,利用区域质量得分和过/欠分割错误标签的新型损失函数,减少标注依赖,优化分割结果。

Method: SCORE通过区域评估的轻量级反馈训练,使用新型损失函数替代全监督标注,优化初始分割结果。

Result: 在肱骨CT扫描实验中,SCORE显著提升初始分割结果,性能接近现有方法,同时大幅减少标注时间和监督需求。

Insight: 弱监督框架在医学图像分割中潜力大,轻量级反馈可有效减少标注负担,接近全监督方法的性能。

Abstract: Delineating anatomical regions is a key task in medical image analysis. Manual segmentation achieves high accuracy but is labor-intensive and prone to variability, thus prompting the development of automated approaches. Recently, a breadth of foundation models has enabled automated segmentations across diverse anatomies and imaging modalities, but these may not always meet the clinical accuracy standards. While segmentation refinement strategies can improve performance, current methods depend on heavy user interactions or require fully supervised segmentations for training. Here, we present SCORE (Segmentation COrrection from Regional Evaluations), a weakly supervised framework that learns to refine mask predictions only using light feedback during training. Specifically, instead of relying on dense training image annotations, SCORE introduces a novel loss that leverages region-wise quality scores and over/under-segmentation error labels. We demonstrate SCORE on humerus CT scans, where it considerably improves initial predictions from TotalSegmentator, and achieves performance on par with existing refinement methods, while greatly reducing their supervision requirements and annotation time. Our code is available at: https://gitlab.inria.fr/adelangl/SCORE.

cs.LG [Back]

[67] Retrieval-Augmented Multimodal Depression Detection

Ruibo Hou,Shiyu Teng,Jiaqing Liu,Shurong Chai,Yinhao Li,Lanfen Lin,Yen-Wei Chen

Main category: cs.LG

TL;DR: 论文提出了一种基于检索增强生成的抑郁症检测方法,通过结合情感数据集和LLM生成情感提示,提升了多模态抑郁症检测的性能。

Details Motivation: 已有的多模态抑郁症检测方法存在计算成本高、领域不匹配和静态知识限制的问题,作者希望通过检索增强框架解决这些问题。

Contribution: 提出了一种新颖的检索增强生成(RAG)框架,通过情感数据集和LLM生成情感提示,增强了情感表示和模型的可解释性。

Method: 利用情感数据集检索情感相关内容,并通过LLM生成情感提示作为辅助模态,结合文本、音频和视频信号进行抑郁症检测。

Result: 在AVEC 2019数据集上实现了领先性能,CCC为0.593,MAE为3.95,优于之前的迁移学习和多任务学习方法。

Insight: 情感提示作为一种辅助模态,能够显著增强情感表示并提升模型的可解释性,为多模态抑郁症检测提供了新思路。

Abstract: Multimodal deep learning has shown promise in depression detection by integrating text, audio, and video signals. Recent work leverages sentiment analysis to enhance emotional understanding, yet suffers from high computational cost, domain mismatch, and static knowledge limitations. To address these issues, we propose a novel Retrieval-Augmented Generation (RAG) framework. Given a depression-related text, our method retrieves semantically relevant emotional content from a sentiment dataset and uses a Large Language Model (LLM) to generate an Emotion Prompt as an auxiliary modality. This prompt enriches emotional representation and improves interpretability. Experiments on the AVEC 2019 dataset show our approach achieves state-of-the-art performance with CCC of 0.593 and MAE of 3.95, surpassing previous transfer learning and multi-task learning baselines.

[68] TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding

Aditya Sridhar,Nish Sinnadurai,Sean Lie,Vithursan Thangarasa

Main category: cs.LG

TL;DR: TapOut是一种基于多臂老虎机的动态推测解码方法,通过在线选择最优的推测策略,无需调参即可实现高效的LLM加速。

Details Motivation: 现有动态推测解码方法依赖手动调参的敏感阈值,成本高且泛化性差。TapOut旨在解决这一问题。

Contribution: 提出TapOut,一种无需训练、即插即用的在线算法,利用多臂老虎机动态选择最优推测策略。

Method: 采用元算法选择参数无关的动态推测策略,基于历史奖励和探索行为。

Result: 实验表明,TapOut在不调参的情况下,性能优于或媲美现有基线方法。

Insight: 多臂老虎机框架为动态推测解码提供了一种鲁棒且泛化性强的解决方案。

Abstract: Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach’s effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.

[69] Regularization Through Reasoning: Systematic Improvements in Language Model Classification via Explanation-Enhanced Fine-Tuning

Vivswan Shah,Randy Cogill,Hanwei Yue,Gopinath Chennupati,Rinat Khaziev

Main category: cs.LG

TL;DR: 本文研究了在语言模型分类任务中,通过在微调阶段为每个标签附加简短解释或随机标记序列(伪解释)来提升模型性能,发现即使是无语义的伪解释也能通过结构化作用改善模型鲁棒性和准确性。

Details Motivation: 传统的语言模型微调通常直接将输入映射到标签,忽略了标签背后的解释信息可能对模型性能的潜在提升作用。

Contribution: 1. 提出了一种通过附加解释(包括真实解释和伪解释)增强微调的方法;2. 发现伪解释(随机标记)通过结构化作用也能提升模型性能;3. 展示了这种方法在多个数据集和任务中的有效性。

Method: 使用多LLM生成的集成数据,对7B参数的模型进行微调,并在六个对话数据集上测试。实验设计包括真实解释和伪解释(如打乱或词袋变体)的对比。

Result: 在18个数据集和任务场景中,附加解释(包括伪解释)的微调优于仅使用标签的基线方法。伪解释能缩小与真实解释的性能差距。

Insight: 解释(即使是伪解释)的作用主要来自其结构化特性而非语义内容,这种结构化作用鼓励模型在推断时进行更丰富的中间计算,从而减少过拟合和增强鲁棒性。

Abstract: Fine-tuning LLMs for classification typically maps inputs directly to labels. We ask whether attaching brief explanations to each label during fine-tuning yields better models. We evaluate conversational response quality along three axes: naturalness, comprehensiveness, and on-topic adherence, each rated on 5-point scales. Using ensemble-generated data from multiple LLMs, we fine-tune a 7B-parameter model and test across six diverse conversational datasets. Across 18 dataset, task settings, label-plus-explanation training outperforms label-only baselines. A central and unexpected result concerns random tokens. We replace human-written explanations with text that is syntactically incoherent yet vocabulary-aligned with the originals (e.g., shuffled or bag-of-words variants). Despite lacking semantics, these pseudo-explanations still improve accuracy over label-only training and often narrow much of the gap to true explanations. The effect persists across datasets and training seeds, indicating that gains arise less from meaning than from structure: the extra token budget encourages richer intermediate computation and acts as a regularizer that reduces over-confident shortcuts. Internal analyses support this view: explanation-augmented models exhibit higher activation entropy in intermediate layers alongside sharper predictive mass at the output layer, consistent with increased deliberation before decision. Overall, explanation-augmented fine-tuning, whether with genuine rationales or carefully constructed random token sequences, improves accuracy and reliability for LLM classification while clarifying how token-level scaffolding shapes computation during inference.

[70] OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning

Kevin Valencia,Thilina Balasooriya,Xihaier Luo,Shinjae Yoo,David Keetae Park

Main category: cs.LG

TL;DR: OmniField提出了一种基于神经场的连续学习框架,通过条件化多模态数据和迭代融合跨模态上下文,解决了多模态时空数据稀疏、噪声和模态缺失的问题。

Details Motivation: 现实世界中的多模态时空数据常常面临稀疏、不规则、噪声以及模态缺失的挑战,需要提出一种适应性强且稳健的学习方法。

Contribution: 1)设计了OmniField,一种连续性感知框架,通过条件化多模态数据和迭代融合跨模态上下文;2)提出了多模态交互块架构,支持统一的重建、插值、预测和跨模态预测。

Method: OmniField利用神经场学习连续表示,并通过多模态交互块和迭代跨模态精细化对齐信号,避免了网格化或预处理的需求。

Result: 实验表明,OmniField在多种任务中优于八个强基线模型,且在严重噪声下仍能保持接近干净输入的鲁棒性。

Insight: 通过条件化和迭代融合跨模态信息,可以显著提升多模态时空学习的性能和鲁棒性,尤其适用于模态缺失的场景。

Abstract: Multimodal spatiotemporal learning on real-world experimental data is constrained by two challenges: within-modality measurements are sparse, irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of available modalities varies across space and time, shrinking the usable record unless models can adapt to arbitrary subsets at train and test time. We propose OmniField, a continuity-aware framework that learns a continuous neural field conditioned on available modalities and iteratively fuses cross-modal context. A multimodal crosstalk block architecture paired with iterative cross-modal refinement aligns signals prior to the decoder, enabling unified reconstruction, interpolation, forecasting, and cross-modal prediction without gridding or surrogate preprocessing. Extensive evaluations show that OmniField consistently outperforms eight strong multimodal spatiotemporal baselines. Under heavy simulated sensor noise, performance remains close to clean-input levels, highlighting robustness to corrupted measurements.

cs.MM [Back]

[71] An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM

Jiawei Liu,Enis Berk Çoban,Zarina Schevchenko,Hao Tang,Zhigang Zhu,Michael I Mandel,Johanna Devaney

Main category: cs.MM

TL;DR: 论文研究了在多模态大语言模型(MLLM)中交替指令调优对语义推理性能的影响,发现交替提示可以提升性能,但会降低音频标注能力。

Details Motivation: 传统的多模态大语言模型训练方法可能未能充分整合模态信息,限制了模型的推理能力。本文探索了交替指令调优的效果。

Contribution: 提出了新的语义推理数据集SHARD,并通过实验验证了交替指令调优在音频MLLM中的有效性及其对性能的权衡。

Method: 使用Listen, Think, and Understand(LTU)模型,通过在提示中交替插入音频标记进行指令调优,并在SHARD数据集上评估性能。

Result: 交替提示在零样本和少量微调下均提升了语义推理性能,但同时降低了模型的音频标注能力。

Insight: 交替指令调优在多模态任务中可能需要在推理能力和模态特定能力之间做权衡。

Abstract: Standard training for Multi-modal Large Language Models (MLLMs) involves concatenating non-textual information, like vision or audio, with a text prompt. This approach may not encourage deep integration of modalities, limiting the model’s ability to leverage the core language model’s reasoning capabilities. This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt. Using the Listen, Think, and Understand (LTU) model as a testbed, we conduct an experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our newly created reasoning benchmark for audio-based semantic reasoning focusing on synonym and hypernym recognition. Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning using interleaved training prompts improves the results further, however, at the expense of the MLLM’s audio labeling ability.

cs.RO [Back]

[72] A Step Toward World Models: A Survey on Robotic Manipulation

Peng-Fei Zhang,Ying Cheng,Xiaofan Sun,Shijie Wang,Lei Zhu,Heng Tao Shen

Main category: cs.RO

TL;DR: 本文是一篇关于世界模型在机器人操控中应用的综述,旨在探讨如何通过世界模型实现对复杂动态环境的理解与操作。

Details Motivation: 自主代理(autonomous agents)需要在复杂、动态且不确定的环境中执行任务(如操控、导航和决策),这要求它们理解世界的底层机制与动态特性,而不仅仅是反应性控制或简单复制观察到的状态。因此,开发能够编码环境状态、捕获动态特性并支持预测、规划和推理的世界模型成为必要。

Contribution: 本文的主要贡献包括:(1)通过对机器人操控方法的综述,分析了具有世界模型核心能力的方法;(2)讨论了这些方法在感知、预测和控制中的作用;(3)总结了世界模型的核心组件、能力与功能;(4)提出了开发通用且实用的机器人世界模型的路线图。

Method: 本文采用综述方法,分析了机器人操控领域中具有世界模型核心能力的研究,重点关注它们在感知、预测和控制中的应用。

Result: 通过对现有方法的分析,本文指出了世界模型应具备的核心能力(如状态编码、动态捕获和推理),并总结了当前研究的挑战与解决方案。

Insight: 世界模型不仅是理论概念,更是实际应用中实现复杂任务的关键工具。未来研究应聚焦于模型的泛化性和实用性,特别是在动态和不确定环境中的表现。

Abstract: Autonomous agents are increasingly expected to operate in complex, dynamic, and uncertain environments, performing tasks such as manipulation, navigation, and decision-making. Achieving these capabilities requires agents to understand the underlying mechanisms and dynamics of the world, moving beyond purely reactive control or simple replication of observed states. This motivates the development of world models as internal representations that encode environmental states, capture dynamics, and enable prediction, planning, and reasoning. Despite growing interest, the definition, scope, architectures, and essential capabilities of world models remain ambiguous. In this survey, rather than directly imposing a fixed definition and limiting our scope to methods explicitly labeled as world models, we examine approaches that exhibit the core capabilities of world models through a review of methods in robotic manipulation. We analyze their roles across perception, prediction, and control, identify key challenges and solutions, and distill the core components, capabilities, and functions that a real world model should possess. Building on this analysis, we aim to outline a roadmap for developing generalizable and practical world models for robotics.

[73] TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System

Yanjie Ze,Siheng Zhao,Weizhuo Wang,Angjoo Kanazawa,Rocky Duan,Pieter Abbeel,Guanya Shi,Jiajun Wu,C. Karen Liu

Main category: cs.RO

TL;DR: TWIST2是一个可扩展、便携且完整的人形机器人数据收集系统,取代了昂贵的动作捕捉设备,通过低成本解决方案实现全身控制,并展示了高效的数据收集能力。

Details Motivation: 现有的人形机器人遥操作系统要么控制解耦,要么依赖昂贵的动作捕捉设备,限制了数据收集的规模和效率。TWIST2旨在解决这一问题。

Contribution: 提出了一个便携、无需动作捕捉的人形机器人遥操作系统TWIST2,实现了高效的全身控制和数据收集,同时开源了系统和数据集。

Method: 利用PICO4U VR实时捕捉人体运动,设计低成本2自由度机器人颈部以实现自我中心视觉,并开发了分层视觉运动策略框架。

Result: 系统在15分钟内可收集100次演示,成功率接近100%,验证了全身灵巧操控和动态踢球任务的可行性。

Insight: TWIST2展示了低成本、便携式系统在高效数据收集和人形机器人控制中的潜力,为相关研究提供了开源工具和数据支持。

Abstract: Large-scale data has driven breakthroughs in robotics, from language models to vision-language-action models in bimanual manipulation. However, humanoid robotics lacks equally effective data collection frameworks. Existing humanoid teleoperation systems either use decoupled control or depend on expensive motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid teleoperation and data collection system that preserves full whole-body control while advancing scalability. Our system leverages PICO4U VR for obtaining real-time whole-body human motions, with a custom 2-DoF robot neck (cost around $250) for egocentric vision, enabling holistic human-to-humanoid control. We demonstrate long-horizon dexterous and mobile humanoid skills and we can collect 100 demonstrations in 15 minutes with an almost 100% success rate. Building on this pipeline, we propose a hierarchical visuomotor policy framework that autonomously controls the full humanoid body based on egocentric vision. Our visuomotor policy successfully demonstrates whole-body dexterous manipulation and dynamic kicking tasks. The entire system is fully reproducible and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also open-sourced at https://twist-data.github.io .

cs.AI [Back]

[74] InsurAgent: A Large Language Model-Empowered Agent for Simulating Individual Behavior in Purchasing Flood Insurance

Ziheng Geng,Jiachen Liu,Ran Cao,Lu Cheng,Dan M. Frangopol,Minghui Cheng

Main category: cs.AI

TL;DR: 该论文提出InsurAgent,一个基于大语言模型(LLM)的代理,通过五个模块(感知、检索、推理、行动和记忆)模拟个体购买洪水保险的行为,弥补了LLM在定量概率估计上的不足。

Details Motivation: 美国高风险人群中洪水保险的低参与率表明需要理解保险决策的行为机制。LLM展现出人类智能潜力,但缺乏定量概率估计能力,因此需要开发新工具。

Contribution: 提出InsurAgent框架,结合检索增强生成(RAG)模块和LLM常识推理,提升定量和定性预测能力,并支持时间演化决策模拟。

Method: 设计五个模块:感知(获取输入)、检索(基于RAG引用调查数据)、推理(利用LLM常识扩展上下文)、行动(生成决策)和记忆(支持时间演化)。

Result: InsurAgent在边际和双变量概率估计上表现准确,并能捕捉传统模型难以处理的上下文信息。

Insight: LLM结合领域数据可显著提升行为建模能力,为政策分析提供了新工具。

Abstract: Flood insurance is an effective strategy for individuals to mitigate disaster-related losses. However, participation rates among at-risk populations in the United States remain strikingly low. This gap underscores the need to understand and model the behavioral mechanisms underlying insurance decisions. Large language models (LLMs) have recently exhibited human-like intelligence across wide-ranging tasks, offering promising tools for simulating human decision-making. This study constructs a benchmark dataset to capture insurance purchase probabilities across factors. Using this dataset, the capacity of LLMs is evaluated: while LLMs exhibit a qualitative understanding of factors, they fall short in estimating quantitative probabilities. To address this limitation, InsurAgent, an LLM-empowered agent comprising five modules including perception, retrieval, reasoning, action, and memory, is proposed. The retrieval module leverages retrieval-augmented generation (RAG) to ground decisions in empirical survey data, achieving accurate estimation of marginal and bivariate probabilities. The reasoning module leverages LLM common sense to extrapolate beyond survey data, capturing contextual information that is intractable for traditional models. The memory module supports the simulation of temporal decision evolutions, illustrated through a roller coaster life trajectory. Overall, InsurAgent provides a valuable tool for behavioral modeling and policy analysis.

[75] Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning

Yibo Zhao,Yang Zhao,Hongru Du,Hao Frank Yang

Main category: cs.AI

TL;DR: 该论文提出了一种个性化决策建模框架ATHENA,通过结合符号化效用理论和LLM的文本推理能力,实现了群体级符号化效用发现和个体级语义适应,显著提升了高风险决策场景中的预测性能。

Details Motivation: 在疫苗选择等高风险决策中,个体决策往往偏离群体最优预测,这是由于个体决策过程的独特性和语言影响的复杂性。论文旨在解决这一问题。

Contribution: 提出了ATHENA框架,首次将符号化效用建模与语义适应有机结合,为个性化决策建模提供了新方案。

Method: ATHENA分为两阶段:1)通过LLM增强的符号化发现技术识别群体级效用函数;2)基于最优效用进行个体级语义模板适配。

Result: 在旅行方式和疫苗选择任务中,ATHENA的F1分数比现有最佳模型提升了至少6.5%,且消融实验验证了两阶段的必要性。

Insight: 符号化推理和语义适应的结合是建模个性化决策的关键,为高解释性与高性能的决策模型提供了新思路。

Abstract: Decision-making models for individuals, particularly in high-stakes scenarios like vaccine uptake, often diverge from population optimal predictions. This gap arises from the uniqueness of the individual decision-making process, shaped by numerical attributes (e.g., cost, time) and linguistic influences (e.g., personal preferences and constraints). Developing upon Utility Theory and leveraging the textual-reasoning capabilities of Large Language Models (LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric Reasoning framework (ATHENA) to address the optimal information integration. ATHENA uniquely integrates two stages: First, it discovers robust, group-level symbolic utility functions via LLM-augmented symbolic discovery; Second, it implements individual-level semantic adaptation, creating personalized semantic templates guided by the optimal utility to model personalized choices. Validated on real-world travel mode and vaccine choice tasks, ATHENA consistently outperforms utility-based, machine learning, and other LLM-based models, lifting F1 score by at least 6.5% over the strongest cutting-edge models. Further, ablation studies confirm that both stages of ATHENA are critical and complementary, as removing either clearly degrades overall predictive performance. By organically integrating symbolic utility modeling and semantic adaptation, ATHENA provides a new scheme for modeling human-centric decisions. The project page can be found at https://yibozh.github.io/Athena.

[76] Training Proactive and Personalized LLM Agents

Weiwei Sun,Xuhui Zhou,Weihua Du,Xingyao Wang,Sean Welleck,Graham Neubig,Maarten Sap,Yiming Yang

Main category: cs.AI

TL;DR: 该论文提出了一个名为PPP的多目标强化学习方法,联合优化了生产力、主动性和个性化三个维度,通过UserVille交互环境和LLM用户模拟器提升AI代理的实用性。

Details Motivation: 现有工作主要关注任务成功率,而忽视了在真实场景中,AI代理还需具备主动性(提出关键问题)和个性化(适应用户偏好)的能力。

Contribution: 提出了PPP方法,首次联合优化生产力、主动性和个性化,并引入UserVille环境以模拟多样化用户偏好。

Method: 使用多目标强化学习(PPP),基于UserVille中的LLM用户模拟器训练代理,优化三个维度的交互能力。

Result: 实验表明,PPP在软件工程和研究任务上优于GPT-5等基线(平均提升21.6分),能提出策略性问题并适应未见过的用户偏好。

Insight: 显式优化用户为中心的交互对构建实用AI代理至关重要,主动性和个性化是任务成功的关键补充维度。

Abstract: While existing work focuses primarily on task success, we argue that effective real-world agents require optimizing three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to diverse user preferences). We introduce UserVille, an interactive environment with LLM-based user simulators enabling diverse, configurable user preferences. Leveraging UserVille, we introduce PPP, a multi-objective reinforcement learning approach that jointly optimizes all three dimensions: Productivity, Proactivity, and Personalization. Experiments on software engineering and deep research tasks show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6 on average), demonstrating the ability to ask strategic clarifying questions, adapt to unseen user preferences, and improve task success through better interaction. This work demonstrates that explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.

[77] Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation

Zhiwei Zhang,Xiaomin Li,Yudi Lin,Hui Liu,Ramraj Chandradevan,Linlin Wu,Minhua Lin,Fali Wang,Xianfeng Tang,Qi He,Suhang Wang

Main category: cs.AI

TL;DR: 这篇论文研究了多智能体LLM在复杂推理任务中的协作问题,揭示了‘懒惰智能体’行为的根源,并提出了一种可验证的奖励机制来促进有效协作。

Details Motivation: 在多智能体LLM框架中,常出现一个智能体主导协作而另一个贡献不足的‘懒惰智能体’现象,限制了协作效果。本文旨在解决这一问题,以充分发挥多智能体协作的潜力。

Contribution: 1. 对‘懒惰智能体’行为的理论分析;2. 提出了一种稳定的因果影响力测量方法;3. 设计了一种可验证的奖励机制,支持推理智能体丢弃噪声输出并重启推理过程。

Method: 1. 通过理论分析揭示了懒惰行为的本质;2. 利用因果影响力测量方法优化协作;3. 引入可验证的奖励机制,允许推理智能体进行审慎思考和重启推理。

Result: 实验表明,该方法有效缓解了懒惰智能体行为,提升了多智能体框架在复杂推理任务中的表现。

Insight: 多智能体协作中的懒惰行为可通过理论分析和机制设计解决,动态重启和噪声处理是提升协作效果的关键策略。

Abstract: Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks.

[78] CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

Jiayu Liu,Cheng Qian,Zhaochen Su,Qing Zong,Shijue Huang,Bingxiang He,Yi R. Fung

Main category: cs.AI

TL;DR: CostBench是一个专注于评估LLM代理在多轮任务中成本优化规划和动态环境适应能力的基准,填补了现有评估中对资源效率和适应性的忽视。

Details Motivation: 现有LLM代理评估主要关注任务完成度,忽略了资源效率和动态适应能力,这限制了代理在实际多变环境中的应用潜力。

Contribution: 提出了CostBench基准,用于系统评估代理的经济推理和实时调整能力,支持多变环境模拟和成本优化规划。

Method: 基于旅行规划领域,设计了包含多种原子和复合工具的任务,并引入了四种动态阻塞事件(如工具故障和成本变化),以测试代理的适应性和规划能力。

Result: 实验显示,现有代理(包括GPT-5)在成本优化规划中表现不佳,最难任务的准确匹配率低于75%,动态环境下性能进一步下降约40%。

Insight: CostBench揭示了当前代理在成本感知规划和动态适应方面的不足,为开发经济合理且鲁棒的未来代理提供了基础。

Abstract: Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents’ ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents’ economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even GPT-5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.

[79] Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

Huawei Lin,Yunzhi Shi,Tong Geng,Weijie Zhao,Wei Wang,Ravender Pal Singh

Main category: cs.AI

TL;DR: Agent-Omni提出了一种无需微调的多模态框架,通过主代理协调专用模型,实现灵活的多模态推理,并在复杂跨模态任务中表现优异。

Details Motivation: 当前多模态大语言模型(MLLMs)受限于固定模态对和高成本微调,难以实现灵活的全模态理解与推理。

Contribution: 提出了基于代理的框架Agent-Omni,通过协调专用模型实现多模态推理,且无需重新训练。

Method: 使用主代理解析用户意图,分配任务给模态专用代理,并整合输出。

Result: 在文本、图像、音频、视频等多模态基准测试中表现优异,尤其是在复杂跨模态任务上。

Insight: 通过代理设计实现模块化和可扩展性,未来可轻松集成更强的模型。

Abstract: Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available. %We release an open-source implementation to support continued research on scalable and reliable omni-modal reasoning.

physics.med-ph [Back]

[80] High-Resolution Magnetic Particle Imaging System Matrix Recovery Using a Vision Transformer with Residual Feature Network

Abuobaida M. Khair,Wenjing Jiang,Yousuf Babiker M. Osman,Wenjun Xia,Xiaopeng Ma

Main category: physics.med-ph

TL;DR: 该论文提出了一种混合深度学习框架VRF-Net,用于磁粒子成像(MPI)中高分辨率系统矩阵的恢复,结合了视觉变换器的全局注意力和残差卷积细化,显著提升了恢复性能。

Details Motivation: MPI的分辨率因下采样和线圈灵敏度变化而受限,传统方法难以恢复大规模结构和细节。

Contribution: 1. 提出VRF-Net,结合变换器和残差网络;2. 设计了双阶段下采样策略模拟真实MPI条件;3. 在公开数据集上验证了方法的优越性。

Method: 使用视觉变换器捕捉全局特征,残差网络细化局部细节,训练采用配对图像超分辨率方法。

Result: 在2倍和8倍缩放下,VRF-Net显著降低了误差(nRMSE减少88.2%),提升了信噪比(pSNR增加44.7%)和结构相似性(SSIM提高34.3%)。

Insight: VRF-Net在MPI系统矩阵恢复中表现出色,为未来体内应用提供了潜在解决方案;全局与局部特征结合的框架对类似任务具有通用性。

Abstract: This study presents a hybrid deep learning framework, the Vision Transformer with Residual Feature Network (VRF-Net), for recovering high-resolution system matrices in Magnetic Particle Imaging (MPI). MPI resolution often suffers from downsampling and coil sensitivity variations. VRF-Net addresses these challenges by combining transformer-based global attention with residual convolutional refinement, enabling recovery of both large-scale structures and fine details. To reflect realistic MPI conditions, the system matrix is degraded using a dual-stage downsampling strategy. Training employed paired-image super-resolution on the public Open MPI dataset and a simulated dataset incorporating variable coil sensitivity profiles. For system matrix recovery on the Open MPI dataset, VRF-Net achieved nRMSE = 0.403, pSNR = 39.08 dB, and SSIM = 0.835 at 2x scaling, and maintained strong performance even at challenging scale 8x (pSNR = 31.06 dB, SSIM = 0.717). For the simulated dataset, VRF-Net achieved nRMSE = 4.44, pSNR = 28.52 dB, and SSIM = 0.771 at 2x scaling, with stable performance at higher scales. On average, it reduced nRMSE by 88.2%, increased pSNR by 44.7%, and improved SSIM by 34.3% over interpolation and CNN-based methods. In image reconstruction of Open MPI phantoms, VRF-Net further reduced reconstruction error to nRMSE = 1.79 at 2x scaling, while preserving structural fidelity (pSNR = 41.58 dB, SSIM = 0.960), outperforming existing methods. These findings demonstrate that VRF-Net enables sharper, artifact-free system matrix recovery and robust image reconstruction across multiple scales, offering a promising direction for future in vivo applications.

cs.HC [Back]

[81] SciDaSynth: Interactive Structured Data Extraction from Scientific Literature with Large Language Model

Xingbo Wang,Samantha L. Huey,Rui Sheng,Saurabh Mehta,Fei Wang

Main category: cs.HC

TL;DR: SciDaSynth是一种基于大型语言模型的交互式系统,用于从科学文献中高效提取结构化数据,支持多模态信息整合和数据验证,显著优于基线方法。

Details Motivation: 科学文献的爆炸式增长使得高效提取结构化数据成为推动科学研究和决策的关键,但现有工具在多模态、不一致信息的处理上存在不足。

Contribution: 开发了SciDaSynth系统,利用大型语言模型自动从多源信息中生成结构化数据表,支持交互式验证和语义分组,解决了跨文档数据不一致问题。

Method: 系统结合文本、表格和图表等多模态信息,通过用户查询生成结构化数据表,并提供可视化摘要和语义分组功能辅助数据验证。

Result: 用户研究表明,SciDaSynth在提取高质量结构化数据方面比基线方法更高效,尤其在营养学和NLP领域表现突出。

Insight: 交互式设计和人机协作是提升数据提取任务效率的关键,多模态信息整合和语义分组技术有助于解决复杂数据的不一致性问题。

Abstract: The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence-based decision-making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models (LLMs) that automatically generates structured data tables according to users’ queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi-faceted visual summaries and semantic grouping capabilities to resolve cross-document data inconsistencies. A within-subjects study with nutrition and NLP researchers demonstrates SciDaSynth’s effectiveness in producing high-quality structured data more efficiently than baseline methods. We discuss design implications for human-AI collaborative systems supporting data extraction tasks. The system code is available at https://github.com/xingbow/SciDaEx

[82] HAGI++: Head-Assisted Gaze Imputation and Generation

Chuhan Jiao,Zhiming Hu,Andreas Bulling

Main category: cs.HC

TL;DR: HAGI++是一种基于多模态扩散的注视数据填补方法,首次利用头部方向传感器捕捉头部与眼睛运动的相关性,显著提升了注视数据的填补质量。

Details Motivation: 由于眨眼、瞳孔检测错误或光照变化导致的注视数据缺失问题,影响了进一步的数据分析。HAGI++旨在通过利用头部运动与眼睛运动的关联性来解决这一问题。

Contribution: 1. 提出首个利用头部方向传感器的注视数据填补方法;2. 基于扩散模型的跨模态学习能力,有效融合头部与眼睛运动的依赖关系;3. 在极端情况下(100%数据缺失)优于依赖全身运动捕捉的现有方法。

Method: 采用基于Transformer的扩散模型,学习头部与眼睛运动的跨模态依赖关系,并可扩展引入其他身体运动数据(如手腕运动)。

Result: 在Nymeria、Ego-Exo4D和HOT3D数据集上的实验表明,HAGI++优于传统插值方法和深度学习方法,生成的注视速度分布更接近真实人类行为。

Insight: 头部运动数据可以作为补充信息显著提升注视数据填补的准确性,甚至在完全缺失的情况下也能生成逼真的注视数据。

Abstract: Mobile eye tracking plays a vital role in capturing human visual attention across both real-world and extended reality (XR) environments, making it an essential tool for applications ranging from behavioural research to human-computer interaction. However, missing values due to blinks, pupil detection errors, or illumination changes pose significant challenges for further gaze data analysis. To address this challenge, we introduce HAGI++ - a multi-modal diffusion-based approach for gaze data imputation that, for the first time, uses the integrated head orientation sensors to exploit the inherent correlation between head and eye movements. HAGI++ employs a transformer-based diffusion model to learn cross-modal dependencies between eye and head representations and can be readily extended to incorporate additional body movements. Extensive evaluations on the large-scale Nymeria, Ego-Exo4D, and HOT3D datasets demonstrate that HAGI++ consistently outperforms conventional interpolation methods and deep learning-based time-series imputation baselines in gaze imputation. Furthermore, statistical analyses confirm that HAGI++ produces gaze velocity distributions that closely match actual human gaze behaviour, ensuring more realistic gaze imputations. Moreover, by incorporating wrist motion captured from commercial wearable devices, HAGI++ surpasses prior methods that rely on full-body motion capture in the extreme case of 100% missing gaze data (pure gaze generation). Our method paves the way for more complete and accurate eye gaze recordings in real-world settings and has significant potential for enhancing gaze-based analysis and interaction across various application domains.

[83] SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration

Dan Bohus,Sean Andrist,Ann Paradiso,Nick Saw,Tim Schoonbeek,Maia Stiber

Main category: cs.HC

TL;DR: SigmaCollab 是一个面向应用的数据集,支持研究物理环境中的(现实世界中的)人机协作。它包含85个会话的多模态数据,如音频、第一视角摄像头、深度图、头部/手部/视线追踪信息等,为AI模型提供了更真实的测试场景。

Details Motivation: 现有数据集多集中在虚拟或远程协作,缺乏真实物理环境中的人机协作数据。SigmaCollab旨在填补这一空白,推动物理场景下AI辅助任务的研究。

Contribution: 提出了一个包含多模态数据的应用驱动数据集(SigmaCollab),支持物理环境中人机协作的研究,并计划构建相关基准测试。

Method: 数据集通过混合现实辅助AI指导参与者完成物理任务收集,包含音频、视觉、深度、追踪等多样化数据。

Result: 数据集虽然规模较小(14小时),但为物理协作场景的AI模型提供了更真实的测试环境。

Insight: 物理环境中的人机协作需要多模态数据支持,SigmaCollab为未来研究提供了宝贵资源,尤其是在混合现实任务辅助领域。

Abstract: We introduce SigmaCollab, a dataset enabling research on physically situated human-AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed-reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head-mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post-hoc. While the dataset is relatively small in size (~ 14 hours), its application-driven and interactive nature brings to the fore novel research challenges for human-AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed-reality task assistive scenarios. SigmaCollab is available at https://github.com/microsoft/SigmaCollab.

cs.MA [Back]

[84] Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

Beyazit Yalcinkaya,Marcell Vazquez-Chanlatte,Ameesh Shah,Hanna Krasowski,Sanjit A. Seshia

Main category: cs.MA

TL;DR: 该论文提出了一个基于自动机的多智能体强化学习框架(ACC-MARL),用于学习任务条件化的分散团队策略,解决了复杂任务分解和样本效率低的问题。

Details Motivation: 现有方法在多任务、多智能体协同学习中存在样本效率低且仅适用于单一任务的局限性,作者希望通过自动机表示任务来实现复杂任务的分解和高效学习。

Contribution: 论文的主要贡献包括提出了ACC-MARL框架,解决了实际应用中的主要挑战,证明了方法的正确性,并展示了如何利用学习到的值函数在测试时最优分配任务。

Method: 作者使用自动机表示任务,将复杂任务分解为子任务,并通过任务条件化的策略学习实现分散执行。具体方法包括任务分解、策略学习和任务分配优化。

Result: 实验结果表明,ACC-MARL能够实现智能体之间的任务感知、多步协同(如按下按钮解锁门、保持门开或短路任务),证明了方法的有效性。

Insight: 通过自动机进行任务分解和多任务学习可以显著提升多智能体强化学习的效率和灵活性,为复杂协同任务提供了一种可行方案。

Abstract: We study the problem of learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub-tasks that can be assigned to agents. However, existing approaches remain sample-inefficient and are limited to the single-task case. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to ACC-MARL’s feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task-aware, multi-step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short-circuiting tasks.