Table of Contents
- cs.CL [Total: 36]
- cs.CV [Total: 79]
- cs.IR [Total: 1]
- q-fin.ST [Total: 1]
- cs.GR [Total: 3]
- cs.SD [Total: 1]
- eess.AS [Total: 1]
- q-bio.NC [Total: 1]
- eess.IV [Total: 5]
- cs.LG [Total: 13]
- q-bio.BM [Total: 1]
- cs.RO [Total: 1]
- cs.DB [Total: 1]
- cs.HC [Total: 2]
- cs.AI [Total: 4]
cs.CL [Back]
[1] Conservative Bias in Large Language Models: Measuring Relation Predictions
Toyin Aguda,Erik Wilson,Allan Anzagira,Simerjot Kaur,Charese Smiley
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLMs)在关系抽取任务中表现出的保守偏差,发现模型倾向于选择无信息标签(如No_Relation)而非可能错误的标签,尽管这避免了错误分配但也导致信息丢失。研究通过多种提示和数据系统地评估了这种现象,并提出”Hobson’s choice”概念来描述模型的保守行为。
Details
Motivation: 大型语言模型在关系抽取任务中表现出明显的保守偏差,倾向于选择安全但无信息的标签,而非可能错误的标签。这种行为虽然减少了错误,但也导致了信息的丢失,作者希望通过系统研究量化这种偏差及其影响。Contribution: 论文的主要贡献包括:1)系统地测量和分析了LLMs在关系抽取中的保守偏差;2)提出了”Hobson’s choice”概念来描述这种保守行为;3)通过实验验证了保守偏差的发生频率是幻觉的两倍;4)利用SBERT和LLM提示量化了保守偏差的语义相似度。
Method: 研究方法包括:1)通过多种提示(约束、半约束和开放式)实验LLMs在关系抽取中的行为;2)使用SBERT和LLM提示捕捉保守偏差的语义相似度;3)对比保守偏差和幻觉的发生频率及其影响。
Result: 实验结果表明,保守偏差在关系抽取任务中发生的频率是幻觉的两倍。通过语义相似度分析,发现保守行为在不同提示下具有一致性。
Insight: 论文的启示包括:1)LLMs的保守偏差虽然减少了错误,但也导致了信息的丢失;2)未来的模型设计需要在保守和幻觉之间找到平衡;3)”Hobson’s choice”概念为理解模型的保守行为提供了新视角。
Abstract: Large language models (LLMs) exhibit pronounced conservative bias in relation
extraction tasks, frequently defaulting to No_Relation label when an
appropriate option is unavailable. While this behavior helps prevent incorrect
relation assignments, our analysis reveals that it also leads to significant
information loss when reasoning is not explicitly included in the output. We
systematically evaluate this trade-off across multiple prompts, datasets, and
relation types, introducing the concept of Hobson’s choice to capture scenarios
where models opt for safe but uninformative labels over hallucinated ones. Our
findings suggest that conservative bias occurs twice as often as hallucination.
To quantify this effect, we use SBERT and LLM prompts to capture the semantic
similarity between conservative bias behaviors in constrained prompts and
labels generated from semi-constrained and open-ended prompts.
[2] EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
Zefang Liu,Yinzhu Quan
Main category: cs.CL
TL;DR: EconWebArena是一个评估自主代理在现实网络环境中完成复杂经济任务的基准测试,包含360个任务,覆盖多个经济领域,强调权威数据源和基于网络的推理能力。
Details
Motivation: 现有基准测试在真实网络环境中的经济任务评估不足,EconWebArena填补了这一空白,通过多模态任务挑战代理的导航、推理和交互能力。Contribution: 提出了EconWebArena基准测试,强调任务的多样性和真实性;评估了多模态LLMs,揭示了代理在真实网络环境中的性能瓶颈。
Method: 通过LLMs生成候选任务并人工筛选,确保任务清晰可行;评估多种代理,分析了视觉基础、计划推理和交互设计的影响。
Result: 结果显示现有代理在基础和导航方面存在显著性能差距。
Insight: 任务真实性对经济推理至关重要,未来研究需关注多模态理解和交互设计的改进。
Abstract: We introduce EconWebArena, a benchmark for evaluating autonomous agents on
complex, multimodal economic tasks in realistic web environments. The benchmark
comprises 360 curated tasks from 82 authoritative websites spanning domains
such as macroeconomics, labor, finance, trade, and public policy. Each task
challenges agents to navigate live websites, interpret structured and visual
content, interact with real interfaces, and extract precise, time-sensitive
data through multi-step workflows. We construct the benchmark by prompting
multiple large language models (LLMs) to generate candidate tasks, followed by
rigorous human curation to ensure clarity, feasibility, and source reliability.
Unlike prior work, EconWebArena emphasizes fidelity to authoritative data
sources and the need for grounded web-based economic reasoning. We evaluate a
diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure
cases, and conduct ablation studies to assess the impact of visual grounding,
plan-based reasoning, and interaction design. Our results reveal substantial
performance gaps and highlight persistent challenges in grounding, navigation,
and multimodal understanding, positioning EconWebArena as a rigorous testbed
for economic web intelligence.
[3] Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions
Yu-Ang Lee,Guan-Ting Yi,Mei-Yi Liu,Jui-Chao Lu,Guan-Bo Yang,Yun-Nung Chen
Main category: cs.CL
TL;DR: 这篇论文系统性地综述了复合AI系统的优化方法、挑战与未来方向,重点关注多组件整合及其交互优化的新挑战,并提出了语言反馈等新方法。
Details
Motivation: 随着大型语言模型(LLMs)和AI系统的进步,复合AI系统的复杂度增加,传统优化方法难以应对其多组件交互的挑战,亟需新的优化方法。Contribution: 论文提出了复合AI系统优化的概念,分类了现有方法,并总结了语言反馈等新兴技术的潜力。
Method: 论文结合传统优化方法(如监督微调和强化学习)与语言反馈等新技术,对复合AI系统优化的方法进行了分类和分析。
Result: 论文系统地梳理了该领域的研究进展,指出了非可微系统优化等开放性问题。
Insight: 语言反馈为复合AI系统优化提供了新思路,尤其适用于传统方法难以处理的非可微组件。
Abstract: Recent advancements in large language models (LLMs) and AI systems have led
to a paradigm shift in the design and optimization of complex AI workflows. By
integrating multiple components, compound AI systems have become increasingly
adept at performing sophisticated tasks. However, as these systems grow in
complexity, new challenges arise in optimizing not only individual components
but also their interactions. While traditional optimization methods such as
supervised fine-tuning (SFT) and reinforcement learning (RL) remain
foundational, the rise of natural language feedback introduces promising new
approaches, especially for optimizing non-differentiable systems. This paper
provides a systematic review of recent progress in optimizing compound AI
systems, encompassing both numerical and language-based techniques. We
formalize the notion of compound AI system optimization, classify existing
methods along several key dimensions, and highlight open research challenges
and future directions in this rapidly evolving field. A list of surveyed papers
is publicly available at https://github.com/MiuLab/AISysOpt-Survey.
[4] Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning
Shashidhar Reddy Javaji,Yupeng Cao,Haohang Li,Yangyang Yu,Nikhil Muralidhar,Zining Zhu
Main category: cs.CL
TL;DR: 论文提出了CLAIM-BENCH基准,用于评估大语言模型(LLMs)在科学声明(claim)和证据(evidence)提取与验证任务中的能力,揭示了LLMs在处理复杂科学内容时的局限性,并展示了封闭模型(如GPT-4)和特定提示方法的优势。
Details
Motivation: LLMs在科学研究任务中广泛应用,但其对科学论文中声明与证据之间复杂关系的理解能力尚未充分研究,因此需要系统评估其科学推理能力。Contribution: 1. 提出CLAIM-BENCH基准,用于评估LLMs的科学声明与证据推理能力。2. 比较六种LLMs在三个策略下的表现,揭示封闭模型的优势。3. 提出三遍和逐个提示方法,显著提升LLMs的准确性(以计算成本为代价)。
Method: 1. 采用分治策略(divide and conquer)设计三个评估方法。2. 使用300多个声明-证据对,从多研究领域进行评估。3. 对比六种LLMs(包括开闭源模型)。4. 引入三遍和逐个提示方法优化性能。
Result: 1. 闭源模型(如GPT-4、Claude)在声明-证据识别任务中表现优于开源模型。2. 特定提示方法能提升模型精度,但增加计算成本。3. LLMs在复杂科学内容处理中仍有显著局限。
Insight: 1. 科学推理需要更深层次的模型理解能力,现有LLMs仍有不足。2. 提示方法的设计对任务表现至关重要。3. CLAIM-BENCH为未来开发更可靠的科学推理系统提供了基准。
Abstract: Large language models (LLMs) are increasingly being used for complex research
tasks such as literature review, idea generation, and scientific paper
analysis, yet their ability to truly understand and process the intricate
relationships within complex research papers, such as the logical links between
claims and supporting evidence remains largely unexplored. In this study, we
present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs’
capabilities in scientific claim-evidence extraction and validation, a task
that reflects deeper comprehension of scientific argumentation. We
systematically compare three approaches which are inspired by divide and
conquer approaches, across six diverse LLMs, highlighting model-specific
strengths and weaknesses in scientific comprehension. Through evaluation
involving over 300 claim-evidence pairs across multiple research domains, we
reveal significant limitations in LLMs’ ability to process complex scientific
content. Our results demonstrate that closed-source models like GPT-4 and
Claude consistently outperform open-source counterparts in precision and recall
across claim-evidence identification tasks. Furthermore, strategically designed
three-pass and one-by-one prompting approaches significantly improve LLMs’
abilities to accurately link dispersed evidence with claims, although this
comes at increased computational cost. CLAIM-BENCH sets a new standard for
evaluating scientific comprehension in LLMs, offering both a diagnostic tool
and a path forward for building systems capable of deeper, more reliable
reasoning across full-length papers.
[5] Automatic Generation of Inference Making Questions for Reading Comprehension Assessments
Wanjing Anya Ma,Michael Flor,Zuowei Wang
Main category: cs.CL
TL;DR: 论文提出了一种自动生成阅读理解推理题的方法,利用GPT-4o和少样本提示生成高质量的诊断题目,结合人工判断实现可扩展的高质量评估。
Details
Motivation: 阅读理解中的推理能力是复杂但关键的能力,需要跨句子引用和背景知识。生成诊断性题目可以帮助教师提供更有针对性的教学。Contribution: 1. 提出了阅读理解推理题的分类法;2. 利用GPT-4o生成高质量的推理题;3. 结合人工判断实现可扩展的高质量题目生成。
Method: 1. 使用少样本提示训练GPT-4o生成推理题;2. 比较有和无思维链提示的效果;3. 从题目质量、推理类型准确性和LLM推理三个方面评估生成结果。
Result: GPT-4o生成的93.8%题目质量良好,适合3-12年级使用,但仅有42.6%题目准确匹配目标推理类型。
Insight: 自动生成结合人工判断是实现高质量可扩展诊断题目生成的有效路径,但推理类型的准确性仍需改进。
Abstract: Inference making is an essential but complex skill in reading comprehension
(RC). Some inferences require resolving references across sentences, and some
rely on using prior knowledge to fill in the detail that is not explicitly
written in the text. Diagnostic RC questions can help educators provide more
effective and targeted reading instruction and interventions for school-age
students. We introduce a taxonomy of inference types for RC and use it to
analyze the distribution of items within a diagnostic RC item bank. Next, we
present experiments using GPT-4o to generate bridging-inference RC items for
given reading passages via few-shot prompting, comparing conditions with and
without chain-of-thought prompts. Generated items were evaluated on three
aspects: overall item quality, appropriate inference type, and LLM reasoning,
achieving high inter-rater agreements above 0.90. Our results show that GPT-4o
produced 93.8% good-quality questions suitable for operational use in grade
3-12 contexts; however, only 42.6% of the generated questions accurately
matched the targeted inference type. We conclude that combining automatic item
generation with human judgment offers a promising path toward scalable,
high-quality diagnostic RC assessments.
[6] Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency
Chenlong Wang,Yuanning Feng,Dongping Chen,Zhaoyang Chu,Ranjay Krishna,Tianyi Zhou
Main category: cs.CL
TL;DR: 论文研究了在大型推理模型中禁用”Wait”等显式自我反思标记是否会影响推理效率,提出了NoWait方法,实验表明它能显著减少推理轨迹长度,同时保持模型性能。
Details
Motivation: 研究动机是解决大型推理模型中因过度思考导致的冗长和冗余输出问题,探索是否必须通过"Wait"等标记进行显式自我反思才能实现高效推理。Contribution: 主要贡献是提出了NoWait方法,通过抑制显式自我反思标记,减少推理轨迹长度(27%-51%),同时保持模型性能,为多模态推理提供即插即用方案。
Method: NoWait方法的核心是在推理时禁用”Wait”等显式自我反思标记,简化推理过程,减少冗余输出,从而提升效率。
Result: 在十个基准测试中,NoWait将链式思维推理轨迹长度减少了27%-51%,且不影响模型性能。
Insight: 研究揭示了显式自我反思标记在推理中可能并非必要,去除这些标记可以显著提升推理效率,为未来高效推理模型设计提供了新思路。
Abstract: Recent advances in large reasoning models have enabled complex, step-by-step
reasoning but often introduce significant overthinking, resulting in verbose
and redundant outputs that hinder efficiency. In this study, we examine whether
explicit self-reflection, signaled by tokens such as “Wait” and “Hmm”, is
necessary for advanced reasoning. We propose NoWait, a simple yet effective
approach that disables explicit self-reflection by suppressing these tokens
during inference. Extensive experiments on ten benchmarks across textual,
visual, and video reasoning tasks show that NoWait reduces chain-of-thought
trajectory length by up to 27%-51% in five R1-style model series, without
compromising model utility. NoWait thus offers a plug-and-play solution for
efficient and utility-preserving multimodal reasoning.
[7] Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning
Yiqun Sun,Qiang Huang,Anthony K. H. Tung,Jun Yu
Main category: cs.CL
TL;DR: 这篇立场论文主张文本嵌入研究应超越表层语义,以隐式语义为核心建模目标,提出了数据、评估和目标的改进方向。
Details
Motivation: 当前的文本嵌入模型主要关注表层语义,忽视了语言中隐含的语用、说话者意图和社会文化背景等深层语义,导致其在需要推理或社会意义的任务上表现不佳。Contribution: 论文呼吁研究社区转向隐式语义建模,提出通过多样化的训练数据、支持深度语义理解的评测基准,并将隐式语义明确作为建模目标。
Method: 通过初步实验展示了当前先进模型在隐式语义任务上的局限,并提出了改进的框架。
Result: 实验显示即使最先进的模型在隐式语义任务上的表现仅略优于简单基线。
Insight: 文本嵌入研究需更贴近真实语言的复杂性,从数据到评测全面支持隐式语义能力的建模。
Abstract: This position paper argues that the text embedding research community should
move beyond surface meaning and embrace implicit semantics as a central
modeling goal. Text embedding models have become foundational in modern NLP,
powering a wide range of applications and drawing increasing research
attention. Yet, much of this progress remains narrowly focused on surface-level
semantics. In contrast, linguistic theory emphasizes that meaning is often
implicit, shaped by pragmatics, speaker intent, and sociocultural context.
Current embedding models are typically trained on data that lacks such depth
and evaluated on benchmarks that reward the capture of surface meaning. As a
result, they struggle with tasks requiring interpretive reasoning, speaker
stance, or social meaning. Our pilot study highlights this gap, showing that
even state-of-the-art models perform only marginally better than simplistic
baselines on implicit semantics tasks. To address this, we call for a paradigm
shift: embedding research should prioritize more diverse and linguistically
grounded training data, design benchmarks that evaluate deeper semantic
understanding, and explicitly frame implicit meaning as a core modeling
objective, better aligning embeddings with real-world language complexity.
[8] CC-RAG: Structured Multi-Hop Reasoning via Theme-Based Causal Graphs
Jash Rajesh Parekh,Pengcheng Jiang,Jiawei Han
Main category: cs.CL
TL;DR: CC-RAG 提出了一种新颖的检索增强生成方法,通过构建因果图支持多跳推理,显著提升了专业领域任务的准确性。
Details
Motivation: 大型语言模型(LLMs)在因果关系推理和专业领域任务中存在局限性,传统RAG缺乏结构化推理能力。Contribution: 提出了CC-RAG方法,结合零样本三元组提取和主题感知图链,构建DAG进行多跳推理,提升了回答的准确性和可解释性。
Method: 利用零样本三元组提取构建因果图(DAG),并通过前向/后向链指导生成结构化答案。
Result: 在比特币价格波动和高雪病等专业领域的实验中,CC-RAG在准确性、信息密度和多样性上优于传统RAG和零样本LLMs。
Insight: 显式建模因果结构能够显著提升LLMs在专业领域的性能,尤其是在需要复杂推理的任务中。
Abstract: Understanding cause and effect relationships remains a formidable challenge
for Large Language Models (LLMs), particularly in specialized domains where
reasoning requires more than surface-level correlations. Retrieval-Augmented
Generation (RAG) improves factual accuracy, but standard RAG pipelines treat
evidence as flat context, lacking the structure required to model true causal
dependencies. We introduce Causal-Chain RAG (CC-RAG), a novel approach that
integrates zero-shot triple extraction and theme-aware graph chaining into the
RAG pipeline, enabling structured multi-hop inference. Given a domain specific
corpus, CC-RAG constructs a Directed Acyclic Graph (DAG) of <cause, relation,
effect> triples and uses forward/backward chaining to guide structured answer
generation. Experiments on two real-world domains: Bitcoin price fluctuations
and Gaucher disease, show that CC-RAG outperforms standard RAG and zero-shot
LLMs in chain similarity, information density, and lexical diversity. Both
LLM-as-a-Judge and human evaluations consistently favor CC-RAG. Our results
demonstrate that explicitly modeling causal structure enables LLMs to generate
more accurate and interpretable responses, especially in specialized domains
where flat retrieval fails.
[9] mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks
Luel Hagos Beyene,Vivek Verma,Min Ma,Jesujoba O. Alabi,Fabian David Schmidt,Joyce Nakatumba-Nabende,David Ifeoluwa Adelani
Main category: cs.CL
TL;DR: 本文介绍了mSTEB,一个用于评估大语言模型(LLM)在多种语言(包括语音和文本任务)上的性能的新基准测试,聚焦于低资源语言的标准化评估。
Details
Motivation: 大语言模型在英语和少数高资源语言上的表现已被广泛研究,但对低资源语言的评估缺乏标准化基准。mSTEB填补了这一空白。Contribution: 提出了mSTEB,一个覆盖语言识别、文本分类、问答和翻译等任务的多样化基准测试,支持语音和文本模态,首次全面评估LLM在低资源语言上的表现。
Method: 通过构建一个包含多种语言(尤其是非洲和美洲/大洋洲的语言)的数据集,评估了包括Gemini 2.0 Flash、GPT-4o (Audio) 等领先模型在mSTEB上的性能。
Result: 评估结果显示,高资源语言与低资源语言(尤其是非洲和美洲/大洋洲的语言)的性能存在显著差距,表明这些语言在LLM中的代表性不足。
Insight: 低资源语言在大语言模型中的表现较差,需要更多投入以提升其覆盖率。
Abstract: Large Language models (LLMs) have demonstrated impressive performance on a
wide range of tasks, including in multimodal settings such as speech. However,
their evaluation is often limited to English and a few high-resource languages.
For low-resource languages, there is no standardized evaluation benchmark. In
this paper, we address this gap by introducing mSTEB, a new benchmark to
evaluate the performance of LLMs on a wide range of tasks covering language
identification, text classification, question answering, and translation tasks
on both speech and text modalities. We evaluated the performance of leading
LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open
models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in
performance between high-resource and low-resource languages, especially for
languages spoken in Africa and Americas/Oceania. Our findings show that more
investment is needed to address their under-representation in LLMs coverage.
[10] TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration
Weiya Li,Junjie Chen,Bei Li,Boyang Liu,Zichen Wen,Nuanqiao Shan,Xiaoqian Liu,Anping Liu,Huajie Liu,Youyan Wang,Wujiuge Yin,Hu Song,Bing Huang,Zhiyuan Xia,Jialiang Chen,Linfeng Zhang
Main category: cs.CL
TL;DR: TACTIC 是一个基于认知理论的多智能体翻译框架,通过模拟人类翻译的认知过程,显著提升了机器翻译质量。
Details
Motivation: 现有的大语言模型(LLM)多智能体翻译框架忽视了认知翻译研究的关键见解,而这些见解对理解人类翻译策略(如直译与意译的平衡、上下文优化)至关重要。Contribution: 提出了一个认知理论支持的多智能体框架 TACTIC,包含六个功能明确的智能体,分别对应人类翻译中的关键认知过程,显著提升了翻译效果。
Method: 设计了六个功能不同的智能体(如起草、优化、评估等),模拟人类翻译的认知流程,并通过交互式协作充分利用 LLM 的能力。
Result: 在多个语言对的实验(FLORES-200 和 WMT24 基准)中,TACTIC 表现优于 GPT-4.1 和 DeepSeek-R1,XCOMET 和 COMETKIWI-23 分数显著提升。
Insight: 认知理论的应用可以显著增强多智能体翻译系统的表现,模拟人类翻译的交互式工作流程是提升 LLM 翻译潜力的有效途径。
Abstract: Machine translation has long been a central task in natural language
processing. With the rapid advancement of large language models (LLMs), there
has been remarkable progress in translation quality. However, fully realizing
the translation potential of LLMs remains an open challenge. Recent studies
have explored multi-agent systems to decompose complex translation tasks into
collaborative subtasks, showing initial promise in enhancing translation
quality through agent cooperation and specialization. Nevertheless, existing
multi-agent translation frameworks largely neglect foundational insights from
cognitive translation studies. These insights emphasize how human translators
employ different cognitive strategies, such as balancing literal and free
translation, refining expressions based on context, and iteratively evaluating
outputs. To address this limitation, we propose a cognitively informed
multi-agent framework called TACTIC, which stands for T ranslation A gents with
Cognitive- T heoretic Interactive Collaboration. The framework comprises six
functionally distinct agents that mirror key cognitive processes observed in
human translation behavior. These include agents for drafting, refinement,
evaluation, scoring, context reasoning, and external knowledge gathering. By
simulating an interactive and theory-grounded translation workflow, TACTIC
effectively leverages the full capacity of LLMs for high-quality translation.
Experimental results on diverse language pairs from the FLORES-200 and WMT24
benchmarks show that our method consistently achieves state-of-the-art
performance. Using DeepSeek-V3 as the base model, TACTIC surpasses GPT-4.1 by
an average of +0.6 XCOMET and +1.18 COMETKIWI-23. Compared to DeepSeek-R1, it
further improves by +0.84 XCOMET and +2.99 COMETKIWI-23. Code is available at
https://github.com/weiyali126/TACTIC.
[11] Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens
Ziyang Ma,Qingyue Yuan,Zhenglin Wang,Deyu Zhou
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(LLM)的元认知能力,提出了自动评估框架AutoMeco和改进策略MIRA,实验表明这些方法能更合理地评估和提升LLM的元认知能力。
Details
Motivation: 现有研究主要关注LLM的认知错误检测能力,而对其元认知能力(如对步骤错误的自我意识)的研究较少,但这些能力对LLM的可靠性至关重要。Contribution: 提出了AutoMeco框架以评估LLM的元认知能力,并设计了无需训练的MIRA策略来改进现有的元认知“透镜”。
Method: AutoMeco用于基准测试现有的元认知评估方法,MIRA则是一种基于马尔可夫理论的内在奖励调整策略,用于提升评估质量。
Result: 在三个数学推理数据集和三个LLM上的实验表明,AutoMeco比Best-of-N验证更合理,MIRA能更有效地评估LLM的元认知能力。
Insight: LLM具有内在的元认知能力,但需要通过更好的评估工具(如AutoMeco和MIRA)来发掘和提升这一能力。
Abstract: Previous research has primarily focused on the cognitive error detection
capabilities of Large Language Models (LLMs), often prompting them to analyze
mistakes in reasoning chains. However, few studies have examined the
meta-cognitive abilities of LLMs (e.g., their self-awareness of step errors),
which are crucial for their reliability. While studies on LLM self-evaluation
present some measures, such as perplexity, which can reflect the answer
correctness and be viewed as the lens of meta-cognition, they lack step-level
analysis and adaptation. This paper studies the evaluation of LLM
meta-cognition using the current lenses and how to improve these lenses.
Specifically, we propose AutoMeco, an Automated Meta-cognition Evaluation
framework for benchmarking the existing lenses. Furthermore, a training-free
Markovian Intrinsic Reward Adjustment strategy, MIRA, is proposed to boost
current meta-cognition lenses. Experimental results on three mathematical
reasoning datasets and three LLMs show the reasonableness of AutoMeco by
comparing it with Best-of-N verification. Moreover, the meta-cognition ability
of LLMs can be better evaluated using MIRA.
[12] Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models
Jiaxiang Liu,Boxuan Xing,Chenhao Yuan,Chenxiang Zhang,Di Wu,Xiusheng Huang,Haida Yu,Chuhan Lang,Pengfei Cao,Jun Zhao,Kang Liu
Main category: cs.CL
TL;DR: Know-MRI 是一个开源工具,旨在系统性分析大型语言模型(LLMs)的内部知识机制,通过可扩展的核心模块支持多种输入数据与解释方法的自动匹配与结果整合。
Details
Motivation: 当前解释方法的输入数据格式与输出结果不一致,导致工具应用受限,亟需一个统一且灵活的解释框架。Contribution: 提出了 Know-MRI,一个支持多种输入数据与解释方法匹配的框架,增强了 LLMs 知识机制的分析灵活性。
Method: 设计了一个可扩展的核心模块,自动匹配输入数据与解释方法,并整合输出结果,支持用户多角度诊断模型知识机制。
Result: 提供了开源代码和演示视频,工具支持更灵活的分析和更全面的模型诊断。
Insight: 通过统一框架整合多样化解释方法,为 LLMs 的可解释性研究提供了更实用的工具支持。
Abstract: As large language models (LLMs) continue to advance, there is a growing
urgency to enhance the interpretability of their internal knowledge mechanisms.
Consequently, many interpretation methods have emerged, aiming to unravel the
knowledge mechanisms of LLMs from various perspectives. However, current
interpretation methods differ in input data formats and interpreting outputs.
The tools integrating these methods are only capable of supporting tasks with
specific inputs, significantly constraining their practical applications. To
address these challenges, we present an open-source Knowledge Mechanisms
Revealer&Interpreter (Know-MRI) designed to analyze the knowledge mechanisms
within LLMs systematically. Specifically, we have developed an extensible core
module that can automatically match different input data with interpretation
methods and consolidate the interpreting outputs. It enables users to freely
choose appropriate interpretation methods based on the inputs, making it easier
to comprehensively diagnose the model’s internal knowledge mechanisms from
multiple perspectives. Our code is available at
https://github.com/nlpkeg/Know-MRI. We also provide a demonstration video on
https://youtu.be/NVWZABJ43Bs.
[13] CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models
Ziqi. Liu,Ziyang. Zhou,Mingxuan. Hu
Main category: cs.CL
TL;DR: 本文提出了一种名为CAF-I的多智能体框架,用于提升大型语言模型在反讽检测任务中的性能。通过多维度分析和协作优化,CAF-I在零样本设置下取得了显著的性能提升。
Details
Motivation: 现有的大型语言模型在反讽检测任务中存在单视角限制、理解不足和缺乏可解释性的问题。为此,作者提出了一种协作多智能体框架。Contribution: 主要贡献是提出了CAF-I框架,通过引入多个专门化的智能体(如上下文、语义和修辞智能体)进行多维分析,并通过决策智能体和反馈优化机制提升性能。
Method: CAF-I采用多智能体协作方法,包括上下文、语义、修辞智能体的多维分析,决策智能体的整合,以及反馈优化。实验在基准数据集上进行。
Result: CAF-I在零样本设置下取得了SOTA性能,平均Macro-F1达到76.31,比之前的最佳基线提升了4.98个绝对百分点。
Insight: 通过模拟人类多视角分析,CAF-I显著提升了反讽检测的准确性和可解释性,展示了多智能体框架在语言任务中的潜力。
Abstract: Large language model (LLM) have become mainstream methods in the field of
sarcasm detection. However, existing LLM methods face challenges in irony
detection, including: 1. single-perspective limitations, 2. insufficient
comprehensive understanding, and 3. lack of interpretability. This paper
introduces the Collaborative Agent Framework for Irony (CAF-I), an LLM-driven
multi-agent system designed to overcome these issues. CAF-I employs specialized
agents for Context, Semantics, and Rhetoric, which perform multidimensional
analysis and engage in interactive collaborative optimization. A Decision Agent
then consolidates these perspectives, with a Refinement Evaluator Agent
providing conditional feedback for optimization. Experiments on benchmark
datasets establish CAF-I’s state-of-the-art zero-shot performance. Achieving
SOTA on the vast majority of metrics, CAF-I reaches an average Macro-F1 of
76.31, a 4.98 absolute improvement over the strongest prior baseline. This
success is attained by its effective simulation of human-like multi-perspective
analysis, enhancing detection accuracy and interpretability.
[14] Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning
Fengjun Pan,Anh Tuan Luu,Xiaobao Wu
Main category: cs.CL
TL;DR: 论文提出了一种名为U-CoT+的新框架,用于高效、灵活且可解释的有害表情包检测,通过将表情包解耦为文本描述并利用人类指导的零样本CoT提示,实现了低资源消耗和高适应性的分类效果。
Details
Motivation: 当前的有害表情包检测方法在资源效率、灵活性和可解释性方面存在不足,限制了其在内容审核系统中的实际部署。Contribution: 提出了一种新的框架U-CoT+,通过解耦表情包理解与分类,结合人类指导的零样本CoT提示,实现了高效、灵活且可解释的有害表情包检测。
Method: 1. 开发了一个高保真的表情包到文本的转换管道;2. 利用人类制定的指导规则结合零样本CoT提示,引导模型推理。
Result: 在七个基准数据集上的实验验证了该框架的有效性,证明了其在小规模LLMs上的低资源需求和可解释性。
Insight: 通过将视觉内容解耦为文本描述,能够避免直接处理复杂的原始视觉数据,从而实现资源高效和灵活的检测。
Abstract: Detecting harmful memes is essential for maintaining the integrity of online
environments. However, current approaches often struggle with resource
efficiency, flexibility, or explainability, limiting their practical deployment
in content moderation systems. To address these challenges, we introduce
U-CoT+, a novel framework for harmful meme detection. Instead of relying solely
on prompting or fine-tuning multimodal models, we first develop a high-fidelity
meme-to-text pipeline that converts visual memes into detail-preserving textual
descriptions. This design decouples meme interpretation from meme
classification, thus avoiding immediate reasoning over complex raw visual
content and enabling resource-efficient harmful meme detection with general
large language models (LLMs). Building on these textual descriptions, we
further incorporate targeted, interpretable human-crafted guidelines to guide
models’ reasoning under zero-shot CoT prompting. As such, this framework allows
for easy adaptation to different harmfulness detection criteria across
platforms, regions, and over time, offering high flexibility and
explainability. Extensive experiments on seven benchmark datasets validate the
effectiveness of our framework, highlighting its potential for explainable and
low-resource harmful meme detection using small-scale LLMs. Codes and data are
available at: https://anonymous.4open.science/r/HMC-AF2B/README.md.
[15] CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations
Divyaksh Shukla,Ritesh Baviskar,Dwijesh Gohil,Aniket Tiwari,Atul Shree,Ashutosh Modi
Main category: cs.CL
TL;DR: 该论文介绍了CoMuMDR语料库,一个多模态、多领域的印地语-英语混合代码对话数据集,用于话语解析任务。当前数据集的局限性在于仅针对单一领域的英语书面对话,而CoMuMDR填补了这一空白。
Details
Motivation: 现有的对话话语解析数据集局限于单一领域和纯英语,无法满足多模态、多领域和代码混合的现实场景需求,因此需要更全面的数据集。Contribution: 提出了CoMuMDR语料库,包含多模态(音频和文本)、多领域和印地语-英语混合代码的对话数据,并标注了九种话语关系,为研究提供了新资源。
Method: 通过收集和标注多领域代码混合对话数据,并使用多种SoTA模型进行基线实验,验证了数据集的挑战性。
Result: SoTA模型在CoMuMDR上的表现不佳,凸显了多领域代码混合数据的复杂性,需要更先进的模型。
Insight: 多模态、多领域和代码混合数据为话语解析任务带来了新的挑战,未来研究需针对这些复杂场景开发更鲁棒的模型。
Abstract: Discourse parsing is an important task useful for NLU applications such as
summarization, machine comprehension, and emotion recognition. The current
discourse parsing datasets based on conversations consists of written English
dialogues restricted to a single domain. In this resource paper, we introduce
CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in
conversations. The corpus (code-mixed in Hindi and English) has both audio and
transcribed text and is annotated with nine discourse relations. We experiment
with various SoTA baseline models; the poor performance of SoTA models
highlights the challenges of multi-domain code-mixed corpus, pointing towards
the need for developing better models for such realistic settings.
[16] Efficient Post-Training Refinement of Latent Reasoning in Large Language Models
Xinyuan Wang,Dongjie Wang,Wangyang Ying,Haoyue Bai,Nanxu Gong,Sixun Dong,Kunpeng Liu,Yanjie Fu
Main category: cs.CL
TL;DR: 提出了一种轻量级后训练框架,通过对比推理反馈和残差嵌入细化,优化大型语言模型的潜在推理轨迹,显著提升了推理任务的性能。
Details
Motivation: 链式思维提示(Chain-of-Thought)虽然通过中间步骤增强了推理性能,但存在较高的计算成本和固定的推理轨迹,无法逐步优化。潜在推理虽解决了这些问题,但如何在后训练阶段高效更新推理嵌入仍是一个挑战。Contribution: 提出了两种新策略:1)对比推理反馈,通过对比强/弱基线的嵌入优化更新方向;2)残差嵌入细化,通过梯度渐进整合实现稳定收敛,显著提升了推理性能。
Method: 结合对比推理反馈和残差嵌入细化,在潜在空间中直接优化推理轨迹,避免了显式输出。
Result: 在五个推理基准测试中表现优异,例如MathQA上实现了5%的准确率提升。
Insight: 潜在推理的后训练优化可以显著提升模型性能,且无需额外训练数据或模型结构改动。
Abstract: Reasoning is a key component of language understanding in Large Language
Models. While Chain-of-Thought prompting enhances performance via explicit
intermediate steps, it suffers from sufficient token overhead and a fixed
reasoning trajectory, preventing step-wise refinement. Recent advances in
latent reasoning address these limitations by refining internal reasoning
processes directly in the model’s latent space, without producing explicit
outputs. However, a key challenge remains: how to effectively update reasoning
embeddings during post-training to guide the model toward more accurate
solutions. To overcome this challenge, we propose a lightweight post-training
framework that refines latent reasoning trajectories using two novel
strategies: 1) Contrastive reasoning feedback, which compares reasoning
embeddings against strong and weak baselines to infer effective update
directions via embedding enhancement; 2) Residual embedding refinement, which
stabilizes updates by progressively integrating current and historical
gradients, enabling fast yet controlled convergence. Extensive experiments and
case studies are conducted on five reasoning benchmarks to demonstrate the
effectiveness of the proposed framework. Notably, a 5% accuracy gain on MathQA
without additional training.
[17] RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval
Minhae Oh,Jeonghye Kim,Nakyung Lee,Donggeon Seo,Taeuk Kim,Jungwoo Lee
Main category: cs.CL
TL;DR: RAISE 是一个通过逐步检索增强科学推理能力的框架,通过问题分解、逻辑查询生成和逻辑检索三个步骤,显著优于其他基线方法。
Details
Motivation: 科学推理需要长链条的推理过程、领域特定术语的知识以及适应最新研究成果的能力。现有方法在逻辑相关性和领域知识适应性上存在局限。Contribution: 提出了一种名为 RAISE 的逐步检索增强框架,能够检索逻辑相关的文档,从而提升模型在科学推理任务中的表现。
Method: RAISE 分为三步:1) 问题分解;2) 逻辑查询生成;3) 逻辑检索。重点是检索逻辑相关而不仅是领域相似的文档。
Result: RAISE 在科学推理基准测试中表现优于其他基线方法,显示出其检索的文档更具逻辑相关性。
Insight: 通过分步检索并关注逻辑相关性(而非仅领域相似性),可以显著提升科学推理任务的效果。
Abstract: Scientific reasoning requires not only long-chain reasoning processes, but
also knowledge of domain-specific terminologies and adaptation to updated
findings. To deal with these challenges for scientific reasoning, we introduce
RAISE, a step-by-step retrieval-augmented framework which retrieves logically
relevant documents from in-the-wild corpus. RAISE is divided into three steps:
problem decomposition, logical query generation, and logical retrieval. We
observe that RAISE consistently outperforms other baselines on scientific
reasoning benchmarks. We analyze that unlike other baselines, RAISE retrieves
documents that are not only similar in terms of the domain knowledge, but also
documents logically more relevant.
[18] MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models
Son The Nguyen,Theja Tulabandhula
Main category: cs.CL
TL;DR: MEMETRON是一个任务无关的框架,通过将LLM的解码过程建模为离散黑盒优化问题,利用混合元启发式算法优化响应,无需重新训练模型或梯度访问,显著提升了任务性能。
Details
Motivation: 当前LLM的解码策略(如贪心搜索或采样)缺乏对任务特定目标的显式优化,限制了模型的控制能力。MEMETRON旨在通过元启发式算法动态优化模型输出。Contribution: 1. 提出MEMETRON框架,将LLM解码过程视为黑盒优化问题;2. 设计了两种混合元启发式算法(GENETRON和ANNETRON)实现高效探索;3. 展示了框架在任务中的优越性能,尤其是在人类偏好对齐任务上。
Method: MEMETRON结合了奖励模型和LLM的上下文操作,利用GENETRON(遗传算法)和ANNETRON(模拟退火)搜索高奖励响应,无需模型重新训练或梯度计算。
Result: 在人类偏好对齐任务上,MEMETRON显著优于标准解码和重排序方法,证明了其在不重新训练模型的情况下提升对齐能力的潜力。
Insight: 通过元启发式算法优化解码过程,可以在不修改模型参数的情况下实现任务性能的提升,为LLM的推理优化提供了新思路。
Abstract: Large language models (LLMs) are increasingly used for both open-ended and
structured tasks, yet their inference-time behavior is still largely dictated
by heuristic decoding strategies such as greedy search, sampling, or reranking.
These methods provide limited control and do not explicitly optimize for
task-specific objectives. We introduce MEMETRON, a task-agnostic framework that
formulates LLM decoding as a discrete black-box optimization problem. MEMETRON
leverages hybrid metaheuristic algorithms, GENETRON and ANNETRON, to search the
response space, guided by reward models and contextual operations performed by
the LLM itself. This approach enables efficient discovery of high-reward
responses without requiring model retraining or gradient access. The framework
is modular and generalizes across diverse tasks, requiring only a reward
function and lightweight prompt templates. We evaluate our framework on the
critical human preference alignment task and demonstrate that it significantly
outperforms standard decoding and reranking methods, highlighting its potential
to improve alignment without model retraining.
[19] TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning
Mingyu Zheng,Zhifan Feng,Jia Wang,Lanrui Wang,Zheng Lin,Yang Hao,Weiping Wang
Main category: cs.CL
TL;DR: TableDreamer是一个渐进式和弱点引导的数据合成框架,旨在改进表格指令调优的数据多样性和效率,通过迭代探索输入空间并针对性生成数据,显著提升了目标LLM在表格理解任务中的表现。
Details
Motivation: 现有基于LLM的数据合成方法在表格指令调优中存在两个问题:1)输入空间探索不足导致数据多样性有限;2)忽视目标LLM的表格理解弱点,盲目追求数据量。Contribution: 提出了一个渐进式和弱点引导的数据合成框架TableDreamer,通过迭代生成多样化的表格和指令数据,并针对目标LLM的薄弱环节优化数据生成策略。
Method: 1)初始合成多样化表格和指令作为种子数据;2)迭代探索输入空间,根据新发现的弱点数据指导后续生成;3)最终合成数据用于目标LLM的微调。
Result: 在10个表格基准测试中,仅用27K GPT-4o合成数据,便将Llama3.1-8B-instruct的平均准确率从49.07%提升到60.69%,超越了使用更多数据的最新基线方法。
Insight: 数据合成的多样性和针对性(针对模型弱点)比单纯增加数据量更有效,能够显著提升模型在特定任务中的性能。
Abstract: Despite the commendable progress of recent LLM-based data synthesis methods,
they face two limitations in generating table instruction tuning data. First,
they can not thoroughly explore the vast input space of table understanding
tasks, leading to limited data diversity. Second, they ignore the weaknesses in
table understanding ability of the target LLM and blindly pursue the increase
of data quantity, resulting in suboptimal data efficiency. In this paper, we
introduce a progressive and weakness-guided data synthesis framework tailored
for table instruction tuning, named TableDreamer, to mitigate the above issues.
Specifically, we first synthesize diverse tables and related instructions as
seed data, and then perform an iterative exploration of the input space under
the guidance of the newly identified weakness data, which eventually serve as
the final training data for fine-tuning the target LLM. Extensive experiments
on 10 tabular benchmarks demonstrate the effectiveness of the proposed
framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62%
(49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms
state-of-the-art data synthesis baselines which use more training data. The
code and data is available at https://github.com/SpursGoZmy/TableDreamer
[20] RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling
Yang Liu,Jiaqi Li,Zilong Zheng
Main category: cs.CL
TL;DR: RuleReasoner提出了一种基于强化学习的规则推理方法,通过动态采样和领域感知增强小模型的推理能力,显著优于现有大模型,并在计算效率上表现优越。
Details
Motivation: 实际应用中规则形式、类型和复杂性的多样性给小模型(SRM)的规则推理能力提出了挑战。本文旨在探索SRM是否可以通过强化学习实现高效且泛化性强的规则推理。Contribution: 提出了RuleReasoner方法,通过动态采样和领域感知技术,显著提升了小模型在规则推理任务中的性能,同时展示了优于大模型(LRM)的推理能力和计算效率。
Method: RuleReasoner通过动态采样技术调整不同领域的采样权重,并结合强化学习进行训练。这种方法避免了预定义混合训练的需求,实现了灵活的在线学习。
Result: 实验表明,RuleReasoner在分布内(ID)和分布外(OOD)任务中均显著优于前沿大模型(如OpenAI-o1),平均提升4.1%(ID)和10.4%(OOD),且计算效率更高。
Insight: 小模型通过强化学习和动态采样技术,能够在复杂规则推理任务中表现出色,挑战了以往对大模型的依赖,并为高效推理提供了新思路。
Abstract: Rule-based reasoning has been acknowledged as one of the fundamental problems
in reasoning, while deviations in rule formats, types, and complexity in
real-world applications pose severe challenges. Recent studies have shown that
large reasoning models (LRMs) have remarkable reasoning capabilities, and their
performance is substantially enhanced by reinforcement learning (RL). However,
it remains an open question whether small reasoning models (SRMs) can learn
rule-based reasoning effectively with robust generalization across diverse
tasks and domains. To address this, we introduce Reinforced Rule-based
Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct
rule-based reasoning via a wide collection of curated tasks and a novel
domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples
each training batch by updating the sampling weights of different domains based
on historical rewards. This facilitates domain augmentation and flexible online
learning schedules for RL, obviating the need for pre-hoc human-engineered
mix-training recipes used in existing methods. Empirical evaluations on
in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that
RuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1%
average points on eight ID tasks and $\Delta$10.4% average points on three OOD
tasks over OpenAI-o1). Notably, our approach also exhibits higher computational
efficiency compared to prior dynamic sampling methods for RL.
[21] Brevity is the soul of sustainability: Characterizing LLM response lengths
Soham Poddar,Paramita Koley,Janardan Misra,Sanjay Podder,Navveen Balani,Niloy Ganguly,Saptarshi Ghosh
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLM)推理过程中过长的响应问题及其对能源效率的影响,提出通过提示工程策略优化响应长度,实现显著的能源节约。
Details
Motivation: LLM推理过程消耗大量能源,而当前研究中输出压缩方面未得到充分探索。论文旨在通过减少不必要的响应长度来优化能源效率。Contribution: :1)对12个仅解码器LLM在5个数据集上的响应长度进行了基准测试;2)定义了LLM响应中的六类信息,揭示了冗余信息的存在;3)提出了简单直观的提示工程策略以减少响应长度。
Method: 论文首先对LLM响应长度进行了基准测试和分类分析,随后设计了几种提示工程策略(如针对长度减少和信息内容控制的提示),并通过实验验证其效果。
Result: 实验表明,优化提示策略可以在保持响应质量的同时,将响应长度减少25-60%,从而实现显著的能源节约。
Insight: LLM响应中存在大量冗余信息,通过简单的提示工程即可显著优化能源效率,这对可持续AI发展具有重要意义。
Abstract: A significant portion of the energy consumed by Large Language Models (LLMs)
arises from their inference processes; hence developing energy-efficient
methods for inference is crucial. While several techniques exist for inference
optimization, output compression remains relatively unexplored, with only a few
preliminary efforts addressing this aspect. In this work, we first benchmark 12
decoder-only LLMs across 5 datasets, revealing that these models often produce
responses that are substantially longer than necessary. We then conduct a
comprehensive quality assessment of LLM responses, formally defining six
information categories present in LLM responses. We show that LLMs often tend
to include redundant or additional information besides the minimal answer. To
address this issue of long responses by LLMs, we explore several simple and
intuitive prompt-engineering strategies. Empirical evaluation shows that
appropriate prompts targeting length reduction and controlling information
content can achieve significant energy optimization between 25-60% by reducing
the response length while preserving the quality of LLM responses.
[22] ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts
Ruiran Su,Jiasheng Si,Zhijiang Guo,Janet B. Pierrehumbert
Main category: cs.CL
TL;DR: 该论文提出了ClimateViz,首个大规模基于科学图表的事实核查基准,包含大量标注的图表和关联的声明,并评估了现有多模态模型的表现。
Details
Motivation: 科学事实核查主要集中在文本和表格上,而忽略了科学图表的重要性。图表是展示定量证据和统计推理的关键工具。Contribution: 1. 发布了首个大规模科学图表事实核查基准ClimateViz;2. 数据集包含2,896个可视化图表和49,862个关联声明;3. 提供了知识图谱形式的解释以提高可解释性。
Method: 1. 构建了包含图表和声明的数据集;2. 对声明进行支持、反驳或信息不足的标注;3. 评估了包括Gemini 2.5和InternVL 2.5在内的多模态语言模型在零样本和少样本设置下的表现。
Result: 现有模型在基于图表的推理上表现不佳,最好模型的准确率仅76.2-77.8%,远低于人类表现(89.3-92.7%)。解释增强的输出对某些模型有帮助。
Insight: 科学图表的事实核查是一个具有挑战性的任务,现有多模态模型仍有提升空间,尤其是对复杂统计推理的处理。
Abstract: Scientific fact-checking has mostly focused on text and tables, overlooking
scientific charts, which are key for presenting quantitative evidence and
statistical reasoning. We introduce ClimateViz, the first large-scale benchmark
for scientific fact-checking using expert-curated scientific charts. ClimateViz
contains 49,862 claims linked to 2,896 visualizations, each labeled as support,
refute, or not enough information. To improve interpretability, each example
includes structured knowledge graph explanations covering trends, comparisons,
and causal relations. We evaluate state-of-the-art multimodal language models,
including both proprietary and open-source systems, in zero-shot and few-shot
settings. Results show that current models struggle with chart-based reasoning:
even the best systems, such as Gemini 2.5 and InternVL 2.5, reach only 76.2 to
77.8 percent accuracy in label-only settings, far below human performance (89.3
and 92.7 percent). Explanation-augmented outputs improve performance in some
models. We released our dataset and code alongside the paper.
[23] Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure
Fariz Ikhwantri,Dusica Marijan
Main category: cs.CL
TL;DR: 论文提出了一种基于自然语言推理(NLI)的合规性检测方法EXCLAIM,通过多跳推理实现可解释和可追踪的合规性检测,并利用大语言模型生成保证案例以解决数据不足的问题。
Details
Motivation: 复杂系统合规性检测面临法律与技术文本复杂、模型解释性需求高以及保证案例数据稀缺等挑战。Contribution: 提出EXCLAIM方法,将保证案例的声明-论据-证据结构建模为多跳推理任务,支持可解释的合规性检测;利用大语言模型生成保证案例以弥补数据不足;并设计了覆盖率和结构一致性的评估指标。
Method: 将保证案例结构转换为多跳自然语言推理任务,采用大语言模型生成保证案例,并通过指标评估生成案例的质量。
Result: 案例研究表明,结合GDPR要求生成的保证案例在多跳推理任务中表现有效,验证了NLI方法在自动化合规性检测中的潜力。
Insight: NLI和多跳推理为复杂合规性检测提供了一种可解释的自动化解决方案,大语言模型在数据稀缺场景下显示出应用潜力。
Abstract: Ensuring complex systems meet regulations typically requires checking the
validity of assurance cases through a claim-argument-evidence framework. Some
challenges in this process include the complicated nature of legal and
technical texts, the need for model explanations, and limited access to
assurance case data. We propose a compliance detection approach based on
Natural Language Inference (NLI): EXplainable CompLiance detection with
Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the
claim-argument-evidence structure of an assurance case as a multi-hop inference
for explainable and traceable compliance detection. We address the limited
number of assurance cases by generating them using large language models
(LLMs). We introduce metrics that measure the coverage and structural
consistency. We demonstrate the effectiveness of the generated assurance case
from GDPR requirements in a multi-hop inference task as a case study. Our
results highlight the potential of NLI-based approaches in automating the
regulatory compliance process.
[24] Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition
Mehedi Hasan Bijoy,Dejan Porjazovski,Tamás Grósz,Mikko Kurimo
Main category: cs.CL
TL;DR: 本文提出了一种基于多教师语言感知知识蒸馏的多语言语音情感识别方法,通过从多个单语言教师模型中提取知识,训练一个多语言学生模型。该方法在英语、芬兰语和法语数据集上表现优异,显著提升了情感识别的召回率。
Details
Motivation: 尽管单语言语音情感识别(SER)研究取得了进展,但扩展到多语言系统仍具挑战性。目标是训练一个能处理多语言SER的单一模型,以提升人机交互体验。Contribution: 主要贡献是提出了一种语言感知的多教师知识蒸馏方法,利用Wav2Vec2.0作为教师模型的基础,将知识蒸馏到多语言学生模型中,显著提升了多语言SER性能。
Method: 方法核心是使用Wav2Vec2.0训练多个单语言教师模型,并通过语言感知的多教师知识蒸馏技术将知识整合到一个多语言学生模型中。
Result: 实验结果显示,该方法在英语数据集的加权召回率达到72.9,芬兰语的非加权召回率为63.4,优于微调和传统知识蒸馏基线。
Insight: 该方法在识别悲伤和中立情感方面表现突出,但在识别愤怒和快乐情感上仍需改进,揭示了多语言SER的复杂性。
Abstract: Speech Emotion Recognition (SER) is crucial for improving human-computer
interaction. Despite strides in monolingual SER, extending them to build a
multilingual system remains challenging. Our goal is to train a single model
capable of multilingual SER by distilling knowledge from multiple teacher
models. To address this, we introduce a novel language-aware multi-teacher
knowledge distillation method to advance SER in English, Finnish, and French.
It leverages Wav2Vec2.0 as the foundation of monolingual teacher models and
then distills their knowledge into a single multilingual student model. The
student model demonstrates state-of-the-art performance, with a weighted recall
of 72.9 on the English dataset and an unweighted recall of 63.4 on the Finnish
dataset, surpassing fine-tuning and knowledge distillation baselines. Our
method excels in improving recall for sad and neutral emotions, although it
still faces challenges in recognizing anger and happiness.
[25] AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP
Ahmed Hasanaath,Aisha Alansari,Ahmed Ashraf,Chafik Salmane,Hamzah Luqman,Saad Ezzini
Main category: cs.CL
TL;DR: 该论文对针对阿拉伯语NLP的推理型大型语言模型(LLMs)进行了全面评估,特别是DeepSeek模型,通过多种策略(如零样本、少样本和微调)在十五个阿拉伯语NLP任务中进行了实验,揭示了关键发现,如少量上下文示例的显著提升效果。
Details
Motivation: 阿拉伯语因其丰富的形态、多样的方言和复杂的书写系统,在LLMs领域的性能研究尚不充分。论文旨在填补这一空白,评估推理型LLMs在阿拉伯语NLP任务中的表现。Contribution: 1) 对多款推理型LLMs(尤其是DeepSeek模型)在阿拉伯语NLP任务中的全面性能评估;2) 揭示了少量上下文示例对性能的显著提升作用;3) 展示了LoRA微调在小规模模型上的有效性。
Method: 使用零样本、少样本和LoRA微调策略,对十五个阿拉伯语NLP任务进行系统性评估,比较不同模型(如DeepSeek和GPT o4-mini)的性能差异。
Result: 实验结果显示:1) 3个上下文示例平均提升13 F1点;2) DeepSeek在零样本设置下优于GPT o4-mini 12 F1点;3) LoRA微调进一步提升8点F1和BLEU分。
Insight: 阿拉伯语NLP任务的复杂性对LLMs提出了更高要求,但通过合理的上下文示例选择和微调策略,可以显著提升性能。
Abstract: Large language models (LLMs) have shown remarkable progress in reasoning
abilities and general natural language processing (NLP) tasks, yet their
performance on Arabic data, characterized by rich morphology, diverse dialects,
and complex script, remains underexplored. This paper presents a comprehensive
benchmarking study of multiple reasoning-focused LLMs, with a special emphasis
on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP
tasks. We experiment with various strategies, including zero-shot, few-shot,
and fine-tuning. This allows us to systematically evaluate performance on
datasets covering a range of applications to examine their capacity for
linguistic reasoning under different levels of complexity. Our experiments
reveal several key findings. First, carefully selecting just three in-context
examples delivers an average uplift of over 13 F1 points on classification
tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection
from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures
outperform a strong GPT o4-mini baseline by an average of 12 F1 points on
complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning
yields up to an additional 8 points in F1 and BLEU compared to equivalent
increases in model scale. The code is available at
https://anonymous.4open.science/r/AraReasoner41299
[26] The impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation
Francisco Vargas,Alejandro González Coene,Gaston Escalante,Exequiel Lobón,Manuel Pulido
Main category: cs.CL
TL;DR: 该论文研究了LLaMA模型在微调后如何减少命名实体提取中的幻觉问题,并比较了多种方法在提取法律文件中交通事故信息的性能。
Details
Motivation: 从法律文件中提取交通事故信息(如残疾百分比和赔偿金额)对保险公司成本量化至关重要,但因法院判决中的复杂论证和推理,即使是专家也难以准确提取。Contribution: 提出了一种两步方法:文本分割和实体提取。比较了经典正则表达式方法和基于语义搜索的分割方法,并展示了微调后LLaMA模型显著减少幻觉的有效性。
Method: 1) 文本分割:比较正则表达式和基于多语言模型向量化的语义搜索方法。2) 实体提取:使用多种LLMs(LLaMA-2、LLaMA-3、GPT-4 Turbo)进行提示或微调(LoRA)。
Result: 微调后的LLaMA-2 70B准确率最高(79.4%),超过基础版本(61.7%)。基础LLaMA-3 8B表现接近微调LLaMA-2 70B(76.6%),GPT-4 Turbo表现最佳(86.1%)。
Insight: 微调可显著减少LLM在命名实体提取中的幻觉;LLaMA-3基础模型的性能已接近微调的LLaMA-2 70B,显示模型发展的快速进步;GPT-4 Turbo在封闭模型中表现最优。
Abstract: The extraction of information about traffic accidents from legal documents is
crucial for quantifying insurance company costs. Extracting entities such as
percentages of physical and/or psychological disability and the involved
compensation amounts is a challenging process, even for experts, due to the
subtle arguments and reasoning in the court decision. A two-step procedure is
proposed: first, segmenting the document identifying the most relevant
segments, and then extracting the entities. For text segmentation, two
methodologies are compared: a classic method based on regular expressions and a
second approach that divides the document into blocks of n-tokens, which are
then vectorized using multilingual models for semantic searches
(text-embedding-ada-002/MiniLM-L12-v2 ). Subsequently, large language models
(LLaMA-2 7b, 70b, LLaMA-3 8b, and GPT-4 Turbo) are applied with prompting to
the selected segments for entity extraction. For the LLaMA models, fine-tuning
is performed using LoRA. LLaMA-2 7b, even with zero temperature, shows a
significant number of hallucinations in extractions which are an important
contention point for named entity extraction. This work shows that these
hallucinations are substantially reduced after finetuning the model. The
performance of the methodology based on segment vectorization and subsequent
use of LLMs significantly surpasses the classic method which achieves an
accuracy of 39.5%. Among open-source models, LLaMA-2 70B with finetuning
achieves the highest accuracy 79.4%, surpassing its base version 61.7%.
Notably, the base LLaMA-3 8B model already performs comparably to the finetuned
LLaMA-2 70B model, achieving 76.6%, highlighting the rapid progress in model
development. Meanwhile, GPT-4 Turbo achieves the highest accuracy at 86.1%.
[27] PropMEND: Hypernetworks for Knowledge Propagation in LLMs
Zeyu Leo Liu,Greg Durrett,Eunsol Choi
Main category: cs.CL
TL;DR: PropMEND是一种基于超网络的LLM知识传播方法,通过元学习优化梯度修改,增强注入知识的推理能力,在多跳问题上表现优异,但未见关系的泛化能力仍有提升空间。
Details
Motivation: 现有的LLM知识编辑技术能够注入知识,但在需要推理的多跳问题上表现不佳,因此需要一种能够有效传播知识的方法。Contribution: 提出了PropMEND方法,通过超网络元学习优化梯度修改,显著提升了LLM在多跳问题中的表现,并引入了新的数据集Controlled RippleEdit评估泛化能力。
Method: 基于MEND的元目标进行扩展,通过超网络学习如何修改语言建模损失的梯度,以促进注入知识的传播。在RippleEdit数据集上验证了方法的有效性。
Result: PropMEND在RippleEdit数据集上的多跳问题准确率提升了近2倍,但在未见过的实体-关系对上表现有所下降。
Insight: 知识传播在未见关系上的泛化能力仍需进一步研究,未来工作可以关注如何扩展到更广泛的关系网络中。
Abstract: Knowledge editing techniques for large language models (LLMs) can inject
knowledge that is later reproducible verbatim, but they fall short on
propagating that knowledge: models cannot answer questions that require
reasoning with the injected knowledge. We present a hypernetwork-based approach
for knowledge propagation, named PropMEND, where we meta-learn how to modify
gradients of a language modeling loss to encourage injected information to
propagate. Our approach extends the meta-objective of MEND [29] so that
gradient updates on knowledge are transformed to enable answering multi-hop
questions involving that knowledge. We show improved performance on the
RippleEdit dataset, showing almost 2x accuracy on challenging multi-hop
questions whose answers are not explicitly stated in the injected fact. We
further introduce a new dataset, Controlled RippleEdit, to evaluate the
generalization of our hypernetwork, testing knowledge propagation along
relations and entities unseen during hypernetwork training. PropMEND still
outperforms existing approaches in unseen entity-relation pairs, yet the
performance gap decreases substantially, suggesting future work in propagating
knowledge to a wide range of relations.
[28] Can A Gamer Train A Mathematical Reasoning Model?
Andrew Shin
Main category: cs.CL
TL;DR: 论文证明,通过结合强化学习和内存优化技术,单张普通游戏GPU(如RTX 3080 Ti)可以训练出性能优异的数学推理模型,挑战了高性能AI研究需要大规模基础设施的传统观念。
Details
Motivation: 大型语言模型(LLMs)在数学推理等任务中表现优异,但其训练通常需要昂贵的计算资源。本文旨在探索如何在资源有限的环境下(如普通游戏GPU)训练出高性能模型。Contribution: 主要贡献在于证明了单张普通游戏GPU(RTX 3080 Ti)可以训练1.5B参数的数学推理模型,其性能不亚于更大规模的模型。这为资源受限的研究者提供了高性能AI研究的可能性。
Method: 结合了强化学习和内存优化技术,显著降低了训练资源需求。
Result: 在数学推理基准测试中,1.5B参数的模型性能与更大规模的模型相当甚至更优。
Insight: 研究表明,高性能AI研究可以通过优化技术和算法突破资源限制,为更广泛的研究者提供机会。
Abstract: While large language models (LLMs) have achieved remarkable performance in
various tasks including mathematical reasoning, their development typically
demands prohibitive computational resources. Recent advancements have reduced
costs for training capable models, yet even these approaches rely on high-end
hardware clusters. In this paper, we demonstrate that a single average gaming
GPU can train a solid mathematical reasoning model, by integrating
reinforcement learning and memory optimization techniques. Specifically, we
train a 1.5B parameter mathematical reasoning model on RTX 3080 Ti of 16GB
memory that achieves comparable or better performance on mathematical reasoning
benchmarks than models several times larger, in resource-constrained
environments. Our results challenge the paradigm that state-of-the-art
mathematical reasoning necessitates massive infrastructure, democratizing
access to high-performance AI research.
https://github.com/shinandrew/YouronMath.
[29] FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation
Qinggang Zhang,Zhishang Xiang,Yilin Xiao,Le Wang,Junhui Li,Xinrun Wang,Jinsong Su
Main category: cs.CL
TL;DR: FaithfulRAG提出了一种新框架,通过显式建模LLM的参数知识与检索上下文之间的冲突,实现了更忠实的信息生成。
Details
Motivation: 现有的基于检索增强的LLM在知识密集型任务中存在输出不忠实的问题,特别是在知识与上下文冲突时。现有方法通过强制服从上下文抑制了模型内部知识,导致误解风险增加。Contribution: 提出了FaithfulRAG框架,通过事实级别的冲突建模和自我思考过程,解决LLM与检索上下文的知识冲突问题。
Method: 1. 在事实级别识别冲突;2. 设计了自我思考机制,促使LLM在生成前推理和整合冲突知识。
Result: 实验表明,FaithfulRAG优于现有方法,显著提升了生成结果的忠实度。
Insight: 显式建模知识冲突和引入推理过程可以平衡LLM的内部知识与外部上下文,提高生成质量。
Abstract: Large language models (LLMs) augmented with retrieval systems have
demonstrated significant potential in handling knowledge-intensive tasks.
However, these models often struggle with unfaithfulness issues, generating
outputs that either ignore the retrieved context or inconsistently blend it
with the LLMs parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the models
parametric knowledge. While existing faithful RAG approaches enforce strict
context adherence through well-designed prompts or modified decoding
strategies, our analysis reveals a critical limitation: they achieve
faithfulness by forcibly suppressing the models parametric knowledge, which undermines the models internal knowledge structure and increases the risk of
misinterpreting the context. To this end, this paper proposes FaithfulRAG, a
novel framework that resolves knowledge conflicts by explicitly modeling
discrepancies between the model`s parametric knowledge and retrieved context.
Specifically, FaithfulRAG identifies conflicting knowledge at the fact level
and designs a self-thinking process, allowing LLMs to reason about and
integrate conflicting facts before generating responses. Extensive experiments
demonstrate that our method outperforms state-of-the-art methods. The code is
available at https:// github.com/DeepLearnXMU/Faithful-RAG
[30] Can LLMs Ground when they (Don’t) Know: A Study on Direct and Loaded Political Questions
Clara Lachenmaier,Judith Sieker,Sina Zarrieß
Main category: cs.CL
TL;DR: 这篇论文研究了大型语言模型(LLMs)在政治领域中如何处理直接和带有误导性的问题,发现LLMs在纠正用户错误信念和主动建立共同认知方面存在显著挑战。
Details
Motivation: 人类对话依赖于共同认知(grounding),而LLMs在这种情境下的表现尚未被充分研究。政治领域存在高风险的信息误导,因此研究LLMs是否能处理此类问题至关重要。Contribution: 论文的主要贡献在于系统地评估了LLMs在面对直接知识问题和预设误导问题的表现,并揭示了其在纠正错误信念和建立共同认知方面的不足。
Method: 研究设计了直接知识问题和带有误导性的负载问题,分析了LLMs的回答行为,关注其知识水平、政治偏见以及是否主动纠正用户错误。
Result: 研究发现LLMs在纠正错误信念和主动建立共同认知方面表现不佳,尤其是在政治领域,其回答容易受到误导问题的影响。
Insight: 论文揭示了LLMs在政治语境中可能加剧信息误导的风险,强调了优化模型以更好地处理共同认知的必要性。
Abstract: Communication among humans relies on conversational grounding, allowing
interlocutors to reach mutual understanding even when they do not have perfect
knowledge and must resolve discrepancies in each other’s beliefs. This paper
investigates how large language models (LLMs) manage common ground in cases
where they (don’t) possess knowledge, focusing on facts in the political domain
where the risk of misinformation and grounding failure is high. We examine the
ability of LLMs to answer direct knowledge questions and loaded questions that
presuppose misinformation. We evaluate whether loaded questions lead LLMs to
engage in active grounding and correct false user beliefs, in connection to
their level of knowledge and their political bias. Our findings highlight
significant challenges in LLMs’ ability to engage in grounding and reject false
user beliefs, raising concerns about their role in mitigating misinformation in
political discourse.
[31] Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers
Marek Kadlčík,Michal Štefánik,Timothee Mickus,Michal Spiegel,Josef Kuchař
Main category: cs.CL
TL;DR: 语言模型(LMs)在预训练后能准确表示数字,而之前的方法未捕捉到其正弦模式。新提出的探测技术能近乎完美解码数字值,证明LMs对数字的表征很精确。此外,这种精确性解释了LMs在基础算术中的错误,并可通过对齐嵌入模式改善。
Details
Motivation: 现有研究认为语言模型在算术任务中表现不佳,原因是其学习到的分布式嵌入无法准确表示数字。本文发现之前的探测方法未充分捕捉嵌入的正弦模式,可能低估了模型的数字表征能力。Contribution: 1. 提出一种新的探测技术,能近乎完美地从嵌入中解码数字值;2. 证明预训练后的LMs能精确表示数字;3. 展示嵌入的精确性与算术错误的相关性,并提出一种对齐方法改善错误。
Method: 作者设计了一种新的探测方法,专注于捕捉嵌入中的正弦模式,以此解码数字值。该方法利用了嵌入结构的特性,实现了高精度解码。
Result: 实验表明,新方法在多个开源LMs上能近乎完美解码数字值。进一步发现嵌入的精确性与算术错误显著相关,通过对齐嵌入模式可减少错误。
Insight: 语言模型的嵌入结构可能包含比之前认为更精确的数字表征,而传统的探测方法未能充分揭示这一点。这表明嵌入的结构对模型的算术能力至关重要。
Abstract: Pretrained language models (LMs) are prone to arithmetic errors. Existing
work showed limited success in probing numeric values from models’
representations, indicating that these errors can be attributed to the inherent
unreliability of distributionally learned embeddings in representing exact
quantities. However, we observe that previous probing methods are inadequate
for the emergent structure of learned number embeddings with sinusoidal
patterns.
In response, we propose a novel probing technique that decodes numeric values
from input embeddings with near-perfect accuracy across a range of open-source
LMs. This proves that after the sole pre-training, LMs represent numbers with
remarkable precision. Finally, we find that the embeddings’ preciseness judged
by our probe’s accuracy explains a large portion of LM’s errors in elementary
arithmetic, and show that aligning the embeddings with the pattern discovered
by our probe can mitigate these errors.
[32] Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System
Yuan Guo,Tingjia Miao,Zheng Wu,Pengzhou Cheng,Ming Zhou,Zhuosheng Zhang
Main category: cs.CL
TL;DR: 该论文提出了UI-NEXUS基准测试和AGENT-NEXUS调度系统,用于评估和改进移动代理在组合任务上的表现,解决了从原子任务到组合任务的泛化问题。
Details
Motivation: 现有移动代理主要处理原子任务,但现实应用需要组合任务的能力,因此需要新的基准和调度系统来解决这一泛化问题。Contribution: 1) UI-NEXUS基准测试;2) AGENT-NEXUS调度系统;3) 实验验证组合任务泛化的挑战与改进。
Method: 通过UI-NEXUS评估组合任务,并设计AGENT-NEXUS动态分解长视野任务。
Result: AGENT-NEXUS显著提升了任务成功率(24%-40%)且不影响推理效率。
Insight: 现有代理在组合任务上表现不佳,动态分解是提升泛化能力的有效方法。
Abstract: Autonomous agents powered by multimodal large language models have been
developed to facilitate task execution on mobile devices. However, prior work
has predominantly focused on atomic tasks – such as shot-chain execution tasks
and single-screen grounding tasks – while overlooking the generalization to
compositional tasks, which are indispensable for real-world applications. This
work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile
agents on three categories of compositional operations: Simple Concatenation,
Context Transition, and Deep Dive. UI-NEXUS supports interactive evaluation in
20 fully controllable local utility app environments, as well as 30 online
Chinese and English service apps. It comprises 100 interactive task templates
with an average optimal step count of 14.05. Experimental results across a
range of mobile agents with agentic workflow or agent-as-a-model show that
UI-NEXUS presents significant challenges. Specifically, existing agents
generally struggle to balance performance and efficiency, exhibiting
representative failure modes such as under-execution, over-execution, and
attention drift, causing visible atomic-to-compositional generalization gap.
Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient
scheduling system to tackle compositional mobile tasks. AGENT-NEXUS
extrapolates the abilities of existing mobile agents by dynamically decomposing
long-horizon tasks to a series of self-contained atomic subtasks. AGENT-NEXUS
achieves 24% to 40% task success rate improvement for existing mobile agents on
compositional operation tasks within the UI-NEXUS benchmark without
significantly sacrificing inference overhead. The demo video, dataset, and code
are available on the project page at https://ui-nexus.github.io.
[33] FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents
Satu Hopponen,Tomi Kinnunen,Alexandre Nikolaev,Rosa González Hautamäki,Lauri Tavi,Einar Meister
Main category: cs.CL
TL;DR: 论文介绍了FROST-EMA数据集,包含18名双语使用者在母语(L1)、二语(L2)及模仿二语口音(假外国口音)下的语音数据,并展示了两项初步研究,探讨这些语言变体对语音技术和发音行为的影响。
Details
Motivation: 研究动机是填补多语言语音数据集在发音测量技术(如电磁发音描记术)上的空白,并探索语言变体(如L2和假口音)在语音技术和发音行为研究中的潜在应用。Contribution: 主要贡献是发布了FROST-EMA数据集,包含多种语言变体的发音数据,并通过两项初步研究展示了其在语音技术和发音行为分析中的价值。
Method: 采用电磁发音描记术(EMA)采集18名双语使用者在L1、L2及模仿L2口音下的发音数据,并设计了两项研究:1)分析L2和假口音对自动说话人验证系统的影响;2)展示一位使用者在不同语言变体下的发音模式。
Result: 结果表明,L2和假口音可能影响说话人验证系统的性能,且发音模式在不同语言变体中存在差异。
Insight: 研究揭示了语言变体对语音技术和发音行为的复杂影响,为多语言语音研究提供了新的数据支持和分析视角。
Abstract: We introduce a new FROST-EMA (Finnish and Russian Oral Speech Dataset of
Electromagnetic Articulography) corpus. It consists of 18 bilingual speakers,
who produced speech in their native language (L1), second language (L2), and
imitated L2 (fake foreign accent). The new corpus enables research into
language variability from phonetic and technological points of view.
Accordingly, we include two preliminary case studies to demonstrate both
perspectives. The first case study explores the impact of L2 and imitated L2 on
the performance of an automatic speaker verification system, while the second
illustrates the articulatory patterns of one speaker in L1, L2, and a fake
accent.
[34] Learning to Reason Across Parallel Samples for LLM Reasoning
Jianing Qi,Xi Ye,Hao Tang,Zhigang Zhu,Eunsol Choi
Main category: cs.CL
TL;DR: 论文提出了一种新的方法SSA(Sample Set Aggregator),通过训练一个小型LLM来聚合多个推理样本,从而提升模型在数学领域等推理任务中的性能。实验表明SSA优于其他测试时扩展方法,并展现了良好的泛化能力。
Details
Motivation: 现有方法通过样本投票或验证器排名等方式利用多个测试时样本来提升LLM性能,但缺乏专门训练模型来优化样本集的聚合过程。本文旨在设计一种更高效的聚合方法。Contribution: 提出Sample Set Aggregator (SSA),一种小型LLM,专门用于通过强化学习优化多样本集的聚合过程,显著提升了推理任务的性能。
Method: 用强化学习训练SSA模型,输入为多个样本的拼接序列,输出为最终答案。SSA与生成样本的LLM解耦,可直接应用于黑盒模型的输出。
Result: 在多推理数据集上,SSA优于基于奖励模型的重新排序等方法,泛化能力强,适应不同样本集大小、模型家族和任务。
Insight: 将生成样本和聚合样本的LLM解耦,既提升了推理性能,又增强了灵活性,适用于黑盒模型的输出。
Abstract: Scaling test-time compute brings substantial performance gains for large
language models (LLMs). By sampling multiple answers and heuristically
aggregate their answers (e.g., either through majority voting or using
verifiers to rank the answers), one can achieve consistent performance gains in
math domains. In this paper, we propose a new way to leverage such multiple
sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that
takes a concatenated sequence of multiple samples and output the final answer,
optimizing it for the answer accuracy with reinforcement learning. Experiments
on multiple reasoning datasets show that SSA outperforms other test-time
scaling methods such as reward model-based re-ranking. Our approach also shows
a promising generalization ability, across sample set sizes, base model
families and scales, and tasks. By separating LLMs to generate answers and LLMs
to analyze and aggregate sampled answers, our approach can work with the
outputs from premier black box models easily and efficiently.
[35] Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning
Haozhen Zhang,Tao Feng,Jiaxuan You
Main category: cs.CL
TL;DR: 论文提出了Router-R1,一个基于强化学习的多轮路由和聚合框架,用于优化LLM路由系统,通过动态调用多个LLM并整合其响应,提升了复杂任务的性能与成本管理。
Details
Motivation: 现有的LLM路由器通常只能进行单轮一对一映射,无法充分利用多个LLM的互补优势处理复杂任务。Contribution: 提出了Router-R1框架,通过强化学习实现多轮路由和聚合,动态调用多个LLM并整合其响应,优化性能与成本的权衡。
Method: 框架将路由和聚合建模为序列决策过程,使用LLM作为智能路由器,结合内部推理和动态模型调用,并通过基于规则的奖励(格式奖励、结果奖励和成本奖励)指导学习。
Result: 在七个通用和多跳QA基准测试中,Router-R1表现优于多个强基线,实现了更好的性能和成本管理。
Insight: 通过强化学习优化LLM路由系统可以显著提升复杂任务的解决能力,同时实现性能与成本的平衡。
Abstract: The rapid emergence of diverse large language models (LLMs) has spurred the
development of LLM routers that assign user queries to the most suitable model.
However, existing LLM routers typically perform a single-round, one-to-one
mapping (\textit{i.e.}, assigning each query to a single model in isolation),
which limits their capability to tackle complex tasks that demand the
complementary strengths of multiple LLMs. In this paper, we present
\textbf{Router-R1}, a reinforcement learning (RL)-based framework that
formulates multi-LLM routing and aggregation as a sequential decision process.
Router-R1 instantiates the router itself as a capable LLM, leveraging its
reasoning ability to interleave “think” actions (internal deliberation) with
“route” actions (dynamic model invocation), and integrates each response into
its evolving context. To guide learning, we employ a lightweight rule-based
reward comprising format rewards, final outcome rewards, and a novel cost
reward for performance and cost trade-off optimization, opening a pathway
toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions
only on simple model descriptors such as pricing, latency, and example
performance, enabling strong generalization to unseen model selection.
Experiments on seven general and multi-hop QA benchmarks show that Router-R1
outperforms over several strong baselines, achieving superior performance while
maintaining robust generalization and cost management.Code is available at
https://github.com/ulab-uiuc/Router-R1.
[36] Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Yaniv Nikankin,Dana Arad,Yossi Gandelsman,Yonatan Belinkov
Main category: cs.CL
TL;DR: 论文研究了视觉语言模型(VLMs)在视觉和文本任务中的性能差异,发现不同模态的电路(circuits)虽功能相似但结构分离。通过将视觉数据的后期层表征回传至早期层,实验表明可以缩小模态间性能差距的三分之一。
Details
Motivation: 视觉语言模型在视觉任务中的表现通常不如相同任务的文本版本,论文旨在探索这种性能差异的原因及其解决方法。Contribution: 揭示了不同模态中电路的分离特性,提出了一种无需训练的干预方法,通过回传视觉表征来缩小性能差距。
Method: 通过分析电路的模态特异性,设计了一种将视觉数据后期层表征回传至早期层的干预方法。
Result: 实验证明该方法在多任务和多模型中平均能缩小三分之一模态间的性能差距。
Insight: 视觉表征在后期层才与文本表征对齐,导致性能差异;通过干预表征对齐时间点可以有效提升性能。
Abstract: Vision-Language models (VLMs) show impressive abilities to answer questions
on visual inputs (e.g., counting objects in an image), yet demonstrate higher
accuracies when performing an analogous task on text (e.g., counting words in a
text). We investigate this accuracy gap by identifying and comparing the
\textit{circuits} - the task-specific computational sub-graphs - in different
modalities. We show that while circuits are largely disjoint between
modalities, they implement relatively similar functionalities: the differences
lie primarily in processing modality-specific data positions (an image or a
text sequence). Zooming in on the image data representations, we observe they
become aligned with the higher-performing analogous textual representations
only towards later layers, too late in processing to effectively influence
subsequent positions. To overcome this, we patch the representations of visual
data tokens from later layers back into earlier layers. In experiments with
multiple tasks and models, this simple intervention closes a third of the
performance gap between the modalities, on average. Our analysis sheds light on
the multi-modal performance gap in VLMs and suggests a training-free approach
for reducing it.
cs.CV [Back]
[37] Towards Reliable AR-Guided Surgical Navigation: Interactive Deformation Modeling with Data-Driven Biomechanics and Prompts
Zheng Han,Jun Zhou,Jialun Pei,Jing Qin,Yingfang Fan,Qi Dou
Main category: cs.CV
TL;DR: 论文提出了一种结合数据驱动的生物力学算法和交互式提示机制的方法,用于增强现实(AR)辅助手术导航中的形变建模。该方法在保持有限元方法(FEM)精度的同时提高了计算效率,并引入了人机交互机制以动态修正解剖结构偏差。实验结果表明其显著提升了手术导航的准确性和可靠性。
Details
Motivation: 手术导航中,术前器官模型与术中动态变化的解剖结构的精确对齐是关键的挑战。传统的有限元方法虽精确但计算成本高,且难以处理大范围解剖变化(如气腹或韧带剥离)。现有算法在这些场景下无法保证精度,限制了AR导航的可靠性。Contribution: 1. 提出了一种数据驱动的生物力学算法,在保持FEM精度的同时提升计算效率;
2. 引入了交互式人机协作机制,允许外科医生动态修正模型偏差,结合临床专业知识提升导航精度。
Method: 1. 数据驱动的生物力学算法:通过机器学习从数据中学习形变规律,替代传统FEM的高成本计算;
2. 交互式提示机制:外科医生可通过交互界面提供实时反馈,调整模型对齐。
Result: 在公开数据集上的实验显示,该方法的平均目标配准误差为3.42 mm,结合交互提示后进一步降至2.78 mm,优于现有方法。
Insight: 1. 数据驱动方法在手术导航中可实现高效且精确的形变建模;
2. 人机交互机制能有效整合临床专家知识,提升复杂手术场景下的导航鲁棒性。
Abstract: In augmented reality (AR)-guided surgical navigation, preoperative organ
models are superimposed onto the patient’s intraoperative anatomy to visualize
critical structures such as vessels and tumors. Accurate deformation modeling
is essential to maintain the reliability of AR overlays by ensuring alignment
between preoperative models and the dynamically changing anatomy. Although the
finite element method (FEM) offers physically plausible modeling, its high
computational cost limits intraoperative applicability. Moreover, existing
algorithms often fail to handle large anatomical changes, such as those induced
by pneumoperitoneum or ligament dissection, leading to inaccurate anatomical
correspondences and compromised AR guidance. To address these challenges, we
propose a data-driven biomechanics algorithm that preserves FEM-level accuracy
while improving computational efficiency. In addition, we introduce a novel
human-in-the-loop mechanism into the deformation modeling process. This enables
surgeons to interactively provide prompts to correct anatomical misalignments,
thereby incorporating clinical expertise and allowing the model to adapt
dynamically to complex surgical scenarios. Experiments on a publicly available
dataset demonstrate that our algorithm achieves a mean target registration
error of 3.42 mm. Incorporating surgeon prompts through the interactive
framework further reduces the error to 2.78 mm, surpassing state-of-the-art
methods in volumetric accuracy. These results highlight the ability of our
framework to deliver efficient and accurate deformation modeling while
enhancing surgeon-algorithm collaboration, paving the way for safer and more
reliable computer-assisted surgeries.
[38] ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
Yongkang Li,Kaixin Xiong,Xiangyu Guo,Fang Li,Sixu Yan,Gangwei Xu,Lijun Zhou,Long Chen,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Xinggang Wang
Main category: cs.CV
TL;DR: 本文提出ReCogDrive框架,结合视觉-语言模型(VLMs)和扩散规划器,通过三阶段训练(领域适应、模仿学习、强化学习)提升端到端自动驾驶在长尾场景中的性能,在NAVSIM基准上取得新SOTA。
Details
Motivation: 解决端到端自动驾驶在罕见和长尾场景中性能下降问题,同时克服现有方法中视觉-语言模型与真实驾驶数据的领域差异、离散语言空间到连续动作空间的维度不匹配以及模仿学习的平均行为问题。Contribution: 1. 提出三阶段训练范式(领域适应训练、扩散规划器模仿学习、强化学习微调);2. 结合VLMs与扩散规划器,实现语言空间到连续动作空间的映射;3. 在NAVSIM基准上显著超越现有方法。
Method: 1. 使用驾驶问答数据集训练VLMs以减少领域差异;2. 扩散规划器进行模仿学习;3. 基于NAVSIM非反应模拟器进行强化学习微调。
Result: 在NAVSIM基准上达到89.6 PDMS,超越现有视觉方法5.6 PDMS,实现新SOTA。
Insight: 通过结合VLMs的世界知识和扩散规划器的生成能力,能够有效解决自动驾驶中的长尾场景问题,同时强化学习微调进一步提升了安全性和人类驾驶行为的模仿能力。
Abstract: Although end-to-end autonomous driving has made remarkable progress, its
performance degrades significantly in rare and long-tail scenarios. Recent
approaches attempt to address this challenge by leveraging the rich world
knowledge of Vision-Language Models (VLMs), but these methods suffer from
several limitations: (1) a significant domain gap between the pre-training data
of VLMs and real-world driving data, (2) a dimensionality mismatch between the
discrete language space and the continuous action space, and (3) imitation
learning tends to capture the average behavior present in the dataset, which
may be suboptimal even dangerous. In this paper, we propose ReCogDrive, an
autonomous driving system that integrates VLMs with diffusion planner, which
adopts a three-stage paradigm for training. In the first stage, we use a
large-scale driving question-answering datasets to train the VLMs, mitigating
the domain discrepancy between generic content and real-world driving
scenarios. In the second stage, we employ a diffusion-based planner to perform
imitation learning, mapping representations from the latent language space to
continuous driving actions. Finally, we fine-tune the diffusion planner using
reinforcement learning with NAVSIM non-reactive simulator, enabling the model
to generate safer, more human-like driving trajectories. We evaluate our
approach on the planning-oriented NAVSIM benchmark, achieving a PDMS of 89.6
and setting a new state-of-the-art that surpasses the previous vision-only SOTA
by 5.6 PDMS.
[39] CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems
Aniket Rege,Zinnia Nie,Mahesh Ramesh,Unmesh Raskar,Zhuoran Yu,Aditya Kusupati,Yong Jae Lee,Ramya Korlakai Vinayak
Main category: cs.CV
TL;DR: CuRe提出了一种可扩展的基准测试和评分套件,用于分析文本到图像(T2I)系统中的文化代表性偏差,重点关注全球南方文化的代表性不足问题。
Details
Motivation: 当前的T2I系统训练数据主要基于欧美中心的数据,这导致全球南方文化的代表性不足。CuRe旨在通过一种新颖的评估方法量化这种文化偏差。Contribution: 1. 提出CuRe基准数据集,包含300个文化制品,涵盖6大类文化轴;2. 开发了基于边际效用的评分方法,用于评估T2I系统的文化代表性;3. 公开了代码和数据集。
Method: CuRe通过分析T2I系统对文本条件信息增加的响应,量化文化代表性。其数据集基于Wikimedia知识图谱构建,包含32个文化子类别。
Result: 实验表明,CuRe评分与人类对感知相似性、图文对齐和文化多样性的判断具有强相关性。测试涵盖了多种T2I系统(如Stable Diffusion、DALL-E 3)和视觉语言模型。
Insight: T2I系统的文化偏差问题亟待解决,尤其是对全球南方文化的忽视。CuRe提供了一种可扩展的评估框架,未来可推动更具包容性的模型开发。
Abstract: Popular text-to-image (T2I) systems are trained on web-scraped data, which is
heavily Amero and Euro-centric, underrepresenting the cultures of the Global
South. To analyze these biases, we introduce CuRe, a novel and scalable
benchmarking and scoring suite for cultural representativeness that leverages
the marginal utility of attribute specification to T2I systems as a proxy for
human judgments. Our CuRe benchmark dataset has a novel categorical hierarchy
built from the crowdsourced Wikimedia knowledge graph, with 300 cultural
artifacts across 32 cultural subcategories grouped into six broad cultural axes
(food, art, fashion, architecture, celebrations, and people). Our dataset’s
categorical hierarchy enables CuRe scorers to evaluate T2I systems by analyzing
their response to increasing the informativeness of text conditioning, enabling
fine-grained cultural comparisons. We empirically observe much stronger
correlations of our class of scorers to human judgments of perceptual
similarity, image-text alignment, and cultural diversity across image encoders
(SigLIP 2, AIMV2 and DINOv2), vision-language models (OpenCLIP, SigLIP 2,
Gemini 2.0 Flash) and state-of-the-art text-to-image systems, including three
variants of Stable Diffusion (1.5, XL, 3.5 Large), FLUX.1 [dev], Ideogram 2.0,
and DALL-E 3. The code and dataset is open-sourced and available at
https://aniketrege.github.io/cure/.
[40] IGraSS: Learning to Identify Infrastructure Networks from Satellite Imagery by Iterative Graph-constrained Semantic Segmentation
Oishee Bintey Hoque,Abhijin Adiga,Aniruddha Adiga,Siddharth Chaudhary,Madhav V. Marathe,S. S. Ravi,Kirti Rajagopalan,Amanda Wilson,Samarth Swarup
Main category: cs.CV
TL;DR: IGraSS是一种结合语义分割和图约束的迭代框架,用于从卫星图像中准确识别基础设施网络(如运河和道路)。通过利用图的连通性和可达性等属性,IGraSS能够优化不完善的标注数据,显著提升分割模型的性能。
Details
Motivation: 现有基于语义分割的基础设施网络识别方法依赖于大规模高质量的标注数据,但实际标注往往不完整或存在噪声。基础设施网络(如运河和道路)具有图级别的属性(如连通性和可达性),这种特性未被充分利用。IGraSS旨在通过迭代优化标注数据,提升分割模型的性能。Contribution: 1. 提出了一种迭代框架IGraSS,结合语义分割和图约束模块,优化基础设施网络的标注数据;2. 通过引入RGB、NDWI和DEM等多模态数据提升分割效果;3. 实验证明了IGraSS在运河和道路网络上的有效性和通用性。
Method: 1. 语义分割模块处理卫星图像块,输入包括RGB、NDWI和DEM数据;2. 图约束模块将基础设施网络视为图,利用图的连通性和可达性优化标注;3. 两个模块交替迭代,逐步提升分割和标注的质量。
Result: IGraSS将不可达运河段的比例从18%降至3%,并且使用优化后的标注数据显著提升了运河识别性能。在道路网络上,IGraSS也展示了通用性和有效性。
Insight: 1. 图级别的属性(如连通性和可达性)可以作为强先验知识,提升语义分割模型的性能;2. 迭代优化标注数据和分割模型是一种有效的策略,尤其适用于标注不完整或噪声较大的场景。
Abstract: Accurate canal network mapping is essential for water management, including
irrigation planning and infrastructure maintenance. State-of-the-art semantic
segmentation models for infrastructure mapping, such as roads, rely on large,
well-annotated remote sensing datasets. However, incomplete or inadequate
ground truth can hinder these learning approaches. Many infrastructure networks
have graph-level properties such as reachability to a source (like canals) or
connectivity (roads) that can be leveraged to improve these existing ground
truth. This paper develops a novel iterative framework IGraSS, combining a
semantic segmentation module-incorporating RGB and additional modalities (NDWI,
DEM)-with a graph-based ground-truth refinement module. The segmentation module
processes satellite imagery patches, while the refinement module operates on
the entire data viewing the infrastructure network as a graph. Experiments show
that IGraSS reduces unreachable canal segments from around 18% to 3%, and
training with refined ground truth significantly improves canal identification.
IGraSS serves as a robust framework for both refining noisy ground truth and
mapping canal networks from remote sensing imagery. We also demonstrate the
effectiveness and generalizability of IGraSS using road networks as an example,
applying a different graph-theoretic constraint to complete road networks.
[41] Spectral Domain Neural Reconstruction for Passband FMCW Radars
Harshvardhan Takawale,Nirupam Roy
Main category: cs.CV
TL;DR: SpINRv2是一个基于神经网络的框架,用于高保真体积重建,特别针对高频FMCW雷达,解决了相位混叠和子频段模糊问题,提升了3D成像性能。
Details
Motivation: 传统方法在高频FMCW雷达中因相位混叠和子频段模糊问题表现不佳,SpINRv2旨在通过神经框架和频域建模解决这些问题。Contribution: 提出了完全可微的频域前向模型,结合隐式神经表示(INR),并引入稀疏性和平滑性正则化,有效解决了高频雷达中的子频段模糊问题。
Method: 使用频域建模和隐式神经表示(INR),结合稀疏性和平滑性正则化,直接监督复杂频谱,避免时间域基线的计算开销。
Result: SpINRv2在高频场景下显著优于经典和基于学习的基线方法,为基于神经网络的3D雷达成像设立了新基准。
Insight: 频域建模和隐式神经表示的结合可以有效解决高频雷达中的复杂问题,同时减少计算负担。
Abstract: We present SpINRv2, a neural framework for high-fidelity volumetric
reconstruction using Frequency-Modulated Continuous-Wave (FMCW) radar.
Extending our prior work (SpINR), this version introduces enhancements that
allow accurate learning under high start frequencies-where phase aliasing and
sub-bin ambiguity become prominent. Our core contribution is a fully
differentiable frequency-domain forward model that captures the complex radar
response using closed-form synthesis, paired with an implicit neural
representation (INR) for continuous volumetric scene modeling. Unlike
time-domain baselines, SpINRv2 directly supervises the complex frequency
spectrum, preserving spectral fidelity while drastically reducing computational
overhead. Additionally, we introduce sparsity and smoothness regularization to
disambiguate sub-bin ambiguities that arise at fine range resolutions.
Experimental results show that SpINRv2 significantly outperforms both classical
and learning-based baselines, especially under high-frequency regimes,
establishing a new benchmark for neural radar-based 3D imaging.
[42] Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion Models in a Vision-Language-Action Framework
Huixin Zhan,Jason H. Moore
Main category: cs.CV
TL;DR: 该论文提出了一种基于离散扩散模型和视觉-语言-动作框架的方法,用于在外科手术中建模外科医生的个性化操作风格。通过多模态输入(如内窥镜视频、手术意图语言等),该方法能够生成个性化的手势序列,同时量化隐私风险。
Details
Motivation: 当前AI系统在外科手术中往往忽略外科医生的个性化操作风格,而这些风格差异对手术效果至关重要。论文旨在通过多模态数据建模外科医生的独特行为模式,同时研究其隐私风险。Contribution: 1. 提出了一种结合离散扩散模型和视觉-语言-动作框架的新方法,用于外科医生风格建模;2. 通过自然语言提示编码个性化风格,避免直接暴露身份信息;3. 量化了隐私风险,揭示了性能提升与身份泄露之间的权衡。
Method: 1. 使用离散扩散模型对外科医生的手势序列进行结构化去噪;2. 结合内窥镜视频、手术意图语言等多模态数据;3. 通过第三方语言模型将个性化信息编码为自然语言提示。
Result: 在JIGSAWS数据集上的实验表明,该方法能准确重建手势序列,并学习到外科医生的独特行为指纹。同时,研究发现更个性化的嵌入虽然提高了任务性能,但也增加了身份泄露的风险。
Insight: 个性化嵌入在外科手术AI中虽能提升性能,但也带来隐私风险,需要在设计和部署时权衡这两者。
Abstract: Surgeons exhibit distinct operating styles due to differences in training,
experience, and motor behavior - yet current AI systems often ignore this
personalization signal. We propose a novel approach to model fine-grained,
surgeon-specific fingerprinting in robotic surgery using a discrete diffusion
framework integrated with a vision-language-action (VLA) pipeline. Our method
formulates gesture prediction as a structured sequence denoising task,
conditioned on multimodal inputs including endoscopic video, surgical intent
language, and a privacy-aware embedding of surgeon identity and skill.
Personalized surgeon fingerprinting is encoded through natural language prompts
using third-party language models, allowing the model to retain individual
behavioral style without exposing explicit identity. We evaluate our method on
the JIGSAWS dataset and demonstrate that it accurately reconstructs gesture
sequences while learning meaningful motion fingerprints unique to each surgeon.
To quantify the privacy implications of personalization, we perform membership
inference attacks and find that more expressive embeddings improve task
performance but simultaneously increase susceptibility to identity leakage.
These findings demonstrate that while personalized embeddings improve
performance, they also increase vulnerability to identity leakage, revealing
the importance of balancing personalization with privacy risk in surgical
modeling. Code is available at:
https://github.com/huixin-zhan-ai/Surgeon_style_fingerprinting.
[43] Open World Scene Graph Generation using Vision Language Models
Amartya Dutta,Kazi Sajeed Mehrab,Medha Sawhney,Abhilash Neog,Mridul Khurana,Sepideh Fatemi,Aanish Pradhan,M. Maruf,Ismini Lourentzou,Arka Daw,Anuj Karpatne
Main category: cs.CV
TL;DR: 该论文提出了一种名为Open-World SGG的训练无关框架,利用预训练视觉语言模型(VLMs)的已有知识,无需微调即可生成场景图,支持开放世界中的新颖对象和关系检测。
Details
Motivation: 现有的场景图生成(SGG)方法通常依赖于特定数据集的监督学习,限制了其在开放世界中处理新对象或关系的能力。论文旨在通过利用预训练VLMs的知识,实现零样本场景图生成,扩展SGG的应用范围。Contribution: 提出了一个训练无关、高效的模型无关框架,能够直接利用预训练VLMs的零样本推理能力生成场景图,无需任务级微调。
Method: 将SGG视为零样本结构推理问题,结合多模态提示、嵌入对齐和轻量级对优化策略,实现了对未见对象和关系的推断。
Result: 在Visual Genome、Open Images V6和Panoptic Scene Graph数据集上的实验表明,预训练VLMs能够在未进行任务级训练的情况下完成关系理解。
Insight: 预训练VLMs具有强大的零样本结构推理能力,可直接应用于开放世界场景图生成任务,为无需监督学习的SGG提供了新思路。
Abstract: Scene-Graph Generation (SGG) seeks to recognize objects in an image and
distill their salient pairwise relationships. Most methods depend on
dataset-specific supervision to learn the variety of interactions, restricting
their usefulness in open-world settings, involving novel objects and/or
relations. Even methods that leverage large Vision Language Models (VLMs)
typically require benchmark-specific fine-tuning. We introduce Open-World SGG,
a training-free, efficient, model-agnostic framework that taps directly into
the pretrained knowledge of VLMs to produce scene graphs with zero additional
learning. Casting SGG as a zero-shot structured-reasoning problem, our method
combines multimodal prompting, embedding alignment, and a lightweight
pair-refinement strategy, enabling inference over unseen object vocabularies
and relation sets. To assess this setting, we formalize an Open-World
evaluation protocol that measures performance when no SGG-specific data have
been observed either in terms of objects and relations. Experiments on Visual
Genome, Open Images V6, and the Panoptic Scene Graph (PSG) dataset demonstrate
the capacity of pretrained VLMs to perform relational understanding without
task-level training.
[44] GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra
Mateusz Michalkiewicz,Anekha Sokhal,Tadeusz Michalkiewicz,Piotr Pawlikowski,Mahsa Baktashmotlagh,Varun Jampani,Guha Balakrishnan
Main category: cs.CV
TL;DR: GIQ是一个专门评估视觉和视觉语言基础模型几何推理能力的基准,包含合成和真实多面体图像,揭示了现有模型在3D对称性检测和几何任务中的不足。
Details
Motivation: 目前单目3D重建方法和视觉语言模型在标准基准上表现优异,但对其几何特性的真正理解尚不明确,需要系统评估。Contribution: 提出了GIQ基准,涵盖224种多面体,通过多种任务(如3D对称性检测、心理旋转测试)系统评估模型,揭示了现有方法的局限性。
Method: 使用合成和真实多面体图像,设计多种几何任务(3D重建、对称性检测、心理旋转等),评估模型的几何推理能力。
Result: 当前模型在基本几何形状重建、几何区分任务中表现不佳,视觉语言助手在复杂多面体属性判断上准确率很低。
Insight: 现有模型对几何属性的理解有限,GIQ为提升几何感知表示学习提供了基准和方向。
Abstract: Monocular 3D reconstruction methods and vision-language models (VLMs)
demonstrate impressive results on standard benchmarks, yet their true
understanding of geometric properties remains unclear. We introduce GIQ , a
comprehensive benchmark specifically designed to evaluate the geometric
reasoning capabilities of vision and vision-language foundation models. GIQ
comprises synthetic and real-world images of 224 diverse polyhedra - including
Platonic, Archimedean, Johnson, and Catalan solids, as well as stellations and
compound shapes - covering varying levels of complexity and symmetry. Through
systematic experiments involving monocular 3D reconstruction, 3D symmetry
detection, mental rotation tests, and zero-shot shape classification tasks, we
reveal significant shortcomings in current models. State-of-the-art
reconstruction algorithms trained on extensive 3D datasets struggle to
reconstruct even basic geometric forms accurately. While foundation models
effectively detect specific 3D symmetry elements via linear probing, they
falter significantly in tasks requiring detailed geometric differentiation,
such as mental rotation. Moreover, advanced vision-language assistants exhibit
remarkably low accuracy on complex polyhedra, systematically misinterpreting
basic properties like face geometry, convexity, and compound structures. GIQ is
publicly available, providing a structured platform to highlight and address
critical gaps in geometric intelligence, facilitating future progress in
robust, geometry-aware representation learning.
[45] A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
Andrew Z. Wang,Songwei Ge,Tero Karras,Ming-Yu Liu,Yogesh Balaji
Main category: cs.CV
TL;DR: 这篇论文研究了在文本到图像生成模型中,使用现代仅解码器大型语言模型(LLM)作为文本编码器的效果。通过标准化训练和评估流程,作者分析了12种不同文本编码器生成的嵌入对生成效果的影响,发现传统的最后一层嵌入方法效果不佳,而跨层归一化平均嵌入能显著提升复杂提示的匹配性能。
Details
Motivation: 现有的文本到图像生成模型仍在使用较为陈旧的T5和CLIP作为文本编码器,而现代仅解码器LLM在自然语言处理领域表现出色。作者希望通过研究LLM作为文本编码器的效果,改进文本到图像生成模型。Contribution: 1. 构建标准化训练和评估流程,分析不同文本嵌入对生成效果的影响。2. 提出跨层归一化平均嵌入方法,显著提升复杂提示的匹配性能。3. 实验证明多数LLM在改进后能超越传统T5基准。
Method: 1. 使用12种不同文本编码器训练27个文本到图像模型。2. 比较不同提取嵌入的方法(如最后一层嵌入、跨层嵌入)。3. 评估生成模型的性能,特别是复杂提示的匹配能力。
Result: 实验表明,传统的最后一层嵌入方法效果较差,而跨层归一化平均嵌入能显著提升模型对复杂提示的理解和生成质量。多数LLM在改进后优于T5基准。
Insight: 1. 在文本到图像生成中,嵌入提取方法对模型性能至关重要。2. 现代仅解码器LLM可以通过适当的嵌入提取方法,显著提升生成模型的性能。3. 跨层嵌入可能更好地捕捉语言的复杂语义。
Abstract: Both text-to-image generation and large language models (LLMs) have made
significant advancements. However, many text-to-image models still employ the
somewhat outdated T5 and CLIP as their text encoders. In this work, we
investigate the effectiveness of using modern decoder-only LLMs as text
encoders for text-to-image diffusion models. We build a standardized training
and evaluation pipeline that allows us to isolate and evaluate the effect of
different text embeddings. We train a total of 27 text-to-image models with 12
different text encoders to analyze the critical aspects of LLMs that could
impact text-to-image generation, including the approaches to extract
embeddings, different LLMs variants, and model sizes. Our experiments reveal
that the de facto way of using last-layer embeddings as conditioning leads to
inferior performance. Instead, we explore embeddings from various layers and
find that using layer-normalized averaging across all layers significantly
improves alignment with complex prompts. Most LLMs with this conditioning
outperform the baseline T5 model, showing enhanced performance in advanced
visio-linguistic reasoning skills.
[46] Using Satellite Images And Self-supervised Machine Learning Networks To Detect Water Hidden Under Vegetation
Ioannis Iakovidis,Zahra Kalantari,Amir Hossein Payberah,Fernando Jaramillo,Francisco Pena Escobar
Main category: cs.CV
TL;DR: 论文提出了一种结合自监督学习和深度聚类的方法,利用雷达卫星图像检测植被下隐藏的水域,无需人工标注数据。
Details
Motivation: 传统模型依赖大量人工标注数据,成本高且效率低,作者希望通过自监督学习减少对标注数据的依赖。Contribution: 1. 提出无需标注的自监督训练方法,结合深度聚类和负采样;2. 提出集成模型以减少方差并提升性能。
Method: 利用深度聚类和负采样进行自监督训练,并通过集成模型优化结果。
Result: 在测试集上,集成模型的IoU指标比全监督单模型提高了0.02。
Insight: 自监督学习在遥感图像分析中具有潜力,可以有效降低标注成本并保持性能。
Abstract: In recent years the wide availability of high-resolution radar satellite
images along with the advancement of computer vision models have enabled the
remote monitoring of the surface area of wetlands. However, these models
require large amounts of manually annotated satellite images, which are slow
and expensive to produce. To overcome this problem, self-supervised training
methods have been deployed to train models without using annotated data. In
this paper we use a combination of deep clustering and negative sampling to
train a model to segment radar satellite images into areas that separate water
from land without the use of any manual annotations. Furthermore, we implement
an ensemble version of the model to reduce variance and improve performance.
Compared to a single fully-supervised model using the same architecture, our
ensemble of self-supervised models achieves a 0.02 improvement in the
Intersection Over Union metric over our test dataset.
[47] Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence
Octave Mariotti,Zhipeng Du,Yash Bhalgat,Oisin Mac Aodha,Hakan Bilen
Main category: cs.CV
TL;DR: 该论文揭示了监督式语义对应方法在稀疏标注关键点之外的泛化能力不足问题,并提出了一种通过单目深度估计将2D关键点提升到3D规范空间的方法。
Details
Motivation: 现有的监督式语义对应方法虽然在稀疏标注的关键点上表现良好,但在泛化到未见过的关键点时表现较差。论文旨在解决这一问题,并提出一种能够学习稠密对应关系的方法。Contribution: 1. 提出了一种通过单目深度估计将2D关键点映射到3D规范空间的方法;2. 引入了一个新的数据集SPair-U,用于更好地评估泛化性能;3. 实验表明该方法在未见过的关键点上显著优于监督式基线方法。
Method: 论文的方法利用单目深度估计将2D关键点提升到3D规范空间,构建了一个连续的规范流形来捕捉对象几何结构,无需显式的3D监督或相机标注。
Result: 实验结果不仅表明该方法在未见过的关键点上显著优于监督式基线方法,还发现无监督基线方法在跨数据集泛化时表现优于监督方法。
Insight: 论文揭示了监督式语义对应方法的泛化局限性,并展示了3D空间表示对提升稠密对应关系学习的重要性。
Abstract: Semantic correspondence (SC) aims to establish semantically meaningful
matches across different instances of an object category. We illustrate how
recent supervised SC methods remain limited in their ability to generalize
beyond sparsely annotated training keypoints, effectively acting as keypoint
detectors. To address this, we propose a novel approach for learning dense
correspondences by lifting 2D keypoints into a canonical 3D space using
monocular depth estimation. Our method constructs a continuous canonical
manifold that captures object geometry without requiring explicit 3D
supervision or camera annotations. Additionally, we introduce SPair-U, an
extension of SPair-71k with novel keypoint annotations, to better assess
generalization. Experiments not only demonstrate that our model significantly
outperforms supervised baselines on unseen keypoints, highlighting its
effectiveness in learning robust correspondences, but that unsupervised
baselines outperform supervised counterparts when generalized across different
datasets.
[48] A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks
Vishaal Udandarao,Mehdi Cherti,Shyamgopal Karthik,Jenia Jitsev,Samuel Albanie,Matthias Bethge
Main category: cs.CV
TL;DR: 论文分析了17种常用于评估视觉-语言模型(VLM)组合理解能力的基准测试(如SugarCREPE、VALSE),揭示了它们在设计和构建过程中存在多种固有偏差。研究发现,简单的启发式方法(如token长度、语言模型对数似然)表现与CLIP模型相当,说明这些基准未能有效衡量组合理解能力。主要原因是正负图像/描述的分布不对称性。作者提出了构建更鲁棒基准的建议。
Details
Motivation: 现有的视觉-语言组合理解基准测试存在设计偏差,可能导致评估不准确。研究旨在揭示这些偏差并改进基准构建方法,以更好地衡量模型的真实能力。Contribution: 1. 系统分析了17种常用基准的设计选择和偏差;2. 发现简单启发式方法表现与复杂模型相当,说明基准存在缺陷;3. 提出改进建议以减少偏差。
Method: 研究通过分析基准构建过程(如数据源选择、负样本生成方法),量化了分布不对称性,并比较了启发式方法与CLIP模型的性能差异。
Result: 发现基准测试中正负样本的分布不对称性是其主要缺陷,导致评估结果不可靠。
Insight: 构建组合理解基准时,需注意正负样本的对称性,避免简单启发式方法可轻易攻破的设计。未来基准应更加鲁棒,能区分真实理解能力和表面特征匹配。
Abstract: We investigate 17 benchmarks (e.g. SugarCREPE, VALSE) commonly used for
measuring compositional understanding capabilities of vision-language models
(VLMs). We scrutinize design choices in their construction, including data
source (e.g. MS-COCO) and curation procedures (e.g. constructing negative
images/captions), uncovering several inherent biases across most benchmarks. We
find that blind heuristics (e.g. token-length, log-likelihood under a language
model) perform on par with CLIP models, indicating that these benchmarks do not
effectively measure compositional understanding. We demonstrate that the
underlying factor is a distribution asymmetry between positive and negative
images/captions, induced by the benchmark construction procedures. To mitigate
these issues, we provide a few key recommendations for constructing more robust
vision-language compositional understanding benchmarks, that would be less
prone to such simple attacks.
[49] Highly Compressed Tokenizer Can Generate Without Training
L. Lao Beyer,T. Li,X. Chen,S. Karaman,K. He
Main category: cs.CV
TL;DR: 论文发现,高度压缩的1D图像标记器(Tokenizer)通过启发式操作标记(tokens)可以实现图像编辑和生成能力,无需训练生成模型。
Details
Motivation: 现有图像标记器多为2D空间排列标记,而1D标记器能将图像压缩为极少的离散标记。研究者发现这种高度压缩的标记空间具有丰富的表达能力,启发了无需训练生成模型的图像编辑和生成方法。Contribution: 提出了一种基于1D标记器的图像生成流程,通过梯度优化和即插即用的损失函数(如重建或CLIP相似性)实现图像编辑和生成,无需训练生成模型。
Method: 使用1D标记器将图像压缩为极少的离散标记,并通过启发式操作(如复制和替换标记)或梯度优化方法(如测试时优化)进行图像编辑和生成。
Result: 方法支持细粒度图像编辑(如外观和语义属性迁移)以及多样化和真实的图像生成,应用于修复和文本引导编辑等场景。
Insight: 高度压缩的1D标记空间具有强大的表达能力,可通过简单操作或优化实现复杂的图像生成和编辑任务,为轻量级生成模型提供了新思路。
Abstract: Commonly used image tokenizers produce a 2D grid of spatially arranged
tokens. In contrast, so-called 1D image tokenizers represent images as highly
compressed one-dimensional sequences of as few as 32 discrete tokens. We find
that the high degree of compression achieved by a 1D tokenizer with vector
quantization enables image editing and generative capabilities through
heuristic manipulation of tokens, demonstrating that even very crude
manipulations – such as copying and replacing tokens between latent
representations of images – enable fine-grained image editing by transferring
appearance and semantic attributes. Motivated by the expressivity of the 1D
tokenizer’s latent space, we construct an image generation pipeline leveraging
gradient-based test-time optimization of tokens with plug-and-play loss
functions such as reconstruction or CLIP similarity. Our approach is
demonstrated for inpainting and text-guided image editing use cases, and can
generate diverse and realistic samples without requiring training of any
generative model.
[50] Seeing Voices: Generating A-Roll Video from Audio with Mirage
Aditi Sundararaman,Amogh Adishesha,Andrew Jaegle,Dan Bigioi,Hyoung-Kyu Song,Jon Kyl,Justin Mao,Kevin Lan,Mojtaba Komeili,ShahRukh Athar,Sheila Babayan,Stanislau Beliasau,William Buchwalter
Main category: cs.CV
TL;DR: Mirage是一款音频到视频生成的基础模型,能根据音频输入生成逼真、富有表现力的视频画面。它结合自注意力机制和通用训练方法,优于现有方法,尤其在生成人物讲话视频(A-roll)时表现卓越。
Details
Motivation: 现有视频生成方法通常忽略音频或仅针对特定领域(如配音),缺乏通用的音频到视频生成能力。Mirage旨在填补这一空白,实现音频驱动的全场景视频生成。Contribution: 提出Mirage模型,通过自注意力机制和统一的训练方法,实现高质量的音频到视频生成,尤其擅长生成人物讲话视频。
Method: 基于自注意力架构构建模型,采用通用训练方法,无需针对特定任务(如语音或人物)设计额外架构或损失函数。
Result: 生成的视频在主观质量上优于现有方法,能逼真地呈现音频中的表演内容。
Insight: 通用训练方法结合自注意力机制,可能在多模态生成任务中具有潜力,而无需过度依赖任务特定的设计。
Abstract: From professional filmmaking to user-generated content, creators and
consumers have long recognized that the power of video depends on the
harmonious integration of what we hear (the video’s audio track) with what we
see (the video’s image sequence). Current approaches to video generation either
ignore sound to focus on general-purpose but silent image sequence generation
or address both visual and audio elements but focus on restricted application
domains such as re-dubbing. We introduce Mirage, an audio-to-video foundation
model that excels at generating realistic, expressive output imagery from
scratch given an audio input. When integrated with existing methods for speech
synthesis (text-to-speech, or TTS), Mirage results in compelling multimodal
video. When trained on audio-video footage of people talking (A-roll) and
conditioned on audio containing speech, Mirage generates video of people
delivering a believable interpretation of the performance implicit in input
audio. Our central technical contribution is a unified method for training
self-attention-based audio-to-video generation models, either from scratch or
given existing weights. This methodology allows Mirage to retain generality as
an approach to audio-to-video generation while producing outputs of superior
subjective quality to methods that incorporate audio-specific architectures or
loss components specific to people, speech, or details of how images or audio
are captured. We encourage readers to watch and listen to the results of Mirage
for themselves (see paper and comments for links).
[51] SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging
Nhat Thanh Tran,Fanghui Xue,Shuai Zhang,Jiancheng Lyu,Yunling Zheng,Yingyong Qi,Jack Xin
Main category: cs.CV
TL;DR: 该论文提出了一种名为SEMA的新型注意力机制,通过令牌定位和算术平均来解决传统注意力的计算复杂性和聚焦问题,并在Imagenet-1k分类任务中展现出优于现有视觉Mamba模型的性能。
Details
Motivation: 传统注意力机制存在计算复杂度高(二次复杂度)和线性注意力变体无法有效聚焦的问题,限制了其在计算机视觉任务中的应用。因此,论文提出了SEMA,以解决这些问题。Contribution: 1. 对广义注意力进行了数学定义,并将传统softmax注意力和线性注意力纳入统一框架。2. 证明了广义注意力的分散性。3. 设计了SEMA,结合令牌定位和算术平均,实现了高效且可扩展的注意力机制。
Method: SEMA通过令牌定位避免注意力分散并保持聚焦能力,同时采用算术平均捕捉注意力的全局特性。该方法在理论上是自洽的。
Result: 在Imagenet-1k分类任务中,SAMA在更大的图像尺度下表现出优于现有视觉Mamba模型的性能,同时保持模型参数规模不变。
Insight: 论文揭示了广义注意力的分散特性,为设计新型注意力机制提供了理论依据。SEMA的提出为高效且可扩展的注意力机制设计提供了新思路。
Abstract: Attention is the critical component of a transformer. Yet the quadratic
computational complexity of vanilla full attention in the input size and the
inability of its linear attention variant to focus have been challenges for
computer vision tasks. We provide a mathematical definition of generalized
attention and formulate both vanilla softmax attention and linear attention
within the general framework. We prove that generalized attention disperses,
that is, as the number of keys tends to infinity, the query assigns equal
weights to all keys. Motivated by the dispersion property and recent
development of Mamba form of attention, we design Scalable and Efficient Mamba
like Attention (SEMA) which utilizes token localization to avoid dispersion and
maintain focusing, complemented by theoretically consistent arithmetic
averaging to capture global aspect of attention. We support our approach on
Imagenet-1k where classification results show that SEMA is a scalable and
effective alternative beyond linear attention, outperforming recent vision
Mamba models on increasingly larger scales of images at similar model parameter
sizes.
[52] OpenRR-1k: A Scalable Dataset for Real-World Reflection Removal
Kangning Yang,Ling Ouyang,Huiming Sun,Jie Cai,Lan Fu,Jiaming Ding,Chiu Man Ho,Zibo Meng
Main category: cs.CV
TL;DR: 论文提出了OpenRR-1k数据集,这是一个高质量、对齐且多样化的反射去除数据集,解决了现有技术缺乏高质量野外数据的问题。
Details
Motivation: 现有的反射去除技术缺乏高质量的真实世界数据集,限制了其在现实环境中的应用效果。Contribution: 提出了一个新颖的数据收集范式,并构建了OpenRR-1k数据集,包含1000组高质量、对齐且多样化的反射-透射图像对。
Method: 通过一种便捷、低成本且可扩展的数据收集方法,确保了数据的高质量和多样性。
Result: 实验表明,OpenRR-1k数据集能够显著提升反射去除方法在复杂真实环境中的鲁棒性。
Insight: 高质量且多样化的数据集是提升反射去除技术实用性的关键。
Abstract: Reflection removal technology plays a crucial role in photography and
computer vision applications. However, existing techniques are hindered by the
lack of high-quality in-the-wild datasets. In this paper, we propose a novel
paradigm for collecting reflection datasets from a fresh perspective. Our
approach is convenient, cost-effective, and scalable, while ensuring that the
collected data pairs are of high quality, perfectly aligned, and represent
natural and diverse scenarios. Following this paradigm, we collect a
Real-world, Diverse, and Pixel-aligned dataset (named OpenRR-1k dataset), which
contains 1,000 high-quality transmission-reflection image pairs collected in
the wild. Through the analysis of several reflection removal methods and
benchmark evaluation experiments on our dataset, we demonstrate its
effectiveness in improving robustness in challenging real-world environments.
Our dataset is available at https://github.com/caijie0620/OpenRR-1k.
[53] Hyperspectral Image Classification via Transformer-based Spectral-Spatial Attention Decoupling and Adaptive Gating
Guandong Li,Mengxia Ye
Main category: cs.CV
TL;DR: 该论文提出了一种称为STNet的网络架构,通过解耦光谱和空间注意力以及自适应门控机制,提高了高光谱图像分类的精度和泛化能力。
Details
Motivation: 高光谱图像分类面临高维数据、地物分布稀疏和光谱冗余等挑战,导致分类过拟合和泛化能力受限。为了解决这些问题,论文提出了STNet。Contribution: 提出了STNet网络架构,其核心创新在于:1)解耦光谱和空间注意力;2)设计了自适应注意力融合门控和特征变换门控。
Method: 采用Transformer模块显式解耦空间和光谱注意力,并结合自适应门控机制(注意力融合门控和GFFN)进行智能调节。
Result: 在IN、UP和KSC数据集上表现优异,超越主流高光谱图像分类方法。
Insight: 通过解耦和门控机制,模型能够在减少过拟合风险的同时,更有效地提取和融合空间与光谱信息。
Abstract: Deep neural networks face several challenges in hyperspectral image
classification, including high-dimensional data, sparse distribution of ground
objects, and spectral redundancy, which often lead to classification
overfitting and limited generalization capability. To more effectively extract
and fuse spatial context with fine spectral information in hyperspectral image
(HSI) classification, this paper proposes a novel network architecture called
STNet. The core advantage of STNet stems from the dual innovative design of its
Spatial-Spectral Transformer module: first, the fundamental explicit decoupling
of spatial and spectral attention ensures targeted capture of key information
in HSI; second, two functionally distinct gating mechanisms perform intelligent
regulation at both the fusion level of attention flows (adaptive attention
fusion gating) and the internal level of feature transformation (GFFN). This
characteristic demonstrates superior feature extraction and fusion capabilities
compared to traditional convolutional neural networks, while reducing
overfitting risks in small-sample and high-noise scenarios. STNet enhances
model representation capability without increasing network depth or width. The
proposed method demonstrates superior performance on IN, UP, and KSC datasets,
outperforming mainstream hyperspectral image classification approaches.
[54] Locating Tennis Ball Impact on the Racket in Real Time Using an Event Camera
Yuto Kase,Kai Ishibe,Ryoma Yasuda,Yudai Washida,Sakiko Hashimoto
Main category: cs.CV
TL;DR: 该论文提出了一种利用事件相机实时定位网球在球拍上击球点的方法,解决了高速相机内存消耗大和人工标注效率低的问题。
Details
Motivation: 在网球等球拍运动中,精准定位击球点对分析球员表现和个性化装备设计至关重要,但传统高速相机内存消耗大且人工处理效率低,限制了长时间场景捕捉与分析。Contribution: 1. 提出了一种基于事件相机的实时击球点定位方法;2. 结合传统计算机视觉技术和原创事件处理算法(PATS)检测击球时机;3. 实现了低内存消耗和微秒级精度的高速运动捕捉。
Method: 方法分为三步:1. 识别挥拍时间范围;2. 利用PATS算法检测击球时机;3. 提取球和球拍的轮廓。结合传统计算机视觉和事件相机的高效事件处理。
Result: 实验结果显示,该方法在测量网球球员表现时误差在允许范围内,且计算时间足够短,适合实时应用。
Insight: 事件相机在高速运动场景中具有低内存消耗和高时间精度的优势,为实时运动分析提供了新思路。
Abstract: In racket sports, such as tennis, locating the ball’s position at impact is
important in clarifying player and equipment characteristics, thereby aiding in
personalized equipment design. High-speed cameras are used to measure the
impact location; however, their excessive memory consumption limits prolonged
scene capture, and manual digitization for position detection is time-consuming
and prone to human error. These limitations make it difficult to effectively
capture the entire playing scene, hindering the ability to analyze the player’s
performance. We propose a method for locating the tennis ball impact on the
racket in real time using an event camera. Event cameras efficiently measure
brightness changes (called `events’) with microsecond accuracy under high-speed
motion while using lower memory consumption. These cameras enable users to
continuously monitor their performance over extended periods. Our method
consists of three identification steps: time range of swing, timing at impact,
and contours of ball and racket. Conventional computer vision techniques are
utilized along with an original event-based processing to detect the timing at
impact (PATS: the amount of polarity asymmetry in time symmetry). The results
of the experiments were within the permissible range for measuring tennis
players’ performance. Moreover, the computation time was sufficiently short for
real-time applications.
[55] How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models
Huixuan Zhang,Junzhe Zhang,Xiaojun Wan
Main category: cs.CV
TL;DR: 该论文重新审视了文本到视觉扩散模型中无分类器引导的自适应方法,并提出了一种通用的自适应引导策略Step AG,能够在保证生成质量的同时显著提升效率。
Details
Motivation: 无分类器引导是当前文本到视觉生成扩散模型的主流方法,但其需要双倍的模型前向步骤,成本高昂。以往的自适应引导方法缺乏深入分析和通用性,亟需改进。Contribution: 提出了一种简单且通用的自适应引导策略Step AG,通过限制无分类器引导仅在去噪的前几步使用,显著提升了生成效率(20%-30%),同时保证了生成质量和文本对齐。
Method: 采用Step AG方法,即在去噪过程的前期阶段应用无分类器引导,后期阶段则关闭引导。这种方法适用于不同生成步骤和模型(如图像和视频生成模型)。
Result: 实验显示,Step AG在图像质量和文本对齐方面表现良好,平均提速20%-30%。这种改进在不同生成步骤和模型上均保持一致。
Insight: 无分类器引导的关键作用集中在去噪早期阶段,后期阶段可以关闭引导以节省计算资源,而不会显著影响生成质量。这一发现为高效生成模型的优化提供了新思路。
Abstract: With the rapid development of text-to-vision generation diffusion models,
classifier-free guidance has emerged as the most prevalent method for
conditioning. However, this approach inherently requires twice as many steps
for model forwarding compared to unconditional generation, resulting in
significantly higher costs. While previous study has introduced the concept of
adaptive guidance, it lacks solid analysis and empirical results, making
previous method unable to be applied to general diffusion models. In this work,
we present another perspective of applying adaptive guidance and propose Step
AG, which is a simple, universally applicable adaptive guidance strategy. Our
evaluations focus on both image quality and image-text alignment. whose results
indicate that restricting classifier-free guidance to the first several
denoising steps is sufficient for generating high-quality, well-conditioned
images, achieving an average speedup of 20% to 30%. Such improvement is
consistent across different settings such as inference steps, and various
models including video generation models, highlighting the superiority of our
method.
[56] MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding
Shivang Chopra,Lingchao Mao,Gabriela Sanchez-Rodriguez,Andrew J Feola,Jing Li,Zsolt Kira
Main category: cs.CV
TL;DR: MedMoE提出了一种基于混合专家(MoE)的医学视觉语言理解框架,动态调整视觉表示以适应不同医学成像模态的特殊需求。
Details
Motivation: 现有医学视觉语言框架通常采用统一的局部特征提取策略,忽略了不同模态的特异性需求,导致性能不足。Contribution: 提出MedMoE框架,通过模态特定的混合专家模块动态路由多尺度图像特征,提升视觉表示与文本的对齐能力。
Method: 使用Swin Transformer作为主干网络,结合MoE模块和特征金字塔,实现多尺度特征提取和模态特定的视觉语义捕获。
Result: 在多个医学基准测试中,MedMoE显著提升了跨模态的对齐和检索性能。
Insight: 模态特定的视觉表示对临床视觉语言系统至关重要,动态路由策略能有效捕捉不同分辨率的诊断信息。
Abstract: Different medical imaging modalities capture diagnostic information at
varying spatial resolutions, from coarse global patterns to fine-grained
localized structures. However, most existing vision-language frameworks in the
medical domain apply a uniform strategy for local feature extraction,
overlooking the modality-specific demands. In this work, we present MedMoE, a
modular and extensible vision-language processing framework that dynamically
adapts visual representation based on the diagnostic context. MedMoE
incorporates a Mixture-of-Experts (MoE) module conditioned on the report type,
which routes multi-scale image features through specialized expert branches
trained to capture modality-specific visual semantics. These experts operate
over feature pyramids derived from a Swin Transformer backbone, enabling
spatially adaptive attention to clinically relevant regions. This framework
produces localized visual representations aligned with textual descriptions,
without requiring modality-specific supervision at inference. Empirical results
on diverse medical benchmarks demonstrate that MedMoE improves alignment and
retrieval performance across imaging modalities, underscoring the value of
modality-specialized visual representations in clinical vision-language
systems.
[57] SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding
Woohyeon Park,Woojin Kim,Jaeik Kim,Jaeyoung Do
Main category: cs.CV
TL;DR: 论文提出了一种名为SECOND的新方法,通过选择性对比解码(Selective and Contrastive Decoding)来减少视觉语言模型中的感知幻觉,显著提升图像理解的准确性。
Details
Motivation: 现有的视觉语言模型(VLMs)因物体幻觉问题而性能受限,无法实现精确的图像理解。因此,需要一种新方法来解决这一问题。Contribution: 提出了SECOND方法,通过多尺度视觉信息的选择和对比解码,显著减少了感知幻觉,提升了模型性能。
Method: SECOND采用对象为中心的方式,逐步选择和整合多尺度视觉信息,并通过迭代对比来优化理解效果。
Result: 实验表明,SECOND在多项基准测试中表现优于现有方法,验证了多尺度视觉信息在VLM中的潜力。
Insight: 多尺度信息的优先级选择和对比是提升VLM性能的关键方向之一,这一研究方向仍有很大探索空间。
Abstract: Despite significant advancements in Vision-Language Models (VLMs), the
performance of existing VLMs remains hindered by object hallucination, a
critical challenge to achieving accurate visual understanding. To address this
issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach
that enables VLMs to effectively leverage multi-scale visual information with
an object-centric manner, closely aligning with human visual perception. SECOND
progressively selects and integrates multi-scale visual information,
facilitating a more precise interpretation of images. By contrasting these
visual information iteratively, SECOND significantly reduces perceptual
hallucinations and outperforms a wide range of benchmarks. Our theoretical
analysis and experiments highlight the largely unexplored potential of
multi-scale application in VLMs, showing that prioritizing and contrasting
across scales outperforms existing methods.
[58] RadioDUN: A Physics-Inspired Deep Unfolding Network for Radio Map Estimation
Taiqin Chen,Zikun Zhou,Zheng Fang,Wenzhen Zou,Kanjun Liu,Ke Chen,Yongbing Zhang,Yaowei Wang
Main category: cs.CV
TL;DR: RadioDUN是一种物理启发的深度展开网络,通过结合无线传播模型的物理特性,解决了稀疏样本下密集无线电地图估计的问题,性能优于现有方法。
Details
Motivation: 现有方法难以结合无线电地图的物理特性,导致从稀疏样本估计密集无线电地图的效果不佳。Contribution: 提出了RadioDUN,一种基于物理传播模型的深度展开网络,通过动态重加权模块和阴影损失函数提升性能。
Method: 将无线电地图估计建模为稀疏信号恢复问题,结合物理传播模型分解优化子问题,并设计动态重加权模块和阴影损失函数。
Result: 实验表明,RadioDUN在无线电地图估计任务中超越了现有方法。
Insight: 物理模型的引入和动态重加权机制有效提升了稀疏信号恢复的准确性,阴影损失进一步优化了模型性能。
Abstract: The radio map represents the spatial distribution of spectrum resources
within a region, supporting efficient resource allocation and interference
mitigation. However, it is difficult to construct a dense radio map as a
limited number of samples can be measured in practical scenarios. While
existing works have used deep learning to estimate dense radio maps from sparse
samples, they are hard to integrate with the physical characteristics of the
radio map. To address this challenge, we cast radio map estimation as the
sparse signal recovery problem. A physical propagation model is further
incorporated to decompose the problem into multiple factor optimization
sub-problems, thereby reducing recovery complexity. Inspired by the existing
compressive sensing methods, we propose the Radio Deep Unfolding Network
(RadioDUN) to unfold the optimization process, achieving adaptive parameter
adjusting and prior fitting in a learnable manner. To account for the radio
propagation characteristics, we develop a dynamic reweighting module (DRM) to
adaptively model the importance of each factor for the radio map. Inspired by
the shadowing factor in the physical propagation model, we integrate
obstacle-related factors to express the obstacle-induced signal stochastic
decay. The shadowing loss is further designed to constrain the factor
prediction and act as a supplementary supervised objective, which enhances the
performance of RadioDUN. Extensive experiments have been conducted to
demonstrate that the proposed method outperforms the state-of-the-art methods.
Our code will be made publicly available upon publication.
[59] Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring
Mingjie Xu,Andrew Estornell,Hongzheng Yang,Yuzhi Zhao,Zhaowei Zhu,Qi Xuan,Jiaheng Wei
Main category: cs.CV
TL;DR: 论文提出SCALE方法,通过跨模态评估框架提升视觉语言模型的数据选择质量,解决图像与文本对齐噪声和文本模糊问题,优化VLM指令调优数据集。
Details
Motivation: 现有视觉语言模型(VLMs)的性能依赖于大规模高质量数据集,但图像与文本对齐噪声和模糊文本导致模型表现受限,需改进数据选择方法。Contribution: 提出SCALE框架,集成跨模态评估,通过任务分类和生成多样化字幕,统一评估数据质量,提升模型鲁棒性和任务适配性。
Method: SCALE结合任务分类、多样化字幕生成和多维度评估(对齐、清晰度、任务稀有性等),实现高质量数据筛选。
Result: 揭示现有单模态评估的不足,展示生成字幕对统一多模态任务到文本模态的有效性。
Insight: 多模态任务可通过统一文本模态优化,数据质量评估需兼顾任务适配性和鲁棒性。
Abstract: The application of visual instruction tuning and other post-training
techniques has significantly enhanced the capabilities of Large Language Models
(LLMs) in visual understanding, enriching Vision-Language Models (VLMs) with
more comprehensive visual language datasets. However, the effectiveness of VLMs
is highly dependent on large-scale, high-quality datasets that ensure precise
recognition and accurate reasoning. Two key challenges hinder progress: (1)
noisy alignments between images and the corresponding text, which leads to
misinterpretation, and (2) ambiguous or misleading text, which obscures visual
content. To address these challenges, we propose SCALE (Single modality data
quality and Cross modality Alignment Evaluation), a novel quality-driven data
selection pipeline for VLM instruction tuning datasets. Specifically, SCALE
integrates a cross-modality assessment framework that first assigns each data
entry to its appropriate vision-language task, generates general and
task-specific captions (covering scenes, objects, style, etc.), and evaluates
the alignment, clarity, task rarity, text coherence, and image clarity of each
entry based on the generated captions. We reveal that: (1) current unimodal
quality assessment methods evaluate one modality while overlooking the rest,
which can underestimate samples essential for specific tasks and discard the
lower-quality instances that help build model robustness; and (2) appropriately
generated image captions provide an efficient way to transfer the image-text
multimodal task into a unified text modality.
[60] Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance
June Suk Choi,Kyungmin Lee,Sihyun Yu,Yisol Choi,Jinwoo Shin,Kimin Lee
Main category: cs.CV
TL;DR: 论文提出了自适应低通引导(ALG)方法,解决了图像到视频(I2V)模型生成的视频动态性不足的问题,通过在降噪早期阶段对输入图像进行自适应低通滤波来提高视频动态性。
Details
Motivation: 现有I2V模型在微调时容易因输入图像的高频细节过早暴露而生成动态性不足的视频,与文本到视频(T2V)模型相比表现更静态。Contribution: 提出自适应低通引导(ALG),通过调制输入图像的频率内容,在生成过程中提升视频的动态性,同时保持图像质量和文本对齐。
Method: 在降噪早期阶段对输入图像进行自适应低通滤波,避免高频细节过早影响采样过程。
Result: 实验表明,ALG显著提高了生成的视频的动态性(在VBench-I2V测试中动态性平均提升36%),且不影响视频质量或图像保真度。
Insight: 通过控制输入图像的高频信息在生成过程中的暴露时机,可以有效平衡视频的动态性和静态保真度。
Abstract: Recent text-to-video (T2V) models have demonstrated strong capabilities in
producing high-quality, dynamic videos. To improve the visual controllability,
recent works have considered fine-tuning pre-trained T2V models to support
image-to-video (I2V) generation. However, such adaptation frequently suppresses
motion dynamics of generated outputs, resulting in more static videos compared
to their T2V counterparts. In this work, we analyze this phenomenon and
identify that it stems from the premature exposure to high-frequency details in
the input image, which biases the sampling process toward a shortcut trajectory
that overfits to the static appearance of the reference image. To address this,
we propose adaptive low-pass guidance (ALG), a simple fix to the I2V model
sampling procedure to generate more dynamic videos without compromising
per-frame image quality. Specifically, ALG adaptively modulates the frequency
content of the conditioning image by applying low-pass filtering at the early
stage of denoising. Extensive experiments demonstrate that ALG significantly
improves the temporal dynamics of generated videos, while preserving image
fidelity and text alignment. Especially, under VBench-I2V test suite, ALG
achieves an average improvement of 36% in dynamic degree without a significant
drop in video quality or image fidelity.
[61] MARMOT: Masked Autoencoder for Modeling Transient Imaging
Siyuan Shen,Ziheng Wang,Xingyue Peng,Suan Xia,Ruiqian Li,Shiying Li,Jingyi Yu
Main category: cs.CV
TL;DR: 该论文提出了一种名为MARMOT的自监督模型,通过掩码自编码器预训练大规模多样化的非视线(NLOS)瞬态成像数据集,为NLOS应用提供支持。
Details
Motivation: 现有研究主要优化隐藏物体的体积密度或表面重建,缺乏从数据集中学习到的先验知识转移。MARMOT旨在填补这一空白,通过自监督学习提升NLOS瞬态成像的性能。Contribution: 提出MARMOT模型,首次将自监督预训练应用于NLOS瞬态成像,并通过Transformer编码器-解码器架构学习掩码瞬态特征。
Method: 使用掩码自编码器,通过扫描模式掩码(SPM)从部分掩码的瞬态数据中学习特征,并预测完整测量。预训练数据为合成的TransVerse数据集。
Result: 综合实验表明,MARMOT在定量和定性结果上均优于现有方法,验证了其高效性。
Insight: 通过自监督预训练和掩码策略,MARMOT能够将先验知识迁移到下游任务,为NLOS瞬态成像提供了新的解决方案。
Abstract: Pretrained models have demonstrated impressive success in many modalities
such as language and vision. Recent works facilitate the pretraining paradigm
in imaging research. Transients are a novel modality, which are captured for an
object as photon counts versus arrival times using a precisely time-resolved
sensor. In particular for non-line-of-sight (NLOS) scenarios, transients of
hidden objects are measured beyond the sensor’s direct line of sight. Using
NLOS transients, the majority of previous works optimize volume density or
surfaces to reconstruct the hidden objects and do not transfer priors learned
from datasets. In this work, we present a masked autoencoder for modeling
transient imaging, or MARMOT, to facilitate NLOS applications. Our MARMOT is a
self-supervised model pretrianed on massive and diverse NLOS transient
datasets. Using a Transformer-based encoder-decoder, MARMOT learns features
from partially masked transients via a scanning pattern mask (SPM), where the
unmasked subset is functionally equivalent to arbitrary sampling, and predicts
full measurements. Pretrained on TransVerse-a synthesized transient dataset of
500K 3D models-MARMOT adapts to downstream imaging tasks using direct feature
transfer or decoder finetuning. Comprehensive experiments are carried out in
comparisons with state-of-the-art methods. Quantitative and qualitative results
demonstrate the efficiency of our MARMOT.
[62] Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization
Qilin Yin,Wei Lu,Xiangyang Luo,Xiaochun Cao
Main category: cs.CV
TL;DR: 该论文提出了一种通用的上下文感知对比学习框架(UniCaCLF),用于解决视频中局部篡改片段的时间伪造定位(TFL)问题。
Details
Motivation: 现有的多媒体取证研究主要集中在检测伪造的音视频内容,但多将深度伪造检测视为分类任务,而忽略了部分视频片段被篡改的场景。时间伪造定位(TFL)更具现实应用价值,但目前仍具挑战性。Contribution: 1. 提出了一个通用的上下文感知对比学习框架(UniCaCLF)用于TFL任务;2. 引入了上下文感知感知层,通过异构激活操作和自适应上下文更新器构建上下文感知对比目标;3. 提出了高效的上下文感知对比编码,进一步区分真实与伪造的片段特征。
Method: 1. 采用监督对比学习,通过异常检测发现伪造片段;2. 提出上下文感知感知层,异构激活操作和自适应上下文更新器增强特征区分性;3. 引入上下文感知对比编码,以样本为单位提升特征区分性。
Result: 在五个公开数据集上的实验表明,UniCaCLF显著优于现有最优算法。
Insight: 上下文感知对比学习能有效提升时间伪造定位的性能,尤其在处理局部篡改片段时表现出色。
Abstract: Most research efforts in the multimedia forensics domain have focused on
detecting forgery audio-visual content and reached sound achievements. However,
these works only consider deepfake detection as a classification task and
ignore the case where partial segments of the video are tampered with. Temporal
forgery localization (TFL) of small fake audio-visual clips embedded in real
videos is still challenging and more in line with realistic application
scenarios. To resolve this issue, we propose a universal context-aware
contrastive learning framework (UniCaCLF) for TFL. Our approach leverages
supervised contrastive learning to discover and identify forged instants by
means of anomaly detection, allowing for the precise localization of temporal
forged segments. To this end, we propose a novel context-aware perception layer
that utilizes a heterogeneous activation operation and an adaptive context
updater to construct a context-aware contrastive objective, which enhances the
discriminability of forged instant features by contrasting them with genuine
instant features in terms of their distances to the global context. An
efficient context-aware contrastive coding is introduced to further push the
limit of instant feature distinguishability between genuine and forged instants
in a supervised sample-by-sample manner, suppressing the cross-sample influence
to improve temporal forgery localization performance. Extensive experimental
results over five public datasets demonstrate that our proposed UniCaCLF
significantly outperforms the state-of-the-art competing algorithms.
[63] MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding
Zhiyi Zhu,Xiaoyu Wu,Zihao Liu,Linlin Yang
Main category: cs.CV
TL;DR: MLVTG提出了一种基于Mamba和LLM的新型多模态视频时间定位框架,通过MambaAligner和LLMRefiner模块解决了现有Transformer方法在多模态对齐上的不足。
Details
Motivation: 现有基于Transformer的视频时间定位方法存在冗余注意力和次优的多模态对齐问题,需要更高效的模型来优化。Contribution: 1. 引入MambaAligner模块,利用Vision Mamba块替代Transformer建模时间依赖性;2. 提出LLMRefiner模块,利用预训练LLM的冻结层隐式传递语义先验;3. 结合双对齐策略提升定位精度。
Method: MLVTG结合MambaAligner(基于Vision Mamba的时序建模)和LLMRefiner(利用LLM语义先验),实现高效的多模态对齐和时间定位。
Result: 在QVHighlights、Charades-STA和TVSum数据集上达到SOTA性能,显著超越现有基线。
Insight: 使用Mamba和LLM的先验可以显著提升多模态对齐的效率和精度,为视频时间定位提供新思路。
Abstract: Video Temporal Grounding (VTG), which aims to localize video clips
corresponding to natural language queries, is a fundamental yet challenging
task in video understanding. Existing Transformer-based methods often suffer
from redundant attention and suboptimal multi-modal alignment. To address these
limitations, we propose MLVTG, a novel framework that integrates two key
modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba
blocks as a backbone instead of Transformers to model temporal dependencies and
extract robust video representations for multi-modal alignment. LLMRefiner
leverages the specific frozen layer of a pre-trained Large Language Model (LLM)
to implicitly transfer semantic priors, enhancing multi-modal alignment without
fine-tuning. This dual alignment strategy, temporal modeling via structured
state-space dynamics and semantic purification via textual priors, enables more
precise localization. Extensive experiments on QVHighlights, Charades-STA, and
TVSum demonstrate that MLVTG achieves state-of-the-art performance and
significantly outperforms existing baselines.
[64] Robust Visual Localization via Semantic-Guided Multi-Scale Transformer
Zhongtao Tian,Wenhao Huang,Zhidong Chen,Xiao Wei Sun
Main category: cs.CV
TL;DR: 论文提出了一种结合多尺度特征学习和语义场景理解的框架,通过分层Transformer和跨尺度注意力提高视觉定位在动态环境中的鲁棒性。
Details
Motivation: 动态环境中的光照变化、恶劣天气和移动物体等干扰了传统视觉定位方法的性能,现有绝对位姿回归方法难以保持一致。Contribution: 提出了一种结合多尺度Transformer和语义监督的框架,提升了动态环境下的视觉定位性能。
Method: 使用分层Transformer和跨尺度注意力融合几何细节与上下文信息,并通过语义监督训练网络学习视角不变特征。
Result: 在TartanAir数据集上,该方法优于现有位姿回归方法,尤其在动态物体、光照变化和遮挡等场景表现优异。
Insight: 多尺度处理与语义引导的结合是提升动态环境下视觉定位鲁棒性的有效策略。
Abstract: Visual localization remains challenging in dynamic environments where
fluctuating lighting, adverse weather, and moving objects disrupt appearance
cues. Despite advances in feature representation, current absolute pose
regression methods struggle to maintain consistency under varying conditions.
To address this challenge, we propose a framework that synergistically combines
multi-scale feature learning with semantic scene understanding. Our approach
employs a hierarchical Transformer with cross-scale attention to fuse geometric
details and contextual cues, preserving spatial precision while adapting to
environmental changes. We improve the performance of this architecture with
semantic supervision via neural scene representation during training, guiding
the network to learn view-invariant features that encode persistent structural
information while suppressing complex environmental interference. Experiments
on TartanAir demonstrate that our approach outperforms existing pose regression
methods in challenging scenarios with dynamic objects, illumination changes,
and occlusions. Our findings show that integrating multi-scale processing with
semantic guidance offers a promising strategy for robust visual localization in
real-world dynamic environments.
[65] LiftVSR: Lifting Image Diffusion to Video Super-Resolution via Hybrid Temporal Modeling with Only 4$\times$RTX 4090s
Xijun Wang,Xin Li,Bingchen Li,Zhibo Chen
Main category: cs.CV
TL;DR: LiftVSR提出了一种高效的视频超分辨率框架,通过混合时间建模机制在仅4块RTX 4090 GPU上实现最佳结果,同时兼顾长时一致性和计算效率。
Details
Motivation: 现有视频超分辨率方法在时间一致性和计算成本上存在局限,尤其长视频处理需要高昂硬件开销。LiftVSR旨在通过图像扩散先验和高效时间建模解决这些问题。Contribution: 1. 提出了LiftVSR框架,首次在4块RTX 4090上实现高效视频超分辨率;2. 设计了混合时间建模机制(DTA和AMC),平衡细粒度时间建模和长时一致性;3. 引入非对称采样策略稳定缓存交互。
Method: 1. 动态时间注意(DTA):多头查询和键标记捕获帧间细粒度时间关系;2. 注意记忆缓存(AMC):缓存历史片段信息以实现长时一致性;3. 非对称采样策略减少缓存交互中的特征不匹配。
Result: 在多个主流VSR基准测试中,LiftVSR以显著更低的计算成本实现了最佳性能。
Insight: 1. 图像扩散先验可高效迁移到视频超分辨率;2. 混合时间建模是平衡计算效率和一致性的有效途径;3. 缓存机制和非对称采样对长视频处理至关重要。
Abstract: Diffusion models have significantly advanced video super-resolution (VSR) by
enhancing perceptual quality, largely through elaborately designed temporal
modeling to ensure inter-frame consistency. However, existing methods usually
suffer from limited temporal coherence and prohibitively high computational
costs (e.g., typically requiring over 8 NVIDIA A100-80G GPUs), especially for
long videos. In this work, we propose LiftVSR, an efficient VSR framework that
leverages and elevates the image-wise diffusion prior from PixArt-$\alpha$,
achieving state-of-the-art results using only 4$\times$RTX 4090 GPUs. To
balance long-term consistency and efficiency, we introduce a hybrid temporal
modeling mechanism that decomposes temporal learning into two complementary
components: (i) Dynamic Temporal Attention (DTA) for fine-grained temporal
modeling within short frame segment ($\textit{i.e.}$, low complexity), and (ii)
Attention Memory Cache (AMC) for long-term temporal modeling across segments
($\textit{i.e.}$, consistency). Specifically, DTA identifies multiple token
flows across frames within multi-head query and key tokens to warp inter-frame
contexts in the value tokens. AMC adaptively aggregates historical segment
information via a cache unit, ensuring long-term coherence with minimal
overhead. To further stabilize the cache interaction during inference, we
introduce an asymmetric sampling strategy that mitigates feature mismatches
arising from different diffusion sampling steps. Extensive experiments on
several typical VSR benchmarks have demonstrated that LiftVSR achieves
impressive performance with significantly lower computational costs.
[66] TrajFlow: Multi-modal Motion Prediction via Flow Matching
Qi Yan,Brian Zhang,Yutong Zhang,Daniel Yang,Joshua White,Di Chen,Jiachao Liu,Langechuan Liu,Binnan Zhuang,Shaoshuai Shi,Renjie Liao
Main category: cs.CV
TL;DR: TrajFlow提出了一种基于流匹配的多模态运动预测框架,通过单次推理生成多个可能的未来轨迹,显著降低计算开销,同时通过排名损失和自条件训练技术进一步提升性能。
Details
Motivation: 现有生成式轨迹预测方法需要多次推理以捕捉多样结果,计算开销大且效率低,TrajFlow旨在解决这一问题,提升自动驾驶安全性和决策效率。Contribution: 1. 提出流匹配框架TrajFlow,单次推理生成多模态轨迹;2. 提出基于Plackett-Luce分布的排名损失改进不确定性估计;3. 设计自条件训练技术提升泛化性和推理速度。
Method: 1. 基于流匹配建模轨迹生成问题;2. 采用Plackett-Luce排名损失优化不确定性;3. 自条件训练通过重用模型预测构造噪声输入以提升性能。
Result: 在大规模Waymo Open Motion Dataset (WOMD)上,TrajFlow在多项关键指标上达到SOTA性能。
Insight: 流匹配在运动预测中具有高效性和扩展性潜力,结合排名损失和自条件训练可显著提升模型性能。
Abstract: Efficient and accurate motion prediction is crucial for ensuring safety and
informed decision-making in autonomous driving, particularly under dynamic
real-world conditions that necessitate multi-modal forecasts. We introduce
TrajFlow, a novel flow matching-based motion prediction framework that
addresses the scalability and efficiency challenges of existing generative
trajectory prediction methods. Unlike conventional generative approaches that
employ i.i.d. sampling and require multiple inference passes to capture diverse
outcomes, TrajFlow predicts multiple plausible future trajectories in a single
pass, significantly reducing computational overhead while maintaining coherence
across predictions. Moreover, we propose a ranking loss based on the
Plackett-Luce distribution to improve uncertainty estimation of predicted
trajectories. Additionally, we design a self-conditioning training technique
that reuses the model’s own predictions to construct noisy inputs during a
second forward pass, thereby improving generalization and accelerating
inference. Extensive experiments on the large-scale Waymo Open Motion Dataset
(WOMD) demonstrate that TrajFlow achieves state-of-the-art performance across
various key metrics, underscoring its effectiveness for safety-critical
autonomous driving applications. The code and other details are available on
the project website https://traj-flow.github.io/.
[67] Convergence of Spectral Principal Paths: How Deep Networks Distill Linear Representations from Noisy Inputs
Bowei Tian,Xuntao Lyu,Meng Liu,Hongyi Wang,Ang Li
Main category: cs.CV
TL;DR: 该论文提出输入空间线性假设(ISLH),并引入谱主路径(SPP)框架,解释深度网络如何从噪声输入中逐步提炼出线性表示,同时验证了这些表示在多模态视觉-语言模型中的鲁棒性。
Details
Motivation: 研究动机源于线性表示假设(LRH),旨在探索深度网络如何从输入空间中提取与人类可解释概念对齐的线性方向,从而提升AI的透明度与控制性。Contribution: 主要贡献包括提出输入空间线性假设(ISLH)、谱主路径(SPP)框架,以及验证其多模态鲁棒性,为表示形成提供了结构化理论。
Method: 方法核心为谱主路径框架,通过分析网络的谱方向,揭示深度网络如何逐步增强输入空间中的语义方向。
Result: 论文验证了谱主路径在多模态视觉-语言模型中的有效性,表明提取的线性表示具有鲁棒性。
Insight: 深度网络通过选择性放大输入空间的线性方向来形成结构化的高级表示,这一机制有助于提升模型的透明度和鲁棒性。
Abstract: High-level representations have become a central focus in enhancing AI
transparency and control, shifting attention from individual neurons or
circuits to structured semantic directions that align with human-interpretable
concepts. Motivated by the Linear Representation Hypothesis (LRH), we propose
the Input-Space Linearity Hypothesis (ISLH), which posits that concept-aligned
directions originate in the input space and are selectively amplified with
increasing depth. We then introduce the Spectral Principal Path (SPP)
framework, which formalizes how deep networks progressively distill linear
representations along a small set of dominant spectral directions. Building on
this framework, we further demonstrate the multimodal robustness of these
representations in Vision-Language Models (VLMs). By bridging theoretical
insights with empirical validation, this work advances a structured theory of
representation formation in deep networks, paving the way for improving AI
robustness, fairness, and transparency.
[68] From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge
Agnese Taluzzi,Davide Gesualdi,Riccardo Santambrogio,Chiara Plizzari,Francesca Palermo,Simone Mentasti,Matteo Matteucci
Main category: cs.CV
TL;DR: 这篇论文提出了SceneNet和KnowledgeNet,用于解决HD-EPIC VQA 2025挑战赛的视觉问答任务,结合场景图和多模态大语言模型(MLLM)以及外部常识知识,显著提升了任务性能。
Details
Motivation: 解决复杂的第一人称视角视觉问答(VQA)任务时,需要捕捉细粒度的物体交互、空间关系及时间动态信息,同时结合外部常识知识进行推理。Contribution: 1. 提出SceneNet,利用多模态大语言模型生成场景图,捕捉视觉细节;2. 提出KnowledgeNet,整合ConceptNet的常识知识,支持高阶语义推理;3. 两者的结合在HD-EPIC基准测试中达到44.21%的准确率。
Method: 1. SceneNet通过MLLM生成场景图,提取对象交互和时空信息;2. KnowledgeNet借助ConceptNet扩展实体间的语义关联;3. 联合两种方法进行推理。
Result: 在HD-EPIC VQA挑战赛的七个类别中,混合框架表现优异,最终准确率为44.21%。
Insight: 结合视觉场景图和外部常识知识能够显著提升复杂VQA任务的性能,表明多模态和知识融合的重要性。
Abstract: This report presents SceneNet and KnowledgeNet, our approaches developed for
the HD-EPIC VQA Challenge 2025. SceneNet leverages scene graphs generated with
a multi-modal large language model (MLLM) to capture fine-grained object
interactions, spatial relationships, and temporally grounded events. In
parallel, KnowledgeNet incorporates ConceptNet’s external commonsense knowledge
to introduce high-level semantic connections between entities, enabling
reasoning beyond directly observable visual evidence. Each method demonstrates
distinct strengths across the seven categories of the HD-EPIC benchmark, and
their combination within our framework results in an overall accuracy of 44.21%
on the challenge, highlighting its effectiveness for complex egocentric VQA
tasks.
[69] Towards Cross-Subject EMG Pattern Recognition via Dual-Branch Adversarial Feature Disentanglement
Xinyue Niu,Akira Furui
Main category: cs.CV
TL;DR: 这篇论文提出了一种双分支对抗特征解耦方法,用于跨受试者EMG模式识别,无需校准数据,实现了较高的泛化性能。
Details
Motivation: 跨受试者EMG模式识别面临信号特性、电极位置和解剖结构等差异的挑战,传统方法依赖用户特定校准,耗时且不实用。Contribution: 提出了一种端到端双分支对抗神经网络,将EMG特征解耦为模式特定和受试者特定分量,实现了无需校准的跨受试者泛化。
Method: 双分支对抗网络同时进行模式识别和个体识别,通过特征解耦分离模式相关和受试者相关特征。
Result: 实验表明,模型在未见过的用户数据上表现优异,优于多种基线方法。
Insight: 特征解耦方法为跨受试者EMG识别提供了新视角,并支持任务无关的生物识别应用。
Abstract: Cross-subject electromyography (EMG) pattern recognition faces significant
challenges due to inter-subject variability in muscle anatomy, electrode
placement, and signal characteristics. Traditional methods rely on
subject-specific calibration data to adapt models to new users, an approach
that is both time-consuming and impractical for large-scale, real-world
deployment. This paper presents an approach to eliminate calibration
requirements through feature disentanglement, enabling effective cross-subject
generalization. We propose an end-to-end dual-branch adversarial neural network
that simultaneously performs pattern recognition and individual identification
by disentangling EMG features into pattern-specific and subject-specific
components. The pattern-specific components facilitate robust pattern
recognition for new users without model calibration, while the subject-specific
components enable downstream applications such as task-invariant biometric
identification. Experimental results demonstrate that the proposed model
achieves robust performance on data from unseen users, outperforming various
baseline methods in cross-subject scenarios. Overall, this study offers a new
perspective for cross-subject EMG pattern recognition without model calibration
and highlights the proposed model’s potential for broader applications, such as
task-independent biometric systems.
[70] Hierarchical Neural Collapse Detection Transformer for Class Incremental Object Detection
Duc Thanh Pham,Hong Dang Nguyen,Nhat Minh Nguyen Quoc,Linh Ngo Van,Sang Dinh Viet,Duc Anh Nguyen
Main category: cs.CV
TL;DR: 提出了一种名为Hier-DETR的增量目标检测框架,结合神经坍缩(Neural Collapse)和层次类别关系,以提高效率和性能。
Details
Motivation: 增量目标检测(IOD)面临性能不足和推理时间过长的问题,限制了实际应用。Hier-DETR旨在解决这些问题。Contribution: 引入了Hier-DETR框架,利用神经坍缩和类别层次关系,实现了高效的增量目标检测。
Method: 采用基于Transformer的检测模型,结合神经坍缩处理数据不平衡,并通过层次标签关系优化学习过程。
Result: 框架在性能和效率上均表现出竞争力。
Insight: 神经坍缩和类别层次关系是提升增量目标检测性能的关键因素。
Abstract: Recently, object detection models have witnessed notable performance
improvements, particularly with transformer-based models. However, new objects
frequently appear in the real world, requiring detection models to continually
learn without suffering from catastrophic forgetting. Although Incremental
Object Detection (IOD) has emerged to address this challenge, these existing
models are still not practical due to their limited performance and prolonged
inference time. In this paper, we introduce a novel framework for IOD, called
Hier-DETR: Hierarchical Neural Collapse Detection Transformer, ensuring both
efficiency and competitive performance by leveraging Neural Collapse for
imbalance dataset and Hierarchical relation of classes’ labels.
[71] Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations
Yibo Cui,Liang Xie,Yu Zhao,Jiawei Sun,Erwei Yin
Main category: cs.CV
TL;DR: 该论文提出了一种名为FCA-NIG的生成框架,用于自动构建包含细粒度跨模态对齐标注的导航指令,解决了现有数据集中缺乏子指令级别和对实体-地标对齐的问题。
Details
Motivation: 现有视觉语言导航(VLN)数据集主要关注全局指令-轨迹匹配,而忽略了子指令级别和对实体-地标的对齐,影响导航动作的准确性。Contribution: 1. 提出了FCA-NIG框架,自动生成带有子指令-轨迹和实体-地标对齐标注的导航指令。2. 构建了FCA-R2R数据集,首次提供大规模细粒度对齐标注数据。
Method: 1. 将增强的轨迹划分为子轨迹。2. 使用GLIP进行地标检测,OFA-Speaker生成类似R2R的指令,CLIP选择实体,生成子指令-轨迹对。3. 聚合子对形成完整指令-轨迹对。
Result: 实验表明,FCA-R2R显著提升了多个VLN代理(如SF、EnvDrop、RecBERT和HAMT)的性能,尤其提升了状态感知和导航决策的准确性。
Insight: 细粒度的子指令-轨迹对齐和实体-地标对齐对提升VLN任务的性能具有关键作用,FCA-NIG无需人工标注即可生成高质量数据,推动了跨模态学习的发展。
Abstract: Vision-Language Navigation (VLN) enables intelligent agents to navigate
environments by integrating visual perception and natural language
instructions, yet faces significant challenges due to the scarcity of
fine-grained cross-modal alignment annotations. Existing datasets primarily
focus on global instruction-trajectory matching, neglecting
sub-instruction-level and entity-level alignments critical for accurate
navigation action decision-making. To address this limitation, we propose
FCA-NIG, a generative framework that automatically constructs navigation
instructions with dual-level fine-grained cross-modal annotations. In this
framework, an augmented trajectory is first divided into sub-trajectories,
which are then processed through GLIP-based landmark detection, crafted
instruction construction, OFA-Speaker based R2R-like instruction generation,
and CLIP-powered entity selection, generating sub-instruction-trajectory pairs
with entity-landmark annotations. Finally, these sub-pairs are aggregated to
form a complete instruction-trajectory pair. The framework generates the
FCA-R2R dataset, the first large-scale augmentation dataset featuring precise
sub-instruction-sub-trajectory and entity-landmark alignments. Extensive
experiments demonstrate that training with FCA-R2R significantly improves the
performance of multiple state-of-the-art VLN agents, including SF, EnvDrop,
RecBERT, and HAMT. Incorporating sub-instruction-trajectory alignment enhances
agents’ state awareness and decision accuracy, while entity-landmark alignment
further boosts navigation performance and generalization. These results
highlight the effectiveness of FCA-NIG in generating high-quality, scalable
training data without manual annotation, advancing fine-grained cross-modal
learning in complex navigation tasks.
[72] Diversity-Guided MLP Reduction for Efficient Large Vision Transformers
Chengchao Shen,Hourun Zhu,Gongfan Fang,Jianxin Wang,Xinchao Wang
Main category: cs.CV
TL;DR: 论文提出了一种多样性引导的MLP缩减方法(DGMR),通过消除多层感知机(MLP)模块中的冗余神经元,显著减少大型视觉变换器的参数量和计算量,同时通过蒸馏保持性能几乎没有损失。
Details
Motivation: 大型变换器模型的参数量和计算成本过高,尤其是MLP模块占据了大部分参数。论文旨在通过缩减MLP模块的冗余参数,实现高效的大型视觉变换器模型。Contribution: 提出了DGMR方法,使用Gram-Schmidt权重剪枝策略消除MLP隐藏层的冗余神经元,同时通过多样性保留提升蒸馏性能。实验表明,该方法在多个SOTA大型视觉变换器上实现了超过57%的参数量和FLOPs减少,且性能损失极小。
Method: 采用Gram-Schmidt权重剪枝策略对MLP隐藏层进行冗余神经元消除,并通过多样性保留策略在蒸馏过程中恢复性能。仅需0.06%的无标签数据(LAION-2B)即可恢复原始性能。
Result: 在EVA-CLIP-E(4.4B)等模型上实现了71.5%的参数量和FLOPs减少,且性能无下降。整体上,参数量和FLOPs减少了超过57%,性能基本无损。
Insight: MLP模块是模型参数的主要来源,通过多样性保留的剪枝策略可以有效减少冗余并保持性能,为大型视觉变换器的高效实现提供了新思路。
Abstract: Transformer models achieve excellent scaling property, where the performance
is improved with the increment of model capacity. However, large-scale model
parameters lead to an unaffordable cost of computing and memory. We analyze
popular transformer architectures and find that multilayer perceptron (MLP)
modules take up the majority of model parameters. To this end, we focus on the
recoverability of the compressed models and propose a Diversity-Guided MLP
Reduction (DGMR) method to significantly reduce the parameters of large vision
transformers with only negligible performance degradation. Specifically, we
conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons
of MLP hidden layer, while preserving weight diversity for better performance
recover during distillation. Compared to the model trained from scratch, our
pruned model only requires 0.06% data of LAION-2B (for the training of large
vision transformers) without labels (ImageNet-1K) to recover the original
performance. Experimental results on several state-of-the-art large vision
transformers demonstrate that our method achieves a more than 57.0% parameter
and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B),
our method accomplishes a 71.5% parameter and FLOPs reduction without
performance degradation. The source code and trained weights are available at
https://github.com/visresearch/DGMR.
[73] Data-Efficient Challenges in Visual Inductive Priors: A Retrospective
Robert-Jan Bruintjes,Attila Lengyel,Osman Semih Kayhan,Davide Zambrano,Nergis Tömen,Hadi Jamali-Rad,Jan van Gemert
Main category: cs.CV
TL;DR: 该论文回顾了数据高效的视觉归纳先验挑战,探讨了在数据不足时如何通过先验知识提升深度学习模型的性能。
Details
Motivation: 解决深度学习在数据不足时性能下降的问题,激发开发更高效的数据利用方法。Contribution: 通过组织数据缺陷挑战赛,推动了对数据高效深度学习方法的研究,展示了先验知识的重要性。
Method: 挑战赛中参与者需从零开始训练模型,限制训练样本数量,禁止使用迁移学习,并结合模型集成和数据增强。
Result: 成功的参赛方法利用了Transformer与CNN的混合集成及数据增强技术,部分还引入了新的先验知识。
Insight: 先验知识和模型集成的结合在数据不足时显著提升模型性能。
Abstract: Deep Learning requires large amounts of data to train models that work well.
In data-deficient settings, performance can be degraded. We investigate which
Deep Learning methods benefit training models in a data-deficient setting, by
organizing the “VIPriors: Visual Inductive Priors for Data-Efficient Deep
Learning” workshop series, featuring four editions of data-impaired challenges.
These challenges address the problem of training deep learning models for
computer vision tasks with limited data. Participants are limited to training
models from scratch using a low number of training samples and are not allowed
to use any form of transfer learning. We aim to stimulate the development of
novel approaches that incorporate prior knowledge to improve the data
efficiency of deep learning models. Successful challenge entries make use of
large model ensembles that mix Transformers and CNNs, as well as heavy data
augmentation. Novel prior knowledge-based methods contribute to success in some
entries.
[74] SAMSelect: A Spectral Index Search for Marine Debris Visualization using Segment Anything
Joost van Dalen,Yuki M. Asano,Marc Russwurm
Main category: cs.CV
TL;DR: SAMSelect是一种算法,通过Segment Anything模型为多光谱图像生成显著的三通道可视化,用于海洋科学家对Sentinel-2影像中的海洋垃圾进行视觉解释。该算法通过小规模标注数据集选择最佳波段或光谱指数组合,提高了分类准确性和视觉信息质量。
Details
Motivation: 海洋垃圾在中等分辨率影像中因成分异构性难以可视化,而领域专家通常依赖经验和启发式方法选择波段和光谱指数。SAMSelect旨在通过自动化方式优化波段选择,提升视觉解释效果。Contribution: 提出了SAMSelect算法,通过Segment Anything模型自动选择最佳波段或光谱指数组合,为多光谱图像生成显著的三通道可视化,提高了海洋垃圾的识别能力。
Method: 利用Segment Anything模型在小规模标注数据集上测试不同波段或光谱指数组合的分类准确性,选择性能最佳的组合生成三通道可视化。
Result: 在加纳阿克拉和南非德班的Sentinel-2影像中,SAMSelect发现了新的未使用波段组合(如B8和B2的归一化差异指数),其性能优于文献中的传统指数。
Insight: 自动化波段选择结合Segment Anything模型可以显著提升海洋垃圾的视觉解释效果,为领域专家提供了更高效的视觉分析工具。
Abstract: This work proposes SAMSelect, an algorithm to obtain a salient three-channel
visualization for multispectral images. We develop SAMSelect and show its use
for marine scientists visually interpreting floating marine debris in
Sentinel-2 imagery. These debris are notoriously difficult to visualize due to
their compositional heterogeneity in medium-resolution imagery. Out of these
difficulties, a visual interpretation of imagery showing marine debris remains
a common practice by domain experts, who select bands and spectral indices on a
case-by-case basis informed by common practices and heuristics. SAMSelect
selects the band or index combination that achieves the best classification
accuracy on a small annotated dataset through the Segment Anything Model. Its
central assumption is that the three-channel visualization achieves the most
accurate segmentation results also provide good visual information for
photo-interpretation.
We evaluate SAMSelect in three Sentinel-2 scenes containing generic marine
debris in Accra, Ghana, and Durban, South Africa, and deployed plastic targets
from the Plastic Litter Project. This reveals the potential of new previously
unused band combinations (e.g., a normalized difference index of B8, B2), which
demonstrate improved performance compared to literature-based indices. We
describe the algorithm in this paper and provide an open-source code repository
that will be helpful for domain scientists doing visual photo interpretation,
especially in the marine field.
[75] ECMNet:Lightweight Semantic Segmentation with Efficient CNN-Mamba Network
Feixiang Du,Shengkun Wu
Main category: cs.CV
TL;DR: 论文提出了一种轻量级的语义分割网络ECMNet,结合CNN和Mamba的优势,通过设计EDAB模块、MSAU单元和Mamba增强的FFM模块,显著提升了分割精度与效率的平衡。
Details
Motivation: 尽管CNN和Transformer在语义分割任务中表现优异,但全局上下文建模仍不足。Mamba在视觉任务中展现出长距离依赖建模的优势,因此将其与CNN结合以弥补各自的不足。Contribution: 1. 提出轻量级Efficient CNN-Mamba Network (ECMNet);2. 设计Enhanced Dual-Attention Block (EDAB)用于轻量级瓶颈;3. 提出Multi-Scale Attention Unit (MSAU)以增强特征表示能力;4. 引入Mamba增强的Feature Fusion Module (FFM)用于多尺度特征融合。
Method: ECMNet通过胶囊框架结合CNN和Mamba,利用EDAB模块优化计算效率,MSAU单元集成多尺度特征聚合,FFM模块增强特征融合能力。
Result: 在Cityscapes和CamVid测试集上分别达到70.6%和73.6%的mIoU,参数量为0.87M,计算量为8.27G FLOPs。
Insight: Mamba与CNN的结合在语义分割任务中展现出高效的长距离依赖建模能力,同时保持了轻量化和计算效率。
Abstract: In the past decade, Convolutional Neural Networks (CNNs) and Transformers
have achieved wide applicaiton in semantic segmentation tasks. Although CNNs
with Transformer models greatly improve performance, the global context
modeling remains inadequate. Recently, Mamba achieved great potential in vision
tasks, showing its advantages in modeling long-range dependency. In this paper,
we propose a lightweight Efficient CNN-Mamba Network for semantic segmentation,
dubbed as ECMNet. ECMNet combines CNN with Mamba skillfully in a capsule-based
framework to address their complementary weaknesses. Specifically, We design a
Enhanced Dual-Attention Block (EDAB) for lightweight bottleneck. In order to
improve the representations ability of feature, We devise a Multi-Scale
Attention Unit (MSAU) to integrate multi-scale feature aggregation, spatial
aggregation and channel aggregation. Moreover, a Mamba enhanced Feature Fusion
Module (FFM) merges diverse level feature, significantly enhancing segmented
accuracy. Extensive experiments on two representative datasets demonstrate that
the proposed model excels in accuracy and efficiency balance, achieving 70.6%
mIoU on Cityscapes and 73.6% mIoU on CamVid test datasets, with 0.87M
parameters and 8.27G FLOPs on a single RTX 3090 GPU platform.
[76] RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping
Yang Bai,Liudi Yang,George Eskandar,Fengyi Shen,Dong Chen,Mohammad Altillawi,Ziyuan Liu,Gitta Kutyniok
Main category: cs.CV
TL;DR: RoboSwap提出了一个结合GAN和扩散模型的视频扩散框架,用于无监督的机器人手臂交换,解决了跨平台机器人学习中的数据稀缺问题。
Details
Motivation: 由于高质量、多样化数据集的稀缺,视频条件化机器人学习的跨平台泛化能力受限。RoboSwap旨在通过无监督方式交换机器人手臂,减少数据收集需求。Contribution: 提出了RoboSwap框架,结合GAN和扩散模型,实现无监督机器人手臂交换;解决了跨机器人学习的跨平台数据生成问题。
Method: 1. 使用GAN将不同环境中的机器人手臂转换目标手臂;2. 通过扩散模型增强转换后视频的连贯性和运动真实性。GAN和扩散模型分开训练。
Result: 在三个基准测试中,RoboSwap在结构连贯性和运动一致性上优于现有视频和图像编辑模型。
Insight: GAN和扩散模型的结合能够互补优势,为机器人学习提供高质量的跨平台数据,减少对成对数据的依赖。
Abstract: Recent advancements in generative models have revolutionized video synthesis
and editing. However, the scarcity of diverse, high-quality datasets continues
to hinder video-conditioned robotic learning, limiting cross-platform
generalization. In this work, we address the challenge of swapping a robotic
arm in one video with another: a key step for crossembodiment learning. Unlike
previous methods that depend on paired video demonstrations in the same
environmental settings, our proposed framework, RoboSwap, operates on unpaired
data from diverse environments, alleviating the data collection needs. RoboSwap
introduces a novel video editing pipeline integrating both GANs and diffusion
models, combining their isolated advantages. Specifically, we segment robotic
arms from their backgrounds and train an unpaired GAN model to translate one
robotic arm to another. The translated arm is blended with the original video
background and refined with a diffusion model to enhance coherence, motion
realism and object interaction. The GAN and diffusion stages are trained
independently. Our experiments demonstrate that RoboSwap outperforms
state-of-the-art video and image editing models on three benchmarks in terms of
both structural coherence and motion consistency, thereby offering a robust
solution for generating reliable, cross-embodiment data in robotic learning.
[77] SurfR: Surface Reconstruction with Multi-scale Attention
Siddhant Ranade,Gonçalo Dias Pais,Ross Tyler Whitaker,Jacinto C. Nascimento,Pedro Miraldo,Srikumar Ramalingam
Main category: cs.CV
TL;DR: 提出了一种快速准确的表面重建算法,通过隐式表示处理无组织点云,解决了现有方法在细节与速度之间的权衡问题。
Details
Motivation: 现有学习方法的局限性体现在要么需要为每个物体单独训练(小模型但高细节),要么采用通用表示(大模型但低细节且推理慢)。需要一种既能保持高细节又能快速推理的新方法。Contribution: 1. 提出“惰性查询”方法,加速特征提取;2. 采用并行多尺度网格表示,提高对噪声和分辨率的鲁棒性;3. 利用跨尺度注意力机制改善重建结果。
Method: 使用隐式表示,结合惰性查询(早期阶段不依赖查询点)、多尺度网格特征提取和跨尺度注意力机制。
Result: 算法在速度上优于所有基线方法,且性能接近当前最优方法,实现了最佳的精度-速度权衡。
Insight: 惰性查询和多尺度注意力机制的引入显著提升了隐式表示的效率与鲁棒性,为表面重建提供了新思路。
Abstract: We propose a fast and accurate surface reconstruction algorithm for
unorganized point clouds using an implicit representation. Recent learning
methods are either single-object representations with small neural models that
allow for high surface details but require per-object training or generalized
representations that require larger models and generalize to newer shapes but
lack details, and inference is slow. We propose a new implicit representation
for general 3D shapes that is faster than all the baselines at their optimum
resolution, with only a marginal loss in performance compared to the
state-of-the-art. We achieve the best accuracy-speed trade-off using three key
contributions. Many implicit methods extract features from the point cloud to
classify whether a query point is inside or outside the object. First, to speed
up the reconstruction, we show that this feature extraction does not need to
use the query point at an early stage (lazy query). Second, we use a parallel
multi-scale grid representation to develop robust features for different noise
levels and input resolutions. Finally, we show that attention across scales can
provide improved reconstruction results.
[78] Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization
Zhiyi Zhu,Xiaoyu Wu,Youwei Lu
Main category: cs.CV
TL;DR: 这篇论文提出了文本-运动跨模态对比损失(TMCCL)以增强视频记忆性预测,并通过新的视频摘要方法(MWCVS)展示了记忆性预测的应用潜力。
Details
Motivation: 现有模型在预测视频记忆性时未能充分利用运动特征,且缺乏标注数据导致运动特征表示不足。论文旨在通过跨模态对比学习提升运动特征表示,并探索记忆性预测的实际应用。Contribution: 1. 提出TMCCL,通过文本-运动跨模态对比学习增强运动特征表示;2. 提出MWCVS,利用记忆性预测减少视频摘要的主观性。
Method: 1. TMCCL:利用视频文本描述相似性建立正负样本对,通过对比学习优化运动特征;2. MWCVS:基于记忆性预测对视频摘要标签进行加权校正。
Result: 在视频记忆性预测和视频摘要任务上均取得最优性能,验证了方法的有效性。
Insight: 跨模态对比学习能有效提升运动特征表示,记忆性预测在视频编辑任务中具有实际应用价值。
Abstract: Video memorability refers to the ability of videos to be recalled after
viewing, playing a crucial role in creating content that remains memorable.
Existing models typically focus on extracting multimodal features to predict
video memorability scores but often fail to fully utilize motion cues. The
representation of motion features is compromised during the fine-tuning phase
of the motion feature extractor due to a lack of labeled data. In this paper,
we introduce the Text-Motion Cross-modal Contrastive Loss (TMCCL), a multimodal
video memorability prediction model designed to enhance the representation of
motion features. We tackle the challenge of improving motion feature
representation by leveraging text description similarities across videos to
establish positive and negative motion sample sets for a given target. This
enhancement allows the model to learn similar feature representations for
semantically related motion content, resulting in more accurate memorability
predictions. Our model achieves state-of-the-art performance on two video
memorability prediction datasets. Moreover, the potential applications of video
memorability prediction have been underexplored. To address this gap, we
present Memorability Weighted Correction for Video Summarization (MWCVS), using
video memorability prediction to reduce subjectivity in video summarization
labels. Experimental results on two video summarization datasets demonstrate
the effectiveness of MWCVS, showcasing the promising applications of video
memorability prediction.
[79] Beyond Calibration: Physically Informed Learning for Raw-to-Raw Mapping
Peter Grönquist,Stepan Tulyakov,Dengxin Dai
Main category: cs.CV
TL;DR: 该论文提出了一种轻量级的Neural Physical Model (NPM),用于解决多相机间RAW图像转换的挑战性任务,适应不同光照条件,并在公开数据集上优于现有方法。
Details
Motivation: 多相机系统中的颜色一致性对图像融合和ISP兼容性至关重要,但现有方法受限于光照适应性差或计算成本高。Contribution: 提出了NPM方法,通过物理启发学习模拟特定光照下的RAW图像,实现了高效且适应性强的跨设备RAW转换。
Method: NPM结合物理测量和深度学习,支持有监督或无监督训练,能够适应不同光照条件并模拟RAW图像转换。
Result: 在NUS和BeyondRGB数据集上的实验表明,NPM在颜色一致性和适应性上优于现有方法。
Insight: 通过物理信息和数据驱动方法的结合,NPM在轻量化设计中实现了高性能的RAW图像转换,为多相机系统提供了实用解决方案。
Abstract: Achieving consistent color reproduction across multiple cameras is essential
for seamless image fusion and Image Processing Pipeline (ISP) compatibility in
modern devices, but it is a challenging task due to variations in sensors and
optics. Existing raw-to-raw conversion methods face limitations such as poor
adaptability to changing illumination, high computational costs, or impractical
requirements such as simultaneous camera operation and overlapping
fields-of-view. We introduce the Neural Physical Model (NPM), a lightweight,
physically-informed approach that simulates raw images under specified
illumination to estimate transformations between devices. The NPM effectively
adapts to varying illumination conditions, can be initialized with physical
measurements, and supports training with or without paired data. Experiments on
public datasets like NUS and BeyondRGB demonstrate that NPM outperforms recent
state-of-the-art methods, providing robust chromatic consistency across
different sensors and optical systems.
[80] LLaVA-c: Continual Improved Visual Instruction Tuning
Wenzhuo Liu,Fei Zhu,Haiyang Guo,Longhui Wei,Cheng-Lin Liu
Main category: cs.CV
TL;DR: LLaVA-c通过光谱感知巩固和无监督查询正则化改进持续学习,在多任务性能与通用能力之间取得平衡,甚至超越联合学习方法。
Details
Motivation: 多模态模型(如LLaVA-1.5)在多任务学习中存在任务平衡和扩展成本的挑战,传统持续学习方法则忽视基模型退化问题。Contribution: 提出光谱感知巩固和无监督查询正则化,首次实现任务持续学习性能匹敌或超越联合学习。
Method: 在LLaVA-1.5基础上引入光谱感知巩固改进任务平衡,通过无监督查询正则化防止基模型退化。
Result: LLaVA-c在持续预训练和微调中既提升基准性能,又保留通用能力。
Insight: 持续学习可通过任务优化设计避免基模型退化,多任务联合学习并非唯一高效路径。
Abstract: Multimodal models like LLaVA-1.5 achieve state-of-the-art visual
understanding through visual instruction tuning on multitask datasets, enabling
strong instruction-following and multimodal performance. However, multitask
learning faces challenges such as task balancing, requiring careful adjustment
of data proportions, and expansion costs, where new tasks risk catastrophic
forgetting and need costly retraining. Continual learning provides a promising
alternative to acquiring new knowledge incrementally while preserving existing
capabilities. However, current methods prioritize task-specific performance,
neglecting base model degradation from overfitting to specific instructions,
which undermines general capabilities. In this work, we propose a simple but
effective method with two modifications on LLaVA-1.5: spectral-aware
consolidation for improved task balance and unsupervised inquiry regularization
to prevent base model degradation. We evaluate both general and task-specific
performance across continual pretraining and fine-tuning. Experiments
demonstrate that LLaVA-c consistently enhances standard benchmark performance
and preserves general capabilities. For the first time, we show that
task-by-task continual learning can achieve results that match or surpass
multitask joint learning. The code will be publicly released.
[81] ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction
Juan Yeo,Soonwoo Cha,Jiwoo Song,Hyunbin Jin,Taesup Kim
Main category: cs.CV
TL;DR: ATAS提出了一种自蒸馏方法,通过利用模型内部的多层次知识,同时提升语义一致性和细粒度视觉-语言对齐,无需额外模块或有监督微调即可增强CLIP模型在开放词汇密集预测任务中的表现。
Details
Motivation: CLIP模型在开放词汇密集预测任务中表现出色,但在细粒度和区域级理解上仍有不足,且现有方法往往以牺牲语义一致性为代价换取细粒度对齐。Contribution: 提出ATAS方法,通过自蒸馏过程同时提升语义一致性和细粒度对齐,无需额外模块或有监督微调,显著提升了开放词汇密集预测任务的性能。
Method: 利用未标注图像和内部自蒸馏过程,通过多层次知识的蒸馏优化CLIP视觉编码器的表征,保持局部语义一致性的同时增强细节识别能力。
Result: 在开放词汇目标检测和语义分割基准测试中,ATAS显著优于基线CLIP模型。
Insight: 同时维护语义一致性和细粒度对齐是提升开放词汇密集预测任务性能的关键,自蒸馏方法是一种高效的无监督优化途径。
Abstract: Vision-language models such as CLIP have recently propelled open-vocabulary
dense prediction tasks by enabling recognition of a broad range of visual
concepts. However, CLIP still struggles with fine-grained, region-level
understanding, hindering its effectiveness on these dense prediction tasks. We
identify two pivotal factors required to address this limitation: semantic
coherence and fine-grained vision-language alignment. Current adaptation
methods often improve fine-grained alignment at the expense of semantic
coherence, and often rely on extra modules or supervised fine-tuning. To
overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel
approach that simultaneously enhances semantic coherence and fine-grained
alignment by leveraging own knowledge of a model across all representation
levels. Unlike prior methods, ATAS uses only unlabeled images and an internal
self-distillation process to refine representations of CLIP vision encoders,
preserving local semantic consistency while sharpening local detail
recognition. On open-vocabulary object detection and semantic segmentation
benchmarks, ATAS achieves substantial performance gains, outperforming baseline
CLIP models. These results validate the effectiveness of our approach and
underscore the importance of jointly maintaining semantic coherence and
fine-grained alignment for advanced open-vocabulary dense prediction.
[82] CanadaFireSat: Toward high-resolution wildfire forecasting with multiple modalities
Hugo Porta,Emanuele Dalsasso,Jessica L. McCarty,Devis Tuia
Main category: cs.CV
TL;DR: 该论文提出了一个高分辨率(100米)的野火预测数据集CanadaFireSat和基线方法,利用多模态数据(包括高分辨率卫星影像和环境因素),展示了多模态深度学习模型在大陆尺度野火预测中的潜力。
Details
Motivation: 2023年加拿大经历了严重的野火季节,亟需通过高分辨率预测模型提升野火管理的效率和准确性。Contribution: 1)提出了CanadaFireSat数据集;2)设计了多模态深度学习基线方法;3)证明了多模态输入在高分辨率野火预测中的优越性。
Method: 结合高分辨率卫星影像(Sentinel-2 L1C)、中分辨率卫星产品(MODIS)和环境数据(ERA5),使用两种主要深度学习架构进行实验。
Result: 多模态输入在2023年野火季节的预测中表现最佳,F1分数达到60.3%,显示出高分辨率模型的潜力。
Insight: 多模态数据融合可以显著提升野火预测的精度和分辨率,为大陆尺度的野火管理提供新工具。
Abstract: Canada experienced in 2023 one of the most severe wildfire seasons in recent
history, causing damage across ecosystems, destroying communities, and emitting
large quantities of CO2. This extreme wildfire season is symptomatic of a
climate-change-induced increase in the length and severity of the fire season
that affects the boreal ecosystem. Therefore, it is critical to empower
wildfire management in boreal communities with better mitigation solutions.
Wildfire probability maps represent an important tool for understanding the
likelihood of wildfire occurrence and the potential severity of future
wildfires. The massive increase in the availability of Earth observation data
has enabled the development of deep learning-based wildfire forecasting models,
aiming at providing precise wildfire probability maps at different spatial and
temporal scales. A main limitation of such methods is their reliance on
coarse-resolution environmental drivers and satellite products, leading to
wildfire occurrence prediction of reduced resolution, typically around $\sim
0.1${\deg}. This paper presents a benchmark dataset: CanadaFireSat, and
baseline methods for high-resolution: 100 m wildfire forecasting across Canada,
leveraging multi-modal data from high-resolution multi-spectral satellite
images (Sentinel-2 L1C), mid-resolution satellite products (MODIS), and
environmental factors (ERA5 reanalysis data). Our experiments consider two
major deep learning architectures. We observe that using multi-modal temporal
inputs outperforms single-modal temporal inputs across all metrics, achieving a
peak performance of 60.3% in F1 score for the 2023 wildfire season, a season
never seen during model training. This demonstrates the potential of
multi-modal deep learning models for wildfire forecasting at high-resolution
and continental scale.
[83] VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism
Congzhi Zhang,Jiawei Peng,Zhenglin Wang,Yilong Lai,Haowen Sun,Heng Chang,Fei Ma,Weijiang Yu
Main category: cs.CV
TL;DR: VReST通过蒙特卡洛树搜索和自奖励机制,无需训练即可提升大规模视觉语言模型(LVLMs)在复杂视觉推理任务中的表现,并在多模态数学推理基准上取得了最先进的结果。
Details
Motivation: 现有的LVLMs虽然在多模态任务中表现出色,但在复杂视觉推理任务中的能力仍有限,尤其是在使用链式思维提示技术时。Contribution: 主要贡献是提出了VReST方法,通过蒙特卡洛树搜索和自奖励机制,显著提升了LVLMs的推理能力,并在多个基准测试中取得了最优成绩。
Method: VReST通过构建搜索树遍历推理空间,每个节点代表推理步骤,路径则涵盖完整推理序列,并利用多模态自奖励机制评估推理质量。
Result: VReST在多模态数学推理基准上超过了当前最优的提示方法,验证了测试时间扩展定律在多模态任务中的有效性。
Insight: VReST为无需额外模型即可提升LVLMs推理能力提供了新方向,展示了测试时间优化在多模态任务中的潜力。
Abstract: Large Vision-Language Models (LVLMs) have shown exceptional performance in
multimodal tasks, but their effectiveness in complex visual reasoning is still
constrained, especially when employing Chain-of-Thought prompting techniques.
In this paper, we propose VReST, a novel training-free approach that enhances
Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms.
VReST meticulously traverses the reasoning landscape by establishing a search
tree, where each node encapsulates a reasoning step, and each path delineates a
comprehensive reasoning sequence. Our innovative multimodal Self-Reward
mechanism assesses the quality of reasoning steps by integrating the utility of
sub-questions, answer correctness, and the relevance of vision-language clues,
all without the need for additional models. VReST surpasses current prompting
methods and secures state-of-the-art performance across three multimodal
mathematical reasoning benchmarks. Furthermore, it substantiates the efficacy
of test-time scaling laws in multimodal tasks, offering a promising direction
for future research.
[84] MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning
Mohammadreza Salehi,Shashanka Venkataramanan,Ioana Simion,Efstratios Gavves,Cees G. M. Snoek,Yuki M Asano
Main category: cs.CV
TL;DR: 提出了一种基于运动轨迹的自监督学习框架MoSiC,通过聚类稠密点轨迹学习时空一致的表示,提升了动态场景中的鲁棒性。
Details
Motivation: 现有自监督学习方法依赖于静态增广,面对物体变形、遮挡和相机运动时表现不佳,导致特征学习不一致。因此,需要一种运动引导的方法来学习更鲁棒的时空表示。Contribution: 1. 提出MoSiC框架,利用运动轨迹作为监督信号,学习时空一致性表示;2. 通过动量编码器和最优传输机制优化特征聚类;3. 在多个数据集上性能超越现有方法1%-6%。
Method: 1. 使用现有点追踪器提取长程运动轨迹;2. 基于动量编码器的特征聚类优化;3. 通过最优传输机制传播聚类标签,强制特征一致性。
Result: 在六个图像和视频数据集及四个评测基准上超越了现有方法1%-6%的性能。
Insight: 运动轨迹可以作为有效的自监督信号,帮助模型在动态场景和遮挡情况下学习更鲁棒的特征表示。
Abstract: Dense self-supervised learning has shown great promise for learning pixel-
and patch-level representations, but extending it to videos remains challenging
due to the complexity of motion dynamics. Existing approaches struggle as they
rely on static augmentations that fail under object deformations, occlusions,
and camera movement, leading to inconsistent feature learning over time. We
propose a motion-guided self-supervised learning framework that clusters dense
point tracks to learn spatiotemporally consistent representations. By
leveraging an off-the-shelf point tracker, we extract long-range motion
trajectories and optimize feature clustering through a momentum-encoder-based
optimal transport mechanism. To ensure temporal coherence, we propagate cluster
assignments along tracked points, enforcing feature consistency across views
despite viewpoint changes. Integrating motion as an implicit supervisory
signal, our method learns representations that generalize across frames,
improving robustness in dynamic scenes and challenging occlusion scenarios. By
initializing from strong image-pretrained models and leveraging video data for
training, we improve state-of-the-art by 1% to 6% on six image and video
datasets and four evaluation benchmarks. The implementation is publicly
available at our GitHub repository: https://github.com/SMSD75/MoSiC/tree/main
[85] TraGraph-GS: Trajectory Graph-based Gaussian Splatting for Arbitrary Large-Scale Scene Rendering
Xiaohan Zhang,Sitong Wang,Yushen Yan,Yi Yang,Mingda Xu,Qi Liu
Main category: cs.CV
TL;DR: TraGraph-GS提出了一种基于轨迹图的方法,用于大规模场景的高质量新视角合成,解决了现有方法在相机轨迹适应性和高斯重叠问题上的局限性。
Details
Motivation: 大规模场景的新视角合成因相机轨迹任意性和高斯重叠问题而具有挑战性。现有方法通过分区重建和合并渲染,但效果不佳。Contribution: 1. 提出了基于轨迹图的空间分区方法,适应任意相机轨迹。2. 引入正则化约束和渐进式渲染策略,解决高斯重叠问题。
Method: 利用轨迹图进行空间分区,结合正则化约束和渐进式渲染策略,优化纹理和远距离物体的渲染效果。
Result: 在4个航空和4个地面数据集上,平均PSNR提升了1.86 dB和1.62 dB,优于当前最佳方法。
Insight: 通过灵活的轨迹图分区和渐进式渲染,可以有效解决大规模场景渲染中的高斯重叠和纹理失真问题。
Abstract: High-quality novel view synthesis for large-scale scenes presents a
challenging dilemma in 3D computer vision. Existing methods typically partition
large scenes into multiple regions, reconstruct a 3D representation using
Gaussian splatting for each region, and eventually merge them for novel view
rendering. They can accurately render specific scenes, yet they do not
generalize effectively for two reasons: (1) rigid spatial partition techniques
struggle with arbitrary camera trajectories, and (2) the merging of regions
results in Gaussian overlap to distort texture details. To address these
challenges, we propose TraGraph-GS, leveraging a trajectory graph to enable
high-precision rendering for arbitrarily large-scale scenes. We present a
spatial partitioning method for large-scale scenes based on graphs, which
incorporates a regularization constraint to enhance the rendering of textures
and distant objects, as well as a progressive rendering strategy to mitigate
artifacts caused by Gaussian overlap. Experimental results demonstrate its
superior performance both on four aerial and four ground datasets and highlight
its remarkable efficiency: our method achieves an average improvement of 1.86
dB in PSNR on aerial datasets and 1.62 dB on ground datasets compared to
state-of-the-art approaches.
[86] SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting
Mengjiao Ma,Qi Ma,Yue Li,Jiahuan Cheng,Runyi Yang,Bin Ren,Nikola Popovic,Mingqiang Wei,Nicu Sebe,Luc Van Gool,Theo Gevers,Martin R. Oswald,Danda Pani Paudel
Main category: cs.CV
TL;DR: 论文提出了SceneSplat++,一个用于语言高斯泼溅的大规模数据集和综合基准测试,填补了现有工作在3D场景理解上的局限性,并展示了通用方法的优势。
Details
Motivation: 现有语言高斯泼溅方法主要在少量场景和视角上评估,缺乏对整体3D场景理解的深入洞察,因此需要大规模基准测试和数据集来推动研究。Contribution: 1) 首次提出大规模基准测试,覆盖1060个场景和四种数据集;2) 引入GaussianWorld-49K数据集,包含49K多样化场景;3) 展示通用方法在快速推理和优异分割性能上的优势。
Method: 提出一个综合基准测试,系统评估三类语言高斯泼溅方法(场景优化、无优化、通用方法),并在新数据集上验证通用方法的性能。
Result: 基准测试结果显示通用方法在放松场景特定限制、快速推理和分割性能上表现最佳。
Insight: 通用方法通过强数据先验能够显著提升3D场景理解的性能,大规模数据集和基准测试是推动该领域发展的关键。
Abstract: 3D Gaussian Splatting (3DGS) serves as a highly performant and efficient
encoding of scene geometry, appearance, and semantics. Moreover, grounding
language in 3D scenes has proven to be an effective strategy for 3D scene
understanding. Current Language Gaussian Splatting line of work fall into three
main groups: (i) per-scene optimization-based, (ii) per-scene
optimization-free, and (iii) generalizable approach. However, most of them are
evaluated only on rendered 2D views of a handful of scenes and viewpoints close
to the training views, limiting ability and insight into holistic 3D
understanding. To address this gap, we propose the first large-scale benchmark
that systematically assesses these three groups of methods directly in 3D
space, evaluating on 1060 scenes across three indoor datasets and one outdoor
dataset. Benchmark results demonstrate a clear advantage of the generalizable
paradigm, particularly in relaxing the scene-specific limitation, enabling fast
feed-forward inference on novel scenes, and achieving superior segmentation
performance. We further introduce GaussianWorld-49K a carefully curated 3DGS
dataset comprising around 49K diverse indoor and outdoor scenes obtained from
multiple sources, with which we demonstrate the generalizable approach could
harness strong data priors. Our codes, benchmark, and datasets will be made
public to accelerate research in generalizable 3DGS scene understanding.
[87] Geometric deep learning for local growth prediction on abdominal aortic aneurysm surfaces
Dieuwertje Alblas,Patryk Rygiel,Julian Suk,Kaj O. Kappe,Marieke Hofman,Christoph Brune,Kak Khee Yeung,Jelmer M. Wolterink
Main category: cs.CV
TL;DR: 该论文提出了一种基于SE(3)-对称Transformer的几何深度学习模型,用于预测腹主动脉瘤(AAA)的局部生长,从而改进个性化监测策略。
Details
Motivation: 当前临床指南仅基于AAA的最大直径决定监测间隔,忽略了3D形状与生长之间的关系,可能导致监测效果不佳。因此,需要一种能够预测局部生长的方法。Contribution: 提出了一种基于血管表面多物理特征和SE(3)-对称Transformer的模型,保留了血管解剖结构和几何保真度,实现了局部生长预测。
Method: 使用纵向CTA扫描数据训练SE(3)-对称Transformer模型,以血管表面为输入,预测AAA生长。模型在24名患者的数据上训练,并在外部验证集上测试。
Result: 模型预测AAA生长的直径误差中位数为1.18 mm,并能以93%的准确率预测患者是否在两年内需要手术修复。外部验证集结果也显示了模型的泛化能力。
Insight: 局部生长预测结合3D形状信息可以提供更个性化的监测策略,SE(3)-对称性设计有助于保持几何一致性,提升预测精度。
Abstract: Abdominal aortic aneurysms (AAAs) are progressive focal dilatations of the
abdominal aorta. AAAs may rupture, with a survival rate of only 20%. Current
clinical guidelines recommend elective surgical repair when the maximum AAA
diameter exceeds 55 mm in men or 50 mm in women. Patients that do not meet
these criteria are periodically monitored, with surveillance intervals based on
the maximum AAA diameter. However, this diameter does not take into account the
complex relation between the 3D AAA shape and its growth, making standardized
intervals potentially unfit. Personalized AAA growth predictions could improve
monitoring strategies. We propose to use an SE(3)-symmetric transformer model
to predict AAA growth directly on the vascular model surface enriched with
local, multi-physical features. In contrast to other works which have
parameterized the AAA shape, this representation preserves the vascular
surface’s anatomical structure and geometric fidelity. We train our model using
a longitudinal dataset of 113 computed tomography angiography (CTA) scans of 24
AAA patients at irregularly sampled intervals. After training, our model
predicts AAA growth to the next scan moment with a median diameter error of
1.18 mm. We further demonstrate our model’s utility to identify whether a
patient will become eligible for elective repair within two years (acc = 0.93).
Finally, we evaluate our model’s generalization on an external validation set
consisting of 25 CTAs from 7 AAA patients from a different hospital. Our
results show that local directional AAA growth prediction from the vascular
surface is feasible and may contribute to personalized surveillance strategies.
[88] InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba
Yuhang Wang,Jun Li,Zhijian Wu,Jianhua Xu
Main category: cs.CV
TL;DR: 该论文提出了InceptionMamba,一种高效的混合网络架构,通过大波段卷积和瓶颈Mamba模块改进InceptionNeXt的局部和全局建模能力,在图像分类和下游任务中表现优异。
Details
Motivation: InceptionNeXt虽然在一维带状卷积的基础上表现优异,但其在空间依赖性建模和局部邻域探索方面存在局限性,且卷积操作的局部性约束不利于全局上下文建模。因此,论文旨在提出一种更高效的架构来解决这些问题。Contribution: 提出InceptionMamba架构,通过正交波段卷积改进空间建模能力,引入瓶颈Mamba模块以实现全局上下文建模和跨通道信息融合。
Method: 使用正交波段卷积替换传统一维卷积,增强空间建模;引入瓶颈Mamba模块,扩展感受野并促进全局信息融合。
Result: 在图像分类和多个下游任务中,InceptionMamba表现优异,且参数和计算效率优于现有方法。
Insight: 结合波段卷积和Mamba模块的混合设计可以显著提升模型的局部和全局建模能力,同时保持计算效率。源代码已开源。
Abstract: Within the family of convolutional neural networks, InceptionNeXt has shown
excellent competitiveness in image classification and a number of downstream
tasks. Built on parallel one-dimensional strip convolutions, however, it
suffers from limited ability of capturing spatial dependencies along different
dimensions and fails to fully explore spatial modeling in local neighborhood.
Besides, inherent locality constraints of convolution operations are
detrimental to effective global context modeling. To overcome these
limitations, we propose a novel backbone architecture termed InceptionMamba in
this study. More specifically, the traditional one-dimensional strip
convolutions are replaced by orthogonal band convolutions in our InceptionMamba
to achieve cohesive spatial modeling. Furthermore, global contextual modeling
can be achieved via a bottleneck Mamba module, facilitating enhanced
cross-channel information fusion and enlarged receptive field. Extensive
evaluations on classification and various downstream tasks demonstrate that the
proposed InceptionMamba achieves state-of-the-art performance with superior
parameter and computational efficiency. The source code will be available at
https://github.com/Wake1021/InceptionMamba.
[89] RS-MTDF: Multi-Teacher Distillation and Fusion for Remote Sensing Semi-Supervised Semantic Segmentation
Jiayi Song,Kaiyu Li,Xiangyong Cao,Deyu Meng
Main category: cs.CV
TL;DR: RS-MTDF提出了一种基于多教师蒸馏与融合的半监督遥感语义分割框架,利用预训练的视觉基础模型(VFMs)作为教师,通过特征级蒸馏和知识融合提升学生模型的性能,在多个数据集上实现了SOTA。
Details
Motivation: 遥感语义分割依赖大量高质量标注数据,但标注成本高昂。半监督学习可缓解这一问题,但现有方法在标注与未标注数据间的分布不匹配问题上表现不佳。视觉基础模型(VFMs)具有强大的泛化能力,可为半监督学习提供语义先验。Contribution: 1. 提出了RS-MTDF框架,首次将多教师(VFMs)蒸馏与融合应用于遥感半监督语义分割;2. 通过特征级蒸馏和知识融合显著提升模型性能;3. 在三个挑战性数据集上验证了方法的有效性。
Method: 1. 使用多个预训练的VFMs(如DINOv2和CLIP)作为教师模型;2. 通过特征级蒸馏对齐学生与教师的特征表示;3. 在解码器中融合蒸馏知识以增强判别力。
Result: 在ISPRS Potsdam、LoveDA和DeepGlobe数据集上实现了SOTA性能,尤其在LoveDA上不同标签比例下均优于现有方法,并在多数语义类别中取得最高IoU。
Insight: 视觉基础模型的泛化能力可显著提升半监督学习性能,多教师蒸馏与知识融合是一种有效的半监督学习策略。
Abstract: Semantic segmentation in remote sensing images is crucial for various
applications, yet its performance is heavily reliant on large-scale,
high-quality pixel-wise annotations, which are notoriously expensive and
time-consuming to acquire. Semi-supervised semantic segmentation (SSS) offers a
promising alternative to mitigate this data dependency. However, existing SSS
methods often struggle with the inherent distribution mismatch between limited
labeled data and abundant unlabeled data, leading to suboptimal generalization.
We propose that Vision Foundation Models (VFMs), pre-trained on vast and
diverse datasets, possess robust generalization capabilities that can
effectively bridge this distribution gap and provide strong semantic priors for
SSS. Inspired by this, we introduce RS-MTDF (Multi-Teacher Distillation and
Fusion), a novel framework that leverages the powerful semantic knowledge
embedded in VFMs to guide semi-supervised learning in remote sensing.
Specifically, RS-MTDF employs multiple frozen VFMs (\textit{e.g.}, DINOv2 and
CLIP) as expert teachers, utilizing feature-level distillation to align student
features with their robust representations. To further enhance discriminative
power, the distilled knowledge is seamlessly fused into the student decoder.
Extensive experiments on three challenging remote sensing datasets (ISPRS
Potsdam, LoveDA, and DeepGlobe) demonstrate that RS-MTDF consistently achieves
state-of-the-art performance. Notably, our method outperforms existing
approaches across various label ratios on LoveDA and secures the highest IoU in
the majority of semantic categories. These results underscore the efficacy of
multi-teacher VFM guidance in significantly enhancing both generalization and
semantic understanding for remote sensing segmentation. Ablation studies
further validate the contribution of each proposed module.
[90] Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting
Keyi Liu,Weidong Yang,Ben Fei,Ying He
Main category: cs.CV
TL;DR: 提出Gaussian2Scene,一种基于3D高斯抛雪球的场景级自监督学习框架,通过两阶段训练策略提升3D几何理解和跨模态对齐能力,优于现有方法。
Details
Motivation: 解决现有自监督学习方法在场景级任务中对隐式表示和高内存需求的依赖,以及难以捕获底层3D几何结构的问题。Contribution: 1. 利用高效的3D高斯抛雪球(3DGS)技术,降低计算负担并支持直接3D场景重建;2. 提出双阶段训练策略,结合掩码自编码器和几何监督,强化几何与跨模态学习。
Method: 1. 第一阶段:双分支掩码自编码器学习2D和3D场景表示;2. 第二阶段:用重建点云初始化训练,通过高斯基元的几何位置和渲染RGB图像监督。
Result: 在多个3D物体检测任务中表现优于现有预训练方法。
Insight: 3D高斯抛雪球的显式表示能为场景级预训练提供高效的几何先验,两阶段策略可有效结合几何与视觉信息。
Abstract: Self-supervised learning (SSL) for point cloud pre-training has become a
cornerstone for many 3D vision tasks, enabling effective learning from
large-scale unannotated data. At the scene level, existing SSL methods often
incorporate volume rendering into the pre-training framework, using RGB-D
images as reconstruction signals to facilitate cross-modal learning. This
strategy promotes alignment between 2D and 3D modalities and enables the model
to benefit from rich visual cues in the RGB-D inputs. However, these approaches
are limited by their reliance on implicit scene representations and high memory
demands. Furthermore, since their reconstruction objectives are applied only in
2D space, they often fail to capture underlying 3D geometric structures. To
address these challenges, we propose Gaussian2Scene, a novel scene-level SSL
framework that leverages the efficiency and explicit nature of 3D Gaussian
Splatting (3DGS) for pre-training. The use of 3DGS not only alleviates the
computational burden associated with volume rendering but also supports direct
3D scene reconstruction, thereby enhancing the geometric understanding of the
backbone network. Our approach follows a progressive two-stage training
strategy. In the first stage, a dual-branch masked autoencoder learns both 2D
and 3D scene representations. In the second stage, we initialize training with
reconstructed point clouds and further supervise learning using the geometric
locations of Gaussian primitives and rendered RGB images. This process
reinforces both geometric and cross-modal learning. We demonstrate the
effectiveness of Gaussian2Scene across several downstream 3D object detection
tasks, showing consistent improvements over existing pre-training methods.
[91] HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation
Ziyao Huang,Zixiang Zhou,Juan Cao,Yifeng Ma,Yi Chen,Zejing Rao,Zhiyong Xu,Hongmei Wang,Qin Lin,Yuan Zhou,Qinglin Lu,Fan Tang
Main category: cs.CV
TL;DR: HunyuanVideo-HOMA 是一个基于多模态驱动的弱条件框架,用于生成通用的人-物交互(HOI)视频,通过稀疏解耦的运动指导和双输入空间编码,提升了生成视频的时序一致性和物理合理性。
Details
Motivation: 现有的人-物交互视频生成方法依赖于精心设计的运动数据,对新颖物体或场景的泛化能力有限,且可访问性较差。HunyuanVideo-HOMA 旨在通过弱监督的多模态驱动方法解决这些问题。Contribution: 1) 提出了一种弱条件的多模态驱动框架,提升了可控性并减少了对精确输入的依赖;2) 引入了双输入空间编码和多模态扩散变换器(MMDiT),实现了时序一致的交互生成;3) 设计了参数空间HOI适配器和面部交叉注意力适配器,优化训练并提升唇部同步的准确性。
Method: 1) 采用稀疏解耦的运动指导;2) 在MMDiT的双输入空间中编码外观和运动信号,并在共享上下文空间内融合;3) 通过HOI适配器和面部交叉注意力适配器优化训练。
Result: 实验表明,HunyuanVideo-HOMA 在交互自然性和弱监督下的泛化能力上达到了SOTA性能,同时支持文本条件生成和交互式物体操作。
Insight: 通过解耦和稀疏的运动指导,结合多模态信号编码,可以有效提升人-物交互视频生成的灵活性和泛化能力,同时保留物理合理性。
Abstract: To address key limitations in human-object interaction (HOI) video generation
– specifically the reliance on curated motion data, limited generalization to
novel objects/scenarios, and restricted accessibility – we introduce
HunyuanVideo-HOMA, a weakly conditioned multimodal-driven framework.
HunyuanVideo-HOMA enhances controllability and reduces dependency on precise
inputs through sparse, decoupled motion guidance. It encodes appearance and
motion signals into the dual input space of a multimodal diffusion transformer
(MMDiT), fusing them within a shared context space to synthesize temporally
consistent and physically plausible interactions. To optimize training, we
integrate a parameter-space HOI adapter initialized from pretrained MMDiT
weights, preserving prior knowledge while enabling efficient adaptation, and a
facial cross-attention adapter for anatomically accurate audio-driven lip
synchronization. Extensive experiments confirm state-of-the-art performance in
interaction naturalness and generalization under weak supervision. Finally,
HunyuanVideo-HOMA demonstrates versatility in text-conditioned generation and
interactive object manipulation, supported by a user-friendly demo interface.
The project page is at https://anonymous.4open.science/w/homa-page-0FBE/.
[92] Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought
Shuyi Zhang,Xiaoshuai Hao,Yingbo Tang,Lingfeng Zhang,Pengwei Wang,Zhongyuan Wang,Hongxuan Ma,Shanghang Zhang
Main category: cs.CV
TL;DR: Video-CoT是一个针对视频时空理解的新数据集,通过链式思维(CoT)方法提供细粒度的问答对和标注样本,旨在提升大尺度视觉语言模型(VLMs)在视频分析中的表现。
Details
Motivation: 现有的大尺度视觉语言模型在视频分析中难以捕捉复杂的时空细节,因此需要一种更全面的数据集和评估标准来改善这一领域的研究。Contribution: 提出了Video-CoT数据集,包含192,000个细粒度时空问答对和23,000个高质量CoT标注样本,并提供了一个综合的基准测试框架。
Method: 通过链式思维(CoT)方法生成细粒度的时空问答对,设计了一个包含750张图像的基准测试,并定制了评估指标。
Result: 实验表明,现有VLMs在处理视频时空理解任务时表现不佳,突显了该任务的挑战性。
Insight: Video-CoT为视频多媒体理解研究提供了新的基础,并为高精度视频分析的智能系统开发铺平了道路。
Abstract: Video content comprehension is essential for various applications, ranging
from video analysis to interactive systems. Despite advancements in large-scale
vision-language models (VLMs), these models often struggle to capture the
nuanced, spatiotemporal details essential for thorough video analysis. To
address this gap, we introduce Video-CoT, a groundbreaking dataset designed to
enhance spatiotemporal understanding using Chain-of-Thought (CoT)
methodologies. Video-CoT contains 192,000 fine-grained spa-tiotemporal
question-answer pairs and 23,000 high-quality CoT-annotated samples, providing
a solid foundation for evaluating spatiotemporal understanding in video
comprehension. Additionally, we provide a comprehensive benchmark for assessing
these tasks, with each task featuring 750 images and tailored evaluation
metrics. Our extensive experiments reveal that current VLMs face significant
challenges in achieving satisfactory performance, high-lighting the
difficulties of effective spatiotemporal understanding. Overall, the Video-CoT
dataset and benchmark open new avenues for research in multimedia understanding
and support future innovations in intelligent systems requiring advanced video
analysis capabilities. By making these resources publicly available, we aim to
encourage further exploration in this critical area. Project
website:https://video-cot.github.io/ .
[93] CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics
Shravan Nayak,Mehar Bhatia,Xiaofeng Zhang,Verena Rieser,Lisa Anne Hendricks,Sjoerd van Steenkiste,Yash Goyal,Karolina Stańczak,Aishwarya Agrawal
Main category: cs.CV
TL;DR: 该论文首次系统量化了文本到图像(T2I)模型和评估指标在显性与隐性文化期望上的对齐问题,并提出了CulturalFrames基准用于评估。研究发现T2I模型在文化表现上存在显著不足,且现有评估指标与人类判断相关性低。
Details
Motivation: 随着T2I模型在视觉内容生成中的普及,其在多元文化背景下的准确性引发担忧。本文旨在填补T2I模型在文化期望对齐方面的研究空白。Contribution: 提出了CulturalFrames基准,涵盖10个国家、5个文化领域,包含983个提示、3637张图像和10k+人工标注;揭示了T2I模型在文化表现上的不足及评估指标的局限性。
Method: 通过CulturalFrames基准,结合人工标注和统计分析,量化了T2I模型在显性与隐性文化期望上的对齐表现,并评估了现有指标的可靠性。
Result: T2I模型平均44%的情况下未能满足文化期望(显性68%,隐性49%),且现有指标与人类判断相关性差。
Insight: 研究暴露了T2I模型和评估方法在文化敏感性上的不足,为未来开发更具文化意识的模型和评估方法提供了方向。
Abstract: The increasing ubiquity of text-to-image (T2I) models as tools for visual
content generation raises concerns about their ability to accurately represent
diverse cultural contexts. In this work, we present the first study to
systematically quantify the alignment of T2I models and evaluation metrics with
respect to both explicit as well as implicit cultural expectations. To this
end, we introduce CulturalFrames, a novel benchmark designed for rigorous human
evaluation of cultural representation in visual generations. Spanning 10
countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts,
3637 corresponding images generated by 4 state-of-the-art T2I models, and over
10k detailed human annotations. We find that T2I models not only fail to meet
the more challenging implicit expectations but also the less challenging
explicit expectations. Across models and countries, cultural expectations are
missed an average of 44% of the time. Among these failures, explicit
expectations are missed at a surprisingly high average rate of 68%, while
implicit expectation failures are also significant, averaging 49%. Furthermore,
we demonstrate that existing T2I evaluation metrics correlate poorly with human
judgments of cultural alignment, irrespective of their internal reasoning.
Collectively, our findings expose critical gaps, providing actionable
directions for developing more culturally informed T2I models and evaluation
methodologies.
[94] Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis
Jingguo Qu,Xinyang Han,Tonghuan Xiao,Jia Ai,Juan Wu,Tong Zhao,Jing Qin,Ann Dorothy King,Winnie Chiu-Wing Chu,Jing Cai,Michael Tin-Cheung Yingınst
Main category: cs.CV
TL;DR: 论文提出了一种针对医学超声图像分析的领域自适应方法,通过微调视觉-语言基础模型,结合大语言模型作为文本优化器和专门设计的任务驱动头,显著提升了模型性能。
Details
Motivation: 医学超声图像分析的标注任务耗时且需要专业知识,而现成的视觉-语言基础模型在自然图像和医学图像间存在性能差距。因此,需要开发领域自适应方法以提升其医学图像分析能力。Contribution: 1. 提出了基于视觉-语言基础模型的领域自适应方法;2. 利用大语言模型优化文本输入;3. 设计任务驱动的适应策略和头结构。
Method: 通过微调视觉-语言基础模型,结合大语言模型优化文本表达,并设计任务驱动的适应策略和头结构(如分割和分类头),在六个超声数据集上验证性能。
Result: 实验表明,该方法显著提升了视觉-语言基础模型在超声图像分割和分类任务中的性能,并优于现有模型。
Insight: 视觉-语言基础模型通过领域自适应技术可以有效迁移到医学图像分析任务,文本优化和任务驱动设计是关键因素。
Abstract: Medical ultrasonography is an essential imaging technique for examining
superficial organs and tissues, including lymph nodes, breast, and thyroid. It
employs high-frequency ultrasound waves to generate detailed images of the
internal structures of the human body. However, manually contouring regions of
interest in these images is a labor-intensive task that demands expertise and
often results in inconsistent interpretations among individuals.
Vision-language foundation models, which have excelled in various computer
vision applications, present new opportunities for enhancing ultrasound image
analysis. Yet, their performance is hindered by the significant differences
between natural and medical imaging domains. This research seeks to overcome
these challenges by developing domain adaptation methods for vision-language
foundation models. In this study, we explore the fine-tuning pipeline for
vision-language foundation models by utilizing large language model as text
refiner with special-designed adaptation strategies and task-driven heads. Our
approach has been extensively evaluated on six ultrasound datasets and two
tasks: segmentation and classification. The experimental results show that our
method can effectively improve the performance of vision-language foundation
models for ultrasound image analysis, and outperform the existing
state-of-the-art vision-language and pure foundation models. The source code of
this study is available at
\href{https://github.com/jinggqu/NextGen-UIA}{GitHub}.
[95] Spatial Transcriptomics Expression Prediction from Histopathology Based on Cross-Modal Mask Reconstruction and Contrastive Learning
Junzhuo Liu,Markus Eckstein,Zhixiang Wang,Friedrich Feuerhake,Dorit Merhof
Main category: cs.CV
TL;DR: 本文提出了一种基于对比学习的深度学习方法,从全切片图像预测空间转录组表达,显著提升了基因表达的预测精度,适用于样本有限的数据集,并展示了在癌症组织定位中的潜力。
Details
Motivation: 空间转录组数据获取成本高,大规模数据难以获得,因此需要从病理图像预测基因表达水平,以降低成本并扩展研究潜力。Contribution: 1. 提出了基于跨模态掩码重建和对比学习的深度学习方法;2. 在六种疾病数据集上验证了方法对高表达基因、高变异基因和标记基因的预测性能提升。
Method: 结合了跨模态掩码重建和对比学习,从全切片图像预测基因表达。
Result: 预测的相关系数(PCC)在高表达基因、高变异基因和标记基因上分别提升了6.27%、6.11%和11.26%。
Insight: 方法保留了基因间相关性,适用于小样本数据集,并展示了在癌症组织定位中的应用潜力。
Abstract: Spatial transcriptomics is a technology that captures gene expression levels
at different spatial locations, widely used in tumor microenvironment analysis
and molecular profiling of histopathology, providing valuable insights into
resolving gene expression and clinical diagnosis of cancer. Due to the high
cost of data acquisition, large-scale spatial transcriptomics data remain
challenging to obtain. In this study, we develop a contrastive learning-based
deep learning method to predict spatially resolved gene expression from
whole-slide images. Evaluation across six different disease datasets
demonstrates that, compared to existing studies, our method improves Pearson
Correlation Coefficient (PCC) in the prediction of highly expressed genes,
highly variable genes, and marker genes by 6.27%, 6.11%, and 11.26%
respectively. Further analysis indicates that our method preserves gene-gene
correlations and applies to datasets with limited samples. Additionally, our
method exhibits potential in cancer tissue localization based on biomarker
expression.
[96] StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams
Zike Wu,Qi Yan,Xuanyu Yi,Lele Wang,Renjie Liao
Main category: cs.CV
TL;DR: StreamSplat是首个前馈框架,能够将未标定的视频流实时转换为动态3D高斯泼溅表示,解决了在线重建动态3D场景的三大挑战。
Details
Motivation: 现有方法难以同时处理未标定输入、动态场景建模和长期稳定性,制约了实时3D重建的应用。Contribution: 提出了StreamSplat,首次实现未标定视频流的在线动态3D重建,并通过概率采样和双向变形场提升了性能。
Method: 结合静态编码器的概率采样机制和动态解码器的双向变形场,实现高效动态建模。
Result: 在静态和动态基准测试中表现优异,支持任意长度视频流的在线重建。
Insight: 通过前馈架构和局部动态建模,解决了动态3D重建的实时性和稳定性问题。
Abstract: Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams
is crucial for numerous real-world applications. However, existing methods
struggle to jointly address three key challenges: 1) processing uncalibrated
inputs in real time, 2) accurately modeling dynamic scene evolution, and 3)
maintaining long-term stability and computational efficiency. To this end, we
introduce StreamSplat, the first fully feed-forward framework that transforms
uncalibrated video streams of arbitrary length into dynamic 3D Gaussian
Splatting (3DGS) representations in an online manner, capable of recovering
scene dynamics from temporally local observations. We propose two key technical
innovations: a probabilistic sampling mechanism in the static encoder for 3DGS
position prediction, and a bidirectional deformation field in the dynamic
decoder that enables robust and efficient dynamic modeling. Extensive
experiments on static and dynamic benchmarks demonstrate that StreamSplat
consistently outperforms prior works in both reconstruction quality and dynamic
scene modeling, while uniquely supporting online reconstruction of arbitrarily
long video streams. Code and models are available at
https://github.com/nickwzk/StreamSplat.
[97] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
Leqi Shen,Guoqiang Gong,Tianxiang Hao,Tao He,Yifeng Zhang,Pengzhang Liu,Sicheng Zhao,Jungong Han,Guiguang Ding
Main category: cs.CV
TL;DR: DiscoVLA提出了一种参数高效的视频-文本检索方法,通过同时解决视觉、语言和对齐三方面的差异,显著提升了性能。
Details
Motivation: 现有方法主要关注视觉差异,而忽视了语言和对齐差异,导致从图像级到视频级的迁移效果不佳。Contribution: DiscoVLA首次同时解决了视觉、语言和对齐三方面的差异,通过特征融合、伪图像标题生成和对齐蒸馏提升了视频-文本检索性能。
Method: 1. 图像-视频特征融合;2. 生成伪图像标题以学习细粒度对齐;3. 图像到视频对齐蒸馏。
Result: 在MSRVTT数据集上,DiscoVLA比现有方法提升了1.5%的R@1,达到50.5%。
Insight: 全面的差异减少(视觉、语言和对齐)对视频-文本检索任务至关重要。
Abstract: The parameter-efficient adaptation of the image-text pretraining model CLIP
for video-text retrieval is a prominent area of research. While CLIP is focused
on image-level vision-language matching, video-text retrieval demands
comprehensive understanding at the video level. Three key discrepancies emerge
in the transfer from image-level to video-level: vision, language, and
alignment. However, existing methods mainly focus on vision while neglecting
language and alignment. In this paper, we propose Discrepancy Reduction in
Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all
three discrepancies. Specifically, we introduce Image-Video Features Fusion to
integrate image-level and video-level features, effectively tackling both
vision and language discrepancies. Additionally, we generate pseudo image
captions to learn fine-grained image-level alignment. To mitigate alignment
discrepancies, we propose Image-to-Video Alignment Distillation, which
leverages image-level alignment knowledge to enhance video-level alignment.
Extensive experiments demonstrate the superiority of our DiscoVLA. In
particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous
methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is
available at https://github.com/LunarShen/DsicoVLA.
[98] Product of Experts for Visual Generation
Yunzhi Zhang,Carson Murtuza-Lanier,Zizhang Li,Yilun Du,Jiajun Wu
Main category: cs.CV
TL;DR: 论文提出了一种基于Product of Experts (PoE)框架的训练免费方法,通过Annealed Importance Sampling (AIS)从异构模型中组合知识,用于视觉生成任务,比单一模型更具可控性。
Details
Motivation: 当前神经模型在共享数据领域(如图像和视频)中具有丰富的先验知识和互补性。如何整合包括视觉生成模型、视觉语言模型以及人类知识来源(如图形引擎和物理模拟器)在内的多样化知识仍待探索。Contribution: 提出了一个训练免费的Product of Experts (PoE)框架,能够在推理时从异构模型中组合知识,并通过Annealed Importance Sampling (AIS)实现知识组合。
Method: 采用PoE框架,通过AIS方法从异构模型中采样,组合知识以实现视觉生成任务。
Result: 在图像和视频合成任务中表现出优于单一模型的可控性,并提供灵活的生成目标用户界面。
Insight: 通过异构模型的知识组合,能够在无需额外训练的情况下提升视觉生成任务的可控性和灵活性。
Abstract: Modern neural models capture rich priors and have complementary knowledge
over shared data domains, e.g., images and videos. Integrating diverse
knowledge from multiple sources – including visual generative models, visual
language models, and sources with human-crafted knowledge such as graphics
engines and physics simulators – remains under-explored. We propose a Product
of Experts (PoE) framework that performs inference-time knowledge composition
from heterogeneous models. This training-free approach samples from the product
distribution across experts via Annealed Importance Sampling (AIS). Our
framework shows practical benefits in image and video synthesis tasks, yielding
better controllability than monolithic methods and additionally providing
flexible user interfaces for specifying visual generation goals.
[99] WetCat: Automating Skill Assessment in Wetlab Cataract Surgery Videos
Negin Ghamsarian,Raphael Sznitman,Klaus Schoeffmann,Jens Kowal
Main category: cs.CV
TL;DR: 论文提出了WetCat数据集,专门用于湿实验室白内障手术视频的自动化技能评估,填补了现有数据集的空白。
Details
Motivation: 传统湿实验室训练依赖人工评估,效率低且主观性强。计算机视觉技术为自动化技能评估提供了可能,但现有数据集多为真实手术或孤立任务。Contribution: 引入了首个湿实验室白内障手术视频数据集WetCat,包含高分辨率视频、相位标注和关键解剖结构语义分割,支持标准化技能评估框架。
Method: 通过记录学员在人工眼上完成的手术视频,并进行精细的相位标注和语义分割,重点评估囊膜撕开和超声乳化阶段的技能。
Result: WetCat为开发可解释的AI驱动评估工具奠定了基础,提升了眼科手术培训的客观性和可扩展性。
Insight: 专注于关键手术阶段的标注能更精准地评估技能,标准化框架有助于推动自动化手术教育的发展。
Abstract: To meet the growing demand for systematic surgical training, wetlab
environments have become indispensable platforms for hands-on practice in
ophthalmology. Yet, traditional wetlab training depends heavily on manual
performance evaluations, which are labor-intensive, time-consuming, and often
subject to variability. Recent advances in computer vision offer promising
avenues for automated skill assessment, enhancing both the efficiency and
objectivity of surgical education. Despite notable progress in ophthalmic
surgical datasets, existing resources predominantly focus on real surgeries or
isolated tasks, falling short of supporting comprehensive skill evaluation in
controlled wetlab settings. To address these limitations, we introduce WetCat,
the first dataset of wetlab cataract surgery videos specifically curated for
automated skill assessment. WetCat comprises high-resolution recordings of
surgeries performed by trainees on artificial eyes, featuring comprehensive
phase annotations and semantic segmentations of key anatomical structures.
These annotations are meticulously designed to facilitate skill assessment
during the critical capsulorhexis and phacoemulsification phases, adhering to
standardized surgical skill assessment frameworks. By focusing on these
essential phases, WetCat enables the development of interpretable, AI-driven
evaluation tools aligned with established clinical metrics. This dataset lays a
strong foundation for advancing objective, scalable surgical education and sets
a new benchmark for automated workflow analysis and skill assessment in
ophthalmology training. The dataset and annotations are publicly available in
Synapse https://www.synapse.org/Synapse:syn66401174/files.
[100] MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis
José Morano,Botond Fazekas,Emese Sükei,Ronald Fecso,Taha Emre,Markus Gumpinger,Georg Faustmann,Marzieh Oghbaie,Ursula Schmidt-Erfurth,Hrvoje Bogunović
Main category: cs.CV
TL;DR: MIRAGE 是一个多模态基础模型,用于视网膜 OCT 和 SLO 图像的综合分析,并通过新的评估基准验证其优越性。
Details
Motivation: 现有的眼科基础模型缺乏多模态支持,且验证不足。开发 MIRAGE 旨在解决这些局限。Contribution: 提出了 MIRAGE 多模态基础模型,并开发了一个包含分类和分割任务的评估基准。
Method: 基于大规模无标签数据训练多模态基础模型,结合 OCT 和 SLO 图像进行分析。
Result: 在分类和分割任务中,MIRAGE 表现优于通用和专用基础模型及分割方法。
Insight: 多模态基础模型在医学图像分析中具有潜力,尤其在数据多样性不足时表现更优。
Abstract: Artificial intelligence (AI) has become a fundamental tool for assisting
clinicians in analyzing ophthalmic images, such as optical coherence tomography
(OCT). However, developing AI models often requires extensive annotation, and
existing models tend to underperform on independent, unseen data. Foundation
models (FMs), large AI models trained on vast unlabeled datasets, have shown
promise in overcoming these challenges. Nonetheless, available FMs for
ophthalmology lack extensive validation, especially for segmentation tasks, and
focus on a single imaging modality. In this context, we propose MIRAGE, a novel
multimodal FM for the analysis of OCT and scanning laser ophthalmoscopy (SLO)
images. Additionally, we propose a new evaluation benchmark with OCT/SLO
classification and segmentation tasks. The comparison with general and
specialized FMs and segmentation methods shows the superiority of MIRAGE in
both types of tasks, highlighting its suitability as a basis for the
development of robust AI systems for retinal OCT image analysis. Both MIRAGE
and the evaluation benchmark are publicly available:
https://github.com/j-morano/MIRAGE.
[101] Inherently Faithful Attention Maps for Vision Transformers
Ananthu Aniraj,Cassio F. Dantas,Dino Ienco,Diego Marcos
Main category: cs.CV
TL;DR: 本文提出了一种基于注意力机制的方法,通过学习的二值注意力掩码确保只有关注的图像区域影响预测,提高了模型对虚假关联和分布外背景的鲁棒性。
Details
Motivation: 上下文对物体感知有强烈影响,可能导致偏差表示,尤其在物体出现在分布外背景时。同时,许多图像级任务需要识别相关区域,通常离不开上下文。需解决这一矛盾。Contribution: 提出了一种两阶段框架,第一阶段处理全图以发现物体部分和任务相关区域,第二阶段通过输入注意力掩码限制感受野,聚焦分析并过滤虚假信息。
Method: 两阶段联合训练,第一阶段识别任务相关区域,第二阶段通过注意力掩码聚焦这些区域,共同优化以提高鲁棒性。
Result: 在多样基准测试中显著提高了模型对虚假关联和分布外背景的鲁棒性。
Insight: 通过注意力掩码限制感受野,可以有效过滤虚假信息,同时保留任务所需的上下文信息。
Abstract: We introduce an attention-based method that uses learned binary attention
masks to ensure that only attended image regions influence the prediction.
Context can strongly affect object perception, sometimes leading to biased
representations, particularly when objects appear in out-of-distribution
backgrounds. At the same time, many image-level object-centric tasks require
identifying relevant regions, often requiring context. To address this
conundrum, we propose a two-stage framework: stage 1 processes the full image
to discover object parts and identify task-relevant regions, while stage 2
leverages input attention masking to restrict its receptive field to these
regions, enabling a focused analysis while filtering out potentially spurious
information. Both stages are trained jointly, allowing stage 2 to refine stage
- Extensive experiments across diverse benchmarks demonstrate that our
approach significantly improves robustness against spurious correlations and
out-of-distribution backgrounds.
[102] Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions
David Acuna,Ximing Lu,Jaehun Jung,Hyunwoo Kim,Amlan Kar,Sanja Fidler,Yejin Choi
Main category: cs.CV
TL;DR: 论文提出了一种名为Socratic-MCTS的方法,通过在非推理模型中注入子问题-子答案对,利用蒙特卡洛树搜索(MCTS)激发隐藏知识并引导长推理链,无需额外训练。
Details
Motivation: 现有视觉语言模型(VLMs)多为非推理模型且已广泛部署,直接废弃它们并不现实。本文探索如何在不额外训练的情况下,通过搜索机制激发这些模型的潜在推理能力。Contribution: 主要贡献是提出Socratic-MCTS方法,利用MCTS框架将子问题作为隐式决策点,帮助模型实现长推理链,从而提升非推理模型的性能。
Method: 通过蒙特卡洛树搜索(MCTS)算法,在模型输出流中插入子问题-子答案对,将其作为搜索过程中的隐式决策点,以引导模型完成推理任务。
Result: 在三个基准测试中均表现优异,其中在MMMU-PRO上实现整体2%的提升,人文学科部分提升9%。
Insight: 将推理任务建模为搜索问题,能够有效利用现有非推理模型的零散知识,通过子问题引导实现长推理链,为模型优化提供新思路。
Abstract: Recent research in vision-language models (VLMs) has centered around the
possibility of equipping them with implicit long-form chain-of-thought
reasoning – akin to the success observed in language models – via
distillation and reinforcement learning. But what about the non-reasoning
models already trained and deployed across the internet? Should we simply
abandon them, or is there hope for a search mechanism that can elicit hidden
knowledge and induce long reasoning traces – without any additional training
or supervision? In this paper, we explore this possibility using a Monte Carlo
Tree Search (MCTS)-inspired algorithm, which injects subquestion-subanswer
pairs into the model’s output stream. We show that framing reasoning as a
search process – where subquestions act as latent decisions within a broader
inference trajectory – helps the model “connect the dots” between fragmented
knowledge and produce extended reasoning traces in non-reasoning models. We
evaluate our method across three benchmarks and observe consistent
improvements. Notably, our approach yields a 2% overall improvement on
MMMU-PRO, including a significant 9% gain in Liberal Arts.
[103] What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities
Wendong Bu,Yang Wu,Qifan Yu,Minghe Gao,Bingchen Miao,Zhenkui Zhang,Kaihang Pan,Yunfei Li,Mengze Li,Wei Ji,Juncheng Li,Siliang Tang,Yueting Zhuang
Main category: cs.CV
TL;DR: 论文提出了OmniBench基准测试和OmniEval评估框架,用于多维度评估基于多模态大语言模型(MLLM)的虚拟代理能力,解决了现有基准测试的局限性。
Details
Motivation: 现有基准测试在任务复杂度控制、人工标注成本和多维度评估等方面存在不足,限制了虚拟代理的应用和发展。Contribution: 1. 提出OmniBench,一个自生成、跨平台的图结构基准测试;2. 开发OmniEval框架,支持10种能力的多维度评估;3. 生成包含36k图结构任务的数据集,91%的人类接受率。
Method: 1. 通过子任务合成控制任务复杂度;2. 图结构基准测试自动化生成;3. 多维度评估框架包括子任务级评估和图结构指标。
Result: 实验显示,图结构数据比人工标注数据更高效;开源和闭源模型的性能在多维度评估中表现出显著差异。
Insight: 图结构任务和多维度评估有助于更全面地理解虚拟代理的能力,为未来研究提供了新的方向。
Abstract: As multimodal large language models (MLLMs) advance, MLLM-based virtual
agents have demonstrated remarkable performance. However, existing benchmarks
face significant limitations, including uncontrollable task complexity,
extensive manual annotation with limited scenarios, and a lack of
multidimensional evaluation. In response to these challenges, we introduce
OmniBench, a self-generating, cross-platform, graph-based benchmark with an
automated pipeline for synthesizing tasks of controllable complexity through
subtask composition. To evaluate the diverse capabilities of virtual agents on
the graph, we further present OmniEval, a multidimensional evaluation framework
that includes subtask-level evaluation, graph-based metrics, and comprehensive
tests across 10 capabilities. Our synthesized dataset contains 36k
graph-structured tasks across 20 scenarios, achieving a 91% human acceptance
rate. Training on our graph-structured data shows that it can more efficiently
guide agents compared to manually annotated data. We conduct multidimensional
evaluations for various open-source and closed-source models, revealing their
performance across various capabilities and paving the way for future
advancements. Our project is available at https://omni-bench.github.io/.
[104] SSS: Semi-Supervised SAM-2 with Efficient Prompting for Medical Imaging Segmentation
Hongjie Zhu,Xiwei Liu,Rundong Xue,Zeyu Zhang,Yong Xu,Daji Ergu,Ying Cai,Yang Zhao
Main category: cs.CV
TL;DR: 提出了一种名为SSS的半监督学习方法,结合SAM-2的强特征提取能力,通过一致性正则化和特征增强机制提升医学图像分割性能,实验效果显著。
Details
Motivation: 医学图像标注成本高昂,如何利用大量未标注数据提升模型性能是关键挑战,半监督学习是一个有前景的方向。Contribution: 1. 提出了结合SAM-2的半监督框架SSS;2. 设计了特征增强机制(DFE);3. 开发了基于物理约束的滑动窗口提示生成器(PCSW)。
Method: 基于一致性正则化框架,引入DFE机制探索特征差异,并设计PCSW生成提示输入SAM-2。
Result: 在ACDC和BHSD数据集上表现优异,BHSD上的Dice分数达53.15,超越之前方法+3.65。
Insight: 结合视觉基础模型(如SAM-2)和半监督学习,能有效挖掘未标注数据的知识,提升医学图像分割性能。
Abstract: In the era of information explosion, efficiently leveraging large-scale
unlabeled data while minimizing the reliance on high-quality pixel-level
annotations remains a critical challenge in the field of medical imaging.
Semi-supervised learning (SSL) enhances the utilization of unlabeled data by
facilitating knowledge transfer, significantly improving the performance of
fully supervised models and emerging as a highly promising research direction
in medical image analysis. Inspired by the ability of Vision Foundation Models
(e.g., SAM-2) to provide rich prior knowledge, we propose SSS (Semi-Supervised
SAM-2), a novel approach that leverages SAM-2’s robust feature extraction
capabilities to uncover latent knowledge in unlabeled medical images, thus
effectively enhancing feature support for fully supervised medical image
segmentation. Specifically, building upon the single-stream “weak-to-strong”
consistency regularization framework, this paper introduces a Discriminative
Feature Enhancement (DFE) mechanism to further explore the feature
discrepancies introduced by various data augmentation strategies across
multiple views. By leveraging feature similarity and dissimilarity across
multi-scale augmentation techniques, the method reconstructs and models the
features, thereby effectively optimizing the salient regions. Furthermore, a
prompt generator is developed that integrates Physical Constraints with a
Sliding Window (PCSW) mechanism to generate input prompts for unlabeled data,
fulfilling SAM-2’s requirement for additional prompts. Extensive experiments
demonstrate the superiority of the proposed method for semi-supervised medical
image segmentation on two multi-label datasets, i.e., ACDC and BHSD. Notably,
SSS achieves an average Dice score of 53.15 on BHSD, surpassing the previous
state-of-the-art method by +3.65 Dice. Code will be available at
https://github.com/AIGeeksGroup/SSS.
[105] Cross-Spectral Body Recognition with Side Information Embedding: Benchmarks on LLCM and Analyzing Range-Induced Occlusions on IJB-MDF
Anirudh Nanduri,Siyuan Huang,Rama Chellappa
Main category: cs.CV
TL;DR: 该论文研究了跨光谱人体识别问题,提出了一种基于Vision Transformer(ViT)的方法,并结合Side Information Embedding(SIE)技术。实验表明仅编码相机信息即可在LLCM数据集上实现最优性能。同时,论文还探讨了范围诱导遮挡对可见光-红外(VI)人体再识别的影响。
Details
Motivation: 跨光谱人体识别(尤其是可见光与红外图像的匹配)是一个具有挑战性的问题。传统的ViT模型在跨光谱任务中表现有限,且现有数据集缺乏对遮挡场景的研究,因此需要改进模型并填补这一研究空白。Contribution: 1. 提出了结合Side Information Embedding(SIE)的ViT模型,适应跨光谱人体识别任务。2. 发现仅编码相机信息即可达到最优性能。3. 首次利用IJB-MDF数据集分析了范围诱导遮挡对VI-ReID的影响。
Method: 1. 使用预训练的ViT模型,并通过SIE引入额外的相机信息嵌入。2. 在LLCM数据集上验证跨光谱匹配性能。3. 利用IJB-MDF数据集研究遮挡对VI-ReID的影响。
Result: 在LLCM数据集上,仅编码相机信息的SIE-ViT取得了最优性能。此外,IJB-MDF数据集的分析揭示了遮挡问题在VI-ReID中的重要性。
Insight: 1. 跨光谱任务中,相机信息可能比域信息(如可见光/红外)更具区分度。2. 现有VI-ReID数据集缺乏遮挡多样性,限制了模型的泛化能力。
Abstract: Vision Transformers (ViTs) have demonstrated impressive performance across a
wide range of biometric tasks, including face and body recognition. In this
work, we adapt a ViT model pretrained on visible (VIS) imagery to the
challenging problem of cross-spectral body recognition, which involves matching
images captured in the visible and infrared (IR) domains. Recent ViT
architectures have explored incorporating additional embeddings beyond
traditional positional embeddings. Building on this idea, we integrate Side
Information Embedding (SIE) and examine the impact of encoding domain and
camera information to enhance cross-spectral matching. Surprisingly, our
results show that encoding only camera information - without explicitly
incorporating domain information - achieves state-of-the-art performance on the
LLCM dataset. While occlusion handling has been extensively studied in
visible-spectrum person re-identification (Re-ID), occlusions in
visible-infrared (VI) Re-ID remain largely underexplored - primarily because
existing VI-ReID datasets, such as LLCM, SYSU-MM01, and RegDB, predominantly
feature full-body, unoccluded images. To address this gap, we analyze the
impact of range-induced occlusions using the IARPA Janus Benchmark Multi-Domain
Face (IJB-MDF) dataset, which provides a diverse set of visible and infrared
images captured at various distances, enabling cross-range, cross-spectral
evaluations.
[106] Segment Concealed Objects with Incomplete Supervision
Chunming He,Kai Li,Yachao Zhang,Ziyun Yang,Youwei Pang,Longxiang Tang,Chengyu Fang,Yulun Zhang,Linghe Kong,Xiu Li,Sina Farsiu
Main category: cs.CV
TL;DR: 论文提出了一种统一的方法SEE,用于不完全监督的隐蔽物体分割任务,通过结合Mean-Teacher框架和SAM模型生成高质量的伪标签,以及设计混合粒度的特征分组模块来解决隐蔽物体与背景难以区分的问题。
Details
Motivation: 隐蔽物体分割任务因标注数据的不完整性和物体与背景的高度相似性而极具挑战性,需要一种能够有效利用弱标注和半标注数据的方法。Contribution: 1.提出了SEE框架,通过结合Mean-Teacher模型和SAM生成伪标签;2.设计了伪标签生成、存储和监督策略;3.提出了混合粒度的特征分组模块以提高分割连贯性。
Method: 1.使用Mean-Teacher框架和SAM模型生成伪标签;2.通过伪标签策略优化训练过程;3.设计混合粒度的特征分组模块提升特征聚合效果。
Result: 实验表明SEE在多种不完全监督的隐蔽物体分割任务中达到SOTA性能,并可作为即插即用的解决方案提升现有模型。
Insight: 结合预训练视觉基础模型(如SAM)和Mean-Teacher框架可以有效利用弱标注数据,同时特征分组模块有助于解决隐蔽物体分割中的相似性问题。
Abstract: Incompletely-Supervised Concealed Object Segmentation (ISCOS) involves
segmenting objects that seamlessly blend into their surrounding environments,
utilizing incompletely annotated data, such as weak and semi-annotations, for
model training. This task remains highly challenging due to (1) the limited
supervision provided by the incompletely annotated training data, and (2) the
difficulty of distinguishing concealed objects from the background, which
arises from the intrinsic similarities in concealed scenarios. In this paper,
we introduce the first unified method for ISCOS to address these challenges. To
tackle the issue of incomplete supervision, we propose a unified mean-teacher
framework, SEE, that leverages the vision foundation model, ``\emph{Segment
Anything Model (SAM)}’’, to generate pseudo-labels using coarse masks produced
by the teacher model as prompts. To mitigate the effect of low-quality
segmentation masks, we introduce a series of strategies for pseudo-label
generation, storage, and supervision. These strategies aim to produce
informative pseudo-labels, store the best pseudo-labels generated, and select
the most reliable components to guide the student model, thereby ensuring
robust network training. Additionally, to tackle the issue of intrinsic
similarity, we design a hybrid-granularity feature grouping module that groups
features at different granularities and aggregates these results. By clustering
similar features, this module promotes segmentation coherence, facilitating
more complete segmentation for both single-object and multiple-object images.
We validate the effectiveness of our approach across multiple ISCOS tasks, and
experimental results demonstrate that our method achieves state-of-the-art
performance. Furthermore, SEE can serve as a plug-and-play solution, enhancing
the performance of existing models.
[107] Data Augmentation For Small Object using Fast AutoAugment
DaeEun Yoon,Semin Kim,SangWook Yoo,Jongha Lee
Main category: cs.CV
TL;DR: 论文提出了一种基于Fast AutoAugment的数据增强方法,显著提升了小目标检测性能,在DOTA数据集上实现了20%的性能提升。
Details
Motivation: 虽然目标检测技术近年来取得了巨大进展,但小目标的检测性能仍远低于大目标。小目标检测是计算机视觉中最具挑战性和重要性的问题之一。Contribution: 提出了一种基于Fast AutoAugment的优化数据增强方法,能够快速找到克服小目标检测性能下降的最优增强策略。
Method: 利用Fast AutoAugment技术,自动化地搜索适合小目标检测的数据增强策略,以提升模型的泛化能力。
Result: 在DOTA数据集上实现了20%的小目标检测性能提升。
Insight: 通过自动化数据增强策略的优化,可以有效缓解小目标检测中的性能瓶颈问题。
Abstract: In recent years, there has been tremendous progress in object detection
performance. However, despite these advances, the detection performance for
small objects is significantly inferior to that of large objects. Detecting
small objects is one of the most challenging and important problems in computer
vision. To improve the detection performance for small objects, we propose an
optimal data augmentation method using Fast AutoAugment. Through our proposed
method, we can quickly find optimal augmentation policies that can overcome
degradation when detecting small objects, and we achieve a 20% performance
improvement on the DOTA dataset.
[108] ORIDa: Object-centric Real-world Image Composition Dataset
Jinwoo Kim,Sangmin Han,Jinho Jeong,Jiwoo Choi,Dongyoung Kim,Seon Joo Kim
Main category: cs.CV
TL;DR: ORIDa是一个大规模、真实捕获的数据集,包含3万张图像和200个独特对象,用于对象合成任务,提供事实-反事实集和纯事实场景两种数据类型,填补了现有数据集的多样性不足的问题。
Details
Motivation: 当前的对象合成数据集缺乏多样性和规模,无法全面探索真实世界场景,因此ORIDa被提出以解决这一问题。Contribution: 提出了ORIDa数据集,其规模大、多样性高,包含30,000张图像和200种对象,首次公开提供具有复杂性和规模的真实世界图像合成数据。
Method: 数据集分为事实-反事实集(每组5张图像)和纯事实场景(单张图像),通过多样化场景和对象组合实现数据的全面覆盖。
Result: ORIDa的数据丰富性和多样性为对象合成任务提供了有力支持,实验和分析验证了其作为研究资源的潜力。
Insight: 大规模、多样化的数据集是推动对象合成任务研究的关键,ORIDa填补了现有数据集的空白,为生成模型的进一步发展提供了重要资源。
Abstract: Object compositing, the task of placing and harmonizing objects in images of
diverse visual scenes, has become an important task in computer vision with the
rise of generative models. However, existing datasets lack the diversity and
scale required to comprehensively explore real-world scenarios. We introduce
ORIDa (Object-centric Real-world Image Composition Dataset), a large-scale,
real-captured dataset containing over 30,000 images featuring 200 unique
objects, each of which is presented across varied positions and scenes. ORIDa
has two types of data: factual-counterfactual sets and factual-only scenes. The
factual-counterfactual sets consist of four factual images showing an object in
different positions within a scene and a single counterfactual (or background)
image of the scene without the object, resulting in five images per scene. The
factual-only scenes include a single image containing an object in a specific
context, expanding the variety of environments. To our knowledge, ORIDa is the
first publicly available dataset with its scale and complexity for real-world
image composition. Extensive analysis and experiments highlight the value of
ORIDa as a resource for advancing further research in object compositing.
[109] ADAM: Autonomous Discovery and Annotation Model using LLMs for Context-Aware Annotations
Amirreza Rouhi,Solmaz Arezoomandan,Knut Peterson,Joseph T. Woods,David K. Han
Main category: cs.CV
TL;DR: ADAM是一个无需训练的自提标注框架,利用LLMs和CLIP为开放世界中未知对象生成上下文感知的标签。
Details
Motivation: 传统目标检测模型依赖于预定义类别,难以识别开放世界中的新物体,ADAM旨在解决这一限制。Contribution: 提出ADAM框架,结合LLMs和CLIP实现无监督的开放世界对象标注,并设计自优化机制提升标签一致性。
Method: 通过LLMs生成候选标签,结合CLIP的视觉嵌入构建Embedding-Label Repository (ELR),利用频率投票和跨模态重排序分配标签,并通过自优化循环提升标签质量。
Result: 在COCO和PASCAL数据集上,ADAM成功标注新类别,无需微调或重新训练。
Insight: LLMs与视觉模型的结合可有效解决开放世界对象标注问题,自优化机制显著提升标签一致性。
Abstract: Object detection models typically rely on predefined categories, limiting
their ability to identify novel objects in open-world scenarios. To overcome
this constraint, we introduce ADAM: Autonomous Discovery and Annotation Model,
a training-free, self-refining framework for open-world object labeling. ADAM
leverages large language models (LLMs) to generate candidate labels for unknown
objects based on contextual information from known entities within a scene.
These labels are paired with visual embeddings from CLIP to construct an
Embedding-Label Repository (ELR) that enables inference without category
supervision. For a newly encountered unknown object, ADAM retrieves visually
similar instances from the ELR and applies frequency-based voting and
cross-modal re-ranking to assign a robust label. To further enhance
consistency, we introduce a self-refinement loop that re-evaluates repository
labels using visual cohesion analysis and k-nearest-neighbor-based majority
re-labeling. Experimental results on the COCO and PASCAL datasets demonstrate
that ADAM effectively annotates novel categories using only visual and
contextual signals, without requiring any fine-tuning or retraining.
[110] Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models
Chenyu Lian,Hong-Yu Zhou,Dongyun Liang,Jing Qin,Liansheng Wang
Main category: cs.CV
TL;DR: ALTA提出了一种高效的医学视觉-语言对齐方法,通过适配掩码视觉模型,仅需少量可训练参数和计算资源,即在检索和零样本分类任务中表现优异。
Details
Motivation: 传统的跨模态对比学习方法(如CLIP)在视觉表示能力上表现不佳,限制了其在视觉-语言对齐中的效果;而多模态掩码建模模型虽在视觉表示上表现优异,却在跨模态匹配上表现不佳。ALTA旨在解决这一矛盾。Contribution: 1. 提出ALTA方法,适配掩码视觉模型以高效对齐视觉-语言;2. 整合时间-多视角X光输入以增强图像-文本一致性;3. 实验表明ALTA在多项任务中显著优于现有方法。
Method: 1. 通过适配预训练的掩码视觉模型实现视觉-语言对齐;2. 引入时间-多视角X光输入以优化跨模态一致性;3. 高效设计,仅需少量可训练参数和计算资源。
Result: ALTA在文本到图像准确率和图像到文本检索准确率上分别提升4%和6%,且在计算效率上显著优于对比方法。
Insight: 适配掩码视觉模型不仅能提升视觉-语言对齐效果,还能促进对视觉和语言的更深层次理解。
Abstract: Medical vision-language alignment through cross-modal contrastive learning
shows promising performance in image-text matching tasks, such as retrieval and
zero-shot classification. However, conventional cross-modal contrastive
learning (CLIP-based) methods suffer from suboptimal visual representation
capabilities, which also limits their effectiveness in vision-language
alignment. In contrast, although the models pretrained via multimodal masked
modeling struggle with direct cross-modal matching, they excel in visual
representation. To address this contradiction, we propose ALTA (ALign Through
Adapting), an efficient medical vision-language alignment method that utilizes
only about 8% of the trainable parameters and less than 1/5 of the
computational consumption required for masked record modeling. ALTA achieves
superior performance in vision-language matching tasks like retrieval and
zero-shot classification by adapting the pretrained vision model from masked
record modeling. Additionally, we integrate temporal-multiview radiograph
inputs to enhance the information consistency between radiographs and their
corresponding descriptions in reports, further improving the vision-language
alignment. Experimental evaluations show that ALTA outperforms the
best-performing counterpart by over 4% absolute points in text-to-image
accuracy and approximately 6% absolute points in image-to-text retrieval
accuracy. The adaptation of vision-language models during efficient alignment
also promotes better vision and language understanding. Code is publicly
available at https://github.com/DopamineLcy/ALTA.
[111] Do MIL Models Transfer?
Daniel Shao,Richard J. Chen,Andrew H. Song,Joel Runevic,Ming Y. Lu,Tong Ding,Faisal Mahmood
Main category: cs.CV
TL;DR: 该论文通过系统评估11个MIL模型在21个预训练任务中的表现,发现预训练的MIL模型即使在不同器官任务上也优于从头训练的模型,证实了MIL模型的强大迁移能力。
Details
Motivation: 尽管迁移学习在NLP和传统计算机视觉中被广泛应用,但MIL模型在计算病理学中的迁移能力尚未得到充分研究。该研究旨在填补这一空白。Contribution: 1. 系统评估了MIL模型的迁移学习能力;2. 证明了预训练的MIL模型能显著提升性能;3. 提供了一个标准化MIL模型实现和预训练权重的开源资源。
Method: 研究了11个MIL模型在21个预训练任务中的表现,比较了预训练模型与从头训练模型的性能差异,特别关注跨器官迁移能力。
Result: 预训练的MIL模型在所有任务中均优于从头训练的模型,尤其是pan-cancer数据预训练的模型表现出强大的泛化能力。
Insight: MIL模型具有强大的迁移能力,计算病理学领域中迁移学习和预训练策略可以有效缓解数据稀缺问题。
Abstract: Multiple Instance Learning (MIL) is a cornerstone approach in computational
pathology (CPath) for generating clinically meaningful slide-level embeddings
from gigapixel tissue images. However, MIL often struggles with small, weakly
supervised clinical datasets. In contrast to fields such as NLP and
conventional computer vision, where transfer learning is widely used to address
data scarcity, the transferability of MIL models remains poorly understood. In
this study, we systematically evaluate the transfer learning capabilities of
pretrained MIL models by assessing 11 models across 21 pretraining tasks for
morphological and molecular subtype prediction. Our results show that
pretrained MIL models, even when trained on different organs than the target
task, consistently outperform models trained from scratch. Moreover,
pretraining on pancancer datasets enables strong generalization across organs
and tasks, outperforming slide foundation models while using substantially less
pretraining data. These findings highlight the robust adaptability of MIL
models and demonstrate the benefits of leveraging transfer learning to boost
performance in CPath. Lastly, we provide a resource which standardizes the
implementation of MIL models and collection of pretrained model weights on
popular CPath tasks, available at https://github.com/mahmoodlab/MIL-Lab
[112] Princeton365: A Diverse Dataset with Accurate Camera Pose
Karhan Kayan,Stamatis Alexandropoulos,Rishabh Jain,Yiming Zuo,Erich Liang,Jia Deng
Main category: cs.CV
TL;DR: Princeton365是一个包含365个视频的大规模多样化数据集,提供了精确的相机位姿,填补了当前SLAM基准测试中精度与数据多样性之间的差距。
Details
Motivation: 当前SLAM基准测试通常缺乏数据多样性或高精度的相机位姿,Princeton365通过结合校准板和360相机的新框架解决了这一问题。Contribution: 1)提出Princeton365数据集;2)提出基于相机位姿估计误差的光流场景尺度感知评估指标;3)引入涵盖非朗伯场景的NVS基准测试。
Method: 利用校准板和360相机采集数据,并提供同步的单目、立体RGB视频及IMU数据。新的评估指标基于相机位姿误差的光流。
Result: 数据集支持SLAM和NVS任务的多样化评估,新指标允许跨场景性能比较。
Insight: 多样化的数据和高精度位姿对SLAM方法的发展至关重要,新指标帮助识别方法失败模式。
Abstract: We introduce Princeton365, a large-scale diverse dataset of 365 videos with
accurate camera pose. Our dataset bridges the gap between accuracy and data
diversity in current SLAM benchmarks by introducing a novel ground truth
collection framework that leverages calibration boards and a 360-camera. We
collect indoor, outdoor, and object scanning videos with synchronized monocular
and stereo RGB video outputs as well as IMU. We further propose a new scene
scale-aware evaluation metric for SLAM based on the the optical flow induced by
the camera pose estimation error. In contrast to the current metrics, our new
metric allows for comparison between the performance of SLAM methods across
scenes as opposed to existing metrics such as Average Trajectory Error (ATE),
allowing researchers to analyze the failure modes of their methods. We also
propose a challenging Novel View Synthesis benchmark that covers cases not
covered by current NVS benchmarks, such as fully non-Lambertian scenes with
360-degree camera trajectories. Please visit
https://princeton365.cs.princeton.edu for the dataset, code, videos, and
submission.
[113] Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
Dianyi Wang,Wei Song,Yikun Wang,Siyuan Wang,Kaicheng Yu,Zhongyu Wei,Jiaqi Wang
Main category: cs.CV
TL;DR: 论文提出了一种名为ASVR的方法,通过自回归的语义视觉重建联合学习视觉和文本模态,解决了现有大型视觉语言模型在视觉信息利用上的不足,并在多模态理解任务中显著提升了性能。
Details
Motivation: 现有的大视觉语言模型(LVLMs)仅对文本序列采用自回归监督,未能充分整合视觉模态,导致无法利用无标注图像、遗漏关键视觉细节,以及某些视觉内容无法通过文本充分表达的问题。Contribution: 提出了Autoregressive Semantic Visual Reconstruction (ASVR),首次在统一的自回归框架中联合学习视觉和文本模态,并通过实验证明语义表示的视觉重建比原始外观重建更能提升多模态理解。
Method: ASVR方法通过自回归重建图像的语义表示(而非原始外观),并验证了模型能够有效地从连续图像特征中重建离散语义标记。实验使用了不同数据规模和类型的LLM骨干模型。
Result: ASVR在14个多模态基准测试中平均提升了LLaVA-1.5模型5%的性能,且在不同数据规模和模型上都表现稳定。
Insight: 语义表示的视觉重建比原始外观的重建更能有效提升多模态理解,且自回归框架可以统一处理视觉和文本信息,填补了现有方法的不足。
Abstract: Typical large vision-language models (LVLMs) apply autoregressive supervision
solely to textual sequences, without fully incorporating the visual modality
into the learning process. This results in three key limitations: (1) an
inability to utilize images without accompanying captions, (2) the risk that
captions omit critical visual details, and (3) the challenge that certain
vision-centric content cannot be adequately conveyed through text. As a result,
current LVLMs often prioritize vision-to-language alignment while potentially
overlooking fine-grained visual information. While some prior works have
explored autoregressive image generation, effectively leveraging autoregressive
visual supervision to enhance image understanding remains an open challenge. In
this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR),
which enables joint learning of visual and textual modalities within a unified
autoregressive framework. We show that autoregressively reconstructing the raw
visual appearance of images does not enhance and may even impair multimodal
understanding. In contrast, autoregressively reconstructing the semantic
representation of images consistently improves comprehension. Notably, we find
that even when models are given continuous image features as input, they can
effectively reconstruct discrete semantic tokens, resulting in stable and
consistent improvements across a wide range of multimodal understanding
benchmarks. Our approach delivers significant performance gains across varying
data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves
LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is
available at https://github.com/AlenjandroWang/ASVR.
[114] Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models
Xuanchi Ren,Yifan Lu,Tianshi Cao,Ruiyuan Gao,Shengyu Huang,Amirmojtaba Sabour,Tianchang Shen,Tobias Pfaff,Jay Zhangjie Wu,Runjian Chen,Seung Wook Kim,Jun Gao,Laura Leal-Taixe,Mike Chen,Sanja Fidler,Huan Ling
Main category: cs.CV
TL;DR: Cosmos-Drive-Dreams提出了一种合成数据生成(SDG)流水线,用于生成高保真且具有挑战性的驾驶场景,以解决自动驾驶系统中长尾分布和边缘案例数据不足的问题。
Details
Motivation: 收集和标注真实世界的自动驾驶数据成本高昂且耗时,边缘案例尤其难以捕捉,而这对训练和测试至关重要。因此,作者希望通过合成数据生成技术弥补这一不足。Contribution: 论文的主要贡献是提出了Cosmos-Drive-Dreams流水线,基于NVIDIA Cosmos世界基础模型,实现了可控、高保真、多视角且时空一致的驾驶视频生成,并开源了相关工具、数据集和模型权重。
Method: 该流水线基于NVIDIA Cosmos世界基础模型,专门为驾驶领域设计了Cosmos-Drive,能够生成多视角的驾驶视频,并确保时空一致性。生成的合成数据用于丰富驾驶数据集的多样性和规模。
Result: 实验表明,生成的合成数据有助于缓解长尾分布问题,并在3D车道检测、3D目标检测和驾驶策略学习等下游任务中提升了模型的泛化能力。
Insight: 合成数据生成技术可以有效地补充真实数据的不足,尤其在高保真和边缘案例生成方面具有潜力,为自动驾驶系统的训练和测试提供了新思路。
Abstract: Collecting and annotating real-world data for safety-critical physical AI
systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is
especially challenging to capture rare edge cases, which play a critical role
in training and testing of an AV system. To address this challenge, we
introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline
that aims to generate challenging scenarios to facilitate downstream tasks such
as perception and driving policy training. Powering this pipeline is
Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation
model for the driving domain and are capable of controllable, high-fidelity,
multi-view, and spatiotemporally consistent driving video generation. We
showcase the utility of these models by applying Cosmos-Drive-Dreams to scale
the quantity and diversity of driving datasets with high-fidelity and
challenging scenarios. Experimentally, we demonstrate that our generated data
helps in mitigating long-tail distribution problems and enhances generalization
in downstream tasks such as 3D lane detection, 3D object detection and driving
policy learning. We open source our pipeline toolkit, dataset and model weights
through the NVIDIA’s Cosmos platform.
Project page: https://research.nvidia.com/labs/toronto-ai/cosmos_drive_dreams
[115] MagCache: Fast Video Generation with Magnitude-Aware Cache
Zehong Ma,Longhui Wei,Feng Wang,Shiliang Zhang,Qi Tian
Main category: cs.CV
TL;DR: MagCache是一种基于幅度感知缓存的视频生成加速方法,通过观测残差输出幅度的统一规律,自适应跳过不重要的时间步,无需大量校准样本即可显著提升生成速度并保持视觉质量。
Details
Motivation: 现有视频扩散模型加速技术通常依赖均匀启发式方法或时间嵌入变体来跳过时间步和重用缓存特征,但这些方法需要大量校准且容易因提示词过拟合导致输出不一致。Contribution: 提出了残差输出幅度比例的统一规律(幅度法则),并基于此设计了MagCache,通过误差建模机制和自适应缓存策略实现高效加速。
Method: 1. 观测到残差输出幅度比例随时间单调递减的规律;2. 设计MagCache,自适应跳过不重要时间步,仅需单个样本校准;3. 结合误差建模和缓存策略优化性能。
Result: 在Open-Sora和Wan 2.1上分别实现了2.1倍和2.68倍的加速,同时在LPIPS、SSIM和PSNR指标上优于现有方法。
Insight: 残差输出幅度比例的规律性是视频扩散模型中可预测且通用的特征,可有效指导自适应加速策略的设计。
Abstract: Existing acceleration techniques for video diffusion models often rely on
uniform heuristics or time-embedding variants to skip timesteps and reuse
cached features. These approaches typically require extensive calibration with
curated prompts and risk inconsistent outputs due to prompt-specific
overfitting. In this paper, we introduce a novel and robust discovery: a
unified magnitude law observed across different models and prompts.
Specifically, the magnitude ratio of successive residual outputs decreases
monotonically and steadily in most timesteps while rapidly in the last several
steps. Leveraging this insight, we introduce a Magnitude-aware Cache (MagCache)
that adaptively skips unimportant timesteps using an error modeling mechanism
and adaptive caching strategy. Unlike existing methods requiring dozens of
curated samples for calibration, MagCache only requires a single sample for
calibration. Experimental results show that MagCache achieves 2.1x and 2.68x
speedups on Open-Sora and Wan 2.1, respectively, while preserving superior
visual fidelity. It significantly outperforms existing methods in LPIPS, SSIM,
and PSNR, under comparable computational budgets.
cs.IR [Back]
[116] Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval
Abdellah Ghassel,Ian Robinson,Gabriel Tanase,Hal Cooper,Bryan Thompson,Zhen Han,Vassilis N. Ioannidis,Soji Adeshina,Huzefa Rangwala
Main category: cs.IR
TL;DR: 该论文提出了一种名为分层词汇图(HLG)的三层索引结构,用于改进多跳检索问题,并设计了两种互补的检索器(StatementGraphRAG和TopicGraphRAG),显著提升了检索召回率和正确性。
Details
Motivation: 现有的检索增强生成(RAG)方法在处理需要跨文档拼接答案的复杂问题时表现不佳,尤其是在语义距离较远的文档之间。Contribution: 1. 提出HLG索引结构,支持从原子命题到潜在主题的多层次检索;2. 设计两种互补的检索器(StatementGraphRAG和TopicGraphRAG);3. 引入合成数据生成流程以评估多跳检索系统。
Method: 1. HLG索引的三层结构:原子命题、潜在主题聚类、实体与关系链接;2. 两种检索器:StatementGraphRAG用于细粒度实体感知搜索,TopicGraphRAG用于粗粒度主题扩展搜索。
Result: 在五个数据集上的实验表明,方法相比传统分块RAG提升了23.1%的检索召回率和正确性。
Insight: 通过层次化的索引结构和多粒度检索策略,可以有效解决跨文档多跳检索的问题,同时合成数据的引入为复杂检索系统的评估提供了新思路。
Abstract: Retrieval-Augmented Generation (RAG) grounds large language models in
external evidence, yet it still falters when answers must be pieced together
across semantically distant documents. We close this gap with the Hierarchical
Lexical Graph (HLG), a three-tier index that (i) traces every atomic
proposition to its source, (ii) clusters propositions into latent topics, and
(iii) links entities and relations to expose cross-document paths. On top of
HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG,
which performs fine-grained entity-aware beam search over propositions for
high-precision factoid questions, and TopicGraphRAG, which selects coarse
topics before expanding along entity links to supply broad yet relevant context
for exploratory queries. Additionally, existing benchmarks lack the complexity
required to rigorously evaluate multi-hop summarization systems, often focusing
on single-document queries or limited datasets. To address this, we introduce a
synthetic dataset generation pipeline that curates realistic, multi-document
question-answer pairs, enabling robust evaluation of multi-hop retrieval
systems. Extensive experiments across five datasets demonstrate that our
methods outperform naive chunk-based RAG achieving an average relative
improvement of 23.1% in retrieval recall and correctness. Open-source Python
library is available at https://github.com/awslabs/graphrag-toolkit.
q-fin.ST [Back]
[117] EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements
Issa Sugiura,Takashi Ishida,Taro Makino,Chieko Tazuke,Takanori Nakagawa,Kosuke Nakago,David Ha
Main category: q-fin.ST
TL;DR: 论文介绍了EDINET-Bench,一个开源日语金融基准数据集,用于评估大语言模型(LLMs)在复杂金融任务(如欺诈检测、盈利预测等)中的表现,结果显示当前LLMs在金融领域的应用仍有挑战。
Details
Motivation: 金融分析领域的复杂任务可利用LLMs能力,但缺乏针对日语金融数据的挑战性数据集,阻碍了学术研究和LLMs在金融领域的应用。Contribution: 提出了EDINET-Bench,首个针对日语金融数据的开源基准数据集,并提供了评估代码,推动了LLMs在金融领域的研究。
Method: 通过日本EDINET系统下载10年的年报数据,自动标注任务标签(如欺诈检测、盈利预测等),构建数据集并评估LLMs表现。
Result: 实验表明,即使是先进的LLMs在欺诈检测和盈利预测等任务上表现仅略优于逻辑回归,凸显了金融领域应用的挑战。
Insight: LLMs在金融领域的实际应用中需要领域特定适配,当前技术仍有显著局限性。
Abstract: Financial analysis presents complex challenges that could leverage large
language model (LLM) capabilities. However, the scarcity of challenging
financial datasets, particularly for Japanese financial data, impedes academic
innovation in financial analytics. As LLMs advance, this lack of accessible
research resources increasingly hinders their development and evaluation in
this specialized domain. To address this gap, we introduce EDINET-Bench, an
open-source Japanese financial benchmark designed to evaluate the performance
of LLMs on challenging financial tasks including accounting fraud detection,
earnings forecasting, and industry prediction. EDINET-Bench is constructed by
downloading annual reports from the past 10 years from Japan’s Electronic
Disclosure for Investors’ NETwork (EDINET) and automatically assigning labels
corresponding to each evaluation task. Our experiments reveal that even
state-of-the-art LLMs struggle, performing only slightly better than logistic
regression in binary classification for fraud detection and earnings
forecasting. These results highlight significant challenges in applying LLMs to
real-world financial applications and underscore the need for domain-specific
adaptation. Our dataset, benchmark construction code, and evaluation code is
publicly available to facilitate future research in finance with LLMs.
cs.GR [Back]
[118] A Real-time 3D Desktop Display
Livio Tenze,Enrique Canessa
Main category: cs.GR
TL;DR: 本文介绍了altiro3D C++库的扩展版本,能够实时处理2D图像或视频流,生成光场并实现3D体验。核心方法包括使用MiDaS CNN从单张2D图像提取深度图,并利用AI技术提升性能。
Details
Motivation: 传统3D显示技术需要复杂的硬件支持,而本文旨在通过软件方式实现实时3D显示,支持多种输入源,包括桌面屏幕区域。Contribution: 主要贡献是扩展altiro3D库,使其能够处理2D图像、视频流和桌面屏幕区域,并通过深度学习和光场技术实现实时3D渲染。
Method: 基于MiDaS CNN提取深度图,结合AI技术优化处理流程,实现从2D到3D的实时转换,并支持多种输入格式。
Result: 实现了对2D图像、视频流和桌面应用的实时3D渲染,支持直接输出到光场3D设备。
Insight: 通过软件和深度学习的结合,可以简化3D显示的实现,拓展了3D技术在普通设备上的应用潜力。
Abstract: A new extended version of the altiro3D C++ Library – initially developed to
get glass-free holographic displays starting from 2D images – is here
introduced aiming to deal with 3D video streams from either 2D webcam images or
flat video files. These streams are processed in real-time to synthesize
light-fields (in Native format) and feed realistic 3D experiences. The core
function needed to recreate multiviews consists on the use of MiDaS
Convolutional Neural Network (CNN), which allows to extract a depth map from a
single 2D image. Artificial Intelligence (AI) computing techniques are applied
to improve the overall performance of the extended altiro3D Library. Thus,
altiro3D can now treat standard images, video streams or screen portions of a
Desktop where other apps may be also running (like web browsers, video chats,
etc) and render them into 3D. To achieve the latter, a screen region need to be
selected in order to feed the output directly into a light-field 3D device such
as Looking Glass (LG) Portrait. In order to simplify the acquisition of a
Desktop screen area by the user, a multi-platform Graphical User Interface has
been also implemented. Sources available at:
https://github.com/canessae/altiro3D/releases/tag/2.0.0
[119] Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos
Weikun Peng,Jun Lv,Cewu Lu,Manolis Savva
Main category: cs.GR
TL;DR: 论文提出了一种从动态RGBD视频中重建铰接对象的粗到细框架,解决了因交互和环境变化带来的挑战,并在合成和真实数据集上显著优于现有方法。
Details
Motivation: 铰接对象在日常生活和机器人应用中广泛存在,但现有方法需要精心采集的数据,限制了其在实际中的可扩展性和泛化性。研究旨在从手持设备随意拍摄的RGBD视频中实现铰接对象的重建。Contribution: 1) 提出了一种粗到细框架,用于从动态RGBD视频中推断铰接对象的关节参数并分割可移动部件;2) 构建了更大规模的合成数据集,包含784个视频、284个对象和11个类别。
Method: 采用粗到细框架,通过动态RGBD视频分析物体运动,逐步优化关节参数和可移动部件的分割。
Result: 在合成和真实数据集上的实验表明,该方法显著优于现有方法,能够跨类别重建铰接对象。
Insight: 随意拍摄的RGBD视频为铰接对象重建提供了更实际的数据来源,但需解决运动模糊和遮挡问题。粗到细框架是一种有效的解决方案。
Abstract: Articulated objects are prevalent in daily life. Understanding their
kinematic structure and reconstructing them have numerous applications in
embodied AI and robotics. However, current methods require carefully captured
data for training or inference, preventing practical, scalable, and
generalizable reconstruction of articulated objects. We focus on reconstruction
of an articulated object from a casually captured RGBD video shot with a
hand-held camera. A casually captured video of an interaction with an
articulated object is easy to acquire at scale using smartphones. However, this
setting is quite challenging, as the object and camera move simultaneously and
there are significant occlusions as the person interacts with the object. To
tackle these challenges, we introduce a coarse-to-fine framework that infers
joint parameters and segments movable parts of the object from a dynamic RGBD
video. To evaluate our method under this new setting, we build a 20$\times$
larger synthetic dataset of 784 videos containing 284 objects across 11
categories. We compare our approach with existing methods that also take video
as input. Experiments show that our method can reconstruct synthetic and real
articulated objects across different categories from dynamic RGBD videos,
outperforming existing methods significantly.
[120] Fine-Grained Spatially Varying Material Selection in Images
Julia Guerrero-Viu,Michael Fischer,Iliyan Georgiev,Elena Garces,Diego Gutierrez,Belen Masia,Valentin Deschaintre
Main category: cs.GR
TL;DR: 本文提出了一种基于视觉Transformer(ViT)的细粒度空间变化材料选择方法,能够在光照和反射变化下实现鲁棒的材料选择,并支持纹理和子纹理两个级别的选择。
Details
Motivation: 传统的图像编辑中,材料选择通常受限于光照和反射变化,且缺乏细粒度的选择能力。为了解决这一问题,本文提出了一种更鲁棒、更精细的材料选择方法。Contribution: 1. 提出了一种基于ViT的多分辨率处理策略,实现了更精细和稳定的材料选择。2. 引入了一个新的两级材料选择数据集(DuMaS),包含80多万张合成图像的密集标注。3. 支持纹理和子纹理两个级别的选择。
Method: 1. 利用ViT模型提取特征。2. 通过多分辨率处理策略优化选择结果。3. 在纹理和子纹理级别上实现细粒度选择。
Result: 该方法在光照和反射变化下表现出鲁棒性,能够生成比现有方法更精细和稳定的选择结果。
Insight: ViT模型在多分辨率处理中的特征提取能力强,能够有效应对图像编辑中复杂的材料和光照变化。
Abstract: Selection is the first step in many image editing processes, enabling faster
and simpler modifications of all pixels sharing a common modality. In this
work, we present a method for material selection in images, robust to lighting
and reflectance variations, which can be used for downstream editing tasks. We
rely on vision transformer (ViT) models and leverage their features for
selection, proposing a multi-resolution processing strategy that yields finer
and more stable selection results than prior methods. Furthermore, we enable
selection at two levels: texture and subtexture, leveraging a new two-level
material selection (DuMaS) dataset which includes dense annotations for over
800,000 synthetic images, both on the texture and subtexture levels.
cs.SD [Back]
[121] Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model
Ailin Huang,Bingxin Li,Bruce Wang,Boyong Wu,Chao Yan,Chengli Feng,Heng Wang,Hongyu Zhou,Hongyuan Wang,Jingbei Li,Jianjian Sun,Joanna Wang,Mingrui Chen,Peng Liu,Ruihang Miao,Shilei Jiang,Tian Fei,Wang You,Xi Chen,Xuerui Yang,Yechang Huang,Yuxiang Zhang,Zheng Ge,Zheng Gong,Zhewei Huang,Zixin Zhang,Bin Wang,Bo Li,Buyun Ma,Changxin Miao,Changyi Wan,Chen Xu,Dapeng Shi,Dingyuan Hu,Enle Liu,Guanzhe Huang,Gulin Yan,Hanpeng Hu,Haonan Jia,Jiahao Gong,Jiaoren Wu,Jie Wu,Jie Yang,Junzhe Lin,Kaixiang Li,Lei Xia,Longlong Gu,Ming Li,Nie Hao,Ranchen Ming,Shaoliang Pang,Siqi Liu,Song Yuan,Tiancheng Cao,Wen Li,Wenqing He,Xu Zhao,Xuelin Zhang,Yanbo Yu,Yinmin Zhong,Yu Zhou,Yuanwei Liang,Yuanwei Lu,Yuxiang Yang,Zidong Yang,Zili Zhang,Binxing Jiao,Heung-Yeung Shum,Jiansheng Chen,Jing Li,Xiangyu Zhang,Xinhao Zhang,Yibo Zhu,Daxin Jiang,Shuchang Zhou,Chen Hu
Main category: cs.SD
TL;DR: 论文提出了Step-Audio-AQAA,一种完全端到端的音频问答模型,通过双码本音频分词器、1300亿参数大模型和神经声码器,实现了高效的音频交互,并在语音控制方面表现优异。
Details
Motivation: 现有的音频-语言模型依赖文本输出,无法直接生成自然语音,限制了音频交互的流畅性。Contribution: 1. 提出完全端到端的AQAA模型;2. 结合双码本音频分词器和大参数模型;3. 通过DPO和模型融合提升性能。
Method: 1. 双码本音频分词器提取特征;2. 1300亿参数LLM作为主干;3. 神经声码器合成语音;4. 后训练结合文本和音频输出增强一致性。
Result: 在StepEval-Audio-360基准测试中,模型在语音控制方面表现突出,超越现有LALMs。
Insight: 1. 端到端设计显著提升音频交互效率;2. 基于分词的声码器对性能至关重要。
Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent
human-computer interaction, yet their reliance on text-based outputs limits
their ability to generate natural speech responses directly, hindering seamless
audio interactions. To address this, we introduce Step-Audio-AQAA, a fully
end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model
integrates a dual-codebook audio tokenizer for linguistic and semantic feature
extraction, a 130-billion-parameter backbone LLM and a neural vocoder for
high-fidelity speech synthesis. Our post-training approach employs interleaved
token-output of text and audio to enhance semantic coherence and combines
Direct Preference Optimization (DPO) with model merge to improve performance.
Evaluations on the StepEval-Audio-360 benchmark demonstrate that
Step-Audio-AQAA excels especially in speech control, outperforming the
state-of-art LALMs in key areas. This work contributes a promising solution for
end-to-end LALMs and highlights the critical role of token-based vocoder in
enhancing overall performance for AQAA tasks.
eess.AS [Back]
[122] Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs
Šimon Sedláček,Bolaji Yusuf,Ján Švec,Pradyoth Hegde,Santosh Kesiraju,Oldřich Plchot,Jan Černocký
Main category: eess.AS
TL;DR: 论文通过对齐语音编码器和LLMs的表征空间,使用小型连接模块优化了口语对话状态追踪(DST),在SpokenWOZ数据集上达到SOTA性能(42.17% JGA)。
Details
Motivation: 为了解决口语对话状态追踪中语音编码器和大语言模型(LLMs)表征空间不一致的问题,作者提出了一种对齐方法,旨在提升性能并完全使用开源组件。Contribution: 1. 提出了一种通过小型连接模块对齐语音编码器和LLMs的方法;2. 分析了不同系统组件(如适配器微调、对话历史中的代理轮次)的影响;3. 引入模糊匹配后处理提升命名实体识别性能。
Method: 1. 使用WavLM-large和OLMo作为语音编码器和LLMs;2. 设计小型连接模块对齐两者表征;3. 实验包括全微调/LoRA微调、代理轮次影响分析和模糊匹配后处理。
Result: 在SpokenWOZ测试集上,最佳模型(WavLM + 连接模块 + OLMo-1B)达到34.66% JGA,而使用Gemma-2-9B的模型进一步提升至42.17% JGA。
Insight: 对齐语音编码器和LLMs的表征空间能显著提升口语对话状态追踪性能;模糊匹配后处理对命名实体识别尤为重要。
Abstract: In this work, we approach spoken Dialogue State Tracking (DST) by bridging
the representation spaces of speech encoders and LLMs via a small connector
module, with a focus on fully open-sourced and open-data components
(WavLM-large, OLMo). We focus on ablating different aspects of such systems
including full/LoRA adapter fine-tuning, the effect of agent turns in the
dialogue history, as well as fuzzy matching-based output post-processing, which
greatly improves performance of our systems on named entities in the dialogue
slot values. We conduct our experiments on the SpokenWOZ dataset, and
additionally utilize the Speech-Aware MultiWOZ dataset to augment our training
data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned
models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our
system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17%
JGA on SpokenWOZ test.
q-bio.NC [Back]
[123] Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in the Brain
Subba Reddy Oota,Khushbu Pahwa,Prachi Jindal,Satya Sai Srinath Namburi,Maneesh Singh,Tanmoy Chakraborty,Bapi S. Raju,Manish Gupta
Main category: q-bio.NC
TL;DR: 本文研究了指令调优的多模态大语言模型(MLLMs)在视频和音频任务中的大脑对齐性,发现任务特定的指令显著提升了MLLMs与大脑活动的对齐性,并揭示了其分层对齐特性。
Details
Motivation: 现有研究主要关注非指令调优的MLLMs在单模态或多模态刺激下的大脑对齐性,但忽视了任务特定指令的作用。本文旨在填补这一空白,探讨指令调优的MLLMs在自然视频-音频刺激下的大脑对齐性。Contribution: 1. 证明了指令调优的MLLMs在视频任务中显著优于非指令调优的多模态和单模态模型(分别提升15%和20%)。2. 揭示了MLLMs的任务特定表征能够精确区分大脑中的多模态功能处理。3. 发现MLLMs的分层表征与大脑的分层处理区域之间存在对齐关系。
Method: 使用13种视频任务特定指令调优的6个视频和2个音频MLLMs,提取其指令特定嵌入,并通过预测自然视频-音频刺激下记录的神经活动来评估大脑对齐性。
Result: 指令调优的MLLMs显著优于非指令调优模型,且其分层表征与大脑的分层处理区域(如早期感觉区域与早期层对齐,高级视觉和语言区域与中晚期层对齐)高度一致。
Insight: 任务特定指令显著提升MLLMs与大脑的对齐性,为研究大脑和MLLMs的联合信息处理提供了新视角。代码已开源。
Abstract: Recent voxel-wise multimodal brain encoding studies have shown that
multimodal large language models (MLLMs) exhibit a higher degree of brain
alignment compared to unimodal models in both unimodal and multimodal stimulus
settings. More recently, instruction-tuned multimodal models have shown to
generate task-specific representations that align strongly with brain activity.
However, prior work evaluating the brain alignment of MLLMs has primarily
focused on unimodal settings or relied on non-instruction-tuned multimodal
models for multimodal stimuli. To address this gap, we investigated brain
alignment, that is, measuring the degree of predictivity of neural activity
recorded while participants were watching naturalistic movies (video along with
audio) with representations derived from MLLMs. We utilized
instruction-specific embeddings from six video and two audio instruction-tuned
MLLMs. Experiments with 13 video task-specific instructions show that
instruction-tuned video MLLMs significantly outperform non-instruction-tuned
multimodal (by 15%) and unimodal models (by 20%). Our evaluation of MLLMs for
both video and audio tasks using language-guided instructions shows clear
disentanglement in task-specific representations from MLLMs, leading to precise
differentiation of multimodal functional processing in the brain. We also find
that MLLM layers align hierarchically with the brain, with early sensory areas
showing strong alignment with early layers, while higher-level visual and
language regions align more with middle to late layers. These findings provide
clear evidence for the role of task-specific instructions in improving the
alignment between brain activity and MLLMs, and open new avenues for mapping
joint information processing in both the systems. We make the code publicly
available [https://github.com/subbareddy248/mllm_videos].
eess.IV [Back]
[124] A System for Accurate Tracking and Video Recordings of Rodent Eye Movements using Convolutional Neural Networks for Biomedical Image Segmentation
Isha Puri,David Cox
Main category: eess.IV
TL;DR: 本文提出了一种基于卷积神经网络(CNN)的生物医学图像分割方法,用于精确跟踪啮齿类动物眼动,解决了现有技术忽略啮齿类眼睛独特特性的问题。
Details
Motivation: 啮齿类动物在神经科学和视觉科学研究中被广泛使用,但现有眼动跟踪技术多针对人眼,未考虑啮齿类眼睛的特殊性(如尺寸小、周围毛发多等)。Contribution: 提出了一种灵活、鲁棒且高精度的CNN架构,用于啮齿类动物瞳孔和角膜反射的识别,并展示了与自动化红外视频系统的结合应用。
Method: 采用卷积神经网络进行生物医学图像分割,通过增量式训练适应啮齿类眼睛参数的多样性。
Result: 该方法在啮齿类眼动跟踪中展现了高精度和实用性,是目前最先进的技术。
Insight: 该方法为啮齿类研究提供了更准确的工具,填补了现有技术在这一领域的空白。
Abstract: Research in neuroscience and vision science relies heavily on careful
measurements of animal subject’s gaze direction. Rodents are the most widely
studied animal subjects for such research because of their economic advantage
and hardiness. Recently, video based eye trackers that use image processing
techniques have become a popular option for gaze tracking because they are easy
to use and are completely noninvasive. Although significant progress has been
made in improving the accuracy and robustness of eye tracking algorithms,
unfortunately, almost all of the techniques have focused on human eyes, which
does not account for the unique characteristics of the rodent eye images, e.g.,
variability in eye parameters, abundance of surrounding hair, and their small
size. To overcome these unique challenges, this work presents a flexible,
robust, and highly accurate model for pupil and corneal reflection
identification in rodent gaze determination that can be incrementally trained
to account for variability in eye parameters encountered in the field. To the
best of our knowledge, this is the first paper that demonstrates a highly
accurate and practical biomedical image segmentation based convolutional neural
network architecture for pupil and corneal reflection identification in eye
images. This new method, in conjunction with our automated infrared videobased
eye recording system, offers the state of the art technology in eye tracking
for neuroscience and vision science research for rodents.
[125] Snap-and-tune: combining deep learning and test-time optimization for high-fidelity cardiovascular volumetric meshing
Daniel H. Pak,Shubh Thaker,Kyle Baylous,Xiaoran Zhang,Danny Bluestein,James S. Duncan
Main category: eess.IV
TL;DR: 论文提出了一种结合深度学习和测试时优化的snap-and-tune策略,用于高质量心血管体积网格生成,显著提高了空间精度和网格质量。
Details
Motivation: 高质量的体积网格生成是个性化医学中基于物理模拟的关键瓶颈,现有深度学习方法在高曲率区域和部件间距离方面存在局限。Contribution: 提出了snap-and-tune策略,结合DL和测试时优化,无需额外训练标签即可实现快速的初始形状拟合和详细的样本特定网格修正。
Method: 先通过DL快速拟合初始形状,再通过测试时优化进行详细修正。
Result: 显著提升了空间精度和网格质量,并在两种软件平台中验证了方法的实用性。
Insight: 通过结合DL的快速性和优化的灵活性,能够在复杂医学结构中实现高质量网格生成。
Abstract: High-quality volumetric meshing from medical images is a key bottleneck for
physics-based simulations in personalized medicine. For volumetric meshing of
complex medical structures, recent studies have often utilized deep learning
(DL)-based template deformation approaches to enable fast test-time generation
with high spatial accuracy. However, these approaches still exhibit
limitations, such as limited flexibility at high-curvature areas and
unrealistic inter-part distances. In this study, we introduce a simple yet
effective snap-and-tune strategy that sequentially applies DL and test-time
optimization, which combines fast initial shape fitting with more detailed
sample-specific mesh corrections. Our method provides significant improvements
in both spatial accuracy and mesh quality, while being fully automated and
requiring no additional training labels. Finally, we demonstrate the
versatility and usefulness of our newly generated meshes via solid mechanics
simulations in two different software platforms. Our code is available at
https://github.com/danpak94/Deep-Cardiac-Volumetric-Mesh.
[126] Plug-and-Play Linear Attention for Pre-trained Image and Video Restoration Models
Srinivasan Kidambi,Pravin Nair
Main category: eess.IV
TL;DR: PnP-Nystra是一种基于Nyström的线性自注意力近似方法,作为即插即用模块,能够在不重新训练的情况下集成到预训练的图像和视频修复模型中,显著提升了计算效率。
Details
Motivation: 多头自注意力(MHSA)的计算复杂度为输入长度的平方,成为实时和资源受限环境中的计算瓶颈。Contribution: 提出了PnP-Nystra,一种线性自注意力近似方法,首次展示了作为无需训练替代MHSA的线性注意力模块在修复模型中的有效性。
Method: 基于Nyström的线性近似,作为即插即用模块集成到预训练模型中,无需重新训练即可替代MHSA。
Result: 实验表明,PnP-Nystra在GPU和CPU上分别实现了2-4倍和2-5倍的加速,PSNR最大仅下降1.5 dB。
Insight: 线性注意力可以有效替代MHSA,显著提升计算效率,同时保持性能的稳定性。
Abstract: Multi-head self-attention (MHSA) has become a core component in modern
computer vision models. However, its quadratic complexity with respect to input
length poses a significant computational bottleneck in real-time and resource
constrained environments. We propose PnP-Nystra, a Nystr"om based linear
approximation of self-attention, developed as a plug-and-play (PnP) module that
can be integrated into the pre-trained image and video restoration models
without retraining. As a drop-in replacement for MHSA, PnP-Nystra enables
efficient acceleration in various window-based transformer architectures,
including SwinIR, Uformer, and RVRT. Our experiments across diverse image and
video restoration tasks, including denoising, deblurring, and super-resolution,
demonstrate that PnP-Nystra achieves a 2-4x speed-up on an NVIDIA RTX 4090 GPU
and a 2-5x speed-up on CPU inference. Despite these significant gains, the
method incurs a maximum PSNR drop of only 1.5 dB across all evaluated tasks. To
the best of our knowledge, we are the first to demonstrate a linear attention
functioning as a training-free substitute for MHSA in restoration models.
[127] Biologically Inspired Deep Learning Approaches for Fetal Ultrasound Image Classification
Rinat Prochii,Elizaveta Dakhova,Pavel Birulin,Maxim Sharaev
Main category: eess.IV
TL;DR: 该论文提出了一种受生物启发的深度学习集成框架,用于第二孕期胎儿超声图像的分类,能够同时区分16种胎儿结构,比现有方法更轻量化且性能优越。
Details
Motivation: 由于胎儿超声图像质量低、类内差异大和类别不平衡,准确分类具有挑战性。现有方法仅针对少数解剖目标,无法满足临床需求。Contribution: 提出了一个轻量级的生物启发式深度学习集成框架,首次同时处理16种胎儿结构的分类,性能优于其他复杂模型。
Method: 采用双分支结构(’浅层’路径处理粗粒度低分辨率线索,’详细’路径处理细粒度高分辨率特征),结合LDAM-Focal损失函数和EfficientNet-B0/B6。
Result: 在5298张临床图像上训练和评估,90%的器官分类准确率>0.75,75%的器官分类准确率>0.85,性能与更复杂模型相当。
Insight: 生物启发的模块化堆叠方法在复杂临床环境中具有鲁棒性和可扩展性,为医学图像分类提供了新思路。
Abstract: Accurate classification of second-trimester fetal ultrasound images remains
challenging due to low image quality, high intra-class variability, and
significant class imbalance. In this work, we introduce a simple yet powerful,
biologically inspired deep learning ensemble framework that-unlike prior
studies focused on only a handful of anatomical targets-simultaneously
distinguishes 16 fetal structures. Drawing on the hierarchical, modular
organization of biological vision systems, our model stacks two complementary
branches (a “shallow” path for coarse, low-resolution cues and a “detailed”
path for fine, high-resolution features), concatenating their outputs for final
prediction. To our knowledge, no existing method has addressed such a large
number of classes with a comparably lightweight architecture. We trained and
evaluated on 5,298 routinely acquired clinical images (annotated by three
experts and reconciled via Dawid-Skene), reflecting real-world noise and
variability rather than a “cleaned” dataset. Despite this complexity, our
ensemble (EfficientNet-B0 + EfficientNet-B6 with LDAM-Focal loss) identifies
90% of organs with accuracy > 0.75 and 75% of organs with accuracy >
0.85-performance competitive with more elaborate models applied to far fewer
categories. These results demonstrate that biologically inspired modular
stacking can yield robust, scalable fetal anatomy recognition in challenging
clinical settings.
[128] Enhancing Synthetic CT from CBCT via Multimodal Fusion: A Study on the Impact of CBCT Quality and Alignment
Maximilian Tschuchnig,Lukas Lamminger,Philipp Steininger,Michael Gadermayr
Main category: eess.IV
TL;DR: 论文通过多模态学习将CBCT和术前CT结合,提升了合成CT的质量,特别是在对齐良好且CBCT质量较低的情况下效果显著。
Details
Motivation: CBCT虽快速低辐射,但存在伪影问题,合成CT(sCT)可改善这一问题。论文通过多模态融合进一步优化sCT生成,尤其关注CBCT质量和对齐的影响。Contribution: 提出了基于CBCT和术前CT多模态融合的sCT生成方法,验证了其性能优于单模态基线,并分析了CBCT质量和对齐对结果的影响。
Method: 采用多模态学习方法,整合术中CBCT和术前CT数据,并通过合成数据集验证CBCT-CT对齐和CBCT质量对sCT质量的影响。
Result: 多模态sCT在真实数据集上表现优于单模态基线,尤其是在对齐良好且CBCT质量较低时效果最显著。
Insight: CBCT质量和对齐对sCT生成至关重要,多模态融合是提升sCT质量的有效途径,尤其在临床实践中具有可重复性。
Abstract: Cone-Beam Computed Tomography (CBCT) is widely used for real-time
intraoperative imaging due to its low radiation dose and high acquisition
speed. However, despite its high resolution, CBCT suffers from significant
artifacts and thereby lower visual quality, compared to conventional Computed
Tomography (CT). A recent approach to mitigate these artifacts is synthetic CT
(sCT) generation, translating CBCT volumes into the CT domain. In this work, we
enhance sCT generation through multimodal learning, integrating intraoperative
CBCT with preoperative CT. Beyond validation on two real-world datasets, we use
a versatile synthetic dataset, to analyze how CBCT-CT alignment and CBCT
quality affect sCT quality. The results demonstrate that multimodal sCT
consistently outperform unimodal baselines, with the most significant gains
observed in well-aligned, low-quality CBCT-CT cases. Finally, we demonstrate
that these findings are highly reproducible in real-world clinical datasets.
cs.LG [Back]
[129] Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining
Chenxi Liu,Tianyi Xiong,Ruibo Chen,Yihan Wu,Junfeng Guo,Tianyi Zhou,Heng Huang
Main category: cs.LG
TL;DR: 论文提出了一种名为MBPO的新型偏好学习框架,通过生成对抗性负样本和在线验证奖励,解决大型多模态模型(LMMs)中的模态不平衡问题,显著提升了模型在视觉-语言任务中的性能并减少幻觉。
Details
Motivation: 现有的偏好优化方法未能有效抑制大型语言模型(LLM)的内部偏差,且依赖离线数据,无法适应训练中的动态分布变化,导致LMMs在推理中出现模态不平衡问题。Contribution: 提出了MBPO框架,结合对抗性负样本生成和在线验证奖励,有效平衡了模态偏差,并利用GRPO方法训练混合数据。
Method: 通过对抗扰动输入图像生成硬负样本(误用视觉信息的错误响应),并结合在线验证奖励生成多样化响应,采用GRPO训练离线-在线混合数据。
Result: 实验表明,MBPO显著提升了LMMs在挑战性视觉-语言任务中的性能,并减少了幻觉现象。
Insight: 通过对抗性方法和在线数据动态优化,可以有效解决LMMs中的模态不平衡问题,为多模态对齐提供了新思路。
Abstract: The task adaptation and alignment of Large Multimodal Models (LMMs) have been
significantly advanced by instruction tuning and further strengthened by recent
preference optimization. Yet, most LMMs still suffer from severe modality
imbalance during reasoning, i.e., outweighing language prior biases over visual
inputs, which bottlenecks their generalization to downstream tasks and causes
hallucinations. However, existing preference optimization approaches for LMMs
do not focus on restraining the internal biases of their Large Language Model
(LLM) backbones when curating the training data. Moreover, they heavily rely on
offline data and lack the capacity to explore diverse responses adaptive to
dynamic distributional shifts during training. Meanwhile, Group Relative Policy
Optimization (GRPO), a recent method using online-generated data and verified
rewards to improve reasoning capabilities, remains largely underexplored in LMM
alignment. In this paper, we propose a novel preference learning framework,
Modality-Balancing Preference Optimization (MBPO), to address the modality
imbalance in LMMs. MBPO constructs a more effective offline preference dataset
by generating hard negatives, i.e., rejected responses misled by LLM biases due
to limited usage of visual information, through adversarial perturbation of
input images. Moreover, MBPO leverages the easy-to-verify nature of close-ended
tasks to generate online responses with verified rewards. GRPO is then employed
to train the model with offline-online hybrid data. Extensive experiments
demonstrate that MBPO can enhance LMM performance on challenging
vision-language tasks and effectively reduce hallucinations.
[130] Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning
Hanbing Liu,Lang Cao,Yuanyi Ren,Mengyu Zhou,Haoyu Dong,Xiaojun Ma,Shi Han,Dongmei Zhang
Main category: cs.LG
TL;DR: Bingo是一个基于强化学习的框架,通过动态和显著性感知的长度奖励设计,提升大语言模型的推理效率和准确性。
Details
Motivation: 现有的大语言模型在推理时往往输出冗长或冗余内容,导致效率低下。虽然已有研究通过强化学习优化推理准确性,但对效率的关注不足,且简单的长度奖励容易导致准确率下降。Contribution: 提出了Bingo框架,引入了显著性感知的长度奖励和动态长度奖励机制,旨在优化推理效率和准确性的平衡。
Method: 1. 显著性感知的长度奖励:鼓励模型仅减少不重要的冗余内容。2. 动态长度奖励:初期鼓励详细推理,后期逐步衰减以提升效率。
Result: 在多个推理基准测试中,Bingo在效率和准确性上均优于基线方法,实现了更好的平衡。
Insight: 显式训练大语言模型以提高推理效率具有潜在价值,动态奖励设计是关键。
Abstract: Large language models have demonstrated impressive reasoning capabilities,
yet they often suffer from inefficiencies due to unnecessarily verbose or
redundant outputs. While many works have explored reinforcement learning (RL)
to enhance reasoning abilities, most primarily focus on improving accuracy,
with limited attention to reasoning efficiency. Some existing approaches
introduce direct length-based rewards to encourage brevity, but this often
leads to noticeable drops in accuracy. In this paper, we propose Bingo, an RL
framework that advances length-based reward design to boost efficient
reasoning. Bingo incorporates two key mechanisms: a significance-aware length
reward, which gradually guides the model to reduce only insignificant tokens,
and a dynamic length reward, which initially encourages elaborate reasoning for
hard questions but decays over time to improve overall efficiency. Experiments
across multiple reasoning benchmarks show that Bingo improves both accuracy and
efficiency. It outperforms the vanilla reward and several other length-based
reward baselines in RL, achieving a favorable trade-off between accuracy and
efficiency. These results underscore the potential of training LLMs explicitly
for efficient reasoning.
[131] Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints
Yaswanth Chittepu,Blossom Metevier,Will Schwarzer,Austin Hoag,Scott Niekum,Philip S. Thomas
Main category: cs.LG
TL;DR: HC-RLHF是一种从人类反馈中学习的高置信度安全强化学习方法,通过明确分离人类偏好为有用性和无害性,并在悲观成本约束下优化奖励函数,确保模型在敏感领域的安全性。
Details
Motivation: 现有语言模型对齐方法往往在安全性和有用性之间妥协,导致敏感领域产生不可接受的回复。HC-RLHF旨在提供高置信度安全保证,同时最大化有用性。Contribution: 1)提出HC-RLHF方法,将人类偏好解耦为有用性和无害性;2)提供理论证明,确保模型以高概率不返回不安全结果;3)在实验中验证了方法的有效性。
Method: 1)训练奖励模型(有用性)和成本模型(无害性);2)在悲观成本约束下优化奖励函数;3)通过安全性测试验证模型性能。
Result: HC-RLHF在实验中对齐三种语言模型(Qwen2-1.5B、Qwen2.5-3B、LLaMa3.2-3B),显著提升了无害性和有用性。
Insight: 通过明确的解耦和悲观优化,HC-RLHF提供了一种更可靠的模型对齐框架,适用于对安全性要求高的应用场景。
Abstract: Existing approaches to language model alignment often treat safety as a
tradeoff against helpfulness, which can lead to unacceptable responses in
sensitive domains. To ensure reliable performance in such settings, we propose
High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a
method that provides high-confidence safety guarantees while maximizing
helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human
preferences into helpfulness and harmlessness (safety), which are learned by
training a reward model and a cost model, respectively. It then employs a
two-step process to find safe solutions. In the first step, it optimizes the
reward function under an intentionally pessimistic version of the cost
constraint. In the second step, the trained model undergoes a safety test to
verify whether its performance stays within an upper-confidence bound of the
actual cost constraint. We provide a theoretical analysis of HC-RLHF, including
proof that it will not return an unsafe solution with a probability greater
than a user-specified threshold. For our empirical analysis, we apply HC-RLHF
to align three different language models (Qwen2-1.5B, Qwen2.5-3B, and
LLaMa3.2-3B) with human preferences. Our results demonstrate that HC-RLHF
produces safe models with high probability and can improve harmlessness and
helpfulness compared to previous methods.
[132] From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium
Xie Yi,Zhanke Zhou,Chentao Cao,Qiyu Niu,Tongliang Liu,Bo Han
Main category: cs.LG
TL;DR: 本文提出了一种通过贝叶斯纳什均衡(BayLE)实现多智能体LLM高效推理的方法ECON,解决了传统多智能体框架计算成本高且缺乏收敛保证的问题。
Details
Motivation: 多智能体框架能显著增强大型语言模型的推理能力,但传统方法存在计算成本高、缺乏收敛保证的缺点。因此,需要一种更高效且理论可靠的方法。Contribution: 1. 将多智能体协调建模为不完全信息博弈并寻求贝叶斯纳什均衡;2. 提出ECON框架,结合分布式推理与集中式输出;3. 提供了理论上的遗憾边界保证。
Method: ECON采用分层强化学习,每个LLM基于对其他智能体的概率信念独立选择最优响应,无需频繁交互。通过贝叶斯纳什均衡实现高效协调。
Result: 实验表明,ECON在六项复杂推理和规划任务中平均性能提升11.2%,并验证了其可扩展性和灵活性。
Insight: 贝叶斯纳什均衡为多智能体LLM协作提供了理论保障,ECON框架展示了分布式与集中式结合的潜力,推动了大规模多LLM集成的发展。
Abstract: Multi-agent frameworks can substantially boost the reasoning power of large
language models (LLMs), but they typically incur heavy computational costs and
lack convergence guarantees. To overcome these challenges, we recast multi-LLM
coordination as an incomplete-information game and seek a Bayesian Nash
equilibrium (BNE), in which each agent optimally responds to its probabilistic
beliefs about the strategies of others. We introduce Efficient Coordination via
Nash Equilibrium (ECON), a hierarchical reinforcement-learning paradigm that
marries distributed reasoning with centralized final output. Under ECON, each
LLM independently selects responses that maximize its expected reward,
conditioned on its beliefs about co-agents, without requiring costly
inter-agent exchanges. We mathematically prove that ECON attains a markedly
tighter regret bound than non-equilibrium multi-agent schemes. Empirically,
ECON outperforms existing multi-LLM approaches by 11.2% on average across six
benchmarks spanning complex reasoning and planning tasks. Further experiments
demonstrate ECON’s ability to flexibly incorporate additional models,
confirming its scalability and paving the way toward larger, more powerful
multi-LLM ensembles. The code is publicly available at:
https://github.com/tmlr-group/ECON.
[133] From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?
Zhanke Zhou,Xiao Feng,Zhaocheng Zhu,Jiangchao Yao,Sanmi Koyejo,Bo Han
Main category: cs.LG
TL;DR: 这篇论文提出了一个新的基准测试AR-Bench,专注于评估大语言模型(LLM)在信息不完整情况下的主动推理能力,揭示了当前LLM在主动推理方面的显著不足。
Details
Motivation: 现有的基准测试主要评估被动推理能力,而忽略了对LLM在信息不完整时主动获取信息的能力。这限制了LLM在实际场景中的应用潜力。Contribution: 设计了AR-Bench基准测试,涵盖三种任务家族(侦探案例、情景谜题和数字猜测),系统地评估LLM的主动推理能力。
Method: 通过模拟真实世界场景的三种任务,测量LLM在常识、逻辑和符号推理挑战中的表现,并进行消融研究分析改进策略的效果。
Result: 实验表明,当前LLM在主动推理能力上表现较差,即使采用高级策略(如树搜索或后训练方法)效果也有限。
Insight: 强调了需要进一步发展主动推理方法,如交互式学习、实时反馈循环和环境感知训练目标。
Abstract: While existing benchmarks probe the reasoning abilities of large language
models (LLMs) across diverse domains, they predominantly assess passive
reasoning, providing models with all the information needed to reach a
solution. By contrast, active reasoning-where an LLM must interact with
external systems to acquire missing evidence or data-has received little
systematic attention. To address this shortfall, we present AR-Bench, a novel
benchmark designed explicitly to evaluate an LLM’s active reasoning skills.
AR-Bench comprises three task families-detective cases, situation puzzles, and
guessing numbers-that together simulate real-world, agentic scenarios and
measure performance across commonsense, logical, and symbolic reasoning
challenges. Empirical evaluation on AR-Bench demonstrates that contemporary
LLMs exhibit pronounced difficulties with active reasoning: they frequently
fail to acquire or leverage the information needed to solve tasks. This gap
highlights a stark divergence between their passive and active reasoning
abilities. Moreover, ablation studies indicate that even advanced strategies,
such as tree-based searching or post-training approaches, yield only modest
gains and fall short of the levels required for real-world deployment.
Collectively, these findings highlight the critical need to advance methodology
for active reasoning, e.g., incorporating interactive learning, real-time
feedback loops, and environment-aware objectives for training. The benchmark is
publicly available at: https://github.com/tmlr-group/AR-Bench.
[134] Reinforce LLM Reasoning through Multi-Agent Reflection
Yurun Yuan,Tengyang Xie
Main category: cs.LG
TL;DR: 这篇论文提出了一种通过多智能体反思增强大语言模型(LLM)推理能力的方法,利用强化学习算法DPSDP动态优化答案生成过程。
Details
Motivation: 现有方法在动态探索解决方案和反馈整合中存在反馈空间受限和缺乏协调训练的问题,导致性能不佳。Contribution: 提出了DPSDP算法,通过直接偏好学习训练actor-critic LLM系统,实现了答案的多轮迭代优化。
Method: 将多轮优化过程建模为马尔可夫决策过程,采用DPSDP算法进行强化学习,实验中使用多种基础模型验证方法。
Result: 在MATH 500基准测试中,多轮优化将初次准确率从58.2%提升至63.2%,并通过消融实验验证了多智能体协作和外分布泛化的优势。
Insight: 多智能体协作和动态优化机制可以显著提升LLM的推理能力,并且具有外分布泛化的潜力。
Abstract: Leveraging more test-time computation has proven to be an effective way to
boost the reasoning capabilities of large language models (LLMs). Among various
methods, the verify-and-improve paradigm stands out for enabling dynamic
solution exploration and feedback incorporation. However, existing approaches
often suffer from restricted feedback spaces and lack of coordinated training
of different parties, leading to suboptimal performance. To address this, we
model this multi-turn refinement process as a Markov Decision Process and
introduce DPSDP (Direct Policy Search by Dynamic Programming), a reinforcement
learning algorithm that trains an actor-critic LLM system to iteratively refine
answers via direct preference learning on self-generated data. Theoretically,
DPSDP can match the performance of any policy within the training distribution.
Empirically, we instantiate DPSDP with various base models and show
improvements on both in- and out-of-distribution benchmarks. For example, on
benchmark MATH 500, majority voting over five refinement steps increases
first-turn accuracy from 58.2% to 63.2% with Ministral-based models. An
ablation study further confirms the benefits of multi-agent collaboration and
out-of-distribution generalization.
[135] Reinforcement Learning Teachers of Test Time Scaling
Edoardo Cetin,Tianyu Zhao,Yujin Tang
Main category: cs.LG
TL;DR: 该论文提出了一种名为强化学习教师(RLT)的新框架,专注于通过详细的解释训练学生模型,避免了强化学习的探索挑战,并在多个任务中表现出色。
Details
Motivation: 传统强化学习方法训练推理语言模型时,依赖模型的初始探索能力,且通常用于蒸馏新学生模型而非直接部署。RLT框架旨在解决这些限制。Contribution: 引入了RLT框架,通过密集奖励训练教师模型生成详细解释,显著提高了学生模型的性能,超越了传统蒸馏方法。
Method: RLT模型通过问题和解决方案生成详细解释,并用学生的反馈作为密集奖励训练。这种方法避免了探索问题并优化蒸馏效果。
Result: 7B规模的RLT模型在多个任务中表现出色,超越了更大规模语言模型的蒸馏效果,并具有零样本迁移能力。
Insight: RLT框架提高了推理语言模型的效率和可重用性,适用于更大规模学生模型和分布外任务。
Abstract: Training reasoning language models (LMs) with reinforcement learning (RL) for
one-hot correctness inherently relies on the LM being able to explore and solve
its task with some chance at initialization. Furthermore, a key use case of
reasoning LMs is to act as teachers for distilling new students and
cold-starting future RL iterations rather than being deployed themselves. From
these considerations, we introduce a new framework that avoids RL’s exploration
challenge by training a new class of Reinforcement-Learned Teachers (RLTs)
focused on yielding the most effective downstream distillation. RLTs are
prompted with both the question and solution to each problem, and tasked to
simply “connect-the-dots” with detailed explanations tailored for their
students. We train RLTs with dense rewards obtained by feeding each explanation
to the student and testing its understanding of the problem’s solution. In
practice, the raw outputs of a 7B RLT provide higher final performance on
competition and graduate-level tasks than existing distillation and
cold-starting pipelines that collect and postprocess the reasoning traces of
orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness
when training larger students and when applied zero-shot to out-of-distribution
tasks, unlocking new levels of efficiency and re-usability for the RL reasoning
framework.
[136] The Geometries of Truth Are Orthogonal Across Tasks
Waiss Azizian,Michael Kirchhof,Eugene Ndiaye,Louis Bethune,Michal Klein,Pierre Ablin,Marco Cuturi
Main category: cs.LG
TL;DR: 该研究指出,大型语言模型(LLMs)中所谓的”真理几何”(geometry of truth)具有任务依赖性,无法跨任务迁移。线性分类器在不同任务上的表现几乎不相关,且稀疏正则化下其支持集几乎不相交。更复杂的方法(如混合探测和任务)也无法克服这一局限性。
Details
Motivation: LLMs的泛化能力虽强,但其可靠性仍受质疑。此前研究发现,可以从LLM的激活中学习"真理几何"以区分正确和错误答案。然而,这种方法的跨任务适用性尚不清楚,本文旨在验证其局限性。Contribution: 研究发现”真理几何”是任务依赖的,无法跨任务迁移。线性分类器在不同任务上的表现差异明显,支持集几乎不重叠。这一发现揭示了当前方法的缺陷。
Method: 通过分析LLM在不同任务上的激活向量,训练线性分类器以区分正确和错误答案。观察分类器在不同任务间的相似性,并使用稀疏正则化进一步验证。尝试混合探测和任务方法,评估其效果。
Result: 结果表明,”真理几何”在不同任务间不具相似性,线性分类器的支持集几乎不相交。混合探测和任务方法也无法解决这一局限性。
Insight: LLM的”真理几何”具有任务特异性,当前的线性分类方法难以实现跨任务的通用性。未来的研究需探索更灵活的分类策略或激活向量表示。
Abstract: Large Language Models (LLMs) have demonstrated impressive generalization
capabilities across various tasks, but their claim to practical relevance is
still mired by concerns on their reliability. Recent works have proposed
examining the activations produced by an LLM at inference time to assess
whether its answer to a question is correct. Some works claim that a “geometry
of truth” can be learned from examples, in the sense that the activations that
generate correct answers can be distinguished from those leading to mistakes
with a linear classifier. In this work, we underline a limitation of these
approaches: we observe that these “geometries of truth” are intrinsically
task-dependent and fail to transfer across tasks. More precisely, we show that
linear classifiers trained across distinct tasks share little similarity and,
when trained with sparsity-enforcing regularizers, have almost disjoint
supports. We show that more sophisticated approaches (e.g., using mixtures of
probes and tasks) fail to overcome this limitation, likely because activation
vectors commonly used to classify answers form clearly separated clusters when
examined across tasks.
[137] SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning
Xiao Liang,Zhong-Zhi Li,Yeyun Gong,Yang Wang,Hengyuan Zhang,Yelong Shen,Ying Nian Wu,Weizhu Chen
Main category: cs.LG
TL;DR: 该论文提出了一种自感知弱点驱动的问题合成框架(SwS),用于增强强化学习在大型语言模型(LLMs)的复杂推理任务中的表现。通过识别模型在训练中的弱项并针对性合成问题,显著提升了模型性能。
Details
Motivation: 现有强化学习方法依赖高质量的问题集,但人工标注问题稀缺且合成数据集缺乏针对性,导致训练效率低下。希望通过模型自感知弱点并针对性合成问题来优化训练。Contribution: 提出了SwS框架,能系统识别模型弱项并合成针对性问题,无需外部知识蒸馏,显著提升模型在推理任务中的性能。
Method: 定义模型在训练中反复失败的问题为弱点,提取核心概念并合成新问题,用于后续训练以强化弱项。
Result: 在7B和32B模型上,8个主流推理基准测试平均分别提升10.0%和7.7%。
Insight: 模型能够通过自感知和针对性训练动态优化其弱项,显著提升泛化能力,为强化学习中的问题合成提供了新思路。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective
for training large language models (LLMs) on complex reasoning tasks, such as
mathematical problem solving. A prerequisite for the scalability of RLVR is a
high-quality problem set with precise and verifiable answers. However, the
scarcity of well-crafted human-labeled math problems and limited-verification
answers in existing distillation-oriented synthetic datasets limit their
effectiveness in RL. Additionally, most problem synthesis strategies
indiscriminately expand the problem set without considering the model’s
capabilities, leading to low efficiency in generating useful questions. To
mitigate this issue, we introduce a Self-aware Weakness-driven problem
Synthesis framework (SwS) that systematically identifies model deficiencies and
leverages them for problem augmentation. Specifically, we define weaknesses as
questions that the model consistently fails to learn through its iterative
sampling during RL training. We then extract the core concepts from these
failure cases and synthesize new problems to strengthen the model’s weak areas
in subsequent augmented training, enabling it to focus on and gradually
overcome its weaknesses. Without relying on external knowledge distillation,
our framework enables robust generalization byempowering the model to
self-identify and address its weaknesses in RL, yielding average performance
gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning
benchmarks.
[138] e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
Amrith Setlur,Matthew Y. R. Yang,Charlie Snell,Jeremy Greer,Ian Wu,Virginia Smith,Max Simchowitz,Aviral Kumar
Main category: cs.LG
TL;DR: 论文提出了一种名为e3的方法,通过训练LLM在推理时进行上下文探索,实现了在超出训练令牌预算时的性能外推。e3结合了三种关键技巧,显著提升了小规模模型的表现。
Details
Motivation: 现有推理模型在超出训练令牌预算时的性能外推表现不佳,亟需一种方法能够充分利用测试时计算资源,提升复杂问题的解决能力。Contribution: 提出了e3方法,通过训练LLM进行上下文探索,成功实现了性能外推,并在小规模模型(1.7B参数)上取得了最佳表现。
Method: 1. 利用LLM在生成与验证等任务上的不对称能力,实现上下文搜索;2. 通过强化学习中的“负”梯度放大探索;3. 设计课程学习,将任务难度与训练令牌预算结合。
Result: e3-1.7B模型在AIME’25和HMMT’25评分中表现最佳,并能外推到训练令牌预算的2倍,显著提高了pass@1和pass@k分数。
Insight: 通过上下文探索和不对称任务链,可以显著提升LLM的推理能力,尤其是在测试时计算资源超出预算时,展示了小规模模型的潜力。
Abstract: Test-time scaling offers a promising path to improve LLM reasoning by
utilizing more compute at inference time; however, the true promise of this
paradigm lies in extrapolation (i.e., improvement in performance on hard
problems as LLMs keep “thinking” for longer, beyond the maximum token budget
they were trained on). Surprisingly, we find that most existing reasoning
models do not extrapolate well. We show that one way to enable extrapolation is
by training the LLM to perform in-context exploration: training the LLM to
effectively spend its test time budget by chaining operations (such as
generation, verification, refinement, etc.), or testing multiple hypotheses
before it commits to an answer. To enable in-context exploration, we identify
three key ingredients as part of our recipe e3: (1) chaining skills that the
base LLM has asymmetric competence in, e.g., chaining verification (easy) with
generation (hard), as a way to implement in-context search; (2) leveraging
“negative” gradients from incorrect traces to amplify exploration during RL,
resulting in longer search traces that chains additional asymmetries; and (3)
coupling task difficulty with training token budget during training via a
specifically-designed curriculum to structure in-context exploration. Our
recipe e3 produces the best known 1.7B model according to AIME’25 and HMMT’25
scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not
only attains high pass@1 scores, but also improves pass@k over the base model.
[139] An Adaptive Method Stabilizing Activations for Enhanced Generalization
Hyunseok Seung,Jaewoo Lee,Hyunsuk Ko
Main category: cs.LG
TL;DR: AdaAct是一种新颖的优化算法,通过根据激活方差调整学习率来增强神经元输出的稳定性,从而提升泛化能力。实验在CIFAR和ImageNet上验证了其性能。
Details
Motivation: 传统的激活正则化方法在提升泛化能力方面存在局限,因此需要一种新的方法能够在训练过程中动态调整激活稳定性。Contribution: 提出了AdaAct方法,通过神经元级别的适应性调整学习率,有效提升了模型的泛化能力,并平衡了Adam的收敛速度和SGD的泛化能力。
Method: AdaAct的核心思想是根据神经元激活的方差动态调整学习率,以此稳定神经元输出。该方法在训练过程中实时监控和调节激活分布。
Result: 在CIFAR和ImageNet上的实验表明,AdaAct在泛化能力上优于其他方法,同时保持了与Adam相似的收敛速度。代码已开源。
Insight: 激活稳定性对模型泛化能力至关重要,动态调整学习率是一种有效的补充方法,能够弥合不同优化器的优势。
Abstract: We introduce AdaAct, a novel optimization algorithm that adjusts learning
rates according to activation variance. Our method enhances the stability of
neuron outputs by incorporating neuron-wise adaptivity during the training
process, which subsequently leads to better generalization – a complementary
approach to conventional activation regularization methods. Experimental
results demonstrate AdaAct’s competitive performance across standard image
classification benchmarks. We evaluate AdaAct on CIFAR and ImageNet, comparing
it with other state-of-the-art methods. Importantly, AdaAct effectively bridges
the gap between the convergence speed of Adam and the strong generalization
capabilities of SGD, all while maintaining competitive execution times. Code is
available at https://github.com/hseung88/adaact.
[140] HSG-12M: A Large-Scale Spatial Multigraph Dataset
Xianquan Yan,Hakan Akgün,Kenji Kawaguchi,N. Duane Loh,Ching Hua Lee
Main category: cs.LG
TL;DR: HSG-12M 是首个大规模空间多重图数据集,其独特之处在于保留了节点之间的多条几何路径。该数据集基于物理光谱数据,提供了丰富的拓扑结构,并展示了多边几何学习的挑战。
Details
Motivation: 现有图数据集通常忽略空间多样性和多重路径,而现实世界中的图(如物理系统)需要保留这些特性。HSG-12M 填补了这一空白。Contribution: 1) 提出了首个大规模空间多重图数据集 HSG-12M;2) 开发了高效的开源工具 Poly2Graph;3) 展示了光谱图在多项式与矩阵拓扑表示中的普适性。
Method: 从 1-D 晶体的光谱数据中提取哈密顿谱图(Hamiltonian spectral graphs),保留了多条几何路径,并通过 Poly2Graph 工具实现高效映射。
Result: 数据集包含 11.6M 静态图和 5.1M 动态图,覆盖 1401 个特征多项式类。实验表明,现有图神经网络在多边几何学习中面临挑战。
Insight: 光谱图不仅是物理系统的有效表示,还能作为多项式、向量和矩阵的拓扑指纹,为图学习与科学发现提供了新桥梁。
Abstract: Existing graph benchmarks assume non-spatial, simple edges, collapsing
physically distinct paths into a single link. We introduce HSG-12M, the first
large-scale dataset of $\textbf{spatial multigraphs}-$graphs embedded in a
metric space where multiple geometrically distinct trajectories between two
nodes are retained as separate edges. HSG-12M contains 11.6 million static and
5.1 million dynamic $\textit{Hamiltonian spectral graphs}$ across 1401
characteristic-polynomial classes, derived from 177 TB of spectral potential
data. Each graph encodes the full geometry of a 1-D crystal’s energy spectrum
on the complex plane, producing diverse, physics-grounded topologies that
transcend conventional node-coordinate datasets. To enable future extensions,
we release $\texttt{Poly2Graph}$: a high-performance, open-source pipeline that
maps arbitrary 1-D crystal Hamiltonians to spectral graphs. Benchmarks with
popular GNNs expose new challenges in learning from multi-edge geometry at
scale. Beyond its practical utility, we show that spectral graphs serve as
universal topological fingerprints of polynomials, vectors, and matrices,
forging a new algebra-to-graph link. HSG-12M lays the groundwork for
geometry-aware graph learning and new opportunities of data-driven scientific
discovery in condensed matter physics and beyond.
[141] Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers
Simon Roschmann,Quentin Bouniot,Vasilii Feofanov,Ievgen Redko,Zeynep Akata
Main category: cs.LG
TL;DR: 论文提出了一种将时间序列转换为图像以利用预训练视觉Transformer(ViT)表示能力的框架TiViT,实现了时间序列分类的SOTA性能,并揭示了视觉模型在非视觉领域的复用潜力。
Details
Motivation: 时间序列分类在医疗和工业中很重要,但时间序列基础模型(TSFM)的发展受限于公开数据集的稀缺。论文通过将时间序列转换为图像,利用大规模图像数据集预训练的ViT的表示能力,以弥补这一缺陷。Contribution: 1) 提出TiViT框架,将时间序列转为图像以利用冻结预训练ViT的表示能力;2) 分析了ViT的2D分片对时间序列的潜在优势;3) 在标准时间序列分类基准上达到SOTA性能;4) 揭示了视觉模型在非视觉领域的复用潜力。
Method: 1) 时间序列转换为图像;2) 使用冻结预训练的ViT(如OpenCLIP模型)提取特征;3) 发现中间层的高内在维度表示对分类最有效;4) 结合TiViT和TSFM的表示空间进一步提升性能。
Result: TiViT在标准时间序列分类基准上实现了SOTA性能,且中间层的表示对分类最有效。结合TiViT和TSFM的表示空间可进一步提升性能。
Insight: 视觉Transformer的表示能力可以迁移到非视觉领域(如时间序列分类),中间层的表示对任务尤为关键,且视觉与时间序列表示空间存在互补性。
Abstract: Time series classification is a fundamental task in healthcare and industry,
yet the development of time series foundation models (TSFMs) remains limited by
the scarcity of publicly available time series datasets. In this work, we
propose Time Vision Transformer (TiViT), a framework that converts time series
into images to leverage the representational power of frozen Vision
Transformers (ViTs) pretrained on large-scale image datasets. First, we
theoretically motivate our approach by analyzing the 2D patching of ViTs for
time series, showing that it can increase the number of label-relevant tokens
and reduce the sample complexity. Second, we empirically demonstrate that TiViT
achieves state-of-the-art performance on standard time series classification
benchmarks by utilizing the hidden representations of large OpenCLIP models. We
explore the structure of TiViT representations and find that intermediate
layers with high intrinsic dimension are the most effective for time series
classification. Finally, we assess the alignment between TiViT and TSFM
representation spaces and identify a strong complementarity, with further
performance gains achieved by combining their features. Our findings reveal yet
another direction for reusing vision representations in a non-visual domain.
q-bio.BM [Back]
[142] Aligning Proteins and Language: A Foundation Model for Protein Retrieval
Qifeng Wu,Zhengzhe Liu,Han Zhu,Yizhou Zhao,Daisuke Kihara,Min Xu
Main category: q-bio.BM
TL;DR: 论文提出了一种基于对比学习的CLIP风格框架,用于对齐3D蛋白质结构和功能注释,促进了蛋白质结构的语义检索。
Details
Motivation: 受视觉-语言模型(VLMs)进展的启发,旨在通过多模态基础模型促进蛋白质结构-功能的理解。Contribution: 1. 提出了用于蛋白质检索的CLIP风格框架;2. 构建了一个包含20万蛋白质-标注对的大规模数据集。
Method: 采用对比学习框架,对齐3D蛋白质结构和功能描述,支持零样本检索。
Result: 在PDB和EMDB数据集上展示了优异的零样本检索性能。
Insight: 多模态基础模型在蛋白质结构功能理解中具有潜力,未来可扩展至其他生物分子领域。
Abstract: This paper aims to retrieve proteins with similar structures and semantics
from large-scale protein dataset, facilitating the functional interpretation of
protein structures derived by structural determination methods like
cryo-Electron Microscopy (cryo-EM). Motivated by the recent progress of
vision-language models (VLMs), we propose a CLIP-style framework for aligning
3D protein structures with functional annotations using contrastive learning.
For model training, we propose a large-scale dataset of approximately 200,000
protein-caption pairs with rich functional descriptors. We evaluate our model
in both in-domain and more challenging cross-database retrieval on Protein Data
Bank (PDB) and Electron Microscopy Data Bank (EMDB) dataset, respectively. In
both cases, our approach demonstrates promising zero-shot retrieval
performance, highlighting the potential of multimodal foundation models for
structure-function understanding in protein biology.
cs.RO [Back]
[143] PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly
Liang Ma,Jiajun Wen,Min Lin,Rongtao Xu,Xiwen Liang,Bingqian Lin,Jun Ma,Yongxin Wang,Ziming Wei,Haokun Lin,Mingfei Han,Meng Cao,Bokui Chen,Ivan Laptev,Xiaodan Liang
Main category: cs.RO
TL;DR: PhyBlock是一个渐进式的基准测试,旨在评估视觉语言模型(VLMs)在3D块组装任务中物理理解和规划的能力,揭示其在空间推理和高级规划中的局限性。
Details
Motivation: 尽管VLMs在推理和规划任务中表现出色,但它们在结构化3D环境中对物理现象的理解仍然十分有限。需要一种新的基准测试来评估和改进VLMs在这些任务中的表现。Contribution: 提出了PhyBlock,一个包含2600个任务的渐进式基准测试,评估VLMs在3D块组装任务中的物理理解、空间推理和规划能力,并为21种先进VLMs提供了全面的性能分析。
Method: PhyBlock结合了四层认知层次的任务和针对性的VQA样本,评估模型在部分完成、失败诊断和规划鲁棒性三个维度的表现。
Result: 实验表明,VLMs在高级规划和空间推理任务中表现显著受限,尤其在复杂任务中性能下降明显。错误分析揭示了它们在空间定向和依赖关系推理中的持续困难。
Insight: 空间任务更依赖直觉理解,链式思维提示对性能改善有限。PhyBlock为提升VLMs的实际物理问题解决能力提供了统一测试平台。
Abstract: While vision-language models (VLMs) have demonstrated promising capabilities
in reasoning and planning for embodied agents, their ability to comprehend
physical phenomena, particularly within structured 3D environments, remains
severely limited. To close this gap, we introduce PhyBlock, a progressive
benchmark designed to assess VLMs on physical understanding and planning
through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level
cognitive hierarchy assembly task alongside targeted Visual Question Answering
(VQA) samples, collectively aimed at evaluating progressive spatial reasoning
and fundamental physical comprehension, including object properties, spatial
relationships, and holistic scene understanding. PhyBlock includes 2600 block
tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three
key dimensions: partial completion, failure diagnosis, and planning robustness.
We benchmark 21 state-of-the-art VLMs, highlighting their strengths and
limitations in physically grounded, multi-step planning. Our empirical findings
indicate that the performance of VLMs exhibits pronounced limitations in
high-level planning and reasoning capabilities, leading to a notable decline in
performance for the growing complexity of the tasks. Error analysis reveals
persistent difficulties in spatial orientation and dependency reasoning.
Surprisingly, chain-of-thought prompting offers minimal improvements,
suggesting spatial tasks heavily rely on intuitive model comprehension. We
position PhyBlock as a unified testbed to advance embodied reasoning, bridging
vision-language understanding and real-world physical problem-solving.
cs.DB [Back]
[144] RADAR: Benchmarking Language Models on Imperfect Tabular Data
Ken Gu,Zhihan Zhang,Kate Lin,Yuwei Zhang,Akshay Paruchuri,Hong Yu,Mehran Kazemi,Kumar Ayush,A. Ali Heydari,Maxwell A. Xu,Girish Narayanswamy,Yun Liu,Ming-Zher Poh,Yuzhe Yang,Mark Malhotra,Shwetak Patel,Hamid Palangi,Xuhai Xu,Daniel McDuff,Tim Althoff,Xin Liu
Main category: cs.DB
TL;DR: RADAR 是一个用于系统性评估语言模型在含数据缺陷的表格数据上的推理能力的基准,揭示了前沿模型在数据缺陷下的性能下降问题。
Details
Motivation: 语言模型在数据分析任务中的应用越来越广泛,但其对数据缺陷(如缺失值、异常值等)的识别和推理能力尚未充分研究,这可能在真实数据分析中导致结论失效。Contribution: 提出了 RADAR 基准,包含 2980 组表格查询对,模拟多种数据缺陷,支持针对性的模型行为评估,拓展了对表格数据推理的研究。
Method: 通过程序化扰动模拟数据缺陷,构建了一个覆盖多领域和缺陷类型的基准,并可控地调整表格大小以研究模型性能变化。
Result: 实验表明,前沿模型在无缺陷表格上表现良好,但在数据缺陷引入时性能显著下降,突显了其数据感知能力不足的问题。
Insight: RADAR 的灵活性和可扩展性使其成为推动表格数据推理研究的宝贵资源,揭示了当前语言模型在真实数据分析中的局限性。
Abstract: Language models (LMs) are increasingly being deployed to perform autonomous
data analyses. However, their data awareness – the ability to recognize,
reason over, and appropriately handle data artifacts such as missing values,
outliers, and logical inconsistencies – remains underexplored. These artifacts
are especially common in real-world tabular data and, if mishandled, can
significantly compromise the validity of analytical conclusions. To address
this gap, we present RADAR, a benchmark for systematically evaluating
data-aware reasoning on tabular data. We develop a framework to simulate data
artifacts via programmatic perturbations to enable targeted evaluation of model
behavior. RADAR comprises 2980 table query pairs, grounded in real-world data
spanning 9 domains and 5 data artifact types. In addition to evaluating
artifact handling, RADAR systematically varies table size to study how
reasoning performance holds when increasing table size. Our evaluation reveals
that, despite decent performance on tables without data artifacts, frontier
models degrade significantly when data artifacts are introduced, exposing
critical gaps in their capacity for robust, data-aware analysis. Designed to be
flexible and extensible, RADAR supports diverse perturbation types and
controllable table sizes, offering a valuable resource for advancing tabular
reasoning.
cs.HC [Back]
[145] SakugaFlow: A Stagewise Illustration Framework Emulating the Human Drawing Process and Providing Interactive Tutoring for Novice Drawing Skills
Kazuki Kawamura,Jun Rekimoto
Main category: cs.HC
TL;DR: SakugaFlow是一个四阶段AI绘图框架,模仿人类绘画过程并提供实时互动教学,帮助新手提升绘图技能。
Details
Motivation: 现有的AI绘图工具尽管能生成高质量图像,但缺乏人类艺术家逐步绘制的透明过程,限制了新手学习。Contribution: 提出了SakugaFlow,将扩散模型与大型语言模型结合,提供分阶段绘图教学和实时反馈。
Method: 四阶段管道:每阶段生成图像,并提供关于解剖、透视和构图的反馈,支持非线性修订和分支版本。
Result: 将黑盒生成器转化为支持学习和创意探索的教学环境。
Insight: 通过透明化中间过程和互动教学,AI工具能更好地辅助技能学习。
Abstract: While current AI illustration tools can generate high-quality images from
text prompts, they rarely reveal the step-by-step procedure that human artists
follow. We present SakugaFlow, a four-stage pipeline that pairs diffusion-based
image generation with a large-language-model tutor. At each stage, novices
receive real-time feedback on anatomy, perspective, and composition, revise any
step non-linearly, and branch alternative versions. By exposing intermediate
outputs and embedding pedagogical dialogue, SakugaFlow turns a black-box
generator into a scaffolded learning environment that supports both creative
exploration and skills acquisition.
[146] MOSAIC-F: A Framework for Enhancing Students’ Oral Presentation Skills through Personalized Feedback
Alvaro Becerra,Daniel Andres,Pablo Villegas,Roberto Daza,Ruth Cobos
Main category: cs.HC
TL;DR: 论文提出了一个名为MOSAIC-F的多模态反馈框架,用于通过个性化反馈提升学生的口头表达能力,结合了人类评估和多模态数据分析。
Details
Motivation: 传统的反馈方法往往缺乏个性化和多角度的评估,难以全面反映学生的学习表现,MOSAIC-F旨在通过整合多模态数据与AI技术弥补这一不足。Contribution: 提出了一个结合多模态学习分析(MMLA)、传感器数据和AI的框架,能够生成更准确和个性化的反馈。
Method: 框架包含四个步骤:标准化评估、多模态数据采集、AI生成个性化反馈、学生自评与可视化反馈。
Result: 在提升口头表达能力的实验中验证了框架的有效性,证明其能为学生提供更全面的反馈。
Insight: 多模态数据与AI的结合可以显著提升教育反馈的深度和个性化程度,为学习分析领域提供了新思路。
Abstract: In this article, we present a novel multimodal feedback framework called
MOSAIC-F, an acronym for a data-driven Framework that integrates Multimodal
Learning Analytics (MMLA), Observations, Sensors, Artificial Intelligence (AI),
and Collaborative assessments for generating personalized feedback on student
learning activities. This framework consists of four key steps. First, peers
and professors’ assessments are conducted through standardized rubrics (that
include both quantitative and qualitative evaluations). Second, multimodal data
are collected during learning activities, including video recordings, audio
capture, gaze tracking, physiological signals (heart rate, motion data), and
behavioral interactions. Third, personalized feedback is generated using AI,
synthesizing human-based evaluations and data-based multimodal insights such as
posture, speech patterns, stress levels, and cognitive load, among others.
Finally, students review their own performance through video recordings and
engage in self-assessment and feedback visualization, comparing their own
evaluations with peers and professors’ assessments, class averages, and
AI-generated recommendations. By combining human-based and data-based
evaluation techniques, this framework enables more accurate, personalized and
actionable feedback. We tested MOSAIC-F in the context of improving oral
presentation skills.
cs.AI [Back]
[147] A Survey on Large Language Models for Mathematical Reasoning
Peng-Yuan Wang,Tian-Shuo Liu,Chenyang Wang,Yi-Di Wang,Shu Yan,Cheng-Xing Jia,Xu-Hui Liu,Xin-Wei Chen,Jia-Cheng Xu,Ziniu Li,Yang Yu
Main category: cs.AI
TL;DR: 该调查总结了大型语言模型(LLMs)在数学推理领域的发展,分为理解阶段和答案生成阶段,并讨论了提升推理能力的方法与挑战。
Details
Motivation: 数学推理是人工智能研究的重要挑战,近年来LLMs在此领域取得显著进展,但依然存在能力、效率和泛化等方面的挑战。Contribution: 综述了LLMs在数学推理中的最新进展,包括从预训练到答案生成的方法,并提出了未来研究方向。
Method: 从训练无需提示到监督微调和强化学习等方法,涵盖扩展的Chain-of-Thought推理和测试时缩放技术。
Result: 尽管取得进展,但LLMs在数学推理中的能力、效率和泛化仍有局限性。
Insight: 未来的研究方向包括改进预训练和知识增强技术、形式化推理框架以及通过学习范式实现元泛化。
Abstract: Mathematical reasoning has long represented one of the most fundamental and
challenging frontiers in artificial intelligence research. In recent years,
large language models (LLMs) have achieved significant advances in this area.
This survey examines the development of mathematical reasoning abilities in
LLMs through two high-level cognitive phases: comprehension, where models gain
mathematical understanding via diverse pretraining strategies, and answer
generation, which has progressed from direct prediction to step-by-step
Chain-of-Thought (CoT) reasoning. We review methods for enhancing mathematical
reasoning, ranging from training-free prompting to fine-tuning approaches such
as supervised fine-tuning and reinforcement learning, and discuss recent work
on extended CoT and “test-time scaling”. Despite notable progress, fundamental
challenges remain in terms of capacity, efficiency, and generalization. To
address these issues, we highlight promising research directions, including
advanced pretraining and knowledge augmentation techniques, formal reasoning
frameworks, and meta-generalization through principled learning paradigms. This
survey tries to provide some insights for researchers interested in enhancing
reasoning capabilities of LLMs and for those seeking to apply these techniques
to other domains.
[148] Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning
Kongcheng Zhang,Qi Yao,Shunyu Liu,Yingjie Wang,Baisheng Lai,Jieping Ye,Mingli Song,Dacheng Tao
Main category: cs.AI
TL;DR: 论文提出了一种自奖励强化学习框架CoVo,通过利用不同推理轨迹的一致性来增强大语言模型的推理能力,无需外部监督。
Details
Motivation: 传统强化学习在复杂推理任务中依赖外部监督,限制了其广泛应用。本文通过探索正确答案的推理路径一致性来解决这一问题。Contribution: 提出了CoVo,一种基于一致性和波动性的自奖励机制,通过向量空间聚合策略和好奇心奖励,实现了无外部监督的推理强化学习。
Method: CoVo通过分析中间推理状态的一致性(高一致性)和波动性(低波动)设计奖励机制,结合好奇心奖励促进多样性探索。
Result: 实验表明,CoVo在多样推理基准测试中性能媲美甚至超越监督强化学习。
Insight: 正确推理路径的中间状态趋向收敛,偏离其他候选答案的波动较小,这种特性可用于自监督强化学习。
Abstract: Recent advances of Reinforcement Learning (RL) have highlighted its potential
in complex reasoning tasks, yet effective training often relies on external
supervision, which limits the broader applicability. In this work, we propose a
novel self-rewarding reinforcement learning framework to enhance Large Language
Model (LLM) reasoning by leveraging the consistency of intermediate reasoning
states across different reasoning trajectories. Our key insight is that correct
responses often exhibit consistent trajectory patterns in terms of model
likelihood: their intermediate reasoning states tend to converge toward their
own final answers (high consistency) with minimal deviation toward other
candidates (low volatility). Inspired by this observation, we introduce CoVo,
an intrinsic reward mechanism that integrates Consistency and Volatility via a
robust vector-space aggregation strategy, complemented by a curiosity bonus to
promote diverse exploration. CoVo enables LLMs to perform RL in a
self-rewarding manner, offering a scalable pathway for learning to reason
without external supervision. Extensive experiments on diverse reasoning
benchmarks show that CoVo achieves performance comparable to or even surpassing
supervised RL. Our code is available at https://github.com/sastpg/CoVo.
[149] Paths to Causality: Finding Informative Subgraphs Within Knowledge Graphs for Knowledge-Based Causal Discovery
Yuni Susanti,Michael Färber
Main category: cs.AI
TL;DR: 该论文提出了一种结合知识图谱(KGs)与大语言模型(LLMs)的新方法,用于提升基于知识的因果发现,通过识别信息丰富的子图并利用学习排序模型优化其选择,显著提高了因果推理的稳定性与准确性。
Details
Motivation: 传统因果发现方法依赖观察数据,局限性较大;而基于LLMs的知识推理方法虽灵活,但结果不稳定。为此,论文提出结合KGs与LLMs,提供更可靠的因果推理框架。Contribution: 1. 提出了一种结合KGs与LLMs的因果推理框架;2. 设计了基于元路径的子图选择与排序方法;3. 在生物医学和开放领域数据集上验证了方法的优越性(F1分数提升高达44.4分)。
Method: 1. 从KGs中提取信息丰富的元路径子图;2. 使用学习排序模型(Learning-to-Rank)优化子图选择;3. 将排名靠前的子图融入零样本提示(zero-shot prompts)中,辅助LLMs进行因果推理。
Result: 在生物医学和开放领域数据集上,提出的方法显著优于基线,F1分数最高提升44.4分,验证了其稳定性和泛化能力。
Insight: 结合结构化知识(KGs)与LLMs的推理能力,能够显著提升因果发现的稳定性与准确性,为复杂系统分析提供新思路。
Abstract: Inferring causal relationships between variable pairs is crucial for
understanding multivariate interactions in complex systems. Knowledge-based
causal discovery – which involves inferring causal relationships by reasoning
over the metadata of variables (e.g., names or textual context) – offers a
compelling alternative to traditional methods that rely on observational data.
However, existing methods using Large Language Models (LLMs) often produce
unstable and inconsistent results, compromising their reliability for causal
inference. To address this, we introduce a novel approach that integrates
Knowledge Graphs (KGs) with LLMs to enhance knowledge-based causal discovery.
Our approach identifies informative metapath-based subgraphs within KGs and
further refines the selection of these subgraphs using Learning-to-Rank-based
models. The top-ranked subgraphs are then incorporated into zero-shot prompts,
improving the effectiveness of LLMs in inferring the causal relationship.
Extensive experiments on biomedical and open-domain datasets demonstrate that
our method outperforms most baselines by up to 44.4 points in F1 scores,
evaluated across diverse LLMs and KGs. Our code and datasets are available on
GitHub: https://github.com/susantiyuni/path-to-causality
[150] VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Li Kang,Xiufeng Song,Heng Zhou,Yiran Qin,Jie Yang,Xiaohong Liu,Philip Torr,Lei Bai,Zhenfei Yin
Main category: cs.AI
TL;DR: 论文介绍了VIKI-Bench和VIKI-R,前者是一个分层基准测试,专注于具身多智能体协作;后者是一个两阶段框架,结合了视觉语言模型和强化学习,显著提升了多智能体协作性能。
Details
Motivation: 当前基于视觉语言模型的多智能体协作方法在支持多样化具身智能体方面存在局限,需要一个统一的测试平台和方法来推动视觉驱动的多智能体协作研究。Contribution: 1) 提出首个分层基准VIKI-Bench,支持多智能体协作的三层次评估;2) 提出VIKI-R框架,结合视觉语言模型和强化学习,提升协作性能;3) 展示了强化学习可以促进异构智能体的组合协作模式。
Method: VIKI-R框架分为两阶段:1) 使用Chain-of-Thought标注的演示数据微调预训练的视觉语言模型;2) 在多层次奖励信号下进行强化学习。
Result: 实验表明VIKI-R在所有任务层次上显著优于基线方法,且强化学习能够促进异构智能体之间的组合协作。
Insight: 多层次的基准测试和结合视觉语言模型与强化学习的框架为视觉驱动的具身多智能体协作研究提供了新思路。
Abstract: Coordinating multiple embodied agents in dynamic environments remains a core
challenge in artificial intelligence, requiring both perception-driven
reasoning and scalable cooperation strategies. While recent works have
leveraged large language models (LLMs) for multi-agent planning, a few have
begun to explore vision-language models (VLMs) for visual reasoning. However,
these VLM-based approaches remain limited in their support for diverse
embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical
benchmark tailored for embodied multi-agent cooperation, featuring three
structured levels: agent activation, task planning, and trajectory perception.
VIKI-Bench includes diverse robot embodiments, multi-view visual observations,
and structured supervision signals to evaluate reasoning grounded in visual
inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a
two-stage framework that fine-tunes a pretrained vision-language model (VLM)
using Chain-of-Thought annotated demonstrations, followed by reinforcement
learning under multi-level reward signals. Our extensive experiments show that
VIKI-R significantly outperforms baselines method across all task levels.
Furthermore, we show that reinforcement learning enables the emergence of
compositional cooperation patterns among heterogeneous agents. Together,
VIKI-Bench and VIKI-R offer a unified testbed and method for advancing
multi-agent, visual-driven cooperation in embodied AI systems.