Table of Contents

cs.CL [Back]

[1] Conservative Bias in Large Language Models: Measuring Relation Predictions

Toyin Aguda,Erik Wilson,Allan Anzagira,Simerjot Kaur,Charese Smiley

Main category: cs.CL

TL;DR: 论文讨论了大型语言模型(LLMs)在关系提取任务中的保守偏见,倾向于选择No_Relation标签而非错误标签,但导致信息丢失。通过多组实验,研究发现保守偏见比幻觉问题更常见,并提出了Hobson’s choice概念来量化这种现象。

Details Motivation: 研究大型语言模型在关系提取任务中表现出的保守偏见及其对信息提取的影响。

Method: 通过多组提示、数据集和关系类型的系统评估,使用SBERT和LLM提示量化保守偏见的语义相似性。

Result: 保守偏见的发生频率是幻觉问题的两倍,保守行为在受限提示和开放提示中表现出语义相似性。

Conclusion: 保守偏见可能导致信息丢失,但其避免错误标签的优势仍需权衡。研究为LLMs在关系提取任务中的行为提供了新视角。

Abstract: Large language models (LLMs) exhibit pronounced conservative bias in relation
extraction tasks, frequently defaulting to No_Relation label when an
appropriate option is unavailable. While this behavior helps prevent incorrect
relation assignments, our analysis reveals that it also leads to significant
information loss when reasoning is not explicitly included in the output. We
systematically evaluate this trade-off across multiple prompts, datasets, and
relation types, introducing the concept of Hobson’s choice to capture scenarios
where models opt for safe but uninformative labels over hallucinated ones. Our
findings suggest that conservative bias occurs twice as often as hallucination.
To quantify this effect, we use SBERT and LLM prompts to capture the semantic
similarity between conservative bias behaviors in constrained prompts and
labels generated from semi-constrained and open-ended prompts.

[2] EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Zefang Liu,Yinzhu Quan

Main category: cs.CL

TL;DR: EconWebArena是一个评估自主代理在真实网页环境中处理复杂经济任务的基准,包含360个任务,强调权威数据源和多模态推理。

Details Motivation: 填补现有研究在权威数据源和真实网页环境中的经济推理任务上的空白。

Method: 通过LLMs生成候选任务,并人工筛选以确保质量,评估多模态LLMs在任务中的表现。

Result: 发现现有模型在多模态理解和任务执行上存在显著性能差距。

Conclusion: EconWebArena为经济智能任务提供了严格的测试平台,揭示了未来改进方向。

Abstract: We introduce EconWebArena, a benchmark for evaluating autonomous agents on
complex, multimodal economic tasks in realistic web environments. The benchmark
comprises 360 curated tasks from 82 authoritative websites spanning domains
such as macroeconomics, labor, finance, trade, and public policy. Each task
challenges agents to navigate live websites, interpret structured and visual
content, interact with real interfaces, and extract precise, time-sensitive
data through multi-step workflows. We construct the benchmark by prompting
multiple large language models (LLMs) to generate candidate tasks, followed by
rigorous human curation to ensure clarity, feasibility, and source reliability.
Unlike prior work, EconWebArena emphasizes fidelity to authoritative data
sources and the need for grounded web-based economic reasoning. We evaluate a
diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure
cases, and conduct ablation studies to assess the impact of visual grounding,
plan-based reasoning, and interaction design. Our results reveal substantial
performance gaps and highlight persistent challenges in grounding, navigation,
and multimodal understanding, positioning EconWebArena as a rigorous testbed
for economic web intelligence.

[3] Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions

Yu-Ang Lee,Guan-Ting Yi,Mei-Yi Liu,Jui-Chao Lu,Guan-Bo Yang,Yun-Nung Chen

Main category: cs.CL

TL;DR: 大规模语言模型(LLM)和AI系统的进步推动了复杂AI工作流设计的变革,复合AI系统能更好地执行复杂任务。本文系统回顾了优化这些系统的最新进展,并提出未来研究挑战。

Details Motivation: 随着复合AI系统的复杂性增加,传统优化方法(如SFT和RL)已不足以应对组件间交互的挑战,基于自然语言反馈的新方法显示出潜力。

Method: 系统回顾并分类现有优化方法,包括数值和语言技术,形式化定义了复合AI系统优化的概念。

Result: 总结了当前优化技术的进展,提出分类维度,并指出了未解决的研究挑战。

Conclusion: 复合AI系统优化是一个快速发展的领域,未来需要进一步探索非可微分系统的优化方法。

Abstract: Recent advancements in large language models (LLMs) and AI systems have led
to a paradigm shift in the design and optimization of complex AI workflows. By
integrating multiple components, compound AI systems have become increasingly
adept at performing sophisticated tasks. However, as these systems grow in
complexity, new challenges arise in optimizing not only individual components
but also their interactions. While traditional optimization methods such as
supervised fine-tuning (SFT) and reinforcement learning (RL) remain
foundational, the rise of natural language feedback introduces promising new
approaches, especially for optimizing non-differentiable systems. This paper
provides a systematic review of recent progress in optimizing compound AI
systems, encompassing both numerical and language-based techniques. We
formalize the notion of compound AI system optimization, classify existing
methods along several key dimensions, and highlight open research challenges
and future directions in this rapidly evolving field. A list of surveyed papers
is publicly available at https://github.com/MiuLab/AISysOpt-Survey.

[4] Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning

Shashidhar Reddy Javaji,Yupeng Cao,Haohang Li,Yangyang Yu,Nikhil Muralidhar,Zining Zhu

Main category: cs.CL

TL;DR: CLAIM-BENCH是一个评估大语言模型在科学论文中提取和验证主张与证据能力的基准,揭示了闭源模型优于开源模型的性能,并为提升模型科学理解能力提供了方法。

Details Motivation: 探索大语言模型在科学论文中处理复杂逻辑关系的能力,尤其是主张与证据的关联性。

Method: 利用CLAIM-BENCH基准,系统比较了三种分治法启发的策略在六个大语言模型上的表现。

Result: 闭源模型(如GPT-4和Claude)在任务中表现更优,且设计的三遍和逐条提示策略显著提高了准确率,但计算成本增加。

Conclusion: CLAIM-BENCH为评估和提升大语言模型的科学理解能力提供了新标准,为未来系统开发奠定了基础。

Abstract: Large language models (LLMs) are increasingly being used for complex research
tasks such as literature review, idea generation, and scientific paper
analysis, yet their ability to truly understand and process the intricate
relationships within complex research papers, such as the logical links between
claims and supporting evidence remains largely unexplored. In this study, we
present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs’
capabilities in scientific claim-evidence extraction and validation, a task
that reflects deeper comprehension of scientific argumentation. We
systematically compare three approaches which are inspired by divide and
conquer approaches, across six diverse LLMs, highlighting model-specific
strengths and weaknesses in scientific comprehension. Through evaluation
involving over 300 claim-evidence pairs across multiple research domains, we
reveal significant limitations in LLMs’ ability to process complex scientific
content. Our results demonstrate that closed-source models like GPT-4 and
Claude consistently outperform open-source counterparts in precision and recall
across claim-evidence identification tasks. Furthermore, strategically designed
three-pass and one-by-one prompting approaches significantly improve LLMs’
abilities to accurately link dispersed evidence with claims, although this
comes at increased computational cost. CLAIM-BENCH sets a new standard for
evaluating scientific comprehension in LLMs, offering both a diagnostic tool
and a path forward for building systems capable of deeper, more reliable
reasoning across full-length papers.

[5] Automatic Generation of Inference Making Questions for Reading Comprehension Assessments

Wanjing Anya Ma,Michael Flor,Zuowei Wang

Main category: cs.CL

TL;DR: 论文研究了阅读理解的推理技能,提出一种推理类型分类法,并利用GPT-4o生成诊断性阅读理解题目,验证了自动生成与人工审核相结合的有效性。

Details Motivation: 提升阅读理解能力需要有效的诊断性评估工具,而推理技能是理解文本的关键。

Method: 提出推理类型分类法;使用GPT-4o通过少量示例生成题目,对比有无链式思考提示的效果。

Result: GPT-4o生成的问题93.8%质量良好,但仅42.6%准确匹配目标推理类型。

Conclusion: 自动生成与人工审核结合,可扩展高质量诊断性阅读理解评估。

Abstract: Inference making is an essential but complex skill in reading comprehension
(RC). Some inferences require resolving references across sentences, and some
rely on using prior knowledge to fill in the detail that is not explicitly
written in the text. Diagnostic RC questions can help educators provide more
effective and targeted reading instruction and interventions for school-age
students. We introduce a taxonomy of inference types for RC and use it to
analyze the distribution of items within a diagnostic RC item bank. Next, we
present experiments using GPT-4o to generate bridging-inference RC items for
given reading passages via few-shot prompting, comparing conditions with and
without chain-of-thought prompts. Generated items were evaluated on three
aspects: overall item quality, appropriate inference type, and LLM reasoning,
achieving high inter-rater agreements above 0.90. Our results show that GPT-4o
produced 93.8% good-quality questions suitable for operational use in grade
3-12 contexts; however, only 42.6% of the generated questions accurately
matched the targeted inference type. We conclude that combining automatic item
generation with human judgment offers a promising path toward scalable,
high-quality diagnostic RC assessments.

[6] Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency

Chenlong Wang,Yuanning Feng,Dongping Chen,Zhaoyang Chu,Ranjay Krishna,Tianyi Zhou

Main category: cs.CL

TL;DR: NoWait 方法通过禁用显式自我反思标记(如“Wait”和“Hmm”)来减少推理过程中的冗余输出,提高效率,且不影响模型性能。

Details Motivation: 大型推理模型在复杂推理中常因过度思考而产生冗余输出,影响了效率。本文研究了显式自我反思标记的必要性。

Method: 提出了NoWait方法,通过抑制自我反思标记(如“Wait”和“Hmm”)来实现高效推理。

Result: 在多个基准测试中,NoWait 将推理轨迹长度减少了27%-51%,且模型性能未受影响。

Conclusion: NoWait 是一种即插即用的解决方案,适用于高效且不影响性能的多模态推理。

Abstract: Recent advances in large reasoning models have enabled complex, step-by-step
reasoning but often introduce significant overthinking, resulting in verbose
and redundant outputs that hinder efficiency. In this study, we examine whether
explicit self-reflection, signaled by tokens such as “Wait” and “Hmm”, is
necessary for advanced reasoning. We propose NoWait, a simple yet effective
approach that disables explicit self-reflection by suppressing these tokens
during inference. Extensive experiments on ten benchmarks across textual,
visual, and video reasoning tasks show that NoWait reduces chain-of-thought
trajectory length by up to 27%-51% in five R1-style model series, without
compromising model utility. NoWait thus offers a plug-and-play solution for
efficient and utility-preserving multimodal reasoning.

[7] Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning

Yiqun Sun,Qiang Huang,Anthony K. H. Tung,Jun Yu

Main category: cs.CL

TL;DR: 该立场论文主张文本嵌入研究应超越表面意义,以隐含语义为核心建模目标,并提出数据多样化、评估基准改进等建议。

Details Motivation: 当前文本嵌入模型主要关注表面语义,忽视了隐含语义(如语用、说话者意图和社会文化背景),导致其在需要深层理解的任务中表现不佳。

Method: 通过试点研究展示现有模型在隐含语义任务上的局限性,并提出研究范式的转变,包括改进训练数据和评估基准。

Result: 试点研究表明,即使是先进模型在隐含语义任务上表现仅略优于简单基线,凸显研究差距。

Conclusion: 呼吁将隐含语义作为核心建模目标,并提出数据多样化、基准改进等建议,以更贴近现实语言的复杂性。

Abstract: This position paper argues that the text embedding research community should
move beyond surface meaning and embrace implicit semantics as a central
modeling goal. Text embedding models have become foundational in modern NLP,
powering a wide range of applications and drawing increasing research
attention. Yet, much of this progress remains narrowly focused on surface-level
semantics. In contrast, linguistic theory emphasizes that meaning is often
implicit, shaped by pragmatics, speaker intent, and sociocultural context.
Current embedding models are typically trained on data that lacks such depth
and evaluated on benchmarks that reward the capture of surface meaning. As a
result, they struggle with tasks requiring interpretive reasoning, speaker
stance, or social meaning. Our pilot study highlights this gap, showing that
even state-of-the-art models perform only marginally better than simplistic
baselines on implicit semantics tasks. To address this, we call for a paradigm
shift: embedding research should prioritize more diverse and linguistically
grounded training data, design benchmarks that evaluate deeper semantic
understanding, and explicitly frame implicit meaning as a core modeling
objective, better aligning embeddings with real-world language complexity.

[8] CC-RAG: Structured Multi-Hop Reasoning via Theme-Based Causal Graphs

Jash Rajesh Parekh,Pengcheng Jiang,Jiawei Han

Main category: cs.CL

TL;DR: CC-RAG通过结合零样本三元组提取和主题感知图链,改进RAG管道以实现结构化多跳推理,显著提升LLMs在专业领域因果推理的准确性。

Details Motivation: 研究针对LLMs在专业领域因果推理中的不足,提出改进RAG管道以更好地建模因果依赖关系。

Method: 开发CC-RAG方法,结合零样本三元组提取和主题感知图链生成DAG,通过前后链进行结构化推理。

Result: 在比特币价格波动和高雪病两个真实领域实验中,CC-RAG在链相似性、信息密度和词汇多样性上优于标准RAG和零样本LLMs。

Conclusion: 明确建模因果结构能显著提升LLMs在专业领域生成答案的准确性和可解释性。

Abstract: Understanding cause and effect relationships remains a formidable challenge
for Large Language Models (LLMs), particularly in specialized domains where
reasoning requires more than surface-level correlations. Retrieval-Augmented
Generation (RAG) improves factual accuracy, but standard RAG pipelines treat
evidence as flat context, lacking the structure required to model true causal
dependencies. We introduce Causal-Chain RAG (CC-RAG), a novel approach that
integrates zero-shot triple extraction and theme-aware graph chaining into the
RAG pipeline, enabling structured multi-hop inference. Given a domain specific
corpus, CC-RAG constructs a Directed Acyclic Graph (DAG) of <cause, relation,
effect> triples and uses forward/backward chaining to guide structured answer
generation. Experiments on two real-world domains: Bitcoin price fluctuations
and Gaucher disease, show that CC-RAG outperforms standard RAG and zero-shot
LLMs in chain similarity, information density, and lexical diversity. Both
LLM-as-a-Judge and human evaluations consistently favor CC-RAG. Our results
demonstrate that explicitly modeling causal structure enables LLMs to generate
more accurate and interpretable responses, especially in specialized domains
where flat retrieval fails.

[9] mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks

Luel Hagos Beyene,Vivek Verma,Min Ma,Jesujoba O. Alabi,Fabian David Schmidt,Joyce Nakatumba-Nabende,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: 论文介绍了mSTEB基准,用于评估LLMs在多语言和多模态任务中的性能,揭示了高资源和低资源语言性能差距。

Details Motivation: 现有LLMs评测主要集中在英语和高资源语言,缺乏对低资源语言的标准化评估。

Method: 引入mSTEB基准,涵盖语音和文本模态的多种任务(如语言识别、翻译等),并评估多个主流和开源LLM。

Result: 评估显示高资源与低资源语言(尤其是非洲和美洲/大洋洲语言)性能差距显著。

Conclusion: 需增加投资以解决LLMs在低资源语言中的覆盖率不足问题。

Abstract: Large Language models (LLMs) have demonstrated impressive performance on a
wide range of tasks, including in multimodal settings such as speech. However,
their evaluation is often limited to English and a few high-resource languages.
For low-resource languages, there is no standardized evaluation benchmark. In
this paper, we address this gap by introducing mSTEB, a new benchmark to
evaluate the performance of LLMs on a wide range of tasks covering language
identification, text classification, question answering, and translation tasks
on both speech and text modalities. We evaluated the performance of leading
LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open
models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in
performance between high-resource and low-resource languages, especially for
languages spoken in Africa and Americas/Oceania. Our findings show that more
investment is needed to address their under-representation in LLMs coverage.

[10] TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration

Weiya Li,Junjie Chen,Bei Li,Boyang Liu,Zichen Wen,Nuanqiao Shan,Xiaoqian Liu,Anping Liu,Huajie Liu,Youyan Wang,Wujiuge Yin,Hu Song,Bing Huang,Zhiyuan Xia,Jialiang Chen,Linfeng Zhang

Main category: cs.CL

TL;DR: TACTIC是一个基于认知理论的多智能体翻译框架,通过模拟人类翻译的认知策略提升了翻译质量,实验表明其在多个基准测试中优于现有模型。

Details Motivation: 现有的多智能体翻译框架忽视了认知翻译研究的关键见解,未能完全发挥大语言模型的翻译潜力。

Method: 提出TACTIC框架,包含六个功能不同的智能体,模拟人类翻译的认知过程,如草拟、精炼、评估等,实现高质量的翻译任务分解与协作。

Result: 在FLORES-200和WMT24基准测试中,TACTIC的平均表现超过了GPT-4.1和DeepSeek-R1,如XCOMET和COMETKIWI-23分数显著提升。

Conclusion: TACTIC通过认知理论驱动的智能体协作,显著提升了机器翻译质量,成为当前最先进的解决方案。

Abstract: Machine translation has long been a central task in natural language
processing. With the rapid advancement of large language models (LLMs), there
has been remarkable progress in translation quality. However, fully realizing
the translation potential of LLMs remains an open challenge. Recent studies
have explored multi-agent systems to decompose complex translation tasks into
collaborative subtasks, showing initial promise in enhancing translation
quality through agent cooperation and specialization. Nevertheless, existing
multi-agent translation frameworks largely neglect foundational insights from
cognitive translation studies. These insights emphasize how human translators
employ different cognitive strategies, such as balancing literal and free
translation, refining expressions based on context, and iteratively evaluating
outputs. To address this limitation, we propose a cognitively informed
multi-agent framework called TACTIC, which stands for T ranslation A gents with
Cognitive- T heoretic Interactive Collaboration. The framework comprises six
functionally distinct agents that mirror key cognitive processes observed in
human translation behavior. These include agents for drafting, refinement,
evaluation, scoring, context reasoning, and external knowledge gathering. By
simulating an interactive and theory-grounded translation workflow, TACTIC
effectively leverages the full capacity of LLMs for high-quality translation.
Experimental results on diverse language pairs from the FLORES-200 and WMT24
benchmarks show that our method consistently achieves state-of-the-art
performance. Using DeepSeek-V3 as the base model, TACTIC surpasses GPT-4.1 by
an average of +0.6 XCOMET and +1.18 COMETKIWI-23. Compared to DeepSeek-R1, it
further improves by +0.84 XCOMET and +2.99 COMETKIWI-23. Code is available at
https://github.com/weiyali126/TACTIC.

[11] Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens

Ziyang Ma,Qingyue Yuan,Zhenglin Wang,Deyu Zhou

Main category: cs.CL

TL;DR: 该论文提出AutoMeco框架和MIRA策略,用于评估和改进大语言模型(LLMs)的元认知能力,结果显示其有效性。

Details Motivation: 现有研究主要关注LLMs的认知错误检测能力,但忽略了其元认知能力(如对步骤错误的自我意识),这对模型可靠性至关重要。

Method: 提出AutoMeco框架评估现有元认知指标,并设计训练免费的MIRA策略优化这些指标。

Result: 在三个数学推理数据集和三种LLMs上的实验表明,AutoMeco的合理性优于Best-of-N验证,且MIRA能更准确评估LLMs的元认知能力。

Conclusion: AutoMeco和MIRA为评估和改进LLMs的元认知能力提供了有效工具,有助于提升模型的可靠性。

Abstract: Previous research has primarily focused on the cognitive error detection
capabilities of Large Language Models (LLMs), often prompting them to analyze
mistakes in reasoning chains. However, few studies have examined the
meta-cognitive abilities of LLMs (e.g., their self-awareness of step errors),
which are crucial for their reliability. While studies on LLM self-evaluation
present some measures, such as perplexity, which can reflect the answer
correctness and be viewed as the lens of meta-cognition, they lack step-level
analysis and adaptation. This paper studies the evaluation of LLM
meta-cognition using the current lenses and how to improve these lenses.
Specifically, we propose AutoMeco, an Automated Meta-cognition Evaluation
framework for benchmarking the existing lenses. Furthermore, a training-free
Markovian Intrinsic Reward Adjustment strategy, MIRA, is proposed to boost
current meta-cognition lenses. Experimental results on three mathematical
reasoning datasets and three LLMs show the reasonableness of AutoMeco by
comparing it with Best-of-N verification. Moreover, the meta-cognition ability
of LLMs can be better evaluated using MIRA.

[12] Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models

Jiaxiang Liu,Boxuan Xing,Chenhao Yuan,Chenxiang Zhang,Di Wu,Xiusheng Huang,Haida Yu,Chuhan Lang,Pengfei Cao,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: 论文介绍了Know-MRI,一个开源工具,旨在系统性分析大型语言模型的知识机制,解决现有方法输入和输出不一致的问题。

Details Motivation: 当前大型语言模型的解释方法输入格式和输出不一致,限制了实际应用,需要一种系统性解决方案。

Method: 开发了Know-MRI,包含可扩展的核心模块,能自动匹配输入数据与解释方法,并整合输出。

Result: 工具允许用户自由选择解释方法,基于输入数据全面诊断模型的知识机制。

Conclusion: Know-MRI为分析LLM知识机制提供了灵活且综合的解决方案,推动了模型可解释性研究。

Abstract: As large language models (LLMs) continue to advance, there is a growing
urgency to enhance the interpretability of their internal knowledge mechanisms.
Consequently, many interpretation methods have emerged, aiming to unravel the
knowledge mechanisms of LLMs from various perspectives. However, current
interpretation methods differ in input data formats and interpreting outputs.
The tools integrating these methods are only capable of supporting tasks with
specific inputs, significantly constraining their practical applications. To
address these challenges, we present an open-source Knowledge Mechanisms
Revealer&Interpreter (Know-MRI) designed to analyze the knowledge mechanisms
within LLMs systematically. Specifically, we have developed an extensible core
module that can automatically match different input data with interpretation
methods and consolidate the interpreting outputs. It enables users to freely
choose appropriate interpretation methods based on the inputs, making it easier
to comprehensively diagnose the model’s internal knowledge mechanisms from
multiple perspectives. Our code is available at
https://github.com/nlpkeg/Know-MRI. We also provide a demonstration video on
https://youtu.be/NVWZABJ43Bs.

[13] CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

Ziqi. Liu,Ziyang. Zhou,Mingxuan. Hu

Main category: cs.CL

TL;DR: 本文提出了一种名为CAF-I的多代理框架,用于解决现有大语言模型在反讽检测中的局限性,包括单视角限制、理解不足和缺乏可解释性。CAF-I通过多维分析和协作优化,显著提升了检测性能。

Details Motivation: 现有的基于大语言模型的反讽检测方法存在单视角限制、理解不足和缺乏可解释性等问题。本文旨在通过多代理协作框架CAF-I解决这些挑战。

Method: CAF-I是一个多代理系统,包括上下文、语义和修辞代理进行多维分析,并通过交互协作优化。决策代理整合这些视角,优化反馈则由一个评估代理提供。

Result: 在基准数据集上的实验表明,CAF-I在零样本性能上达到了SOTA,平均Macro-F1为76.31,比之前最强的基线提高了4.98个百分点。

Conclusion: CAF-I通过模拟人类多视角分析,显著提升了反讽检测的准确性和可解释性,验证了其在解决现有问题中的有效性。

Abstract: Large language model (LLM) have become mainstream methods in the field of
sarcasm detection. However, existing LLM methods face challenges in irony
detection, including: 1. single-perspective limitations, 2. insufficient
comprehensive understanding, and 3. lack of interpretability. This paper
introduces the Collaborative Agent Framework for Irony (CAF-I), an LLM-driven
multi-agent system designed to overcome these issues. CAF-I employs specialized
agents for Context, Semantics, and Rhetoric, which perform multidimensional
analysis and engage in interactive collaborative optimization. A Decision Agent
then consolidates these perspectives, with a Refinement Evaluator Agent
providing conditional feedback for optimization. Experiments on benchmark
datasets establish CAF-I’s state-of-the-art zero-shot performance. Achieving
SOTA on the vast majority of metrics, CAF-I reaches an average Macro-F1 of
76.31, a 4.98 absolute improvement over the strongest prior baseline. This
success is attained by its effective simulation of human-like multi-perspective
analysis, enhancing detection accuracy and interpretability.

[14] Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning

Fengjun Pan,Anh Tuan Luu,Xiaobao Wu

Main category: cs.CL

TL;DR: U-CoT+ 是一种新型框架,通过将视觉模因转换为文本描述,并结合人工指导,实现了高效、灵活且可解释的有害模因检测。

Details Motivation: 当前有害模因检测方法在资源效率、灵活性和可解释性方面存在不足,难以实际部署于内容审核系统。

Method: 开发了 meme-to-text 管道,将模因转换为文本描述,并结合人工指导,使用零样本 CoT 提示引导模型推理。

Result: 在七个基准数据集上的实验验证了框架的有效性,展示了其在低资源和可解释性方面的潜力。

Conclusion: U-CoT+ 能够高效、灵活地检测有害模因,适合跨平台和跨区域的动态需求,为内容审核提供了新思路。

Abstract: Detecting harmful memes is essential for maintaining the integrity of online
environments. However, current approaches often struggle with resource
efficiency, flexibility, or explainability, limiting their practical deployment
in content moderation systems. To address these challenges, we introduce
U-CoT+, a novel framework for harmful meme detection. Instead of relying solely
on prompting or fine-tuning multimodal models, we first develop a high-fidelity
meme-to-text pipeline that converts visual memes into detail-preserving textual
descriptions. This design decouples meme interpretation from meme
classification, thus avoiding immediate reasoning over complex raw visual
content and enabling resource-efficient harmful meme detection with general
large language models (LLMs). Building on these textual descriptions, we
further incorporate targeted, interpretable human-crafted guidelines to guide
models’ reasoning under zero-shot CoT prompting. As such, this framework allows
for easy adaptation to different harmfulness detection criteria across
platforms, regions, and over time, offering high flexibility and
explainability. Extensive experiments on seven benchmark datasets validate the
effectiveness of our framework, highlighting its potential for explainable and
low-resource harmful meme detection using small-scale LLMs. Codes and data are
available at: https://anonymous.4open.science/r/HMC-AF2B/README.md.

[15] CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations

Divyaksh Shukla,Ritesh Baviskar,Dwijesh Gohil,Aniket Tiwari,Atul Shree,Ashutosh Modi

Main category: cs.CL

TL;DR: CoMuMDR是一个跨领域、多模态(音频和文本)的编码混合(印地语和英语)语料库,用于对话中的篇章解析,突出了当前模型在该领域的挑战。

Details Motivation: 现有的篇章解析数据集通常限于单一领域的书面英语对话,不足以支持多领域、编码混合的现实场景。

Method: 介绍了CoMuMDR语料库,包含印地语和英语的编码混合数据,并进行了九种篇章关系标注,测试了多种先进基线模型。

Result: 先进基线模型在CoMuMDR上的表现不佳,表明多领域、编码混合数据的复杂性对现有模型构成挑战。

Conclusion: 需要开发更好的模型以处理多领域、编码混合的篇章解析任务。

Abstract: Discourse parsing is an important task useful for NLU applications such as
summarization, machine comprehension, and emotion recognition. The current
discourse parsing datasets based on conversations consists of written English
dialogues restricted to a single domain. In this resource paper, we introduce
CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in
conversations. The corpus (code-mixed in Hindi and English) has both audio and
transcribed text and is annotated with nine discourse relations. We experiment
with various SoTA baseline models; the poor performance of SoTA models
highlights the challenges of multi-domain code-mixed corpus, pointing towards
the need for developing better models for such realistic settings.

[16] Efficient Post-Training Refinement of Latent Reasoning in Large Language Models

Xinyuan Wang,Dongjie Wang,Wangyang Ying,Haoyue Bai,Nanxu Gong,Sixun Dong,Kunpeng Liu,Yanjie Fu

Main category: cs.CL

TL;DR: 论文提出了一种轻量级的后训练框架,通过对比推理反馈和残差嵌入细化来优化潜在推理轨迹,显著提升了推理任务的准确性。

Details Motivation: 解决大型语言模型中推理过程的局限性,特别是后训练阶段如何有效更新推理嵌入以提高准确性。

Method: 使用对比推理反馈和残差嵌入细化两种策略,优化潜在推理轨迹。

Result: 在五个推理基准测试中验证了框架的有效性,尤其在MathQA上实现了5%的准确率提升。

Conclusion: 提出的框架能够在不增加训练成本的情况下显著提升推理任务的性能。

Abstract: Reasoning is a key component of language understanding in Large Language
Models. While Chain-of-Thought prompting enhances performance via explicit
intermediate steps, it suffers from sufficient token overhead and a fixed
reasoning trajectory, preventing step-wise refinement. Recent advances in
latent reasoning address these limitations by refining internal reasoning
processes directly in the model’s latent space, without producing explicit
outputs. However, a key challenge remains: how to effectively update reasoning
embeddings during post-training to guide the model toward more accurate
solutions. To overcome this challenge, we propose a lightweight post-training
framework that refines latent reasoning trajectories using two novel
strategies: 1) Contrastive reasoning feedback, which compares reasoning
embeddings against strong and weak baselines to infer effective update
directions via embedding enhancement; 2) Residual embedding refinement, which
stabilizes updates by progressively integrating current and historical
gradients, enabling fast yet controlled convergence. Extensive experiments and
case studies are conducted on five reasoning benchmarks to demonstrate the
effectiveness of the proposed framework. Notably, a 5% accuracy gain on MathQA
without additional training.

[17] RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval

Minhae Oh,Jeonghye Kim,Nakyung Lee,Donggeon Seo,Taeuk Kim,Jungwoo Lee

Main category: cs.CL

TL;DR: RAISE是一个分三步的检索增强框架,通过逐步检索逻辑相关文档来提升科学推理能力。

Details Motivation: 解决科学推理中的长链推理、领域术语知识和适应更新发现等挑战。

Method: 分为问题分解、逻辑查询生成和逻辑检索三步。

Result: 在科学推理基准测试中表现优于其他基线方法。

Conclusion: RAISE不仅能检索领域知识相似的文档,还能找到逻辑相关性更高的文档。

Abstract: Scientific reasoning requires not only long-chain reasoning processes, but
also knowledge of domain-specific terminologies and adaptation to updated
findings. To deal with these challenges for scientific reasoning, we introduce
RAISE, a step-by-step retrieval-augmented framework which retrieves logically
relevant documents from in-the-wild corpus. RAISE is divided into three steps:
problem decomposition, logical query generation, and logical retrieval. We
observe that RAISE consistently outperforms other baselines on scientific
reasoning benchmarks. We analyze that unlike other baselines, RAISE retrieves
documents that are not only similar in terms of the domain knowledge, but also
documents logically more relevant.

[18] MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models

Son The Nguyen,Theja Tulabandhula

Main category: cs.CL

TL;DR: MEMETRON是一个任务无关的框架,将大型语言模型的解码视为离散黑盒优化问题,利用混合元启发式算法(GENETRON和ANNETRON)搜索响应空间,显著优于标准解码和重排序方法。

Details Motivation: 现有的大型语言模型解码策略(如贪婪搜索或采样)缺乏对任务特定目标的明确优化,且控制能力有限。

Method: MEMETRON通过混合元启发式算法(GENETRON和ANNETRON)和奖励模型,在不需模型重新训练或梯度访问的情况下,高效地发现高奖励响应。

Result: 在人类偏好对齐任务上,MEMETRON显著优于标准解码和重排序方法。

Conclusion: MEMETRON展示了在不重新训练模型的情况下改善对齐的潜力,是一种模块化且通用的框架。

Abstract: Large language models (LLMs) are increasingly used for both open-ended and
structured tasks, yet their inference-time behavior is still largely dictated
by heuristic decoding strategies such as greedy search, sampling, or reranking.
These methods provide limited control and do not explicitly optimize for
task-specific objectives. We introduce MEMETRON, a task-agnostic framework that
formulates LLM decoding as a discrete black-box optimization problem. MEMETRON
leverages hybrid metaheuristic algorithms, GENETRON and ANNETRON, to search the
response space, guided by reward models and contextual operations performed by
the LLM itself. This approach enables efficient discovery of high-reward
responses without requiring model retraining or gradient access. The framework
is modular and generalizes across diverse tasks, requiring only a reward
function and lightweight prompt templates. We evaluate our framework on the
critical human preference alignment task and demonstrate that it significantly
outperforms standard decoding and reranking methods, highlighting its potential
to improve alignment without model retraining.

[19] TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning

Mingyu Zheng,Zhifan Feng,Jia Wang,Lanrui Wang,Zheng Lin,Yang Hao,Weiping Wang

Main category: cs.CL

TL;DR: TableDreamer提出了一种渐进式、弱点导向的数据合成框架,用于生成表格指令调优数据,解决了现有方法在数据多样性和效率上的不足。

Details Motivation: 现有LLM-based数据合成方法在生成表格指令调优数据时存在两个主要问题:无法充分探索输入空间导致数据多样性不足,以及盲目追求数据量而忽视目标LLM的表格理解弱点。

Method: TableDreamer通过生成多样化表格和指令作为种子数据,并在新识别的弱点数据指导下迭代探索输入空间,最终生成用于微调目标LLM的训练数据。

Result: 在10个表格基准测试中,TableDreamer将Llama3.1-8B-instruct的平均准确率从49.07%提升至60.69%,优于使用更多训练数据的现有方法。

Conclusion: TableDreamer通过渐进式和弱点导向的数据合成策略,显著提升了表格指令调优的效果,为相关任务提供了高效的数据生成框架。

Abstract: Despite the commendable progress of recent LLM-based data synthesis methods,
they face two limitations in generating table instruction tuning data. First,
they can not thoroughly explore the vast input space of table understanding
tasks, leading to limited data diversity. Second, they ignore the weaknesses in
table understanding ability of the target LLM and blindly pursue the increase
of data quantity, resulting in suboptimal data efficiency. In this paper, we
introduce a progressive and weakness-guided data synthesis framework tailored
for table instruction tuning, named TableDreamer, to mitigate the above issues.
Specifically, we first synthesize diverse tables and related instructions as
seed data, and then perform an iterative exploration of the input space under
the guidance of the newly identified weakness data, which eventually serve as
the final training data for fine-tuning the target LLM. Extensive experiments
on 10 tabular benchmarks demonstrate the effectiveness of the proposed
framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62%
(49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms
state-of-the-art data synthesis baselines which use more training data. The
code and data is available at https://github.com/SpursGoZmy/TableDreamer

[20] RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

Yang Liu,Jiaqi Li,Zilong Zheng

Main category: cs.CL

TL;DR: RuleReasoner是一种通过动态采样增强的强化学习方法,能够高效进行规则推理,且在分布内外任务中表现优于大型推理模型。

Details Motivation: 探索小型推理模型是否能够通过强化学习有效学习规则推理,并具有跨任务和领域的鲁棒泛化能力。

Method: 引入了RuleReasoner方法,通过领域感知的动态采样策略和多样化的任务集合进行强化学习。

Result: 在分布内外任务中,RuleReasoner显著优于前沿大型推理模型(ID任务平均提升4.1%,OOD任务提升10.4%),且计算效率更高。

Conclusion: RuleReasoner证明小型推理模型可以通过动态采样和强化学习实现高效的规则推理,同时具有更好的泛化能力和计算效率。

Abstract: Rule-based reasoning has been acknowledged as one of the fundamental problems
in reasoning, while deviations in rule formats, types, and complexity in
real-world applications pose severe challenges. Recent studies have shown that
large reasoning models (LRMs) have remarkable reasoning capabilities, and their
performance is substantially enhanced by reinforcement learning (RL). However,
it remains an open question whether small reasoning models (SRMs) can learn
rule-based reasoning effectively with robust generalization across diverse
tasks and domains. To address this, we introduce Reinforced Rule-based
Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct
rule-based reasoning via a wide collection of curated tasks and a novel
domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples
each training batch by updating the sampling weights of different domains based
on historical rewards. This facilitates domain augmentation and flexible online
learning schedules for RL, obviating the need for pre-hoc human-engineered
mix-training recipes used in existing methods. Empirical evaluations on
in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that
RuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1%
average points on eight ID tasks and $\Delta$10.4% average points on three OOD
tasks over OpenAI-o1). Notably, our approach also exhibits higher computational
efficiency compared to prior dynamic sampling methods for RL.

[21] Brevity is the soul of sustainability: Characterizing LLM response lengths

Soham Poddar,Paramita Koley,Janardan Misra,Sanjay Podder,Navveen Balani,Niloy Ganguly,Saptarshi Ghosh

Main category: cs.CL

TL;DR: 论文研究了大型语言模型(LLMs)推理过程中的能源消耗问题,提出通过输出压缩和提示工程策略减少响应长度,实现25-60%的能源优化。

Details Motivation: LLMs的推理过程消耗大量能源,而现有研究中输出压缩优化方法较少,亟需开发能源高效的推理方法。

Method: 通过基准测试12个仅解码器LLMs,分析了其响应的冗余信息,并提出提示工程策略以减少响应长度。

Result: 实验表明,针对长度减少和信息内容控制的提示策略能显著降低能源消耗(25-60%),同时保持响应质量。

Conclusion: 提示工程是优化LLM能源效率的有效手段,未来研究可进一步探索输出压缩领域。

Abstract: A significant portion of the energy consumed by Large Language Models (LLMs)
arises from their inference processes; hence developing energy-efficient
methods for inference is crucial. While several techniques exist for inference
optimization, output compression remains relatively unexplored, with only a few
preliminary efforts addressing this aspect. In this work, we first benchmark 12
decoder-only LLMs across 5 datasets, revealing that these models often produce
responses that are substantially longer than necessary. We then conduct a
comprehensive quality assessment of LLM responses, formally defining six
information categories present in LLM responses. We show that LLMs often tend
to include redundant or additional information besides the minimal answer. To
address this issue of long responses by LLMs, we explore several simple and
intuitive prompt-engineering strategies. Empirical evaluation shows that
appropriate prompts targeting length reduction and controlling information
content can achieve significant energy optimization between 25-60% by reducing
the response length while preserving the quality of LLM responses.

[22] ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts

Ruiran Su,Jiasheng Si,Zhijiang Guo,Janet B. Pierrehumbert

Main category: cs.CL

TL;DR: ClimateViz是首个用于科学事实核查的大规模科学图表基准数据集,包含4.9万条标注,评估发现当前多模态模型在图表推理上表现远低于人类。

Details Motivation: 科学图表在定量证据展示中至关重要,但现有的科学事实核查大多忽略了图表。

Method: 构建ClimateViz数据集,包含专家标注的图表和知识图谱解释,评估多种多模态模型。

Result: 最佳模型准确率为76.2-77.8%,远低于人类表现(89.3-92.7%),解释增强可提升部分模型。

Conclusion: 图表推理现有模型能力不足,提供解释可辅助提升性能。

Abstract: Scientific fact-checking has mostly focused on text and tables, overlooking
scientific charts, which are key for presenting quantitative evidence and
statistical reasoning. We introduce ClimateViz, the first large-scale benchmark
for scientific fact-checking using expert-curated scientific charts. ClimateViz
contains 49,862 claims linked to 2,896 visualizations, each labeled as support,
refute, or not enough information. To improve interpretability, each example
includes structured knowledge graph explanations covering trends, comparisons,
and causal relations. We evaluate state-of-the-art multimodal language models,
including both proprietary and open-source systems, in zero-shot and few-shot
settings. Results show that current models struggle with chart-based reasoning:
even the best systems, such as Gemini 2.5 and InternVL 2.5, reach only 76.2 to
77.8 percent accuracy in label-only settings, far below human performance (89.3
and 92.7 percent). Explanation-augmented outputs improve performance in some
models. We released our dataset and code alongside the paper.

[23] Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Fariz Ikhwantri,Dusica Marijan

Main category: cs.CL

TL;DR: 论文提出了一种基于自然语言推理(NLI)的合规性检测方法EXCLAIM,通过多跳推理解释保证案例的合规性,并利用大型语言模型生成保证案例以解决数据不足的问题。

Details Motivation: 保证案例的合规性检测面临法律与技术文本复杂、模型解释需求高以及数据有限等挑战。

Method: 采用自然语言推理(NLI)和多跳推理框架,结合大型语言模型生成保证案例,并引入覆盖率和结构一致性指标。

Result: 通过GDPR要求的案例研究展示了方法的有效性,表明NLI方法在自动化合规过程中的潜力。

Conclusion: EXCLAIM方法为自动化合规性检测提供了可解释且可追溯的解决方案。

Abstract: Ensuring complex systems meet regulations typically requires checking the
validity of assurance cases through a claim-argument-evidence framework. Some
challenges in this process include the complicated nature of legal and
technical texts, the need for model explanations, and limited access to
assurance case data. We propose a compliance detection approach based on
Natural Language Inference (NLI): EXplainable CompLiance detection with
Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the
claim-argument-evidence structure of an assurance case as a multi-hop inference
for explainable and traceable compliance detection. We address the limited
number of assurance cases by generating them using large language models
(LLMs). We introduce metrics that measure the coverage and structural
consistency. We demonstrate the effectiveness of the generated assurance case
from GDPR requirements in a multi-hop inference task as a case study. Our
results highlight the potential of NLI-based approaches in automating the
regulatory compliance process.

[24] Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition

Mehedi Hasan Bijoy,Dejan Porjazovski,Tamás Grósz,Mikko Kurimo

Main category: cs.CL

TL;DR: 提出了一种多教师知识蒸馏方法,用于提升英语、芬兰语和法语的多语言语音情感识别性能,取得了优于基准的成果。

Details Motivation: 尽管单语言语音情感识别(SER)取得进展,多语言SER仍然是个挑战,目标是训练一个能兼容多语言的单一模型。

Method: 使用Wav2Vec2.0作为单语言教师模型的基础,通过语言感知的多教师知识蒸馏方法,将知识蒸馏到一个多语言学生模型中。

Result: 学生模型表现优异,英语数据集的加权召回率为72.9,芬兰语数据集的未加权召回率为63.4,优于微调和知识蒸馏基准。

Conclusion: 该方法在提升悲伤和中性情绪的识别上表现突出,但对愤怒和快乐的识别仍有挑战。

Abstract: Speech Emotion Recognition (SER) is crucial for improving human-computer
interaction. Despite strides in monolingual SER, extending them to build a
multilingual system remains challenging. Our goal is to train a single model
capable of multilingual SER by distilling knowledge from multiple teacher
models. To address this, we introduce a novel language-aware multi-teacher
knowledge distillation method to advance SER in English, Finnish, and French.
It leverages Wav2Vec2.0 as the foundation of monolingual teacher models and
then distills their knowledge into a single multilingual student model. The
student model demonstrates state-of-the-art performance, with a weighted recall
of 72.9 on the English dataset and an unweighted recall of 63.4 on the Finnish
dataset, surpassing fine-tuning and knowledge distillation baselines. Our
method excels in improving recall for sad and neutral emotions, although it
still faces challenges in recognizing anger and happiness.

[25] AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

Ahmed Hasanaath,Aisha Alansari,Ahmed Ashraf,Chafik Salmane,Hamzah Luqman,Saad Ezzini

Main category: cs.CL

TL;DR: 本文全面评估了多种专注于推理的大语言模型(LLMs)在阿拉伯语NLP任务上的表现,特别关注DeepSeek模型,通过零样本、少样本和微调等策略,揭示了关键发现,如少样本示例的显著提升。

Details Motivation: 尽管LLMs在推理和通用NLP任务上表现出色,但其在阿拉伯语数据(以丰富的形态、多样的方言和复杂文字为特点)上的性能尚未充分探索。

Method: 通过零样本、少样本和微调(包括LoRA)等多种策略,对多种LLMs在15项阿拉伯语NLP任务上进行系统评估。

Result: 精选三例少样本可将分类任务F1平均提升13点;DeepSeek在零样本下表现优于GPT o4-mini 12点;LoRA微调比模型规模提升带来额外8点F1和BLEU增益。

Conclusion: 研究展示了LLMs在复杂阿拉伯语任务中的潜力,并为提升性能和效率提供了实用策略。

Abstract: Large language models (LLMs) have shown remarkable progress in reasoning
abilities and general natural language processing (NLP) tasks, yet their
performance on Arabic data, characterized by rich morphology, diverse dialects,
and complex script, remains underexplored. This paper presents a comprehensive
benchmarking study of multiple reasoning-focused LLMs, with a special emphasis
on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP
tasks. We experiment with various strategies, including zero-shot, few-shot,
and fine-tuning. This allows us to systematically evaluate performance on
datasets covering a range of applications to examine their capacity for
linguistic reasoning under different levels of complexity. Our experiments
reveal several key findings. First, carefully selecting just three in-context
examples delivers an average uplift of over 13 F1 points on classification
tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection
from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures
outperform a strong GPT o4-mini baseline by an average of 12 F1 points on
complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning
yields up to an additional 8 points in F1 and BLEU compared to equivalent
increases in model scale. The code is available at
https://anonymous.4open.science/r/AraReasoner41299

Francisco Vargas,Alejandro González Coene,Gaston Escalante,Exequiel Lobón,Manuel Pulido

Main category: cs.CL

TL;DR: 论文提出两步法从法律文档提取交通事故信息:先文本分割再实体提取,相比传统方法显著提升准确率。

Details Motivation: 法律文档中交通事故信息的提取对保险公司成本量化至关重要,但传统方法效率低且易出错。

Method: 采用两步法:文本分割(正则表达式或向量化)和基于LLM的实体提取,并对LLaMA模型进行微调。

Result: 微调后的LLaMA-2 70B准确率达79.4%,LLaMA-3 8B表现接近76.6%,GPT-4 Turbo最高86.1%。

Conclusion: 向量化分割结合LLM显著优于传统方法,模型迭代迅速,GPT-4 Turbo表现最佳。

Abstract: The extraction of information about traffic accidents from legal documents is
crucial for quantifying insurance company costs. Extracting entities such as
percentages of physical and/or psychological disability and the involved
compensation amounts is a challenging process, even for experts, due to the
subtle arguments and reasoning in the court decision. A two-step procedure is
proposed: first, segmenting the document identifying the most relevant
segments, and then extracting the entities. For text segmentation, two
methodologies are compared: a classic method based on regular expressions and a
second approach that divides the document into blocks of n-tokens, which are
then vectorized using multilingual models for semantic searches
(text-embedding-ada-002/MiniLM-L12-v2 ). Subsequently, large language models
(LLaMA-2 7b, 70b, LLaMA-3 8b, and GPT-4 Turbo) are applied with prompting to
the selected segments for entity extraction. For the LLaMA models, fine-tuning
is performed using LoRA. LLaMA-2 7b, even with zero temperature, shows a
significant number of hallucinations in extractions which are an important
contention point for named entity extraction. This work shows that these
hallucinations are substantially reduced after finetuning the model. The
performance of the methodology based on segment vectorization and subsequent
use of LLMs significantly surpasses the classic method which achieves an
accuracy of 39.5%. Among open-source models, LLaMA-2 70B with finetuning
achieves the highest accuracy 79.4%, surpassing its base version 61.7%.
Notably, the base LLaMA-3 8B model already performs comparably to the finetuned
LLaMA-2 70B model, achieving 76.6%, highlighting the rapid progress in model
development. Meanwhile, GPT-4 Turbo achieves the highest accuracy at 86.1%.

[27] PropMEND: Hypernetworks for Knowledge Propagation in LLMs

Zeyu Leo Liu,Greg Durrett,Eunsol Choi

Main category: cs.CL

TL;DR: PropMEND是一种基于超网络的知识传播方法,通过修改梯度以促进知识的传播,解决了传统知识编辑技术在多跳推理问题上的不足。

Details Motivation: 传统知识编辑技术虽能注入知识,但无法支持基于这些知识的推理。PropMEND旨在通过超网络学习如何调整梯度,以传播知识并支持多跳推理。

Method: PropMEND扩展了MEND的元目标,通过超网络学习修改语言建模损失的梯度,使注入的知识能用于多跳问题回答。

Result: 在RippleEdit数据集上,PropMEND的性能显著提升,多跳问题准确率几乎翻倍。在未见过的实体-关系对上,PropMEND仍优于现有方法,但性能差距缩小。

Conclusion: PropMEND在知识传播方面表现优异,尤其在多跳推理任务上,但仍有改进空间,尤其是在泛化到更多关系上。

Abstract: Knowledge editing techniques for large language models (LLMs) can inject
knowledge that is later reproducible verbatim, but they fall short on
propagating that knowledge: models cannot answer questions that require
reasoning with the injected knowledge. We present a hypernetwork-based approach
for knowledge propagation, named PropMEND, where we meta-learn how to modify
gradients of a language modeling loss to encourage injected information to
propagate. Our approach extends the meta-objective of MEND [29] so that
gradient updates on knowledge are transformed to enable answering multi-hop
questions involving that knowledge. We show improved performance on the
RippleEdit dataset, showing almost 2x accuracy on challenging multi-hop
questions whose answers are not explicitly stated in the injected fact. We
further introduce a new dataset, Controlled RippleEdit, to evaluate the
generalization of our hypernetwork, testing knowledge propagation along
relations and entities unseen during hypernetwork training. PropMEND still
outperforms existing approaches in unseen entity-relation pairs, yet the
performance gap decreases substantially, suggesting future work in propagating
knowledge to a wide range of relations.

[28] Can A Gamer Train A Mathematical Reasoning Model?

Andrew Shin

Main category: cs.CL

TL;DR: 本文展示了如何在单个普通游戏GPU(RTX 3080 Ti)上训练一个高性能的数学推理模型,通过结合强化学习和内存优化技术,挑战了传统认为高性能AI需要大规模基础设施的观点。

Details Motivation: 当前大型语言模型(LLM)的开发需要昂贵的计算资源,本文旨在通过技术创新降低训练成本,使高性能AI研究更加普及。

Method: 采用强化学习和内存优化技术,训练一个1.5B参数的数学推理模型。

Result: 在资源受限环境下,该模型在数学推理基准测试中表现优于或可比拟更大规模的模型。

Conclusion: 研究结果表明,高性能数学推理模型并非必须依赖大规模基础设施,为资源有限的研究者提供了可行的解决方案。

Abstract: While large language models (LLMs) have achieved remarkable performance in
various tasks including mathematical reasoning, their development typically
demands prohibitive computational resources. Recent advancements have reduced
costs for training capable models, yet even these approaches rely on high-end
hardware clusters. In this paper, we demonstrate that a single average gaming
GPU can train a solid mathematical reasoning model, by integrating
reinforcement learning and memory optimization techniques. Specifically, we
train a 1.5B parameter mathematical reasoning model on RTX 3080 Ti of 16GB
memory that achieves comparable or better performance on mathematical reasoning
benchmarks than models several times larger, in resource-constrained
environments. Our results challenge the paradigm that state-of-the-art
mathematical reasoning necessitates massive infrastructure, democratizing
access to high-performance AI research.
https://github.com/shinandrew/YouronMath.

[29] FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation

Qinggang Zhang,Zhishang Xiang,Yilin Xiao,Le Wang,Junhui Li,Xinrun Wang,Jinsong Su

Main category: cs.CL

TL;DR: 论文提出了FaithfulRAG框架,通过显式建模检索上下文与模型参数知识之间的冲突,解决大型语言模型在知识密集任务中的不忠实问题。

Details Motivation: 现有方法通过强制抑制模型参数知识实现忠实性,但这损害了模型内部知识结构并增加误解上下文的风险。

Method: FaithfulRAG在事实层面识别知识冲突,并设计自思考过程,让模型在生成响应前对冲突事实进行推理与整合。

Result: 实验表明,该方法优于现有最佳方法。

Conclusion: FaithfulRAG有效解决了知识冲突问题,提升了模型的忠实性与性能。

Abstract: Large language models (LLMs) augmented with retrieval systems have
demonstrated significant potential in handling knowledge-intensive tasks.
However, these models often struggle with unfaithfulness issues, generating
outputs that either ignore the retrieved context or inconsistently blend it
with the LLMs parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the models
parametric knowledge. While existing faithful RAG approaches enforce strict
context adherence through well-designed prompts or modified decoding
strategies, our analysis reveals a critical limitation: they achieve
faithfulness by forcibly suppressing the models parametric knowledge, which undermines the models internal knowledge structure and increases the risk of
misinterpreting the context. To this end, this paper proposes FaithfulRAG, a
novel framework that resolves knowledge conflicts by explicitly modeling
discrepancies between the model`s parametric knowledge and retrieved context.
Specifically, FaithfulRAG identifies conflicting knowledge at the fact level
and designs a self-thinking process, allowing LLMs to reason about and
integrate conflicting facts before generating responses. Extensive experiments
demonstrate that our method outperforms state-of-the-art methods. The code is
available at https:// github.com/DeepLearnXMU/Faithful-RAG

[30] Can LLMs Ground when they (Don’t) Know: A Study on Direct and Loaded Political Questions

Clara Lachenmaier,Judith Sieker,Sina Zarrieß

Main category: cs.CL

TL;DR: 研究发现大型语言模型在处理政治领域的事实信息时,难以有效纠正用户的错误信念,其与用户建立共同理解的能力存在显著挑战。

Details Motivation: 探究大型语言模型在拥有或不拥有知识的情况下,如何处理共同基础,尤其是在政治领域中存在高风险错误信息的情况。

Method: 通过直接知识问题和预设错误信息的诱导性问题,评估大型语言模型的回答能力及其是否主动纠正用户错误信念。

Result: 大型语言模型在纠正用户错误信念和建立共同理解方面表现不佳,尤其是在政治偏见和知识水平的影响下。

Conclusion: 研究强调大型语言模型在政治讨论中纠错和减轻错误信息传播的能力有限,引发对其实际应用的担忧。

Abstract: Communication among humans relies on conversational grounding, allowing
interlocutors to reach mutual understanding even when they do not have perfect
knowledge and must resolve discrepancies in each other’s beliefs. This paper
investigates how large language models (LLMs) manage common ground in cases
where they (don’t) possess knowledge, focusing on facts in the political domain
where the risk of misinformation and grounding failure is high. We examine the
ability of LLMs to answer direct knowledge questions and loaded questions that
presuppose misinformation. We evaluate whether loaded questions lead LLMs to
engage in active grounding and correct false user beliefs, in connection to
their level of knowledge and their political bias. Our findings highlight
significant challenges in LLMs’ ability to engage in grounding and reject false
user beliefs, raising concerns about their role in mitigating misinformation in
political discourse.

[31] Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers

Marek Kadlčík,Michal Štefánik,Timothee Mickus,Michal Spiegel,Josef Kuchař

Main category: cs.CL

TL;DR: 预训练语言模型在算术上容易出错,现有方法未能有效探测数值嵌入的精确性。本文提出了一种新探测技术,能高精度解码数字,并证明模型在预训练后能精确表示数字。通过校准嵌入可减少算术错误。

Details Motivation: 预训练语言模型在算术任务中表现不佳,现有方法未能捕捉其数值嵌入的正弦模式结构,导致无法准确评估模型的数字表示能力。

Method: 提出一种新的探测技术,能够从输入嵌入中高精度解码数值,揭示预训练语言模型中数字表示的精确性。

Result: 新探测技术证明模型在预训练后能高精度表示数字,且嵌入的精确性与算术错误相关,校准嵌入可减少错误。

Conclusion: 通过改进的探测技术揭示了语言模型中数字嵌入的精确性,校准嵌入能显著提升算术任务的性能。

Abstract: Pretrained language models (LMs) are prone to arithmetic errors. Existing
work showed limited success in probing numeric values from models’
representations, indicating that these errors can be attributed to the inherent
unreliability of distributionally learned embeddings in representing exact
quantities. However, we observe that previous probing methods are inadequate
for the emergent structure of learned number embeddings with sinusoidal
patterns.
In response, we propose a novel probing technique that decodes numeric values
from input embeddings with near-perfect accuracy across a range of open-source
LMs. This proves that after the sole pre-training, LMs represent numbers with
remarkable precision. Finally, we find that the embeddings’ preciseness judged
by our probe’s accuracy explains a large portion of LM’s errors in elementary
arithmetic, and show that aligning the embeddings with the pattern discovered
by our probe can mitigate these errors.

[32] Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System

Yuan Guo,Tingjia Miao,Zheng Wu,Pengzhou Cheng,Ming Zhou,Zhuosheng Zhang

Main category: cs.CL

TL;DR: 摘要介绍了UI-NEXUS基准测试,用于评估移动代理在组合任务中的表现,并提出了AGENT-NEXUS调度系统以提升任务成功率。

Details Motivation: 现有研究多关注原子任务,忽略了组合任务的重要性,而组合任务对实际应用至关重要。

Method: 提出UI-NEXUS基准测试,涵盖三种组合操作类型,并在20个本地应用和30个在线服务应用上评估。随后提出AGENT-NEXUS调度系统,动态分解任务。

Result: 现有代理在组合任务中表现不佳,AGENT-NEXUS提升了24%至40%的任务成功率,且未显著增加开销。

Conclusion: UI-NEXUS揭示了组合任务的挑战,AGENT-NEXUS为移动代理提供了高效解决方案。

Abstract: Autonomous agents powered by multimodal large language models have been
developed to facilitate task execution on mobile devices. However, prior work
has predominantly focused on atomic tasks – such as shot-chain execution tasks
and single-screen grounding tasks – while overlooking the generalization to
compositional tasks, which are indispensable for real-world applications. This
work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile
agents on three categories of compositional operations: Simple Concatenation,
Context Transition, and Deep Dive. UI-NEXUS supports interactive evaluation in
20 fully controllable local utility app environments, as well as 30 online
Chinese and English service apps. It comprises 100 interactive task templates
with an average optimal step count of 14.05. Experimental results across a
range of mobile agents with agentic workflow or agent-as-a-model show that
UI-NEXUS presents significant challenges. Specifically, existing agents
generally struggle to balance performance and efficiency, exhibiting
representative failure modes such as under-execution, over-execution, and
attention drift, causing visible atomic-to-compositional generalization gap.
Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient
scheduling system to tackle compositional mobile tasks. AGENT-NEXUS
extrapolates the abilities of existing mobile agents by dynamically decomposing
long-horizon tasks to a series of self-contained atomic subtasks. AGENT-NEXUS
achieves 24% to 40% task success rate improvement for existing mobile agents on
compositional operation tasks within the UI-NEXUS benchmark without
significantly sacrificing inference overhead. The demo video, dataset, and code
are available on the project page at https://ui-nexus.github.io.

[33] FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents

Satu Hopponen,Tomi Kinnunen,Alexandre Nikolaev,Rosa González Hautamäki,Lauri Tavi,Einar Meister

Main category: cs.CL

TL;DR: 介绍了一个新的双语语音数据集FROST-EMA,可用于语音学和自动语音验证系统的研究。

Details Motivation: 研究语言变异性,尤其是二语及模仿二语对语音学和语音技术的影响。

Method: 收集18名双语者的语音数据,包括母语、二语及模仿二语,并通过两个案例研究展示数据集的应用。

Result: 初步研究表明,二语和模仿二语对自动说话人验证系统性能有影响,同时也揭示了发音模式的差异。

Conclusion: FROST-EMA数据集为语言变异性研究提供了新工具,展示了其在语音学和语音技术领域的潜力。

Abstract: We introduce a new FROST-EMA (Finnish and Russian Oral Speech Dataset of
Electromagnetic Articulography) corpus. It consists of 18 bilingual speakers,
who produced speech in their native language (L1), second language (L2), and
imitated L2 (fake foreign accent). The new corpus enables research into
language variability from phonetic and technological points of view.
Accordingly, we include two preliminary case studies to demonstrate both
perspectives. The first case study explores the impact of L2 and imitated L2 on
the performance of an automatic speaker verification system, while the second
illustrates the articulatory patterns of one speaker in L1, L2, and a fake
accent.

[34] Learning to Reason Across Parallel Samples for LLM Reasoning

Jianing Qi,Xi Ye,Hao Tang,Zhigang Zhu,Eunsol Choi

Main category: cs.CL

TL;DR: 通过训练一个紧凑的LLM(SSA)来聚合多个样本集的答案,该方法在数学领域表现优于其他测试时扩展方法,并展现良好的泛化能力。

Details Motivation: 提出SSA方法是为了更高效地利用多样本集的输出,提升大语言模型在推理任务中的表现。

Method: 训练一个紧凑的SSA模型,通过强化学习优化其对多个样本的聚合能力,以提升答案准确性。

Result: 在多个推理数据集上,SSA优于其他测试时扩展方法(如基于奖励模型的重新排序)。

Conclusion: SSA展现了跨样本集大小、基础模型、任务等的泛化能力,并能高效与黑盒模型协同工作。

Abstract: Scaling test-time compute brings substantial performance gains for large
language models (LLMs). By sampling multiple answers and heuristically
aggregate their answers (e.g., either through majority voting or using
verifiers to rank the answers), one can achieve consistent performance gains in
math domains. In this paper, we propose a new way to leverage such multiple
sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that
takes a concatenated sequence of multiple samples and output the final answer,
optimizing it for the answer accuracy with reinforcement learning. Experiments
on multiple reasoning datasets show that SSA outperforms other test-time
scaling methods such as reward model-based re-ranking. Our approach also shows
a promising generalization ability, across sample set sizes, base model
families and scales, and tasks. By separating LLMs to generate answers and LLMs
to analyze and aggregate sampled answers, our approach can work with the
outputs from premier black box models easily and efficiently.

[35] Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

Haozhen Zhang,Tao Feng,Jiaxuan You

Main category: cs.CL

TL;DR: 提出了基于强化学习的多LLM路由框架Router-R1,通过动态模型调用与响应集成,优化性能与成本的权衡。

Details Motivation: 现有LLM路由器仅支持单轮一对一映射,无法利用多模型的互补优势应对复杂任务。

Method: 采用强化学习,将路由和聚合建模为序列决策过程;路由器本身为LLM,交替进行内部推理与动态模型调用。

Result: 在7个基准测试中,Router-R1表现优于基线,兼顾性能与成本控制。

Conclusion: Router-R1通过强化学习实现高效路由,能够泛化到未见过的模型选择,为性能-成本权衡优化提供新途径。

Abstract: The rapid emergence of diverse large language models (LLMs) has spurred the
development of LLM routers that assign user queries to the most suitable model.
However, existing LLM routers typically perform a single-round, one-to-one
mapping (\textit{i.e.}, assigning each query to a single model in isolation),
which limits their capability to tackle complex tasks that demand the
complementary strengths of multiple LLMs. In this paper, we present
\textbf{Router-R1}, a reinforcement learning (RL)-based framework that
formulates multi-LLM routing and aggregation as a sequential decision process.
Router-R1 instantiates the router itself as a capable LLM, leveraging its
reasoning ability to interleave “think” actions (internal deliberation) with
“route” actions (dynamic model invocation), and integrates each response into
its evolving context. To guide learning, we employ a lightweight rule-based
reward comprising format rewards, final outcome rewards, and a novel cost
reward for performance and cost trade-off optimization, opening a pathway
toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions
only on simple model descriptors such as pricing, latency, and example
performance, enabling strong generalization to unseen model selection.
Experiments on seven general and multi-hop QA benchmarks show that Router-R1
outperforms over several strong baselines, achieving superior performance while
maintaining robust generalization and cost management.Code is available at
https://github.com/ulab-uiuc/Router-R1.

[36] Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

Yaniv Nikankin,Dana Arad,Yossi Gandelsman,Yonatan Belinkov

Main category: cs.CL

TL;DR: 研究了视觉语言模型(VLMs)在视觉和文本任务中的性能差异,通过分析计算子图(circuits)和提出一种无需训练的方法来缩小差距。

Details Motivation: 探索为什么VLMs在文本任务上表现优于视觉任务,并尝试缩小这种性能差距。

Method: 比较视觉和文本任务的计算子图,发现它们在后期层才能对齐,提出通过将视觉数据令牌从后期层修补回早期层来优化表现。

Result: 实验表明这种方法平均能缩小两种模态之间性能差距的三分之一。

Conclusion: 揭示了多模态性能差距的原因,并提出了一种无需训练即可部分解决该问题的方法。

Abstract: Vision-Language models (VLMs) show impressive abilities to answer questions
on visual inputs (e.g., counting objects in an image), yet demonstrate higher
accuracies when performing an analogous task on text (e.g., counting words in a
text). We investigate this accuracy gap by identifying and comparing the
\textit{circuits} - the task-specific computational sub-graphs - in different
modalities. We show that while circuits are largely disjoint between
modalities, they implement relatively similar functionalities: the differences
lie primarily in processing modality-specific data positions (an image or a
text sequence). Zooming in on the image data representations, we observe they
become aligned with the higher-performing analogous textual representations
only towards later layers, too late in processing to effectively influence
subsequent positions. To overcome this, we patch the representations of visual
data tokens from later layers back into earlier layers. In experiments with
multiple tasks and models, this simple intervention closes a third of the
performance gap between the modalities, on average. Our analysis sheds light on
the multi-modal performance gap in VLMs and suggests a training-free approach
for reducing it.

cs.CV [Back]

[37] Towards Reliable AR-Guided Surgical Navigation: Interactive Deformation Modeling with Data-Driven Biomechanics and Prompts

Zheng Han,Jun Zhou,Jialun Pei,Jing Qin,Yingfang Fan,Qi Dou

Main category: cs.CV

TL;DR: 论文提出了一种数据驱动的生物力学算法,结合人机交互机制,提高了AR引导手术中变形建模的效率和准确性。

Details Motivation: 现有方法在术中变形建模中存在计算成本高和难以处理大范围解剖变化的问题,影响AR手术导航的可靠性。

Method: 采用数据驱动的生物力学算法,结合人机交互机制,允许医生动态纠正解剖对准错误。

Result: 算法在公开数据集上的平均目标配准误差为3.42 mm,结合交互框架后降至2.78 mm,优于现有方法。

Conclusion: 该框架实现了高效准确的变形建模,提升了医生与算法的协作,为安全的计算机辅助手术奠定了基础。

Abstract: In augmented reality (AR)-guided surgical navigation, preoperative organ
models are superimposed onto the patient’s intraoperative anatomy to visualize
critical structures such as vessels and tumors. Accurate deformation modeling
is essential to maintain the reliability of AR overlays by ensuring alignment
between preoperative models and the dynamically changing anatomy. Although the
finite element method (FEM) offers physically plausible modeling, its high
computational cost limits intraoperative applicability. Moreover, existing
algorithms often fail to handle large anatomical changes, such as those induced
by pneumoperitoneum or ligament dissection, leading to inaccurate anatomical
correspondences and compromised AR guidance. To address these challenges, we
propose a data-driven biomechanics algorithm that preserves FEM-level accuracy
while improving computational efficiency. In addition, we introduce a novel
human-in-the-loop mechanism into the deformation modeling process. This enables
surgeons to interactively provide prompts to correct anatomical misalignments,
thereby incorporating clinical expertise and allowing the model to adapt
dynamically to complex surgical scenarios. Experiments on a publicly available
dataset demonstrate that our algorithm achieves a mean target registration
error of 3.42 mm. Incorporating surgeon prompts through the interactive
framework further reduces the error to 2.78 mm, surpassing state-of-the-art
methods in volumetric accuracy. These results highlight the ability of our
framework to deliver efficient and accurate deformation modeling while
enhancing surgeon-algorithm collaboration, paving the way for safer and more
reliable computer-assisted surgeries.

[38] ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li,Kaixin Xiong,Xiangyu Guo,Fang Li,Sixu Yan,Gangwei Xu,Lijun Zhou,Long Chen,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: ReCogDrive通过结合视觉语言模型(VLMs)和扩散规划器,解决了自动驾驶在长尾场景中的性能问题,并显著提升了性能指标。

Details Motivation: 当前端到端自动驾驶在罕见和长尾场景中表现不佳,且现有方法存在领域差异、维度不匹配和模仿学习的局限性。

Method: 采用三阶段训练:1) 用驾驶问答数据集训练VLMs;2) 用扩散规划器进行模仿学习;3) 通过强化学习微调规划器。

Result: 在NAVSIM基准测试中,PDMS达到89.6,超越了之前的最佳性能5.6分。

Conclusion: ReCogDrive通过整合VLMs和扩散规划器,显著提升了自动驾驶在复杂场景中的表现,并成为新的SOTA。

Abstract: Although end-to-end autonomous driving has made remarkable progress, its
performance degrades significantly in rare and long-tail scenarios. Recent
approaches attempt to address this challenge by leveraging the rich world
knowledge of Vision-Language Models (VLMs), but these methods suffer from
several limitations: (1) a significant domain gap between the pre-training data
of VLMs and real-world driving data, (2) a dimensionality mismatch between the
discrete language space and the continuous action space, and (3) imitation
learning tends to capture the average behavior present in the dataset, which
may be suboptimal even dangerous. In this paper, we propose ReCogDrive, an
autonomous driving system that integrates VLMs with diffusion planner, which
adopts a three-stage paradigm for training. In the first stage, we use a
large-scale driving question-answering datasets to train the VLMs, mitigating
the domain discrepancy between generic content and real-world driving
scenarios. In the second stage, we employ a diffusion-based planner to perform
imitation learning, mapping representations from the latent language space to
continuous driving actions. Finally, we fine-tune the diffusion planner using
reinforcement learning with NAVSIM non-reactive simulator, enabling the model
to generate safer, more human-like driving trajectories. We evaluate our
approach on the planning-oriented NAVSIM benchmark, achieving a PDMS of 89.6
and setting a new state-of-the-art that surpasses the previous vision-only SOTA
by 5.6 PDMS.

[39] CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems

Aniket Rege,Zinnia Nie,Mahesh Ramesh,Unmesh Raskar,Zhuoran Yu,Aditya Kusupati,Yong Jae Lee,Ramya Korlakai Vinayak

Main category: cs.CV

TL;DR: 论文分析了文本到图像(T2I)系统在训练数据中对全球南方文化代表性不足的问题,提出了一种名为CuRe的新型、可扩展的评测和评分套件。

Details Motivation: 当前的T2I系统主要基于欧美中心的数据训练,忽略了全球南方文化的多样性。为了解决这一问题,研究团队开发了CuRe评测套件,以量化文化代表性。

Method: CuRe利用Wikimedia知识图谱构建了一个分类层级数据集,包含32个子类别的300个文化要素,并通过属性规格的边际效用作为代理人类评分的指标。

Result: 研究显示,CuRe评分与人类对感知相似性、图文对齐性和文化多样性的评分具有强相关性,并在多种T2I系统和视觉语言模型中验证了其有效性。

Conclusion: CuRe为评测和改进T2I系统的文化代表性提供了一种有效工具,相关代码和数据集已开源。

Abstract: Popular text-to-image (T2I) systems are trained on web-scraped data, which is
heavily Amero and Euro-centric, underrepresenting the cultures of the Global
South. To analyze these biases, we introduce CuRe, a novel and scalable
benchmarking and scoring suite for cultural representativeness that leverages
the marginal utility of attribute specification to T2I systems as a proxy for
human judgments. Our CuRe benchmark dataset has a novel categorical hierarchy
built from the crowdsourced Wikimedia knowledge graph, with 300 cultural
artifacts across 32 cultural subcategories grouped into six broad cultural axes
(food, art, fashion, architecture, celebrations, and people). Our dataset’s
categorical hierarchy enables CuRe scorers to evaluate T2I systems by analyzing
their response to increasing the informativeness of text conditioning, enabling
fine-grained cultural comparisons. We empirically observe much stronger
correlations of our class of scorers to human judgments of perceptual
similarity, image-text alignment, and cultural diversity across image encoders
(SigLIP 2, AIMV2 and DINOv2), vision-language models (OpenCLIP, SigLIP 2,
Gemini 2.0 Flash) and state-of-the-art text-to-image systems, including three
variants of Stable Diffusion (1.5, XL, 3.5 Large), FLUX.1 [dev], Ideogram 2.0,
and DALL-E 3. The code and dataset is open-sourced and available at
https://aniketrege.github.io/cure/.

[40] IGraSS: Learning to Identify Infrastructure Networks from Satellite Imagery by Iterative Graph-constrained Semantic Segmentation

Oishee Bintey Hoque,Abhijin Adiga,Aniruddha Adiga,Siddharth Chaudhary,Madhav V. Marathe,S. S. Ravi,Kirti Rajagopalan,Amanda Wilson,Samarth Swarup

Main category: cs.CV

TL;DR: IGraSS是一种结合语义分割和图基优化的新型框架,显著提升了渠道网络映射的准确性。

Details Motivation: 现有语义分割模型依赖于大量标注数据,但标注不完整会影响效果。IGraSS旨在利用图级特性(如可达性)优化标注数据。

Method: 融合RGB、NDWI和DEM的语义分割模块与基于图的标注优化模块,迭代优化结果。

Result: IGraSS将不可达渠道片段从18%降至3%,并显著提升渠道识别效果。还可推广至路网等其他基础设施。

Conclusion: IGraSS为噪声标注优化和遥感图像映射提供了一种通用且高效的解决方案。

Abstract: Accurate canal network mapping is essential for water management, including
irrigation planning and infrastructure maintenance. State-of-the-art semantic
segmentation models for infrastructure mapping, such as roads, rely on large,
well-annotated remote sensing datasets. However, incomplete or inadequate
ground truth can hinder these learning approaches. Many infrastructure networks
have graph-level properties such as reachability to a source (like canals) or
connectivity (roads) that can be leveraged to improve these existing ground
truth. This paper develops a novel iterative framework IGraSS, combining a
semantic segmentation module-incorporating RGB and additional modalities (NDWI,
DEM)-with a graph-based ground-truth refinement module. The segmentation module
processes satellite imagery patches, while the refinement module operates on
the entire data viewing the infrastructure network as a graph. Experiments show
that IGraSS reduces unreachable canal segments from around 18% to 3%, and
training with refined ground truth significantly improves canal identification.
IGraSS serves as a robust framework for both refining noisy ground truth and
mapping canal networks from remote sensing imagery. We also demonstrate the
effectiveness and generalizability of IGraSS using road networks as an example,
applying a different graph-theoretic constraint to complete road networks.

[41] Spectral Domain Neural Reconstruction for Passband FMCW Radars

Harshvardhan Takawale,Nirupam Roy

Main category: cs.CV

TL;DR: SpINRv2是一个基于神经网络的框架,用于通过FMCW雷达实现高保真度的体积重建,通过改进解决了高频下的相位混叠和子区间模糊问题。

Details Motivation: 高频雷达环境下,相位混叠和子区间模糊问题限制了重建精度,SpINRv2旨在解决这些问题。

Method: 提出一种完全可微的频率域前向模型,结合隐式神经表示(INR),并引入稀疏性和平滑性正则化。

Result: SpINRv2在高频环境下显著优于传统和学习基线方法,成为神经雷达3D成像的新标杆。

Conclusion: SpINRv2通过改进的模型和正则化方法,在高频雷达场景中实现了更精确的体积重建。

Abstract: We present SpINRv2, a neural framework for high-fidelity volumetric
reconstruction using Frequency-Modulated Continuous-Wave (FMCW) radar.
Extending our prior work (SpINR), this version introduces enhancements that
allow accurate learning under high start frequencies-where phase aliasing and
sub-bin ambiguity become prominent. Our core contribution is a fully
differentiable frequency-domain forward model that captures the complex radar
response using closed-form synthesis, paired with an implicit neural
representation (INR) for continuous volumetric scene modeling. Unlike
time-domain baselines, SpINRv2 directly supervises the complex frequency
spectrum, preserving spectral fidelity while drastically reducing computational
overhead. Additionally, we introduce sparsity and smoothness regularization to
disambiguate sub-bin ambiguities that arise at fine range resolutions.
Experimental results show that SpINRv2 significantly outperforms both classical
and learning-based baselines, especially under high-frequency regimes,
establishing a new benchmark for neural radar-based 3D imaging.

[42] Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion Models in a Vision-Language-Action Framework

Huixin Zhan,Jason H. Moore

Main category: cs.CV

TL;DR: 论文提出一种结合离散扩散框架与视觉-语言-动作(VLA)管道的个性化手术指纹建模方法,通过学习独特的手势序列平衡性能与隐私风险。

Details Motivation: 当前AI系统常忽略外科医生因训练、经验和动作行为差异导致的个性化信号,需要一种方法在建模中兼顾个性化与隐私保护。

Method: 采用离散扩散框架与VLA管道,通过视频、手术意图语言和隐私感知的医生身份嵌入,将手势预测转化为结构化序列去噪任务。

Result: 在JIGSAWS数据集上验证方法能准确重构手势序列并学习独特运动指纹,但更个性化的嵌入会增加身份泄露风险。

Conclusion: 个性化嵌入虽提升性能,但也增加隐私风险,需在手术建模中平衡两者。

Abstract: Surgeons exhibit distinct operating styles due to differences in training,
experience, and motor behavior - yet current AI systems often ignore this
personalization signal. We propose a novel approach to model fine-grained,
surgeon-specific fingerprinting in robotic surgery using a discrete diffusion
framework integrated with a vision-language-action (VLA) pipeline. Our method
formulates gesture prediction as a structured sequence denoising task,
conditioned on multimodal inputs including endoscopic video, surgical intent
language, and a privacy-aware embedding of surgeon identity and skill.
Personalized surgeon fingerprinting is encoded through natural language prompts
using third-party language models, allowing the model to retain individual
behavioral style without exposing explicit identity. We evaluate our method on
the JIGSAWS dataset and demonstrate that it accurately reconstructs gesture
sequences while learning meaningful motion fingerprints unique to each surgeon.
To quantify the privacy implications of personalization, we perform membership
inference attacks and find that more expressive embeddings improve task
performance but simultaneously increase susceptibility to identity leakage.
These findings demonstrate that while personalized embeddings improve
performance, they also increase vulnerability to identity leakage, revealing
the importance of balancing personalization with privacy risk in surgical
modeling. Code is available at:
https://github.com/huixin-zhan-ai/Surgeon_style_fingerprinting.

[43] Open World Scene Graph Generation using Vision Language Models

Amartya Dutta,Kazi Sajeed Mehrab,Medha Sawhney,Abhilash Neog,Mridul Khurana,Sepideh Fatemi,Aanish Pradhan,M. Maruf,Ismini Lourentzou,Arka Daw,Anuj Karpatne

Main category: cs.CV

TL;DR: 提出了一个无需训练、模型无关的框架Open-World SGG,利用预训练的视觉语言模型(VLMs)直接生成场景图,无需额外学习。

Details Motivation: 传统的场景图生成(SGG)方法需要数据集特定的监督学习,限制了其在开放世界中的应用。而现有基于VLMs的方法仍需微调。

Method: 将SGG视为零样本结构化推理问题,结合多模态提示、嵌入对齐和轻量级对优化策略。

Result: 在Visual Genome、Open Images V6和Panoptic Scene Graph数据集上的实验表明,预训练VLMs具备无需任务级训练的关系理解能力。

Conclusion: 该方法为开放世界中的场景图生成提供了高效、无需训练的解决方案。

Abstract: Scene-Graph Generation (SGG) seeks to recognize objects in an image and
distill their salient pairwise relationships. Most methods depend on
dataset-specific supervision to learn the variety of interactions, restricting
their usefulness in open-world settings, involving novel objects and/or
relations. Even methods that leverage large Vision Language Models (VLMs)
typically require benchmark-specific fine-tuning. We introduce Open-World SGG,
a training-free, efficient, model-agnostic framework that taps directly into
the pretrained knowledge of VLMs to produce scene graphs with zero additional
learning. Casting SGG as a zero-shot structured-reasoning problem, our method
combines multimodal prompting, embedding alignment, and a lightweight
pair-refinement strategy, enabling inference over unseen object vocabularies
and relation sets. To assess this setting, we formalize an Open-World
evaluation protocol that measures performance when no SGG-specific data have
been observed either in terms of objects and relations. Experiments on Visual
Genome, Open Images V6, and the Panoptic Scene Graph (PSG) dataset demonstrate
the capacity of pretrained VLMs to perform relational understanding without
task-level training.

[44] GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

Mateusz Michalkiewicz,Anekha Sokhal,Tadeusz Michalkiewicz,Piotr Pawlikowski,Mahsa Baktashmotlagh,Varun Jampani,Guha Balakrishnan

Main category: cs.CV

TL;DR: GIQ是一个评估视觉和视觉语言基础模型几何推理能力的基准测试,揭示了当前模型在几何理解上的不足。

Details Motivation: 评估模型对几何属性的真实理解能力。

Method: 通过合成和真实图像,以及多种几何形状(如柏拉图立体、阿基米德立体等)进行测试。

Result: 现有模型在3D重建、对称性检测和几何分类等任务中表现不佳。

Conclusion: GIQ为未来的几何智能研究提供了重要基准。

Abstract: Monocular 3D reconstruction methods and vision-language models (VLMs)
demonstrate impressive results on standard benchmarks, yet their true
understanding of geometric properties remains unclear. We introduce GIQ , a
comprehensive benchmark specifically designed to evaluate the geometric
reasoning capabilities of vision and vision-language foundation models. GIQ
comprises synthetic and real-world images of 224 diverse polyhedra - including
Platonic, Archimedean, Johnson, and Catalan solids, as well as stellations and
compound shapes - covering varying levels of complexity and symmetry. Through
systematic experiments involving monocular 3D reconstruction, 3D symmetry
detection, mental rotation tests, and zero-shot shape classification tasks, we
reveal significant shortcomings in current models. State-of-the-art
reconstruction algorithms trained on extensive 3D datasets struggle to
reconstruct even basic geometric forms accurately. While foundation models
effectively detect specific 3D symmetry elements via linear probing, they
falter significantly in tasks requiring detailed geometric differentiation,
such as mental rotation. Moreover, advanced vision-language assistants exhibit
remarkably low accuracy on complex polyhedra, systematically misinterpreting
basic properties like face geometry, convexity, and compound structures. GIQ is
publicly available, providing a structured platform to highlight and address
critical gaps in geometric intelligence, facilitating future progress in
robust, geometry-aware representation learning.

[45] A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

Andrew Z. Wang,Songwei Ge,Tero Karras,Ming-Yu Liu,Yogesh Balaji

Main category: cs.CV

TL;DR: 研究了使用现代仅解码器LLM作为文本编码器在文本到图像扩散模型中的效果,发现层归一化平均嵌入优于T5基线。

Details Motivation: 当前许多文本到图像模型仍使用过时的T5和CLIP作为文本编码器,希望通过现代LLM提升性能。

Method: 构建标准化训练和评估流程,训练27个模型,分析LLM嵌入提取方法、变体和大小对生成的影响。

Result: 使用层归一化平均嵌入优于传统最后层嵌入,LLM在复杂提示对齐和视觉语言推理上表现更好。

Conclusion: 现代LLM作为文本编码器在文本到图像生成中表现优于传统方法。

Abstract: Both text-to-image generation and large language models (LLMs) have made
significant advancements. However, many text-to-image models still employ the
somewhat outdated T5 and CLIP as their text encoders. In this work, we
investigate the effectiveness of using modern decoder-only LLMs as text
encoders for text-to-image diffusion models. We build a standardized training
and evaluation pipeline that allows us to isolate and evaluate the effect of
different text embeddings. We train a total of 27 text-to-image models with 12
different text encoders to analyze the critical aspects of LLMs that could
impact text-to-image generation, including the approaches to extract
embeddings, different LLMs variants, and model sizes. Our experiments reveal
that the de facto way of using last-layer embeddings as conditioning leads to
inferior performance. Instead, we explore embeddings from various layers and
find that using layer-normalized averaging across all layers significantly
improves alignment with complex prompts. Most LLMs with this conditioning
outperform the baseline T5 model, showing enhanced performance in advanced
visio-linguistic reasoning skills.

[46] Using Satellite Images And Self-supervised Machine Learning Networks To Detect Water Hidden Under Vegetation

Ioannis Iakovidis,Zahra Kalantari,Amir Hossein Payberah,Fernando Jaramillo,Francisco Pena Escobar

Main category: cs.CV

TL;DR: 论文提出一种自监督学习方法,通过深度聚类和负采样训练模型,无需人工标注即可分割雷达卫星图像中的水域和陆地,并采用集成模型提高性能。

Details Motivation: 传统方法需要大量人工标注的高分辨率雷达卫星图像,成本高且耗时。自监督学习可以解决这一问题。

Method: 结合深度聚类和负采样训练自监督模型,并采用集成模型降低方差。

Result: 与全监督模型相比,自监督集成模型的IOU指标提高了0.02。

Conclusion: 自监督学习方法在湿地监测中具有潜力,能显著减少对人工标注的依赖。

Abstract: In recent years the wide availability of high-resolution radar satellite
images along with the advancement of computer vision models have enabled the
remote monitoring of the surface area of wetlands. However, these models
require large amounts of manually annotated satellite images, which are slow
and expensive to produce. To overcome this problem, self-supervised training
methods have been deployed to train models without using annotated data. In
this paper we use a combination of deep clustering and negative sampling to
train a model to segment radar satellite images into areas that separate water
from land without the use of any manual annotations. Furthermore, we implement
an ensemble version of the model to reduce variance and improve performance.
Compared to a single fully-supervised model using the same architecture, our
ensemble of self-supervised models achieves a 0.02 improvement in the
Intersection Over Union metric over our test dataset.

[47] Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence

Octave Mariotti,Zhipeng Du,Yash Bhalgat,Oisin Mac Aodha,Hakan Bilen

Main category: cs.CV

TL;DR: 提出一种通过单目深度估计将2D关键点提升到3D空间的新方法,学习密集对应关系,并在未见关键点上显著优于现有方法。

Details Motivation: 现有监督语义对应方法泛化能力有限,仅稀疏标注的关键点难以学习密集对应,需改进。

Method: 利用单目深度估计将2D关键点映射到3D规范空间,构建连续流形,无需显式3D监督或相机标注。

Result: 在未见关键点上显著优于监督基线,且无监督方法在跨数据集泛化中表现更好。

Conclusion: 提出的方法能学习更具鲁棒性的对应关系,展示了无监督方法在泛化中的潜力。

Abstract: Semantic correspondence (SC) aims to establish semantically meaningful
matches across different instances of an object category. We illustrate how
recent supervised SC methods remain limited in their ability to generalize
beyond sparsely annotated training keypoints, effectively acting as keypoint
detectors. To address this, we propose a novel approach for learning dense
correspondences by lifting 2D keypoints into a canonical 3D space using
monocular depth estimation. Our method constructs a continuous canonical
manifold that captures object geometry without requiring explicit 3D
supervision or camera annotations. Additionally, we introduce SPair-U, an
extension of SPair-71k with novel keypoint annotations, to better assess
generalization. Experiments not only demonstrate that our model significantly
outperforms supervised baselines on unseen keypoints, highlighting its
effectiveness in learning robust correspondences, but that unsupervised
baselines outperform supervised counterparts when generalized across different
datasets.

[48] A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

Vishaal Udandarao,Mehdi Cherti,Shyamgopal Karthik,Jenia Jitsev,Samuel Albanie,Matthias Bethge

Main category: cs.CV

TL;DR: 本文研究了17个常用于评估视觉-语言模型(VLM)组合理解能力的基准(如SugarCREPE、VALSE),揭示了其中普遍存在的设计偏差,并提出改进建议。

Details Motivation: 研究动机在于发现现有基准在设计上的不足(如数据来源和负样本构造方式),导致其无法有效衡量模型的组合理解能力。

Method: 方法包括分析基准的设计选择(如数据源和负样本构造),并使用简单启发式方法(如标记长度、语言模型对数似然)与CLIP模型对比。

Result: 结果表明,基准因设计不对称性(正负样本分布不均)易被简单启发式攻击,无法有效测试组合理解能力。

Conclusion: 结论是需改进基准设计以减少偏差,并提出了一些关键建议。

Abstract: We investigate 17 benchmarks (e.g. SugarCREPE, VALSE) commonly used for
measuring compositional understanding capabilities of vision-language models
(VLMs). We scrutinize design choices in their construction, including data
source (e.g. MS-COCO) and curation procedures (e.g. constructing negative
images/captions), uncovering several inherent biases across most benchmarks. We
find that blind heuristics (e.g. token-length, log-likelihood under a language
model) perform on par with CLIP models, indicating that these benchmarks do not
effectively measure compositional understanding. We demonstrate that the
underlying factor is a distribution asymmetry between positive and negative
images/captions, induced by the benchmark construction procedures. To mitigate
these issues, we provide a few key recommendations for constructing more robust
vision-language compositional understanding benchmarks, that would be less
prone to such simple attacks.

[49] Highly Compressed Tokenizer Can Generate Without Training

L. Lao Beyer,T. Li,X. Chen,S. Karaman,K. He

Main category: cs.CV

TL;DR: 本文探讨了1D图像标记化器在高度压缩图像为一维序列时的表现,并通过启发式操作展示了其编辑和生成能力。

Details Motivation: 研究1D图像标记化器的高压缩特性及其在图像编辑和生成中的潜力。

Method: 使用基于梯度的测试时优化和即插即用的损失函数(如重建或CLIP相似性)构建图像生成流程。

Result: 1D标记化器能够通过简单的操作(如复制和替换标记)实现细粒度图像编辑,并生成多样且真实的样本。

Conclusion: 1D标记化器的潜在空间表达能力强大,无需训练生成模型即可实现高效的图像编辑和生成。

Abstract: Commonly used image tokenizers produce a 2D grid of spatially arranged
tokens. In contrast, so-called 1D image tokenizers represent images as highly
compressed one-dimensional sequences of as few as 32 discrete tokens. We find
that the high degree of compression achieved by a 1D tokenizer with vector
quantization enables image editing and generative capabilities through
heuristic manipulation of tokens, demonstrating that even very crude
manipulations – such as copying and replacing tokens between latent
representations of images – enable fine-grained image editing by transferring
appearance and semantic attributes. Motivated by the expressivity of the 1D
tokenizer’s latent space, we construct an image generation pipeline leveraging
gradient-based test-time optimization of tokens with plug-and-play loss
functions such as reconstruction or CLIP similarity. Our approach is
demonstrated for inpainting and text-guided image editing use cases, and can
generate diverse and realistic samples without requiring training of any
generative model.

[50] Seeing Voices: Generating A-Roll Video from Audio with Mirage

Aditi Sundararaman,Amogh Adishesha,Andrew Jaegle,Dan Bigioi,Hyoung-Kyu Song,Jon Kyl,Justin Mao,Kevin Lan,Mojtaba Komeili,ShahRukh Athar,Sheila Babayan,Stanislau Beliasau,William Buchwalter

Main category: cs.CV

TL;DR: Mirage是一个音频到视频的基础模型,能够根据音频输入生成逼真且富有表现力的视频。

Details Motivation: 视频的力量在于音频与视觉的和谐结合,但现有方法要么忽略声音,要么局限于特定领域(如重新配音)。

Method: 介绍Mirage,利用自注意力机制的统一训练方法,从零开始或基于现有权重训练音频到视频生成模型。

Result: Mirage生成的视频在主观质量上优于其他方法,并能与语音合成技术结合,生成多模态视频。

Conclusion: Mirage为音频到视频生成提供了一种通用方法,效果优于现有技术。

Abstract: From professional filmmaking to user-generated content, creators and
consumers have long recognized that the power of video depends on the
harmonious integration of what we hear (the video’s audio track) with what we
see (the video’s image sequence). Current approaches to video generation either
ignore sound to focus on general-purpose but silent image sequence generation
or address both visual and audio elements but focus on restricted application
domains such as re-dubbing. We introduce Mirage, an audio-to-video foundation
model that excels at generating realistic, expressive output imagery from
scratch given an audio input. When integrated with existing methods for speech
synthesis (text-to-speech, or TTS), Mirage results in compelling multimodal
video. When trained on audio-video footage of people talking (A-roll) and
conditioned on audio containing speech, Mirage generates video of people
delivering a believable interpretation of the performance implicit in input
audio. Our central technical contribution is a unified method for training
self-attention-based audio-to-video generation models, either from scratch or
given existing weights. This methodology allows Mirage to retain generality as
an approach to audio-to-video generation while producing outputs of superior
subjective quality to methods that incorporate audio-specific architectures or
loss components specific to people, speech, or details of how images or audio
are captured. We encourage readers to watch and listen to the results of Mirage
for themselves (see paper and comments for links).

[51] SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

Nhat Thanh Tran,Fanghui Xue,Shuai Zhang,Jiancheng Lyu,Yunling Zheng,Yingyong Qi,Jack Xin

Main category: cs.CV

TL;DR: 论文提出了广义注意力的数学定义,并设计了SEMA方法,解决了注意力的二次计算复杂性和分散问题,在图像分类任务中优于线性注意力和Mamba模型。

Details Motivation: 解决传统注意力机制在计算复杂度高和线性注意力无法聚焦的问题,提出了广义注意力的概念。

Method: 利用令牌定位和算术平均设计SEMA方法,避免分散并保持聚焦。

Result: 在Imagenet-1k上,SEMA在更大规模的图像上表现优于线性注意力和Mamba模型。

Conclusion: SEMA是一种高效且可扩展的注意力替代方案,适用于视觉任务。

Abstract: Attention is the critical component of a transformer. Yet the quadratic
computational complexity of vanilla full attention in the input size and the
inability of its linear attention variant to focus have been challenges for
computer vision tasks. We provide a mathematical definition of generalized
attention and formulate both vanilla softmax attention and linear attention
within the general framework. We prove that generalized attention disperses,
that is, as the number of keys tends to infinity, the query assigns equal
weights to all keys. Motivated by the dispersion property and recent
development of Mamba form of attention, we design Scalable and Efficient Mamba
like Attention (SEMA) which utilizes token localization to avoid dispersion and
maintain focusing, complemented by theoretically consistent arithmetic
averaging to capture global aspect of attention. We support our approach on
Imagenet-1k where classification results show that SEMA is a scalable and
effective alternative beyond linear attention, outperforming recent vision
Mamba models on increasingly larger scales of images at similar model parameter
sizes.

[52] OpenRR-1k: A Scalable Dataset for Real-World Reflection Removal

Kangning Yang,Ling Ouyang,Huiming Sun,Jie Cai,Lan Fu,Jiaming Ding,Chiu Man Ho,Zibo Meng

Main category: cs.CV

TL;DR: 论文提出了一种新的反射数据集收集方法,构建了高质量、多样化的OpenRR-1k数据集,提升了反射去除技术的鲁棒性。

Details Motivation: 现有反射去除技术因缺乏高质量的真实数据集而受限。

Method: 提出了一种便捷、低成本、可扩展的数据收集范式,构建了OpenRR-1k数据集。

Result: 数据集在真实环境中显著提升了反射去除方法的鲁棒性。

Conclusion: OpenRR-1k数据集为反射去除研究提供了有效的支持,推动了技术进步。

Abstract: Reflection removal technology plays a crucial role in photography and
computer vision applications. However, existing techniques are hindered by the
lack of high-quality in-the-wild datasets. In this paper, we propose a novel
paradigm for collecting reflection datasets from a fresh perspective. Our
approach is convenient, cost-effective, and scalable, while ensuring that the
collected data pairs are of high quality, perfectly aligned, and represent
natural and diverse scenarios. Following this paradigm, we collect a
Real-world, Diverse, and Pixel-aligned dataset (named OpenRR-1k dataset), which
contains 1,000 high-quality transmission-reflection image pairs collected in
the wild. Through the analysis of several reflection removal methods and
benchmark evaluation experiments on our dataset, we demonstrate its
effectiveness in improving robustness in challenging real-world environments.
Our dataset is available at https://github.com/caijie0620/OpenRR-1k.

[53] Hyperspectral Image Classification via Transformer-based Spectral-Spatial Attention Decoupling and Adaptive Gating

Guandong Li,Mengxia Ye

Main category: cs.CV

TL;DR: STNet是一种新型网络架构,通过空间-光谱Transformer模块的创新设计,有效解决高光谱图像分类中的过拟合和泛化能力问题。

Details Motivation: 高光谱图像分类中存在高维数据、地物稀疏分布和光谱冗余等问题,导致分类过拟合和泛化能力受限。

Method: 提出STNet,其核心是通过空间-光谱Transformer模块的解耦设计和双门控机制,实现高效的空间与光谱信息融合。

Result: STNet在IN、UP和KSC数据集上表现优异,优于主流高光谱分类方法,且未增加网络深度或宽度。

Conclusion: STNet通过创新的模块设计和门控机制,显著提升了特征提取和融合能力,减少了小样本和高噪声场景下的过拟合风险。

Abstract: Deep neural networks face several challenges in hyperspectral image
classification, including high-dimensional data, sparse distribution of ground
objects, and spectral redundancy, which often lead to classification
overfitting and limited generalization capability. To more effectively extract
and fuse spatial context with fine spectral information in hyperspectral image
(HSI) classification, this paper proposes a novel network architecture called
STNet. The core advantage of STNet stems from the dual innovative design of its
Spatial-Spectral Transformer module: first, the fundamental explicit decoupling
of spatial and spectral attention ensures targeted capture of key information
in HSI; second, two functionally distinct gating mechanisms perform intelligent
regulation at both the fusion level of attention flows (adaptive attention
fusion gating) and the internal level of feature transformation (GFFN). This
characteristic demonstrates superior feature extraction and fusion capabilities
compared to traditional convolutional neural networks, while reducing
overfitting risks in small-sample and high-noise scenarios. STNet enhances
model representation capability without increasing network depth or width. The
proposed method demonstrates superior performance on IN, UP, and KSC datasets,
outperforming mainstream hyperspectral image classification approaches.

[54] Locating Tennis Ball Impact on the Racket in Real Time Using an Event Camera

Yuto Kase,Kai Ishibe,Ryoma Yasuda,Yudai Washida,Sakiko Hashimoto

Main category: cs.CV

TL;DR: 提出一种使用事件相机实时定位网球击球位置的方法,解决了高速相机内存消耗大和人工数字化耗时的问题。

Details Motivation: 在网球等球拍运动中,击球位置定位对分析球员和装备特性至关重要,但高速相机内存消耗大且人工处理耗时易错。

Method: 利用事件相机高效捕捉亮度变化,结合传统计算机视觉技术和原创的事件处理方法(PATS),通过三个识别步骤实现实时定位。

Result: 实验结果在测量网球运动员性能的可接受范围内,计算时间足够短,适用于实时应用。

Conclusion: 该方法能高效实时定位击球位置,为球员表现分析提供支持。

Abstract: In racket sports, such as tennis, locating the ball’s position at impact is
important in clarifying player and equipment characteristics, thereby aiding in
personalized equipment design. High-speed cameras are used to measure the
impact location; however, their excessive memory consumption limits prolonged
scene capture, and manual digitization for position detection is time-consuming
and prone to human error. These limitations make it difficult to effectively
capture the entire playing scene, hindering the ability to analyze the player’s
performance. We propose a method for locating the tennis ball impact on the
racket in real time using an event camera. Event cameras efficiently measure
brightness changes (called `events’) with microsecond accuracy under high-speed
motion while using lower memory consumption. These cameras enable users to
continuously monitor their performance over extended periods. Our method
consists of three identification steps: time range of swing, timing at impact,
and contours of ball and racket. Conventional computer vision techniques are
utilized along with an original event-based processing to detect the timing at
impact (PATS: the amount of polarity asymmetry in time symmetry). The results
of the experiments were within the permissible range for measuring tennis
players’ performance. Moreover, the computation time was sufficiently short for
real-time applications.

[55] How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models

Huixuan Zhang,Junzhe Zhang,Xiaojun Wan

Main category: cs.CV

TL;DR: 该论文提出了一种名为Step AG的自适应引导策略,通过仅在前几个去噪步骤中使用分类器自由引导,显著提高了生成速度(20%至30%),同时保持了图像质量和文本对齐性。

Details Motivation: 由于分类器自由引导方法在文本到视觉生成扩散模型中需要双倍的计算步骤,导致成本显著增加。尽管已有研究提出自适应引导概念,但缺乏充分分析和实用性。本文旨在提出一种通用的自适应引导策略以解决这一问题。

Method: 提出Step AG策略,仅在前几个去噪步骤中应用分类器自由引导,从而减少计算成本。

Result: 实验结果表明,Step AG在保持图像质量和文本对齐性的同时,平均提高了20%至30%的速度,且适用于不同模型和设置。

Conclusion: Step AG是一种简单且通用的自适应引导策略,有效提升生成效率,适用于广泛的扩散模型。

Abstract: With the rapid development of text-to-vision generation diffusion models,
classifier-free guidance has emerged as the most prevalent method for
conditioning. However, this approach inherently requires twice as many steps
for model forwarding compared to unconditional generation, resulting in
significantly higher costs. While previous study has introduced the concept of
adaptive guidance, it lacks solid analysis and empirical results, making
previous method unable to be applied to general diffusion models. In this work,
we present another perspective of applying adaptive guidance and propose Step
AG, which is a simple, universally applicable adaptive guidance strategy. Our
evaluations focus on both image quality and image-text alignment. whose results
indicate that restricting classifier-free guidance to the first several
denoising steps is sufficient for generating high-quality, well-conditioned
images, achieving an average speedup of 20% to 30%. Such improvement is
consistent across different settings such as inference steps, and various
models including video generation models, highlighting the superiority of our
method.

[56] MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding

Shivang Chopra,Lingchao Mao,Gabriela Sanchez-Rodriguez,Andrew J Feola,Jing Li,Zsolt Kira

Main category: cs.CV

TL;DR: MedMoE是一种动态适应医学成像模态的视觉语言处理框架,通过MoE模块和多尺度特征提取优化诊断性能。

Details Motivation: 现有医学视觉语言框架采用统一的局部特征提取策略,忽略了不同模态的特定需求,因此需要一种适应多模态的动态方法。

Method: MedMoE基于Swin Transformer主干网,利用MoE模块(根据报告类型路由特征)和多尺度专家分支提取模态特定的视觉语义。

Result: 在多种医学基准测试中,MedMoE提升了跨模态的对齐和检索性能。

Conclusion: MedMoE证明了模态专用视觉表征在临床视觉语言系统中的价值,无需推断时的模态特定监督。

Abstract: Different medical imaging modalities capture diagnostic information at
varying spatial resolutions, from coarse global patterns to fine-grained
localized structures. However, most existing vision-language frameworks in the
medical domain apply a uniform strategy for local feature extraction,
overlooking the modality-specific demands. In this work, we present MedMoE, a
modular and extensible vision-language processing framework that dynamically
adapts visual representation based on the diagnostic context. MedMoE
incorporates a Mixture-of-Experts (MoE) module conditioned on the report type,
which routes multi-scale image features through specialized expert branches
trained to capture modality-specific visual semantics. These experts operate
over feature pyramids derived from a Swin Transformer backbone, enabling
spatially adaptive attention to clinically relevant regions. This framework
produces localized visual representations aligned with textual descriptions,
without requiring modality-specific supervision at inference. Empirical results
on diverse medical benchmarks demonstrate that MedMoE improves alignment and
retrieval performance across imaging modalities, underscoring the value of
modality-specialized visual representations in clinical vision-language
systems.

[57] SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

Woohyeon Park,Woojin Kim,Jaeik Kim,Jaeyoung Do

Main category: cs.CV

TL;DR: 论文提出了SECOND方法,通过选择性对比解码解决VLMs中的物体幻觉问题,提升视觉理解准确性。

Details Motivation: 现有VLMs因物体幻觉问题受限,需要更精准的视觉理解方法。

Method: SECOND方法以对象为中心,逐步选择和整合多尺度视觉信息,并通过对比减少幻觉。

Result: SECOND显著减少幻觉,在多基准测试中表现优异。

Conclusion: 多尺度应用在VLMs中潜力巨大,SECOND方法优于现有技术。

Abstract: Despite significant advancements in Vision-Language Models (VLMs), the
performance of existing VLMs remains hindered by object hallucination, a
critical challenge to achieving accurate visual understanding. To address this
issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach
that enables VLMs to effectively leverage multi-scale visual information with
an object-centric manner, closely aligning with human visual perception. SECOND
progressively selects and integrates multi-scale visual information,
facilitating a more precise interpretation of images. By contrasting these
visual information iteratively, SECOND significantly reduces perceptual
hallucinations and outperforms a wide range of benchmarks. Our theoretical
analysis and experiments highlight the largely unexplored potential of
multi-scale application in VLMs, showing that prioritizing and contrasting
across scales outperforms existing methods.

[58] RadioDUN: A Physics-Inspired Deep Unfolding Network for Radio Map Estimation

Taiqin Chen,Zikun Zhou,Zheng Fang,Wenzhen Zou,Kanjun Liu,Ke Chen,Yongbing Zhang,Yaowei Wang

Main category: cs.CV

TL;DR: 本文提出RadioDUN方法,通过稀疏信号恢复问题估计密集无线电地图,结合物理传播模型和动态重加权模块,提升估计性能。

Details Motivation: 现有深度学习方法难以结合无线电地图的物理特性,导致稀疏样本的密集地图估计效果不佳。

Method: 将无线电地图估计转化为稀疏信号恢复问题,结合物理传播模型分解优化子问题,提出RadioDUN网络和动态重加权模块,并设计阴影损失。

Result: 实验表明,RadioDUN优于现有方法。

Conclusion: 结合物理特性和深度学习优化,RadioDUN能有效提升无线电地图估计性能,未来将公开代码。

Abstract: The radio map represents the spatial distribution of spectrum resources
within a region, supporting efficient resource allocation and interference
mitigation. However, it is difficult to construct a dense radio map as a
limited number of samples can be measured in practical scenarios. While
existing works have used deep learning to estimate dense radio maps from sparse
samples, they are hard to integrate with the physical characteristics of the
radio map. To address this challenge, we cast radio map estimation as the
sparse signal recovery problem. A physical propagation model is further
incorporated to decompose the problem into multiple factor optimization
sub-problems, thereby reducing recovery complexity. Inspired by the existing
compressive sensing methods, we propose the Radio Deep Unfolding Network
(RadioDUN) to unfold the optimization process, achieving adaptive parameter
adjusting and prior fitting in a learnable manner. To account for the radio
propagation characteristics, we develop a dynamic reweighting module (DRM) to
adaptively model the importance of each factor for the radio map. Inspired by
the shadowing factor in the physical propagation model, we integrate
obstacle-related factors to express the obstacle-induced signal stochastic
decay. The shadowing loss is further designed to constrain the factor
prediction and act as a supplementary supervised objective, which enhances the
performance of RadioDUN. Extensive experiments have been conducted to
demonstrate that the proposed method outperforms the state-of-the-art methods.
Our code will be made publicly available upon publication.

[59] Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring

Mingjie Xu,Andrew Estornell,Hongzheng Yang,Yuzhi Zhao,Zhaowei Zhu,Qi Xuan,Jiaheng Wei

Main category: cs.CV

TL;DR: 论文提出了SCALE方法,用于解决视觉语言模型(VLM)中数据质量和对齐问题,通过跨模态评估框架提升数据选择效果。

Details Motivation: 视觉语言模型的性能依赖于高质量数据集,但现有数据集存在图像与文本不对齐和文本模糊等问题,限制了模型表现。

Method: 提出SCALE管道,集成跨模态评估框架,通过生成任务特定描述并评估对齐、清晰度等指标来选择高质量数据。

Result: 研究发现现有单模态评估方法低估了对某些任务重要的样本,并证明适当生成的图像描述有助于统一多模态任务。

Conclusion: SCALE通过综合评估数据质量和对齐性,为视觉语言模型的指令调优提供了更高质量的数据选择方法。

Abstract: The application of visual instruction tuning and other post-training
techniques has significantly enhanced the capabilities of Large Language Models
(LLMs) in visual understanding, enriching Vision-Language Models (VLMs) with
more comprehensive visual language datasets. However, the effectiveness of VLMs
is highly dependent on large-scale, high-quality datasets that ensure precise
recognition and accurate reasoning. Two key challenges hinder progress: (1)
noisy alignments between images and the corresponding text, which leads to
misinterpretation, and (2) ambiguous or misleading text, which obscures visual
content. To address these challenges, we propose SCALE (Single modality data
quality and Cross modality Alignment Evaluation), a novel quality-driven data
selection pipeline for VLM instruction tuning datasets. Specifically, SCALE
integrates a cross-modality assessment framework that first assigns each data
entry to its appropriate vision-language task, generates general and
task-specific captions (covering scenes, objects, style, etc.), and evaluates
the alignment, clarity, task rarity, text coherence, and image clarity of each
entry based on the generated captions. We reveal that: (1) current unimodal
quality assessment methods evaluate one modality while overlooking the rest,
which can underestimate samples essential for specific tasks and discard the
lower-quality instances that help build model robustness; and (2) appropriately
generated image captions provide an efficient way to transfer the image-text
multimodal task into a unified text modality.

[60] Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance

June Suk Choi,Kyungmin Lee,Sihyun Yu,Yisol Choi,Jinwoo Shin,Kimin Lee

Main category: cs.CV

TL;DR: 论文提出了一种自适应低通引导(ALG)方法,用于改善图像到视频(I2V)生成中因高频细节导致视频动态性不足的问题,显著提升了动态性而不损失画质。

Details Motivation: 现有方法通过微调预训练的文本到视频(T2V)模型支持I2V生成,但会导致生成的视频动态性不足,原因是输入图像的高频细节过早影响了采样过程。

Method: 提出自适应低通引导(ALG),通过在去噪早期阶段自适应调节输入图像的频率内容(低通滤波),避免模型过度拟合静态外观。

Result: 实验表明,ALG显著提升了生成视频的动态性(VBench-I2V测试中动态性平均提升36%),同时保持了图像质量和文本对齐。

Conclusion: ALG是一种简单有效的改进方法,解决了I2V生成中视频动态性不足的问题,且无需牺牲画质。

Abstract: Recent text-to-video (T2V) models have demonstrated strong capabilities in
producing high-quality, dynamic videos. To improve the visual controllability,
recent works have considered fine-tuning pre-trained T2V models to support
image-to-video (I2V) generation. However, such adaptation frequently suppresses
motion dynamics of generated outputs, resulting in more static videos compared
to their T2V counterparts. In this work, we analyze this phenomenon and
identify that it stems from the premature exposure to high-frequency details in
the input image, which biases the sampling process toward a shortcut trajectory
that overfits to the static appearance of the reference image. To address this,
we propose adaptive low-pass guidance (ALG), a simple fix to the I2V model
sampling procedure to generate more dynamic videos without compromising
per-frame image quality. Specifically, ALG adaptively modulates the frequency
content of the conditioning image by applying low-pass filtering at the early
stage of denoising. Extensive experiments demonstrate that ALG significantly
improves the temporal dynamics of generated videos, while preserving image
fidelity and text alignment. Especially, under VBench-I2V test suite, ALG
achieves an average improvement of 36% in dynamic degree without a significant
drop in video quality or image fidelity.

[61] MARMOT: Masked Autoencoder for Modeling Transient Imaging

Siyuan Shen,Ziheng Wang,Xingyue Peng,Suan Xia,Ruiqian Li,Shiying Li,Jingyi Yu

Main category: cs.CV

TL;DR: 该论文提出了MARMOT,一种基于掩码自编码器的预训练模型,用于非视距(NLOS)瞬态成像任务,通过自监督学习从大规模数据中提取特征,并在下游任务中表现优异。

Details Motivation: 预训练模型在语言和视觉领域取得了显著成功,但在瞬态成像领域尚未广泛应用。本研究旨在将预训练范式引入NLOS瞬态成像,以弥补现有方法在数据驱动先验学习上的不足。

Method: 作者提出了MARMOT模型,采用基于Transformer的编码器-解码器结构,通过扫描模式掩码(SPM)自监督学习部分掩码的瞬态数据特征,并预测完整测量结果。模型在合成的500K 3D模型数据集TransVerse上进行预训练。

Result: 通过定量和定性实验对比,MARMOT在NLOS瞬态成像任务中表现出高效性,优于现有方法。

Conclusion: MARMOT通过自监督预训练和特征迁移,为NLOS瞬态成像任务提供了一种高效解决方案,验证了预训练模型在该领域的潜力。

Abstract: Pretrained models have demonstrated impressive success in many modalities
such as language and vision. Recent works facilitate the pretraining paradigm
in imaging research. Transients are a novel modality, which are captured for an
object as photon counts versus arrival times using a precisely time-resolved
sensor. In particular for non-line-of-sight (NLOS) scenarios, transients of
hidden objects are measured beyond the sensor’s direct line of sight. Using
NLOS transients, the majority of previous works optimize volume density or
surfaces to reconstruct the hidden objects and do not transfer priors learned
from datasets. In this work, we present a masked autoencoder for modeling
transient imaging, or MARMOT, to facilitate NLOS applications. Our MARMOT is a
self-supervised model pretrianed on massive and diverse NLOS transient
datasets. Using a Transformer-based encoder-decoder, MARMOT learns features
from partially masked transients via a scanning pattern mask (SPM), where the
unmasked subset is functionally equivalent to arbitrary sampling, and predicts
full measurements. Pretrained on TransVerse-a synthesized transient dataset of
500K 3D models-MARMOT adapts to downstream imaging tasks using direct feature
transfer or decoder finetuning. Comprehensive experiments are carried out in
comparisons with state-of-the-art methods. Quantitative and qualitative results
demonstrate the efficiency of our MARMOT.

[62] Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization

Qilin Yin,Wei Lu,Xiangyang Luo,Xiaochun Cao

Main category: cs.CV

TL;DR: 该论文提出了一种通用的上下文感知对比学习框架(UniCaClF),用于解决多媒体取证领域中的时间伪造定位(TFL)问题,通过在异常检测中发现和识别伪造片段。

Details Motivation: 当前多媒体取证研究主要集中于检测伪造音视频内容,但忽略了部分视频片段被篡改的情况。时间伪造定位(TFL)在实际应用中更具挑战性。

Method: 提出了一种基于监督对比学习的框架,结合上下文感知感知层和自适应上下文更新器,构建上下文感知对比目标,增强伪造片段的特征区分性。

Result: 在五个公开数据集上的实验结果表明,UniCaClF显著优于现有的竞争算法。

Conclusion: UniCaClF为时间伪造定位提供了一种高效且通用的解决方案,能够精确定位伪造片段。

Abstract: Most research efforts in the multimedia forensics domain have focused on
detecting forgery audio-visual content and reached sound achievements. However,
these works only consider deepfake detection as a classification task and
ignore the case where partial segments of the video are tampered with. Temporal
forgery localization (TFL) of small fake audio-visual clips embedded in real
videos is still challenging and more in line with realistic application
scenarios. To resolve this issue, we propose a universal context-aware
contrastive learning framework (UniCaCLF) for TFL. Our approach leverages
supervised contrastive learning to discover and identify forged instants by
means of anomaly detection, allowing for the precise localization of temporal
forged segments. To this end, we propose a novel context-aware perception layer
that utilizes a heterogeneous activation operation and an adaptive context
updater to construct a context-aware contrastive objective, which enhances the
discriminability of forged instant features by contrasting them with genuine
instant features in terms of their distances to the global context. An
efficient context-aware contrastive coding is introduced to further push the
limit of instant feature distinguishability between genuine and forged instants
in a supervised sample-by-sample manner, suppressing the cross-sample influence
to improve temporal forgery localization performance. Extensive experimental
results over five public datasets demonstrate that our proposed UniCaCLF
significantly outperforms the state-of-the-art competing algorithms.

[63] MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding

Zhiyi Zhu,Xiaoyu Wu,Zihao Liu,Linlin Yang

Main category: cs.CV

TL;DR: 论文提出了一种名为MLVTG的新框架,通过MambaAligner和LLMRefiner两个模块,解决了现有Transformer方法在视频时序定位中的冗余注意力和多模态对齐问题。

Details Motivation: 视频时序定位(VTG)是视频理解中的基础任务,但现有方法存在冗余注意力和多模态对齐不足的问题。

Method: 采用MambaAligner模块(基于Vision Mamba块)建模时序依赖关系,以及LLMRefiner模块(利用预训练LLM的特定冻结层)增强多模态对齐。

Result: 在QVHighlights、Charades-STA和TVSum数据集上,MLVTG取得了最优性能,显著超越了现有基线。

Conclusion: MLVTG通过双重对齐策略(时序建模和语义净化)实现了更精确的定位,为VTG任务提供了新思路。

Abstract: Video Temporal Grounding (VTG), which aims to localize video clips
corresponding to natural language queries, is a fundamental yet challenging
task in video understanding. Existing Transformer-based methods often suffer
from redundant attention and suboptimal multi-modal alignment. To address these
limitations, we propose MLVTG, a novel framework that integrates two key
modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba
blocks as a backbone instead of Transformers to model temporal dependencies and
extract robust video representations for multi-modal alignment. LLMRefiner
leverages the specific frozen layer of a pre-trained Large Language Model (LLM)
to implicitly transfer semantic priors, enhancing multi-modal alignment without
fine-tuning. This dual alignment strategy, temporal modeling via structured
state-space dynamics and semantic purification via textual priors, enables more
precise localization. Extensive experiments on QVHighlights, Charades-STA, and
TVSum demonstrate that MLVTG achieves state-of-the-art performance and
significantly outperforms existing baselines.

[64] Robust Visual Localization via Semantic-Guided Multi-Scale Transformer

Zhongtao Tian,Wenhao Huang,Zhidong Chen,Xiao Wei Sun

Main category: cs.CV

TL;DR: 论文提出了一种结合多尺度特征学习与语义场景理解的框架,以提升动态环境中的视觉定位性能。

Details Motivation: 动态环境中的光照变化、恶劣天气和移动物体会干扰视觉定位的外观线索,现有方法难以保持一致性能。

Method: 采用分层Transformer与跨尺度注意力机制,融合几何细节与上下文信息,并通过语义监督学习视图不变特征。

Result: 在TartanAir数据集上,该方法在动态对象、光照变化和遮挡等挑战性场景中优于现有姿态回归方法。

Conclusion: 结合多尺度处理与语义指导,为现实动态环境中的鲁棒视觉定位提供了有效策略。

Abstract: Visual localization remains challenging in dynamic environments where
fluctuating lighting, adverse weather, and moving objects disrupt appearance
cues. Despite advances in feature representation, current absolute pose
regression methods struggle to maintain consistency under varying conditions.
To address this challenge, we propose a framework that synergistically combines
multi-scale feature learning with semantic scene understanding. Our approach
employs a hierarchical Transformer with cross-scale attention to fuse geometric
details and contextual cues, preserving spatial precision while adapting to
environmental changes. We improve the performance of this architecture with
semantic supervision via neural scene representation during training, guiding
the network to learn view-invariant features that encode persistent structural
information while suppressing complex environmental interference. Experiments
on TartanAir demonstrate that our approach outperforms existing pose regression
methods in challenging scenarios with dynamic objects, illumination changes,
and occlusions. Our findings show that integrating multi-scale processing with
semantic guidance offers a promising strategy for robust visual localization in
real-world dynamic environments.

[65] LiftVSR: Lifting Image Diffusion to Video Super-Resolution via Hybrid Temporal Modeling with Only 4$\times$RTX 4090s

Xijun Wang,Xin Li,Bingchen Li,Zhibo Chen

Main category: cs.CV

TL;DR: 提出LiftVSR,通过动态时间注意力和注意力内存缓存优化视频超分辨率,显著降低计算成本。

Details Motivation: 现有扩散模型在视频超分辨率中虽提升了感知质量,但计算成本高且时间一致性不足。

Method: 结合动态时间注意力(DTA)和注意力内存缓存(AMC),分解时间建模为短帧精细建模和长片段一致性建模。

Result: LiftVSR在多个基准测试中表现优异,计算成本显著降低。

Conclusion: LiftVSR在效率和一致性上取得平衡,为视频超分辨率提供更优解决方案。

Abstract: Diffusion models have significantly advanced video super-resolution (VSR) by
enhancing perceptual quality, largely through elaborately designed temporal
modeling to ensure inter-frame consistency. However, existing methods usually
suffer from limited temporal coherence and prohibitively high computational
costs (e.g., typically requiring over 8 NVIDIA A100-80G GPUs), especially for
long videos. In this work, we propose LiftVSR, an efficient VSR framework that
leverages and elevates the image-wise diffusion prior from PixArt-$\alpha$,
achieving state-of-the-art results using only 4$\times$RTX 4090 GPUs. To
balance long-term consistency and efficiency, we introduce a hybrid temporal
modeling mechanism that decomposes temporal learning into two complementary
components: (i) Dynamic Temporal Attention (DTA) for fine-grained temporal
modeling within short frame segment ($\textit{i.e.}$, low complexity), and (ii)
Attention Memory Cache (AMC) for long-term temporal modeling across segments
($\textit{i.e.}$, consistency). Specifically, DTA identifies multiple token
flows across frames within multi-head query and key tokens to warp inter-frame
contexts in the value tokens. AMC adaptively aggregates historical segment
information via a cache unit, ensuring long-term coherence with minimal
overhead. To further stabilize the cache interaction during inference, we
introduce an asymmetric sampling strategy that mitigates feature mismatches
arising from different diffusion sampling steps. Extensive experiments on
several typical VSR benchmarks have demonstrated that LiftVSR achieves
impressive performance with significantly lower computational costs.

[66] TrajFlow: Multi-modal Motion Prediction via Flow Matching

Qi Yan,Brian Zhang,Yutong Zhang,Daniel Yang,Joshua White,Di Chen,Jiachao Liu,Langechuan Liu,Binnan Zhuang,Shaoshuai Shi,Renjie Liao

Main category: cs.CV

TL;DR: TrajFlow是一种基于流匹配的运动预测框架,通过单次推理预测多模态轨迹,解决现有方法的计算效率问题,并在Waymo数据集上表现优异。

Details Motivation: 自动驾驶中高效准确的运动预测对安全和决策至关重要,尤其是在动态多模态场景下。现有泛化轨迹预测方法计算效率低,难以满足需求。

Method: 提出TrajFlow框架,采用流匹配技术,单次推理即可预测多模态轨迹;引入Plackett-Luce排序损失改进不确定性估计;设计自条件训练技术提升泛化能力。

Result: 在Waymo Open Motion Dataset上实现最先进性能,显著降低计算开销并保持预测一致性。

Conclusion: TrajFlow为安全关键型自动驾驶提供高效、高性能的解决方案,代码已开源。

Abstract: Efficient and accurate motion prediction is crucial for ensuring safety and
informed decision-making in autonomous driving, particularly under dynamic
real-world conditions that necessitate multi-modal forecasts. We introduce
TrajFlow, a novel flow matching-based motion prediction framework that
addresses the scalability and efficiency challenges of existing generative
trajectory prediction methods. Unlike conventional generative approaches that
employ i.i.d. sampling and require multiple inference passes to capture diverse
outcomes, TrajFlow predicts multiple plausible future trajectories in a single
pass, significantly reducing computational overhead while maintaining coherence
across predictions. Moreover, we propose a ranking loss based on the
Plackett-Luce distribution to improve uncertainty estimation of predicted
trajectories. Additionally, we design a self-conditioning training technique
that reuses the model’s own predictions to construct noisy inputs during a
second forward pass, thereby improving generalization and accelerating
inference. Extensive experiments on the large-scale Waymo Open Motion Dataset
(WOMD) demonstrate that TrajFlow achieves state-of-the-art performance across
various key metrics, underscoring its effectiveness for safety-critical
autonomous driving applications. The code and other details are available on
the project website https://traj-flow.github.io/.

[67] Convergence of Spectral Principal Paths: How Deep Networks Distill Linear Representations from Noisy Inputs

Bowei Tian,Xuntao Lyu,Meng Liu,Hongyi Wang,Ang Li

Main category: cs.CV

TL;DR: 该论文提出了输入空间线性假设(ISLH)和光谱主路径(SPP)框架,研究深度网络中线性表示的形成,并通过实验验证其在视觉语言模型中的多模态鲁棒性,从而推动AI的透明性和鲁棒性。

Details Motivation: 研究如何通过高级表示提升AI的透明性和控制性,从单个神经元转向与人类可解释概念对齐的结构化语义方向。

Method: 提出输入空间线性假设(ISLH)和光谱主路径(SPP)框架,分析深度网络中线性表示的逐步提取过程。

Result: 实验表明,视觉语言模型中这些表示具有多模态鲁棒性。

Conclusion: 该研究为深度网络中表示形成的结构化理论提供了基础,有望改善AI的鲁棒性、公平性和透明性。

Abstract: High-level representations have become a central focus in enhancing AI
transparency and control, shifting attention from individual neurons or
circuits to structured semantic directions that align with human-interpretable
concepts. Motivated by the Linear Representation Hypothesis (LRH), we propose
the Input-Space Linearity Hypothesis (ISLH), which posits that concept-aligned
directions originate in the input space and are selectively amplified with
increasing depth. We then introduce the Spectral Principal Path (SPP)
framework, which formalizes how deep networks progressively distill linear
representations along a small set of dominant spectral directions. Building on
this framework, we further demonstrate the multimodal robustness of these
representations in Vision-Language Models (VLMs). By bridging theoretical
insights with empirical validation, this work advances a structured theory of
representation formation in deep networks, paving the way for improving AI
robustness, fairness, and transparency.

[68] From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

Agnese Taluzzi,Davide Gesualdi,Riccardo Santambrogio,Chiara Plizzari,Francesca Palermo,Simone Mentasti,Matteo Matteucci

Main category: cs.CV

TL;DR: SceneNet和KnowledgeNet是用于HD-EPIC VQA挑战赛2025的两种方法,分别通过场景图和外部常识知识提升视觉问答性能,组合使用时准确率达到44.21%。

Details Motivation: 解决复杂视觉问答任务中对象交互、空间关系及常识知识推理的需求。

Method: SceneNet利用多模态大语言模型生成场景图捕捉对象交互;KnowledgeNet整合ConceptNet的常识知识用于高级语义推理。

Result: 在HD-EPIC基准测试的七个类别中展示了各自的优势,组合框架的准确率达到44.21%。

Conclusion: SceneNet和KnowledgeNet的组合框架在复杂视觉问答任务中表现有效。

Abstract: This report presents SceneNet and KnowledgeNet, our approaches developed for
the HD-EPIC VQA Challenge 2025. SceneNet leverages scene graphs generated with
a multi-modal large language model (MLLM) to capture fine-grained object
interactions, spatial relationships, and temporally grounded events. In
parallel, KnowledgeNet incorporates ConceptNet’s external commonsense knowledge
to introduce high-level semantic connections between entities, enabling
reasoning beyond directly observable visual evidence. Each method demonstrates
distinct strengths across the seven categories of the HD-EPIC benchmark, and
their combination within our framework results in an overall accuracy of 44.21%
on the challenge, highlighting its effectiveness for complex egocentric VQA
tasks.

[69] Towards Cross-Subject EMG Pattern Recognition via Dual-Branch Adversarial Feature Disentanglement

Xinyue Niu,Akira Furui

Main category: cs.CV

TL;DR: 提出一种通过特征解耦消除校准需求的方法,实现有效的跨被试EMG模式识别。

Details Motivation: 解决跨被试EMG模式识别中因解剖结构、电极位置和信号特性差异导致的模型校准问题。

Method: 采用双分支对抗神经网络,将EMG特征解耦为模式特异和个体特异两部分,实现无校准的跨被试识别。

Result: 模型在未见用户数据上表现稳健,优于多种基线方法。

Conclusion: 为无需校准的跨被试EMG识别提供了新思路,并展示了模型在生物识别等应用中的潜力。

Abstract: Cross-subject electromyography (EMG) pattern recognition faces significant
challenges due to inter-subject variability in muscle anatomy, electrode
placement, and signal characteristics. Traditional methods rely on
subject-specific calibration data to adapt models to new users, an approach
that is both time-consuming and impractical for large-scale, real-world
deployment. This paper presents an approach to eliminate calibration
requirements through feature disentanglement, enabling effective cross-subject
generalization. We propose an end-to-end dual-branch adversarial neural network
that simultaneously performs pattern recognition and individual identification
by disentangling EMG features into pattern-specific and subject-specific
components. The pattern-specific components facilitate robust pattern
recognition for new users without model calibration, while the subject-specific
components enable downstream applications such as task-invariant biometric
identification. Experimental results demonstrate that the proposed model
achieves robust performance on data from unseen users, outperforming various
baseline methods in cross-subject scenarios. Overall, this study offers a new
perspective for cross-subject EMG pattern recognition without model calibration
and highlights the proposed model’s potential for broader applications, such as
task-independent biometric systems.

[70] Hierarchical Neural Collapse Detection Transformer for Class Incremental Object Detection

Duc Thanh Pham,Hong Dang Nguyen,Nhat Minh Nguyen Quoc,Linh Ngo Van,Sang Dinh Viet,Duc Anh Nguyen

Main category: cs.CV

TL;DR: Hier-DETR 是一种新型的增量目标检测框架,结合神经崩溃和层级关系标签,提升效率和性能。

Details Motivation: 解决现有增量目标检测模型性能不足和推理时间长的问题。

Method: 利用神经崩溃处理数据集不平衡问题,并结合类别标签的层级关系。

Result: 实现了高效且具有竞争力的性能。

Conclusion: Hier-DETR 为增量目标检测提供了更实用的解决方案。

Abstract: Recently, object detection models have witnessed notable performance
improvements, particularly with transformer-based models. However, new objects
frequently appear in the real world, requiring detection models to continually
learn without suffering from catastrophic forgetting. Although Incremental
Object Detection (IOD) has emerged to address this challenge, these existing
models are still not practical due to their limited performance and prolonged
inference time. In this paper, we introduce a novel framework for IOD, called
Hier-DETR: Hierarchical Neural Collapse Detection Transformer, ensuring both
efficiency and competitive performance by leveraging Neural Collapse for
imbalance dataset and Hierarchical relation of classes’ labels.

[71] Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations

Yibo Cui,Liang Xie,Yu Zhao,Jiawei Sun,Erwei Yin

Main category: cs.CV

TL;DR: 提出FCA-NIG框架,自动生成带有细粒度跨模态标注的导航指令,并构建FCA-R2R数据集,显著提升多种VLN代理的性能。

Details Motivation: 现有数据集缺乏细粒度的跨模态对齐标注,无法满足导航决策的需求,需生成更精细的标注数据。

Method: 通过轨迹划分、地标检测、指令生成和实体选择等步骤,生成子指令-轨迹对和实体-地标标注,最终构建完整数据集。

Result: FCA-R2R数据集显著提升了SF、EnvDrop等VLN代理的性能,增强了状态感知和导航泛化能力。

Conclusion: FCA-NIG框架无需人工标注即可生成高质量训练数据,推动了复杂导航任务中的细粒度跨模态学习。

Abstract: Vision-Language Navigation (VLN) enables intelligent agents to navigate
environments by integrating visual perception and natural language
instructions, yet faces significant challenges due to the scarcity of
fine-grained cross-modal alignment annotations. Existing datasets primarily
focus on global instruction-trajectory matching, neglecting
sub-instruction-level and entity-level alignments critical for accurate
navigation action decision-making. To address this limitation, we propose
FCA-NIG, a generative framework that automatically constructs navigation
instructions with dual-level fine-grained cross-modal annotations. In this
framework, an augmented trajectory is first divided into sub-trajectories,
which are then processed through GLIP-based landmark detection, crafted
instruction construction, OFA-Speaker based R2R-like instruction generation,
and CLIP-powered entity selection, generating sub-instruction-trajectory pairs
with entity-landmark annotations. Finally, these sub-pairs are aggregated to
form a complete instruction-trajectory pair. The framework generates the
FCA-R2R dataset, the first large-scale augmentation dataset featuring precise
sub-instruction-sub-trajectory and entity-landmark alignments. Extensive
experiments demonstrate that training with FCA-R2R significantly improves the
performance of multiple state-of-the-art VLN agents, including SF, EnvDrop,
RecBERT, and HAMT. Incorporating sub-instruction-trajectory alignment enhances
agents’ state awareness and decision accuracy, while entity-landmark alignment
further boosts navigation performance and generalization. These results
highlight the effectiveness of FCA-NIG in generating high-quality, scalable
training data without manual annotation, advancing fine-grained cross-modal
learning in complex navigation tasks.

[72] Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

Chengchao Shen,Hourun Zhu,Gongfan Fang,Jianxin Wang,Xinchao Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为DGMR(Diversity-Guided MLP Reduction)的方法,通过减少多层感知机(MLP)模块的参数来降低大规模视觉Transformer的计算和内存成本,几乎不影响性能。

Details Motivation: 由于大规模Transformer模型参数过多导致计算和内存成本高昂,论文发现MLP模块占据了大部分参数,因此提出压缩方法以降低成本。

Method: 采用基于Gram-Schmidt的权重剪枝策略,消除MLP隐藏层的冗余神经元,同时保留权重多样性以提高蒸馏时的性能恢复能力。

Result: 实验表明,该方法在大规模视觉Transformer上实现了超过57%的参数和FLOPs减少,几乎无损性能;在EVA-CLIP-E(4.4B)上甚至实现了71.5%的减少。

Conclusion: DGMR方法显著减少了Transformer模型的参数和计算成本,同时保持了性能,为大规模模型的实际应用提供了可行方案。

Abstract: Transformer models achieve excellent scaling property, where the performance
is improved with the increment of model capacity. However, large-scale model
parameters lead to an unaffordable cost of computing and memory. We analyze
popular transformer architectures and find that multilayer perceptron (MLP)
modules take up the majority of model parameters. To this end, we focus on the
recoverability of the compressed models and propose a Diversity-Guided MLP
Reduction (DGMR) method to significantly reduce the parameters of large vision
transformers with only negligible performance degradation. Specifically, we
conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons
of MLP hidden layer, while preserving weight diversity for better performance
recover during distillation. Compared to the model trained from scratch, our
pruned model only requires 0.06% data of LAION-2B (for the training of large
vision transformers) without labels (ImageNet-1K) to recover the original
performance. Experimental results on several state-of-the-art large vision
transformers demonstrate that our method achieves a more than 57.0% parameter
and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B),
our method accomplishes a 71.5% parameter and FLOPs reduction without
performance degradation. The source code and trained weights are available at
https://github.com/visresearch/DGMR.

[73] Data-Efficient Challenges in Visual Inductive Priors: A Retrospective

Robert-Jan Bruintjes,Attila Lengyel,Osman Semih Kayhan,Davide Zambrano,Nergis Tömen,Hadi Jamali-Rad,Jan van Gemert

Main category: cs.CV

TL;DR: 论文研究在数据稀缺情况下训练深度学习模型的有效方法,通过组织‘VIPriors’挑战赛,探索不使用迁移学习的情况下,如何结合先验知识提升数据效率。

Details Motivation: 解决数据稀缺时深度学习模型性能下降的问题,推动开发结合先验知识的新型方法。

Method: 通过‘VIPriors’挑战赛限制参与者仅使用少量数据从头训练模型,禁止迁移学习,探索有效方法。

Result: 成功的参赛方案结合了Transformer和CNN的模型集成、大规模数据增强以及先验知识方法。

Conclusion: 结合模型集成、数据增强和先验知识可以有效提升数据稀缺情况下的深度学习性能。

Abstract: Deep Learning requires large amounts of data to train models that work well.
In data-deficient settings, performance can be degraded. We investigate which
Deep Learning methods benefit training models in a data-deficient setting, by
organizing the “VIPriors: Visual Inductive Priors for Data-Efficient Deep
Learning” workshop series, featuring four editions of data-impaired challenges.
These challenges address the problem of training deep learning models for
computer vision tasks with limited data. Participants are limited to training
models from scratch using a low number of training samples and are not allowed
to use any form of transfer learning. We aim to stimulate the development of
novel approaches that incorporate prior knowledge to improve the data
efficiency of deep learning models. Successful challenge entries make use of
large model ensembles that mix Transformers and CNNs, as well as heavy data
augmentation. Novel prior knowledge-based methods contribute to success in some
entries.

[74] SAMSelect: A Spectral Index Search for Marine Debris Visualization using Segment Anything

Joost van Dalen,Yuki M. Asano,Marc Russwurm

Main category: cs.CV

TL;DR: SAMSelect是一种用于从多光谱图像中选择最佳三通道可视化组合的算法,特别适用于海洋科学家解读海洋垃圾。通过Segment Anything Model在小规模标注数据上实现高分类准确率,SAMSelect帮助专家更直观地进行图像解译。

Details Motivation: 多光谱图像中的海洋垃圾由于其成分多样性和中等分辨率的特点,难以直观解译。目前专家通常依靠经验和启发式方法手动选择波段或光谱指数,缺乏系统性优化。SAMSelect旨在通过算法优化这一过程。

Method: SAMSelect利用Segment Anything Model在小规模标注数据上评估不同波段或光谱指数组合的分类准确率,并选择最佳的三通道可视化方案。其假设是分类准确率高的组合也能提供良好的视觉解译效果。

Result: 在加纳阿克拉和南非德班等地的Sentinel-2图像测试中,SAMSelect发现了一些未被文献记载的波段组合(如B8和B2的归一化差值指数),其性能优于传统方法。

Conclusion: SAMSelect为海洋科学家提供了一种高效且开源的工具,显著提升了多光谱图像中海洋垃圾的可视化解译效果。

Abstract: This work proposes SAMSelect, an algorithm to obtain a salient three-channel
visualization for multispectral images. We develop SAMSelect and show its use
for marine scientists visually interpreting floating marine debris in
Sentinel-2 imagery. These debris are notoriously difficult to visualize due to
their compositional heterogeneity in medium-resolution imagery. Out of these
difficulties, a visual interpretation of imagery showing marine debris remains
a common practice by domain experts, who select bands and spectral indices on a
case-by-case basis informed by common practices and heuristics. SAMSelect
selects the band or index combination that achieves the best classification
accuracy on a small annotated dataset through the Segment Anything Model. Its
central assumption is that the three-channel visualization achieves the most
accurate segmentation results also provide good visual information for
photo-interpretation.
We evaluate SAMSelect in three Sentinel-2 scenes containing generic marine
debris in Accra, Ghana, and Durban, South Africa, and deployed plastic targets
from the Plastic Litter Project. This reveals the potential of new previously
unused band combinations (e.g., a normalized difference index of B8, B2), which
demonstrate improved performance compared to literature-based indices. We
describe the algorithm in this paper and provide an open-source code repository
that will be helpful for domain scientists doing visual photo interpretation,
especially in the marine field.

[75] ECMNet:Lightweight Semantic Segmentation with Efficient CNN-Mamba Network

Feixiang Du,Shengkun Wu

Main category: cs.CV

TL;DR: 提出了一种轻量级Efficient CNN-Mamba Network(ECMNet),结合CNN与Mamba优势,用于提升语义分割任务中的全局上下文建模能力。

Details Motivation: 解决当前CNN与Transformer模型在语义分割中全局上下文建模不足的问题。

Method: 设计Enhanced Dual-Attention Block(EDAB)轻量瓶颈、Multi-Scale Attention Unit(MSAU)多尺度特征聚合模块,以及Mamba增强的Feature Fusion Module(FFM)。

Result: 在Cityscapes和CamVid数据集上分别达到70.6%和73.6% mIoU,参数量0.87M,计算量8.27G FLOPs。

Conclusion: ECMNet在精度与效率的平衡上表现优异。

Abstract: In the past decade, Convolutional Neural Networks (CNNs) and Transformers
have achieved wide applicaiton in semantic segmentation tasks. Although CNNs
with Transformer models greatly improve performance, the global context
modeling remains inadequate. Recently, Mamba achieved great potential in vision
tasks, showing its advantages in modeling long-range dependency. In this paper,
we propose a lightweight Efficient CNN-Mamba Network for semantic segmentation,
dubbed as ECMNet. ECMNet combines CNN with Mamba skillfully in a capsule-based
framework to address their complementary weaknesses. Specifically, We design a
Enhanced Dual-Attention Block (EDAB) for lightweight bottleneck. In order to
improve the representations ability of feature, We devise a Multi-Scale
Attention Unit (MSAU) to integrate multi-scale feature aggregation, spatial
aggregation and channel aggregation. Moreover, a Mamba enhanced Feature Fusion
Module (FFM) merges diverse level feature, significantly enhancing segmented
accuracy. Extensive experiments on two representative datasets demonstrate that
the proposed model excels in accuracy and efficiency balance, achieving 70.6%
mIoU on Cityscapes and 73.6% mIoU on CamVid test datasets, with 0.87M
parameters and 8.27G FLOPs on a single RTX 3090 GPU platform.

[76] RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping

Yang Bai,Liudi Yang,George Eskandar,Fengyi Shen,Dong Chen,Mohammad Altillawi,Ziyuan Liu,Gitta Kutyniok

Main category: cs.CV

TL;DR: RoboSwap结合GAN和扩散模型,利用非配对数据实现视频中机械臂的替换,提升跨平台机器人学习的数据生成质量。

Details Motivation: 解决视频条件机器人学习中高质量多样化数据稀缺的问题,特别是跨平台学习中的机械臂替换挑战。

Method: 通过分割机械臂与背景,用非配对GAN进行机械臂转换,再用扩散模型增强视频连贯性和运动真实性。

Result: 在三个基准测试中,RoboSwap在结构连贯性和运动一致性上优于现有方法。

Conclusion: RoboSwap为机器人学习提供了可靠、跨平台的数据生成解决方案。

Abstract: Recent advancements in generative models have revolutionized video synthesis
and editing. However, the scarcity of diverse, high-quality datasets continues
to hinder video-conditioned robotic learning, limiting cross-platform
generalization. In this work, we address the challenge of swapping a robotic
arm in one video with another: a key step for crossembodiment learning. Unlike
previous methods that depend on paired video demonstrations in the same
environmental settings, our proposed framework, RoboSwap, operates on unpaired
data from diverse environments, alleviating the data collection needs. RoboSwap
introduces a novel video editing pipeline integrating both GANs and diffusion
models, combining their isolated advantages. Specifically, we segment robotic
arms from their backgrounds and train an unpaired GAN model to translate one
robotic arm to another. The translated arm is blended with the original video
background and refined with a diffusion model to enhance coherence, motion
realism and object interaction. The GAN and diffusion stages are trained
independently. Our experiments demonstrate that RoboSwap outperforms
state-of-the-art video and image editing models on three benchmarks in terms of
both structural coherence and motion consistency, thereby offering a robust
solution for generating reliable, cross-embodiment data in robotic learning.

[77] SurfR: Surface Reconstruction with Multi-scale Attention

Siddhant Ranade,Gonçalo Dias Pais,Ross Tyler Whitaker,Jacinto C. Nascimento,Pedro Miraldo,Srikumar Ramalingam

Main category: cs.CV

TL;DR: 提出了一种基于隐式表示的快速准确的无组织点云表面重建算法,通过延迟查询、多尺度网格表示和跨尺度注意力实现最佳精度与速度的平衡。

Details Motivation: 解决现有学习方法在单对象表示和通用表示之间的权衡问题,即前者需要每对象训练且模型小,后者模型大且推理慢但通用性强。

Method: 采用延迟查询策略、并行多尺度网格表示和跨尺度注意力机制,以优化隐式表示的速度和精度。

Result: 新方法在所有基线模型的最优分辨率下速度最快,性能仅略低于最先进方法,实现了最佳精度与速度的权衡。

Conclusion: 通过三种关键贡献,该方法在隐式表示中实现了高效且高精度的表面重建,适用于通用3D形状。

Abstract: We propose a fast and accurate surface reconstruction algorithm for
unorganized point clouds using an implicit representation. Recent learning
methods are either single-object representations with small neural models that
allow for high surface details but require per-object training or generalized
representations that require larger models and generalize to newer shapes but
lack details, and inference is slow. We propose a new implicit representation
for general 3D shapes that is faster than all the baselines at their optimum
resolution, with only a marginal loss in performance compared to the
state-of-the-art. We achieve the best accuracy-speed trade-off using three key
contributions. Many implicit methods extract features from the point cloud to
classify whether a query point is inside or outside the object. First, to speed
up the reconstruction, we show that this feature extraction does not need to
use the query point at an early stage (lazy query). Second, we use a parallel
multi-scale grid representation to develop robust features for different noise
levels and input resolutions. Finally, we show that attention across scales can
provide improved reconstruction results.

[78] Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization

Zhiyi Zhu,Xiaoyu Wu,Youwei Lu

Main category: cs.CV

TL;DR: 该论文提出了一种通过文本-运动跨模态对比损失(TMCCL)来增强运动特征表示的视频记忆性预测模型,并利用视频记忆性预测减少视频摘要标签的主观性(MWCVS)。

Details Motivation: 现有视频记忆性预测模型未能充分利用运动线索,且运动特征提取器在微调阶段因缺乏标记数据而导致特征表示不佳。此外,视频记忆性预测的应用潜力未被充分探索。

Method: 提出TMCCL,通过文本描述相似性构建正负样本集,增强运动特征表示;并利用视频记忆性预测设计MWCVS,减少视频摘要的主观性。

Result: TMCCL在两个视频记忆性预测数据集上达到最优性能;MWCVS在两个视频摘要数据集上验证了其有效性。

Conclusion: 论文成功提升了视频记忆性预测的准确性和应用价值,展示了其在视频摘要中的潜力。

Abstract: Video memorability refers to the ability of videos to be recalled after
viewing, playing a crucial role in creating content that remains memorable.
Existing models typically focus on extracting multimodal features to predict
video memorability scores but often fail to fully utilize motion cues. The
representation of motion features is compromised during the fine-tuning phase
of the motion feature extractor due to a lack of labeled data. In this paper,
we introduce the Text-Motion Cross-modal Contrastive Loss (TMCCL), a multimodal
video memorability prediction model designed to enhance the representation of
motion features. We tackle the challenge of improving motion feature
representation by leveraging text description similarities across videos to
establish positive and negative motion sample sets for a given target. This
enhancement allows the model to learn similar feature representations for
semantically related motion content, resulting in more accurate memorability
predictions. Our model achieves state-of-the-art performance on two video
memorability prediction datasets. Moreover, the potential applications of video
memorability prediction have been underexplored. To address this gap, we
present Memorability Weighted Correction for Video Summarization (MWCVS), using
video memorability prediction to reduce subjectivity in video summarization
labels. Experimental results on two video summarization datasets demonstrate
the effectiveness of MWCVS, showcasing the promising applications of video
memorability prediction.

[79] Beyond Calibration: Physically Informed Learning for Raw-to-Raw Mapping

Peter Grönquist,Stepan Tulyakov,Dengxin Dai

Main category: cs.CV

TL;DR: 提出了一种名为神经物理模型(NPM)的轻量级物理信息方法,用于多相机系统中的颜色一致性重现。

Details Motivation: 解决多相机系统中由于传感器和光学变化导致的颜色一致性挑战。

Method: 引入NPM,模拟指定光照下的原始图像,估计设备间的转换,支持物理测量初始化和有无配对数据的训练。

Result: 在公开数据集NUS和BeyondRGB上,NPM表现优于现有方法,实现跨传感器和光学系统的稳健颜色一致性。

Conclusion: NPM是一种高效且适应性强的解决方案,适用于多相机系统的颜色一致性重现。

Abstract: Achieving consistent color reproduction across multiple cameras is essential
for seamless image fusion and Image Processing Pipeline (ISP) compatibility in
modern devices, but it is a challenging task due to variations in sensors and
optics. Existing raw-to-raw conversion methods face limitations such as poor
adaptability to changing illumination, high computational costs, or impractical
requirements such as simultaneous camera operation and overlapping
fields-of-view. We introduce the Neural Physical Model (NPM), a lightweight,
physically-informed approach that simulates raw images under specified
illumination to estimate transformations between devices. The NPM effectively
adapts to varying illumination conditions, can be initialized with physical
measurements, and supports training with or without paired data. Experiments on
public datasets like NUS and BeyondRGB demonstrate that NPM outperforms recent
state-of-the-art methods, providing robust chromatic consistency across
different sensors and optical systems.

[80] LLaVA-c: Continual Improved Visual Instruction Tuning

Wenzhuo Liu,Fei Zhu,Haiyang Guo,Longhui Wei,Cheng-Lin Liu

Main category: cs.CV

TL;DR: 本文提出了LLaVA-c方法,通过在LLaVA-1.5基础上引入频谱感知巩固和无监督查询正则化,解决了多任务学习中的任务平衡和基础模型退化问题,实现了持续学习与多任务联合学习相媲美的效果。

Details Motivation: 多任务学习存在任务平衡和数据扩展成本的挑战,持续学习虽能逐步获取新知识但可能忽视基础模型的通用能力退化。本文旨在解决这些问题。

Method: 在LLaVA-1.5基础上,通过频谱感知巩固改进任务平衡,并引入无监督查询正则化防止基础模型退化。

Result: LLaVA-c在持续预训练和微调中均提升了基准性能,并保留了通用能力,首次证明了持续学习可以匹敌多任务联合学习。

Conclusion: LLaVA-c通过简单但有效的修改,解决了持续学习中的关键问题,为未来研究提供了新思路。

Abstract: Multimodal models like LLaVA-1.5 achieve state-of-the-art visual
understanding through visual instruction tuning on multitask datasets, enabling
strong instruction-following and multimodal performance. However, multitask
learning faces challenges such as task balancing, requiring careful adjustment
of data proportions, and expansion costs, where new tasks risk catastrophic
forgetting and need costly retraining. Continual learning provides a promising
alternative to acquiring new knowledge incrementally while preserving existing
capabilities. However, current methods prioritize task-specific performance,
neglecting base model degradation from overfitting to specific instructions,
which undermines general capabilities. In this work, we propose a simple but
effective method with two modifications on LLaVA-1.5: spectral-aware
consolidation for improved task balance and unsupervised inquiry regularization
to prevent base model degradation. We evaluate both general and task-specific
performance across continual pretraining and fine-tuning. Experiments
demonstrate that LLaVA-c consistently enhances standard benchmark performance
and preserves general capabilities. For the first time, we show that
task-by-task continual learning can achieve results that match or surpass
multitask joint learning. The code will be publicly released.

[81] ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

Juan Yeo,Soonwoo Cha,Jiwoo Song,Hyunbin Jin,Taesup Kim

Main category: cs.CV

TL;DR: 论文提出了一种名为ATAS的方法,通过自蒸馏技术提升CLIP模型在细粒度视觉-语言对齐和语义一致性方面的表现,无需额外模块或有监督微调。

Details Motivation: 当前CLIP模型在细粒度、区域级别的理解上表现不足,影响了其在密集预测任务中的效果。

Method: 提出Any-to-Any Self-Distillation (ATAS),利用无标注图像和内部自蒸馏过程优化CLIP视觉编码器的表示,保持语义一致性的同时提升细节识别能力。

Result: 在开放词汇目标检测和语义分割任务中,ATAS显著优于基线CLIP模型。

Conclusion: ATAS方法的有效性验证了同时保持语义一致性和细粒度对齐对提升开放词汇密集预测任务的重要性。

Abstract: Vision-language models such as CLIP have recently propelled open-vocabulary
dense prediction tasks by enabling recognition of a broad range of visual
concepts. However, CLIP still struggles with fine-grained, region-level
understanding, hindering its effectiveness on these dense prediction tasks. We
identify two pivotal factors required to address this limitation: semantic
coherence and fine-grained vision-language alignment. Current adaptation
methods often improve fine-grained alignment at the expense of semantic
coherence, and often rely on extra modules or supervised fine-tuning. To
overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel
approach that simultaneously enhances semantic coherence and fine-grained
alignment by leveraging own knowledge of a model across all representation
levels. Unlike prior methods, ATAS uses only unlabeled images and an internal
self-distillation process to refine representations of CLIP vision encoders,
preserving local semantic consistency while sharpening local detail
recognition. On open-vocabulary object detection and semantic segmentation
benchmarks, ATAS achieves substantial performance gains, outperforming baseline
CLIP models. These results validate the effectiveness of our approach and
underscore the importance of jointly maintaining semantic coherence and
fine-grained alignment for advanced open-vocabulary dense prediction.

[82] CanadaFireSat: Toward high-resolution wildfire forecasting with multiple modalities

Hugo Porta,Emanuele Dalsasso,Jessica L. McCarty,Devis Tuia

Main category: cs.CV

TL;DR: 加拿大2023年经历了近年来最严重的野火季节,凸显气候变化加剧火灾季节长度和严重性的问题。该论文提出高分辨率野火预测方法,利用多模态数据提升预测精度。

Details Motivation: 气候变化导致野火季节延长和加剧,需为北方生态系统提供更好的野火管理工具。高分辨率野火概率图是重要工具,但现有方法依赖低分辨率数据。

Method: 提出CanadaFireSat基准数据集和基线方法,利用Sentinel-2、MODIS和ERA5等多模态数据,测试两种深度学习架构进行高分辨率野火预测。

Result: 多模态输入在所有指标上优于单模态输入,2023年野火季节F1得分达60.3%,展示了高分辨率和大陆尺度野火预测潜力。

Conclusion: 多模态深度学习模型在高分辨率野火预测中表现优异,为未来野火管理提供了有力工具。

Abstract: Canada experienced in 2023 one of the most severe wildfire seasons in recent
history, causing damage across ecosystems, destroying communities, and emitting
large quantities of CO2. This extreme wildfire season is symptomatic of a
climate-change-induced increase in the length and severity of the fire season
that affects the boreal ecosystem. Therefore, it is critical to empower
wildfire management in boreal communities with better mitigation solutions.
Wildfire probability maps represent an important tool for understanding the
likelihood of wildfire occurrence and the potential severity of future
wildfires. The massive increase in the availability of Earth observation data
has enabled the development of deep learning-based wildfire forecasting models,
aiming at providing precise wildfire probability maps at different spatial and
temporal scales. A main limitation of such methods is their reliance on
coarse-resolution environmental drivers and satellite products, leading to
wildfire occurrence prediction of reduced resolution, typically around $\sim
0.1${\deg}. This paper presents a benchmark dataset: CanadaFireSat, and
baseline methods for high-resolution: 100 m wildfire forecasting across Canada,
leveraging multi-modal data from high-resolution multi-spectral satellite
images (Sentinel-2 L1C), mid-resolution satellite products (MODIS), and
environmental factors (ERA5 reanalysis data). Our experiments consider two
major deep learning architectures. We observe that using multi-modal temporal
inputs outperforms single-modal temporal inputs across all metrics, achieving a
peak performance of 60.3% in F1 score for the 2023 wildfire season, a season
never seen during model training. This demonstrates the potential of
multi-modal deep learning models for wildfire forecasting at high-resolution
and continental scale.

[83] VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism

Congzhi Zhang,Jiawei Peng,Zhenglin Wang,Yilong Lai,Haowen Sun,Heng Chang,Fei Ma,Weijiang Yu

Main category: cs.CV

TL;DR: VReST提出了一种无需训练的方法,通过蒙特卡洛树搜索和自奖励机制,提升大型视觉语言模型在复杂视觉推理任务中的性能,并在多模态数学推理基准测试中取得最优表现。

Details Motivation: 当前大型视觉语言模型在复杂视觉推理中的表现受限,尤其在链式思考提示技术上。

Method: 采用蒙特卡洛树搜索和创新的多模态自奖励机制,评估推理步骤的质量,无需额外模型。

Result: 在多模态数学推理基准测试中超越现有提示方法,达到最优性能。

Conclusion: VReST为多模态任务的未来研究提供了有前景的方向,并验证了测试时间扩展定律的有效性。

Abstract: Large Vision-Language Models (LVLMs) have shown exceptional performance in
multimodal tasks, but their effectiveness in complex visual reasoning is still
constrained, especially when employing Chain-of-Thought prompting techniques.
In this paper, we propose VReST, a novel training-free approach that enhances
Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms.
VReST meticulously traverses the reasoning landscape by establishing a search
tree, where each node encapsulates a reasoning step, and each path delineates a
comprehensive reasoning sequence. Our innovative multimodal Self-Reward
mechanism assesses the quality of reasoning steps by integrating the utility of
sub-questions, answer correctness, and the relevance of vision-language clues,
all without the need for additional models. VReST surpasses current prompting
methods and secures state-of-the-art performance across three multimodal
mathematical reasoning benchmarks. Furthermore, it substantiates the efficacy
of test-time scaling laws in multimodal tasks, offering a promising direction
for future research.

[84] MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning

Mohammadreza Salehi,Shashanka Venkataramanan,Ioana Simion,Efstratios Gavves,Cees G. M. Snoek,Yuki M Asano

Main category: cs.CV

TL;DR: 提出了一种基于运动引导的自监督学习框架,通过聚类密集点轨迹学习时空一致性特征,在动态场景中表现优异。

Details Motivation: 现有方法在处理视频动态时因静态增强失败而表现不佳,需要更好的时空一致性特征学习方法。

Method: 利用现有点跟踪器提取运动轨迹,通过动量编码器优化运输机制,传播集群分配以保持特征一致性。

Result: 在六种图像和视频数据集上,性能提升1%至6%,优于现有技术。

Conclusion: 运动作为辅助信号,显著提升了动态场景下的特征学习效果。

Abstract: Dense self-supervised learning has shown great promise for learning pixel-
and patch-level representations, but extending it to videos remains challenging
due to the complexity of motion dynamics. Existing approaches struggle as they
rely on static augmentations that fail under object deformations, occlusions,
and camera movement, leading to inconsistent feature learning over time. We
propose a motion-guided self-supervised learning framework that clusters dense
point tracks to learn spatiotemporally consistent representations. By
leveraging an off-the-shelf point tracker, we extract long-range motion
trajectories and optimize feature clustering through a momentum-encoder-based
optimal transport mechanism. To ensure temporal coherence, we propagate cluster
assignments along tracked points, enforcing feature consistency across views
despite viewpoint changes. Integrating motion as an implicit supervisory
signal, our method learns representations that generalize across frames,
improving robustness in dynamic scenes and challenging occlusion scenarios. By
initializing from strong image-pretrained models and leveraging video data for
training, we improve state-of-the-art by 1% to 6% on six image and video
datasets and four evaluation benchmarks. The implementation is publicly
available at our GitHub repository: https://github.com/SMSD75/MoSiC/tree/main

[85] TraGraph-GS: Trajectory Graph-based Gaussian Splatting for Arbitrary Large-Scale Scene Rendering

Xiaohan Zhang,Sitong Wang,Yushen Yan,Yi Yang,Mingda Xu,Qi Liu

Main category: cs.CV

TL;DR: 论文提出TraGraph-GS方法,通过轨迹图优化大场景视角合成中的空间分区和高斯重叠问题,显著提升渲染质量。

Details Motivation: 现有方法在大场景视角合成中存在分区不灵活和纹理失真的问题,TraGraph-GS旨在通过轨迹图和正则化约束解决这些问题。

Method: 采用基于图的空间分区方法,结合正则化约束和渐进渲染策略,优化大场景的高斯渲染质量。

Result: 在多个数据集上性能显著优于现有方法,PSNR提升平均1.86 dB(航空)和1.62 dB(地面)。

Conclusion: TraGraph-GS通过创新的空间分区和渲染策略,为大场景视角合成提供了高效且高质量的解决方案。

Abstract: High-quality novel view synthesis for large-scale scenes presents a
challenging dilemma in 3D computer vision. Existing methods typically partition
large scenes into multiple regions, reconstruct a 3D representation using
Gaussian splatting for each region, and eventually merge them for novel view
rendering. They can accurately render specific scenes, yet they do not
generalize effectively for two reasons: (1) rigid spatial partition techniques
struggle with arbitrary camera trajectories, and (2) the merging of regions
results in Gaussian overlap to distort texture details. To address these
challenges, we propose TraGraph-GS, leveraging a trajectory graph to enable
high-precision rendering for arbitrarily large-scale scenes. We present a
spatial partitioning method for large-scale scenes based on graphs, which
incorporates a regularization constraint to enhance the rendering of textures
and distant objects, as well as a progressive rendering strategy to mitigate
artifacts caused by Gaussian overlap. Experimental results demonstrate its
superior performance both on four aerial and four ground datasets and highlight
its remarkable efficiency: our method achieves an average improvement of 1.86
dB in PSNR on aerial datasets and 1.62 dB on ground datasets compared to
state-of-the-art approaches.

[86] SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting

Mengjiao Ma,Qi Ma,Yue Li,Jiahuan Cheng,Runyi Yang,Bin Ren,Nikola Popovic,Mingqiang Wei,Nicu Sebe,Luc Van Gool,Theo Gevers,Martin R. Oswald,Danda Pani Paudel

Main category: cs.CV

TL;DR: 该论文提出了第一个大规模基准测试,系统性评估了三种语言高斯泼溅方法在3D空间中的表现,并引入了一个包含49K场景的新数据集GaussianWorld-49K,展示了通用化方法的优势。

Details Motivation: 当前的语言高斯泼溅方法仅在少量场景和视角的2D渲染视图上评估,限制了整体3D理解的深度。论文旨在填补这一空白,促进通用化3D场景理解的研究。

Method: 论文提出一个大规模基准测试,评估三类语言高斯泼溅方法在超过1000个室内外场景中的表现,并引入新数据集GaussianWorld-49K。

Result: 通用化方法在放松场景限制、快速推理和分割性能方面表现优越,展示了利用数据的潜力。

Conclusion: 通过公开代码、基准测试和数据集,论文旨在加速通用化3D高斯泼溅场景理解的研究。

Abstract: 3D Gaussian Splatting (3DGS) serves as a highly performant and efficient
encoding of scene geometry, appearance, and semantics. Moreover, grounding
language in 3D scenes has proven to be an effective strategy for 3D scene
understanding. Current Language Gaussian Splatting line of work fall into three
main groups: (i) per-scene optimization-based, (ii) per-scene
optimization-free, and (iii) generalizable approach. However, most of them are
evaluated only on rendered 2D views of a handful of scenes and viewpoints close
to the training views, limiting ability and insight into holistic 3D
understanding. To address this gap, we propose the first large-scale benchmark
that systematically assesses these three groups of methods directly in 3D
space, evaluating on 1060 scenes across three indoor datasets and one outdoor
dataset. Benchmark results demonstrate a clear advantage of the generalizable
paradigm, particularly in relaxing the scene-specific limitation, enabling fast
feed-forward inference on novel scenes, and achieving superior segmentation
performance. We further introduce GaussianWorld-49K a carefully curated 3DGS
dataset comprising around 49K diverse indoor and outdoor scenes obtained from
multiple sources, with which we demonstrate the generalizable approach could
harness strong data priors. Our codes, benchmark, and datasets will be made
public to accelerate research in generalizable 3DGS scene understanding.

[87] Geometric deep learning for local growth prediction on abdominal aortic aneurysm surfaces

Dieuwertje Alblas,Patryk Rygiel,Julian Suk,Kaj O. Kappe,Marieke Hofman,Christoph Brune,Kak Khee Yeung,Jelmer M. Wolterink

Main category: cs.CV

TL;DR: 该论文提出了一种基于SE(3)-对称变换器模型的方法,通过保留血管表面解剖结构和几何保真度,预测腹主动脉瘤(AAA)的局部生长,从而改进个性化监测策略。

Details Motivation: 目前的AAA监测基于直径阈值,忽略了3D形状与生长的复杂关系,标准化监测间隔可能不适合。个性化生长预测能优化监测策略。

Method: 使用SE(3)-对称变换器模型,直接利用血管表面模型和局部多物理特征预测AAA生长,训练数据为24名患者的113次CTA扫描。

Result: 模型预测下一扫描时刻的AAA生长时,中位直径误差为1.18毫米,且能准确识别两年内需手术的患者(准确率0.93)。外部验证集结果同样良好。

Conclusion: 基于血管表面的局部定向AAA生长预测可行,有望推动个性化监测策略的发展。

Abstract: Abdominal aortic aneurysms (AAAs) are progressive focal dilatations of the
abdominal aorta. AAAs may rupture, with a survival rate of only 20%. Current
clinical guidelines recommend elective surgical repair when the maximum AAA
diameter exceeds 55 mm in men or 50 mm in women. Patients that do not meet
these criteria are periodically monitored, with surveillance intervals based on
the maximum AAA diameter. However, this diameter does not take into account the
complex relation between the 3D AAA shape and its growth, making standardized
intervals potentially unfit. Personalized AAA growth predictions could improve
monitoring strategies. We propose to use an SE(3)-symmetric transformer model
to predict AAA growth directly on the vascular model surface enriched with
local, multi-physical features. In contrast to other works which have
parameterized the AAA shape, this representation preserves the vascular
surface’s anatomical structure and geometric fidelity. We train our model using
a longitudinal dataset of 113 computed tomography angiography (CTA) scans of 24
AAA patients at irregularly sampled intervals. After training, our model
predicts AAA growth to the next scan moment with a median diameter error of
1.18 mm. We further demonstrate our model’s utility to identify whether a
patient will become eligible for elective repair within two years (acc = 0.93).
Finally, we evaluate our model’s generalization on an external validation set
consisting of 25 CTAs from 7 AAA patients from a different hospital. Our
results show that local directional AAA growth prediction from the vascular
surface is feasible and may contribute to personalized surveillance strategies.

[88] InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba

Yuhang Wang,Jun Li,Zhijian Wu,Jianhua Xu

Main category: cs.CV

TL;DR: 提出了一种名为InceptionMamba的新主干架构,通过正交带卷积和瓶颈Mamba模块解决InceptionNeXt在空间依赖性和全局上下文建模上的不足,实现了卓越的分类和下游任务性能。

Details Motivation: InceptionNeXt在图像分类和下游任务中表现出色,但因其一维条带卷积的局限性,无法充分捕捉不同维度的空间依赖关系和局部邻域空间建模。此外,卷积操作的局部性约束也限制了全局上下文建模的效果。

Method: 用正交带卷积替代传统一维条带卷积,实现更紧密的空间建模;引入瓶颈Mamba模块,增强跨通道信息融合和扩大感受野。

Result: 在分类和多个下游任务中,InceptionMamba表现出最佳性能,同时具备优越的参数量和计算效率。

Conclusion: InceptionMamba通过改进空间建模和全局上下文处理,显著提升了性能,为图像分类和下游任务提供了高效解决方案。

Abstract: Within the family of convolutional neural networks, InceptionNeXt has shown
excellent competitiveness in image classification and a number of downstream
tasks. Built on parallel one-dimensional strip convolutions, however, it
suffers from limited ability of capturing spatial dependencies along different
dimensions and fails to fully explore spatial modeling in local neighborhood.
Besides, inherent locality constraints of convolution operations are
detrimental to effective global context modeling. To overcome these
limitations, we propose a novel backbone architecture termed InceptionMamba in
this study. More specifically, the traditional one-dimensional strip
convolutions are replaced by orthogonal band convolutions in our InceptionMamba
to achieve cohesive spatial modeling. Furthermore, global contextual modeling
can be achieved via a bottleneck Mamba module, facilitating enhanced
cross-channel information fusion and enlarged receptive field. Extensive
evaluations on classification and various downstream tasks demonstrate that the
proposed InceptionMamba achieves state-of-the-art performance with superior
parameter and computational efficiency. The source code will be available at
https://github.com/Wake1021/InceptionMamba.

[89] RS-MTDF: Multi-Teacher Distillation and Fusion for Remote Sensing Semi-Supervised Semantic Segmentation

Jiayi Song,Kaiyu Li,Xiangyong Cao,Deyu Meng

Main category: cs.CV

TL;DR: 该论文提出了一种名为RS-MTDF的半监督语义分割框架,利用预训练视觉基础模型(VFMs)的多教师知识蒸馏和融合,显著提升了遥感图像语义分割的性能。

Details Motivation: 遥感图像的语义标注成本高昂且耗时,现有半监督方法因标记与未标记数据分布不匹配导致泛化能力不足。

Method: 采用多教师(如DINOv2和CLIP)特征级蒸馏,将学生模型特征与VFMs的鲁棒表示对齐,并在解码器中融合蒸馏知识。

Result: 在ISPRS Potsdam、LoveDA和DeepGlobe数据集上取得最优性能,尤其在LoveDA上不同标注比例下均优于现有方法。

Conclusion: 多教师VFM指导能显著提升遥感分割的泛化能力和语义理解,各模块均通过消融实验验证其贡献。

Abstract: Semantic segmentation in remote sensing images is crucial for various
applications, yet its performance is heavily reliant on large-scale,
high-quality pixel-wise annotations, which are notoriously expensive and
time-consuming to acquire. Semi-supervised semantic segmentation (SSS) offers a
promising alternative to mitigate this data dependency. However, existing SSS
methods often struggle with the inherent distribution mismatch between limited
labeled data and abundant unlabeled data, leading to suboptimal generalization.
We propose that Vision Foundation Models (VFMs), pre-trained on vast and
diverse datasets, possess robust generalization capabilities that can
effectively bridge this distribution gap and provide strong semantic priors for
SSS. Inspired by this, we introduce RS-MTDF (Multi-Teacher Distillation and
Fusion), a novel framework that leverages the powerful semantic knowledge
embedded in VFMs to guide semi-supervised learning in remote sensing.
Specifically, RS-MTDF employs multiple frozen VFMs (\textit{e.g.}, DINOv2 and
CLIP) as expert teachers, utilizing feature-level distillation to align student
features with their robust representations. To further enhance discriminative
power, the distilled knowledge is seamlessly fused into the student decoder.
Extensive experiments on three challenging remote sensing datasets (ISPRS
Potsdam, LoveDA, and DeepGlobe) demonstrate that RS-MTDF consistently achieves
state-of-the-art performance. Notably, our method outperforms existing
approaches across various label ratios on LoveDA and secures the highest IoU in
the majority of semantic categories. These results underscore the efficacy of
multi-teacher VFM guidance in significantly enhancing both generalization and
semantic understanding for remote sensing segmentation. Ablation studies
further validate the contribution of each proposed module.

[90] Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting

Keyi Liu,Weidong Yang,Ben Fei,Ying He

Main category: cs.CV

TL;DR: 论文提出了一种名为Gaussian2Scene的新型自监督学习框架,利用3D高斯泼溅(3DGS)进行点云预训练,解决了现有方法依赖隐式场景表示和高内存需求的问题,并在多个下游任务中表现出色。

Details Motivation: 现有的自监督学习方法依赖RGB-D图像作为重建信号,但受限于计算负担和高内存需求,且难以捕捉3D几何结构。Gaussian2Scene通过3DGS的效率和显式特性,旨在提升几何理解和跨模态学习。

Method: 采用渐进式两阶段训练策略:第一阶段使用双分支掩码自编码器学习2D和3D场景表示;第二阶段通过高斯基元的几何位置和渲染的RGB图像进行监督学习。

Result: 在多个下游3D物体检测任务中,Gaussian2Scene均优于现有预训练方法,验证了其有效性。

Conclusion: Gaussian2Scene通过结合3DGS的效率和显式特性,显著提升了点云预训练的性能,为3D视觉任务提供了新的解决方案。

Abstract: Self-supervised learning (SSL) for point cloud pre-training has become a
cornerstone for many 3D vision tasks, enabling effective learning from
large-scale unannotated data. At the scene level, existing SSL methods often
incorporate volume rendering into the pre-training framework, using RGB-D
images as reconstruction signals to facilitate cross-modal learning. This
strategy promotes alignment between 2D and 3D modalities and enables the model
to benefit from rich visual cues in the RGB-D inputs. However, these approaches
are limited by their reliance on implicit scene representations and high memory
demands. Furthermore, since their reconstruction objectives are applied only in
2D space, they often fail to capture underlying 3D geometric structures. To
address these challenges, we propose Gaussian2Scene, a novel scene-level SSL
framework that leverages the efficiency and explicit nature of 3D Gaussian
Splatting (3DGS) for pre-training. The use of 3DGS not only alleviates the
computational burden associated with volume rendering but also supports direct
3D scene reconstruction, thereby enhancing the geometric understanding of the
backbone network. Our approach follows a progressive two-stage training
strategy. In the first stage, a dual-branch masked autoencoder learns both 2D
and 3D scene representations. In the second stage, we initialize training with
reconstructed point clouds and further supervise learning using the geometric
locations of Gaussian primitives and rendered RGB images. This process
reinforces both geometric and cross-modal learning. We demonstrate the
effectiveness of Gaussian2Scene across several downstream 3D object detection
tasks, showing consistent improvements over existing pre-training methods.

[91] HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation

Ziyao Huang,Zixiang Zhou,Juan Cao,Yifeng Ma,Yi Chen,Zejing Rao,Zhiyong Xu,Hongmei Wang,Qin Lin,Yuan Zhou,Qinglin Lu,Fan Tang

Main category: cs.CV

TL;DR: HunyuanVideo-HOMA是一个弱条件多模态驱动框架,用于解决人-物交互视频生成中的限制,如依赖精确数据、泛化性差和可访问性低的问题。

Details Motivation: 解决人-物交互视频生成中依赖精确数据、泛化性有限和可访问性低的问题。

Method: 采用稀疏解耦运动引导和双模态扩散Transformer,结合HOI适配器和面部交叉注意力适配器。

Result: 在弱监督下实现了交互自然性和泛化性的最先进表现,并展示了文本条件生成和交互式对象操作的多样性。

Conclusion: HunyuanVideo-HOMA在提升可控性和减少依赖精确输入的同时,展示了广泛的应用潜力。

Abstract: To address key limitations in human-object interaction (HOI) video generation
– specifically the reliance on curated motion data, limited generalization to
novel objects/scenarios, and restricted accessibility – we introduce
HunyuanVideo-HOMA, a weakly conditioned multimodal-driven framework.
HunyuanVideo-HOMA enhances controllability and reduces dependency on precise
inputs through sparse, decoupled motion guidance. It encodes appearance and
motion signals into the dual input space of a multimodal diffusion transformer
(MMDiT), fusing them within a shared context space to synthesize temporally
consistent and physically plausible interactions. To optimize training, we
integrate a parameter-space HOI adapter initialized from pretrained MMDiT
weights, preserving prior knowledge while enabling efficient adaptation, and a
facial cross-attention adapter for anatomically accurate audio-driven lip
synchronization. Extensive experiments confirm state-of-the-art performance in
interaction naturalness and generalization under weak supervision. Finally,
HunyuanVideo-HOMA demonstrates versatility in text-conditioned generation and
interactive object manipulation, supported by a user-friendly demo interface.
The project page is at https://anonymous.4open.science/w/homa-page-0FBE/.

[92] Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought

Shuyi Zhang,Xiaoshuai Hao,Yingbo Tang,Lingfeng Zhang,Pengwei Wang,Zhongyuan Wang,Hongxuan Ma,Shanghang Zhang

Main category: cs.CV

TL;DR: Video-CoT是一个创新的数据集,旨在通过链式思考方法提升视频理解中的时空细节捕捉能力,包含大量问题和答案对,为研究提供新方向。

Details Motivation: 当前大规模视觉语言模型在捕捉视频内容中的时空细节方面表现不足,这阻碍了深入视频分析。

Method: 引入Video-CoT数据集,包含192,000个细粒度时空问题和23,000个链式思考标注样本,并提供基准测试。

Result: 实验表明现有模型在时空理解任务中表现不佳,突显了该任务的挑战性。

Conclusion: Video-CoT为多媒体理解研究开辟了新途径,并支持需要高级视频分析的智能系统创新。

Abstract: Video content comprehension is essential for various applications, ranging
from video analysis to interactive systems. Despite advancements in large-scale
vision-language models (VLMs), these models often struggle to capture the
nuanced, spatiotemporal details essential for thorough video analysis. To
address this gap, we introduce Video-CoT, a groundbreaking dataset designed to
enhance spatiotemporal understanding using Chain-of-Thought (CoT)
methodologies. Video-CoT contains 192,000 fine-grained spa-tiotemporal
question-answer pairs and 23,000 high-quality CoT-annotated samples, providing
a solid foundation for evaluating spatiotemporal understanding in video
comprehension. Additionally, we provide a comprehensive benchmark for assessing
these tasks, with each task featuring 750 images and tailored evaluation
metrics. Our extensive experiments reveal that current VLMs face significant
challenges in achieving satisfactory performance, high-lighting the
difficulties of effective spatiotemporal understanding. Overall, the Video-CoT
dataset and benchmark open new avenues for research in multimedia understanding
and support future innovations in intelligent systems requiring advanced video
analysis capabilities. By making these resources publicly available, we aim to
encourage further exploration in this critical area. Project
website:https://video-cot.github.io/ .

[93] CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

Shravan Nayak,Mehar Bhatia,Xiaofeng Zhang,Verena Rieser,Lisa Anne Hendricks,Sjoerd van Steenkiste,Yash Goyal,Karolina Stańczak,Aishwarya Agrawal

Main category: cs.CV

TL;DR: 本文通过CulturalFrames基准系统评估了T2I模型在文化表现上的不足,发现模型在显性和隐性文化期望上的失败率分别为68%和49%,现有评估指标与人类判断相关性差。

Details Motivation: 研究T2I模型在多样化文化背景中的表现偏差,填补系统性量化文化对齐的研究空白。

Method: 开发CulturalFrames基准,涵盖10国、5个社会文化领域,生成3637张图像并收集10k+人工标注,评估4种T2I模型。

Result: T2I模型在显性(68%)和隐性(49%)文化期望上的失败率高,现有评估指标与人类判断相关性低。

Conclusion: 揭示了T2I模型的文化表现缺陷,提出了改进的文化对齐建模与评估方向。

Abstract: The increasing ubiquity of text-to-image (T2I) models as tools for visual
content generation raises concerns about their ability to accurately represent
diverse cultural contexts. In this work, we present the first study to
systematically quantify the alignment of T2I models and evaluation metrics with
respect to both explicit as well as implicit cultural expectations. To this
end, we introduce CulturalFrames, a novel benchmark designed for rigorous human
evaluation of cultural representation in visual generations. Spanning 10
countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts,
3637 corresponding images generated by 4 state-of-the-art T2I models, and over
10k detailed human annotations. We find that T2I models not only fail to meet
the more challenging implicit expectations but also the less challenging
explicit expectations. Across models and countries, cultural expectations are
missed an average of 44% of the time. Among these failures, explicit
expectations are missed at a surprisingly high average rate of 68%, while
implicit expectation failures are also significant, averaging 49%. Furthermore,
we demonstrate that existing T2I evaluation metrics correlate poorly with human
judgments of cultural alignment, irrespective of their internal reasoning.
Collectively, our findings expose critical gaps, providing actionable
directions for developing more culturally informed T2I models and evaluation
methodologies.

[94] Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

Jingguo Qu,Xinyang Han,Tonghuan Xiao,Jia Ai,Juan Wu,Tong Zhao,Jing Qin,Ann Dorothy King,Winnie Chiu-Wing Chu,Jing Cai,Michael Tin-Cheung Yingınst

Main category: cs.CV

TL;DR: 该论文探讨了如何通过领域适应方法优化视觉语言基础模型,以提升超声图像分析的性能,并展示了在分割和分类任务中的优异表现。

Details Motivation: 超声图像分析中手动标注工作繁琐且主观性强,视觉语言基础模型虽适用但受限于领域差异,因此需要通过领域适应方法改进。

Method: 利用大语言模型作为文本精炼器,结合特殊设计的适应策略和任务驱动头,优化视觉语言基础模型的微调流程。

Result: 在六个超声数据集和两项任务上的实验表明,该方法显著提升了模型性能,超越了现有最先进的视觉语言和纯粹基础模型。

Conclusion: 该研究成功开发了领域适应方法,有效提升了视觉语言基础模型在超声图像分析中的应用效果。

Abstract: Medical ultrasonography is an essential imaging technique for examining
superficial organs and tissues, including lymph nodes, breast, and thyroid. It
employs high-frequency ultrasound waves to generate detailed images of the
internal structures of the human body. However, manually contouring regions of
interest in these images is a labor-intensive task that demands expertise and
often results in inconsistent interpretations among individuals.
Vision-language foundation models, which have excelled in various computer
vision applications, present new opportunities for enhancing ultrasound image
analysis. Yet, their performance is hindered by the significant differences
between natural and medical imaging domains. This research seeks to overcome
these challenges by developing domain adaptation methods for vision-language
foundation models. In this study, we explore the fine-tuning pipeline for
vision-language foundation models by utilizing large language model as text
refiner with special-designed adaptation strategies and task-driven heads. Our
approach has been extensively evaluated on six ultrasound datasets and two
tasks: segmentation and classification. The experimental results show that our
method can effectively improve the performance of vision-language foundation
models for ultrasound image analysis, and outperform the existing
state-of-the-art vision-language and pure foundation models. The source code of
this study is available at
\href{https://github.com/jinggqu/NextGen-UIA}{GitHub}.

[95] Spatial Transcriptomics Expression Prediction from Histopathology Based on Cross-Modal Mask Reconstruction and Contrastive Learning

Junzhuo Liu,Markus Eckstein,Zhixiang Wang,Friedrich Feuerhake,Dorit Merhof

Main category: cs.CV

TL;DR: 该研究提出了一种基于对比学习的深度学习方法,从全切片图像预测空间转录组数据,显著提升了基因表达的预测准确性。

Details Motivation: 由于空间转录组数据获取成本高,大规模数据难以获得,因此开发了一种高效预测方法。

Method: 采用对比学习的深度学习框架,从全切片图像预测基因表达。

Result: 在六种疾病数据集上,预测高表达基因、高变异基因和标记基因的PCC分别提升了6.27%、6.11%和11.26%。

Conclusion: 该方法不仅保持了基因间相关性,还适用于小样本数据,并展示了在癌症组织定位中的潜力。

Abstract: Spatial transcriptomics is a technology that captures gene expression levels
at different spatial locations, widely used in tumor microenvironment analysis
and molecular profiling of histopathology, providing valuable insights into
resolving gene expression and clinical diagnosis of cancer. Due to the high
cost of data acquisition, large-scale spatial transcriptomics data remain
challenging to obtain. In this study, we develop a contrastive learning-based
deep learning method to predict spatially resolved gene expression from
whole-slide images. Evaluation across six different disease datasets
demonstrates that, compared to existing studies, our method improves Pearson
Correlation Coefficient (PCC) in the prediction of highly expressed genes,
highly variable genes, and marker genes by 6.27%, 6.11%, and 11.26%
respectively. Further analysis indicates that our method preserves gene-gene
correlations and applies to datasets with limited samples. Additionally, our
method exhibits potential in cancer tissue localization based on biomarker
expression.

[96] StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

Zike Wu,Qi Yan,Xuanyu Yi,Lele Wang,Renjie Liao

Main category: cs.CV

TL;DR: StreamSplat是一个实时重建动态3D场景的框架,通过静态编码器和动态解码器的创新技术,解决了未校准输入、动态场景建模和长期稳定性的挑战。

Details Motivation: 现有的方法难以同时处理未校准输入、动态场景建模和长期稳定性三大挑战,因此提出了StreamSplat。

Method: 采用静态编码器中的概率采样机制和动态解码器中的双向变形场技术,实现在线重建。

Result: 在静态和动态基准测试中,StreamSplat表现优于现有方法,支持任意长度视频流的重建。

Conclusion: StreamSplat在重建质量和动态建模方面展现优越性能,是首个支持在线重建的框架。

Abstract: Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams
is crucial for numerous real-world applications. However, existing methods
struggle to jointly address three key challenges: 1) processing uncalibrated
inputs in real time, 2) accurately modeling dynamic scene evolution, and 3)
maintaining long-term stability and computational efficiency. To this end, we
introduce StreamSplat, the first fully feed-forward framework that transforms
uncalibrated video streams of arbitrary length into dynamic 3D Gaussian
Splatting (3DGS) representations in an online manner, capable of recovering
scene dynamics from temporally local observations. We propose two key technical
innovations: a probabilistic sampling mechanism in the static encoder for 3DGS
position prediction, and a bidirectional deformation field in the dynamic
decoder that enables robust and efficient dynamic modeling. Extensive
experiments on static and dynamic benchmarks demonstrate that StreamSplat
consistently outperforms prior works in both reconstruction quality and dynamic
scene modeling, while uniquely supporting online reconstruction of arbitrarily
long video streams. Code and models are available at
https://github.com/nickwzk/StreamSplat.

[97] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

Leqi Shen,Guoqiang Gong,Tianxiang Hao,Tao He,Yifeng Zhang,Pengzhang Liu,Sicheng Zhao,Jungong Han,Guiguang Ding

Main category: cs.CV

TL;DR: 本文提出了DiscoVLA方法,通过同时解决视觉、语言和对齐三个方面的差异,显著提升了视频-文本检索的性能。

Details Motivation: 现有方法在将图像级预训练模型CLIP迁移到视频-文本检索时,主要关注视觉差异,而忽略了语言和对齐差异,导致性能不足。

Method: 提出DiscoVLA方法,通过图像-视频特征融合减少视觉和语言差异,生成伪图像标题学习图像级对齐,并通过图像-视频对齐蒸馏增强视频级对齐。

Result: 在MSRVTT数据集上,DiscoVLA的R@1达到50.5%,优于之前方法1.5%。

Conclusion: DiscoVLA通过综合解决视觉、语言和对齐差异,显著提升了视频-文本检索性能,为相关研究提供了新思路。

Abstract: The parameter-efficient adaptation of the image-text pretraining model CLIP
for video-text retrieval is a prominent area of research. While CLIP is focused
on image-level vision-language matching, video-text retrieval demands
comprehensive understanding at the video level. Three key discrepancies emerge
in the transfer from image-level to video-level: vision, language, and
alignment. However, existing methods mainly focus on vision while neglecting
language and alignment. In this paper, we propose Discrepancy Reduction in
Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all
three discrepancies. Specifically, we introduce Image-Video Features Fusion to
integrate image-level and video-level features, effectively tackling both
vision and language discrepancies. Additionally, we generate pseudo image
captions to learn fine-grained image-level alignment. To mitigate alignment
discrepancies, we propose Image-to-Video Alignment Distillation, which
leverages image-level alignment knowledge to enhance video-level alignment.
Extensive experiments demonstrate the superiority of our DiscoVLA. In
particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous
methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is
available at https://github.com/LunarShen/DsicoVLA.

[98] Product of Experts for Visual Generation

Yunzhi Zhang,Carson Murtuza-Lanier,Zizhang Li,Yilun Du,Jiajun Wu

Main category: cs.CV

TL;DR: 提出了一个专家乘积(PoE)框架,通过训练自由的方法从异构模型中组合知识,并在图像和视频合成任务中显示出优于单一方法的可控性。

Details Motivation: 整合来自不同来源(如视觉生成模型、视觉语言模型和人类知识)的多样化知识,以增强视觉生成任务的表现。

Method: 使用专家乘积框架和退火重要性采样(AIS)从异质模型中组合知识。

Result: 在图像和视频合成任务中表现出更好的可控性,并提供灵活的用户界面。

Conclusion: 该框架在视觉生成任务中具有潜力,能够灵活整合多源知识。

Abstract: Modern neural models capture rich priors and have complementary knowledge
over shared data domains, e.g., images and videos. Integrating diverse
knowledge from multiple sources – including visual generative models, visual
language models, and sources with human-crafted knowledge such as graphics
engines and physics simulators – remains under-explored. We propose a Product
of Experts (PoE) framework that performs inference-time knowledge composition
from heterogeneous models. This training-free approach samples from the product
distribution across experts via Annealed Importance Sampling (AIS). Our
framework shows practical benefits in image and video synthesis tasks, yielding
better controllability than monolithic methods and additionally providing
flexible user interfaces for specifying visual generation goals.

[99] WetCat: Automating Skill Assessment in Wetlab Cataract Surgery Videos

Negin Ghamsarian,Raphael Sznitman,Klaus Schoeffmann,Jens Kowal

Main category: cs.CV

TL;DR: WetCat是首个专门用于自动化技能评估的湿实验室白内障手术视频数据集,旨在解决传统手动评估的低效和主观性问题。

Details Motivation: 传统湿实验室培训依赖手动评估,效率低且主观性强,亟需自动化工具提升效率和客观性。

Method: 引入WetCat数据集,包含高分辨率手术视频、详细阶段标注和关键结构语义分割,支持标准化技能评估。

Result: WetCat为开发可解释的AI评估工具奠定基础,推动客观化、可扩展的手术培训。

Conclusion: WetCat填补了湿实验室自动化技能评估的空白,为眼科培训设立了新标准。

Abstract: To meet the growing demand for systematic surgical training, wetlab
environments have become indispensable platforms for hands-on practice in
ophthalmology. Yet, traditional wetlab training depends heavily on manual
performance evaluations, which are labor-intensive, time-consuming, and often
subject to variability. Recent advances in computer vision offer promising
avenues for automated skill assessment, enhancing both the efficiency and
objectivity of surgical education. Despite notable progress in ophthalmic
surgical datasets, existing resources predominantly focus on real surgeries or
isolated tasks, falling short of supporting comprehensive skill evaluation in
controlled wetlab settings. To address these limitations, we introduce WetCat,
the first dataset of wetlab cataract surgery videos specifically curated for
automated skill assessment. WetCat comprises high-resolution recordings of
surgeries performed by trainees on artificial eyes, featuring comprehensive
phase annotations and semantic segmentations of key anatomical structures.
These annotations are meticulously designed to facilitate skill assessment
during the critical capsulorhexis and phacoemulsification phases, adhering to
standardized surgical skill assessment frameworks. By focusing on these
essential phases, WetCat enables the development of interpretable, AI-driven
evaluation tools aligned with established clinical metrics. This dataset lays a
strong foundation for advancing objective, scalable surgical education and sets
a new benchmark for automated workflow analysis and skill assessment in
ophthalmology training. The dataset and annotations are publicly available in
Synapse https://www.synapse.org/Synapse:syn66401174/files.

[100] MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis

José Morano,Botond Fazekas,Emese Sükei,Ronald Fecso,Taha Emre,Markus Gumpinger,Georg Faustmann,Marzieh Oghbaie,Ursula Schmidt-Erfurth,Hrvoje Bogunović

Main category: cs.CV

TL;DR: 提出了一种多模态基础模型MIRAGE,用于分析OCT和SLO图像,并建立了一个新的评估基准,验证其优于现有方法。

Details Motivation: 现有眼科AI模型依赖大量标注数据且在独立数据上表现不佳,MIRAGE旨在解决这些问题。

Method: 提出了多模态基础模型MIRAGE,同时设计了OCT/SLO分类与分割任务的评估基准。

Result: MIRAGE在分类和分割任务中均优于通用和专用基础模型。

Conclusion: MIRAGE为眼科图像AI分析提供了更稳健的基础,代码和基准已开源。

Abstract: Artificial intelligence (AI) has become a fundamental tool for assisting
clinicians in analyzing ophthalmic images, such as optical coherence tomography
(OCT). However, developing AI models often requires extensive annotation, and
existing models tend to underperform on independent, unseen data. Foundation
models (FMs), large AI models trained on vast unlabeled datasets, have shown
promise in overcoming these challenges. Nonetheless, available FMs for
ophthalmology lack extensive validation, especially for segmentation tasks, and
focus on a single imaging modality. In this context, we propose MIRAGE, a novel
multimodal FM for the analysis of OCT and scanning laser ophthalmoscopy (SLO)
images. Additionally, we propose a new evaluation benchmark with OCT/SLO
classification and segmentation tasks. The comparison with general and
specialized FMs and segmentation methods shows the superiority of MIRAGE in
both types of tasks, highlighting its suitability as a basis for the
development of robust AI systems for retinal OCT image analysis. Both MIRAGE
and the evaluation benchmark are publicly available:
https://github.com/j-morano/MIRAGE.

[101] Inherently Faithful Attention Maps for Vision Transformers

Ananthu Aniraj,Cassio F. Dantas,Dino Ienco,Diego Marcos

Main category: cs.CV

TL;DR: 提出一种基于注意力机制的方法,通过二进制注意力掩码确保只有关注的图像区域影响预测,解决上下文偏差问题。

Details Motivation: 上下文可能强烈影响物体感知,导致偏差表示,尤其是在物体出现在分布外背景时。

Method: 提出两阶段框架:第一阶段处理全图以发现物体部分和任务相关区域;第二阶段利用注意力掩码限制感受野到这些区域,进行集中分析并过滤潜在噪声。

Result: 在多个基准测试中,该方法显著提升了对虚假相关性和分布外背景的鲁棒性。

Conclusion: 两阶段联合训练框架能够有效解决上下文偏差问题,提升模型表现。

Abstract: We introduce an attention-based method that uses learned binary attention
masks to ensure that only attended image regions influence the prediction.
Context can strongly affect object perception, sometimes leading to biased
representations, particularly when objects appear in out-of-distribution
backgrounds. At the same time, many image-level object-centric tasks require
identifying relevant regions, often requiring context. To address this
conundrum, we propose a two-stage framework: stage 1 processes the full image
to discover object parts and identify task-relevant regions, while stage 2
leverages input attention masking to restrict its receptive field to these
regions, enabling a focused analysis while filtering out potentially spurious
information. Both stages are trained jointly, allowing stage 2 to refine stage

  1. Extensive experiments across diverse benchmarks demonstrate that our
    approach significantly improves robustness against spurious correlations and
    out-of-distribution backgrounds.

[102] Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions

David Acuna,Ximing Lu,Jaehun Jung,Hyunwoo Kim,Amlan Kar,Sanja Fidler,Yejin Choi

Main category: cs.CV

TL;DR: 论文探讨了一种在视觉语言模型中通过搜索机制诱导隐藏知识和长推理链的方法,无需额外训练。

Details Motivation: 研究旨在探索是否可以从已部署的非推理模型中引出隐藏知识,而无需重新训练或监督。

Method: 采用蒙特卡洛树搜索(MCTS)启发式算法,通过注入子问题-子答案对来引导模型推理。

Result: 在三个基准测试中表现一致提升,尤其在MMMU-PRO上总体提升2%,其中文科领域提升9%。

Conclusion: 将推理视为搜索过程,可以有效帮助非推理模型连接碎片化知识并生成长推理链。

Abstract: Recent research in vision-language models (VLMs) has centered around the
possibility of equipping them with implicit long-form chain-of-thought
reasoning – akin to the success observed in language models – via
distillation and reinforcement learning. But what about the non-reasoning
models already trained and deployed across the internet? Should we simply
abandon them, or is there hope for a search mechanism that can elicit hidden
knowledge and induce long reasoning traces – without any additional training
or supervision? In this paper, we explore this possibility using a Monte Carlo
Tree Search (MCTS)-inspired algorithm, which injects subquestion-subanswer
pairs into the model’s output stream. We show that framing reasoning as a
search process – where subquestions act as latent decisions within a broader
inference trajectory – helps the model “connect the dots” between fragmented
knowledge and produce extended reasoning traces in non-reasoning models. We
evaluate our method across three benchmarks and observe consistent
improvements. Notably, our approach yields a 2% overall improvement on
MMMU-PRO, including a significant 9% gain in Liberal Arts.

cs.IR [Back]

[103] Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval

Abdellah Ghassel,Ian Robinson,Gabriel Tanase,Hal Cooper,Bryan Thompson,Zhen Han,Vassilis N. Ioannidis,Soji Adeshina,Huzefa Rangwala

Main category: cs.IR

TL;DR: 论文提出了一种分层词汇图(HLG)索引方法,改进了基于检索增强生成(RAG)的模型在多文档语义分散情况下的表现,并通过两种检索器和合成数据集验证了其有效性。

Details Motivation: 传统的RAG方法在多文档跨语义检索时表现不佳,论文旨在解决这一问题。

Method: 设计了HLG索引结构,并构建了两种检索器(StatementGraphRAG和TopicGraphRAG),同时提出了合成数据集生成方法。

Result: 在五个数据集上的实验显示,方法比传统RAG在召回率和正确率上平均提升了23.1%。

Conclusion: HLG索引和配套检索器显著提升了RAG在多文档检索中的表现,且开源工具可供使用。

Abstract: Retrieval-Augmented Generation (RAG) grounds large language models in
external evidence, yet it still falters when answers must be pieced together
across semantically distant documents. We close this gap with the Hierarchical
Lexical Graph (HLG), a three-tier index that (i) traces every atomic
proposition to its source, (ii) clusters propositions into latent topics, and
(iii) links entities and relations to expose cross-document paths. On top of
HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG,
which performs fine-grained entity-aware beam search over propositions for
high-precision factoid questions, and TopicGraphRAG, which selects coarse
topics before expanding along entity links to supply broad yet relevant context
for exploratory queries. Additionally, existing benchmarks lack the complexity
required to rigorously evaluate multi-hop summarization systems, often focusing
on single-document queries or limited datasets. To address this, we introduce a
synthetic dataset generation pipeline that curates realistic, multi-document
question-answer pairs, enabling robust evaluation of multi-hop retrieval
systems. Extensive experiments across five datasets demonstrate that our
methods outperform naive chunk-based RAG achieving an average relative
improvement of 23.1% in retrieval recall and correctness. Open-source Python
library is available at https://github.com/awslabs/graphrag-toolkit.

cs.LG [Back]

[104] Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining

Chenxi Liu,Tianyi Xiong,Ruibo Chen,Yihan Wu,Junfeng Guo,Tianyi Zhou,Heng Huang

Main category: cs.LG

TL;DR: 论文提出了一种名为MBPO的新偏好学习框架,通过生成硬负样本和结合在线数据来解决大型多模态模型(LMMs)中的模态不平衡问题。

Details Motivation: 现有偏好优化方法未能有效抑制LLM骨干的内部偏差,且依赖离线数据,无法适应动态分布变化。

Method: MBPO通过对抗性扰动生成硬负样本,并结合在线数据与GRPO方法进行训练。

Result: 实验表明MBPO能提升LMM在视觉语言任务中的表现并减少幻觉。

Conclusion: MBPO通过平衡模态输入有效解决了LMM的模态不平衡问题。

Abstract: The task adaptation and alignment of Large Multimodal Models (LMMs) have been
significantly advanced by instruction tuning and further strengthened by recent
preference optimization. Yet, most LMMs still suffer from severe modality
imbalance during reasoning, i.e., outweighing language prior biases over visual
inputs, which bottlenecks their generalization to downstream tasks and causes
hallucinations. However, existing preference optimization approaches for LMMs
do not focus on restraining the internal biases of their Large Language Model
(LLM) backbones when curating the training data. Moreover, they heavily rely on
offline data and lack the capacity to explore diverse responses adaptive to
dynamic distributional shifts during training. Meanwhile, Group Relative Policy
Optimization (GRPO), a recent method using online-generated data and verified
rewards to improve reasoning capabilities, remains largely underexplored in LMM
alignment. In this paper, we propose a novel preference learning framework,
Modality-Balancing Preference Optimization (MBPO), to address the modality
imbalance in LMMs. MBPO constructs a more effective offline preference dataset
by generating hard negatives, i.e., rejected responses misled by LLM biases due
to limited usage of visual information, through adversarial perturbation of
input images. Moreover, MBPO leverages the easy-to-verify nature of close-ended
tasks to generate online responses with verified rewards. GRPO is then employed
to train the model with offline-online hybrid data. Extensive experiments
demonstrate that MBPO can enhance LMM performance on challenging
vision-language tasks and effectively reduce hallucinations.

[105] Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning

Hanbing Liu,Lang Cao,Yuanyi Ren,Mengyu Zhou,Haoyu Dong,Xiaojun Ma,Shi Han,Dongmei Zhang

Main category: cs.LG

TL;DR: 论文提出Bingo框架,通过改进基于长度的奖励设计来提升语言模型的高效推理能力,兼顾准确性和效率。

Details Motivation: 虽然大语言模型展现强大推理能力,但输出冗长且冗余,现有强化学习方法多关注准确性,忽视推理效率。Bingo框架旨在解决这一问题。

Method: Bingo引入两种机制:1)显著性感知长度奖励,逐步减少不重要标记;2)动态长度奖励,初期鼓励详细推理,后期提高效率。

Result: 在多个推理基准测试中,Bingo在准确性和效率上均优于基准方法,实现两者的良好平衡。

Conclusion: Bingo的成功证明通过显式训练提升语言模型高效推理的潜力,为未来研究提供方向。

Abstract: Large language models have demonstrated impressive reasoning capabilities,
yet they often suffer from inefficiencies due to unnecessarily verbose or
redundant outputs. While many works have explored reinforcement learning (RL)
to enhance reasoning abilities, most primarily focus on improving accuracy,
with limited attention to reasoning efficiency. Some existing approaches
introduce direct length-based rewards to encourage brevity, but this often
leads to noticeable drops in accuracy. In this paper, we propose Bingo, an RL
framework that advances length-based reward design to boost efficient
reasoning. Bingo incorporates two key mechanisms: a significance-aware length
reward, which gradually guides the model to reduce only insignificant tokens,
and a dynamic length reward, which initially encourages elaborate reasoning for
hard questions but decays over time to improve overall efficiency. Experiments
across multiple reasoning benchmarks show that Bingo improves both accuracy and
efficiency. It outperforms the vanilla reward and several other length-based
reward baselines in RL, achieving a favorable trade-off between accuracy and
efficiency. These results underscore the potential of training LLMs explicitly
for efficient reasoning.