Table of Contents

cs.CL [Back]

[1] Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha

Tandin Wangchuk,Tad Gonsalves

Main category: cs.CL

TL;DR: 本文比较了三种分词算法(BPE、WordPiece和SentencePiece)在不丹低资源语言Dzongkha中的表现,发现SentencePiece效果最佳,为Dzongkha的NLP研究提供了重要参考。

Details Motivation: 现有的分词器主要针对高资源语言(如英语),而低资源语言Dzongkha由于语言复杂且研究不足,亟需评估和改进分词算法。

Contribution: 比较了三种分词算法在Dzongkha中的表现,证明SentencePiece最适合,为构建Dzongkha大语言模型奠定了基础。

Method: 评估了Byte-Pair Encoding (BPE)、WordPiece和SentencePiece (Unigram)三种算法,使用Subword Fertility等指标衡量性能。

Result: SentencePiece在Dzongkha分词中表现最优,优于BPE和WordPiece。

Insight: 低资源语言需要专门的分词算法,SentencePiece的适应性为类似语言提供了解决方案。

Abstract: Large Language Models (LLMs) are gaining popularity and improving rapidly. Tokenizers are crucial components of natural language processing, especially for LLMs. Tokenizers break down input text into tokens that models can easily process while ensuring the text is accurately represented, capturing its meaning and structure. Effective tokenizers enhance the capabilities of LLMs by improving a model’s understanding of context and semantics, ultimately leading to better performance in various downstream tasks, such as translation, classification, sentiment analysis, and text generation. Most pre-trained tokenizers are suitable for high-resource languages like English but perform poorly for low-resource languages. Dzongkha, Bhutan’s national language spoken by around seven hundred thousand people, is a low-resource language, and its linguistic complexity poses unique NLP challenges. Despite some progress, significant research in Dzongkha NLP is lacking, particularly in tokenization. This study evaluates the training and performance of three common tokenization algorithms in comparison to other popular methods. Specifically, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece (Unigram) were evaluated for their suitability for Dzongkha. Performance was assessed using metrics like Subword Fertility, Proportion of Continued Words, Normalized Sequence Length, and execution time. The results show that while all three algorithms demonstrate potential, SentencePiece is the most effective for Dzongkha tokenization, paving the way for further NLP advancements. This underscores the need for tailored approaches for low-resource languages and ongoing research. In this study, we presented three tokenization algorithms for Dzongkha, paving the way for building Dzongkha Large Language Models.

[2] PolBiX: Detecting LLMs’ Political Bias in Fact-Checking through X-phemisms

Charlott Jakob,David Harbecke,Patrick Parschan,Pia Wenzel Neves,Vera Schmitt

Main category: cs.CL

TL;DR: 该论文通过研究LLMs在德语事实核查任务中因使用委婉语或贬义词(X-phemisms)表现出的政治偏见,发现评判性词汇对模型的真实性评估影响更大,而政治倾向的影响较小,且提示中的客观性要求未能缓解偏见。

Details Motivation: 大型语言模型(LLMs)在需要客观评估的应用中可能因政治偏见而受影响。尽管已有研究发现LLMs倾向于左翼立场,但其对下游任务(如事实核查)的具体影响尚待深入研究。

Contribution: 系统性地研究了LLMs在德语事实核查中因X-phemisms表现出的政治偏见,并发现评判性词汇比政治倾向更显著地影响模型的真实性评估。

Method: 构建了事实等价但政治内涵不同的最小句对(通过替换委婉语或贬义词),评估了六种LLMs对其真实性的分类一致性。

Result: 评判性词汇的存在显著影响LLMs的真实性评估,而政治倾向的影响较小;提示中明确要求客观性未能有效缓解偏见。

Insight: 在对LLMs进行事实核查任务时,需关注语言中的评判性词汇,而非仅聚焦政治倾向;提示设计可能无法完全消除偏见。

Abstract: Large Language Models are increasingly used in applications requiring objective assessment, which could be compromised by political bias. Many studies found preferences for left-leaning positions in LLMs, but downstream effects on tasks like fact-checking remain underexplored. In this study, we systematically investigate political bias through exchanging words with euphemisms or dysphemisms in German claims. We construct minimal pairs of factually equivalent claims that differ in political connotation, to assess the consistency of LLMs in classifying them as true or false. We evaluate six LLMs and find that, more than political leaning, the presence of judgmental words significantly influences truthfulness assessment. While a few models show tendencies of political bias, this is not mitigated by explicitly calling for objectivism in prompts.

[3] Beyond Spurious Signals: Debiasing Multimodal Large Language Models via Counterfactual Inference and Adaptive Expert Routing

Zichen Wu,Hsiu-Yuan Huang,Yunfang Wu

Main category: cs.CL

TL;DR: 该论文提出了一种基于因果中介的新型去偏框架,通过反事实示例区分核心语义与虚假上下文,并结合MoE架构动态选择专家,显著提升了多模态大语言模型在复杂推理任务中的鲁棒性和泛化能力。

Details Motivation: 多模态大语言模型(MLLMs)在处理视觉和文本信息时,常常依赖于虚假相关性,这影响了其在复杂推理任务中的表现。作者旨在解决这一问题,提升模型的鲁棒性。

Contribution: 论文的主要贡献是提出了一种基于因果中介的去偏框架,结合反事实推断和动态路由的MoE架构,有效减少了模型对虚假信号的依赖。

Method: 方法包括:1)通过反事实示例区分核心语义与虚假上下文;2)利用MoE架构动态选择去偏专家,实现多模态去偏的灵活适配。

Result: 在情感分析和多模态讽刺检测任务上的实验表明,该框架显著优于单模态去偏方法和现有SOTA模型。

Insight: 论文的核心洞见是通过结合因果推断和动态路由,可以更有效地解决多模态任务中的虚假信号问题,为未来研究方向提供了新思路。

Abstract: Multimodal Large Language Models (MLLMs) have shown substantial capabilities in integrating visual and textual information, yet frequently rely on spurious correlations, undermining their robustness and generalization in complex multimodal reasoning tasks. This paper addresses the critical challenge of superficial correlation bias in MLLMs through a novel causal mediation-based debiasing framework. Specially, we distinguishing core semantics from spurious textual and visual contexts via counterfactual examples to activate training-stage debiasing and employ a Mixture-of-Experts (MoE) architecture with dynamic routing to selectively engages modality-specific debiasing experts. Empirical evaluation on multimodal sarcasm detection and sentiment analysis tasks demonstrates that our framework significantly surpasses unimodal debiasing strategies and existing state-of-the-art models.

[4] Speech Language Models for Under-Represented Languages: Insights from Wolof

Yaya Sy,Dioula Doucouré,Christophe Cerisara,Irina Illina

Main category: cs.CL

TL;DR: 该论文研究了为沃洛夫语(一种西非的少数语言)训练语音语言模型的挑战与成果,强调了高质量语音数据的重要性,并展示了模型在语音识别和翻译任务中的优越性。

Details Motivation: 为少数语言(如沃洛夫语)构建语音语言模型面临数据稀缺和技术挑战,该研究旨在填补这一空白。

Contribution: 1. 训练了首个沃洛夫语的语音语言模型;2. 展示了持续预训练HuBERT在少数语言上的优越性;3. 探索了多步Chain-of-Thought方法在语音任务中的应用。

Method: 1. 收集大规模高质量的沃洛夫语语音数据;2. 在数据上持续预训练HuBERT;3. 将语音编码器集成到沃洛夫语LLM中,支持语音翻译和Chain-of-Thought任务。

Result: 模型在语音识别和翻译任务中表现优异,超越了基线模型和非洲中心模型。

Insight: 高质量语音数据和对现有模型的持续预训练是提升少数语言语音任务性能的关键。

Abstract: We present our journey in training a speech language model for Wolof, an underrepresented language spoken in West Africa, and share key insights. We first emphasize the importance of collecting large-scale, spontaneous, high-quality speech data, and show that continued pretraining HuBERT on this dataset outperforms both the base model and African-centric models on ASR. We then integrate this speech encoder into a Wolof LLM to train the first Speech LLM for this language, extending its capabilities to tasks such as speech translation. Furthermore, we explore training the Speech LLM to perform multi-step Chain-of-Thought before transcribing or translating. Our results show that the Speech LLM not only improves speech recognition but also performs well in speech translation. The models and the code will be openly shared.

[5] Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding

Zhu Li,Xiyuan Gao,Yuqing Zhang,Shekhar Nayak,Matt Coler

Main category: cs.CL

TL;DR: 该论文系统评估了大语言模型(LLM)和多模态大语言模型(MLLM)在英语和中文口语讽刺检测任务中的表现,探索了零样本、少样本和LoRA微调设置下的效果,并发现音频-文本和音频-视觉组合优于单模态和三模态模型。

Details Motivation: 讽刺检测在自然语言理解中具有挑战性,尤其是在口语场景中,讽刺意图通常依赖于跨模态(文本、语音、视觉)的复杂线索。现有研究多聚焦于文本或视觉-文本模态,而对全面的音频-视觉-文本模态探索不足。

Contribution: 1. 在英语(MUStARD++)和中文(MCSD 1.0)数据集上首次系统地评估了LLM和MLLM的讽刺检测能力。2. 探索了多种模态组合的有效性,并提出了一种协作门控融合模块以整合模型特征。3. 发现MLLM(如Qwen-Omni)在零样本和微调任务中表现出色。

Method: 1. 在零样本、少样本和LoRA微调设置下评估LLM和MLLM。2. 将模型作为特征编码器,通过协作门控模块整合不同模态的表征。3. 对比了单模态(音频、文本、视觉)和多模态(文本-音频、音频-视觉、三模态)组合的性能。

Result: 1. 音频模型在单模态任务中表现最强。2. 文本-音频和音频-视觉组合优于单模态和三模态模型。3. Qwen-Omni等MLLM在零样本和微调任务中表现出竞争力。

Insight: 1. 跨模态融合对口语讽刺理解至关重要。2. 音频在讽刺检测中的作用被低估。3. MLLM具备跨语言和多模态理解的潜力。

Abstract: Sarcasm detection remains a challenge in natural language understanding, as sarcastic intent often relies on subtle cross-modal cues spanning text, speech, and vision. While prior work has primarily focused on textual or visual-textual sarcasm, comprehensive audio-visual-textual sarcasm understanding remains underexplored. In this paper, we systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English (MUStARD++) and Chinese (MCSD 1.0) in zero-shot, few-shot, and LoRA fine-tuning settings. In addition to direct classification, we explore models as feature encoders, integrating their representations through a collaborative gating fusion module. Experimental results show that audio-based models achieve the strongest unimodal performance, while text-audio and audio-vision combinations outperform unimodal and trimodal models. Furthermore, MLLMs such as Qwen-Omni show competitive zero-shot and fine-tuned performance. Our findings highlight the potential of MLLMs for cross-lingual, audio-visual-textual sarcasm understanding.

[6] Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Madison Van Doren,Casey Ford,Emily Dix

Main category: cs.CL

TL;DR: 这项研究评估了四种多模态大语言模型(MLLM)在对抗性条件下的安全性,发现不同模型和模态的脆弱性存在显著差异,且纯文本提示的安全机制绕过效果略高于多模态提示。

Details Motivation: 随着多模态大语言模型在现实应用中的普及,其在对抗条件下的安全性尚未得到充分研究。

Contribution: 提供了四种主流MLLM在对抗性提示下的安全性评估,揭示了模型类型和输入模态对危害性的显著影响。

Method: 通过26名红队成员生成726个对抗性提示,并使用17名标注者对2,904个模型输出的危害性进行评分。

Result: Pixtral 12B的危害响应率最高(约62%),而Claude Sonnet 3.5最安全(约10%)。纯文本提示的危害性略高于多模态提示。

Insight: 多模态安全基准的建立是当前亟需解决的问题,以应对MLLM更广泛部署的安全挑战。

Abstract: Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored. This study evaluates the harmlessness of four leading MLLMs (GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus) when exposed to adversarial prompts across text-only and multimodal formats. A team of 26 red teamers generated 726 prompts targeting three harm categories: illegal activity, disinformation, and unethical behaviour. These prompts were submitted to each model, and 17 annotators rated 2,904 model outputs for harmfulness using a 5-point scale. Results show significant differences in vulnerability across models and modalities. Pixtral 12B exhibited the highest rate of harmful responses (62%), while Claude Sonnet 3.5 was the most resistant (10%). Contrary to expectations, text-only prompts were slightly more effective at bypassing safety mechanisms than multimodal ones. Statistical analysis confirmed that both model type and input modality were significant predictors of harmfulness. These findings underscore the urgent need for robust, multimodal safety benchmarks as MLLMs are deployed more widely.

[7] Relevance to Utility: Process-Supervised Rewrite for RAG

Jaeyoung Kim,Jongho Kim,Seung-won Hwang,Seoho Song,Young-In Song

Main category: cs.CL

TL;DR: 该论文提出了R2U方法,通过过程监督直接优化生成正确答案的概率,解决了RAG系统中检索相关性与生成效用之间的差距问题。

Details Motivation: RAG系统中存在检索内容虽相关但无法有效支持生成的问题,现有桥梁模块未能真正捕捉文档的效用。

Contribution: 提出了R2U方法,直接通过过程监督优化生成效用,并设计了高效的蒸馏管道以扩展监督。

Method: 采用过程监督最大化生成正确答案的概率,并通过LLM扩展监督以提升小型重写模型的泛化能力。

Result: 在多个开放域问答基准测试中,R2U均优于现有的桥梁模块基线。

Insight: 过程监督和扩展监督的结合可以有效提升RAG系统中检索内容对生成的支持能力。

Abstract: Retrieval-Augmented Generation systems often suffer from a gap between optimizing retrieval relevance and generative utility: retrieved documents may be topically relevant but still lack the content needed for effective reasoning during generation. While existing “bridge” modules attempt to rewrite the retrieved text for better generation, we show how they fail to capture true document utility. In this work, we propose R2U, with a key distinction of directly optimizing to maximize the probability of generating a correct answer through process supervision. As such direct observation is expensive, we also propose approximating an efficient distillation pipeline by scaling the supervision from LLMs, which helps the smaller rewriter model generalize better. We evaluate our method across multiple open-domain question-answering benchmarks. The empirical results demonstrate consistent improvements over strong bridging baselines.

[8] DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

Tsz Ting Chung,Lemao Liu,Mo Yu,Dit-Yan Yeung

Main category: cs.CL

TL;DR: 该论文提出了一种新的经典逻辑基准DivLogicEval,用于评估大型语言模型(LLMs)的逻辑推理能力,通过多样化的反直觉语句设计,确保更公正的评估。

Details Motivation: 现有逻辑推理基准在语言多样性和分布上存在偏差,可能导致评估结果不准确,因此需要一种更可靠的基准和方法。

Contribution: 提出了DivLogicEval基准和新的评估指标,旨在减少LLMs中固有的偏见和随机性影响,提供更准确的逻辑推理评估。

Method: 设计了由多样化反直觉语句组成的自然句子基准,并引入新的评估指标以减少偏见和随机性的影响。

Result: 实验表明,DivLogicEval能够有效评估LLMs的逻辑推理能力,并揭示了不同流行LLMs在逻辑推理上的表现差异。

Insight: 逻辑推理能力的评估需要考虑语言多样性和分布偏差,新基准和指标为LLMs的逻辑推理评估提供了更可靠的依据。

Abstract: Logic reasoning in natural language has been recognized as an important measure of human intelligence for Large Language Models (LLMs). Popular benchmarks may entangle multiple reasoning skills and thus provide unfaithful evaluations on the logic reasoning skill. Meanwhile, existing logic reasoning benchmarks are limited in language diversity and their distributions are deviated from the distribution of an ideal logic reasoning benchmark, which may lead to biased evaluation results. This paper thereby proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way. To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in LLMs. Through experiments, we demonstrate the extent to which logical reasoning is required to answer the questions in DivLogicEval and compare the performance of different popular LLMs in conducting logical reasoning.

[9] SciEvent: Benchmarking Multi-domain Scientific Event Extraction

Bofu Dong,Pritesh Shah,Sumedh Sonawane,Tiyasha Banerjee,Erin Brady,Xinya Du,Ming Jiang

Main category: cs.CL

TL;DR: SciEvent是一个多领域科学事件抽取的基准数据集,旨在通过统一的事件抽取框架增强科学内容的上下文理解。它包含500篇来自五个研究领域的摘要,并标注了事件片段、触发词和细粒度论元。实验表明,现有模型在社科和人文学科领域表现不佳。

Details Motivation: 科学信息抽取(SciIE)主要集中在狭窄领域的实体-关系抽取,缺乏跨学科适用性,且难以捕捉科学信息的上下文,导致信息碎片化或矛盾。因此,需要一个统一的事件抽取框架来增强多领域的科学内容理解。

Contribution: 提出了SciEvent,一个多领域科学事件抽取基准数据集;设计了一个统一的事件抽取框架,支持结构化、上下文感知的科学信息提取;实验评估了现有模型和人类标注的表现。

Method: 将SciIE定义为多阶段事件抽取流程:1)将摘要分割为背景、方法、结果和结论四大核心科学活动;2)提取对应的触发词和论元。使用了微调的事件抽取模型、大语言模型(LLMs)和人类标注进行实验。

Result: 实验结果揭示了性能差距,特别是在社会学和人文学科领域,现有模型表现不佳。SciEvent为多领域SciIE提供了一个具有挑战性的基准。

Insight: 跨领域科学事件抽取仍存在挑战,尤其是社科和人文学科;需要更强大的模型和统一的框架来支持多领域的科学信息理解。

Abstract: Scientific information extraction (SciIE) has primarily relied on entity-relation extraction in narrow domains, limiting its applicability to interdisciplinary research and struggling to capture the necessary context of scientific information, often resulting in fragmented or conflicting statements. In this paper, we introduce SciEvent, a novel multi-domain benchmark of scientific abstracts annotated via a unified event extraction (EE) schema designed to enable structured and context-aware understanding of scientific content. It includes 500 abstracts across five research domains, with manual annotations of event segments, triggers, and fine-grained arguments. We define SciIE as a multi-stage EE pipeline: (1) segmenting abstracts into core scientific activities–Background, Method, Result, and Conclusion; and (2) extracting the corresponding triggers and arguments. Experiments with fine-tuned EE models, large language models (LLMs), and human annotators reveal a performance gap, with current models struggling in domains such as sociology and humanities. SciEvent serves as a challenging benchmark and a step toward generalizable, multi-domain SciIE.

[10] VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion

Dimitrios Damianos,Leon Voukoutis,Georgios Paraskevopoulos,Vassilis Katsouros

Main category: cs.CL

TL;DR: 论文提出了一种名为VOX-KRIKRI的多模态融合框架,用于将预训练的大型语言模型(LLM)与声学编码器-解码器架构(如Whisper)结合,构建支持语音的LLM。通过中间音频条件文本空间实现对齐,并通过跨模态注意力在连续文本表示空间中融合,支持离线和流式模式。在希腊语语音LLM任务中表现优异。

Details Motivation: 现有语音和语言模型的融合往往直接使用音频嵌入,缺乏有效对齐机制。论文提出通过中间音频条件文本空间实现更高效的对齐,为多语言和低资源语音LLM提供新思路。

Contribution: 1. 提出了一种新的多模态融合框架,通过中间音频条件文本空间实现语音和语言模型的有效对齐。2. 首次构建了希腊语语音LLM(VoxKrikri),并在希腊语自动语音识别任务中实现SOTA性能。

Method: 1. 使用Whisper和LLM的隐藏解码器状态,通过跨模态注意力机制在连续文本表示空间中进行融合。2. 支持离线和流式两种模式。

Result: 在希腊语自动语音识别任务中,平均相对性能提升约20%。

Insight: 连续空间融合是多语言和低资源语音LLM的有效路径,通过中间音频条件文本空间可以更高效地实现跨模态对齐。

Abstract: We present a multimodal fusion framework that bridges pre-trained decoder-based large language models (LLM) and acoustic encoder-decoder architectures such as Whisper, with the aim of building speech-enabled LLMs. Instead of directly using audio embeddings, we explore an intermediate audio-conditioned text space as a more effective mechanism for alignment. Our method operates fully in continuous text representation spaces, fusing Whisper’s hidden decoder states with those of an LLM through cross-modal attention, and supports both offline and streaming modes. We introduce \textit{VoxKrikri}, the first Greek speech LLM, and show through analysis that our approach effectively aligns representations across modalities. These results highlight continuous space fusion as a promising path for multilingual and low-resource speech LLMs, while achieving state-of-the-art results for Automatic Speech Recognition in Greek, providing an average $\sim20%$ relative improvement across benchmarks.

[11] Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment

Ke Wang,Wenning Wei,Yan Deng,Lei He,Sheng Zhao

Main category: cs.CL

TL;DR: 本文研究了如何通过微调大型多模态模型(LMMs)实现自动发音评估(APA),在细粒度任务中表现优异,但音素级评估仍有挑战性。

Details Motivation: 自动发音评估(APA)对计算机辅助语言学习(CALL)非常重要,但大型多模态模型在细粒度评估中的表现尚不明确,本文旨在探索其潜力与局限。

Contribution: 1. 通过微调LMMs显著提升APA性能;2. 展示了模型在单词和句子级别的竞争力;3. 指出了音素级评估的挑战性及相关性指标的差异。

Method: 使用Speechocean762数据集和私有语料微调LMMs,对比零样本设置,评估其在单词、句子和音素级别的表现。

Result: 微调LMMs在APA任务中显著优于零样本设置,单词和句子级表现优异(PCC达0.9),但音素级表现较差(SCC约为0.6)。

Insight: SCC比PCC更能反映顺序一致性,LMMs在APA中潜力巨大,但需进一步研究细粒度建模和基于排名的评估方法。

Abstract: Automatic Pronunciation Assessment (APA) is critical for Computer-Assisted Language Learning (CALL), requiring evaluation across multiple granularities and aspects. Large Multimodal Models (LMMs) present new opportunities for APA, but their effectiveness in fine-grained assessment remains uncertain. This work investigates fine-tuning LMMs for APA using the Speechocean762 dataset and a private corpus. Fine-tuning significantly outperforms zero-shot settings and achieves competitive results on single-granularity tasks compared to public and commercial systems. The model performs well at word and sentence levels, while phoneme-level assessment remains challenging. We also observe that the Pearson Correlation Coefficient (PCC) reaches 0.9, whereas Spearman’s rank Correlation Coefficient (SCC) remains around 0.6, suggesting that SCC better reflects ordinal consistency. These findings highlight both the promise and limitations of LMMs for APA and point to future work on fine-grained modeling and rank-aware evaluation.

[12] Once Upon a Time: Interactive Learning for Storytelling with Small Language Models

Jonas Mayer Martins,Ali Hamza Bashir,Muhammad Rehan Khalid,Lisa Beinborn

Main category: cs.CL

TL;DR: 论文探讨了一种通过交互式学习(高层次的认知反馈)训练小型语言模型的方法,用于生成故事。研究发现,仅需100万词的交互学习输入,其效果相当于4.1亿词的next-word预测训练。

Details Motivation: 儿童通过社交互动高效学习语言,而当前大型语言模型通常依赖海量数据进行next-word预测训练。作者希望通过结合高层次的认知反馈(如可读性、叙事连贯性和创造性),实现更高效的语言模型训练。

Contribution: 主要贡献在于提出了一种交互式学习方法,通过教师模型对学生模型的输出进行评分(高层次的认知反馈),显著提升了小型语言模型的故事生成能力。同时证明了这种方法的极高数据效率。

Method: 方法包括:1)训练一个学生模型生成故事;2)使用教师模型对生成的故事进行评分(包括可读性、叙事连贯性和创造性);3)通过调整预训练数据量,评估交互式学习对语言能力的影响。

Result: 结果显示,仅需100万词的交互学习输入,模型的故事生成能力提升相当于基于4.1亿词的next-word预测训练。

Insight: 高层次认知反馈是提升语言模型数据效率的关键。交互式学习方法可能为小型语言模型的训练开辟新途径。

Abstract: Children efficiently acquire language not just by listening, but by interacting with others in their social environment. Conversely, large language models are typically trained with next-word prediction on massive amounts of text. Motivated by this contrast, we investigate whether language models can be trained with less data by learning not only from next-word prediction but also from high-level, cognitively inspired feedback. We train a student model to generate stories, which a teacher model rates on readability, narrative coherence, and creativity. By varying the amount of pretraining before the feedback loop, we assess the impact of this interactive learning on formal and functional linguistic competence. We find that the high-level feedback is highly data efficient: With just 1 M words of input in interactive learning, storytelling skills can improve as much as with 410 M words of next-word prediction.

[13] REFER: Mitigating Bias in Opinion Summarisation via Frequency Framed Prompting

Nannan Huang,Haytham M. Fayek,Xiuzhen Zhang

Main category: cs.CL

TL;DR: 本研究提出了REFER方法,通过频率框架提示(frequency framed prompting)减少大型语言模型(LLM)在意见摘要中的偏见,实验表明该方法显著提升了公平性,尤其在更大模型和更强推理指令下效果更明显。

Details Motivation: 现有的意见摘要公平性方法依赖超参数调整或提供真实分布信息,但实际应用中用户很少修改模型参数且分布信息难以获取。基于认知科学研究,频率表示可以减少人类统计推理中的偏见,因此研究是否类似方法可用于改进LLM的公平性。

Contribution: 提出了REFER方法,将频率框架提示引入LLM意见摘要任务,显著提高了摘要的公平性。

Method: 通过频率框架提示(frequency framed prompting),利用已知能改善人类推理的技术,对比抽象概率表示,研究了不同提示框架对模型信息处理的影响。

Result: 实验结果表明,REFER方法在意见摘要中显著提升了公平性,尤其在大模型和更强推理指令下效果更优。

Insight: 频率表示的提示框架可以有效减少语言模型的偏见,为进一步改进生成内容的公平性提供了新思路。

Abstract: Individuals express diverse opinions, a fair summary should represent these viewpoints comprehensively. Previous research on fairness in opinion summarisation using large language models (LLMs) relied on hyperparameter tuning or providing ground truth distributional information in prompts. However, these methods face practical limitations: end-users rarely modify default model parameters, and accurate distributional information is often unavailable. Building upon cognitive science research demonstrating that frequency-based representations reduce systematic biases in human statistical reasoning by making reference classes explicit and reducing cognitive load, this study investigates whether frequency framed prompting (REFER) can similarly enhance fairness in LLM opinion summarisation. Through systematic experimentation with different prompting frameworks, we adapted techniques known to improve human reasoning to elicit more effective information processing in language models compared to abstract probabilistic representations.Our results demonstrate that REFER enhances fairness in language models when summarising opinions. This effect is particularly pronounced in larger language models and using stronger reasoning instructions.

[14] Can LLMs Judge Debates? Evaluating Non-Linear Reasoning via Argumentation Theory Semantics

Reza Sanayei,Srdjan Vesic,Eduardo Blanco,Mihai Surdeanu

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型(LLMs)是否能够通过对自然辩论的非线性结构进行推理(以计算论证理论(CAT)为基础),并利用QuAD语义对论点进行排名。研究发现,LLMs在较短的输入中表现较好,但随着输入长度增加或话语流被打断,性能下降。先进提示策略(如Chain-of-Thought和上下文学习)有助于减轻这些影响。

Details Motivation: 尽管LLMs在线性推理任务中表现出色,但在非线性结构(如自然辩论中的论证图)中的表现尚未充分探索。论文旨在评估LLMs是否能近似计算论证理论(CAT)中的结构化推理能力,特别是在没有直接访问底层论证图的情况下。

Contribution: 论文的主要贡献包括:(1)首次系统地研究了LLMs在处理自然辩论中的非线性推理任务的能力;(2)提出了利用QuAD语义对论点进行排名的评估方法;(3)展示了先进提示策略在减轻模型偏见和提高性能方面的作用。

Method: 研究方法包括:(1)利用QuAD语义为论点分配可接受性分数;(2)将辩论数据以对话格式输入LLMs,要求其对论点进行排名,而不提供底层论证图;(3)测试多种LLM模型和提示策略(如Chain-of-Thought和上下文学习)的表现。

Result: 实验结果表明,LLMs在较短的输入中与QuAD排名有适度的对齐,但随着输入长度增加或话语流被打断,性能显著下降。先进提示策略可以减少与论点长度和位置相关的偏见。

Insight: 论文的启示在于:(1)LLMs在建模形式化论证语义方面具有潜力,但仍存在局限性;(2)未来的研究应探索如何让LLMs具备更灵活的图结构推理能力,尤其是针对复杂非线性结构。

Abstract: Large Language Models (LLMs) excel at linear reasoning tasks but remain underexplored on non-linear structures such as those found in natural debates, which are best expressed as argument graphs. We evaluate whether LLMs can approximate structured reasoning from Computational Argumentation Theory (CAT). Specifically, we use Quantitative Argumentation Debate (QuAD) semantics, which assigns acceptability scores to arguments based on their attack and support relations. Given only dialogue-formatted debates from two NoDE datasets, models are prompted to rank arguments without access to the underlying graph. We test several LLMs under advanced instruction strategies, including Chain-of-Thought and In-Context Learning. While models show moderate alignment with QuAD rankings, performance degrades with longer inputs or disrupted discourse flow. Advanced prompting helps mitigate these effects by reducing biases related to argument length and position. Our findings highlight both the promise and limitations of LLMs in modeling formal argumentation semantics and motivate future work on graph-aware reasoning.

[15] Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning

Sara Rajaee,Rochelle Choenni,Ekaterina Shutova,Christof Monz

Main category: cs.CL

TL;DR: 论文研究了多语言LLMs中数学推理能力的跨语言互补性,提出了一种跨语言奖励模型来优化推理性能。

Details Motivation: 探究多语言大语言模型(LLMs)中数学推理能力在不同语言间的差异,以及不同语言生成的推理路径是否能互补。

Contribution: 提出了一种跨语言奖励模型,用于优化多语言LLMs的数学推理性能,揭示了跨语言采样的互补优势。

Method: 训练一个跨语言奖励模型,对多语言生成的回答进行排序,并通过比较单语言和多语言奖励模型的性能来验证其有效性。

Result: 跨语言奖励模型显著提升了数学推理性能,且在高资源语言中也有优势;在低采样预算下,跨语言采样尤其对英语有益。

Insight: 多语言推理可以通过利用不同语言的互补优势来进一步提升性能,跨语言奖励建模是一种有效的手段。

Abstract: While the reasoning abilities of large language models (LLMs) continue to advance, it remains unclear how such ability varies across languages in multilingual LLMs and whether different languages produce reasoning paths that complement each other. To investigate this question, we train a reward model to rank generated responses for a given question across languages. Our results show that our cross-lingual reward model substantially improves mathematical reasoning performance compared to using reward modeling within a single language, benefiting even high-resource languages. While English often exhibits the highest performance in multilingual models, we find that cross-lingual sampling particularly benefits English under low sampling budgets. Our findings reveal new opportunities to improve multilingual reasoning by leveraging the complementary strengths of diverse languages.

[16] Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems

Zhongze Luo,Zhenshuai Yin,Yongxin Guo,Zhichao Wang,Jionghao Zhu,Xiaoying Tang

Main category: cs.CL

TL;DR: 论文提出了一个名为Multi-Physics的综合性基准测试,用于评估多模态大语言模型(MLLMs)在中文多学科物理问题上的推理能力,填补了当前领域在细粒度学科覆盖、逐步推理过程评估以及视觉信息分析方面的空白。

Details Motivation: 现有评估基准在科学领域如物理学的多模态推理任务中表现不足,尤其是在细粒度学科覆盖、逐步推理过程评估和中文环境下的系统性分析方面。

Contribution: 提出了一个中文物理推理基准测试Multi-Physics,包含1,412道图像相关的选择题,覆盖11个高中物理学科,并提供5个难度级别。同时设计了一种双评估框架,分析模型答案准确性和逐步推理链的完整性。

Method: 使用双评估框架对20种MLLMs进行评估,并结合输入模式变化的实验分析难度级别和视觉信息对模型性能的影响。

Result: 基准测试为社区提供了细粒度的资源,并揭示了MLLMs在多模态推理中的表现差异,代码和数据集已开源。

Insight: 论文揭示了视觉信息在不同难度级别中对MLLMs推理能力的影响,为未来多模态模型在科学领域的优化提供了方法论支持。

Abstract: While multimodal LLMs (MLLMs) demonstrate remarkable reasoning progress, their application in specialized scientific domains like physics reveals significant gaps in current evaluation benchmarks. Specifically, existing benchmarks often lack fine-grained subject coverage, neglect the step-by-step reasoning process, and are predominantly English-centric, failing to systematically evaluate the role of visual information. Therefore, we introduce \textbf {Multi-Physics} for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels, featuring 1,412 image-associated, multiple-choice questions spanning 11 high-school physics subjects. We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought. Furthermore, we systematically study the impact of difficulty level and visual information by comparing the model performance before and after changing the input mode. Our work provides not only a fine-grained resource for the community but also offers a robust methodology for dissecting the multimodal reasoning process of state-of-the-art MLLMs, and our dataset and code have been open-sourced: https://github.com/luozhongze/Multi-Physics.

[17] Distribution-Aligned Decoding for Efficient LLM Task Adaptation

Senkang Hu,Xudong Han,Jinqi Jiang,Yihang Tao,Zihan Fang,Sam Tak Wu Kwong,Yuguang Fang

Main category: cs.CL

TL;DR: 论文提出了一种轻量级的方法SVD,通过输出分布对齐而非权重更新来高效适配大型语言模型(LLM)的任务适应。

Details Motivation: 现有的参数高效微调(PEFT)方法仍显昂贵,论文探索了直接通过解码过程优化任务分布适配的方法。

Contribution: 提出了Steering Vector Decoding(SVD),一种轻量、兼容PEFT且理论支持的方法,通过KL梯度提取任务导向向量以优化解码。

Method: 结合短时预热微调,提取任务导向向量,并将其用于解码过程,理论证明了其与全微调的一阶等价性。

Result: 在多项任务基准测试中,SVD提升了多选准确率和开放问题真实性,未增加额外可训练参数。

Insight: 直接优化输出分布是一种高效的任务适配方法,为LLM轻量化适配提供了理论依据和实践路径。

Abstract: Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVD), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model’s output distribution towards the task distribution. We theoretically prove that SVD is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVD paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 points and open-ended truthfulness by 2 points, with similar gains (1-2 points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVD thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models.

[18] Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions

Frederic Kirstein,Sonu Kumar,Terry Ruas,Bela Gipp

Main category: cs.CL

TL;DR: FRAME通过模块化管道将会议摘要重新定义为语义丰富任务,结合SCOPE的推理协议提升个性化摘要质量,并使用P-MESA框架评估效果,显著减少了幻觉和遗漏。

Details Motivation: 会议摘要中LLMs常出现幻觉、遗漏和不相关内容的问题,需要一种更可控、忠实且个性化的解决方案。

Contribution: 1. 提出FRAME模块化管道,将摘要任务重新定义为语义丰富;2. 引入SCOPE推理协议提升个性化;3. 提出P-MESA参考无关评估框架。

Method: 1. FRAME提取和评分关键事实并按主题组织,丰富摘要;2. SCOPE通过回答九个问题构建推理痕迹;3. P-MESA评估摘要质量。

Result: 幻觉和遗漏减少2/5分,SCOPE提升知识匹配和目标对齐;P-MESA评估与人工标注一致性强(>=89%准确率)。

Insight: 摘要任务需重新设计,强调可控性、忠实性和个性化。

Abstract: Meeting summarization with large language models (LLMs) remains error-prone, often producing outputs with hallucinations, omissions, and irrelevancies. We present FRAME, a modular pipeline that reframes summarization as a semantic enrichment task. FRAME extracts and scores salient facts, organizes them thematically, and uses these to enrich an outline into an abstractive summary. To personalize summaries, we introduce SCOPE, a reason-out-loud protocol that has the model build a reasoning trace by answering nine questions before content selection. For evaluation, we propose P-MESA, a multi-dimensional, reference-free evaluation framework to assess if a summary fits a target reader. P-MESA reliably identifies error instances, achieving >= 89% balanced accuracy against human annotations and strongly aligns with human severity ratings (r >= 0.70). On QMSum and FAME, FRAME reduces hallucination and omission by 2 out of 5 points (measured with MESA), while SCOPE improves knowledge fit and goal alignment over prompt-only baselines. Our findings advocate for rethinking summarization to improve control, faithfulness, and personalization.

[19] Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning

Hong-Yun Lin,Jhen-Ke Lin,Chung-Chun Wang,Hao-Chien Lu,Berlin Chen

Main category: cs.CL

TL;DR: 该论文提出了一种多模态基础模型方法,用于会话级别的口语评估(SLA),通过多目标学习和冻结的Whisper ASR模型实现声学校准,显著提升了评估性能。

Details Motivation: 随着L2英语学习者数量的增加,对可靠口语评估的需求日益迫切。现有方法多为级联管道或短窗口端到端模型,容易引入误差或忽略话语级信息。

Contribution: 论文的主要贡献是提出了一种新型多模态基础模型,结合多目标学习与Whisper ASR模型的声学校准,实现了会话级别的SLA评估。

Method: 方法采用多目标学习与冻结的Whisper ASR模型结合,无需手工特征提取,通过联合学习整体和特质目标完成评估。

Result: 在Speak & Improve基准测试中,该方法优于当前最优级联系统,并表现出强大的跨部分泛化能力。

Insight: 会话级别处理和声学校准是提升口语评估性能的关键。这种方法为计算机辅助语言学习(CALL)提供了紧凑高效的评估工具。

Abstract: Spoken Language Assessment (SLA) estimates a learner’s oral proficiency from spontaneous speech. The growing population of L2 English speakers has intensified the demand for reliable SLA, a critical component of Computer Assisted Language Learning (CALL). Existing efforts often rely on cascaded pipelines, which are prone to error propagation, or end-to-end models that often operate on a short audio window, which might miss discourse-level evidence. This paper introduces a novel multimodal foundation model approach that performs session-level evaluation in a single pass. Our approach couples multi-target learning with a frozen, Whisper ASR model-based speech prior for acoustic-aware calibration, allowing for jointly learning holistic and trait-level objectives of SLA without resorting to handcrafted features. By coherently processing the entire response session of an L2 speaker, the model excels at predicting holistic oral proficiency. Experiments conducted on the Speak & Improve benchmark demonstrate that our proposed approach outperforms the previous state-of-the-art cascaded system and exhibits robust cross-part generalization, producing a compact deployable grader that is tailored for CALL applications.

[20] Think, Verbalize, then Speak: Bridging Complex Thoughts and Comprehensible Speech

Sang Hoon Woo,Sehun Lee,Kang-wook Kim,Gunhee Kim

Main category: cs.CL

TL;DR: 论文提出了一种名为Think-Verbalize-Speak的框架,通过将推理与语音输出解耦,以保留大语言模型(LLMs)的全部推理能力,并通过中间步骤verbalizing将思维转化为自然的、适合语音输出的文本。

Details Motivation: 现有的方法直接将大语言模型应用于语音对话系统时,由于文本与语音输出的不匹配,效果欠佳。论文旨在解决这一问题,同时探索此类方法对推理性能的影响。

Contribution: 主要贡献包括:(1)提出了Think-Verbalize-Speak框架,将推理与语音输出分离;(2)引入了ReVerT,一种基于增量异步摘要的低延迟verbalizer。

Method: 方法的核心是通过verbalizing作为中间步骤,将LLMs的复杂思维转化为适合语音输出的自然文本,同时采用ReVerT进行高效处理。

Result: 实验表明,该方法在多个基准测试中均能提升语音的自然度和简洁性,同时对模型的推理能力影响最小。

Insight: 解耦推理与语音输出是一种有效的策略,既能保留LLMs的推理能力,又能优化语音生成效果。

Abstract: Spoken dialogue systems increasingly employ large language models (LLMs) to leverage their advanced reasoning capabilities. However, direct application of LLMs in spoken communication often yield suboptimal results due to mismatches between optimal textual and verbal delivery. While existing approaches adapt LLMs to produce speech-friendly outputs, their impact on reasoning performance remains underexplored. In this work, we propose Think-Verbalize-Speak, a framework that decouples reasoning from spoken delivery to preserve the full reasoning capacity of LLMs. Central to our method is verbalizing, an intermediate step that translates thoughts into natural, speech-ready text. We also introduce ReVerT, a latency-efficient verbalizer based on incremental and asynchronous summarization. Experiments across multiple benchmarks show that our method enhances speech naturalness and conciseness with minimal impact on reasoning. The project page with the dataset and the source code is available at https://yhytoto12.github.io/TVS-ReVerT

[21] Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses

Fangyi Yu,Nabeel Seedat,Dasha Herrmannova,Frank Schilder,Jonathan Richard Schwarz

Main category: cs.CL

TL;DR: 论文提出了DeCE框架,用于分解评估LLM生成的长文本回答,将精度和召回率分开衡量,显著优于传统方法和现有LLM评估器,并与专家评分高度相关。

Details Motivation: 当前评估长文本回答(如法律或医学领域)的标准方法(如BLEU、ROUGE)难以捕捉语义正确性,且现有LLM评估器通常仅提供单一分数,忽略了答案质量的多个维度。

Contribution: 提出DeCE框架,通过分解精度(事实准确性和相关性)和召回率(概念覆盖)来评估LLM回答,自动从标准答案中提取评估标准,无需预定义分类或人工制定规则。

Method: DeCE框架通过LLM自动提取实例特定的评估标准,分解出精度和召回率两个维度,并应用于真实世界的法律问答任务。

Result: DeCE与专家评分的相关性显著提升(r=0.78),远高于传统指标(r=0.12)、点式LLM评分(r=0.35)和多维评估器(r=0.48)。此外,仅11.95%的生成标准需要专家修正。

Insight: DeCE揭示了通用模型倾向于高召回率,而专用模型倾向于高精度的权衡关系。该方法具有可扩展性和可解释性,适用于专家领域。

Abstract: Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments ($r=0.78$), compared to traditional metrics ($r=0.12$), pointwise LLM scoring ($r=0.35$), and modern multidimensional evaluators ($r=0.48$). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE’s scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.

[22] It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge

Lukas Ellinger,Georg Groh

Main category: cs.CL

TL;DR: 论文研究了大型语言模型(LLMs)在多轮对话中利用常识解决指代歧义的能力,发现当前模型倾向于固定解释或覆盖所有可能,而非寻求澄清。简化提示会削弱常识推理。通过直接偏好优化微调Llama-3.1-8B显著提升了歧义解决能力。

Details Motivation: 指代歧义和未明确引用需要通过上下文和常识解决。研究旨在验证LLMs是否能利用常识解决多轮对话中的歧义,并分析歧义持续时的行为。

Contribution: 1. 系统性研究了LLMs在歧义解决中的表现;2. 揭示了简化提示对常识推理的负面影响;3. 通过微调显著提升了模型性能。

Method: 1. 使用多语言评估数据集测试多个LLM;2. 结合LLM-as-Judge和人工标注;3. 对Llama-3.1-8B进行直接偏好优化微调。

Result: 当前LLMs解决歧义效果不佳,简化提示会削弱推理能力。微调后的Llama-3.1-8B在所有任务中表现显著提升。

Insight: 高级微调对提升LLMs在歧义处理和多样化沟通风格中的鲁棒性至关重要。

Abstract: Ambiguous words or underspecified references require interlocutors to resolve them, often by relying on shared context and commonsense knowledge. Therefore, we systematically investigate whether Large Language Models (LLMs) can leverage commonsense to resolve referential ambiguity in multi-turn conversations and analyze their behavior when ambiguity persists. Further, we study how requests for simplified language affect this capacity. Using a novel multilingual evaluation dataset, we test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via LLM-as-Judge and human annotations. Our findings indicate that current LLMs struggle to resolve ambiguity effectively: they tend to commit to a single interpretation or cover all possible references, rather than hedging or seeking clarification. This limitation becomes more pronounced under simplification prompts, which drastically reduce the use of commonsense reasoning and diverse response strategies. Fine-tuning Llama-3.1-8B with Direct Preference Optimization substantially improves ambiguity resolution across all request types. These results underscore the need for advanced fine-tuning to improve LLMs’ handling of ambiguity and to ensure robust performance across diverse communication styles.

[23] CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion

Sheng Zhang,Yifan Ding,Shuquan Lian,Shun Song,Hui Li

Main category: cs.CL

TL;DR: CodeRAG是一个用于检索增强的仓库级代码补全框架,通过改进查询构造、多路径代码检索和偏好对齐的BestFit重排序,显著提升了代码补全的性能。

Details Motivation: 当前仓库级代码补全方法存在查询构造不当、单一路径检索及代码检索器与代码LLM不对齐的问题,因此需要一种更高效的框架来解决这些问题。

Contribution: CodeRAG提出了log概率引导的查询构造、多路径代码检索和BestFit重排序三大创新点,显著提高了代码补全的准确性和效率。

Method: 1. 使用log概率指导查询构造;2. 多路径代码检索;3. 通过BestFit重排序对齐模型偏好。

Result: 在ReccEval和CCEval基准测试中,CodeRAG显著优于现有方法。

Insight: 多路径检索和偏好对齐对提升代码补全性能至关重要,但需要注意计算开销的平衡。

Abstract: Repository-level code completion automatically predicts the unfinished code based on the broader information from the repository. Recent strides in Code Large Language Models (code LLMs) have spurred the development of repository-level code completion methods, yielding promising results. Nevertheless, they suffer from issues such as inappropriate query construction, single-path code retrieval, and misalignment between code retriever and code LLM. To address these problems, we introduce CodeRAG, a framework tailored to identify relevant and necessary knowledge for retrieval-augmented repository-level code completion. Its core components include log probability guided query construction, multi-path code retrieval, and preference-aligned BestFit reranking. Extensive experiments on benchmarks ReccEval and CCEval demonstrate that CodeRAG significantly and consistently outperforms state-of-the-art methods. The implementation of CodeRAG is available at https://github.com/KDEGroup/CodeRAG.

[24] CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs

Jinghao Zhang,Sihang Jiang,Shiwei Guo,Shisong Chen,Yanghua Xiao,Hongwei Feng,Jiaqing Liang,Minggui HE,Shimin Tao,Hongxia Ma

Main category: cs.CL

TL;DR: CultureScope是一种评估大型语言模型(LLMs)文化理解能力的综合性框架,基于文化冰山理论设计了分层维度模式,能自动构建文化特定知识库和评估数据集。实验表明现有LLMs缺乏全面的文化能力,且多语言数据并不直接提升文化理解。

Details Motivation: 随着LLMs在多样化文化环境中的部署日益增多,评估其文化理解能力对于确保可信和符合文化的应用至关重要。然而,现有基准测试缺乏全面性,难以跨文化扩展和适应。

Contribution: 提出了CultureScope,是目前最全面的文化理解评估框架,基于文化冰山理论设计了一个3层140维的分类模式,能够自动构建文化特定知识库和评估数据集。

Method: 通过文化冰山理论引导的分层维度模式(3层140维)构建知识库和评估数据集,方法自动化且可扩展。

Result: 实验证明该方法能有效评估文化理解能力,同时揭示现有LLMs文化能力不足,多语言数据对提升文化理解帮助有限。

Insight: 文化理解评估需要理论指导的分层框架;单纯增加多语言数据并非提升文化能力的有效途径。

Abstract: As large language models (LLMs) are increasingly deployed in diverse cultural environments, evaluating their cultural understanding capability has become essential for ensuring trustworthy and culturally aligned applications. However, most existing benchmarks lack comprehensiveness and are challenging to scale and adapt across different cultural contexts, because their frameworks often lack guidance from well-established cultural theories and tend to rely on expert-driven manual annotations. To address these issues, we propose CultureScope, the most comprehensive evaluation framework to date for assessing cultural understanding in LLMs. Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification, comprising 3 layers and 140 dimensions, which guides the automated construction of culture-specific knowledge bases and corresponding evaluation datasets for any given languages and cultures. Experimental results demonstrate that our method can effectively evaluate cultural understanding. They also reveal that existing large language models lack comprehensive cultural competence, and merely incorporating multilingual data does not necessarily enhance cultural understanding. All code and data files are available at https://github.com/HoganZinger/Culture

[25] RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Jane Luo,Xin Zhang,Steven Liu,Jie Wu,Yiming Huang,Yangyu Huang,Chengyu Yin,Ying Xin,Jianfeng Liu,Yuefeng Zhan,Hao Sun,Qi Chen,Scarlett Li,Mao Yang

Main category: cs.CL

TL;DR: 论文提出了Repository Planning Graph(RPG)和ZeroRepo框架,用于从零开始生成代码仓库,解决了大型语言模型在完整代码库生成中的规划问题。

Details Motivation: 自然语言的模糊性和冗长性使其难以准确表示复杂的软件结构,而现有方法在完整代码库生成中的规划和扩展性方面存在不足。

Contribution: 1. 提出了Repository Planning Graph(RPG)作为统一的规划表示;2. 开发了ZeroRepo框架,通过三阶段生成代码仓库;3. 构建了RepoCraft基准测试,验证了方法的有效性。

Method: 1. 使用RPG编码能力、文件结构、数据流和函数;2. ZeroRepo通过提案级规划、实现级细化和图引导的代码生成三个阶段生成代码仓库;3. 测试验证确保生成代码的功能完整性。

Result: 在RepoCraft基准上,ZeroRepo生成了平均36K行代码的仓库,功能覆盖率和通过率分别达到81.5%和69.7%,显著优于基线方法。

Insight: RPG能够建模复杂依赖关系,提升LLM对代码库的理解,并通过近线性缩放实现渐进式规划。

Abstract: Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands coherent and reliable planning across proposal- and implementation-level stages, while natural language, due to its ambiguity and verbosity, is ill-suited for faithfully representing complex software structures. To address this, we introduce the Repository Planning Graph (RPG), a persistent representation that unifies proposal- and implementation-level planning by encoding capabilities, file structures, data flows, and functions in one graph. RPG replaces ambiguous natural language with an explicit blueprint, enabling long-horizon planning and scalable repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework for repository generation from scratch. It operates in three stages: proposal-level planning and implementation-level refinement to construct the graph, followed by graph-guided code generation with test validation. To evaluate this setting, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces repositories averaging nearly 36K LOC, roughly 3.9$\times$ the strongest baseline (Claude Code) and about 64$\times$ other baselines. It attains 81.5% functional coverage and a 69.7% pass rate, exceeding Claude Code by 27.3 and 35.8 percentage points, respectively. Further analysis shows that RPG models complex dependencies, enables progressively more sophisticated planning through near-linear scaling, and enhances LLM understanding of repositories, thereby accelerating agent localization.

cs.CV [Back]

[26] Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays

Hanbin Ko,Gihun Cho,Inhyeok Baek,Donguk Kim,Joonbeom Koo,Changi Kim,Dongheon Lee,Chang Min Park

Main category: cs.CV

TL;DR: 论文探讨了LLM编码器在胸部X射线图像-文本检索中的能力,提出了LLM2VEC4CXR和LLM2CLIP4CXR两个模型,解决了放射学报告异构性导致的多模态学习瓶颈问题。

Details Motivation: 放射学报告存在缩写、风格多样性等异构性问题,导致传统的视觉-语言预训练方法在多模态学习中表现受限。论文旨在探索LLM编码器是否能为临床文本提供更鲁棒的表示。

Contribution: 1. 提出了LLM2VEC4CXR,一种针对胸部X射线报告的领域自适应LLM编码器;2. 设计了LLM2CLIP4CXR双塔框架,结合视觉骨干网络提升检索精度;3. 展示了模型在异构和噪声报告数据下的鲁棒性和跨数据集泛化能力。

Method: 1. LLM2VEC4CXR通过微调LLM编码器适应放射学报告;2. LLM2CLIP4CXR将LLM2VEC4CXR与视觉网络结合,实现图像-文本对齐;3. 使用1.6M胸部X射线研究数据进行训练。

Result: LLM2VEC4CXR在临床文本理解上优于BERT基线,LLM2CLIP4CXR提升了检索精度和临床对齐分数,且在跨数据集泛化上优于现有医学CLIP变体。

Insight: 研究表明,多模态学习的关键在于鲁棒性,而不仅仅是数据规模。模型在异构和噪声数据中表现良好,为医学图像-文本表示学习提供了新思路。

Abstract: Vision-language pretraining has advanced image-text alignment, yet progress in radiology remains constrained by the heterogeneity of clinical reports, including abbreviations, impression-only notes, and stylistic variability. Unlike general-domain settings where more data often leads to better performance, naively scaling to large collections of noisy reports can plateau or even degrade model learning. We ask whether large language model (LLM) encoders can provide robust clinical representations that transfer across diverse styles and better guide image-text alignment. We introduce LLM2VEC4CXR, a domain-adapted LLM encoder for chest X-ray reports, and LLM2CLIP4CXR, a dual-tower framework that couples this encoder with a vision backbone. LLM2VEC4CXR improves clinical text understanding over BERT-based baselines, handles abbreviations and style variation, and achieves strong clinical alignment on report-level metrics. LLM2CLIP4CXR leverages these embeddings to boost retrieval accuracy and clinically oriented scores, with stronger cross-dataset generalization than prior medical CLIP variants. Trained on 1.6M CXR studies from public and private sources with heterogeneous and noisy reports, our models demonstrate that robustness – not scale alone – is the key to effective multimodal learning. We release models to support further research in medical image-text representation learning.

[27] ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang,Han Shu,Wenshuo Li,Yingjie Zhai,Xinghao Chen

Main category: cs.CV

TL;DR: ViSpec introduces一种针对视觉-语言模型(VLM)的视觉感知推测解码框架,通过轻量化的视觉适配器和全局特征增强,显著提升了VLMs的推理速度。

Details Motivation: 现有的推测解码方法在多模态模型(VLMs)中效果有限,无法充分利用图像的冗余信息,而随着多模态模型的重要性增加,加速VLMs的推理需求日益迫切。

Contribution: 提出了ViSpec框架,通过视觉适配器压缩图像表示并保留位置信息,同时利用全局特征增强文本连贯性,首次在VLMs中实现了显著的加速效果。

Method: 使用轻量化视觉适配器压缩图像token,提取全局特征增强文本token,并通过改进的训练数据集避免捷径学习问题。

Result: ViSpec在VLMs中实现了显著的推理加速效果,首次突破现有方法的限制(<1.5倍)。

Insight: VLMs可以通过分层过滤冗余图像信息而不影响文本理解,而小型草案模型需要额外的设计(如视觉适配器)才能实现类似效果。

Abstract: Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model’s attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model’s hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding.

[28] M-PACE: Mother Child Framework for Multimodal Compliance

Shreyash Verma,Amit Kesari,Vinayak Trivedi,Anupam Purwar,Ratnesh Jamidar

Main category: cs.CV

TL;DR: M-PACE提出了一种多模态合规性检查框架,通过结合母-子MLLM架构,实现了对视觉和文本内容的联合处理,显著降低了人工审核依赖和推理成本。

Details Motivation: 传统合规性检查框架因模块分散导致效率低下且难以适应动态需求,M-PACE利用多模态大语言模型(MLLMs)统一流程以提升效率。

Contribution: 提出了M-PACE框架,通过母-子MLLM架构实现多模态内容合规性的联合评估,并降低了成本。

Method: 采用母-子MLLM架构,母模型评估子模型的输出,结合基准数据集进行结构化评估。

Result: 推理成本降低31倍,模型效率提升(每张图片成本0.0005美元),在广告数据上实现实时质量与成本的平衡。

Insight: 通过母-子架构,可以实现高效的多模态合规性检查,减少对人工审核的依赖,同时降低成本。

Abstract: Ensuring that multi-modal content adheres to brand, legal, or platform-specific compliance standards is an increasingly complex challenge across domains. Traditional compliance frameworks typically rely on disjointed, multi-stage pipelines that integrate separate modules for image classification, text extraction, audio transcription, hand-crafted checks, and rule-based merges. This architectural fragmentation increases operational overhead, hampers scalability, and hinders the ability to adapt to dynamic guidelines efficiently. With the emergence of Multimodal Large Language Models (MLLMs), there is growing potential to unify these workflows under a single, general-purpose framework capable of jointly processing visual and textual content. In light of this, we propose Multimodal Parameter Agnostic Compliance Engine (M-PACE), a framework designed for assessing attributes across vision-language inputs in a single pass. As a representative use case, we apply M-PACE to advertisement compliance, demonstrating its ability to evaluate over 15 compliance-related attributes. To support structured evaluation, we introduce a human-annotated benchmark enriched with augmented samples that simulate challenging real-world conditions, including visual obstructions and profanity injection. M-PACE employs a mother-child MLLM setup, demonstrating that a stronger parent MLLM evaluating the outputs of smaller child models can significantly reduce dependence on human reviewers, thereby automating quality control. Our analysis reveals that inference costs reduce by over 31 times, with the most efficient models (Gemini 2.0 Flash as child MLLM selected by mother MLLM) operating at 0.0005 per image, compared to 0.0159 for Gemini 2.5 Pro with comparable accuracy, highlighting the trade-off between cost and output quality achieved in real time by M-PACE in real life deployment over advertising data.

[29] Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models

Muhammad Imran,Yugyung Lee

Main category: cs.CV

TL;DR: 论文提出了多模态可解释学习框架(MMEL),通过多尺度特征处理、自适应注意力加权和跨模态对齐,提升了视觉-语言模型的可解释性和可靠性。

Details Motivation: 在安全关键应用中,视觉-语言模型因复杂对象关系和透明性需求而面临挑战,亟需提升模型的可解释性。

Contribution: 引入了层次语义关系模块,通过多尺度特征和自适应加权,改进梯度解释方法(Grad-eclip),生成更具上下文感知的可视化结果。

Method: MMEL框架结合多尺度特征处理、自适应注意力加权和跨模态对齐,增强模型的语义关系捕捉能力。

Result: 实验表明,MMEL生成的视觉解释更精准,能更好地反映模型对复杂场景的处理过程,同时保持高性能。

Insight: 跨模态对齐和多尺度特征处理是提升视觉-语言模型可解释性和可靠性的关键。

Abstract: Recent advances in vision-language models have significantly expanded the frontiers of automated image analysis. However, applying these models in safety-critical contexts remains challenging due to the complex relationships between objects, subtle visual cues, and the heightened demand for transparency and reliability. This paper presents the Multi-Modal Explainable Learning (MMEL) framework, designed to enhance the interpretability of vision-language models while maintaining high performance. Building upon prior work in gradient-based explanations for transformer architectures (Grad-eclip), MMEL introduces a novel Hierarchical Semantic Relationship Module that enhances model interpretability through multi-scale feature processing, adaptive attention weighting, and cross-modal alignment. Our approach processes features at multiple semantic levels to capture relationships between image regions at different granularities, applying learnable layer-specific weights to balance contributions across the model’s depth. This results in more comprehensive visual explanations that highlight both primary objects and their contextual relationships with improved precision. Through extensive experiments on standard datasets, we demonstrate that by incorporating semantic relationship information into gradient-based attribution maps, MMEL produces more focused and contextually aware visualizations that better reflect how vision-language models process complex scenes. The MMEL framework generalizes across various domains, offering valuable insights into model decisions for applications requiring high interpretability and reliability.

[30] Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning

Wenda Qin,Andrea Burns,Bryan A. Plummer,Margrit Betke

Main category: cs.CV

TL;DR: 本文提出了一种针对视觉与语言导航(VLN)任务的轻量级方法——导航感知剪枝(NAP),通过预过滤前景和背景标记来提高效率,显著减少计算成本同时保持性能。

Details Motivation: 当前大型模型在VLN任务中表现优异,但计算成本高昂。传统剪枝方法在VLN中效果不佳,可能因信息丢失导致导航效率下降。

Contribution: 1. 提出导航感知剪枝(NAP),利用导航特性预过滤标记;2. 结合大型语言模型提取导航相关指令;3. 通过减少无效节点避免导航路径增长。

Method: 1. 将图像标记分为前景和背景;2. 使用LLM筛选导航指令;3. 对背景标记进行剪枝,并移除低重要性导航节点。

Result: 在标准VLN基准测试中,NAP能够节省50%以上的FLOPS,同时保持更高的成功率。

Insight: 导航任务的特殊性需要定制化的剪枝策略,避免信息丢失对路径效率的负面影响。

Abstract: Large models achieve strong performance on Vision-and-Language Navigation (VLN) tasks, but are costly to run in resource-limited environments. Token pruning offers appealing tradeoffs for efficiency with minimal performance loss by reducing model input size, but prior work overlooks VLN-specific challenges. For example, information loss from pruning can effectively increase computational cost due to longer walks. Thus, the inability to identify uninformative tokens undermines the supposed efficiency gains from pruning. To address this, we propose Navigation-Aware Pruning (NAP), which uses navigation-specific traits to simplify the pruning process by pre-filtering tokens into foreground and background. For example, image views are filtered based on whether the agent can navigate in that direction. We also extract navigation-relevant instructions using a Large Language Model. After filtering, we focus pruning on background tokens, minimizing information loss. To further help avoid increases in navigation length, we discourage backtracking by removing low-importance navigation nodes. Experiments on standard VLN benchmarks show NAP significantly outperforms prior work, preserving higher success rates while saving more than 50% FLOPS.

[31] RespoDiff: Dual-Module Bottleneck Transformation for Responsible & Faithful T2I Generation

Silpa Vadakkeeveetil Sreelatha,Sauradip Nag,Muhammad Awais,Serge Belongie,Anjan Dutta

Main category: cs.CV

TL;DR: RespoDiff提出了一种双模块瓶颈变换框架,用于在文本到图像生成中同时保证公平性、安全性和语义保真度,显著优于现有方法。

Details Motivation: 当前扩散模型在文本到图像生成中面临公平性和安全性的挑战,而现有方法往往以牺牲语义保真或图像质量为代价。本文旨在提出一种方法,在不影响生成质量的前提下,提升生成内容的公平性和安全性。

Contribution: 1. 提出RespoDiff框架,通过双模块瓶颈变换平衡责任性(公平性和安全性)与语义保真度;
2. 设计了一种新的分数匹配目标,协调双模块学习;
3. 在多样未见提示下,责任性和语义一致性生成性能提升20%。

Method: 1. 在扩散模型的中间瓶颈表示上应用双模块变换:一个模块专注于责任性(如公平性、安全性),另一个模块专注于语义对齐;
2. 引入分数匹配目标协调双模块学习;
3. 适用于大规模模型(如SDXL),无需牺牲图像质量。

Result: 实验表明,RespoDiff在责任性生成和语义对齐方面优于现有方法,性能提升20%,且能无缝集成到大规模模型中。

Insight: 通过双模块设计明确分离责任性和语义对齐的学习目标,可以显著提升生成模型的综合性能,同时保持高质量输出。

Abstract: The rapid advancement of diffusion models has enabled high-fidelity and semantically rich text-to-image generation; however, ensuring fairness and safety remains an open challenge. Existing methods typically improve fairness and safety at the expense of semantic fidelity and image quality. In this work, we propose RespoDiff, a novel framework for responsible text-to-image generation that incorporates a dual-module transformation on the intermediate bottleneck representations of diffusion models. Our approach introduces two distinct learnable modules: one focused on capturing and enforcing responsible concepts, such as fairness and safety, and the other dedicated to maintaining semantic alignment with neutral prompts. To facilitate the dual learning process, we introduce a novel score-matching objective that enables effective coordination between the modules. Our method outperforms state-of-the-art methods in responsible generation by ensuring semantic alignment while optimizing both objectives without compromising image fidelity. Our approach improves responsible and semantically coherent generation by 20% across diverse, unseen prompts. Moreover, it integrates seamlessly into large-scale models like SDXL, enhancing fairness and safety. Code will be released upon acceptance.

[32] Autoguided Online Data Curation for Diffusion Model Training

Valeria Pais,Luis Oala,Daniele Faccio,Marco Aversa

Main category: cs.CV

TL;DR: 本文研究了自动导航和在线数据选择方法是否能提升扩散模型训练的效率和样本质量,发现自动导航显著改善了样本质量和多样性,而早期数据选择在效率上有一定优势但复杂度较高。

Details Motivation: 生成模型的训练成本高昂,促使研究者探索高效的数据筛选方法。本文旨在评估自动导航和在线数据选择方法是否能提升扩散模型的训练效率。

Contribution: 整合了联合例子选择(JEST)和自动导航方法,形成统一框架进行快速消融实验和基准测试,明确了两种方法的优劣。

Method: 在2-D合成数据和3x64x64-D图像生成任务中,比较了自动导航与在线数据选择方法的性能,考虑了时间开销和样本数量。

Result: 自动导航显著提升了样本质量和多样性;早期数据选择在效率上有一定优势,但时间开销和复杂度较高。

Insight: 虽然有针对性的数据选择在早期训练中可能提升效率,但样本质量的稳健提升主要依赖于自动导航方法。

Abstract: The costs of generative model compute rekindled promises and hopes for efficient data curation. In this work, we investigate whether recently developed autoguidance and online data selection methods can improve the time and sample efficiency of training generative diffusion models. We integrate joint example selection (JEST) and autoguidance into a unified code base for fast ablation and benchmarking. We evaluate combinations of data curation on a controlled 2-D synthetic data generation task as well as (3x64x64)-D image generation. Our comparisons are made at equal wall-clock time and equal number of samples, explicitly accounting for the overhead of selection. Across experiments, autoguidance consistently improves sample quality and diversity. Early AJEST (applying selection only at the beginning of training) can match or modestly exceed autoguidance alone in data efficiency on both tasks. However, its time overhead and added complexity make autoguidance or uniform random data selection preferable in most situations. These findings suggest that while targeted online selection can yield efficiency gains in early training, robust sample quality improvements are primarily driven by autoguidance. We discuss limitations and scope, and outline when data selection may be beneficial.

[33] PRISM: Phase-enhanced Radial-based Image Signature Mapping framework for fingerprinting AI-generated images

Emanuele Ricco,Elia Onofri,Lorenzo Cima,Stefano Cresci,Roberto Di Pietro

Main category: cs.CV

TL;DR: 本文提出了PRISM框架,用于指纹识别AI生成的图像,通过在傅里叶变换中利用振幅和相位信息实现模型归属,并在多个数据集上取得了高准确率。

Details Motivation: 随着生成式AI的普及,需要一种可靠的方法来识别AI生成内容的来源,尤其在商业场景中,用户需要确保内容的真实性和版权归属。

Contribution: 1. 提出PRISM框架,利用离散傅里叶变换的径向提取和相位信息捕捉模型特有的签名;2. 构建PRISM-36K数据集,包含36,000张图像;3. 在多项基准测试中取得高准确率,突出频率域指纹识别的有效性。

Method: 1. 使用离散傅里叶变换的径向提取方法;2. 结合振幅和相位信息;3.通过线性判别分析(LDA)对输出进行聚类以实现模型归属。

Result: 在PRISM-36K数据集上,PRISM的归属准确率达92.04%;在四个文献基准测试中平均准确率为81.60%;在真实与合成图像二分类任务中平均准确率为88.41%。

Insight: 频率域指纹识别是一种有效的跨架构和跨数据集模型归属方法,为生成式AI系统的责任与信任问题提供了可行解决方案。

Abstract: A critical need has emerged for generative AI: attribution methods. That is, solutions that can identify the model originating AI-generated content. This feature, generally relevant in multimodal applications, is especially sensitive in commercial settings where users subscribe to paid proprietary services and expect guarantees about the source of the content they receive. To address these issues, we introduce PRISM, a scalable Phase-enhanced Radial-based Image Signature Mapping framework for fingerprinting AI-generated images. PRISM is based on a radial reduction of the discrete Fourier transform that leverages amplitude and phase information to capture model-specific signatures. The output of the above process is subsequently clustered via linear discriminant analysis to achieve reliable model attribution in diverse settings, even if the model’s internal details are inaccessible. To support our work, we construct PRISM-36K, a novel dataset of 36,000 images generated by six text-to-image GAN- and diffusion-based models. On this dataset, PRISM achieves an attribution accuracy of 92.04%. We additionally evaluate our method on four benchmarks from the literature, reaching an average accuracy of 81.60%. Finally, we evaluate our methodology also in the binary task of detecting real vs fake images, achieving an average accuracy of 88.41%. We obtain our best result on GenImage with an accuracy of 95.06%, whereas the original benchmark achieved 82.20%. Our results demonstrate the effectiveness of frequency-domain fingerprinting for cross-architecture and cross-dataset model attribution, offering a viable solution for enforcing accountability and trust in generative AI systems.

[34] Large Vision Models Can Solve Mental Rotation Problems

Sebastian Ray Mason,Anders Gjølbye,Phillip Chavarria Højbjerg,Lenka Tětková,Lars Kai Hansen

Main category: cs.CV

TL;DR: 这篇论文系统地评估了多种视觉变换器模型(如ViT、CLIP、DINOv2和DINOv3)在心理旋转任务上的表现,发现自监督ViT在捕捉几何结构上优于监督ViT,中间层表现优于最终层,且任务难度与旋转复杂性和遮挡相关。

Details Motivation: 心理旋转是人类空间推理的关键测试,而现代视觉变换器在这些任务上的能力尚不清楚。研究旨在探索这些模型如何发展出类似人类的空间推理能力。

Contribution: 论文的主要贡献是对ViT、CLIP等模型在多种心理旋转任务上的系统性评估,揭示了自监督ViT的优势、中间层的重要性以及任务难度与人类行为的相似性。

Method: 通过分层探测模型表征,评估了ViT、CLIP、DINOv2和DINOv3在从简单到复杂的心理旋转任务中的表现,包括块结构、文本和真实物体。

Result: 研究发现:1)自监督ViT在几何结构捕捉上优于监督ViT;2)中间层表现优于最终层;3)任务难度与旋转复杂性和遮挡相关,与人类反应时间一致。

Insight: 这些结果揭示了视觉变换器模型在空间推理任务中的潜在能力,表明其表征空间与人类认知存在相似约束,为进一步研究提供了方向。

Abstract: Mental rotation is a key test of spatial reasoning in humans and has been central to understanding how perception supports cognition. Despite the success of modern vision transformers, it is still unclear how well these models develop similar abilities. In this work, we present a systematic evaluation of ViT, CLIP, DINOv2, and DINOv3 across a range of mental-rotation tasks, from simple block structures similar to those used by Shepard and Metzler to study human cognition, to more complex block figures, three types of text, and photo-realistic objects. By probing model representations layer by layer, we examine where and how these networks succeed. We find that i) self-supervised ViTs capture geometric structure better than supervised ViTs; ii) intermediate layers perform better than final layers; iii) task difficulty increases with rotation complexity and occlusion, mirroring human reaction times and suggesting similar constraints in embedding space representations.

[35] Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks

Yannis Kaltampanidis,Alexandros Doumanoglou,Dimitrios Zarpalas

Main category: cs.CV

TL;DR: 该论文分析了自监督学习的ViT(Vision Transformers)在不同下游任务中的表征能力,特别关注了未经修改的特征的内在性能,并提供了关于如何选择token类型和决策规则的见解。

Details Motivation: 现有研究中,自监督ViT的特征通常需要额外的处理层或蒸馏才能获得最佳性能,但缺乏对其未经修改特征内在表征能力的系统分析。本文旨在填补这一空白。

Contribution: 首次系统评估了未修改ViT特征在图像分类和分割任务中的表现,揭示了其固有的表征能力,并提供了任务和预训练目标下token类型和决策规则的选择指南。

Method: 使用基于超平面(逻辑回归)和余弦相似度的分类与分割规则,分析不同token类型(keys, queries, values等)在不同任务和预训练模型中的性能。

Result: 论文详细报告了在两种广泛使用的数据集上的实验结果,展示了未修改特征的性能表现,并提供了token类型和决策规则的最优选择建议。

Insight: 研究发现,自监督ViT的特征在不同任务中具有显著的表征能力,无需额外处理即可达到良好性能,同时token类型的选择对任务性能有重要影响。

Abstract: Self-Supervised Learning (SSL) for Vision Transformers (ViTs) has recently demonstrated considerable potential as a pre-training strategy for a variety of computer vision tasks, including image classification and segmentation, both in standard and few-shot downstream contexts. Two pre-training objectives dominate the landscape of SSL techniques: Contrastive Learning and Masked Image Modeling. Features (or tokens) extracted from the final transformer attention block – specifically, the keys, queries, and values – as well as features obtained after the final block’s feed-forward layer, have become a common foundation for addressing downstream tasks. However, in many existing approaches, these pre-trained ViT features are further processed through additional transformation layers, often involving lightweight heads or combined with distillation, to achieve superior task performance. Although such methods can improve task outcomes, to the best of our knowledge, a comprehensive analysis of the intrinsic representation capabilities of unaltered ViT features has yet to be conducted. This study aims to bridge this gap by systematically evaluating the use of these unmodified features across image classification and segmentation tasks, in both standard and few-shot contexts. The classification and segmentation rules that we use are either hyperplane based (as in logistic regression) or cosine-similarity based, both of which rely on the presence of interpretable directions in the ViT’s latent space. Based on the previous rules and without the use of additional feature transformations, we conduct an analysis across token types, tasks, and pre-trained ViT models. This study provides insights into the optimal choice for token type and decision rule based on the task, context, and the pre-training objective, while reporting detailed findings on two widely-used datasets.

[36] How Good are Foundation Models in Step-by-Step Embodied Reasoning?

Dinura Dissanayake,Ahmed Heakl,Omkar Thawakar,Noor Ahsan,Ritesh Thawkar,Ketan More,Jean Lahoud,Rao Anwer,Hisham Cholakkal,Ivan Laptev,Fahad Shahbaz Khan,Salman Khan

Main category: cs.CV

TL;DR: 这篇论文提出了FoMER基准,用于评估基础模型在具身环境中的逐步推理能力,分析了大型多模态模型(LMMs)在复杂决策任务中的表现与局限性。

Details Motivation: 当前的LMMs虽然在视觉理解和语言生成方面表现优异,但在具身任务中的结构化推理能力尚未充分探索,研究旨在填补这一空白。

Contribution: 1) 提出了FoMER基准,包含多样化的具身推理任务;2) 设计了新的评估框架,区分感知基础与动作推理;3) 对多个LMMs进行了实证分析。

Method: 通过构建大规模任务集(1.1k样本,10任务,8种实现方式)和新评估框架,定量分析LMMs的推理能力。

Result: 研究揭示了LMMs在具身推理中的潜力与局限性,为未来机器人智能研究指明了方向。

Insight: LMMs在具身环境中需进一步解决感知与动作推理的协同问题,以提升实际应用的鲁棒性。

Abstract: Embodied agents operating in the physical world must make decisions that are not only effective but also safe, spatially coherent, and grounded in context. While recent advances in large multimodal models (LMMs) have shown promising capabilities in visual understanding and language generation, their ability to perform structured reasoning for real-world embodied tasks remains underexplored. In this work, we aim to understand how well foundation models can perform step-by-step reasoning in embodied environments. To this end, we propose the Foundation Model Embodied Reasoning (FoMER) benchmark, designed to evaluate the reasoning capabilities of LMMs in complex embodied decision-making scenarios. Our benchmark spans a diverse set of tasks that require agents to interpret multimodal observations, reason about physical constraints and safety, and generate valid next actions in natural language. We present (i) a large-scale, curated suite of embodied reasoning tasks, (ii) a novel evaluation framework that disentangles perceptual grounding from action reasoning, and (iii) empirical analysis of several leading LMMs under this setting. Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments, covering three different robot types. Our results highlight both the potential and current limitations of LMMs in embodied reasoning, pointing towards key challenges and opportunities for future research in robot intelligence. Our data and code will be made publicly available.

[37] CoDoL: Conditional Domain Prompt Learning for Out-of-Distribution Generalization

Min Zhang,Bo Jiang,Jie Zhou,Yimeng Liu,Xin Lin

Main category: cs.CV

TL;DR: 论⽂提出了⼀种新颖的条件域提示学习⽅法(CoDoL),通过利⽤现成的域信息形成提示,改进视觉-语⾔嵌⼊的对⻬,从⽽提升分布外(OOD)泛化性能。此外还提出了轻量级的域元⽹络(DMN)⽤于⽣成域内图像的输⼊条件标记。

Details Motivation: 现有的基于提示的CLIP⽅法在分布外泛化中存在两个主要问题:1)不准确的⽂本描述导致精度和鲁棒性下降;2)视觉-语⾔嵌⼊的对⻬不⾜影响泛化性能。

Contribution: 提出CoDoL⽅法,通过条件域提示学习和DMN改进视觉-语⾔嵌⼊的对⻬,提⾼OOD泛化性能。

Method: 1)利⽤现成的域信息形成提示;2)提出DMN⽣成输⼊条件标记,捕捉实例和域的特定信息。

Result: 在四个OOD基准数据集(PACS、VLCS、OfficeHome和DigitDG)上验证了CoDoL的有效性,显着提⾼了视觉-语⾔嵌⼊对⻬和OOD泛化性能。

Insight: 通过条件域提示学习和DMN的联合设计,可以有效改进CLIP⽅法在分布外任务中的表现。

Abstract: Recent advances in pre-training vision-language models (VLMs), e.g., contrastive language-image pre-training (CLIP) methods, have shown great potential in learning out-of-distribution (OOD) representations. Despite showing competitive performance, the prompt-based CLIP methods still suffer from: i) inaccurate text descriptions, which leads to degraded accuracy and robustness, and poses a challenge for zero-shot CLIP methods. ii) limited vision-language embedding alignment, which significantly affects the generalization performance. To tackle the above issues, this paper proposes a novel Conditional Domain prompt Learning (CoDoL) method, which utilizes readily-available domain information to form prompts and improves the vision-language embedding alignment for improving OOD generalization. To capture both instance-specific and domain-specific information, we further propose a lightweight Domain Meta Network (DMN) to generate input-conditional tokens for images in each domain. Extensive experiments on four OOD benchmarks (PACS, VLCS, OfficeHome and DigitDG) validate the effectiveness of our proposed CoDoL in terms of improving the vision-language embedding alignment as well as the out-of-distribution generalization performance.

[38] Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception

Yulin Wang,Yang Yue,Yang Yue,Huanqian Wang,Haojun Jiang,Yizeng Han,Zanlin Ni,Yifan Pu,Minglei Shi,Rui Lu,Qisen Yang,Andrew Zhao,Zhuofan Xia,Shiji Song,Gao Huang

Main category: cs.CV

TL;DR: 论文提出了一种名为AdaptiveNN的框架,旨在通过模仿人类视觉的主动适应性,实现高效、灵活的机器视觉感知,显著降低计算成本并提升可解释性。

Details Motivation: 人类视觉具有高效且自适应的特性,能够动态聚焦于任务相关的区域,而现有机器视觉模型通常被动处理整个场景,导致资源消耗大且缺乏灵活性。

Contribution: 提出了AdaptiveNN框架,将视觉感知建模为从粗到细的序列决策过程,集成表征学习与自奖励强化学习,实现高效、灵活的视觉感知。

Method: 通过序列决策逐步关注任务相关区域,动态组合多轮观察信息,利用自奖励强化学习实现端到端训练,无需额外的注视位置监督。

Result: 在17个任务的基准测试中,AdaptiveNN实现了最高28倍的推理成本降低,同时保持精度,并展示了类似人类的行为模式。

Insight: AdaptiveNN不仅提升了计算效率,还为研究人类视觉认知提供了工具,揭示了机器视觉模仿人类自适应感知的潜力。

Abstract: Human vision is highly adaptive, efficiently sampling intricate environments by sequentially fixating on task-relevant regions. In contrast, prevailing machine vision models passively process entire scenes at once, resulting in excessive resource demands scaling with spatial-temporal input resolution and model size, yielding critical limitations impeding both future advancements and real-world application. Here we introduce AdaptiveNN, a general framework aiming to drive a paradigm shift from ‘passive’ to ‘active, adaptive’ vision models. AdaptiveNN formulates visual perception as a coarse-to-fine sequential decision-making process, progressively identifying and attending to regions pertinent to the task, incrementally combining information across fixations, and actively concluding observation when sufficient. We establish a theory integrating representation learning with self-rewarding reinforcement learning, enabling end-to-end training of the non-differentiable AdaptiveNN without additional supervision on fixation locations. We assess AdaptiveNN on 17 benchmarks spanning 9 tasks, including large-scale visual recognition, fine-grained discrimination, visual search, processing images from real driving and medical scenarios, language-driven embodied AI, and side-by-side comparisons with humans. AdaptiveNN achieves up to 28x inference cost reduction without sacrificing accuracy, flexibly adapts to varying task demands and resource budgets without retraining, and provides enhanced interpretability via its fixation patterns, demonstrating a promising avenue toward efficient, flexible, and interpretable computer vision. Furthermore, AdaptiveNN exhibits closely human-like perceptual behaviors in many cases, revealing its potential as a valuable tool for investigating visual cognition. Code is available at https://github.com/LeapLabTHU/AdaptiveNN.

[39] LowDiff: Efficient Diffusion Sampling with Low-Resolution Condition

Jiuyi Xu,Qing Jin,Meida Chen,Andrew Feng,Yang Sui,Yangming Shi

Main category: cs.CV

TL;DR: LowDiff是一种高效的扩散采样框架,通过多分辨率策略减少高分辨率采样步骤,显著提升生成效率,同时保持或超越现有模型性能。

Details Motivation: 扩散模型采样速度慢,现有方法多关注模型压缩或减少去噪步骤,而未充分利用多分辨率输入的潜力。

Contribution: 提出了LowDiff框架,利用多分辨率级联生成,统一模型逐步细化图像,大幅提升效率且不损失质量。

Method: 采用级联方法从低分辨率逐步生成高分辨率图像,适用于像素空间和潜在空间的扩散模型。

Result: 在CIFAR-10、FFHQ和ImageNet上实现50%以上吞吐量提升,FID和IS指标优于或接近现有方法。

Insight: 结合多分辨率策略在扩散模型中可显著提升效率,同时保持高生成质量,为实际应用提供新思路。

Abstract: Diffusion models have achieved remarkable success in image generation but their practical application is often hindered by the slow sampling speed. Prior efforts of improving efficiency primarily focus on compressing models or reducing the total number of denoising steps, largely neglecting the possibility to leverage multiple input resolutions in the generation process. In this work, we propose LowDiff, a novel and efficient diffusion framework based on a cascaded approach by generating increasingly higher resolution outputs. Besides, LowDiff employs a unified model to progressively refine images from low resolution to the desired resolution. With the proposed architecture design and generation techniques, we achieve comparable or even superior performance with much fewer high-resolution sampling steps. LowDiff is applicable to diffusion models in both pixel space and latent space. Extensive experiments on both conditional and unconditional generation tasks across CIFAR-10, FFHQ and ImageNet demonstrate the effectiveness and generality of our method. Results show over 50% throughput improvement across all datasets and settings while maintaining comparable or better quality. On unconditional CIFAR-10, LowDiff achieves an FID of 2.11 and IS of 9.87, while on conditional CIFAR-10, an FID of 1.94 and IS of 10.03. On FFHQ 64x64, LowDiff achieves an FID of 2.43, and on ImageNet 256x256, LowDiff built on LightningDiT-B/1 produces high-quality samples with a FID of 4.00 and an IS of 195.06, together with substantial efficiency gains.

[40] MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

Yu Chang,Jiahao Chen,Anzhe Cheng,Paul Bogdan

Main category: cs.CV

TL;DR: MaskAttn-SDXL通过引入区域级门控机制,解决多对象提示中的组合性问题,提升空间一致性和属性绑定,同时保持图像质量和多样性。

Details Motivation: 文本到图像扩散模型在多对象提示中常出现组合性失败,如实体纠缠、属性混淆和空间关系违规。为了解决这些问题,作者提出了MaskAttn-SDXL。

Contribution: 提出MaskAttn-SDXL,一种区域级门控机制,通过稀疏化标记到潜在空间的交互,提高组合控制能力,无需额外位置编码或外部区域掩码。

Method: 方法在SDXL的UNet跨注意力得分图中学习二进制掩码,将其注入每个跨注意力对数图,仅保留语义相关连接。

Result: 实验表明,模型在多对象提示中显著提升了空间合规性和属性绑定,同时保持图像质量和多样性。

Insight: 对数级掩码跨注意力是组合控制的高效方法,为文本到图像生成的空间控制提供了实用扩展。

Abstract: Text-to-image diffusion models achieve impressive realism but often suffer from compositional failures on prompts with multiple objects, attributes, and spatial relations, resulting in cross-token interference where entities entangle, attributes mix across objects, and spatial cues are violated. To address these failures, we propose MaskAttn-SDXL,a region-level gating mechanism applied to the cross-attention logits of Stable Diffusion XL(SDXL)’s UNet. MaskAttn-SDXL learns a binary mask per layer, injecting it into each cross-attention logit map before softmax to sparsify token-to-latent interactions so that only semantically relevant connections remain active. The method requires no positional encodings, auxiliary tokens, or external region masks, and preserves the original inference path with negligible overhead. In practice, our model improves spatial compliance and attribute binding in multi-object prompts while preserving overall image quality and diversity. These findings demonstrate that logit-level maksed cross-attention is an data-efficient primitve for enforcing compositional control, and our method thus serves as a practical extension for spatial control in text-to-image generation.

[41] RaceGAN: A Framework for Preserving Individuality while Converting Racial Information for Image-to-Image Translation

Mst Tasnim Pervin,George Bebis,Fang Jiang,Alireza Tavakkoli

Main category: cs.CV

TL;DR: RaceGAN是一个用于图像到图像转换的多领域框架,专注于种族属性的转换,同时保留个体特征和高层次语义,无需依赖参考图像。

Details Motivation: 现有的风格转换模型(如CycleGAN、StarGANv2和StyleGAN)在多领域图像转换中存在局限性,尤其是在种族属性转换时无法保持个体特征或需要额外参考图像。

Contribution: 提出了RaceGAN框架,能够在种族属性转换中映射多领域风格代码,保持个体特征和高层次语义,且无需参考图像。

Method: RaceGAN基于生成对抗网络(GAN),通过多领域风格映射实现种族属性的转换,并利用潜在空间的聚类区分不同种族的面部特征。

Result: 在Chicago Face Dataset上,RaceGAN在种族特征转换(如亚洲人、白人和黑人)方面表现优于其他模型,并通过量化分类验证了其有效性。

Insight: RaceGAN展示了如何在潜在空间中划分不同种族的面部特征聚类,为多领域图像转换提供了一个无需参考图像的高效解决方案。

Abstract: Generative adversarial networks (GANs) have demonstrated significant progress in unpaired image-to-image translation in recent years for several applications. CycleGAN was the first to lead the way, although it was restricted to a pair of domains. StarGAN overcame this constraint by tackling image-to-image translation across various domains, although it was not able to map in-depth low-level style changes for these domains. Style mapping via reference-guided image synthesis has been made possible by the innovations of StarGANv2 and StyleGAN. However, these models do not maintain individuality and need an extra reference image in addition to the input. Our study aims to translate racial traits by means of multi-domain image-to-image translation. We present RaceGAN, a novel framework capable of mapping style codes over several domains during racial attribute translation while maintaining individuality and high level semantics without relying on a reference image. RaceGAN outperforms other models in translating racial features (i.e., Asian, White, and Black) when tested on Chicago Face Dataset. We also give quantitative findings utilizing InceptionReNetv2-based classification to demonstrate the effectiveness of our racial translation. Moreover, we investigate how well the model partitions the latent space into distinct clusters of faces for each ethnic group.

[42] Generating Part-Based Global Explanations Via Correspondence

Kunal Rathore,Prasad Tadepalli

Main category: cs.CV

TL;DR: 提出了一种基于用户定义局部标签的方法,通过有限标注将局部解释高效扩展到大规模数据集,生成全局符号化解释。

Details Motivation: 深度学习模型缺乏可解释性,现有方法多局限于单张图像的局部解释,而全局概念解释需要大量标注成本。本文旨在通过局部标签扩展,生成可理解的全局解释。

Contribution: 提出了一种新颖的方法,利用少量用户定义的局部标签,通过迁移技术生成大规模数据集的全局符号化解释。

Method: 通过用户定义的局部标签,将这些标注高效迁移到更大数据集,并聚合局部解释为全局符号化解释。

Result: 方法能够高效生成人类可理解的全局解释,降低了标注成本。

Insight: 局部解释的聚合可以生成有意义的全局解释,为模型决策提供更广泛的透明度。

Abstract: Deep learning models are notoriously opaque. Existing explanation methods often focus on localized visual explanations for individual images. Concept-based explanations, while offering global insights, require extensive annotations, incurring significant labeling cost. We propose an approach that leverages user-defined part labels from a limited set of images and efficiently transfers them to a larger dataset. This enables the generation of global symbolic explanations by aggregating part-based local explanations, ultimately providing human-understandable explanations for model decisions on a large scale.

[43] Causal Fingerprints of AI Generative Models

Hui Xu,Chi Liu,Congcong Zhu,Minghao Wang,Youyang Qu,Longxiang Gao

Main category: cs.CV

TL;DR: 本文提出一种因果指纹的概念,通过解耦生成模型中的因果关系与图像内容/风格,提升了模型溯源的准确性和泛化能力。

Details Motivation: 现有方法依赖模型特定线索或合成伪影,导致指纹泛化能力差。本文提出因果指纹,以更全面反映图像来源与模型痕迹之间的因果关系。

Contribution: 1. 提出因果指纹概念;2. 设计因果解耦框架,在语义不变潜在空间中分离指纹与内容/风格;3. 利用多样特征增强指纹粒度。

Method: 1. 利用预训练扩散重建残差构建语义不变潜在空间;2. 通过因果解耦分离指纹;3. 结合多样特征提升指纹表征能力。

Result: 实验表明,该方法在GAN和扩散模型的溯源任务中优于现有方法,支持伪造检测、版权追踪和身份保护等应用。

Insight: 1. 因果关系是模型溯源的关键;2. 分离内容/风格能提升指纹鲁棒性;3. 多样化特征对指纹粒度至关重要。

Abstract: AI generative models leave implicit traces in their generated images, which are commonly referred to as model fingerprints and are exploited for source attribution. Prior methods rely on model-specific cues or synthesis artifacts, yielding limited fingerprints that may generalize poorly across different generative models. We argue that a complete model fingerprint should reflect the causality between image provenance and model traces, a direction largely unexplored. To this end, we conceptualize the \emph{causal fingerprint} of generative models, and propose a causality-decoupling framework that disentangles it from image-specific content and style in a semantic-invariant latent space derived from pre-trained diffusion reconstruction residual. We further enhance fingerprint granularity with diverse feature representations. We validate causality by assessing attribution performance across representative GANs and diffusion models and by achieving source anonymization using counterfactual examples generated from causal fingerprints. Experiments show our approach outperforms existing methods in model attribution, indicating strong potential for forgery detection, model copyright tracing, and identity protection.

[44] NeuroRAD-FM: A Foundation Model for Neuro-Oncology with Distributionally Robust Training

Moinak Bhattacharya,Angelica P. Kurtz,Fabio M. Iwamoto,Prateek Prasanna,Gagandeep Singh

Main category: cs.CV

TL;DR: 这篇论文提出了NeuroRAD-FM,一种专为神经肿瘤学设计的基础模型,采用分布鲁棒优化(DRO)提升跨机构通用性和罕见分子标志物预测性能。

Details Motivation: 神经肿瘤学数据异质性和肿瘤复杂性导致现有基础模型泛化能力不足,尤其在罕见分子标志物预测上表现较差。

Contribution: 开发了NeuroRAD-FM,结合自监督预训练和DRO,显著改善了分子标志物分类和生存预测性能,尤其在罕见标志物上表现突出。

Method: 1)使用BYOL、DINO、MAE、MoCo等自监督方法预训练MRI数据;2)应用DRO缓解数据和类别不平衡;3)在多个下游任务(分子标志物分类、生存预测)中验证。

Result: 模型在多机构数据上显著提升性能,罕见标志物预测AUC最高提升0.19,生存预测c-index也有所提高。Grad-CAM验证了模型的可解释性。

Insight: 结合基础模型和DRO可有效提升跨机构泛化能力,尤其是对罕见目标,未来需整合纵向和干预信号以推动精准神经肿瘤学。

Abstract: Neuro-oncology poses unique challenges for machine learning due to heterogeneous data and tumor complexity, limiting the ability of foundation models (FMs) to generalize across cohorts. Existing FMs also perform poorly in predicting uncommon molecular markers, which are essential for treatment response and risk stratification. To address these gaps, we developed a neuro-oncology specific FM with a distributionally robust loss function, enabling accurate estimation of tumor phenotypes while maintaining cross-institution generalization. We pretrained self-supervised backbones (BYOL, DINO, MAE, MoCo) on multi-institutional brain tumor MRI and applied distributionally robust optimization (DRO) to mitigate site and class imbalance. Downstream tasks included molecular classification of common markers (MGMT, IDH1, 1p/19q, EGFR), uncommon alterations (ATRX, TP53, CDKN2A/2B, TERT), continuous markers (Ki-67, TP53), and overall survival prediction in IDH1 wild-type glioblastoma at UCSF, UPenn, and CUIMC. Our method improved molecular prediction and reduced site-specific embedding differences. At CUIMC, mean balanced accuracy rose from 0.744 to 0.785 and AUC from 0.656 to 0.676, with the largest gains for underrepresented endpoints (CDKN2A/2B accuracy 0.86 to 0.92, AUC 0.73 to 0.92; ATRX AUC 0.69 to 0.82; Ki-67 accuracy 0.60 to 0.69). For survival, c-index improved at all sites: CUIMC 0.592 to 0.597, UPenn 0.647 to 0.672, UCSF 0.600 to 0.627. Grad-CAM highlighted tumor and peri-tumoral regions, confirming interpretability. Overall, coupling FMs with DRO yields more site-invariant representations, improves prediction of common and uncommon markers, and enhances survival discrimination, underscoring the need for prospective validation and integration of longitudinal and interventional signals to advance precision neuro-oncology.

[45] ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models

Chung-En Johnny Yu,Hsuan-Chih,Chen,Brian Jalaian,Nathaniel D. Bastian

Main category: cs.CV

TL;DR: ORCA是一种代理推理框架,通过测试时结构化推理提升视觉语言模型的准确性和对抗鲁棒性,避免幻觉和对抗攻击。

Details Motivation: 大型视觉语言模型在多模态任务中表现出色,但仍存在幻觉和对抗攻击的脆弱性,限制了其实际应用可靠性。

Contribution: ORCA提出了一个代理推理框架,通过观察-推理-批判-行动的循环,结合小型视觉模型,无需重新训练即可提升模型性能。

Method: ORCA采用Observe–Reason–Critique–Act循环,利用多个视觉工具验证不一致性,迭代优化预测,并存储中间推理过程支持审计。

Result: 在POPE基准测试中,ORCA将LVLM准确性提升了3.64%到40.67%;对抗扰动下平均提升20.11%;结合防御技术时进一步提升1.20%到48.00%。

Insight: ORCA不仅解决幻觉问题,还意外地提升了对抗鲁棒性,展示了无需对抗训练的可靠性提升路径。

Abstract: Large Vision-Language Models (LVLMs) exhibit strong multimodal capabilities but remain vulnerable to hallucinations from intrinsic errors and adversarial attacks from external exploitations, limiting their reliability in real-world applications. We present ORCA, an agentic reasoning framework that improves the factual accuracy and adversarial robustness of pretrained LVLMs through test-time structured inference reasoning with a suite of small vision models (less than 3B parameters). ORCA operates via an Observe–Reason–Critique–Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. ORCA also stores intermediate reasoning traces, which supports auditable decision-making. Though designed primarily to mitigate object-level hallucinations, ORCA also exhibits emergent adversarial robustness without requiring adversarial training or defense mechanisms. We evaluate ORCA across three settings: (1) clean images on hallucination benchmarks, (2) adversarially perturbed images without defense, and (3) adversarially perturbed images with defense applied. On the POPE hallucination benchmark, ORCA improves standalone LVLM performance by +3.64% to +40.67% across different subsets. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20% to +48.00% across evaluation metrics. These results demonstrate that ORCA offers a promising path toward building more reliable and robust multimodal systems.

[46] Region-Aware Deformable Convolutions

Abolfazl Saheban Maleki,Maryam Imani

Main category: cs.CV

TL;DR: RAD-Conv提出了一种新型卷积算子,通过动态调整感受野的形状和大小,增强了神经网络对复杂图像结构的适应能力。

Details Motivation: 传统可变形卷积受限于固定的四边形采样区域,难以精确捕捉局部细节和长程依赖关系,RAD-Conv旨在解决这一问题。

Contribution: 提出了一种区域感知的可变形卷积(RAD-Conv),通过四个边界偏移量动态调整感受野的形状和大小,结合了注意力机制的适应性和标准卷积的效率。

Method: RAD-Conv为每个卷积核元素引入四个边界偏移量,形成灵活可调整的矩形区域,从而精确控制感受野的宽度和高度。

Result: RAD-Conv能够高效捕捉局部细节和长程依赖,即使使用1x1的小卷积核也能实现。

Insight: 通过解耦感受野形状与卷积核结构,RAD-Conv为构建更高效和表达力强的视觉模型提供了新思路。

Abstract: We introduce Region-Aware Deformable Convolution (RAD-Conv), a new convolutional operator that enhances neural networks’ ability to adapt to complex image structures. Unlike traditional deformable convolutions, which are limited to fixed quadrilateral sampling areas, RAD-Conv uses four boundary offsets per kernel element to create flexible, rectangular regions that dynamically adjust their size and shape to match image content. This approach allows precise control over the receptive field’s width and height, enabling the capture of both local details and long-range dependencies, even with small 1x1 kernels. By decoupling the receptive field’s shape from the kernel’s structure, RAD-Conv combines the adaptability of attention mechanisms with the efficiency of standard convolutions. This innovative design offers a practical solution for building more expressive and efficient vision models, bridging the gap between rigid convolutional architectures and computationally costly attention-based methods.

[47] Self-supervised learning of imaging and clinical signatures using a multimodal joint-embedding predictive architecture

Thomas Z. Li,Aravind R. Krishnan,Lianrui Zuo,John M. Still,Kim L. Sandler,Fabien Maldonado,Thomas A. Lasko,Bennett A. Landman

Main category: cs.CV

TL;DR: 该论文提出了一种利用自监督学习从多模态医学数据中提取特征的联合嵌入预测架构(JEPA),用于肺结节诊断,并在内部数据集上表现优于基线模型,但在外部数据集上表现较差,同时分析了其性能受限的上下文环境。

Details Motivation: 肺结节诊断的多模态模型发展受限于标记数据的稀缺性和模型对训练分布过拟合的倾向,因此需要利用自监督学习从无标签的多模态数据中提取有效特征。

Contribution: 1. 提出了一种基于自监督学习的联合嵌入预测架构(JEPA),利用无标签的纵向多模态医学数据提升模型性能。
2. 在肺结节诊断任务中验证了JEPA的有效性,并揭示了其在外部数据集上的局限性。

Method: 1. 构建了一个无标签的患者数据集,包含CT扫描和电子健康记录。
2. 采用JEPA进行预训练,通过联合嵌入学习影像和临床特征。
3. 在监督微调后评估模型性能。

Result: 在内部队列中,JEPA的表现优于无正则化的多模态模型(AUC: 0.91 vs. 0.88)和仅基于影像的模型(AUC: 0.91 vs. 0.73);但在外部队列中表现较差(AUC: 0.72 vs. 0.75)。

Insight: 自监督学习可以显著提升模型在内部数据集上的性能,但其泛化能力受到外部数据分布差异的限制,需进一步研究如何适应更大的上下文环境。

Abstract: The development of multimodal models for pulmonary nodule diagnosis is limited by the scarcity of labeled data and the tendency for these models to overfit on the training distribution. In this work, we leverage self-supervised learning from longitudinal and multimodal archives to address these challenges. We curate an unlabeled set of patients with CT scans and linked electronic health records from our home institution to power joint embedding predictive architecture (JEPA) pretraining. After supervised finetuning, we show that our approach outperforms an unregularized multimodal model and imaging-only model in an internal cohort (ours: 0.91, multimodal: 0.88, imaging-only: 0.73 AUC), but underperforms in an external cohort (ours: 0.72, imaging-only: 0.75 AUC). We develop a synthetic environment that characterizes the context in which JEPA may underperform. This work innovates an approach that leverages unlabeled multimodal medical archives to improve predictive models and demonstrates its advantages and limitations in pulmonary nodule diagnosis.

[48] Efficient Multimodal Dataset Distillation via Generative Models

Zhenghao Zhao,Haoxuan Wang,Junyi Wu,Yuzhang Shang,Gaowen Liu,Yan Yan

Main category: cs.CV

TL;DR: 该论文提出了一种高效的生成式多模态数据集蒸馏方法EDGE,解决了现有方法计算资源需求高和耗时长的问题,通过双向对比损失和多样性损失优化生成模型,并引入标题合成策略提升性能。

Details Motivation: 随着多模态数据的增长,现有数据集蒸馏方法因计算资源需求高和耗时长(如Matching Training Trajectories算法)无法满足实际需求,因此需要一种更高效的多模态数据集蒸馏方法。

Contribution: 1) 提出EDGE方法,利用生成模型实现高效多模态数据集蒸馏。2) 引入双向对比损失和多样性损失解决生成样本相关性和多样性不足的问题。3) 提出标题合成策略,提升文本-图像检索性能。

Method: 1) 使用生成模型训练框架,结合双向对比损失和多样性损失优化生成样本的质量。2) 通过标题合成策略增加文本信息,改善相关性。

Result: 在Flickr30K、COCO和CC3M数据集上表现优越,速度比现有方法快18倍。

Insight: 生成模型在多模态数据集蒸馏中具有潜力,但需解决样本相关性和多样性问题,双向对比损失和多样性损失是有效的解决方案。

Abstract: Dataset distillation aims to synthesize a small dataset from a large dataset, enabling the model trained on it to perform well on the original dataset. With the blooming of large language models and multimodal large language models, the importance of multimodal datasets, particularly image-text datasets, has grown significantly. However, existing multimodal dataset distillation methods are constrained by the Matching Training Trajectories algorithm, which significantly increases the computing resource requirement, and takes days to process the distillation. In this work, we introduce EDGE, a generative distillation method for efficient multimodal dataset distillation. Specifically, we identify two key challenges of distilling multimodal datasets with generative models: 1) The lack of correlation between generated images and captions. 2) The lack of diversity among generated samples. To address the aforementioned issues, we propose a novel generative model training workflow with a bi-directional contrastive loss and a diversity loss. Furthermore, we propose a caption synthesis strategy to further improve text-to-image retrieval performance by introducing more text information. Our method is evaluated on Flickr30K, COCO, and CC3M datasets, demonstrating superior performance and efficiency compared to existing approaches. Notably, our method achieves results 18x faster than the state-of-the-art method.

[49] OpenViGA: Video Generation for Automotive Driving Scenes by Streamlining and Fine-Tuning Open Source Models with Public Data

Björn Möller,Zhengyang Li,Malte Stelzer,Thomas Graave,Fabian Bettels,Muaaz Ataya,Tim Fingscheidt

Main category: cs.CV

TL;DR: OpenViGA是一个开放的汽车驾驶场景视频生成系统,通过微调开源模型和公共数据(BDD100K)实现高效、可复现的视频生成。

Details Motivation: 现有视频生成系统通常依赖大型模型,训练资源需求高,设计选择不透明且缺乏公开代码和数据。OpenViGA旨在解决这些问题,提供一个模块化的开源解决方案。

Contribution: 1. 对系统三个组件(图像分词器、世界模型、视频解码器)进行深入分析;2. 基于开源模型和公共数据微调;3. 构建连贯的视频生成系统;4. 确保完全可复现性;5. 公开代码和模型。

Method: 通过微调预训练开源模型(使用BDD100K数据),并优化组件接口,实现高效视频生成。

Result: 在256x256分辨率、4fps下,仅需一帧的算法延迟即可生成真实的驾驶场景视频。

Insight: 模块化设计和开源模型结合公共数据可以显著降低视频生成的复杂性和成本,同时保证性能。

Abstract: Recent successful video generation systems that predict and create realistic automotive driving scenes from short video inputs assign tokenization, future state prediction (world model), and video decoding to dedicated models. These approaches often utilize large models that require significant training resources, offer limited insight into design choices, and lack publicly available code and datasets. In this work, we address these deficiencies and present OpenViGA, an open video generation system for automotive driving scenes. Our contributions are: Unlike several earlier works for video generation, such as GAIA-1, we provide a deep analysis of the three components of our system by separate quantitative and qualitative evaluation: Image tokenizer, world model, video decoder. Second, we purely build upon powerful pre-trained open source models from various domains, which we fine-tune by publicly available automotive data (BDD100K) on GPU hardware at academic scale. Third, we build a coherent video generation system by streamlining interfaces of our components. Fourth, due to public availability of the underlying models and data, we allow full reproducibility. Finally, we also publish our code and models on Github. For an image size of 256x256 at 4 fps we are able to predict realistic driving scene videos frame-by-frame with only one frame of algorithmic latency.

[50] Comparing Computational Pathology Foundation Models using Representational Similarity Analysis

Vaibhav Mishra,William Lotter

Main category: cs.CV

TL;DR: 该论文通过表征相似性分析(RSA)比较了六种计算病理学基础模型的学习表征结构,揭示了模型训练范式对表征空间的影响,并为模型鲁棒性和集成策略提供了见解。

Details Motivation: 计算病理学(CPath)基础模型在下游任务中展现潜力,但其学习表征的结构和变异性尚不明确。研究旨在系统性分析这些模型的表征空间,以指导模型的开发和部署。

Contribution: 1. 首次在CPath领域系统性比较了六种基础模型的表征结构;2. 揭示了训练范式(视觉-语言对比学习与自蒸馏)对表征相似性的影响;3. 提出了染色标准化对降低切片依赖性的作用;4. 发现视觉-语言模型的表征更紧凑,而纯视觉模型的表征更分散。

Method: 使用H&E图像块作为输入,通过表征相似性分析(RSA)比较六种模型(CONCH、PLIP、KEEP、UNI2、Virchow2、Prov-GigaPath)的表征结构,并分析其对切片和疾病的依赖性。

Result: 1. UNI2和Virchow2的表征结构最独特;2. Prov-GigaPath的平均相似性最高;3. 染色标准化可降低切片依赖性5.5%-20.5%;4. 视觉-语言模型的表征更紧凑。

Insight: 1. 训练范式不直接影响表征相似性;2. 切片依赖性高但疾病依赖性低;3. 染色标准化能显著提升模型鲁棒性;4. 视觉-语言模型的表征可能更适合特定任务。

Abstract: Foundation models are increasingly developed in computational pathology (CPath) given their promise in facilitating many downstream tasks. While recent studies have evaluated task performance across models, less is known about the structure and variability of their learned representations. Here, we systematically analyze the representational spaces of six CPath foundation models using techniques popularized in computational neuroscience. The models analyzed span vision-language contrastive learning (CONCH, PLIP, KEEP) and self-distillation (UNI (v2), Virchow (v2), Prov-GigaPath) approaches. Through representational similarity analysis using H&E image patches from TCGA, we find that UNI2 and Virchow2 have the most distinct representational structures, whereas Prov-Gigapath has the highest average similarity across models. Having the same training paradigm (vision-only vs. vision-language) did not guarantee higher representational similarity. The representations of all models showed a high slide-dependence, but relatively low disease-dependence. Stain normalization decreased slide-dependence for all models by a range of 5.5% (CONCH) to 20.5% (PLIP). In terms of intrinsic dimensionality, vision-language models demonstrated relatively compact representations, compared to the more distributed representations of vision-only models. These findings highlight opportunities to improve robustness to slide-specific features, inform model ensembling strategies, and provide insights into how training paradigms shape model representations. Our framework is extendable across medical imaging domains, where probing the internal representations of foundation models can help ensure effective development and deployment.

[51] SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters

Abdarahmane Traore,Éric Hervet,Andy Couturier

Main category: cs.CV

TL;DR: 这篇论文提出了SmolRGPT,一种高效的视觉语言模型,仅需600M参数,在仓库环境中实现强大的空间推理能力,并在资源受限的场景下表现优异。

Details Motivation: 现有的大规模视觉语言模型(VLMs)虽然功能强大,但其计算和内存需求过高,难以在仓库、机器人和工业应用等资源受限的环境中部署。

Contribution: 提出了SmolRGPT,一种紧凑的视觉语言架构,结合RGB和深度信息,通过三阶段课程学习逐步对齐视觉和语言特征,实现高效的空间推理。

Method: SmolRGPT采用三阶段课程学习策略:1) 视觉和语言特征对齐;2) 空间关系理解;3) 任务特定数据集的适应。

Result: 仅用600M参数的SmolRGPT在仓库空间推理基准测试中表现优异,性能与大模型相当甚至超越。

Insight: 研究表明,通过精心设计的小模型可以在资源受限的环境中实现高效的多模态推理,而无需牺牲核心空间推理能力。

Abstract: Recent advances in vision-language models (VLMs) have enabled powerful multimodal reasoning, but state-of-the-art approaches typically rely on extremely large models with prohibitive computational and memory requirements. This makes their deployment challenging in resource-constrained environments such as warehouses, robotics, and industrial applications, where both efficiency and robust spatial understanding are critical. In this work, we present SmolRGPT, a compact vision-language architecture that explicitly incorporates region-level spatial reasoning by integrating both RGB and depth cues. SmolRGPT employs a three-stage curriculum that progressively align visual and language features, enables spatial relationship understanding, and adapts to task-specific datasets. We demonstrate that with only 600M parameters, SmolRGPT achieves competitive results on challenging warehouse spatial reasoning benchmarks, matching or exceeding the performance of much larger alternatives. These findings highlight the potential for efficient, deployable multimodal intelligence in real-world settings without sacrificing core spatial reasoning capabilities. The code of the experimentation will be available at: https://github.com/abtraore/SmolRGPT

[52] Lynx: Towards High-Fidelity Personalized Video Generation

Shen Sang,Tiancheng Zhi,Tianpei Gu,Jing Liu,Linjie Luo

Main category: cs.CV

TL;DR: Lynx是一个基于扩散Transformer(DiT)的高保真个性化视频生成模型,通过两个轻量级适配器(ID-adapter和Ref-adapter)实现身份保真和细节注入。

Details Motivation: 个性化视频生成在保持身份一致性和视觉真实感方面面临挑战,Lynx旨在通过轻量级模块解决这一问题。

Contribution: 提出了两个适配器:ID-adapter用于身份保真,Ref-adapter用于注入细节,共同提升视频生成质量。

Method: 使用Perceiver Resampler处理ArcFace嵌入生成身份令牌,并通过交叉注意力在Transformer层中注入细节。

Result: 在40个主题和20个提示的基准测试中,Lynx在身份相似度、提示跟随和视频质量上表现优异。

Insight: 轻量级适配器能够在不增加过多计算成本的情况下显著提升生成视频的身份一致性和细节表现。

Abstract: We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross-attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation.

[53] MEC-Quant: Maximum Entropy Coding for Extremely Low Bit Quantization-Aware Training

Junbiao Pang,Tianyang Cai,Baochang Zhang

Main category: cs.CV

TL;DR: MEC-Quant提出了一种基于最大熵编码的量化感知训练方法,旨在解决极低比特量化引入的偏差问题,通过优化表示结构提升模型泛化能力。

Details Motivation: 当前量化感知训练(QAT)在极低比特设置下会引入表示偏差,导致性能不如全精度模型。为解决这一问题,论文提出了MEC-Quant。

Contribution: 提出了MEC-Quant,一种基于最大熵编码的目标函数,通过最小化编码长度作为熵的替代,优化表示结构,减少偏差并提升泛化能力。此外,还提出了基于MOE的可扩展目标函数,支持快速计算和处理长尾分布。

Method: 方法包括:1)使用最小编码长度作为熵的替代;2)基于MOE的目标函数重构,支持高效计算和处理长尾分布;3)端到端的训练机制。

Result: 在多种计算机视觉任务中,MEC-Quant首次将QAT的极限推至x比特激活,其精度甚至超越全精度模型,成为QAT的新SOTA。

Insight: 通过优化表示结构,MEC-Quant证明了在极低比特量化中减少偏差的重要性,为量化训练提供了新的理论和方法支持。

Abstract: Quantization-Aware Training (QAT) has driven much attention to produce efficient neural networks. Current QAT still obtains inferior performances compared with the Full Precision (FP) counterpart. In this work, we argue that quantization inevitably introduce biases into the learned representation, especially under the extremely low-bit setting. To cope with this issue, we propose Maximum Entropy Coding Quantization (MEC-Quant), a more principled objective that explicitly optimizes on the structure of the representation, so that the learned representation is less biased and thus generalizes better to unseen in-distribution samples. To make the objective end-to-end trainable, we propose to leverage the minimal coding length in lossy data coding as a computationally tractable surrogate for the entropy, and further derive a scalable reformulation of the objective based on Mixture Of Experts (MOE) that not only allows fast computation but also handles the long-tailed distribution for weights or activation values. Extensive experiments on various tasks on computer vision tasks prove its superiority. With MEC-Qaunt, the limit of QAT is pushed to the x-bit activation for the first time and the accuracy of MEC-Quant is comparable to or even surpass the FP counterpart. Without bells and whistles, MEC-Qaunt establishes a new state of the art for QAT.

[54] GUI-ARP: Enhancing Grounding with Adaptive Region Perception for GUI Agents

Xianhang Ye,Yiqing Li,Wei Dai,Miancan Liu,Ziyuan Chen,Zhangye Han,Hongbo Min,Jinkui Ren,Xiantao Zhang,Wen Yang,Zhi Jin

Main category: cs.CV

TL;DR: 提出了GUI-ARP框架,通过自适应区域感知和多阶段推理提升GUI智能体的细粒度定位能力。

Details Motivation: 现有GUI定位方法在高分辨率截图中的细粒度定位表现不佳。

Contribution: 提出了自适应区域感知(ARP)和自适应阶段控制(ASC),动态调整视觉注意力和推理策略。

Method: 结合监督微调和基于GRPO的强化学习微调两阶段训练流程。

Result: 在ScreenSpot-Pro和UI-Vision基准测试中达到SOTA性能,7B模型表现优于部分72B大模型。

Insight: 通过动态多阶段推理和任务相关区域裁剪,显著提升GUI定位效果。

Abstract: Existing GUI grounding methods often struggle with fine-grained localization in high-resolution screenshots. To address this, we propose GUI-ARP, a novel framework that enables adaptive multi-stage inference. Equipped with the proposed Adaptive Region Perception (ARP) and Adaptive Stage Controlling (ASC), GUI-ARP dynamically exploits visual attention for cropping task-relevant regions and adapts its inference strategy, performing a single-stage inference for simple cases and a multi-stage analysis for more complex scenarios. This is achieved through a two-phase training pipeline that integrates supervised fine-tuning with reinforcement fine-tuning based on Group Relative Policy Optimization (GRPO). Extensive experiments demonstrate that the proposed GUI-ARP achieves state-of-the-art performance on challenging GUI grounding benchmarks, with a 7B model reaching 60.8% accuracy on ScreenSpot-Pro and 30.9% on UI-Vision benchmark. Notably, GUI-ARP-7B demonstrates strong competitiveness against open-source 72B models (UI-TARS-72B at 38.1%) and proprietary models.

[55] SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models

Sen Wang,Jingyi Tian,Le Wang,Zhimin Liao,Jiayi Li,Huaiyi Dong,Kun Xia,Sanping Zhou,Wei Tang,Hua Gang

Main category: cs.CV

TL;DR: SAMPO是一种结合视觉自回归建模和因果建模的混合框架,用于提升世界模型中时间一致性和生成效率,同时通过运动提示模块增强动态场景理解。

Details Motivation: 现有的自回归世界模型在多帧预测中存在空间结构破坏、解码效率低和运动建模不足的问题。SAMPO旨在解决这些问题。

Contribution: 1. 提出结合时空因果解码和双向空间注意力的混合框架。2. 设计非对称多尺度分词器优化内存和性能。3. 引入轨迹感知运动提示模块提升动态区域关注和物理真实性。

Method: 1. 混合视觉自回归建模和因果建模。2. 使用双向空间注意力保留空间局部性。3. 运动提示模块注入时空线索。

Result: SAMPO在动作条件视频预测和基于模型的控制中表现优异,生成质量提升且推理速度快4.4倍,具备零样本泛化能力并可扩展模型规模。

Insight: 混合建模和运动提示是提升世界模型性能的关键,非对称分词设计在多尺度信息处理中更高效。

Abstract: World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO’s zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

[56] Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues

Wei Chen,Tongguan Wang,Feiyue Xue,Junkai Li,Hui Liu,Ying Sha

Main category: cs.CV

TL;DR: 本文提出了一种对称双向多模态学习框架,用于增强欲望、情感和情绪识别,通过文本和图像模态的相互引导,结合局部与全局特征表示,取得了优于现有方法的效果。

Details Motivation: 现有的情感分析多侧重于语言线索,忽视了图像作为补充的非语言线索,而多模态学习方法在欲望理解方面仍未被充分探索。

Contribution: 1. 提出了一种对称双向多模态学习框架,用于欲望、情感和情绪识别;2. 引入了文本和图像的相互引导解码器,增强了模态间的交互;3. 采用了混合尺度的图像策略,平衡了计算成本和感知增益。

Method: 1. 使用低分辨率图像获取全局视觉表示;2. 将高分辨率图像分割为子图并进行掩码图像建模,以捕捉细粒度局部特征;3. 设计了文本引导的图像解码器和图像引导的文本解码器,实现深度跨模态交互。

Result: 在MSED数据集上,提出的方法在欲望理解(F1提升1.1%)、情感识别(0.6%)和情绪分析(0.9%)上均优于现有方法。

Insight: 通过结合局部与全局视觉特征,并引入模态间的相互引导,可以显著提升多模态任务中的性能表现。

Abstract: Desire, as an intention that drives human behavior, is closely related to both emotion and sentiment. Multimodal learning has advanced sentiment and emotion recognition, but multimodal approaches specially targeting human desire understanding remain underexplored. And existing methods in sentiment analysis predominantly emphasize verbal cues and overlook images as complementary non-verbal cues. To address these gaps, we propose a Symmetrical Bidirectional Multimodal Learning Framework for Desire, Emotion, and Sentiment Recognition, which enforces mutual guidance between text and image modalities to effectively capture intention-related representations in the image. Specifically, low-resolution images are used to obtain global visual representations for cross-modal alignment, while high resolution images are partitioned into sub-images and modeled with masked image modeling to enhance the ability to capture fine-grained local features. A text-guided image decoder and an image-guided text decoder are introduced to facilitate deep cross-modal interaction at both local and global representations of image information. Additionally, to balance perceptual gains with computation cost, a mixed-scale image strategy is adopted, where high-resolution images are cropped into sub-images for masked modeling. The proposed approach is evaluated on MSED, a multimodal dataset that includes a desire understanding benchmark, as well as emotion and sentiment recognition. Experimental results indicate consistent improvements over other state-of-the-art methods, validating the effectiveness of our proposed method. Specifically, our method outperforms existing approaches, achieving F1-score improvements of 1.1% in desire understanding, 0.6% in emotion recognition, and 0.9% in sentiment analysis. Our code is available at: https://github.com/especiallyW/SyDES.

[57] Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track

Ran Hong,Feng Lu,Leilei Cao,An Yan,Youhai Jiang,Fengjie Zhu

Main category: cs.CV

TL;DR: 该论文提出了一种无需训练的框架,显著提升了Sa2VA在参考视频对象分割(RVOS)任务中的性能,通过引入视频语言检查器和关键帧采样器,取得了优异的成绩。

Details Motivation: RVOS任务需要结合视觉和语言理解,现有方法如Sa2VA虽结合了LLM和SAM~2,但仍存在假阳性问题且缺乏高效的关键帧选择机制。

Contribution: 提出了两个核心组件:视频语言检查器以减少假阳性,以及自适应关键帧采样器以更好地捕捉对象外观和时序上下文。

Method: 通过视频语言检查器验证查询内容是否出现在视频中,并结合关键帧采样器选择信息丰富的帧。

Result: 在MeViS测试集上取得64.14%的J&F分数,位列ICCV 2025的LSVOS挑战赛RVOS赛道第二名。

Insight: 验证查询内容的真实性和选择关键帧对提升RVOS性能至关重要,无需额外训练即可显著改进现有方法。

Abstract: Referential Video Object Segmentation (RVOS) aims to segment all objects in a video that match a given natural language description, bridging the gap between vision and language understanding. Recent work, such as Sa2VA, combines Large Language Models (LLMs) with SAM~2, leveraging the strong video reasoning capability of LLMs to guide video segmentation. In this work, we present a training-free framework that substantially improves Sa2VA’s performance on the RVOS task. Our method introduces two key components: (1) a Video-Language Checker that explicitly verifies whether the subject and action described in the query actually appear in the video, thereby reducing false positives; and (2) a Key-Frame Sampler that adaptively selects informative frames to better capture both early object appearances and long-range temporal context. Without any additional training, our approach achieves a J&F score of 64.14% on the MeViS test set, ranking 2nd place in the RVOS track of the 7th LSVOS Challenge at ICCV 2025.

[58] MS-GS: Multi-Appearance Sparse-View 3D Gaussian Splatting in the Wild

Deming Li,Kaiwen Jiang,Yutao Tang,Ravi Ramamoorthi,Rama Chellappa,Cheng Peng

Main category: cs.CV

TL;DR: MS-GS是一个用于稀疏视图和多外观场景的3D高斯溅射框架,通过利用单目深度估计的几何先验和结构从运动点锚定算法,解决了场景重建和新视角合成的挑战。

Details Motivation: 野外照片集通常包含有限的图像量且呈现多种外观(如不同时间或季节拍摄),这增加了场景重建和新视角合成的难度。现有方法(如NeRF和3DGS)容易过度平滑或过拟合。

Contribution: 1) 提出MS-GS框架,支持稀疏视图和多外观场景的3D高斯溅射;2) 利用单目深度估计几何先验和SfM点锚定算法提取局部语义区域;3) 提出几何引导的多视图监督策略以减少过拟合。

Method: 1) 从单目深度估计中提取几何先验;2) 使用SfM点锚定算法提取局部语义区域;3) 通过几何引导的虚拟视图监督策略增强3D一致性和减少过拟合。

Result: MS-GS在稀疏视图和多外观场景下实现了真实感渲染,并在多个数据集上显著优于现有方法。

Insight: 几何先验和局部语义区域的利用可以显著改善稀疏视图下的3D重建效果,而多视图约束策略有助于减少过拟合。

Abstract: In-the-wild photo collections often contain limited volumes of imagery and exhibit multiple appearances, e.g., taken at different times of day or seasons, posing significant challenges to scene reconstruction and novel view synthesis. Although recent adaptations of Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have improved in these areas, they tend to oversmooth and are prone to overfitting. In this paper, we present MS-GS, a novel framework designed with Multi-appearance capabilities in Sparse-view scenarios using 3DGS. To address the lack of support due to sparse initializations, our approach is built on the geometric priors elicited from monocular depth estimations. The key lies in extracting and utilizing local semantic regions with a Structure-from-Motion (SfM) points anchored algorithm for reliable alignment and geometry cues. Then, to introduce multi-view constraints, we propose a series of geometry-guided supervision at virtual views in a fine-grained and coarse scheme to encourage 3D consistency and reduce overfitting. We also introduce a dataset and an in-the-wild experiment setting to set up more realistic benchmarks. We demonstrate that MS-GS achieves photorealistic renderings under various challenging sparse-view and multi-appearance conditions and outperforms existing approaches significantly across different datasets.

[59] Diffusion-Based Cross-Modal Feature Extraction for Multi-Label Classification

Tian Lan,Yiming Zheng,Jianxin Yin

Main category: cs.CV

TL;DR: 论文提出了一种名为Diff-Feat的框架,利用预训练的扩散-Transformer模型提取图像和文本的中间特征,并通过融合实现多标签分类任务的高性能表现。

Details Motivation: 多标签分类任务需要强大的特征表示能力以捕捉标签间的关系,目前的方法依赖CNN、图模型或Transformer,但缺乏对跨模态特征的有效融合。Diff-Feat旨在通过扩散模型提取更优的特征表示。

Contribution: 1. 提出Diff-Feat框架,利用扩散模型的中间特征实现图像和文本的跨模态特征提取。2. 发现图像和语言任务中最佳特征出现的位置不同:图像任务在扩散过程的中间步骤和中间层,语言任务在无噪声步骤的最深层。3. 设计了一种局部搜索算法,高效定位最优“图像-文本”ד块-时间步”对。4. 在多个数据集上实现了SOTA性能。

Method: 1. 从预训练的扩散-Transformer模型中提取图像和文本的中间特征。2. 观察特征在扩散过程和Transformer层中的分布规律。3. 设计启发式局部搜索算法优化特征提取位置。4. 通过简单的融合线性投影和特征加法提升下游任务性能。

Result: 1. 在MS-COCO-enhanced上达到98.6% mAP,Visual Genome 500上达到45.7% mAP,显著超越现有方法。2. t-SNE和聚类分析表明Diff-Feat的语义聚类更紧密。

Insight: 1. 扩散模型的特征分布于任务相关:图像任务偏向中间层和时间步,语言任务偏向深层和无噪声步骤。2. Layer 12是图像任务中普遍最优的特征层。3. 跨模态特征的简单融合就能显著提升性能。

Abstract: Multi-label classification has broad applications and depends on powerful representations capable of capturing multi-label interactions. We introduce \textit{Diff-Feat}, a simple but powerful framework that extracts intermediate features from pre-trained diffusion-Transformer models for images and text, and fuses them for downstream tasks. We observe that for vision tasks, the most discriminative intermediate feature along the diffusion process occurs at the middle step and is located in the middle block in Transformer. In contrast, for language tasks, the best feature occurs at the noise-free step and is located in the deepest block. In particular, we observe a striking phenomenon across varying datasets: a mysterious “Layer $12$” consistently yields the best performance on various downstream classification tasks for images (under DiT-XL/2-256$\times$256). We devise a heuristic local-search algorithm that pinpoints the locally optimal “image-text”$\times$”block-timestep” pair among a few candidates, avoiding an exhaustive grid search. A simple fusion-linear projection followed by addition-of the selected representations yields state-of-the-art performance: 98.6% mAP on MS-COCO-enhanced and 45.7% mAP on Visual Genome 500, surpassing strong CNN, graph, and Transformer baselines by a wide margin. t-SNE and clustering metrics further reveal that \textit{Diff-Feat} forms tighter semantic clusters than unimodal counterparts. The code is available at https://github.com/lt-0123/Diff-Feat.

[60] From Development to Deployment of AI-assisted Telehealth and Screening for Vision- and Hearing-threatening diseases in resource-constrained settings: Field Observations, Challenges and Way Forward

Mahesh Shakya,Bijay Adhikari,Nirsara Shrestha,Bipin Koirala,Arun Adhikari,Prasanta Poudyal,Luna Mathema,Sarbagya Buddhacharya,Bijay Khatri,Bishesh Khanal

Main category: cs.CV

TL;DR: 论文探讨了在资源受限环境中开发和部署AI辅助的视力和听力疾病筛查系统的挑战和解决方案,强调了早期原型设计、跨学科合作和自动化图像质量检查的重要性。

Details Motivation: 资源受限地区由于专业医生和筛查设备匮乏,视力和听力疾病常导致可预防的残疾。AI辅助的大规模筛查和远程医疗有望扩展早期检测,但实际部署面临许多挑战,如纸质流程转换和缺乏现场经验。

Contribution: 论文提出了在资源受限环境中开发和部署AI辅助筛查系统的实用性建议,包括迭代的跨学科合作、自动化图像质量检查的重要性,并填补了这类场景下的实际知识缺口。

Method: 采用早期原型设计、影子部署和持续反馈的迭代方法,结合公共数据集和AI模型,开发了一个端到端的协同设计流程。

Result: 研究发现,尽管公共数据集和AI模型在领域偏移下表现不佳,但仍具实用性;自动化图像质量检查可提升高流量筛查营的效率。

Insight: AI开发和工作流程数字化应被视为端到端的迭代协同设计过程,跨学科合作和早期用户反馈是成功的关键。

Abstract: Vision- and hearing-threatening diseases cause preventable disability, especially in resource-constrained settings(RCS) with few specialists and limited screening setup. Large scale AI-assisted screening and telehealth has potential to expand early detection, but practical deployment is challenging in paper-based workflows and limited documented field experience exist to build upon. We provide insights on challenges and ways forward in development to adoption of scalable AI-assisted Telehealth and screening in such settings. Specifically, we find that iterative, interdisciplinary collaboration through early prototyping, shadow deployment and continuous feedback is important to build shared understanding as well as reduce usability hurdles when transitioning from paper-based to AI-ready workflows. We find public datasets and AI models highly useful despite poor performance due to domain shift. In addition, we find the need for automated AI-based image quality check to capture gradable images for robust screening in high-volume camps. Our field learning stress the importance of treating AI development and workflow digitization as an end-to-end, iterative co-design process. By documenting these practical challenges and lessons learned, we aim to address the gap in contextual, actionable field knowledge for building real-world AI-assisted telehealth and mass-screening programs in RCS.

Shaojie Zhang,Ruoceng Zhang,Pei Fu,Shaokang Wang,Jiahui Yang,Xin Du,Shiqi Cui,Bin Qin,Ying Huang,Zhenbo Luo,Jian Luan

Main category: cs.CV

TL;DR: 论文提出了一种受人类认知启发的GUI交互框架BTL(Blink-Think-Link),通过模拟人类与图形界面的交互过程,将任务分解为Blink(快速检测)、Think(高层推理)和Link(执行命令生成)三个阶段,并引入了Blink数据生成和BTL Reward两大技术创新。

Details Motivation: 现有基于多模态大语言模型的GUI自动化交互方法与人类自然交互模式存在显著差异,作者希望通过模拟人类认知过程填补这一空白。

Contribution: 1. 提出BTL框架,模拟人类GUI交互的三阶段认知过程;2. 设计了Blink数据生成的自动化标注流水线;3. 首创了基于规则的BTL Reward机制,结合过程和结果驱动强化学习。

Method: BTL框架分为Blink(快速检测注意力区域)、Think(高层推理与决策)、Link(生成执行命令)三个阶段,并分别通过Blink数据生成和BTL Reward技术优化模型训练。

Result: BTL-UI在静态GUI理解和动态交互任务上均表现优异,证明了该框架在高级GUI代理开发中的有效性。

Insight: 模拟人类认知过程可以显著提升GUI代理的自然交互能力,BTL框架为未来人机交互研究提供了新的方向。

Abstract: In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose “Blink-Think-Link” (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward – the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates consistent state-of-the-art performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework’s efficacy in developing advanced GUI Agents.

[62] Towards Size-invariant Salient Object Detection: A Generic Evaluation and Optimization Approach

Shilong Bao,Qianqian Xu,Feiran Li,Boyu Han,Zhiyong Yang,Xiaochun Cao,Qingming Huang

Main category: cs.CV

TL;DR: 该论文探讨了显着目标检测(SOD)中一个基础但未被充分研究的问题:多目标尺寸差异显著时,现有评估协议对尺寸的敏感性问题,并提出了一种通用的尺寸不变评估框架(SIEva)和优化方法(SIOpt)。

Details Motivation: 现有SOD评估指标对目标尺寸敏感,导致大尺寸区域主导评估结果,而小尺寸但语义重要的目标被忽视,影响了评估的公平性和实际性能。

Contribution: 1. 揭示了现有SOD指标的尺寸敏感性;2. 提出了通用的尺寸不变评估框架SIEva和优化方法SIOpt;3. 提供了理论分析和实验验证。

Method: 通过理论推导分解评估结果,提出SIEva独立评估每个目标区域并聚合结果,进一步开发了模型无关的SIOpt优化框架。

Result: 实验表明SIEva和SIOpt显著改善了多尺寸目标场景下的检测性能,验证了方法的有效性。

Insight: 尺寸敏感性是SOD评估中被忽视的问题,独立评估和聚合各目标区域的策略能提升公平性和性能。

Abstract: This paper investigates a fundamental yet underexplored issue in Salient Object Detection (SOD): the size-invariant property for evaluation protocols, particularly in scenarios when multiple salient objects of significantly different sizes appear within a single image. We first present a novel perspective to expose the inherent size sensitivity of existing widely used SOD metrics. Through careful theoretical derivations, we show that the evaluation outcome of an image under current SOD metrics can be essentially decomposed into a sum of several separable terms, with the contribution of each term being directly proportional to its corresponding region size. Consequently, the prediction errors would be dominated by the larger regions, while smaller yet potentially more semantically important objects are often overlooked, leading to biased performance assessments and practical degradation. To address this challenge, a generic Size-Invariant Evaluation (SIEva) framework is proposed. The core idea is to evaluate each separable component individually and then aggregate the results, thereby effectively mitigating the impact of size imbalance across objects. Building upon this, we further develop a dedicated optimization framework (SIOpt), which adheres to the size-invariant principle and significantly enhances the detection of salient objects across a broad range of sizes. Notably, SIOpt is model-agnostic and can be seamlessly integrated with a wide range of SOD backbones. Theoretically, we also present generalization analysis of SOD methods and provide evidence supporting the validity of our new evaluation protocols. Finally, comprehensive experiments speak to the efficacy of our proposed approach. The code is available at https://github.com/Ferry-Li/SI-SOD.

[63] Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality Fusion

Shanghong Li,Chiam Wen Qi Ruth,Hong Xu,Fang Liu

Main category: cs.CV

TL;DR: 该论文提出了一种多模态框架HFN,用于检测短视频中的虚假新闻。HFN通过融合视频、音频和文本数据,动态调整模态权重,并在不完全数据下保持鲁棒性。实验表明,该方法在FakeTT和VESV数据集上优于现有技术,提升了虚假新闻检测的性能。

Details Motivation: 短视频平台的快速普及导致虚假信息传播加剧,现有方法难以应对短视频内容的动态性和多模态性质,迫切需要一种高效的多模态检测方法。

Contribution: 1. 提出HFN框架,动态融合多模态数据;2. 引入决策网络调整模态权重和加权特征融合模块;3. 贡献了一个专门用于短视频虚假新闻检测的数据集VESV。

Method: HFN框架结合视频、音频和文本模态,通过决策网络动态调整权重,并使用加权特征融合模块处理不完全数据。

Result: 在FakeTT和VESV数据集上分别提升了2.71%和4.14%的宏F1分数,显著优于现有方法。

Insight: 动态调整模态权重和多模态融合是处理短视频虚假新闻的关键,VESV数据集的发布为相关研究提供了重要资源。

Abstract: The rapid proliferation of short video platforms has necessitated advanced methods for detecting fake news. This need arises from the widespread influence and ease of sharing misinformation, which can lead to significant societal harm. Current methods often struggle with the dynamic and multimodal nature of short video content. This paper presents HFN, Heterogeneous Fusion Net, a novel multimodal framework that integrates video, audio, and text data to evaluate the authenticity of short video content. HFN introduces a Decision Network that dynamically adjusts modality weights during inference and a Weighted Multi-Modal Feature Fusion module to ensure robust performance even with incomplete data. Additionally, we contribute a comprehensive dataset VESV (VEracity on Short Videos) specifically designed for short video fake news detection. Experiments conducted on the FakeTT and newly collected VESV datasets demonstrate improvements of 2.71% and 4.14% in Marco F1 over state-of-the-art methods. This work establishes a robust solution capable of effectively identifying fake news in the complex landscape of short video platforms, paving the way for more reliable and comprehensive approaches in combating misinformation.

[64] EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery

Gui Wang,Yang Wennuo,Xusen Ma,Zehao Zhong,Zhuoru Wu,Ende Wu,Rong Qu,Wooi Ping Cheah,Jianfeng Ren,Linlin Shen

Main category: cs.CV

TL;DR: 论文《EyePCR》提出了一个眼科手术领域的多模态基准测试,用于评估模型在细粒度感知、知识理解和临床推理方面的能力,并通过领域适配的MLLM模型取得了优于开源模型的表现。

Details Motivation: 现有MLLM在高风险领域(如手术场景)的表现尚未充分探索,亟需一个结构化的临床知识基准来评估模型的认知能力。

Contribution: 提出EyePCR基准,包含210k VQAs和多视角细粒度标注,25k医学知识图谱,以及四项临床推理任务,为手术视频理解提供全面评估框架。

Method: 基于领域知识构建标注数据集,并适配Qwen2.5-VL-7B模型(EyePCR-MLLM),通过细粒度感知、知识理解和临床推理任务评估模型表现。

Result: EyePCR-MLLM在感知任务中表现最佳,在理解和推理任务中优于开源模型,与GPT-4.1等商业模型相当,揭示了现有MLLM在手术认知上的局限性。

Insight: 结构化的临床知识标注能显著提升模型的认知能力,为手术场景的模型可靠性评估提供了新方向。

Abstract: MLLMs (Multimodal Large Language Models) have showcased remarkable capabilities, but their performance in high-stakes, domain-specific scenarios like surgical settings, remains largely under-explored. To address this gap, we develop \textbf{EyePCR}, a large-scale benchmark for ophthalmic surgery analysis, grounded in structured clinical knowledge to evaluate cognition across \textit{Perception}, \textit{Comprehension} and \textit{Reasoning}. EyePCR offers a richly annotated corpus with more than 210k VQAs, which cover 1048 fine-grained attributes for multi-view perception, medical knowledge graph of more than 25k triplets for comprehension, and four clinically grounded reasoning tasks. The rich annotations facilitate in-depth cognitive analysis, simulating how surgeons perceive visual cues and combine them with domain knowledge to make decisions, thus greatly improving models’ cognitive ability. In particular, \textbf{EyePCR-MLLM}, a domain-adapted variant of Qwen2.5-VL-7B, achieves the highest accuracy on MCQs for \textit{Perception} among compared models and outperforms open-source models in \textit{Comprehension} and \textit{Reasoning}, rivalling commercial models like GPT-4.1. EyePCR reveals the limitations of existing MLLMs in surgical cognition and lays the foundation for benchmarking and enhancing clinical reliability of surgical video understanding models.

[65] TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

Zhongyuan Bao,Lejun Zhang

Main category: cs.CV

TL;DR: TennisTV是首个全面的网球视频理解基准,揭示了多模态大语言模型(MLLMs)在高速运动数据理解中的不足。

Details Motivation: 当前MLLMs在一般视频理解中表现优异,但对于高速、高频的体育视频(如网球)理解能力不足,亟需系统性评估。

Contribution: TennisTV提供了一个包含2500个人工验证问题的基准,涵盖了8个任务,并系统性评估了16种MLLMs。

Method: 通过自动化流程对每个回合建模为时间有序的连续击球事件,并生成问题。

Result: 结果表明MLLMs在网球视频理解中存在显著不足,并提出帧采样密度需任务定制和时间定位改进是关键。

Insight: 任务定制化的帧采样和提升时间定位能力是改进MLLMs在高速运动数据理解中的关键。

Abstract: Multimodal large language models (MLLMs) excel at general video understanding but struggle with fast, high-frequency sports like tennis, where rally clips are short yet information-dense. To systematically evaluate MLLMs in this challenging domain, we present TennisTV, the first and most comprehensive benchmark for tennis video understanding. TennisTV models each rally as a temporal-ordered sequence of consecutive stroke events, using automated pipelines for filtering and question generation. It covers 8 tasks at rally and stroke levels and includes 2,500 human-verified questions. Evaluating 16 representative MLLMs, we provide the first systematic assessment of tennis video understanding. Results reveal substantial shortcomings and yield two key insights: (i) frame-sampling density should be tailored and balanced across tasks, and (ii) improving temporal grounding is essential for stronger reasoning.

[66] Enhancing WSI-Based Survival Analysis with Report-Auxiliary Self-Distillation

Zheng Wang,Hong Liu,Zheng Wang,Danyi Li,Min Cen,Baptiste Magnier,Li Liang,Liansheng Wang

Main category: cs.CV

TL;DR: 该论文提出了一种基于报告辅助自蒸馏(Rasa)的WSI生存分析方法,利用病理报告和大语言模型(LLM)提取关键信息,并通过自蒸馏和数据增强策略提升模型性能。

Details Motivation: WSI生存分析中存在特征噪声和数据访问受限问题,而病理报告作为辅助信息未被充分利用。研究旨在利用报告的丰富信息提升WSI预后预测能力。

Contribution: 1. 提出Rasa框架,结合LLM从病理报告中提取WSI相关文本描述;2. 设计自蒸馏管道,通过教师模型过滤冗余特征;3. 引入风险感知mix-up策略增强数据多样性。

Method: 1. 使用LLM提取病理报告中的WSI相关文本;2. 设计自蒸馏管道(教师-学生模型)过滤噪声特征;3. 通过风险感知mix-up策略扩充训练数据。

Result: 在CRC和TCGA-BRCA数据集上的实验表明,Rasa显著优于现有方法,证明了其有效性。

Insight: 病理报告作为辅助信息可以显著提升WSI生存分析的性能,自蒸馏和数据增强策略是解决噪声和样本稀缺问题的有效手段。

Abstract: Survival analysis based on Whole Slide Images (WSIs) is crucial for evaluating cancer prognosis, as they offer detailed microscopic information essential for predicting patient outcomes. However, traditional WSI-based survival analysis usually faces noisy features and limited data accessibility, hindering their ability to capture critical prognostic features effectively. Although pathology reports provide rich patient-specific information that could assist analysis, their potential to enhance WSI-based survival analysis remains largely unexplored. To this end, this paper proposes a novel Report-auxiliary self-distillation (Rasa) framework for WSI-based survival analysis. First, advanced large language models (LLMs) are utilized to extract fine-grained, WSI-relevant textual descriptions from original noisy pathology reports via a carefully designed task prompt. Next, a self-distillation-based pipeline is designed to filter out irrelevant or redundant WSI features for the student model under the guidance of the teacher model’s textual knowledge. Finally, a risk-aware mix-up strategy is incorporated during the training of the student model to enhance both the quantity and diversity of the training data. Extensive experiments carried out on our collected data (CRC) and public data (TCGA-BRCA) demonstrate the superior effectiveness of Rasa against state-of-the-art methods. Our code is available at https://github.com/zhengwang9/Rasa.

[67] PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning

Zhuoyao Liu,Yang Liu,Wentao Feng,Shudong Huang

Main category: cs.CV

TL;DR: PCSR提出了一种新框架,通过伪标签一致性引导的样本细化来解决跨模态检索中的噪声对应问题,显著提升了检索性能。

Details Motivation: 现实数据中存在大量噪声对应(Noisy Correspondences),传统方法仅粗略划分清洁与噪声样本,忽略了噪声样本的多样性,且训练策略单一,导致性能受限。

Contribution: 1. 提出PCSR框架,利用伪标签一致性细化样本;2. 设计PCS评分量化预测稳定性,分离模糊与可优化样本;3. 提出APO策略,针对不同样本特性动态优化。

Method: 1. 基于置信度的清洁/噪声样本划分;2. 伪标签一致性细化噪声样本;3. PCS评分分离模糊与可优化样本;4. APO策略动态优化训练。

Result: 在CC152K、MS-COCO和Flickr30K上的实验表明,PCSR显著提升了噪声监督下的检索鲁棒性。

Insight: 噪声样本的多样性需要被更精细地处理,动态优化策略能更高效地利用样本信息。

Abstract: Cross-modal retrieval aims to align different modalities via semantic similarity. However, existing methods often assume that image-text pairs are perfectly aligned, overlooking Noisy Correspondences in real data. These misaligned pairs misguide similarity learning and degrade retrieval performance. Previous methods often rely on coarse-grained categorizations that simply divide data into clean and noisy samples, overlooking the intrinsic diversity within noisy instances. Moreover, they typically apply uniform training strategies regardless of sample characteristics, resulting in suboptimal sample utilization for model optimization. To address the above challenges, we introduce a novel framework, called Pseudo-label Consistency-Guided Sample Refinement (PCSR), which enhances correspondence reliability by explicitly dividing samples based on pseudo-label consistency. Specifically, we first employ a confidence-based estimation to distinguish clean and noisy pairs, then refine the noisy pairs via pseudo-label consistency to uncover structurally distinct subsets. We further proposed a Pseudo-label Consistency Score (PCS) to quantify prediction stability, enabling the separation of ambiguous and refinable samples within noisy pairs. Accordingly, we adopt Adaptive Pair Optimization (APO), where ambiguous samples are optimized with robust loss functions and refinable ones are enhanced via text replacement during training. Extensive experiments on CC152K, MS-COCO and Flickr30K validate the effectiveness of our method in improving retrieval robustness under noisy supervision.

[68] UNIV: Unified Foundation Model for Infrared and Visible Modalities

Fangyuan Mao,Shuo Wang,Jilin Mei,Chen Min,Shun Lu,Fuyang Liu,Yu Hu

Main category: cs.CV

TL;DR: UNIV是一个统一的红外和可见光模态基础模型,通过仿生学设计解决了多模态感知的挑战,提升了红外任务的性能而不影响可见光任务的基线表现。

Details Motivation: 研究旨在解决RGB-可见光和红外多模态感知中预训练模型性能不足的问题,特别是在复杂天气条件下。

Contribution: 1. 提出Patch-wise Cross-modality Contrastive Learning (PCCL),一种注意力引导的跨模态特征对齐方法;2. 设计了双知识保留机制,结合LoRA适配器和同步蒸馏;3. 发布了MVIP数据集,支持跨模态学习。

Method: 1. PCCL模拟视网膜水平细胞的侧向抑制功能;2. 双知识保留机制模拟双极细胞信号路由,防止灾难性遗忘。

Result: UNIV在红外任务(语义分割+1.7 mIoU,目标检测+0.7 mAP)上表现优异,同时保持可见光任务99%以上的基线性能。

Insight: 仿生学设计为多模态模型提供了新思路,结合轻量适配器(如LoRA)和蒸馏技术可有效平衡跨模态学习与单模态性能。

Abstract: The demand for joint RGB-visible and infrared perception is growing rapidly, particularly to achieve robust performance under diverse weather conditions. Although pre-trained models for RGB-visible and infrared data excel in their respective domains, they often underperform in multimodal scenarios, such as autonomous vehicles equipped with both sensors. To address this challenge, we propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV), featuring two key innovations. First, we introduce Patch-wise Cross-modality Contrastive Learning (PCCL), an attention-guided distillation framework that mimics retinal horizontal cells’ lateral inhibition, which enables effective cross-modal feature alignment while remaining compatible with any transformer-based architecture. Second, our dual-knowledge preservation mechanism emulates the retina’s bipolar cell signal routing - combining LoRA adapters (2% added parameters) with synchronous distillation to prevent catastrophic forgetting, thereby replicating the retina’s photopic (cone-driven) and scotopic (rod-driven) functionality. To support cross-modal learning, we introduce the MVIP dataset, the most comprehensive visible-infrared benchmark to date. It contains 98,992 precisely aligned image pairs spanning diverse scenarios. Extensive experiments demonstrate UNIV’s superior performance on infrared tasks (+1.7 mIoU in semantic segmentation and +0.7 mAP in object detection) while maintaining 99%+ of the baseline performance on visible RGB tasks. Our code is available at https://github.com/fangyuanmao/UNIV.

[69] GS-Scale: Unlocking Large-Scale 3D Gaussian Splatting Training via Host Offloading

Donghyun Lee,Dawoon Jeong,Jae W. Lee,Hongil Yoon

Main category: cs.CV

TL;DR: GS-Scale提出了一种高效内存的训练系统,通过在主机内存中存储所有高斯数据并按需传输子集到GPU,显著降低了GPU内存需求,同时通过系统级优化保持训练速度。

Details Motivation: 现有的3D高斯喷涂技术在训练大规模场景时面临GPU内存不足的挑战,限制了高质量渲染的实现。GS-Scale旨在解决这一问题。

Contribution: GS-Scale通过主机内存卸载技术显著减少了GPU内存需求(3.3-5.6倍),并实现了与纯GPU训练相当的速度。此外,它还支持在消费级GPU上训练更大规模的高斯喷涂场景。

Method: GS-Scale采用三种系统级优化:(1)选择性几何参数卸载以加速视锥剔除;(2)参数转发将CPU优化器更新与GPU计算流水线化;(3)延迟优化器更新以减少零梯度高斯的不必要内存访问。

Result: 在大型数据集上的实验表明,GS-Scale在RTX 4070 Mobile GPU上将高斯数量从400万扩展到1800万,LPIPS提升23-35%。

Insight: 通过主机内存卸载和系统优化,可以在有限GPU资源下实现高质量的大规模3D高斯喷涂训练。这将推动消费级设备上的高性能图形渲染应用。

Abstract: The advent of 3D Gaussian Splatting has revolutionized graphics rendering by delivering high visual quality and fast rendering speeds. However, training large-scale scenes at high quality remains challenging due to the substantial memory demands required to store parameters, gradients, and optimizer states, which can quickly overwhelm GPU memory. To address these limitations, we propose GS-Scale, a fast and memory-efficient training system for 3D Gaussian Splatting. GS-Scale stores all Gaussians in host memory, transferring only a subset to the GPU on demand for each forward and backward pass. While this dramatically reduces GPU memory usage, it requires frustum culling and optimizer updates to be executed on the CPU, introducing slowdowns due to CPU’s limited compute and memory bandwidth. To mitigate this, GS-Scale employs three system-level optimizations: (1) selective offloading of geometric parameters for fast frustum culling, (2) parameter forwarding to pipeline CPU optimizer updates with GPU computation, and (3) deferred optimizer update to minimize unnecessary memory accesses for Gaussians with zero gradients. Our extensive evaluations on large-scale datasets demonstrate that GS-Scale significantly lowers GPU memory demands by 3.3-5.6x, while achieving training speeds comparable to GPU without host offloading. This enables large-scale 3D Gaussian Splatting training on consumer-grade GPUs; for instance, GS-Scale can scale the number of Gaussians from 4 million to 18 million on an RTX 4070 Mobile GPU, leading to 23-35% LPIPS (learned perceptual image patch similarity) improvement.

[70] Saccadic Vision for Fine-Grained Visual Classification

Johann Schmidt,Sebastian Stober,Joachim Denzler,Paul Bodesheim

Main category: cs.CV

TL;DR: 论文提出了一种受人类扫视视觉启发的两阶段方法,通过提取外围特征和并行编码注视点,结合上下文选择性注意力,显著改善了细粒度视觉分类任务的表现。

Details Motivation: 现有细粒度视觉分类方法依赖复杂定位网络,导致特征冗余且难以确定最佳部分数量。通过模拟人类扫视视觉,论文试图优化特征提取和分类效果。

Contribution: 提出了一种两阶段框架,结合外围特征和注视点编码,并通过上下文选择性注意力加权融合特征,解决了特征冗余问题。

Method: 首先提取外围特征生成采样图,再从中采样注视点并行编码,利用非极大值抑制消除冗余,最后通过选择性注意力融合特征。

Result: 在多个细粒度视觉分类基准数据集上表现优异,性能与现有最优方法相当,显著优于基线模型。

Insight: 模拟人类视觉机制(如扫视和非极大值抑制)可以优化细粒度分类中的特征提取,减少冗余并提升性能。

Abstract: Fine-grained visual classification (FGVC) requires distinguishing between visually similar categories through subtle, localized features - a task that remains challenging due to high intra-class variability and limited inter-class differences. Existing part-based methods often rely on complex localization networks that learn mappings from pixel to sample space, requiring a deep understanding of image content while limiting feature utility for downstream tasks. In addition, sampled points frequently suffer from high spatial redundancy, making it difficult to quantify the optimal number of required parts. Inspired by human saccadic vision, we propose a two-stage process that first extracts peripheral features (coarse view) and generates a sample map, from which fixation patches are sampled and encoded in parallel using a weight-shared encoder. We employ contextualized selective attention to weigh the impact of each fixation patch before fusing peripheral and focus representations. To prevent spatial collapse - a common issue in part-based methods - we utilize non-maximum suppression during fixation sampling to eliminate redundancy. Comprehensive evaluation on standard FGVC benchmarks (CUB-200-2011, NABirds, Food-101 and Stanford-Dogs) and challenging insect datasets (EU-Moths, Ecuador-Moths and AMI-Moths) demonstrates that our method achieves comparable performance to state-of-the-art approaches while consistently outperforming our baseline encoder.

[71] SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions

Cristian Sbrolli,Matteo Matteucci

Main category: cs.CV

TL;DR: SceneForge通过结构化多物体场景增强3D点云与文本的对齐,显著提升对比学习性能。

Details Motivation: 解决大规模3D-文本数据稀缺问题,提升数据复杂性和多样性,以改进3D与文本的对齐学习。

Contribution: 提出SceneForge框架,通过多物体场景构建与文本描述对齐,提升3D-文本对比学习性能。

Method: 利用独立3D形状构建多物体场景,明确空间关系,并搭配由大语言模型生成的多物体描述,增强对比训练。

Result: 在多个任务(零样本分类、少样本分割、视觉问答等)中表现优异,且模型架构无关。

Insight: 结构化场景组合能够显著提升3D与文本对齐能力,尤其在数据稀缺的场景下表现突出。

Abstract: The whole is greater than the sum of its parts-even in 3D-text contrastive learning. We introduce SceneForge, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SceneForge leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. We systematically investigate critical design elements, such as the optimal number of objects per scene, the proportion of compositional samples in training batches, and scene construction strategies. Extensive experiments demonstrate that SceneForge delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SceneForge’s compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SceneForge improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.

[72] ORIC: Benchmarking Object Recognition in Incongruous Context for Large Vision-Language Models

Zhaoyang Li,Zhan Ling,Yuchen Zhou,Hao Su

Main category: cs.CV

TL;DR: 论文介绍了ORIC(Incongruous Context下的物体识别基准),用于评估大视觉语言模型(LVLMs)在不一致上下文中的物体识别能力,揭示了模型在对象误识别和幻觉上的问题。

Details Motivation: 当前大视觉语言模型(LVLMs)在图像描述、视觉问答等任务中表现优异,但在不一致上下文中易出现物体误识别和幻觉问题,需要系统性评估和改进。

Contribution: 提出ORIC基准,通过LLM和CLIP引导的采样策略,专门评估LVLMs在不一致上下文中的物体识别能力。

Method: 1. LLM引导采样:识别存在但不一致的对象;2. CLIP引导采样:检测可能被幻觉的虚构对象。

Result: 评估18个LVLMs和2个开放词汇检测模型,显示模型在不一致上下文中的显著识别差距。

Insight: ORIC揭示了LVLMs的局限性,为上下文感知的物体识别研究提供了重要方向。

Abstract: Large Vision-Language Models (LVLMs) have made significant strides in image caption, visual question answering, and robotics by integrating visual and textual information. However, they remain prone to errors in incongruous contexts, where objects appear unexpectedly or are absent when contextually expected. This leads to two key recognition failures: object misidentification and hallucination. To systematically examine this issue, we introduce the Object Recognition in Incongruous Context Benchmark (ORIC), a novel benchmark that evaluates LVLMs in scenarios where object-context relationships deviate from expectations. ORIC employs two key strategies: (1) LLM-guided sampling, which identifies objects that are present but contextually incongruous, and (2) CLIP-guided sampling, which detects plausible yet nonexistent objects that are likely to be hallucinated, thereby creating an incongruous context. Evaluating 18 LVLMs and two open-vocabulary detection models, our results reveal significant recognition gaps, underscoring the challenges posed by contextual incongruity. This work provides critical insights into LVLMs’ limitations and encourages further research on context-aware object recognition.

[73] Training-Free Pyramid Token Pruning for Efficient Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

Yuxuan Liang,Xu Li,Xiaolei Chen,Yi Zheng,Haotian Chen,Bin Li,Xiangyang Xue

Main category: cs.CV

TL;DR: 论文提出了一种无需训练的金字塔令牌修剪(PTP)方法,通过结合区域和令牌级别的视觉显著性以及指令引导的重要性,显著减少大型视觉语言模型(LVLM)在高效处理高分辨率图像时的计算开销。

Details Motivation: 大型视觉语言模型(LVLM)在处理高分辨率图像时因生成大量视觉令牌而导致计算开销指数级增长,影响推理效率。为了在不显著损失性能的情况下提升效率,作者提出了PTP方法。

Contribution: 1. 提出了一种无需训练的金字塔令牌修剪(PTP)策略,结合视觉显著性和指令引导选择重要令牌。2. 灵感来源于人类视觉注意力机制,通过多层次筛选减少计算开销。3. 在13个多样化基准测试中验证了方法的高效性和性能。

Method: PTP通过以下步骤实现:1. 基于区域和令牌级别的视觉显著性进行自下而上筛选;2. 结合指令引导的自顶向下重要性评估,选择与任务最相关的令牌;3. 保留视觉显著性高的区域令牌,减少冗余计算。

Result: 实验表明,PTP能显著降低计算开销和推理延迟,同时仅带来最小的性能损失,适用于多样化任务。

Insight: 1. 视觉显著性和任务指令的结合是高效令牌修剪的关键。2. 无需训练的方法在保持性能的前提下提升了模型效率。3. 对人类视觉机制的借鉴为多模态任务提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) have significantly advanced multimodal understanding but still struggle with efficiently processing high-resolution images. Recent approaches partition high-resolution images into multiple sub-images, dramatically increasing the number of visual tokens and causing exponential computational overhead during inference. To address these limitations, we propose a training-free token pruning strategy, Pyramid Token Pruning (PTP), that integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided importance. Inspired by human visual attention mechanisms, PTP selectively retains more tokens from visually salient regions and further leverages textual instructions to pinpoint tokens most relevant to specific multimodal tasks. Extensive experiments across 13 diverse benchmarks demonstrate that our method substantially reduces computational overhead and inference latency with minimal performance loss.

[74] SGMAGNet: A Baseline Model for 3D Cloud Phase Structure Reconstruction on a New Passive Active Satellite Benchmark

Chi Yang,Fu Wang,Xiaofei Yang,Hao Huang,Weijia Cao,Xiaowen Chu

Main category: cs.CV

TL;DR: 该论文提出了一个基准数据集和SGMAGNet模型,用于从多模态卫星观测数据重建3D云相结构,显著提升了云相剖面检索的精度。

Details Motivation: 云相剖面在数值天气预报中至关重要,直接影响辐射传输和降水过程。现有方法难以准确重建复杂的3D云相结构,亟需一种新方法。

Contribution: 提出了一个新的基准数据集和SGMAGNet模型,专注于从多模态卫星观测数据中重建3D云相结构,并为未来集成到数值天气预报系统奠定了基础。

Method: 采用SGMAGNet作为主要模型,利用多尺度空间模式捕捉技术,与UNet变体和SegNet等基线模型进行比较。

Result: SGMAGNet在复杂多层和边界过渡区域的云相重建中表现优异,Precision为0.922,Recall为0.858,F1-score为0.763,IoU为0.617。

Insight: 多模态数据结合和有针对性的模型设计对提升云相结构重建精度至关重要,特别是在复杂场景下。

Abstract: Cloud phase profiles are critical for numerical weather prediction (NWP), as they directly affect radiative transfer and precipitation processes. In this study, we present a benchmark dataset and a baseline framework for transforming multimodal satellite observations into detailed 3D cloud phase structures, aiming toward operational cloud phase profile retrieval and future integration with NWP systems to improve cloud microphysics parameterization. The multimodal observations consist of (1) high–spatiotemporal–resolution, multi-band visible (VIS) and thermal infrared (TIR) imagery from geostationary satellites, and (2) accurate vertical cloud phase profiles from spaceborne lidar (CALIOP\slash CALIPSO) and radar (CPR\slash CloudSat). The dataset consists of synchronized image–profile pairs across diverse cloud regimes, defining a supervised learning task: given VIS/TIR patches, predict the corresponding 3D cloud phase structure. We adopt SGMAGNet as the main model and compare it with several baseline architectures, including UNet variants and SegNet, all designed to capture multi-scale spatial patterns. Model performance is evaluated using standard classification metrics, including Precision, Recall, F1-score, and IoU. The results demonstrate that SGMAGNet achieves superior performance in cloud phase reconstruction, particularly in complex multi-layer and boundary transition regions. Quantitatively, SGMAGNet attains a Precision of 0.922, Recall of 0.858, F1-score of 0.763, and an IoU of 0.617, significantly outperforming all baselines across these key metrics.

[75] Toward Medical Deepfake Detection: A Comprehensive Dataset and Novel Method

Shuaibo Li,Zhaohu Xing,Hongqiu Wang,Pengfei Hao,Xingyu Li,Zekai Liu,Lei Zhu

Main category: cs.CV

TL;DR: 该论文提出了一个针对医疗领域深度伪造检测的综合数据集MedForensics和新方法DSKI,以应对医疗图像伪造带来的风险,并在实验中展示了其优越性。

Details Motivation: 生成式AI在医疗影像中的快速发展带来了巨大机遇,但也带来了伪造医疗图像的严重风险。现有缺乏针对该领域的综合数据集和有效的检测方法。

Contribution: 1)提出了MedForensics数据集,涵盖六种医疗模态和12种生成模型;2)提出了DSKI检测器,通过双阶段知识注入提升检测精度。

Method: DSKI包括跨域微痕适配器(CDFA)提取空间和噪声域中的伪造痕迹,以及医疗取证检索模块(MFRM)通过少样本检索提升测试阶段的检测准确率。

Result: 实验显示DSKI在多种医疗模态中显著优于现有方法和人类专家。

Insight: 医疗图像的伪造检测需要针对其独特特征设计专门方法,而跨域特征提取和知识检索能够显著提升检测性能。

Abstract: The rapid advancement of generative AI in medical imaging has introduced both significant opportunities and serious challenges, especially the risk that fake medical images could undermine healthcare systems. These synthetic images pose serious risks, such as diagnostic deception, financial fraud, and misinformation. However, research on medical forensics to counter these threats remains limited, and there is a critical lack of comprehensive datasets specifically tailored for this field. Additionally, existing media forensic methods, which are primarily designed for natural or facial images, are inadequate for capturing the distinct characteristics and subtle artifacts of AI-generated medical images. To tackle these challenges, we introduce \textbf{MedForensics}, a large-scale medical forensics dataset encompassing six medical modalities and twelve state-of-the-art medical generative models. We also propose \textbf{DSKI}, a novel \textbf{D}ual-\textbf{S}tage \textbf{K}nowledge \textbf{I}nfusing detector that constructs a vision-language feature space tailored for the detection of AI-generated medical images. DSKI comprises two core components: 1) a cross-domain fine-trace adapter (CDFA) for extracting subtle forgery clues from both spatial and noise domains during training, and 2) a medical forensic retrieval module (MFRM) that boosts detection accuracy through few-shot retrieval during testing. Experimental results demonstrate that DSKI significantly outperforms both existing methods and human experts, achieving superior accuracy across multiple medical modalities.

[76] Hybrid Lie semi-group and cascade structures for the generalized Gaussian derivative model for visual receptive fields

Tony Lindeberg

Main category: cs.CV

TL;DR: 本文提出了一种基于混合Lie半群和级联结构的广义高斯导数模型,用于视觉感受野建模,旨在处理自然图像变换对感受野响应的影响。

Details Motivation: 由于自然图像变换(如视角变化)会影响视觉层级中最早层感受野的响应,需要建立协变的感受野家族以应对这种变异性。

Contribution: 主要贡献包括推导了微小(半群和Lie群结合)和宏观(级联平滑)的关系,用于描述不同形状参数下感受野响应之间的联系。

Method: 结合Lie半群和级联结构,提出了广义高斯导数模型,支持多参数感受野家族的响应计算。

Result: 研究成果为设计高效的多参数感受野计算方案和生物视觉理论模型提供了理论基础。

Insight: 通过Lie群和半群的理论框架,可以更高效地建模感受野响应,并为生物视觉系统中的计算机制提供了新视角。

Abstract: Because of the variabilities of real-world image structures under the natural image transformations that arise when observing similar objects or spatio-temporal events under different viewing conditions, the receptive field responses computed in the earliest layers of the visual hierarchy may be strongly influenced by such geometric image transformations. One way of handling this variability is by basing the vision system on covariant receptive field families, which expand the receptive field shapes over the degrees of freedom in the image transformations. This paper addresses the problem of deriving relationships between spatial and spatio-temporal receptive field responses obtained for different values of the shape parameters in the resulting multi-parameter families of receptive fields. For this purpose, we derive both (i) infinitesimal relationships, roughly corresponding to a combination of notions from semi-groups and Lie groups, as well as (ii) macroscopic cascade smoothing properties, which describe how receptive field responses at coarser spatial and temporal scales can be computed by applying smaller support incremental filters to the output from corresponding receptive fields at finer spatial and temporal scales, structurally related to the notion of Lie algebras, although with directional preferences. The presented results provide (i) a deeper understanding of the relationships between spatial and spatio-temporal receptive field responses for different values of the filter parameters, which can be used for both (ii) designing more efficient schemes for computing receptive field responses over populations of multi-parameter families of receptive fields, as well as (iii)~formulating idealized theoretical models of the computations of simple cells in biological vision.

[77] Simulated Cortical Magnification Supports Self-Supervised Object Learning

Zhengyang Yu,Arthur Aubret,Chen Yu,Jochen Triesch

Main category: cs.CV

TL;DR: 论文研究了模拟人眼中心/周边分辨率差异(Foveated Vision)如何提升自监督学习模型的对象表征质量,结果显示这种生物启发的方法能改善学习效果。

Details Motivation: 现有自监督学习模型忽略了人眼视觉的中心/周边分辨率差异(Foveated Vision),而这是人类视觉学习中的重要特征。论文旨在探究这种差异对对象表征学习的影响。

Contribution: 1) 提出利用模拟人眼中心/周边分辨率差异的数据增强方法;2) 通过实验验证这种方法能提升自监督学习模型的对象表征质量。

Method: 1) 使用两种数据集模拟人类与对象互动的视觉输入;2) 应用人眼Foveation模型和皮层放大模型(Cortical Magnification)处理数据;3) 训练两种时间基自监督学习模型。

Result: 结果表明,模拟Foveated Vision能显著提升对象表征质量,原因是对象显得更大且中心/周边信息更平衡。

Insight: 生物启发的视觉模拟(如Foveated Vision)可以为自监督学习提供更真实且高效的训练信号。

Abstract: Recent self-supervised learning models simulate the development of semantic object representations by training on visual experience similar to that of toddlers. However, these models ignore the foveated nature of human vision with high/low resolution in the center/periphery of the visual field. Here, we investigate the role of this varying resolution in the development of object representations. We leverage two datasets of egocentric videos that capture the visual experience of humans during interactions with objects. We apply models of human foveation and cortical magnification to modify these inputs, such that the visual content becomes less distinct towards the periphery. The resulting sequences are used to train two bio-inspired self-supervised learning models that implement a time-based learning objective. Our results show that modeling aspects of foveated vision improves the quality of the learned object representations in this setting. Our analysis suggests that this improvement comes from making objects appear bigger and inducing a better trade-off between central and peripheral visual information. Overall, this work takes a step towards making models of humans’ learning of visual representations more realistic and performant.

[78] MCOD: The First Challenging Benchmark for Multispectral Camouflaged Object Detection

Yang Li,Tingfa Xu,Shuyan Bai,Peifu Liu,Jianan Li

Main category: cs.CV

TL;DR: 该论文提出了第一个多光谱伪装物体检测基准数据集MCOD,填补了现有RGB数据集的不足,展示了多光谱信息在检测任务中的显著提升效果。

Details Motivation: 现有伪装物体检测(COD)仅基于RGB数据,在多光谱数据方面缺乏支持,限制了性能提升。MCOD旨在填补这一空白,推动多光谱方法的研究。

Contribution: 1) 提出首个多光谱伪装物体检测基准数据集MCOD;2) 包含多样真实场景和挑战性属性;3) 展示了多光谱模态在提升检测鲁棒性中的价值。

Method: 通过构建包含多光谱图像的MCOD数据集,并标记像素级注释和挑战属性标签,验证现有COD方法的性能和多光谱融合的效果。

Result: 在多光谱数据上,现有COD方法性能下降,但多光谱信息显著缓解了性能退化。

Insight: 多光谱信息在伪装物体检测中具有重要价值,尤其是在复杂场景和小目标检测中表现突出。

Abstract: Camouflaged Object Detection (COD) aims to identify objects that blend seamlessly into natural scenes. Although RGB-based methods have advanced, their performance remains limited under challenging conditions. Multispectral imagery, providing rich spectral information, offers a promising alternative for enhanced foreground-background discrimination. However, existing COD benchmark datasets are exclusively RGB-based, lacking essential support for multispectral approaches, which has impeded progress in this area. To address this gap, we introduce MCOD, the first challenging benchmark dataset specifically designed for multispectral camouflaged object detection. MCOD features three key advantages: (i) Comprehensive challenge attributes: It captures real-world difficulties such as small object sizes and extreme lighting conditions commonly encountered in COD tasks. (ii) Diverse real-world scenarios: The dataset spans a wide range of natural environments to better reflect practical applications. (iii) High-quality pixel-level annotations: Each image is manually annotated with precise object masks and corresponding challenge attribute labels. We benchmark eleven representative COD methods on MCOD, observing a consistent performance drop due to increased task difficulty. Notably, integrating multispectral modalities substantially alleviates this degradation, highlighting the value of spectral information in enhancing detection robustness. We anticipate MCOD will provide a strong foundation for future research in multispectral camouflaged object detection. The dataset is publicly accessible at https://github.com/yl2900260-bit/MCOD.

[79] Overview of PlantCLEF 2024: multi-species plant identification in vegetation plot images

Herve Goeau,Vincent Espitalier,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: PlantCLEF 2024 是一个专注于多物种植物识别的竞赛,旨在通过AI技术提升生态研究的效率和覆盖范围,提供大规模数据集和先进的视觉Transformer模型,任务是多标签分类。

Details Motivation: 生态研究中,植被地块图像的物种识别是重要但耗时的任务。AI的应用可以显著提高效率,扩大研究范围。

Contribution: 提供了新的测试集(数千张多标签图像,覆盖800多种物种)和训练集(170万张单标签植物图像),并预训练了先进的视觉Transformer模型。

Method: 任务设计为多标签分类,基于高分辨率地块图像和弱标注数据,结合单标签训练数据。

Result: 竞赛结果为生态研究中的多物种识别提供了技术基准。

Insight: 大规模数据集和先进模型的结合是推动生态研究中植物识别技术发展的关键。

Abstract: Plot images are essential for ecological studies, enabling standardized sampling, biodiversity assessment, long-term monitoring and remote, large-scale surveys. Plot images are typically fifty centimetres or one square meter in size, and botanists meticulously identify all the species found there. The integration of AI could significantly improve the efficiency of specialists, helping them to extend the scope and coverage of ecological studies. To evaluate advances in this regard, the PlantCLEF 2024 challenge leverages a new test set of thousands of multi-label images annotated by experts and covering over 800 species. In addition, it provides a large training set of 1.7 million individual plant images as well as state-of-the-art vision transformer models pre-trained on this data. The task is evaluated as a (weakly-labeled) multi-label classification task where the aim is to predict all the plant species present on a high-resolution plot image (using the single-label training data). In this paper, we provide an detailed description of the data, the evaluation methodology, the methods and models employed by the participants and the results achieved.

[80] Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation

Weimin Bai,Yubo Li,Weijian Luo,Wenzheng Chen,He Sun

Main category: cs.CV

TL;DR: 该论文提出了一种名为VLM3D的新框架,将视觉语言模型(VLMs)集成到Score Distillation Sampling(SDS)流程中,以解决现有文本到3D生成方法在语义对齐和空间一致性方面的局限性。

Details Motivation: 现有基于SDS的文本到3D生成方法依赖CLIP风格的文本编码器,导致语义对齐粗糙,且2D扩散先验缺乏明确的3D空间约束,导致几何不一致和多物体场景中的空间关系不准确。

Contribution: 论文的主要贡献是提出VLM3D框架,通过视觉语言模型(VLMs)提供细粒度的语义对齐和空间先验,显著提升了文本到3D生成的语义保真度、几何一致性和空间正确性。

Method: VLM3D将视觉语言模型(如Qwen2.5-VL)作为可微分的语义和空间先验,集成到SDS流程中,利用其丰富的语言基础监督和空间理解能力。

Result: 实验表明,VLM3D在GPTeval3D基准测试中显著优于现有SDS方法,尤其是在语义保真度和空间一致性方面表现出色。

Insight: 通过引入视觉语言模型,VLM3D揭示了视觉语言模型在提升文本到3D生成任务的语义和空间一致性方面的潜力,为未来研究提供了新方向。

Abstract: Score Distillation Sampling (SDS) enables high-quality text-to-3D generation by supervising 3D models through the denoising of multi-view 2D renderings, using a pretrained text-to-image diffusion model to align with the input prompt and ensure 3D consistency. However, existing SDS-based methods face two fundamental limitations: (1) their reliance on CLIP-style text encoders leads to coarse semantic alignment and struggles with fine-grained prompts; and (2) 2D diffusion priors lack explicit 3D spatial constraints, resulting in geometric inconsistencies and inaccurate object relationships in multi-object scenes. To address these challenges, we propose VLM3D, a novel text-to-3D generation framework that integrates large vision-language models (VLMs) into the SDS pipeline as differentiable semantic and spatial priors. Unlike standard text-to-image diffusion priors, VLMs leverage rich language-grounded supervision that enables fine-grained prompt alignment. Moreover, their inherent vision language modeling provides strong spatial understanding, which significantly enhances 3D consistency for single-object generation and improves relational reasoning in multi-object scenes. We instantiate VLM3D based on the open-source Qwen2.5-VL model and evaluate it on the GPTeval3D benchmark. Experiments across diverse objects and complex scenes show that VLM3D significantly outperforms prior SDS-based methods in semantic fidelity, geometric coherence, and spatial correctness.

[81] Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution

Chang Soo Lim,Joonyoung Moon,Donghyeon Cho

Main category: cs.CV

TL;DR: 这篇论文提出了一个名为SCOPE的框架,结合了Cutie的查询分割和SAM2的预训练ViT编码器,并通过引入运动预测模块提升时序稳定性。该方法在7th LSVOS挑战赛中取得第三名。

Details Motivation: 现有的Cutie和SAM2方法在特征能力和时序建模上各有限制,论文旨在整合两者的优势,提升视频目标分割的效果。

Contribution: 1. 提出结合Cutie和SAM2的框架SCOPE;2. 引入运动预测模块提升时序稳定性;3. 通过集成策略在LSVOS挑战赛中取得第三名。

Method: 1. 用SAM2的预训练ViT编码器替换Cutie的编码器;2. 添加运动预测模块;3. 采用Cutie、SAM2和SCOPE的集成策略。

Result: 在MOSEv2赛道中排名第三,证明了丰富特征表达和运动预测模块的有效性。

Insight: 结合不同模型的互补优势,并通过运动建模提升时序性能,可显著提升视频目标分割效果。

Abstract: Video object segmentation (VOS) is a challenging task with wide applications such as video editing and autonomous driving. While Cutie provides strong query-based segmentation and SAM2 offers enriched representations via a pretrained ViT encoder, each has limitations in feature capacity and temporal modeling. In this report, we propose a framework that integrates their complementary strengths by replacing the encoder of Cutie with the ViT encoder of SAM2 and introducing a motion prediction module for temporal stability. We further adopt an ensemble strategy combining Cutie, SAM2, and our variant, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge. We refer to our final model as SCOPE (SAM2-CUTIE Object Prediction Ensemble). This demonstrates the effectiveness of enriched feature representation and motion prediction for robust video object segmentation. The code is available at https://github.com/2025-LSVOS-3rd-place/MOSEv2_3rd_place.

[82] Ideal Registration? Segmentation is All You Need

Xiang Chen,Fengting Zhang,Qinghao Liu,Min Liu,Kun Wu,Yaonan Wang,Hang Zhang

Main category: cs.CV

TL;DR: 论文提出SegReg框架,通过分割驱动的注册方法实现解剖自适应的图像配准,显著提升了配准精度。

Details Motivation: 现有图像配准方法通常采用全局均匀平滑约束,无法适应复杂区域变形的解剖运动特点。

Contribution: 提出SegReg框架,利用分割技术实现区域自适应配准,并通过实验验证其优越性。

Method: SegReg先对图像进行分割,生成解剖一致子区域,再通过同一注册骨干网络计算局部变形场,最终整合为全局变形场。

Result: 在心脏、腹部和肺部图像配准任务中,SegReg的Dice系数达到98.23%,超过现有方法2-12%。

Insight: 注册精度与分割质量呈近线性关系,将配准问题转化为分割问题。

Abstract: Deep learning has revolutionized image registration by its ability to handle diverse tasks while achieving significant speed advantages over conventional approaches. Current approaches, however, often employ globally uniform smoothness constraints that fail to accommodate the complex, regionally varying deformations characteristic of anatomical motion. To address this limitation, we propose SegReg, a Segmentation-driven Registration framework that implements anatomically adaptive regularization by exploiting region-specific deformation patterns. Our SegReg first decomposes input moving and fixed images into anatomically coherent subregions through segmentation. These localized domains are then processed by the same registration backbone to compute optimized partial deformation fields, which are subsequently integrated into a global deformation field. SegReg achieves near-perfect structural alignment (98.23% Dice on critical anatomies) using ground-truth segmentation, and outperforms existing methods by 2-12% across three clinical registration scenarios (cardiac, abdominal, and lung images) even with automatic segmentation. Our SegReg demonstrates a near-linear dependence of registration accuracy on segmentation quality, transforming the registration challenge into a segmentation problem. The source code will be released upon manuscript acceptance.

[83] Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks

Het Patel,Muzammil Allie,Qian Zhang,Jia Chen,Evangelos E. Papalexakis

Main category: cs.CV

TL;DR: 论文提出了一种通过张量分解防御视觉语言模型(VLMs)对抗攻击的轻量级方法,无需重新训练,有效过滤对抗噪声并保持语义信息,显著提升了模型鲁棒性。

Details Motivation: 现有的防御方法通常需要昂贵的重新训练或大幅修改模型结构。为了解决这一问题,作者提出了一种轻量级的张量分解方法,适用于任何预训练的VLM,无需额外训练开销。

Contribution: 1. 首次将张量分解技术应用于VLMs的对抗防御;2. 提出了一种无需重新训练的轻量级防御方案,适用于所有预训练VLM;3. 实验验证了该方法在CLIP模型上的有效性,显著提升了鲁棒性。

Method: 通过张量分解(如Tensor Train分解)对视觉编码器的表征进行分解和重建,过滤对抗噪声。作者发现低秩(8-32)和低残差强度(α=0.1-0.2)是最优配置。

Result: 在Flickr30K数据集上,该方法恢复了12.3%的性能损失,将Recall@1准确率从7.5%提升至19.8%;在COCO数据集上,恢复了8.1%的性能,准确率从3.8%提高到11.9%。

Insight: 1. 张量分解是一种高效的对抗噪声过滤方法;2. 低秩设置和残差强度的选择对性能和鲁棒性至关重要;3. 该方法是一种即插即用的解决方案,适合实际部署。

Abstract: Vision language models (VLMs) excel in multimodal understanding but are prone to adversarial attacks. Existing defenses often demand costly retraining or significant architecture changes. We introduce a lightweight defense using tensor decomposition suitable for any pre-trained VLM, requiring no retraining. By decomposing and reconstructing vision encoder representations, it filters adversarial noise while preserving meaning. Experiments with CLIP on COCO and Flickr30K show improved robustness. On Flickr30K, it restores 12.3% performance lost to attacks, raising Recall@1 accuracy from 7.5% to 19.8%. On COCO, it recovers 8.1% performance, improving accuracy from 3.8% to 11.9%. Analysis shows Tensor Train decomposition with low rank (8-32) and low residual strength ($\alpha=0.1-0.2$) is optimal. This method is a practical, plug-and-play solution with minimal overhead for existing VLMs.

[84] MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Yanghao Li,Rui Qian,Bowen Pan,Haotian Zhang,Haoshuo Huang,Bowen Zhang,Jialing Tong,Haoxuan You,Xianzhi Du,Zhe Gan,Hyunjik Kim,Chao Jia,Zhenbang Wang,Yinfei Yang,Mingfei Gao,Zi-Yi Dou,Wenze Hu,Chang Gao,Dongxu Li,Philipp Dufter,Zirui Wang,Guoli Yin,Zhengdong Zhang,Chen Chen,Yang Zhao,Ruoming Pang,Zhifeng Chen

Main category: cs.CV

TL;DR: MANZANO是一个简单且可扩展的统一多模态模型,通过混合视觉分词器和精心设计的训练方法,显著降低了理解和生成视觉内容之间的性能权衡。

Details Motivation: 现有的开源统一多模态大型语言模型(LLMs)在理解和生成视觉内容之间存在性能权衡,限制了其潜力。作者希望通过设计一个更高效的框架来解决这一问题。

Contribution: MANZANO的主要贡献是提出了一种混合图像分词器,结合了一个统一的训练方法,实现了理解和生成能力的联合学习,并在性能和扩展性上表现出色。

Method: 方法包括:一个共享的视觉编码器连接两个轻量级适配器(分别生成连续嵌入和离散标记);统一的LLM预测文本和图像标记;辅助扩散解码器将图像标记转换为像素。

Result: MANZANO在统一模型中实现了最先进的性能,尤其在文本丰富的评估中与专业模型竞争,并且模型规模扩展时表现出稳定的性能提升。

Insight: 研究表明,混合分词器的设计能最小化任务冲突,并在扩展模型规模时带来一致性的增益,验证了该设计的有效性。

Abstract: Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

[85] TASAM: Terrain-and-Aware Segment Anything Model for Temporal-Scale Remote Sensing Segmentation

Tianyang Wang,Xi Xiao,Gaofei Chen,Hanzhang Chi,Qi Zhang,Guo Cheng,Yingrui Ji

Main category: cs.CV

TL;DR: TASAM扩展了SAM模型,针对遥感图像分割的独特挑战(如复杂地形、多尺度对象和时间动态)进行了优化,通过三个轻量级模块实现性能提升,无需重新训练SAM主干。

Details Motivation: SAM在自然图像领域表现出色,但在遥感数据中泛化能力不足,尤其是面对地形复杂性、多尺度对象和时间变化时。TASAM旨在解决这些问题,提升SAM在遥感领域的适用性。

Contribution: 1. 引入地形感知适配器,注入高程先验;2. 设计了时间提示生成器,捕捉土地利用变化;3. 提出多尺度融合策略,改善细粒度对象分割。

Method: TASAM在SAM基础上集成了三个模块:地形感知适配器、时间提示生成器和多尺度融合策略,无需重新训练主干网络。

Result: 在LoveDA、iSAID和WHU-CD三个遥感基准测试中,性能显著优于零样本SAM和任务专用模型,计算开销极小。

Insight: 研究表明,通过领域自适应增强基础模型,可以显著提升其在特定任务(如遥感分割)中的表现,且无需大规模重新训练。

Abstract: Segment Anything Model (SAM) has demonstrated impressive zero-shot segmentation capabilities across natural image domains, but it struggles to generalize to the unique challenges of remote sensing data, such as complex terrain, multi-scale objects, and temporal dynamics. In this paper, we introduce TASAM, a terrain and temporally-aware extension of SAM designed specifically for high-resolution remote sensing image segmentation. TASAM integrates three lightweight yet effective modules: a terrain-aware adapter that injects elevation priors, a temporal prompt generator that captures land-cover changes over time, and a multi-scale fusion strategy that enhances fine-grained object delineation. Without retraining the SAM backbone, our approach achieves substantial performance gains across three remote sensing benchmarks-LoveDA, iSAID, and WHU-CD-outperforming both zero-shot SAM and task-specific models with minimal computational overhead. Our results highlight the value of domain-adaptive augmentation for foundation models and offer a scalable path toward more robust geospatial segmentation.

[86] ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding

Kehua Chen

Main category: cs.CV

TL;DR: ChronoForge-RL是一个新的视频理解框架,通过结合Temporal Apex Distillation(TAD)和KeyFrame-aware Group Relative Policy Optimization(KF-GRPO)解决了密集视频内容中帧处理效率低和语义关键帧识别难的问题,显著提升了性能。

Details Motivation: 现有视频理解方法在处理密集视频时计算效率低,且难以通过均匀采样策略识别语义关键帧。

Contribution: 1. 提出了可微分的关键帧选择机制;2. 引入TAD和KF-GRPO模块,提升帧选择和时序推理能力;3. 在性能上显著超过基线方法。

Method: 1. TAD通过变分评分、拐点检测和优先级蒸馏选择关键帧;2. KF-GRPO利用对比学习和显著性增强奖励机制,结合帧内容与时序关系。

Result: 在VideoMME和LVBench上分别达到69.1%和52.7%,性能优于基线方法,7B参数模型性能接近72B参数模型。

Insight: 通过结合可微分的关键帧选择和强化学习,能够高效捕捉视频中的语义信息,同时提升模型的计算效率。

Abstract: Current state-of-the-art video understanding methods typically struggle with two critical challenges: (1) the computational infeasibility of processing every frame in dense video content and (2) the difficulty in identifying semantically significant frames through naive uniform sampling strategies. In this paper, we propose a novel video understanding framework, called ChronoForge-RL, which combines Temporal Apex Distillation (TAD) and KeyFrame-aware Group Relative Policy Optimization (KF-GRPO) to tackle these issues. Concretely, we introduce a differentiable keyframe selection mechanism that systematically identifies semantic inflection points through a three-stage process to enhance computational efficiency while preserving temporal information. Then, two particular modules are proposed to enable effective temporal reasoning: Firstly, TAD leverages variation scoring, inflection detection, and prioritized distillation to select the most informative frames. Secondly, we introduce KF-GRPO which implements a contrastive learning paradigm with a saliency-enhanced reward mechanism that explicitly incentivizes models to leverage both frame content and temporal relationships. Finally, our proposed ChronoForge-RL achieves 69.1% on VideoMME and 52.7% on LVBench compared to baseline methods, clearly surpassing previous approaches while enabling our 7B parameter model to achieve performance comparable to 72B parameter alternatives.

[87] CIDER: A Causal Cure for Brand-Obsessed Text-to-Image Models

Fangjian Shen,Zifeng Liang,Chao Wang,Wushao Wen

Main category: cs.CV

TL;DR: CIDER是一个新框架,通过提示优化减轻文本到图像模型中的品牌偏见,无需重新训练,显著减少偏见同时保持图像质量。

Details Motivation: 现有的T2I模型存在品牌偏见,即倾向于生成包含主导商业品牌的内容,可能引发伦理和法律问题。

Contribution: 提出了CIDER框架,通过轻量检测器和视觉语言模型生成替代内容,并引入了品牌中立分数(BNS)来量化偏见。

Method: 使用检测器识别品牌内容,通过VLM生成风格不同的替代选项,并结合BNS评估偏见。

Result: 实验表明,CIDER能显著减少显性和隐性品牌偏见,同时不牺牲图像质量。

Insight: 提示优化是一种高效且低成本的方法,可用于改善生成模型的偏见问题。

Abstract: Text-to-image (T2I) models exhibit a significant yet under-explored “brand bias”, a tendency to generate contents featuring dominant commercial brands from generic prompts, posing ethical and legal risks. We propose CIDER, a novel, model-agnostic framework to mitigate bias at inference-time through prompt refinement to avoid costly retraining. CIDER uses a lightweight detector to identify branded content and a Vision-Language Model (VLM) to generate stylistically divergent alternatives. We introduce the Brand Neutrality Score (BNS) to quantify this issue and perform extensive experiments on leading T2I models. Results show CIDER significantly reduces both explicit and implicit biases while maintaining image quality and aesthetic appeal. Our work offers a practical solution for more original and equitable content, contributing to the development of trustworthy generative AI.

[88] Boosting Active Learning with Knowledge Transfer

Tianyang Wang,Xi Xiao,Gaofei Chen,Xiaoying Liao,Guo Cheng,Yingrui Ji

Main category: cs.CV

TL;DR: 该论文提出了一种利用知识迁移增强主动学习中不确定性估计的新方法,通过教师-学生模型框架简化了传统复杂辅助模型的需求,适用于多种任务。

Details Motivation: 传统主动学习方法依赖复杂辅助模型和特殊训练方式(如对抗训练)进行不确定性估计,这在特定领域任务(如计算生物学中的冷冻电子断层扫描分类)中尤其难以实现。

Contribution: 1. 提出了一种基于知识迁移的教师-学生模型框架,简化不确定性估计;2. 证明了数据不确定性与任务损失的上界紧密相关,而非具体损失值;3. 方法任务无关,适用于多种领域。

Method: 使用教师-学生模式,教师模型为主动学习的任务模型,学生模型为辅助模型。两模型在每个主动学习周期中同步训练,并通过模型输出间的距离度量未标记数据的不确定性。

Result: 在经典计算机视觉任务和冷冻电子断层扫描挑战中验证了方法的有效性和高效性。

Insight: 1. 不确定性估计可以脱离复杂的训练框架;2. 任务损失的上界是衡量数据不确定性的关键指标。

Abstract: Uncertainty estimation is at the core of Active Learning (AL). Most existing methods resort to complex auxiliary models and advanced training fashions to estimate uncertainty for unlabeled data. These models need special design and hence are difficult to train especially for domain tasks, such as Cryo-Electron Tomography (cryo-ET) classification in computational biology. To address this challenge, we propose a novel method using knowledge transfer to boost uncertainty estimation in AL. Specifically, we exploit the teacher-student mode where the teacher is the task model in AL and the student is an auxiliary model that learns from the teacher. We train the two models simultaneously in each AL cycle and adopt a certain distance between the model outputs to measure uncertainty for unlabeled data. The student model is task-agnostic and does not rely on special training fashions (e.g. adversarial), making our method suitable for various tasks. More importantly, we demonstrate that data uncertainty is not tied to concrete value of task loss but closely related to the upper-bound of task loss. We conduct extensive experiments to validate the proposed method on classical computer vision tasks and cryo-ET challenges. The results demonstrate its efficacy and efficiency.

[89] LC-SLab – An Object-based Deep Learning Framework for Large-scale Land Cover Classification from Satellite Imagery and Sparse In-situ Labels

Johannes Leonhardt,Juergen Gall,Ribana Roscher

Main category: cs.CV

TL;DR: LC-SLab是一个基于对象的深度学习框架,用于从卫星图像和稀疏实地标记中进行大规模土地覆盖分类,通过输入级和输出级聚合提高分类的准确性和连贯性。

Details Motivation: 现有的基于深度学习的土地覆盖分类方法在处理稀疏实地标记时,往往导致预测结果碎片化和噪声较大。基于对象的分类方法通过为语义连贯的区域分配标签,可以解决这一问题,但此类方法在深度学习中尚未得到充分探索。

Contribution: 提出了LC-SLab,这是第一个用于稀疏监督条件下大规模土地覆盖分类的基于对象的深度学习框架,支持通过图神经网络进行输入级聚合以及通过后处理语义分割结果进行输出级聚合。

Method: 结合输入级(图神经网络)和输出级(后处理)的聚合方法,并利用预训练网络的特征提取能力提升小数据集的性能。

Result: 实验表明,基于对象的方法在准确性和连贯性上优于常见的像素级模型,输入级聚合在小数据集上更鲁棒,而输出级聚合在数据量较大时表现最佳。

Insight: 基于对象的分类方法可以有效减少噪声和碎片化,结合输入级和输出级聚合的策略在不同数据规模下各有优势,为大规模土地覆盖分类提供了新思路。

Abstract: Large-scale land cover maps generated using deep learning play a critical role across a wide range of Earth science applications. Open in-situ datasets from principled land cover surveys offer a scalable alternative to manual annotation for training such models. However, their sparse spatial coverage often leads to fragmented and noisy predictions when used with existing deep learning-based land cover mapping approaches. A promising direction to address this issue is object-based classification, which assigns labels to semantically coherent image regions rather than individual pixels, thereby imposing a minimum mapping unit. Despite this potential, object-based methods remain underexplored in deep learning-based land cover mapping pipelines, especially in the context of medium-resolution imagery and sparse supervision. To address this gap, we propose LC-SLab, the first deep learning framework for systematically exploring object-based deep learning methods for large-scale land cover classification under sparse supervision. LC-SLab supports both input-level aggregation via graph neural networks, and output-level aggregation by postprocessing results from established semantic segmentation models. Additionally, we incorporate features from a large pre-trained network to improve performance on small datasets. We evaluate the framework on annual Sentinel-2 composites with sparse LUCAS labels, focusing on the tradeoff between accuracy and fragmentation, as well as sensitivity to dataset size. Our results show that object-based methods can match or exceed the accuracy of common pixel-wise models while producing substantially more coherent maps. Input-level aggregation proves more robust on smaller datasets, whereas output-level aggregation performs best with more data. Several configurations of LC-SLab also outperform existing land cover products, highlighting the framework’s practical utility.

[90] Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval

Liwei Liao,Xufeng Li,Xiaoyun Zheng,Boning Liu,Feng Gao,Ronggang Wang

Main category: cs.CV

TL;DR: 论文提出了GVR,一种零样本的3D视觉定位框架,通过视角检索将3D问题转化为2D任务,避免了昂贵的3D标注和逐场景训练需求。

Details Motivation: 现有3D视觉定位方法难以处理3D高斯溅射中的隐式空间纹理表示,且依赖大量标注数据。GVR旨在解决这些限制,实现零样本定位。

Contribution: 1. 提出GVR框架,将3D视觉定位转化为多视角2D检索任务;2. 避免逐场景训练和3D标注需求;3. 实现零样本下的先进性能。

Method: 通过对象级视角检索从多视角中收集定位线索,利用2D检索技术解决3D问题,无需训练场景数据。

Result: 实验表明,GVR在零样本条件下实现了最先进的视觉定位性能。

Insight: 将3D问题分解为2D任务是一种有效的零样本解决方案,减少了数据依赖和计算开销。

Abstract: 3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose \underline{G}rounding via \underline{V}iew \underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.

[91] ENSAM: an efficient foundation model for interactive segmentation of 3D medical images

Elias Stenhede,Agnar Martin Bjørnstad,Arian Ranjbar

Main category: cs.CV

TL;DR: ENSAM是一种轻量级、可提示的3D医学图像分割基础模型,结合SegResNet编码器、提示编码器和掩码解码器,在有限数据和计算资源下表现优异。

Details Motivation: 解决3D医学图像分割中计算资源有限和数据稀缺的问题,提供一种通用且高效的交互式分割解决方案。

Contribution: 1. 提出ENSAM模型,结合多种优化技术(如相对位置编码和Muon优化器);2. 在少量数据和单GPU上高效训练;3. 在CVPR 2025挑战赛中表现优于多个基线模型。

Method: 使用U-Net架构,结合SegResNet编码器、提示编码器和掩码解码器,引入潜在交叉注意力、相对位置编码和归一化注意力,并使用Muon优化器加速训练。

Result: 在CVPR 2025挑战赛中,ENSAM的DSC AUC为2.404,NSD AUC为2.266,最终DSC为0.627,最终NSD为0.597,超越部分基线模型。

Insight: 相对位置编码和Muon优化器显著提升收敛速度和分割质量,适用于资源受限的3D医学图像分割任务。

Abstract: We present ENSAM (Equivariant, Normalized, Segment Anything Model), a lightweight and promptable model for universal 3D medical image segmentation. ENSAM combines a SegResNet-based encoder with a prompt encoder and mask decoder in a U-Net-style architecture, using latent cross-attention, relative positional encoding, normalized attention, and the Muon optimizer for training. ENSAM is designed to achieve good performance under limited data and computational budgets, and is trained from scratch on under 5,000 volumes from multiple modalities (CT, MRI, PET, ultrasound, microscopy) on a single 32 GB GPU in 6 hours. As part of the CVPR 2025 Foundation Models for Interactive 3D Biomedical Image Segmentation Challenge, ENSAM was evaluated on hidden test set with multimodal 3D medical images, obtaining a DSC AUC of 2.404, NSD AUC of 2.266, final DSC of 0.627, and final NSD of 0.597, outperforming two previously published baseline models (VISTA3D, SAM-Med3D) and matching the third (SegVol), surpassing its performance in final DSC but trailing behind in the other three metrics. In the coreset track of the challenge, ENSAM ranks 5th of 10 overall and best among the approaches not utilizing pretrained weights. Ablation studies confirm that our use of relative positional encodings and the Muon optimizer each substantially speed up convergence and improve segmentation quality.

[92] RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning

Xiaosheng Long,Hanyu Wang,Zhentao Song,Kun Luo,Hongde Liu

Main category: cs.CV

TL;DR: RACAP是一个轻量级的检索增强图像描述生成模型,通过关系感知提示解决语义提示粒度粗糙和关系建模不足的问题。

Details Motivation: 现有检索增强的图像描述生成方法在关系建模上存在局限性:语义提示粒度粗糙且缺乏对图像对象及其语义关系的显式建模。

Contribution: 提出了RACap,一种关系感知的检索增强模型,能够从检索描述中挖掘结构化关系语义并识别图像中的异构对象。

Method: 通过关系感知提示和异构对象识别,检索并整合结构化关系特征以增强语义一致性和关系表达能力。

Result: 仅需10.8M可训练参数,RACap在性能上优于其他轻量级描述生成模型。

Insight: 细粒度的关系建模和异构对象的显式识别是提升检索增强图像描述生成效果的关键。

Abstract: Recent retrieval-augmented image captioning methods incorporate external knowledge to compensate for the limitations in comprehending complex scenes. However, current approaches face challenges in relation modeling: (1) the representation of semantic prompts is too coarse-grained to capture fine-grained relationships; (2) these methods lack explicit modeling of image objects and their semantic relationships. To address these limitations, we propose RACap, a relation-aware retrieval-augmented model for image captioning, which not only mines structured relation semantics from retrieval captions, but also identifies heterogeneous objects from the image. RACap effectively retrieves structured relation features that contain heterogeneous visual information to enhance the semantic consistency and relational expressiveness. Experimental results show that RACap, with only 10.8M trainable parameters, achieves superior performance compared to previous lightweight captioning models.

[93] RangeSAM: Leveraging Visual Foundation Models for Range-View repesented LiDAR segmentation

Paul Julius Kühn,Duc Anh Nguyen,Arjan Kuijper,Holger Graf,Dieter Fellner,Saptarshi Neil Sinha

Main category: cs.CV

TL;DR: 论文提出了RangeSAM框架,首次将视觉基础模型SAM2应用于LiDAR点云的距离视图分割,通过改进SAM2的编码器部分,实现了高效且性能优越的3D分割方法。

Details Motivation: 点云分割对自动驾驶和3D场景理解至关重要。尽管基于体素和点的方法在研究领域占据主导地位,但它们存在计算成本高和实时性差的问题。相比之下,距离视图方法可以借助成熟的2D语义分割技术实现快速准确的预测,但相关研究较少。作者希望通过结合视觉基础模型(VFMs)的优势,探索其在LiDAR点云分割中的应用潜力。

Contribution: 1. 提出了首个基于视觉基础模型SAM2的距离视图LiDAR分割框架RangeSAM。2. 设计了针对LiDAR距离视图特性的SAM2编码器改进模块,包括水平空间依赖性增强、球面投影几何特性适配和独特空间模式捕捉机制。3. 验证了RangeSAM在SemanticKITTI数据集上的竞争性性能,同时保持了2D管道的速度和部署优势。

Method: 1. 将SAM2适配到距离视图表示中,结合2D特征提取和标准投影/反投影技术。2. 引入新的编码器模块,增强LiDAR距离图像的水平空间依赖性。3. 定制编码器配置,适配球面投影的几何特性。4. 设计专用机制,捕捉距离视图伪图像中的空间模式和间断性。

Result: RangeSAM在SemanticKITTI数据集上表现出竞争性性能,同时具备高效的速度、可扩展性和部署简便性。

Insight: 1. 视觉基础模型(VFMs)可以作为通用的3D感知骨干网络。2. 距离视图方法是点云分割的高效替代方案。3. SAM2的成功应用为LiDAR分割提供了一条新的路径,展示了其潜力。

Abstract: Point cloud segmentation is central to autonomous driving and 3D scene understanding. While voxel- and point-based methods dominate recent research due to their compatibility with deep architectures and ability to capture fine-grained geometry, they often incur high computational cost, irregular memory access, and limited real-time efficiency. In contrast, range-view methods, though relatively underexplored - can leverage mature 2D semantic segmentation techniques for fast and accurate predictions. Motivated by the rapid progress in Visual Foundation Models (VFMs) for captioning, zero-shot recognition, and multimodal tasks, we investigate whether SAM2, the current state-of-the-art VFM for segmentation tasks, can serve as a strong backbone for LiDAR point cloud segmentation in the range view. We present , to our knowledge, the first range-view framework that adapts SAM2 to 3D segmentation, coupling efficient 2D feature extraction with standard projection/back-projection to operate on point clouds. To optimize SAM2 for range-view representations, we implement several architectural modifications to the encoder: (1) a novel module that emphasizes horizontal spatial dependencies inherent in LiDAR range images, (2) a customized configuration of tailored to the geometric properties of spherical projections, and (3) an adapted mechanism in the encoder backbone specifically designed to capture the unique spatial patterns and discontinuities present in range-view pseudo-images. Our approach achieves competitive performance on SemanticKITTI while benefiting from the speed, scalability, and deployment simplicity of 2D-centric pipelines. This work highlights the viability of VFMs as general-purpose backbones for 3D perception and opens a path toward unified, foundation-model-driven LiDAR segmentation. Results lets us conclude that range-view segmentation methods using VFMs leads to promising results.

[94] Deep Feedback Models

David Calhas,Arlindo L. Oliveira

Main category: cs.CV

TL;DR: Deep Feedback Models (DFMs)是一种新型的状态神经网络,通过结合自底向上的输入和高层次表示,引入动态反馈机制,提升了模型在噪声和数据有限情况下的表现。

Details Motivation: 传统前馈神经网络在动态性和鲁棒性上存在不足,DFMs通过引入反馈机制模拟生物决策过程,提升模型的稳定性和泛化能力。

Contribution: 提出了Deep Feedback Models (DFMs),一种通过反馈机制迭代优化内部状态的新型神经网络,提升了噪声鲁棒性和小样本泛化能力。

Method: 将反馈过程建模为微分方程,通过带有指数衰减的循环神经网络求解,确保收敛性。

Result: 在物体识别和分割任务中,DFMs在低数据和高噪声条件下优于前馈网络,并在医学图像中表现出色。

Insight: 反馈机制是实现稳定、鲁棒和泛化学习的关键。

Abstract: Deep Feedback Models (DFMs) are a new class of stateful neural networks that combine bottom up input with high level representations over time. This feedback mechanism introduces dynamics into otherwise static architectures, enabling DFMs to iteratively refine their internal state and mimic aspects of biological decision making. We model this process as a differential equation solved through a recurrent neural network, stabilized via exponential decay to ensure convergence. To evaluate their effectiveness, we measure DFMs under two key conditions: robustness to noise and generalization with limited data. In both object recognition and segmentation tasks, DFMs consistently outperform their feedforward counterparts, particularly in low data or high noise regimes. In addition, DFMs translate to medical imaging settings, while being robust against various types of noise corruption. These findings highlight the importance of feedback in achieving stable, robust, and generalizable learning. Code is available at https://github.com/DCalhas/deep_feedback_models.

[95] Sparse Multiview Open-Vocabulary 3D Detection

Olivier Moliner,Viktor Larsson,Kalle Åström

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、基于预训练2D基础模型的方法,用于稀疏多视图开放词汇3D检测,通过2D检测提升和3D提案优化,在多视图场景中实现高效的3D物体检测。

Details Motivation: 传统的3D物体检测方法需要针对固定类别进行训练,限制了其应用范围。作者希望解决稀疏视图(仅少量RGB图像输入)下的开放词汇3D检测问题,充分利用2D模型中丰富的预训练知识。

Contribution: 1. 提出了一种无需训练的方法,直接利用预训练2D基础模型实现开放词汇3D检测;2. 通过2D检测提升和3D提案优化,在多视图场景中实现高效的特征一致性。

Method: 方法基于预训练2D基础模型,通过将2D检测结果提升为3D提案,并优化这些提案在多视图中的特征一致性,避免了昂贵的3D特征融合或3D特定学习。

Result: 在标准基准测试中,该方法在稀疏视图场景中显著优于现有技术,同时在高密度采样场景中表现竞争力。

Insight: 充分利用2D模型的预训练知识可以有效解决3D检测问题,尤其在数据稀疏的场景下表现出色。

Abstract: The ability to interpret and comprehend a 3D scene is essential for many vision and robotics systems. In numerous applications, this involves 3D object detection, i.e.~identifying the location and dimensions of objects belonging to a specific category, typically represented as bounding boxes. This has traditionally been solved by training to detect a fixed set of categories, which limits its use. In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting, where only a limited number of posed RGB images are available as input. Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion or requiring 3D-specific learning. By lifting 2D detections and directly optimizing 3D proposals for featuremetric consistency across views, we fully leverage the extensive training data available in 2D compared to 3D. Through standard benchmarks, we demonstrate that this simple pipeline establishes a powerful baseline, performing competitively with state-of-the-art techniques in densely sampled scenarios while significantly outperforming them in the sparse-view setting.

[96] A multi-temporal multi-spectral attention-augmented deep convolution neural network with contrastive learning for crop yield prediction

Shalini Dangi,Surya Karthikeya Mullapudi,Chandravardhan Singh Raghaw,Shahid Shafi Dar,Mohammad Zia Ur Rehman,Nagendra Kumar

Main category: cs.CV

TL;DR: 本文提出了一种新型多时序多光谱注意力增强的深度卷积神经网络MTMS-YieldNet,结合对比学习,用于作物产量预测,显著优于现有方法。

Details Motivation: 气候变化对农作物产量预测带来挑战,传统方法难以处理多光谱数据。本文旨在解决这一问题,提升预测精度。

Contribution: 提出了MTMS-YieldNet模型,结合多光谱和时序数据,利用对比学习捕捉空间-光谱模式和时空依赖性,显著提升了产量预测性能。

Method: 模型采用注意力增强的深度卷积神经网络,结合对比学习进行预训练,整合多时序多光谱数据,优化特征提取。

Result: MTMS-YieldNet在Sentinel-1、Landsat-8和Sentinel-2数据集上的MAPE得分分别为0.336、0.353和0.331,优于七种现有方法。

Insight: 模型通过提取多光谱和时序数据的相关性,为农业决策提供了更可靠的预测,有助于提升农作物产量。

Abstract: Precise yield prediction is essential for agricultural sustainability and food security. However, climate change complicates accurate yield prediction by affecting major factors such as weather conditions, soil fertility, and farm management systems. Advances in technology have played an essential role in overcoming these challenges by leveraging satellite monitoring and data analysis for precise yield estimation. Current methods rely on spatio-temporal data for predicting crop yield, but they often struggle with multi-spectral data, which is crucial for evaluating crop health and growth patterns. To resolve this challenge, we propose a novel Multi-Temporal Multi-Spectral Yield Prediction Network, MTMS-YieldNet, that integrates spectral data with spatio-temporal information to effectively capture the correlations and dependencies between them. While existing methods that rely on pre-trained models trained on general visual data, MTMS-YieldNet utilizes contrastive learning for feature discrimination during pre-training, focusing on capturing spatial-spectral patterns and spatio-temporal dependencies from remote sensing data. Both quantitative and qualitative assessments highlight the excellence of the proposed MTMS-YieldNet over seven existing state-of-the-art methods. MTMS-YieldNet achieves MAPE scores of 0.336 on Sentinel-1, 0.353 on Landsat-8, and an outstanding 0.331 on Sentinel-2, demonstrating effective yield prediction performance across diverse climatic and seasonal conditions. The outstanding performance of MTMS-YieldNet improves yield predictions and provides valuable insights that can assist farmers in making better decisions, potentially improving crop yields.

[97] CoPAD : Multi-source Trajectory Fusion and Cooperative Trajectory Prediction with Anchor-oriented Decoder in V2X Scenarios

Kangyu Wu,Jiaqi Qiao,Ya Zhang

Main category: cs.CV

TL;DR: CoPAD 是一种轻量级的协作式轨迹预测框架,通过匈牙利算法和卡尔曼滤波的融合模块、历史时间注意力模块(PTA)、模式注意力模块和基于锚点的解码器(AoD),实现了多源轨迹数据的早期融合和高精度预测。

Details Motivation: 单车辆感知的不稳定性限制了轨迹预测的性能,而多源数据的协作可以提升预测的完整性和准确性。

Contribution: 提出了一种新颖的协作轨迹预测框架 CoPAD,包含融合模块、PTA模块、模式注意力模块和AoD解码器,实现了高效的多源数据融合和多样化的预测输出。

Method: 1)基于匈牙利算法和卡尔曼滤波的融合模块;2)历史时间注意力模块(PTA);3)模式注意力模块;4)基于稀疏锚点的解码器(AoD)。

Result: 在 DAIR-V2X-Seq 数据集上取得了最先进的性能,验证了模型在 V2X 场景中的有效性。

Insight: 多源数据融合和注意力机制的结合能够显著提升轨迹预测的准确性和多样性,稀疏锚点设计则有助于生成完整的轨迹。

Abstract: Recently, data-driven trajectory prediction methods have achieved remarkable results, significantly advancing the development of autonomous driving. However, the instability of single-vehicle perception introduces certain limitations to trajectory prediction. In this paper, a novel lightweight framework for cooperative trajectory prediction, CoPAD, is proposed. This framework incorporates a fusion module based on the Hungarian algorithm and Kalman filtering, along with the Past Time Attention (PTA) module, mode attention module and anchor-oriented decoder (AoD). It effectively performs early fusion on multi-source trajectory data from vehicles and road infrastructure, enabling the trajectories with high completeness and accuracy. The PTA module can efficiently capture potential interaction information among historical trajectories, and the mode attention module is proposed to enrich the diversity of predictions. Additionally, the decoder based on sparse anchors is designed to generate the final complete trajectories. Extensive experiments show that CoPAD achieves the state-of-the-art performance on the DAIR-V2X-Seq dataset, validating the effectiveness of the model in cooperative trajectory prediction in V2X scenarios.

[98] Towards Sharper Object Boundaries in Self-Supervised Depth Estimation

Aurélien Cecille,Stefan Duffner,Franck Davoine,Rémi Agier,Thibault Neveu

Main category: cs.CV

TL;DR: 这篇论文提出了一种自监督的单目深度估计方法,通过将每个像素的深度建模为混合分布,显著提高了物体边界的锐度,而无需精细的监督。

Details Motivation: 现有的单目深度估计方法在物体边界处容易出现模糊,引入虚假的3D点,需要细粒度的监督才能解决。本文旨在通过自监督方法实现锐利的深度不连续性。

Contribution: 论文的主要贡献是提出了一种混合分布建模方法,将深度不确定性转移到混合权重中,从而在不依赖精细监督的情况下提升边界锐度。

Method: 方法的核心是将每个像素的深度建模为混合分布,并通过方差感知损失函数和不确定性传播,无缝集成到现有流程中。

Result: 在KITTI和VKITTIv2数据集上的实验表明,该方法边界锐度提高了35%,并显著提升了点云质量。

Insight: 通过概率建模将不确定性转移到混合权重中,可以更自然地处理深度估计中的边界模糊问题,同时保持自监督的高效性。

Abstract: Accurate monocular depth estimation is crucial for 3D scene understanding, but existing methods often blur depth at object boundaries, introducing spurious intermediate 3D points. While achieving sharp edges usually requires very fine-grained supervision, our method produces crisp depth discontinuities using only self-supervision. Specifically, we model per-pixel depth as a mixture distribution, capturing multiple plausible depths and shifting uncertainty from direct regression to the mixture weights. This formulation integrates seamlessly into existing pipelines via variance-aware loss functions and uncertainty propagation. Extensive evaluations on KITTI and VKITTIv2 show that our method achieves up to 35% higher boundary sharpness and improves point cloud quality compared to state-of-the-art baselines.

[99] DAFTED: Decoupled Asymmetric Fusion of Tabular and Echocardiographic Data for Cardiac Hypertension Diagnosis

Jérémie Stym-Popper,Nathan Painchaud,Clément Rambour,Pierre-Yves Courand,Nicolas Thome,Olivier Bernard

Main category: cs.CV

TL;DR: 论文提出了一种非对称多模态数据融合方法DAFTED,用于提高心脏高血压的诊断准确性。

Details Motivation: 医学诊断中多模态数据的有效融合是关键问题,需要解决共享和特定模态信息的分离问题。

Contribution: 提出了非对称融合策略,从主要模态出发,分离共享与特定模态信息,并结合其他次级模态。

Method: 采用解耦方法分离共享和模态特定信息,并在心脏高血压诊断任务上验证了其有效性。

Result: 在239名患者的超声心动图时间序列和表格数据上,模型AUC超过90%,表现优于现有方法。

Insight: 非对称融合策略在医学多模态数据中具有潜力,能够显著提升诊断性能。

Abstract: Multimodal data fusion is a key approach for enhancing diagnosis in medical applications. We propose an asymmetric fusion strategy starting from a primary modality and integrating secondary modalities by disentangling shared and modality-specific information. Validated on a dataset of 239 patients with echocardiographic time series and tabular records, our model outperforms existing methods, achieving an AUC over 90%. This improvement marks a crucial benchmark for clinical use.

[100] Towards Robust Visual Continual Learning with Multi-Prototype Supervision

Xiwei Liu,Yulong Li,Yichen Li,Xinlin Zhuang,Haolin Yang,Huifa Li,Imran Razzak

Main category: cs.CV

TL;DR: 论文提出了MuproCL框架,通过多原型监督解决视觉持续学习中的语义模糊性和类内视觉多样性问题,显著提升了性能。

Details Motivation: 传统的语言引导监督方法依赖单一语义目标,存在语义模糊和类内视觉多样性不足的缺陷,无法有效支持视觉持续学习。

Contribution: 提出MuproCL框架,引入多原型监督和LLM驱动的上下文感知原型生成,解决了语义模糊性和视觉多样性问题。

Method: 使用轻量级LLM代理生成上下文感知的多原型,结合LogSumExp机制实现视觉模型与最相关原型的自适应对齐。

Result: 在多个持续学习基准测试中,MuproCL显著提升了模型的性能和鲁棒性。

Insight: 多原型监督能够更全面地捕捉视觉类别的语义和外观多样性,为语言引导的持续学习提供了更有效的路径。

Abstract: Language-guided supervision, which utilizes a frozen semantic target from a Pretrained Language Model (PLM), has emerged as a promising paradigm for visual Continual Learning (CL). However, relying on a single target introduces two critical limitations: 1) semantic ambiguity, where a polysemous category name results in conflicting visual representations, and 2) intra-class visual diversity, where a single prototype fails to capture the rich variety of visual appearances within a class. To this end, we propose MuproCL, a novel framework that replaces the single target with multiple, context-aware prototypes. Specifically, we employ a lightweight LLM agent to perform category disambiguation and visual-modal expansion to generate a robust set of semantic prototypes. A LogSumExp aggregation mechanism allows the vision model to adaptively align with the most relevant prototype for a given image. Extensive experiments across various CL baselines demonstrate that MuproCL consistently enhances performance and robustness, establishing a more effective path for language-guided continual learning.

[101] DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching

Meng Yang,Fan Fan,Zizhuo Li,Songchu Deng,Yong Ma,Jiayi Ma

Main category: cs.CV

TL;DR: DistillMatch 是一种通过知识蒸馏从视觉基础模型(VFM)中提取高层语义特征的多模态图像匹配方法,结合模态类别信息和数据增强技术,显著提升了匹配性能。

Details Motivation: 多模态图像匹配由于模态间外观差异大和高质量标注数据稀缺,现有深度学习方法性能有限且泛化能力不足。视觉基础模型(VFM)在大规模数据上训练,具有适应多种模态的泛化特征表示。

Contribution: 1) 提出 DistillMatch,利用 VFM(如 DINOv2/DINOv3)的知识蒸馏构建轻量学生模型;2) 注入模态类别信息以保留模态特有特征;3) 设计 V2I-GAN 生成伪红外图像增强数据泛化。

Method: 1) 使用知识蒸馏从 VFM 提取高层语义特征辅助匹配;2) 提取并跨模态注入模态类别信息;3) V2I-GAN 实现可见光到伪红外图像的翻译以扩充数据。

Result: DistillMatch 在公开数据集上表现优于现有方法。

Insight: VFM 的知识蒸馏可有效提升多模态任务的泛化能力;模态类别信息的注入能增强跨模态相关性理解;数据增强技术(如 GAN)能进一步提升模型性能。

Abstract: Multimodal image matching seeks pixel-level correspondences between images of different modalities, crucial for cross-modal perception, fusion and analysis. However, the significant appearance differences between modalities make this task challenging. Due to the scarcity of high-quality annotated datasets, existing deep learning methods that extract modality-common features for matching perform poorly and lack adaptability to diverse scenarios. Vision Foundation Model (VFM), trained on large-scale data, yields generalizable and robust feature representations adapted to data and tasks of various modalities, including multimodal matching. Thus, we propose DistillMatch, a multimodal image matching method using knowledge distillation from VFM. DistillMatch employs knowledge distillation to build a lightweight student model that extracts high-level semantic features from VFM (including DINOv2 and DINOv3) to assist matching across modalities. To retain modality-specific information, it extracts and injects modality category information into the other modality’s features, which enhances the model’s understanding of cross-modal correlations. Furthermore, we design V2I-GAN to boost the model’s generalization by translating visible to pseudo-infrared images for data augmentation. Experiments show that DistillMatch outperforms existing algorithms on public datasets.

[102] GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition

Tianyue Wang,Shuang Yang,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: GLip 是一个全局-局部集成渐进式框架,旨在解决视觉语音识别(VSR)在现实世界中的视觉挑战(如光照变化、遮挡等)。通过两阶段学习,结合全局和局部特征,提升模型鲁棒性,并在多个基准数据集上表现优异。

Details Motivation: 现有视觉语音识别方法对真实世界中的视觉挑战(如光照变化、遮挡、模糊和姿态变化)关注不足,导致鲁棒性较差。GLip 旨在通过全局与局部特征的结合,提升模型在复杂环境下的表现。

Contribution: 1. 提出 GLip 框架,首次将全局与局部特征集成到渐进式学习中,提升视觉语音识别的鲁棒性。
2. 设计了上下文增强模块(CEM),动态整合局部特征与全局上下文,优化视觉到语音的映射。
3. 在多个基准数据集(如 LRS2、LRS3 和新引入的普通话数据集)上验证了框架的有效性。

Method: 1. 双路径特征提取结构,同时捕获全局和局部视觉特征。
2. 两阶段渐进式学习:第一阶段学习粗糙对齐,第二阶段通过 CEM 精细化特征。
3. 通过容易获取的视听数据训练,提升鲁棒性。

Result: 在 LRS2 和 LRS3 基准测试中表现优于现有方法,并在新引入的普通话数据集上验证了其鲁棒性。

Insight: 1. 复杂环境下,局部区域(如未遮挡部分)可能比全局特征更具判别性。
2. 渐进式学习有助于从粗糙对齐逐步过渡到精确映射,提升模型鲁棒性。

Abstract: Visual speech recognition (VSR), also known as lip reading, is the task of recognizing speech from silent video. Despite significant advancements in VSR over recent decades, most existing methods pay limited attention to real-world visual challenges such as illumination variations, occlusions, blurring, and pose changes. To address these challenges, we propose GLip, a Global-Local Integrated Progressive framework designed for robust VSR. GLip is built upon two key insights: (i) learning an initial \textit{coarse} alignment between visual features across varying conditions and corresponding speech content facilitates the subsequent learning of \textit{precise} visual-to-speech mappings in challenging environments; (ii) under adverse conditions, certain local regions (e.g., non-occluded areas) often exhibit more discriminative cues for lip reading than global features. To this end, GLip introduces a dual-path feature extraction architecture that integrates both global and local features within a two-stage progressive learning framework. In the first stage, the model learns to align both global and local visual features with corresponding acoustic speech units using easily accessible audio-visual data, establishing a coarse yet semantically robust foundation. In the second stage, we introduce a Contextual Enhancement Module (CEM) to dynamically integrate local features with relevant global context across both spatial and temporal dimensions, refining the coarse representations into precise visual-speech mappings. Our framework uniquely exploits discriminative local regions through a progressive learning strategy, demonstrating enhanced robustness against various visual challenges and consistently outperforming existing methods on the LRS2 and LRS3 benchmarks. We further validate its effectiveness on a newly introduced challenging Mandarin dataset.

[103] Graph-based Point Cloud Surface Reconstruction using B-Splines

Stuti Pathak,Rhys G. Evans,Gunther Steenackers,Rudi Penne

Main category: cs.CV

TL;DR: 该论文提出了一种基于字典引导的图卷积网络(DG-GCN)的表面重建方法,用于从噪声点云数据生成平滑表面,无需依赖点云法线。

Details Motivation: 现实世界中的点云数据通常带有噪声,现有方法依赖法线或固定数量的控制点,限制了其对复杂表面的适应性。论文旨在解决这一问题,提出了一种更鲁棒的重建方法。

Contribution: 主要贡献包括:1)同时预测控制点的位置和数量,适应不同复杂度的表面;2)无需依赖点云法线;3)采用字典引导的图卷积网络优化控制点预测。

Method: 方法分为两步:1)使用图卷积网络提取点云特征;2)通过字典引导的机制动态预测控制点的位置和数量,利用B样条生成平滑表面。

Result: 实验表明,该方法在定量和定性上均优于现有基线,尤其在噪声点云数据上表现优异。

Insight: 动态调整控制点数量和位置的策略可以更好地适应复杂表面,而无需依赖法线信息,为噪声点云表面重建提供了新思路。

Abstract: Generating continuous surfaces from discrete point cloud data is a fundamental task in several 3D vision applications. Real-world point clouds are inherently noisy due to various technical and environmental factors. Existing data-driven surface reconstruction algorithms rely heavily on ground truth normals or compute approximate normals as an intermediate step. This dependency makes them extremely unreliable for noisy point cloud datasets, even if the availability of ground truth training data is ensured, which is not always the case. B-spline reconstruction techniques provide compact surface representations of point clouds and are especially known for their smoothening properties. However, the complexity of the surfaces approximated using B-splines is directly influenced by the number and location of the spline control points. Existing spline-based modeling methods predict the locations of a fixed number of control points for a given point cloud, which makes it very difficult to match the complexity of its underlying surface. In this work, we develop a Dictionary-Guided Graph Convolutional Network-based surface reconstruction strategy where we simultaneously predict both the location and the number of control points for noisy point cloud data to generate smooth surfaces without the use of any point normals. We compare our reconstruction method with several well-known as well as recent baselines by employing widely-used evaluation metrics, and demonstrate that our method outperforms all of them both qualitatively and quantitatively.

[104] Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model

Jihua Peng,Qianxiong Xu,Yichen Liu,Chenxi Liu,Cheng Long,Rui Zhao,Ziyue Li

Main category: cs.CV

TL;DR: LIR-GAD is a novel framework for group activity detection (GAD) that leverages a Multimodal Large Language Model (MLLM) enriched with activity-level and group-specific tokens, achieving superior performance through language-instructed reasoning and multimodal fusion.

Details Motivation: Existing GAD methods rely on implicit visual pattern recognition, lacking contextual reasoning and explainability. The paper aims to integrate language instructions and MLLM's commonsense knowledge to improve these aspects.

Contribution: 1) Introduces activity-level and cluster-specific tokens to expand MLLM’s vocabulary. 2) Proposes a Multi-label Classification Loss for better semantic representation. 3) Designs a Multimodal Dual-Alignment Fusion (MDAF) module to integrate visual and linguistic features.

Method: 1) Expands MLLM with and tokens for semantic capturing. 2) Trains MLLM with video frames, tokens, and language instructions. 3) Uses MDAF to fuse MLLM embeddings with visual features.

Result: Quantitative and qualitative experiments show LIR-GAD outperforms existing methods in GAD tasks.

Insight: Language instructions and MLLM’s commonsense knowledge can significantly enhance GAD by improving contextual reasoning and explainability.

Abstract: Group activity detection (GAD) aims to simultaneously identify group members and categorize their collective activities within video sequences. Existing deep learning-based methods develop specialized architectures (e.g., transformer networks) to model the dynamics of individual roles and semantic dependencies between individuals and groups. However, they rely solely on implicit pattern recognition from visual features and struggle with contextual reasoning and explainability. In this work, we propose LIR-GAD, a novel framework of language-instructed reasoning for GAD via Multimodal Large Language Model (MLLM). Our approach expand the original vocabulary of MLLM by introducing an activity-level token and multiple cluster-specific tokens. We process video frames alongside two specially designed tokens and language instructions, which are then integrated into the MLLM. The pretrained commonsense knowledge embedded in the MLLM enables the token and tokens to effectively capture the semantic information of collective activities and learn distinct representational features of different groups, respectively. Also, we introduce a multi-label classification loss to further enhance the token’s ability to learn discriminative semantic representations. Then, we design a Multimodal Dual-Alignment Fusion (MDAF) module that integrates MLLM’s hidden embeddings corresponding to the designed tokens with visual features, significantly enhancing the performance of GAD. Both quantitative and qualitative experiments demonstrate the superior performance of our proposed method in GAD taks.

[105] See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model

Pengteng Li,Pinhao Song,Wuyang Li,Weiyu Guo,Huizai Yao,Yijie Xu,Dugang Liu,Hui Xiong

Main category: cs.CV

TL;DR: SEE&TREK是一种无需训练的提示框架,通过增加视觉多样性和运动重建来增强多模态大语言模型(MLLMs)的空间理解能力。

Details Motivation: 当前MLLMs的空间理解能力主要依赖于额外的模态(如深度或点云),而纯视觉空间理解尚未充分探索。SEE&TREK旨在填补这一空白。

Contribution: 提出了第一个无需训练的提示框架SEE&TREK,专注于视觉多样性和运动重建,无需额外训练或GPU资源。

Method: 1. 通过最大语义丰富采样提取关键帧;2. 模拟视觉轨迹并编码相对空间位置以保持时空一致性。

Result: 在多个空间推理任务中,SEE&TREK显著提升了MLLMs的性能,最高提升3.5%。

Insight: 纯视觉提示框架可以有效增强MLLMs的空间理解能力,且无需额外训练资源。

Abstract: We introduce SEE&TREK, the first training-free prompting framework tailored to enhance the spatial understanding of Multimodal Large Language Models (MLLMS) under vision-only constraints. While prior efforts have incorporated modalities like depth or point clouds to improve spatial reasoning, purely visualspatial understanding remains underexplored. SEE&TREK addresses this gap by focusing on two core principles: increasing visual diversity and motion reconstruction. For visual diversity, we conduct Maximum Semantic Richness Sampling, which employs an off-the-shell perception model to extract semantically rich keyframes that capture scene structure. For motion reconstruction, we simulate visual trajectories and encode relative spatial positions into keyframes to preserve both spatial relations and temporal coherence. Our method is training&GPU-free, requiring only a single forward pass, and can be seamlessly integrated into existing MLLM’S. Extensive experiments on the VSI-B ENCH and STI-B ENCH show that S EE &T REK consistently boosts various MLLM S performance across diverse spatial reasoning tasks with the most +3.5% improvement, offering a promising path toward stronger spatial intelligence.

[106] AdaSports-Traj: Role- and Domain-Aware Adaptation for Multi-Agent Trajectory Modeling in Sports

Yi Xu,Yun Fu

Main category: cs.CV

TL;DR: AdaSports-Traj是一个多智能体轨迹预测框架,针对体育场景中的角色和领域异构性提出自适应方法,结合了角色和领域感知适配器以及分层对比学习目标,显著提升了跨角色和跨领域的预测性能。

Details Motivation: 体育场景中的轨迹预测存在角色(如球员和球)和领域(如篮球和足球)的结构异质性和动态分布差异,而现有统一框架难以捕捉这些差异,导致泛化能力不足。

Contribution: 1. 提出了角色和领域感知适配器,显式调整潜在表征;2. 设计了分层对比学习目标,分离监督角色敏感和领域感知表征;3. 在多个体育数据集上验证了方法的有效性。

Method: 1. 角色和领域感知适配器:根据智能体身份和领域上下文动态调整表征;2. 分层对比学习:分别优化角色和域相关特征,避免优化冲突。

Result: 在Basketball-U、Football-U和Soccer-U数据集上,AdaSports-Traj在统一和跨领域预测任务中表现优异。

Insight: 显式建模角色和领域差异有助于提升多智能体轨迹预测的泛化能力;分层对比学习能够有效分离异构特征,避免优化冲突。

Abstract: Trajectory prediction in multi-agent sports scenarios is inherently challenging due to the structural heterogeneity across agent roles (e.g., players vs. ball) and dynamic distribution gaps across different sports domains. Existing unified frameworks often fail to capture these structured distributional shifts, resulting in suboptimal generalization across roles and domains. We propose AdaSports-Traj, an adaptive trajectory modeling framework that explicitly addresses both intra-domain and inter-domain distribution discrepancies in sports. At its core, AdaSports-Traj incorporates a Role- and Domain-Aware Adapter to conditionally adjust latent representations based on agent identity and domain context. Additionally, we introduce a Hierarchical Contrastive Learning objective, which separately supervises role-sensitive and domain-aware representations to encourage disentangled latent structures without introducing optimization conflict. Experiments on three diverse sports datasets, Basketball-U, Football-U, and Soccer-U, demonstrate the effectiveness of our adaptive design, achieving strong performance in both unified and cross-domain trajectory prediction settings.

[107] RadarGaussianDet3D: An Efficient and Effective Gaussian-based 3D Detector with 4D Automotive Radars

Weiyi Xiong,Bing Zhu,Tao Huang,Zewei Zheng

Main category: cs.CV

TL;DR: 该论文提出了一种基于高斯点的高效3D检测器RadarGaussianDet3D,用于4D车载雷达数据,通过高斯原语和分布优化特征提取和检测精度,显著提升了实时性和检测性能。

Details Motivation: 现有基于4D雷达的3D检测器依赖稀疏的BEV特征提取和独立的边界框优化,导致精度不足且计算效率低,难以满足车载设备的实时需求。

Contribution: 1. 提出Point Gaussian Encoder (PGE),利用高斯原语和3D高斯渲染生成稠密特征图;2. 提出Box Gaussian Loss (BGL),通过高斯分布优化边界框;3. 显著提升检测精度和推理速度。

Method: 1. PGE将雷达点转为高斯原语,并通过3DGS技术生成BEV特征图;2. BGL将边界框转为高斯分布以优化检测一致性;3. 算法优化以实现快速推理。

Result: 在TJ4DRadSet和View-of-Delft数据集上,RadarGaussianDet3D实现了SOTA检测精度,并显著提升了推理速度。

Insight: 高斯分布可作为点云和边界框的高效中间表示,兼顾特征密度和计算效率,适用于车载实时应用。

Abstract: 4D automotive radars have gained increasing attention for autonomous driving due to their low cost, robustness, and inherent velocity measurement capability. However, existing 4D radar-based 3D detectors rely heavily on pillar encoders for BEV feature extraction, where each point contributes to only a single BEV grid, resulting in sparse feature maps and degraded representation quality. In addition, they also optimize bounding box attributes independently, leading to sub-optimal detection accuracy. Moreover, their inference speed, while sufficient for high-end GPUs, may fail to meet the real-time requirement on vehicle-mounted embedded devices. To overcome these limitations, an efficient and effective Gaussian-based 3D detector, namely RadarGaussianDet3D is introduced, leveraging Gaussian primitives and distributions as intermediate representations for radar points and bounding boxes. In RadarGaussianDet3D, a novel Point Gaussian Encoder (PGE) is designed to transform each point into a Gaussian primitive after feature aggregation and employs the 3D Gaussian Splatting (3DGS) technique for BEV rasterization, yielding denser feature maps. PGE exhibits exceptionally low latency, owing to the optimized algorithm for point feature aggregation and fast rendering of 3DGS. In addition, a new Box Gaussian Loss (BGL) is proposed, which converts bounding boxes into 3D Gaussian distributions and measures their distance to enable more comprehensive and consistent optimization. Extensive experiments on TJ4DRadSet and View-of-Delft demonstrate that RadarGaussianDet3D achieves state-of-the-art detection accuracy while delivering substantially faster inference, highlighting its potential for real-time deployment in autonomous driving.

[108] BaseReward: A Strong Baseline for Multimodal Reward Model

Yi-Fan Zhang,Haihua Yang,Huanyu Zhang,Yang Shi,Zezhou Chen,Haochen Tian,Chaoyou Fu,Haotian Wang,Kai Wu,Bo Cui,Xu Wang,Jianfei Pan,Haotian Wang,Zhang Zhang,Liang Wang

Main category: cs.CV

TL;DR: 该论文提出了一个强大的多模态奖励模型基准BaseReward,通过系统研究奖励建模的各个关键组件,包括建模范式、架构设计和数据选择等,并基于实验结果,构建了一个高效的多模态奖励模型。

Details Motivation: 随着多模态大语言模型(MLLMs)的快速发展,如何将其与人类偏好对齐成为一个关键挑战。奖励模型是实现这一目标的核心技术,但目前缺乏一个系统性强、高效的构建指南。

Contribution: 1. 提出BaseReward,一个高效的多模态奖励模型基准;2. 系统分析了奖励建模的关键组件;3. 在多个基准测试中达到SOTA性能,并验证了其在实际强化学习中的实用性。

Method: 基于Qwen2.5-VL骨干网络,采用两层奖励头架构,结合高质量多模态和文本偏好数据进行训练。

Result: BaseReward在MM-RLHF-Reward Bench等主要基准测试中表现优异,并在实际强化学习任务中提升了MLLM的性能。

Insight: 该研究为开发下一代多模态奖励模型提供了明确的指南,证明了奖励模型的架构设计和数据选择对性能的关键影响。

Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has made aligning them with human preferences a critical challenge. Reward Models (RMs) are a core technology for achieving this goal, but a systematic guide for building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking in both academia and industry. Through exhaustive experimental analysis, this paper aims to provide a clear ``recipe’’ for constructing high-performance MRMs. We systematically investigate every crucial component in the MRM development pipeline, including \textit{reward modeling paradigms} (e.g., Naive-RM, Critic-based RM, and Generative RM), \textit{reward head architecture}, \textit{training strategies}, \textit{data curation} (covering over ten multimodal and text-only preference datasets), \textit{backbone model} and \textit{model scale}, and \textit{ensemble methods}. Based on these experimental insights, we introduce \textbf{BaseReward}, a powerful and efficient baseline for multimodal reward modeling. BaseReward adopts a simple yet effective architecture, built upon a {Qwen2.5-VL} backbone, featuring an optimized two-layer reward head, and is trained on a carefully curated mixture of high-quality multimodal and text-only preference data. Our results show that BaseReward establishes a new SOTA on major benchmarks such as MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench, outperforming previous models. Furthermore, to validate its practical utility beyond static benchmarks, we integrate BaseReward into a real-world reinforcement learning pipeline, successfully enhancing an MLLM’s performance across various perception, reasoning, and conversational tasks. This work not only delivers a top-tier MRM but, more importantly, provides the community with a clear, empirically-backed guide for developing robust reward models for the next generation of MLLMs.

[109] AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

Vatsal Malaviya,Agneet Chatterjee,Maitreya Patel,Yezhou Yang,Chitta Baral

Main category: cs.CV

TL;DR: AcT2I研究团队提出了一个基准测试AcT2I,用于评估文本到图像(T2I)模型在生成动作中心图像时的性能,发现现有模型表现不佳。他们通过知识蒸馏技术增强提示信息,显著提升了生成准确性。

Details Motivation: 现有的文本到图像模型在生成动作复杂场景时表现不佳,缺乏对隐含属性和上下文依赖的捕捉能力。

Contribution: 1. 引入AcT2I基准;2. 提出基于大型语言模型的知识蒸馏技术,通过增强提示提升生成准确性。

Method: 采用训练自由的框架,通过大语言模型扩展提示信息,包含时间等三个维度的密集信息。

Result: 最佳模型在图像生成准确性上提升了72%。

Insight: 当前T2I模型在复杂推理场景中存在局限,系统整合语言知识可显著改善生成效果。

Abstract: Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images.

[110] Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models

Renjie Pi,Kehao Miao,Li Peihang,Runtao Liu,Jiahui Gao,Jipeng Zhang,Xiaofang Zhou

Main category: cs.CV

TL;DR: 该论文研究了多模态大语言模型(MLLMs)在处理图像输入时表现出的明显谄媚行为(sycophantic behavior),并提出了一种名为Sycophantic Reflective Tuning(SRT)的方法来缓解这一行为。

Details Motivation: 多模态大语言模型在基于图像的对话中表现出明显的谄媚行为,这种行为在文本模型中也有观察到,但在图像输入时更为突出。论文旨在理解这一现象并提出解决方案。

Contribution: 论文的主要贡献是提出了Sycophantic Reflective Tuning(SRT)方法,显著减少了MLLMs对误导性指令的谄媚行为,同时避免了过度固执。

Method: 作者首先尝试了简单的监督微调,但发现会导致模型过于固执。随后提出了SRT方法,通过让MLLMs进行反思性推理来判断用户指令是否误导或纠正,从而平衡行为。

Result: 实验表明,SRT显著减少了MLLMs对误导性指令的谄媚行为,同时避免了模型在面对纠正性指令时的过度固执。

Insight: 研究发现,MLLMs在处理图像输入时的谄媚行为比文本输入更明显,表明模态之间的差异可能放大了这一问题。

Abstract: Multimodal large language models (MLLMs) have demonstrated extraordinary capabilities in conducting conversations based on image inputs. However, we observe that MLLMs exhibit a pronounced form of visual sycophantic behavior. While similar behavior has also been noted in text-based large language models (LLMs), it becomes significantly more prominent when MLLMs process image inputs. We refer to this phenomenon as the “sycophantic modality gap.” To better understand this issue, we further analyze the factors that contribute to the exacerbation of this gap. To mitigate the visual sycophantic behavior, we first experiment with naive supervised fine-tuning to help the MLLM resist misleading instructions from the user. However, we find that this approach also makes the MLLM overly resistant to corrective instructions (i.e., stubborn even if it is wrong). To alleviate this trade-off, we propose Sycophantic Reflective Tuning (SRT), which enables the MLLM to engage in reflective reasoning, allowing it to determine whether a user’s instruction is misleading or corrective before drawing a conclusion. After applying SRT, we observe a significant reduction in sycophantic behavior toward misleading instructions, without resulting in excessive stubbornness when receiving corrective instructions.

[111] UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation

Xiaoqi Zhao,Youwei Pang,Chenyang Yu,Lihe Zhang,Huchuan Lu,Shijian Lu,Georges El Fakhri,Xiaofeng Liu

Main category: cs.CV

TL;DR: UniMRSeg提出了一种统一的多模态分割网络,通过层次化自监督补偿(HSSC)解决多模态缺失问题,显著提高了分割性能。

Details Motivation: 真实世界中多模态图像分割常面临模态缺失或损坏的问题,现有方法需为每种模态组合训练专用模型,导致部署成本高。

Contribution: 提出UniMRSeg,通过层次化自监督补偿(HSSC)在输入、特征和输出层面弥合完整与缺失模态间的差距,支持统一模型处理各种模态组合。

Method: 采用模态重建、模态不变对比学习及轻量反向注意力适配器,结合混合一致性约束进行微调。

Result: 在多模态脑肿瘤分割、RGB-D语义分割等任务中显著优于现有方法。

Insight: 层次化补偿机制有效提升模型对模态缺失的鲁棒性,同时保持轻量化和高效性。

Abstract: Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance. While existing methods address training-inference modality gaps via specialized per-combination models, they introduce high deployment costs by requiring exhaustive model subsets and model-modality matching. In this work, we propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC). Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels. % First, we adopt modality reconstruction with the hybrid shuffled-masking augmentation, encouraging the model to learn the intrinsic modality characteristics and generate meaningful representations for missing modalities through cross-modal fusion. % Next, modality-invariant contrastive learning implicitly compensates the feature space distance among incomplete-complete modality pairs. Furthermore, the proposed lightweight reverse attention adapter explicitly compensates for the weak perceptual semantics in the frozen encoder. Last, UniMRSeg is fine-tuned under the hybrid consistency constraint to ensure stable prediction under all modality combinations without large performance fluctuations. Without bells and whistles, UniMRSeg significantly outperforms the state-of-the-art methods under diverse missing modality scenarios on MRI-based brain tumor segmentation, RGB-D semantic segmentation, RGB-D/T salient object segmentation. The code will be released at https://github.com/Xiaoqi-Zhao-DLUT/UniMRSeg.

cs.LG [Back]

[112] Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning

Chi Liu,Derek Li,Yan Shu,Robin Chen,Derek Duan,Teng Fang,Bryan Dai

Main category: cs.LG

TL;DR: Fleming-R1通过强化学习实现了接近专家水平的医疗推理能力,其创新包括数据策略、思维链初始化和可验证奖励的强化学习框架,显著提升了模型性能。

Details Motivation: 当前大型语言模型在医疗应用中的挑战是需要同时实现准确答案和透明的推理过程,因此需要开发一种能够兼顾两者的新方法。

Contribution: 论文的贡献包括:1)面向推理的数据策略(RODS);2)思维链(CoT)冷启动方法;3)基于可验证奖励的两阶段强化学习框架(RLVR)。

Method: 方法包括:1)使用RODS结合医学知识图谱合成数据;2)通过CoT冷启动从教师模型中提取高质量推理轨迹;3)采用RLVR框架优化模型的推理能力。

Result: 结果显示,Fleming-R1在多项医学基准测试中表现优异,7B版本超越更大的基线模型,32B版本接近GPT-4o性能,且显著优于开源替代方案。

Insight: 论文表明,结构化的数据设计、面向推理的初始化和可验证的强化学习框架可以有效提升医疗推理能力,而不仅仅是优化准确性。

Abstract: While large language models show promise in medical applications, achieving expert-level clinical reasoning remains challenging due to the need for both accurate answers and transparent reasoning processes. To address this challenge, we introduce Fleming-R1, a model designed for verifiable medical reasoning through three complementary innovations. First, our Reasoning-Oriented Data Strategy (RODS) combines curated medical QA datasets with knowledge-graph-guided synthesis to improve coverage of underrepresented diseases, drugs, and multi-hop reasoning chains. Second, we employ Chain-of-Thought (CoT) cold start to distill high-quality reasoning trajectories from teacher models, establishing robust inference priors. Third, we implement a two-stage Reinforcement Learning from Verifiable Rewards (RLVR) framework using Group Relative Policy Optimization, which consolidates core reasoning skills while targeting persistent failure modes through adaptive hard-sample mining. Across diverse medical benchmarks, Fleming-R1 delivers substantial parameter-efficient improvements: the 7B variant surpasses much larger baselines, while the 32B model achieves near-parity with GPT-4o and consistently outperforms strong open-source alternatives. These results demonstrate that structured data design, reasoning-oriented initialization, and verifiable reinforcement learning can advance clinical reasoning beyond simple accuracy optimization. We release Fleming-R1 publicly to promote transparent, reproducible, and auditable progress in medical AI, enabling safer deployment in high-stakes clinical environments.

[113] Small LLMs with Expert Blocks Are Good Enough for Hyperparamter Tuning

Om Naphade,Saksham Bansal,Parikshit Pareek

Main category: cs.LG

TL;DR: 本文提出了一种基于小型LLMs(语言模型)的专家块框架,用于高效且透明的超参数调优(HPT),其核心是通过确定性块TCS(轨迹上下文总结器)将原始训练轨迹转化为结构化上下文,使得小型LLMs的效果可比肩大模型。

Details Motivation: 超参数调优(HPT)在机器学习中是必要的,但随着模型规模增大,其计算成本高且不透明。大型语言模型(LLMs)虽被用于HPT,但其参数规模过大(如超过1000亿参数)。本文旨在探索小型LLMs在HPT中的潜力,以降低成本并提高透明度。

Contribution: 主要贡献是提出了专家块框架,尤其是TCS(轨迹上下文总结器),它能将训练轨迹转化为结构化上下文,使小型LLMs(如phi4和qwen2.5-coder)在HPT任务中表现接近GPT-4等高参数模型。

Method: 方法的核心是通过TCS块对训练轨迹进行确定性转换,生成结构化上下文,供小型LLMs分析优化进展。实验中使用phi4(14B)和qwen2.5-coder(32B)两种小型LLMs,并结合10次试验预算进行评估。

Result: 在六个多样化任务中,启用TCS的HPT流程平均性能与GPT-4相差仅约0.9个百分点,证明了小型LLMs在HPT中的有效性。

Insight: 研究表明,小型LLMs结合专家块(如TCS)可以显著降低计算成本,同时在性能上接近大型LLMs。这为资源受限的场景提供了高效且透明的HPT解决方案。

Abstract: Hyper-parameter Tuning (HPT) is a necessary step in machine learning (ML) pipelines but becomes computationally expensive and opaque with larger models. Recently, Large Language Models (LLMs) have been explored for HPT, yet most rely on models exceeding 100 billion parameters. We propose an Expert Block Framework for HPT using Small LLMs. At its core is the Trajectory Context Summarizer (TCS), a deterministic block that transforms raw training trajectories into structured context, enabling small LLMs to analyze optimization progress with reliability comparable to larger models. Using two locally-run LLMs (phi4:reasoning14B and qwen2.5-coder:32B) and a 10-trial budget, our TCS-enabled HPT pipeline achieves average performance within ~0.9 percentage points of GPT-4 across six diverse tasks.

[114] KITE: Kernelized and Information Theoretic Exemplars for In-Context Learning

Vaibhav Singh,Soumya Suvra Ghosal,Kapu Nirmal Joshua,Soumyabrata Pal,Sayak Ray Chowdhury

Main category: cs.LG

TL;DR: 提出了一种基于信息论和核方法的示例选择方法KITE,用于提升上下文学习中大型语言模型的性能。

Details Motivation: 在上下文学习中,选择合适的示例对性能至关重要,但现有方法(如KATE)在高维嵌入空间中存在泛化差和多样性不足的问题。

Contribution: 1. 将示例选择问题建模为特定查询的优化问题;2. 提出近似子模的代理目标,支持贪心算法;3. 结合核方法和最优设计正则化以提升多样性。

Method: 1. 将LLM建模为输入嵌入的线性函数;2. 设计基于信息论的优化目标;3. 引入核方法和正则化技术。

Result: 在分类任务中显著优于标准检索方法,验证了结构感知和多样性示例选择的有效性。

Insight: 特定查询的优化和信息多样性是提升上下文学习性能的关键,核方法有助于处理高维特征空间。

Abstract: In-context learning (ICL) has emerged as a powerful paradigm for adapting large language models (LLMs) to new and data-scarce tasks using only a few carefully selected task-specific examples presented in the prompt. However, given the limited context size of LLMs, a fundamental question arises: Which examples should be selected to maximize performance on a given user query? While nearest-neighbor-based methods like KATE have been widely adopted for this purpose, they suffer from well-known drawbacks in high-dimensional embedding spaces, including poor generalization and a lack of diversity. In this work, we study this problem of example selection in ICL from a principled, information theory-driven perspective. We first model an LLM as a linear function over input embeddings and frame the example selection task as a query-specific optimization problem: selecting a subset of exemplars from a larger example bank that minimizes the prediction error on a specific query. This formulation departs from traditional generalization-focused learning theoretic approaches by targeting accurate prediction for a specific query instance. We derive a principled surrogate objective that is approximately submodular, enabling the use of a greedy algorithm with an approximation guarantee. We further enhance our method by (i) incorporating the kernel trick to operate in high-dimensional feature spaces without explicit mappings, and (ii) introducing an optimal design-based regularizer to encourage diversity in the selected examples. Empirically, we demonstrate significant improvements over standard retrieval methods across a suite of classification tasks, highlighting the benefits of structure-aware, diverse example selection for ICL in real-world, label-scarce scenarios.

[115] Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

Andrew Kyle Lampinen,Martin Engelcke,Yuxuan Li,Arslan Chaudhry,James L. McClelland

Main category: cs.LG

TL;DR: 该论文探讨机器学习系统的泛化能力不足问题,提出从认知科学中汲取灵感,引入潜在学习和情景记忆机制以改进泛化能力。

Details Motivation: 当前机器学习系统往往未能学习与当前任务无关但对未来任务有用的信息(即潜在学习),导致泛化能力不足,作者希望通过引入认知科学中的情景记忆机制来解决这一问题。

Contribution: 1. 提出机器学习的潜在学习问题及其对泛化的影响;2. 展示通过情景记忆机制(如检索)可以更灵活地利用学习经验;3. 识别了有效利用检索的关键组件,例如上下文学习能力。

Method: 提出一种结合情景记忆的系统,使用检索机制(如“预言机检索”)灵活重用学习经验,并通过上下文学习提升跨样本信息利用能力。

Result: 实验表明引入检索机制的系统(特别是具备预言机检索时)在多任务泛化挑战中表现更优,验证了情景记忆的潜力。

Insight: 1. 潜在学习是当前机器学习泛化能力不足的原因之一;2. 情景记忆机制可作为参数化学习的补充,提升数据利用效率;3. 上下文学习能力是有效利用检索的关键。

Abstract: When do machine learning systems fail to generalize, and what mechanisms could improve their generalization? Here, we draw inspiration from cognitive science to argue that one weakness of machine learning systems is their failure to exhibit latent learning – learning information that is not relevant to the task at hand, but that might be useful in a future task. We show how this perspective links failures ranging from the reversal curse in language modeling to new findings on agent-based navigation. We then highlight how cognitive science points to episodic memory as a potential part of the solution to these issues. Correspondingly, we show that a system with an oracle retrieval mechanism can use learning experiences more flexibly to generalize better across many of these challenges. We also identify some of the essential components for effectively using retrieval, including the importance of within-example in-context learning for acquiring the ability to use information across retrieved examples. In summary, our results illustrate one possible contributor to the relative data inefficiency of current machine learning systems compared to natural intelligence, and help to understand how retrieval methods can complement parametric learning to improve generalization.

[116] Global Pre-fixing, Local Adjusting: A Simple yet Effective Contrastive Strategy for Continual Learning

Jia Tang,Xinrui Wang,Songcan Chen

Main category: cs.LG

TL;DR: 该论文提出了一种简单而有效的对比策略GPLASC,通过全局预固定和局部调整来解决持续学习中的任务间和任务内特征混淆问题,提升了特征的判别性。

Details Motivation: 持续学习(CL)中的对比学习方法虽能减轻灾难性遗忘,但仍受到任务间和任务内特征混淆的限制。现有方法性能仍有提升空间。

Contribution: 提出了GPLASC策略,通过将表征空间划分为非重叠区域并形成预固定的任务间ETF(等角紧框架),同时在任务内部调整ETF,确保了任务间和任务内特征的判别性。

Method: 使用全局预固定的ETF处理任务间特征混淆,通过局部调整ETF优化任务内特征结构。此方法可无缝集成到现有对比持续学习框架中。

Result: 实验验证了GPLASC的有效性,显著提升了持续学习的性能。

Insight: 通过构建全局预固定和局部可调整的表征空间,可以同时解决任务间和任务内的特征混淆问题。

Abstract: Continual learning (CL) involves acquiring and accumulating knowledge from evolving tasks while alleviating catastrophic forgetting. Recently, leveraging contrastive loss to construct more transferable and less forgetful representations has been a promising direction in CL. Despite advancements, their performance is still limited due to confusion arising from both inter-task and intra-task features. To address the problem, we propose a simple yet effective contrastive strategy named \textbf{G}lobal \textbf{P}re-fixing, \textbf{L}ocal \textbf{A}djusting for \textbf{S}upervised \textbf{C}ontrastive learning (GPLASC). Specifically, to avoid task-level confusion, we divide the entire unit hypersphere of representations into non-overlapping regions, with the centers of the regions forming an inter-task pre-fixed \textbf{E}quiangular \textbf{T}ight \textbf{F}rame (ETF). Meanwhile, for individual tasks, our method helps regulate the feature structure and form intra-task adjustable ETFs within their respective allocated regions. As a result, our method \textit{simultaneously} ensures discriminative feature structures both between tasks and within tasks and can be seamlessly integrated into any existing contrastive continual learning framework. Extensive experiments validate its effectiveness.

[117] Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

Zinan Lin,Enshu Liu,Xuefei Ning,Junyi Zhu,Wenyu Wang,Sergey Yekhanin

Main category: cs.LG

TL;DR: Latent Zoning Network (LZN) 提出了一种统一的潜空间框架,能够同时解决生成建模、表示学习和分类任务,并在多个任务上取得了优于现有方法的性能。

Details Motivation: 当前机器学习的生成建模、表示学习和分类任务的解决方案大多是分离的,缺乏统一的框架。LZN旨在通过共享的潜空间将这三个任务统一起来,简化流程并提升协同效果。

Contribution: LZN的核心贡献是引入了共享的Gaussian潜空间,并通过任务特定的编码器和解码器组合实现多任务统一处理。实验表明,LZN在生成建模、表示学习和分类任务中均表现优异。

Method: LZN设计了共享的Gaussian潜空间,每个数据类型(如图像、文本、标签)配有独立的编码器和解码器。任务通过组合这些模块完成,例如条件生成通过标签编码器和图像解码器实现,分类通过图像编码器和标签解码器完成。

Result: LZN在多种任务中表现突出:与Rectified Flow结合提升图像生成的FID;在无监督表示学习中优于MoCo和SimCLR;在联合生成和分类任务中实现SOTA分类精度。

Insight: 共享潜空间的设计为多任务统一提供了新思路,LZN展示了在生成、表示和分类任务中共享信息的潜力,为未来的多任务学习框架提供了参考。

Abstract: Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.

[118] Efficient Long-Tail Learning in Latent Space by sampling Synthetic Data

Nakul Sharma

Main category: cs.LG

TL;DR: 该论文提出了一种利用视觉基础模型的语义潜在空间生成合成数据的方法,结合真实数据和合成数据训练线性分类器,以解决长尾分类问题,同时显著减少可训练参数,提升计算效率。

Details Motivation: 现有的长尾分类方法尽管性能有所提升,但仍无法完全消除与平衡数据集训练的模型的差距,且计算资源消耗大。本文旨在提出一种计算高效且简单的方法。

Contribution: 1. 提出了一种新颖的框架,利用视觉基础模型的潜在空间生成合成数据;2. 通过结合真实和合成数据训练线性分类器,显著降低可训练参数数量;3. 在CIFAR-100-LT和Places-LT基准测试中达到新的SOTA性能。

Method: 1. 使用视觉基础模型的潜在空间生成合成数据;2. 混合真实与合成数据训练线性分类器;3. 减少可训练参数至线性模型的数量,提升计算效率。

Result: 在CIFAR-100-LT和Places-LT基准测试中取得了新的最优性能,同时计算效率大幅提升。

Insight: 利用基础模型的潜在空间生成合成数据是一种有效且高效的方法,能够显著改善长尾分类问题,同时减少资源消耗。

Abstract: Imbalanced classification datasets pose significant challenges in machine learning, often leading to biased models that perform poorly on underrepresented classes. With the rise of foundation models, recent research has focused on the full, partial, and parameter-efficient fine-tuning of these models to deal with long-tail classification. Despite the impressive performance of these works on the benchmark datasets, they still fail to close the gap with the networks trained using the balanced datasets and still require substantial computational resources, even for relatively smaller datasets. Underscoring the importance of computational efficiency and simplicity, in this work we propose a novel framework that leverages the rich semantic latent space of Vision Foundation Models to generate synthetic data and train a simple linear classifier using a mixture of real and synthetic data for long-tail classification. The computational efficiency gain arises from the number of trainable parameters that are reduced to just the number of parameters in the linear model. Our method sets a new state-of-the-art for the CIFAR-100-LT benchmark and demonstrates strong performance on the Places-LT benchmark, highlighting the effectiveness and adaptability of our simple and effective approach.

[119] DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng,Huayu Chen,Haotian Ye,Haoxiang Wang,Qinsheng Zhang,Kai Jiang,Hang Su,Stefano Ermon,Jun Zhu,Ming-Yu Liu

Main category: cs.LG

TL;DR: DiffusionNFT提出了一种新的在线强化学习范式,直接通过流匹配在扩散模型的正向过程中优化模型,避免了传统方法的复杂性和限制。

Details Motivation: 传统的在线强化学习方法在扩散模型中难以应用,因为涉及不可处理的似然性问题,且现有方法存在求解器限制、正向-反向不一致性等问题。

Contribution: 提出了DiffusionNFT,直接在正向过程中优化扩散模型,通过对比正面和负面的生成样本定义策略改进方向,无需似然估计或复杂求解器。

Method: 基于流匹配的正向过程优化,对比正负生成样本,自然地结合强化学习信号到监督学习目标中,支持任意黑盒求解器。

Result: 在1k步内将GenEval得分从0.24提升到0.98,比FlowGRPO高效25倍;在多奖励模型下显著提升SD3.5-Medium性能。

Insight: 直接在正向过程中优化扩散模型是一种高效且灵活的方法,避免了传统反向过程优化的复杂性和限制。

Abstract: Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

[120] Dynamic Classifier-Free Diffusion Guidance via Online Feedback

Pinelopi Papalampidi,Olivia Wiles,Ira Ktena,Aleksandar Shtedritski,Emanuele Bugliarello,Ivana Kajic,Isabela Albuquerque,Aida Nematzadeh

Main category: cs.LG

TL;DR: 该论文提出了一种动态调整Classifier-Free Guidance (CFG) 的方法,通过在线反馈和贪心搜索为每个时间步和提示生成优化的CFG 策略,显著提升了文本到图像生成的质量。

Details Motivation: 传统的CFG 使用静态指导尺度,无法适应不同提示的多样性需求,而现有解决方案(如梯度修正或固定启发式调度)复杂且泛化性差。

Contribution: 1. 提出动态CFG 调度框架;2. 利用在线反馈评估生成质量;3. 通过贪心搜索为每个时间步选择最优CFG 尺度。

Method: 结合CLIP、判别器和人类偏好奖励模型等评估工具,在每个反向扩散步骤中动态调整CFG 尺度。

Result: 在Imagen 3 等模型上,文本对齐、视觉质量和文本渲染等方面均有显著提升,人类偏好胜率高达53.8%。

Insight: 最优指导策略是动态且提示依赖的,为改进扩散模型提供了高效且可泛化的方法。

Abstract: Classifier-free guidance (CFG) is a cornerstone of text-to-image diffusion models, yet its effectiveness is limited by the use of static guidance scales. This “one-size-fits-all” approach fails to adapt to the diverse requirements of different prompts; moreover, prior solutions like gradient-based correction or fixed heuristic schedules introduce additional complexities and fail to generalize. In this work, we challeng this static paradigm by introducing a framework for dynamic CFG scheduling. Our method leverages online feedback from a suite of general-purpose and specialized small-scale latent-space evaluations, such as CLIP for alignment, a discriminator for fidelity and a human preference reward model, to assess generation quality at each step of the reverse diffusion process. Based on this feedback, we perform a greedy search to select the optimal CFG scale for each timestep, creating a unique guidance schedule tailored to every prompt and sample. We demonstrate the effectiveness of our approach on both small-scale models and the state-of-the-art Imagen 3, showing significant improvements in text alignment, visual quality, text rendering and numerical reasoning. Notably, when compared against the default Imagen 3 baseline, our method achieves up to 53.8% human preference win-rate for overall preference, a figure that increases up to to 55.5% on prompts targeting specific capabilities like text rendering. Our work establishes that the optimal guidance schedule is inherently dynamic and prompt-dependent, and provides an efficient and generalizable framework to achieve it.

eess.IV [Back]

[121] Analysis Plug-and-Play Methods for Imaging Inverse Problems

Edward P. Chandler,Shirin Shoushtari,Brendt Wohlberg,Ulugbek S. Kamilov

Main category: eess.IV

TL;DR: 该论文提出了一种分析型Plug-and-Play (PnP)方法,将先验知识施加在图像的变换表示(如梯度域)上,而非直接作用于图像域,扩展了传统TV正则化的概念,并提出了两种基于HQS和ADMM的算法APnP-HQS和APnP-ADMM,在图像去模糊和超分辨率任务中表现优异。

Details Motivation: 传统PnP方法直接在图像域施加先验知识,限制了表达能力;而通过将先验转移到梯度域等变换表示上,可以更灵活地捕捉图像的结构信息。

Contribution: 1. 提出了梯度域的分析型PnP框架,扩展了TV正则化的概念;2. 开发了两种新算法APnP-HQS和APnP-ADMM;3. 在图像重建任务中验证了方法的有效性。

Method: 1. 在梯度域上训练高斯去噪器;2. 基于HQS和ADMM设计分析型PnP算法APnP-HQS和APnP-ADMM。

Result: 实验表明,分析型PnP在图像去模糊和超分辨率任务中表现与图像域PnP相当,甚至在某些情况下更优。

Insight: 将先验施加在变换域(如梯度域)可以更有效地捕捉图像的结构信息,为未来的图像重建任务提供了新的思路。

Abstract: Plug-and-Play Priors (PnP) is a popular framework for solving imaging inverse problems by integrating learned priors in the form of denoisers trained to remove Gaussian noise from images. In standard PnP methods, the denoiser is applied directly in the image domain, serving as an implicit prior on natural images. This paper considers an alternative analysis formulation of PnP, in which the prior is imposed on a transformed representation of the image, such as its gradient. Specifically, we train a Gaussian denoiser to operate in the gradient domain, rather than on the image itself. Conceptually, this is an extension of total variation (TV) regularization to learned TV regularization. To incorporate this gradient-domain prior in image reconstruction algorithms, we develop two analysis PnP algorithms based on half-quadratic splitting (APnP-HQS) and the alternating direction method of multipliers (APnP-ADMM). We evaluate our approach on image deblurring and super-resolution, demonstrating that the analysis formulation achieves performance comparable to image-domain PnP algorithms.

[122] Uncertainty-Gated Deformable Network for Breast Tumor Segmentation in MR Images

Yue Zhang,Jiahua Dong,Chengtao Peng,Qiuli Wang,Dan Song,Guiduo Duan

Main category: eess.IV

TL;DR: 本文提出了一种不确定性门控可变形网络,结合CNN和Transformer的互补信息,用于MRI图像中乳腺癌肿瘤的精确分割,并在两个临床数据集上表现出优越性能。

Details Motivation: 现有方法在捕捉不规则肿瘤形状和有效整合局部与全局特征方面存在不足,影响乳腺癌诊断的准确性。本文旨在解决这些问题。

Contribution: 1) 提出结合可变形卷积和注意力的模块;2) 设计不确定性门控增强模块(U-GEM),选择性融合局部与全局特征;3) 引入边界敏感深度监督损失以优化轮廓。

Method: 1) 在网络中嵌入可变形特征建模;2) U-GEM基于像素级不确定性动态融合CNN与Transformer特征;3) 边界敏感损失提升轮廓精度。

Result: 在两个临床数据集上优于现有方法,验证了模型的临床适用性。

Insight: 1) 可变形建模与不确定性门控结合能有效处理不规则形状;2) 动态特征融合优于固定结构。

Abstract: Accurate segmentation of breast tumors in magnetic resonance images (MRI) is essential for breast cancer diagnosis, yet existing methods face challenges in capturing irregular tumor shapes and effectively integrating local and global features. To address these limitations, we propose an uncertainty-gated deformable network to leverage the complementary information from CNN and Transformers. Specifically, we incorporates deformable feature modeling into both convolution and attention modules, enabling adaptive receptive fields for irregular tumor contours. We also design an Uncertainty-Gated Enhancing Module (U-GEM) to selectively exchange complementary features between CNN and Transformer based on pixel-wise uncertainty, enhancing both local and global representations. Additionally, a Boundary-sensitive Deep Supervision Loss is introduced to further improve tumor boundary delineation. Comprehensive experiments on two clinical breast MRI datasets demonstrate that our method achieves superior segmentation performance compared with state-of-the-art methods, highlighting its clinical potential for accurate breast tumor delineation.

[123] QWD-GAN: Quality-aware Wavelet-driven GAN for Unsupervised Medical Microscopy Images Denoising

Qijun Yang,Yating Huang,Lintao Xiang,Hujun Yin

Main category: eess.IV

TL;DR: QWD-GAN 是一种基于生成对抗网络 (GAN) 的无监督医学显微镜图像去噪方法,通过小波变换和多尺度自适应生成器,结合双分支判别器,提升了去噪效果和高频信息的保留能力。

Details Motivation: 医学显微镜图像去噪面临图像采集条件限制、噪声类型复杂、算法适应性差以及临床需求多样等挑战,需要一种高效且能保留细节的去噪方法。

Contribution: 提出了一种基于小波变换的多尺度自适应生成器和双分支判别器的无监督去噪方法 QWD-GAN,显著提升了去噪性能和高频信息保留能力。

Method: 采用小波变换设计多尺度自适应生成器,并引入双分支判别器,结合差异感知特征图和原始特征,优化去噪效果。

Result: 在多个生物医学显微镜图像数据集上实现了先进的去噪性能,尤其在保留高频信息方面表现突出。

Insight: 小波变换的多尺度分析和双分支判别器的设计为医学图像去噪提供了新的思路,同时该框架可轻松适配其他 GAN 架构。

Abstract: Image denoising plays a critical role in biomedical and microscopy imaging, especially when acquiring wide-field fluorescence-stained images. This task faces challenges in multiple fronts, including limitations in image acquisition conditions, complex noise types, algorithm adaptability, and clinical application demands. Although many deep learning-based denoising techniques have demonstrated promising results, further improvements are needed in preserving image details, enhancing algorithmic efficiency, and increasing clinical interpretability. We propose an unsupervised image denoising method based on a Generative Adversarial Network (GAN) architecture. The approach introduces a multi-scale adaptive generator based on the Wavelet Transform and a dual-branch discriminator that integrates difference perception feature maps with original features. Experimental results on multiple biomedical microscopy image datasets show that the proposed model achieves state-of-the-art denoising performance, particularly excelling in the preservation of high-frequency information. Furthermore, the dual-branch discriminator is seamlessly compatible with various GAN frameworks. The proposed quality-aware, wavelet-driven GAN denoising model is termed as QWD-GAN.

[124] The Missing Piece: A Case for Pre-Training in 3D Medical Object Detection

Katharina Eckstein,Constantin Ulrich,Michael Baumgartner,Jessica Kächele,Dimitrios Bounias,Tassilo Wald,Ralf Floca,Klaus H. Maier-Hein

Main category: eess.IV

TL;DR: 该论文是第一项系统研究如何将预训练方法整合到3D医学目标检测中的工作,结果表明预训练能显著提升检测性能,其中基于重建的自监督预训练表现优于监督预训练。

Details Motivation: 3D医学目标检测在计算机辅助诊断中至关重要,但目前预训练方法主要集中在分割任务上,而3D目标检测的预训练方法尚不成熟,且现有方法未充分利用3D体数据信息。

Contribution: 论文首次系统研究了预训练方法在3D医学目标检测中的应用,覆盖了CNN和Transformer架构,并展示了预训练在各种任务和数据集上的普适性改进。

Method: 研究比较了监督预训练、基于重建的自监督预训练和对比学习预训练方法,并将其整合到现有检测架构中。

Result: 结果表明,预训练能显著提升检测性能,其中基于重建的自监督预训练表现最佳,而对比预训练对3D医学目标检测无明显帮助。

Insight: 重建自监督预训练可能更适合3D医学数据,因其能更好地捕捉体积信息;而对比学习方法在3D检测任务中效果有限。

Abstract: Large-scale pre-training holds the promise to advance 3D medical object detection, a crucial component of accurate computer-aided diagnosis. Yet, it remains underexplored compared to segmentation, where pre-training has already demonstrated significant benefits. Existing pre-training approaches for 3D object detection rely on 2D medical data or natural image pre-training, failing to fully leverage 3D volumetric information. In this work, we present the first systematic study of how existing pre-training methods can be integrated into state-of-the-art detection architectures, covering both CNNs and Transformers. Our results show that pre-training consistently improves detection performance across various tasks and datasets. Notably, reconstruction-based self-supervised pre-training outperforms supervised pre-training, while contrastive pre-training provides no clear benefit for 3D medical object detection. Our code is publicly available at: https://github.com/MIC-DKFZ/nnDetection-finetuning.

[125] FMD-TransUNet: Abdominal Multi-Organ Segmentation Based on Frequency Domain Multi-Axis Representation Learning and Dual Attention Mechanisms

Fang Lu,Jingyu Xu,Qinxiu Sun,Qiong Lou

Main category: eess.IV

TL;DR: 论文提出了一种名为FMD-TransUNet的新框架,通过结合多轴频率域表示学习和双注意力机制,提升了腹部多器官分割的精度。

Details Motivation: 腹部多器官分割对临床应用至关重要。现有深度学习方法在分割小、不规则或解剖结构复杂的器官时表现不佳,且主要依赖空间域分析,忽略了频率域表示的潜在协同作用。

Contribution: 1. 提出了FMD-TransUNet框架,集成多轴频率域特征提取(MEWB)和改进的双注意力模块(DA+);
2. MEWB通过频率域分析补充了空间域表示的全局和局部细节;
3. DA+利用深度可分离卷积和空间-通道注意力机制提升特征融合效果。

Method: 1. 在TransUNet基础上引入MEWB提取多轴频率域特征;
2. DA+模块结合深度可分离卷积与双重注意力机制;
3. 在Synapse数据集上验证模型性能。

Result: 实验表明,FMD-TransUNet在八种腹部器官的分割中DSC达到81.32%,HD为16.35 mm,相比基线模型DSC提升3.84%,HD降低15.34 mm。

Insight: 频率域表示能有效补充空间域特征的不足,结合多轴分析和注意力机制可显著提升复杂器官分割的精度。

Abstract: Accurate abdominal multi-organ segmentation is critical for clinical applications. Although numerous deep learning-based automatic segmentation methods have been developed, they still struggle to segment small, irregular, or anatomically complex organs. Moreover, most current methods focus on spatial-domain analysis, often overlooking the synergistic potential of frequency-domain representations. To address these limitations, we propose a novel framework named FMD-TransUNet for precise abdominal multi-organ segmentation. It innovatively integrates the Multi-axis External Weight Block (MEWB) and the improved dual attention module (DA+) into the TransUNet framework. The MEWB extracts multi-axis frequency-domain features to capture both global anatomical structures and local boundary details, providing complementary information to spatial-domain representations. The DA+ block utilizes depthwise separable convolutions and incorporates spatial and channel attention mechanisms to enhance feature fusion, reduce redundant information, and narrow the semantic gap between the encoder and decoder. Experimental validation on the Synapse dataset shows that FMD-TransUNet outperforms other recent state-of-the-art methods, achieving an average DSC of 81.32% and a HD of 16.35 mm across eight abdominal organs. Compared to the baseline model, the average DSC increased by 3.84%, and the average HD decreased by 15.34 mm. These results demonstrate the effectiveness of FMD-TransUNet in improving the accuracy of abdominal multi-organ segmentation.

[126] PRISM: Probabilistic and Robust Inverse Solver with Measurement-Conditioned Diffusion Prior for Blind Inverse Problems

Yuanyun Hu,Evan Bell,Guijin Wang,Yu Sun

Main category: eess.IV

TL;DR: PRISM提出了一种基于测量条件扩散先验的概率鲁棒逆求解器,用于解决盲逆问题,性能优于现有方法。

Details Motivation: 扩散模型在计算成像的逆问题中应用广泛,但大多数方法需要完整的前向算子信息,无法处理盲逆问题。PRISM旨在填补这一空白。

Contribution: 提出了PRISM方法,首次将测量条件扩散模型与理论驱动的后验采样方案结合,解决了盲逆问题。

Method: 采用测量条件扩散模型作为先验,结合概率和鲁棒性理论,设计了新的后验采样方案。

Result: 在盲图像去模糊任务中,PRISM在图像和模糊核恢复上均优于现有基线方法。

Insight: PRISM为盲逆问题提供了一种新的概率鲁棒解决方案,展示了扩散模型在不确定性建模中的潜力。

Abstract: Diffusion models are now commonly used to solve inverse problems in computational imaging. However, most diffusion-based inverse solvers require complete knowledge of the forward operator to be used. In this work, we introduce a novel probabilistic and robust inverse solver with measurement-conditioned diffusion prior (PRISM) to effectively address blind inverse problems. PRISM offers a technical advancement over current methods by incorporating a powerful measurement-conditioned diffusion model into a theoretically principled posterior sampling scheme. Experiments on blind image deblurring validate the effectiveness of the proposed method, demonstrating the superior performance of PRISM over state-of-the-art baselines in both image and blur kernel recovery.

cs.RO [Back]

[127] CoReVLA: A Dual-Stage End-to-End Autonomous Driving Framework for Long-Tail Scenarios via Collect-and-Refine

Shiyu Fang,Yiming Cui,Haoyang Liang,Chen Lv,Peng Hang,Jian Sun

Main category: cs.RO

TL;DR: CoReVLA提出了一个双阶段的端到端自动驾驶框架,通过数据收集与行为优化提升长尾场景下的性能,显著超越现有方法。

Details Motivation: 自动驾驶系统在长尾、安全性关键场景下表现不佳,而现有视觉-语言-动作模型的数据质量和学习效率有限,亟需改进。

Contribution: 1) 提出双阶段框架CoReVLA,结合开放数据集与仿真平台采集数据;2) 采用直接偏好优化(DPO)从人类偏好中学习,避免奖励函数设计问题。

Method: 1) 联合微调开源驾驶QA数据集;2) 在CAVE仿真平台收集接管数据;3) 使用DPO进行行为优化。

Result: 在Bench2Drive基准测试中,CoReVLA驾驶得分72.18,成功率50%,超越现有方法7.96分和15%。

Insight: 通过仿真平台采集接管数据并结合偏好优化,能有效提升长尾场景下的自动驾驶鲁棒性和持续性改进能力。

Abstract: Autonomous Driving (AD) systems have made notable progress, but their performance in long-tail, safety-critical scenarios remains limited. These rare cases contribute a disproportionate number of accidents. Vision-Language Action (VLA) models have strong reasoning abilities and offer a potential solution, but their effectiveness is limited by the lack of high-quality data and inefficient learning in such conditions. To address these challenges, we propose CoReVLA, a continual learning end-to-end autonomous driving framework that improves the performance in long-tail scenarios through a dual-stage process of data Collection and behavior Refinement. First, the model is jointly fine-tuned on a mixture of open-source driving QA datasets, allowing it to acquire a foundational understanding of driving scenarios. Next, CoReVLA is deployed within the Cave Automatic Virtual Environment (CAVE) simulation platform, where driver takeover data is collected from real-time interactions. Each takeover indicates a long-tail scenario that CoReVLA fails to handle reliably. Finally, the model is refined via Direct Preference Optimization (DPO), allowing it to learn directly from human preferences and thereby avoid reward hacking caused by manually designed rewards. Extensive open-loop and closed-loop experiments demonstrate that the proposed CoReVLA model can accurately perceive driving scenarios and make appropriate decisions. On the Bench2Drive benchmark, CoReVLA achieves a Driving Score (DS) of 72.18 and a Success Rate (SR) of 50%, outperforming state-of-the-art methods by 7.96 DS and 15% SR under long-tail, safety-critical scenarios. Furthermore, case studies demonstrate the model’s ability to continually improve its performance in similar failure-prone scenarios by leveraging past takeover experiences. All codea and preprocessed datasets are available at: https://github.com/FanGShiYuu/CoReVLA

cs.MM [Back]

[128] Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents

Xueqiao Zhang,Chao Zhang,Jingtao Xu,Yifan Zhu,Xin Shi,Yi Yang,Yawei Luo

Main category: cs.MM

TL;DR: 该论文提出了Video2Roleplay,通过结合视频模态和多模态数据集Role-playing-Video60k,构建了动态角色扮演代理(RPA)框架,并提出了8项评估指标,证明了动态角色档案的重要性。

Details Motivation: 现有角色扮演代理(RPA)主要基于静态角色档案,缺乏人类动态感知能力。论文希望通过引入视频模态,提升RPA的动态表现能力。

Contribution: 1. 提出了动态角色档案概念;2. 构建了大规模多模态数据集Role-playing-Video60k;3. 开发了结合动态和静态角色档案的RPA框架;4. 提出了8项评估指标。

Method: 1. 自适应时序采样视频帧输入LLM;2. 静态角色档案包含训练视频对话和推理时视频摘要;3. 动态与静态档案联合优化。

Result: 实验证明该框架能显著提升RPA的响应质量,验证了动态角色档案的有效性。

Insight: 视频模态和多模态数据的引入为RPA提供了更丰富的动态表现能力,未来可扩展至更复杂的人机交互场景。

Abstract: Role-playing agents (RPAs) have attracted growing interest for their ability to simulate immersive and interactive characters. However, existing approaches primarily focus on static role profiles, overlooking the dynamic perceptual abilities inherent to humans. To bridge this gap, we introduce the concept of dynamic role profiles by incorporating video modality into RPAs. To support this, we construct Role-playing-Video60k, a large-scale, high-quality dataset comprising 60k videos and 700k corresponding dialogues. Based on this dataset, we develop a comprehensive RPA framework that combines adaptive temporal sampling with both dynamic and static role profile representations. Specifically, the dynamic profile is created by adaptively sampling video frames and feeding them to the LLM in temporal order, while the static profile consists of (1) character dialogues from training videos during fine-tuning, and (2) a summary context from the input video during inference. This joint integration enables RPAs to generate greater responses. Furthermore, we propose a robust evaluation method covering eight metrics. Experimental results demonstrate the effectiveness of our framework, highlighting the importance of dynamic role profiles in developing RPAs.

cs.CY [Back]

[129] Learning Analytics from Spoken Discussion Dialogs in Flipped Classroom

Hang Su,Borislav Dzodzo,Changlun Li,Danyang Zhao,Hao Geng,Yunxiang Li,Sidharth Jaggi,Helen Meng

Main category: cs.CY

TL;DR: 研究通过分析和转录翻转课堂中的小组讨论对话,提取特征并用机器学习预测学习成果,最佳准确率达78.9%。

Details Motivation: 翻转课堂中小组讨论蕴含丰富学习信息,但缺乏自动化分析工具。研究旨在通过对话分析了解学习过程和成果。

Contribution: 提出从讨论对话中提取特征并预测学习成果的方法,验证了自动预测的可行性。

Method: 录音转录后,使用多种工具提取对话特征,统计分析与学习成果相关的指标,并用机器学习进行分类预测。

Result: 最佳预测准确率达78.9%,表明通过讨论对话预测学习成果是可行的。

Insight: 小组讨论对话可为学习分析提供有效数据,机器学习能辅助教育评估。

Abstract: The flipped classroom is a new pedagogical strategy that has been gaining increasing importance recently. Spoken discussion dialog commonly occurs in flipped classroom, which embeds rich information indicating processes and progression of students’ learning. This study focuses on learning analytics from spoken discussion dialog in the flipped classroom, which aims to collect and analyze the discussion dialogs in flipped classroom in order to get to know group learning processes and outcomes. We have recently transformed a course using the flipped classroom strategy, where students watched video-recorded lectures at home prior to group-based problem-solving discussions in class. The in-class group discussions were recorded throughout the semester and then transcribed manually. After features are extracted from the dialogs by multiple tools and customized processing techniques, we performed statistical analyses to explore the indicators that are related to the group learning outcomes from face-to-face discussion dialogs in the flipped classroom. Then, machine learning algorithms are applied to the indicators in order to predict the group learning outcome as High, Mid or Low. The best prediction accuracy reaches 78.9%, which demonstrates the feasibility of achieving automatic learning outcome prediction from group discussion dialog in flipped classroom.

cs.GR [Back]

[130] MoAngelo: Motion-Aware Neural Surface Reconstruction for Dynamic Scenes

Mohamed Ebbed,Zorah Lähner

Main category: cs.GR

TL;DR: 论文提出了MoAngelo框架,用于动态场景的高精度表面重建。通过结合静态3D重建方法NeuralAngelo和变形场优化,解决了动态场景重建中的几何保真问题。

Details Motivation: 动态场景的多视图重建是一个核心挑战,现有方法在几何保真度和细节表现上不足,导致重建结果要么噪声多,要么过于平滑。

Contribution: 提出MoAngelo框架,将静态方法NeuralAngelo扩展至动态场景,通过优化变形场和灵活模板来提升重建精度。

Method: 结合初始帧的高质量模板重建和变形场优化,支持几何更新以处理遮挡和拓扑变化。

Result: 在ActorsHQ数据集上展示了优于现有方法的重建精度。

Insight: 灵活模板和变形场的联合优化是动态场景高精度重建的关键。

Abstract: Dynamic scene reconstruction from multi-view videos remains a fundamental challenge in computer vision. While recent neural surface reconstruction methods have achieved remarkable results in static 3D reconstruction, extending these approaches with comparable quality for dynamic scenes introduces significant computational and representational challenges. Existing dynamic methods focus on novel-view synthesis, therefore, their extracted meshes tend to be noisy. Even approaches aiming for geometric fidelity often result in too smooth meshes due to the ill-posedness of the problem. We present a novel framework for highly detailed dynamic reconstruction that extends the static 3D reconstruction method NeuralAngelo to work in dynamic settings. To that end, we start with a high-quality template scene reconstruction from the initial frame using NeuralAngelo, and then jointly optimize deformation fields that track the template and refine it based on the temporal sequence. This flexible template allows updating the geometry to include changes that cannot be modeled with the deformation field, for instance occluded parts or the changes in the topology. We show superior reconstruction accuracy in comparison to previous state-of-the-art methods on the ActorsHQ dataset.

cs.SD [Back]

[131] SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models

Qiaolin Wang,Xilin Jiang,Linyang He,Junkai Wu,Nima Mesgarani

Main category: cs.SD

TL;DR: SightSound-R1是一个跨模态蒸馏框架,将视觉语言模型(LVLM)的推理能力转移到音频语言模型(LALM),以补足音频领域缺乏大规模链式思维数据的短板。

Details Motivation: 音频语言模型在复杂场景下的推理能力落后于视觉语言模型,主要原因是缺乏大规模链式思维音频数据用于训练。SightSound-R1旨在通过跨模态蒸馏填补这一差距。

Contribution: 提出了SightSound-R1框架,通过跨模态蒸馏将视觉模型的推理能力转移到音频模型,显著提升了音频语言模型的性能。

Method: 框架包括三步:(i) 测试时扩展生成音频相关的链式思维数据,(ii) 基于音频的验证过滤错误信息,(iii) 结合监督微调(SFT)和群组相对策略优化(GRPO)的蒸馏流程。

Result: 实验表明,SightSound-R1在音频视觉问答(AVQA)数据集和未见过的音频场景中均优于基线模型。

Insight: 视觉模型的推理能力可以有效迁移到音频模型,且通过音频-视觉数据的规模化可进一步提升性能。

Abstract: While large audio-language models (LALMs) have demonstrated state-of-the-art audio understanding, their reasoning capability in complex soundscapes still falls behind large vision-language models (LVLMs). Compared to the visual domain, one bottleneck is the lack of large-scale chain-of-thought audio data to teach LALM stepwise reasoning. To circumvent this data and modality gap, we present SightSound-R1, a cross-modal distillation framework that transfers advanced reasoning from a stronger LVLM teacher to a weaker LALM student on the same audio-visual question answering (AVQA) dataset. SightSound-R1 consists of three core steps: (i) test-time scaling to generate audio-focused chains of thought (CoT) from an LVLM teacher, (ii) audio-grounded validation to filter hallucinations, and (iii) a distillation pipeline with supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) for the LALM student. Results show that SightSound-R1 improves LALM reasoning performance both in the in-domain AVQA test set as well as in unseen auditory scenes and questions, outperforming both pretrained and label-only distilled baselines. Thus, we conclude that vision reasoning can be effectively transferred to audio models and scaled with abundant audio-visual data.

cs.AI [Back]

[132] EHR-MCP: Real-world Evaluation of Clinical Information Retrieval by Large Language Models via Model Context Protocol

Kanato Masayoshi,Masahiro Hashimoto,Ryoichi Yokoyama,Naoki Toda,Yoshifumi Uwamino,Shogo Fukuda,Ho Namkoong,Masahiro Jinzaki

Main category: cs.AI

TL;DR: 论文EHR-MCP评估了通过模型上下文协议(MCP)将大型语言模型(LLM)与电子健康记录(EHR)系统集成的效果,证明了LLM在医院环境中能够高效检索临床信息,但在复杂任务中存在挑战。

Details Motivation: 大型语言模型在医疗领域具有潜力,但由于无法直接访问医院EHR系统,其应用受限。MCP协议为LLM与外部工具集成提供了可能,本研究旨在评估其在真实医院环境中的效果。

Contribution: 提出了EHR-MCP框架,通过MCP协议将LLM与EHR数据库集成,展示了LLM在临床信息检索中的高效性,同时指出了复杂任务的挑战。

Method: 开发了EHR-MCP框架,利用GPT-4.1和LangGraph ReAct代理与EHR交互,设计了六项任务(基于感染控制团队用例),并通过与医生生成的金标准对比进行评估。

Result: LLM在简单任务中表现近乎完美,但在需要时间依赖计算的复杂任务中表现较差。错误主要源于参数错误或工具结果误解。

Insight: EHR-MCP为医院AI代理提供了安全、一致的数据访问基础设施,未来需扩展到推理、生成和临床影响评估,以推动生成式AI在临床实践中的有效应用。

Abstract: Background: Large language models (LLMs) show promise in medicine, but their deployment in hospitals is limited by restricted access to electronic health record (EHR) systems. The Model Context Protocol (MCP) enables integration between LLMs and external tools. Objective: To evaluate whether an LLM connected to an EHR database via MCP can autonomously retrieve clinically relevant information in a real hospital setting. Methods: We developed EHR-MCP, a framework of custom MCP tools integrated with the hospital EHR database, and used GPT-4.1 through a LangGraph ReAct agent to interact with it. Six tasks were tested, derived from use cases of the infection control team (ICT). Eight patients discussed at ICT conferences were retrospectively analyzed. Agreement with physician-generated gold standards was measured. Results: The LLM consistently selected and executed the correct MCP tools. Except for two tasks, all tasks achieved near-perfect accuracy. Performance was lower in the complex task requiring time-dependent calculations. Most errors arose from incorrect arguments or misinterpretation of tool results. Responses from EHR-MCP were reliable, though long and repetitive data risked exceeding the context window. Conclusions: LLMs can retrieve clinical data from an EHR via MCP tools in a real hospital setting, achieving near-perfect performance in simple tasks while highlighting challenges in complex ones. EHR-MCP provides an infrastructure for secure, consistent data access and may serve as a foundation for hospital AI agents. Future work should extend beyond retrieval to reasoning, generation, and clinical impact assessment, paving the way for effective integration of generative AI into clinical practice.

[133] MICA: Multi-Agent Industrial Coordination Assistant

Di Wen,Kunyu Peng,Junwei Zheng,Yufan Chen,Yitain Shi,Jiale Wei,Ruiping Liu,Kailun Yang,Rainer Stiefelhagen

Main category: cs.AI

TL;DR: MICA是一个面向工业环境的多智能体协调辅助系统,结合感知和语音交互,通过角色专业化语言智能体提供实时指导。其核心贡献包括自适应步骤融合方法(ASF)和多智能体协调基准测试。

Details Motivation: 工业工作流程需要适应性强且可信的辅助系统,但面临计算能力有限、连接不稳定和隐私约束等问题。MICA旨在解决这些挑战,提供实时、准确的工业辅助支持。

Contribution: 1. 提出MICA系统,结合感知和语音交互,支持装配、故障排除等多种任务。2. 引入自适应步骤融合(ASF)方法,动态结合专家推理和语音反馈。3. 建立多智能体协调基准测试,并提出工业辅助专用评估指标。

Method: MICA通过五个角色专业化的语言智能体协作,并由安全检查器审查以确保合规。ASF动态融合专家推理和自然语言反馈,提升步骤理解的鲁棒性。

Result: 实验表明,MICA在任务成功率、可靠性和响应速度上优于基线方法,且可在离线硬件上部署。

Insight: MICA为动态工厂环境中的隐私保护多智能体助手提供了可行方案,其协调机制和评估方法对工业智能辅助具有启发意义。

Abstract: Industrial workflows demand adaptive and trustworthy assistance that can operate under limited computing, connectivity, and strict privacy constraints. In this work, we present MICA (Multi-Agent Industrial Coordination Assistant), a perception-grounded and speech-interactive system that delivers real-time guidance for assembly, troubleshooting, part queries, and maintenance. MICA coordinates five role-specialized language agents, audited by a safety checker, to ensure accurate and compliant support. To achieve robust step understanding, we introduce Adaptive Step Fusion (ASF), which dynamically blends expert reasoning with online adaptation from natural speech feedback. Furthermore, we establish a new multi-agent coordination benchmark across representative task categories and propose evaluation metrics tailored to industrial assistance, enabling systematic comparison of different coordination topologies. Our experiments demonstrate that MICA consistently improves task success, reliability, and responsiveness over baseline structures, while remaining deployable on practical offline hardware. Together, these contributions highlight MICA as a step toward deployable, privacy-preserving multi-agent assistants for dynamic factory environments. The source code will be made publicly available at https://github.com/Kratos-Wen/MICA.