Table of Contents
- cs.CL [Total: 36]
- cs.CV [Total: 59]
- cs.SE [Total: 1]
- cs.LG [Total: 4]
- q-bio.QM [Total: 2]
- cs.MM [Total: 1]
- cs.IR [Total: 1]
- eess.IV [Total: 3]
- cs.RO [Total: 1]
- cs.AI [Total: 6]
cs.CL [Back]
[1] Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models cs.CLPDF
Haorui Yu, Ramon Ruiz-Dolz, Xuehang Wen, Fengrui Zhang, Qiufeng Yi
TL;DR: 本文提出了一种三层次评估框架,用于评估视觉语言模型(VLMs)在跨文化艺术评论中的文化理解能力。该框架通过离线指标计算、基于量规的评分以及使用保序回归校准人类评分,为模型选择和文化差异诊断提供校准后的文化理解分数。研究评估了15个VLMs在涵盖六种文化传统的294个专家锚点上的表现,发现自动化指标不可靠、西方样本得分更高,并强调了单一主评判与校准的重要性。
Details
Motivation: 视觉语言模型在视觉感知方面表现出色,但其在艺术中解释文化意义的能力尚未得到充分验证,因此需要一种系统方法来评估VLMs的跨文化艺术评论能力。
Result: 在152个样本的保留集上,通过保序回归校准后,平均绝对误差(MAE)降低了5.2%。评估了15个VLMs在294个专家锚点上的表现,发现西方样本在特定采样和量规下得分高于非西方样本。
Insight: 创新点包括三层次评估框架(覆盖离线指标、量规评分和校准),以及强调自动化指标不能替代文化深度评估,需通过单一主评判和校准来解决跨评判尺度不匹配问题,为模型选择和文化差距诊断提供实用工具。
Abstract: Vision-Language Models (VLMs) excel at visual perception, yet their ability to interpret cultural meaning in art remains under-validated. We present a tri-tier evaluation framework for cross-cultural art-critique assessment: Tier I computes automated coverage and risk indicators offline; Tier II applies rubric-based scoring using a single primary judge across five dimensions; and Tier III calibrates the Tier II aggregate score to human ratings via isotonic regression, yielding a 5.2% reduction in MAE on a 152-sample held-out set. The framework outputs a calibrated cultural-understanding score for model selection and cultural-gap diagnosis, together with dimension-level diagnostics and risk indicators. We evaluate 15 VLMs on 294 expert anchors spanning six cultural traditions. Key findings are that (i) automated metrics are unreliable proxies for cultural depth, (ii) Western samples score higher than non-Western samples under our sampling and rubric, and (iii) cross-judge scale mismatch makes naive score averaging unreliable, motivating a single primary judge with explicit calibration. Dataset and code are available in the supplementary materials.
[2] Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset cs.CLPDF
Z. Melce Hüsünbeyi, Virginie Mouilleron, Leonie Uhling, Daniel Foppe, Tatjana Scheffler
TL;DR: 本文提出了一种多语言、多模态的数据收集与处理流程,用于构建包含结构化注释和视觉内容的事实核查数据集,以应对当前数据集在范围、多模态证据和结构化链接方面的不足。
Details
Motivation: 针对在线平台虚假信息泛滥的问题,现有数据集在范围、多模态证据和结构化注释方面存在局限,需要更全面、可解释和多语言的事实核查资源。
Result: 通过使用最先进的大型语言模型和多模态LLM进行证据提取和理由生成,结合G-Eval和人工评估,验证了流程的有效性,支持跨组织或媒体市场的事实核查实践细粒度比较。
Insight: 创新点在于整合ClaimReview源、抓取完整辟谣文章、标准化裁决并添加结构化元数据和对齐的视觉内容,为多语言多模态虚假信息验证研究奠定了基础。
Abstract: The rapid proliferation of misinformation across online platforms underscores the urgent need for robust, up-to-date, explainable, and multilingual fact-checking resources. However, existing datasets are limited in scope, often lacking multimodal evidence, structured annotations, and detailed links between claims, evidence, and verdicts. This paper introduces a comprehensive data collection and processing pipeline that constructs multimodal fact-checking datasets in French and German languages by aggregating ClaimReview feeds, scraping full debunking articles, normalizing heterogeneous claim verdicts, and enriching them with structured metadata and aligned visual content. We used state-of-the-art large language models (LLMs) and multimodal LLMs for (i) evidence extraction under predefined evidence categories and (ii) justification generation that links evidence to verdicts. Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and evidence-grounded fact-checking models, and lays the groundwork for future research on multilingual, multimodal misinformation verification.
[3] VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding cs.CL | cs.CVPDF
Haorui Yu, Ramon Ruiz-Dolz, Diji Yang, Hang He, Fengrui Zhang
TL;DR: VULCA-Bench是一个用于评估视觉语言模型(VLMs)文化理解能力的多文化艺术评论基准,它超越了浅层视觉感知,包含7,410个图像-评论对,涵盖八种文化传统,并提供中英双语覆盖。
Details
Motivation: 现有VLM基准主要评估L1-L2能力(如物体识别、场景描述和事实问答),而缺乏对高阶文化解释能力的评估,因此需要一个新的基准来填补这一空白。
Result: 初步结果表明,高层推理(L3-L5,如文化解释和哲学美学)比视觉和技术分析(L1-L2)更具挑战性,但未提及具体模型性能或与SOTA的比较。
Insight: 创新点在于提出了一个五层框架(从视觉感知到哲学美学)来操作化文化理解,并实例化为225个文化特定维度,由专家撰写双语评论支持,为评估VLMs的文化智能提供了结构化方法。
Abstract: We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models’ (VLMs) cultural understanding beyond surface-level visual perception. Existing VLM benchmarks predominantly measure L1-L2 capabilities (object recognition, scene description, and factual question answering) while under-evaluate higher-order cultural interpretation. VULCA-Bench contains 7,410 matched image-critique pairs spanning eight cultural traditions, with Chinese-English bilingual coverage. We operationalise cultural understanding using a five-layer framework (L1-L5, from Visual Perception to Philosophical Aesthetics), instantiated as 225 culture-specific dimensions and supported by expert-written bilingual critiques. Our pilot results indicate that higher-layer reasoning (L3-L5) is consistently more challenging than visual and technical analysis (L1-L2). The dataset, evaluation scripts, and annotation tools are available under CC BY 4.0 in the supplementary materials.
[4] LLM Review: Enhancing Creative Writing via Blind Peer Review Feedback cs.CL | cs.AI | cs.MAPDF
Weiyue Li, Mingxiao Song, Zhenda Shen, Dachuan Zhao, Yunfan Long
TL;DR: 本文提出LLM Review框架,通过盲审机制促进大语言模型在创意写作中的多样化生成,避免多智能体交互导致的同质化问题,并构建SciFi-100数据集进行综合评估。
Details
Motivation: 解决大语言模型在创意生成中的局限性,以及多智能体框架因交互导致内容同质化而抑制创造力的问题。
Result: 在SciFi-100数据集上,LLM Review框架在多智能体基线中表现一致更优,且较小模型结合该框架可超越较大单智能体模型,表明交互结构可替代模型规模。
Insight: 创新点包括盲审机制以保持创意轨迹的多样性,以及结合LLM评分、人工标注和规则新颖性度量的统一评估框架;客观分析认为该框架为提升模型创造力提供了可扩展的结构化解决方案。
Abstract: Large Language Models (LLMs) often struggle with creative generation, and multi-agent frameworks that improve reasoning through interaction can paradoxically hinder creativity by inducing content homogenization. We introduce LLM Review, a peer-review-inspired framework implementing Blind Peer Review: agents exchange targeted feedback while revising independently, preserving divergent creative trajectories. To enable rigorous evaluation, we propose SciFi-100, a science fiction writing dataset with a unified framework combining LLM-as-a-judge scoring, human annotation, and rule-based novelty metrics. Experiments demonstrate that LLM Review consistently outperforms multi-agent baselines, and smaller models with our framework can surpass larger single-agent models, suggesting interaction structure may substitute for model scale.
[5] Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models cs.CL | cs.AIPDF
Zhenghao He, Guangzhi Xiong, Bohan Liu, Sanchit Sinha, Aidong Zhang
TL;DR: 本研究通过稀疏自编码器分析大语言模型内部表示,发现了一组与推理行为因果相关的潜在特征,证明无需显式思维链提示,仅通过操控单个推理相关潜在特征即可显著提升模型推理准确率,且对于大模型能达到与标准思维链提示相当的性能但输出更高效。
Details
Motivation: 探究思维链提示为何有效,以及它是否是大语言模型中触发推理的唯一机制,旨在理解推理的内部计算模式。
Result: 在多个模型家族和推理基准测试中,潜在特征操控显著提升了准确率;对于大模型,其性能与标准思维链提示相当,同时输出更高效。
Insight: 论文揭示了LLM的多步推理由可被外部激活的潜在内部激活支持,思维链提示是激活该机制的一种有效但非唯一方式;创新之处在于通过内部表示分析与干预,发现了独立于显式提示的、可操控的推理潜在计算模式。
Abstract: Chain-of-Thought (CoT) prompting has improved the reasoning performance of large language models (LLMs), but it remains unclear why it works and whether it is the unique mechanism for triggering reasoning in large language models. In this work, we study this question by directly analyzing and intervening on the internal representations of LLMs with Sparse Autoencoders (SAEs), identifying a small set of latent features that are causally associated with LLM reasoning behavior. Across multiple model families and reasoning benchmarks, we find that steering a single reasoning-related latent feature can substantially improve accuracy without explicit CoT prompting. For large models, latent steering achieves performance comparable to standard CoT prompting while producing more efficient outputs. We further observe that this reasoning-oriented internal state is triggered early in generation and can override prompt-level instructions that discourage explicit reasoning. Overall, our results suggest that multi-step reasoning in LLMs is supported by latent internal activations that can be externally activated, while CoT prompting is one effective, but not unique, way of activating this mechanism rather than its necessary cause.
[6] Debiasing Large Language Models via Adaptive Causal Prompting with Sketch-of-Thought cs.CL | cs.AIPDF
Bowen Li, Ziqi Xu, Jing Ren, Renqiang Luo, Xikun Zhang
TL;DR: 本文提出了一种名为自适应因果提示与思维草图(ACPS)的框架,旨在解决现有提示方法(如思维链)在大型语言模型中存在的令牌使用过多和跨任务泛化能力有限的问题。该框架利用结构因果模型推断查询对答案的因果效应,并自适应地选择适当的干预(如前门调整和条件前门调整),从而在不进行任务特定重训练的情况下实现跨异构任务的泛化因果推理。
Details
Motivation: 现有提示方法(如思维链)存在令牌使用效率低和跨不同推理任务泛化能力不足的局限性,需要一种更高效且通用的推理框架。
Result: 在多个推理基准测试和大型语言模型上的广泛实验表明,ACPS在准确性、鲁棒性和计算效率方面持续优于现有提示基线方法。
Insight: 创新点在于将结构因果模型与自适应干预选择相结合,以思维草图替代冗长的思维链,实现了高效的因果推理,这为减少推理成本和提高跨任务泛化提供了新思路。
Abstract: Despite notable advancements in prompting methods for Large Language Models (LLMs), such as Chain-of-Thought (CoT), existing strategies still suffer from excessive token usage and limited generalisability across diverse reasoning tasks. To address these limitations, we propose an Adaptive Causal Prompting with Sketch-of-Thought (ACPS) framework, which leverages structural causal models to infer the causal effect of a query on its answer and adaptively select an appropriate intervention (i.e., standard front-door and conditional front-door adjustments). This design enables generalisable causal reasoning across heterogeneous tasks without task-specific retraining. By replacing verbose CoT with concise Sketch-of-Thought, ACPS enables efficient reasoning that significantly reduces token usage and inference cost. Extensive experiments on multiple reasoning benchmarks and LLMs demonstrate that ACPS consistently outperforms existing prompting baselines in terms of accuracy, robustness, and computational efficiency.
[7] How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains cs.CLPDF
Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur, Ivan Brugere, Charese H. Smiley
TL;DR: 本文针对大型推理模型(LRMs)在高风险领域中的置信度估计问题,提出了推理模型置信度估计基准(RMCB),该基准包含来自六个不同架构LRMs的347,496条推理轨迹,覆盖临床、金融、法律、数学推理及复杂通用推理等高危领域数据集。通过对十多种基于表示的方法进行大规模评估,研究发现判别能力(AUROC)与校准能力(ECE)之间存在持续的权衡:基于文本的编码器在AUROC上表现最佳(0.672),而结构感知模型在ECE上最优(0.148),没有单一方法能在两方面均占优。此外,增加架构复杂性并未稳定超越简单的序列基线,表明仅依赖块级隐藏状态的方法存在性能上限。
Details
Motivation: 大型推理模型(LRMs)在高风险领域中的错误校准问题削弱了其可靠性,因此需要准确估计其长格式、多步骤输出的置信度,但目前缺乏系统性的基准和方法评估。
Result: 在RMCB基准上评估了十多种基于表示的方法,基于文本的编码器取得了最佳AUROC(0.672),结构感知模型获得了最佳ECE(0.148),没有方法能同时在判别和校准方面占优;增加架构复杂性并未可靠地超越简单序列基线。
Insight: 论文的创新点在于构建了首个全面的LRM置信度估计基准RMCB,并系统揭示了基于表示的方法中判别与校准的权衡关系,以及架构复杂性提升的收益有限,为未来研究提供了严格基线和范式局限性的洞察。
Abstract: The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between discrimination (AUROC) and calibration (ECE): text-based encoders achieve the best AUROC (0.672), while structurally-aware models yield the best ECE (0.148), with no single method dominating both. Furthermore, we find that increased architectural complexity does not reliably outperform simpler sequential baselines, suggesting a performance ceiling for methods relying solely on chunk-level hidden states. This work provides the most comprehensive benchmark for this task to date, establishing rigorous baselines and demonstrating the limitations of current representation-based paradigms.
[8] Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training cs.CL | cs.AI | cs.LGPDF
Muhammad Taimoor Hassan, Jawad Ahmed, Muhammad Awais
TL;DR: 本文介绍了Qalb,一个针对乌尔都语的大语言模型。该模型基于LLaMA 3.1 8B,通过两阶段方法(持续预训练和监督微调)开发,旨在解决乌尔都语在现有多语言模型中表现不佳的问题。在包含19.7亿token的多样化乌尔都语语料库上进行持续预训练后,再使用Alif Urdu-instruct数据集进行微调。评估显示,Qalb在乌尔都语特定基准测试中取得了显著提升,加权平均得分达到90.34,超越了之前的SOTA模型和基础模型。
Details
Motivation: 尽管大语言模型取得了显著进展,但拥有超过2.3亿使用者的乌尔都语在现代NLP系统中代表性严重不足。现有多语言模型在乌尔都语特定任务上表现不佳,难以处理其复杂的形态、从右到左的Nastaliq文字和丰富的文学传统。
Result: 在乌尔都语特定基准测试中,Qalb取得了加权平均90.34的分数,比之前的SOTA模型Alif-1.0-Instruct(87.1)高出3.24分,比基础LLaMA-3.1 8B-Instruct模型高出44.64分。在包括分类、情感分析和推理在内的七项多样化任务上均达到了最先进的性能水平。
Insight: 论文宣称的创新点在于通过系统性的两阶段方法(持续预训练+指令微调)来有效适配基础模型到低资源语言。客观来看,其核心贡献在于构建了大规模、高质量的多样化乌尔都语语料库,并证明了这种适配策略的有效性,为其他低资源语言模型开发提供了可借鉴的路径。
Abstract: Despite remarkable progress in large language models, Urdu-a language spoken by over 230 million people-remains critically underrepresented in modern NLP systems. Existing multilingual models demonstrate poor performance on Urdu-specific tasks, struggling with the language’s complex morphology, right-to-left Nastaliq script, and rich literary traditions. Even the base LLaMA-3.1 8B-Instruct model shows limited capability in generating fluent, contextually appropriate Urdu text. We introduce Qalb, an Urdu language model developed through a two-stage approach: continued pre-training followed by supervised fine-tuning. Starting from LLaMA 3.1 8B, we perform continued pre-training on a dataset of 1.97 billion tokens. This corpus comprises 1.84 billion tokens of diverse Urdu text-spanning news archives, classical and contemporary literature, government documents, and social media-combined with 140 million tokens of English Wikipedia data to prevent catastrophic forgetting. We then fine-tune the resulting model on the Alif Urdu-instruct dataset. Through extensive evaluation on Urdu-specific benchmarks, Qalb demonstrates substantial improvements, achieving a weighted average score of 90.34 and outperforming the previous state-of-the-art Alif-1.0-Instruct model (87.1) by 3.24 points, while also surpassing the base LLaMA-3.1 8B-Instruct model by 44.64 points. Qalb achieves state-of-the-art performance with comprehensive evaluation across seven diverse tasks including Classification, Sentiment Analysis, and Reasoning. Our results demonstrate that continued pre-training on diverse, high-quality language data, combined with targeted instruction fine-tuning, effectively adapts foundation models to low-resource languages.
[9] WISE-Flow: Workflow-Induced Structured Experience for Self-Evolving Conversational Service Agents cs.CLPDF
Yuqing Zhou, Zhuoer Wang, Jie Yuan, Hong Wang, Samson Koelle
TL;DR: 本文提出WISE-Flow框架,旨在解决基于大语言模型的服务代理在部署中易出错、重复失败且表现不稳定的问题。该框架通过从历史服务交互中提取以工作流为中心的结构化经验,并在执行时对齐工作流并进行可行性推理,以实现代理的自我进化。
Details
Motivation: 基于大语言模型的代理在面向用户的服务中广泛部署,但在新任务中容易出错,倾向于重复相同的失败模式,并且运行间变异性大。通过特定环境训练或手动修补来修复故障成本高且难以扩展。
Result: 在ToolSandbox和τ²-bench基准测试上的实验表明,该方法在不同基础模型上均取得了持续的性能提升。
Insight: 创新点在于提出了一个以工作流为中心的经验结构化框架,通过增强先决条件的动作块来诱导工作流,并在部署时进行状态接地的可行性推理,以实现代理的自我进化,这为构建更鲁棒、可自我改进的对话服务代理提供了新思路。
Abstract: Large language model (LLM)-based agents are widely deployed in user-facing services but remain error-prone in new tasks, tend to repeat the same failure patterns, and show substantial run-to-run variability. Fixing failures via environment-specific training or manual patching is costly and hard to scale. To enable self-evolving agents in user-facing service environments, we propose WISE-Flow, a workflow-centric framework that converts historical service interactions into reusable procedural experience by inducing workflows with prerequisite-augmented action blocks. At deployment, WISE-Flow aligns the agent’s execution trajectory to retrieved workflows and performs prerequisite-aware feasibility reasoning to achieve state-grounded next actions. Experiments on ToolSandbox and $τ^2$-bench show consistent improvement across base models.
[10] Relational Knowledge Distillation Using Fine-tuned Function Vectors cs.CL | cs.LGPDF
Andrea Kang, Yingnian Wu, Hongjing Lu
TL;DR: 本文提出通过微调函数向量来增强大语言模型的关系知识表示能力。研究发现,使用少量示例微调函数向量能提升基于关系的单词补全任务性能,且适用于不同规模的模型。进一步引入复合函数向量,通过加权组合微调后的向量来提取关系知识并支持类比推理,显著提升了在认知科学和SAT基准测试中的表现。
Details
Motivation: 解决大语言模型如何更有效地表示和利用概念间关系的问题,以提升其推理能力和可解释性。
Result: 在基于关系的单词补全任务上,微调函数向量比原始因果中介分析得到的向量表现更好;复合函数向量在认知科学和SAT类比问题基准测试中显著提升了性能。
Insight: 通过微调函数向量和构建复合向量,提供了一种可控的激活修补机制来编码和操作关系知识,增强了模型的可解释性和推理能力。
Abstract: Representing relations between concepts is a core prerequisite for intelligent systems to make sense of the world. Recent work using causal mediation analysis has shown that a small set of attention heads encodes task representation in in-context learning, captured in a compact representation known as the function vector. We show that fine-tuning function vectors with only a small set of examples (about 20 word pairs) yields better performance on relation-based word-completion tasks than using the original vectors derived from causal mediation analysis. These improvements hold for both small and large language models. Moreover, the fine-tuned function vectors yield improved decoding performance for relation words and show stronger alignment with human similarity judgments of semantic relations. Next, we introduce the composite function vector - a weighted combination of fine-tuned function vectors - to extract relational knowledge and support analogical reasoning. At inference time, inserting this composite vector into LLM activations markedly enhances performance on challenging analogy problems drawn from cognitive science and SAT benchmarks. Our results highlight the potential of activation patching as a controllable mechanism for encoding and manipulating relational knowledge, advancing both the interpretability and reasoning capabilities of large language models.
[11] Prompt-Based Clarity Evaluation and Topic Detection in Political Question Answering cs.CL | cs.AIPDF
Lavanya Prahallad, Sai Utkarsh Choudarypally, Pragna Prahallad, Pranathi Prahallad
TL;DR: 本文研究了在政治问答任务中,基于提示设计对大型语言模型(LLM)回答的清晰度进行自动评估的方法。利用SemEval 2026共享任务的CLARITY数据集,比较了GPT-3.5基线模型与采用三种提示策略(简单提示、思维链提示、带少样本示例的思维链提示)的GPT-5.2模型在清晰度、回避性及主题检测上的性能。
Details
Motivation: 动机在于自动评估LLM回答时,不仅需要事实正确性,还需评估其清晰度,尤其是在政治问答领域。当前研究对提示设计如何影响自动清晰度评估的探索不足。
Result: 在清晰度预测上,GPT-5.2持续优于GPT-3.5基线,使用带少样本的思维链提示时准确率从56%提升至63%。思维链提示在回避性预测上达到最高准确率34%,但细分类别上改进不稳定。在主题识别上,基于推理的提示将准确率从60%提升至74%。
Insight: 创新点在于系统比较了不同提示策略对政治问答清晰度评估的影响,表明提示设计能可靠提升高层级清晰度评估,而细粒度回避性和主题检测即使采用结构化推理提示仍具挑战性。
Abstract: Automatic evaluation of large language model (LLM) responses requires not only factual correctness but also clarity, particularly in political question-answering. While recent datasets provide human annotations for clarity and evasion, the impact of prompt design on automatic clarity evaluation remains underexplored. In this paper, we study prompt-based clarity evaluation using the CLARITY dataset from the SemEval 2026 shared task. We compare a GPT-3.5 baseline provided with the dataset against GPT-5.2 evaluated under three prompting strategies: simple prompting, chain-of-thought prompting, and chain-of-thought with few-shot examples. Model predictions are evaluated against human annotations using accuracy and class-wise metrics for clarity and evasion, along with hierarchical exact match. Results show that GPT-5.2 consistently outperforms the GPT-3.5 baseline on clarity prediction, with accuracy improving from 56 percent to 63 percent under chain-of-thought with few-shot prompting. Chain-of-thought prompting yields the highest evasion accuracy at 34 percent, though improvements are less stable across fine-grained evasion categories. We further evaluate topic identification and find that reasoning-based prompting improves accuracy from 60 percent to 74 percent relative to human annotations. Overall, our findings indicate that prompt design reliably improves high-level clarity evaluation, while fine-grained evasion and topic detection remain challenging despite structured reasoning prompts.
[12] Generation-Augmented Generation: A Plug-and-Play Framework for Private Knowledge Injection in Large Language Models cs.CLPDF
Rongji Li, Jian Xu, Xueqing Chen, Yisheng Yang, Jiayi Wang
TL;DR: 本文提出了一种名为生成增强生成(GAG)的即插即用框架,用于将私有领域知识注入大型语言模型(LLM)。该方法将私有专业知识视为一种额外的专家模态,通过一个紧凑的、表示层面的接口与冻结的基础模型对齐,从而避免了提示时的证据序列化问题,并支持即插即用的专业化以及可扩展的多领域组合与可靠的选择性激活。
Details
Motivation: 在生物医学、材料和金融等高风险领域部署LLM时,需要注入私有、领域特定且快速演化的知识,这些知识在公共预训练中代表性不足。现有的微调和检索增强生成(RAG)方法分别存在迭代成本高、灾难性遗忘风险,以及在专业私有语料中因分块导致的证据碎片化、检索漂移和长上下文压力等缺陷。
Result: 在两个私有科学问答基准(免疫学佐剂和催化材料)以及混合领域评估中,GAG在两个基准上分别比强大的RAG基线性能提升了15.34%和14.86%,同时在六个开放通用基准上保持了性能,并实现了接近oracle水平的选择性激活,支持可扩展的多领域部署。
Insight: 核心创新在于将私有知识视为一种新的模态,并通过表示层对齐进行注入,这避免了RAG中提示工程和检索的复杂性。该方法实现了即插即用的专业化、可靠的多领域组合与激活,为LLM在私有、动态知识领域的应用提供了一种高效、灵活的解决方案。
Abstract: In domains such as biomedicine, materials, and finance, high-stakes deployment of large language models (LLMs) requires injecting private, domain-specific knowledge that is proprietary, fast-evolving, and under-represented in public pretraining. However, the two dominant paradigms for private knowledge injection each have pronounced drawbacks: fine-tuning is expensive to iterate, and continual updates risk catastrophic forgetting and general-capability regression; retrieval-augmented generation (RAG) keeps the base model intact but is brittle in specialized private corpora due to chunk-induced evidence fragmentation, retrieval drift, and long-context pressure that yields query-dependent prompt inflation. Inspired by how multimodal LLMs align heterogeneous modalities into a shared semantic space, we propose Generation-Augmented Generation (GAG), which treats private expertise as an additional expert modality and injects it via a compact, representation-level interface aligned to the frozen base model, avoiding prompt-time evidence serialization while enabling plug-and-play specialization and scalable multi-domain composition with reliable selective activation. Across two private scientific QA benchmarks (immunology adjuvant and catalytic materials) and mixed-domain evaluations, GAG improves specialist performance over strong RAG baselines by 15.34% and 14.86% on the two benchmarks, respectively, while maintaining performance on six open general benchmarks and enabling near-oracle selective activation for scalable multi-domain deployment.
[13] User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale cs.CLPDF
Jungho Cho, Minbyul Jeong, Sungrae Park
TL;DR: 本文提出了一种面向用户的多轮对话生成框架,通过解耦任务生成与用户模拟器,动态生成领域特定工具,以模拟真实人类行为规则(如渐进式请求和逐轮反馈),从而生成更真实、高轮次、高密度的工具使用对话数据。
Details
Motivation: 现有工具使用数据集和生成方法受限于静态预定义工具集,无法扩展到开放人机协作的复杂性,且纯任务导向设计导致对话轮次少、交互不足,难以反映现实场景中迭代式问题解决的多轮对话。
Result: 未在摘要中提及具体基准测试或定量结果,但强调所生成数据集具有高可扩展性、高密度(单轨迹内完成多任务)和更真实的对话特性。
Insight: 创新点在于从任务导向转向用户导向的模拟范式,通过基于大型推理模型的模拟器动态生成工具,并引入模拟人类行为规则的用户模拟器来促进更真实、扩展的多轮对话,其生成管道作为即插即用模块支持从任意状态启动,确保了数据生成的高可扩展性。
Abstract: The recent paradigm shift toward large reasoning models (LRMs) as autonomous agents has intensified the demand for sophisticated, multi-turn tool-use capabilities. Yet, existing datasets and data-generation approaches are limited by static, predefined toolsets that cannot scale to the complexity of open-ended human-agent collaboration. To address this, we initially developed a framework for automated task-oriented multi-turn dialogue generation at scale, utilizing an LRM-based simulator to dynamically generate high-value, domain-specific tools to solve specified tasks. However, we observe that a purely task-oriented design often results in “solely task-solving” trajectories, where the agent completes the objective with minimal interaction, failing to generate the high turn-count conversations seen in realistic scenarios. To bridge this gap, we shift toward a user-oriented simulation paradigm. By decoupling task generation from a dedicated user simulator that mimics human behavioral rules - such as incremental request-making and turn-by-turn feedback - we facilitate more authentic, extended multi-turn dialogues that reflect the iterative nature of real-world problem solving. Our generation pipeline operates as a versatile, plug-and-play module capable of initiating generation from any state, ensuring high scalability in producing extended tool-use data. Furthermore, by facilitating multiple task completions within a single trajectory, it yields a high-density dataset that reflects the multifaceted demands of real-world human-agent interaction.
[14] Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning cs.CLPDF
Fan Gao, Sherry T. Tong, Jiwoong Sohn, Jiahao Huang, Junfeng Jiang
TL;DR: 本文提出Med-CoReasoner框架,旨在通过语言协同推理减少医学推理中的语言差异。该框架利用并行英语和本地语言推理,将其抽象为结构化概念,并通过概念对齐与检索将本地临床知识整合到英语逻辑框架中,以提升非英语语言的医学推理能力。
Details
Motivation: 动机是解决增强推理能力的大语言模型在英语医学任务上表现优异,但在本地语言上存在显著的多语言差距,限制了全球医疗公平部署的问题。
Result: 在三个基准测试上的实验表明,Med-CoReasoner将多语言推理性能平均提升了5%,在低资源语言上增益尤为显著。此外,通过模型蒸馏和专家评估分析证实,其生成的推理轨迹在临床和文化上具有合理性。
Insight: 创新点在于提出了语言协同推理框架,通过概念级对齐结合英语推理的结构鲁棒性与本地语言的实践专业知识,并构建了MultiMed-X多语言长形式问答和自然语言推理基准以评估超越选择题的医学推理能力。
Abstract: While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.
[15] Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees cs.CLPDF
Kun Li, Zenan Xu, Junan Li, Zengrui Jin, Jinghao Deng
TL;DR: 本文提出了DART框架,通过强化学习在长链思维推理中自动发现和强化工具集成,无需人工标注,有效结合工具使用与长链推理。
Details
Motivation: 解决长链思维推理中工具集成训练数据稀缺、且不损害模型内在推理能力的挑战。
Result: 在AIME和GPQA-Diamond等基准测试中显著优于现有方法,实现了工具执行与长链推理的协调。
Insight: 通过动态展开树探索工具集成轨迹,并基于树的优势估计强化有益行为,创新性地将强化学习应用于工具集成推理的自动化训练。
Abstract: Tool-Integrated Reasoning has emerged as a key paradigm to augment Large Language Models (LLMs) with computational capabilities, yet integrating tool-use into long Chain-of-Thought (long CoT) remains underexplored, largely due to the scarcity of training data and the challenge of integrating tool-use without compromising the model’s intrinsic long-chain reasoning. In this paper, we introduce DART (Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees), a reinforcement learning framework that enables spontaneous tool-use during long CoT reasoning without human annotation. DART operates by constructing dynamic rollout trees during training to discover valid tool-use opportunities, branching out at promising positions to explore diverse tool-integrated trajectories. Subsequently, a tree-based process advantage estimation identifies and credits specific sub-trajectories where tool invocation positively contributes to the solution, effectively reinforcing these beneficial behaviors. Extensive experiments on challenging benchmarks like AIME and GPQA-Diamond demonstrate that DART significantly outperforms existing methods, successfully harmonizing tool execution with long CoT reasoning.
[16] D$^2$Plan: Dual-Agent Dynamic Global Planning for Complex Retrieval-Augmented Reasoning cs.CLPDF
Kangcheng Luo, Tinglang Wu, Yansong Feng
TL;DR: 本文提出了D$^2$Plan,一种用于复杂检索增强推理的双智能体动态全局规划范式,以解决现有基于强化学习的搜索增强大语言模型在多跳推理任务中,因上下文信息过载而导致的搜索链构建无效和推理被无关证据劫持的问题。该方法通过一个推理器和一个净化器的协作,结合两阶段训练框架,在多个具有挑战性的QA基准测试上实现了更连贯的多步推理和更强的抗干扰能力。
Details
Motivation: 解决现有基于强化学习的搜索增强大语言模型在多跳推理任务中,因上下文累积导致关键证据与无关信息混杂,从而引发的两个关键失效模式:1)搜索链构建无效(产生错误查询或遗漏关键信息检索);2)推理被外围证据劫持(模型将干扰项误认为有效证据)。
Result: 在具有挑战性的QA基准测试上进行了广泛实验,结果表明D$^2$Plan能够实现更连贯的多步推理和更强的抗无关信息干扰能力,从而取得了优越的性能。
Insight: 核心创新点是提出了一个由推理器和净化器组成的双智能体协作范式,其中推理器负责构建并动态调整显式的全局规划,净化器负责评估检索相关性并浓缩关键信息。此外,采用了一个包含基于合成轨迹的监督微调冷启动和基于规划导向奖励的强化学习的两阶段训练框架,以教导大语言模型掌握该范式。这种方法将规划、检索和推理过程更明确地解耦和协同,增强了系统的鲁棒性和可解释性。
Abstract: Recent search-augmented LLMs trained with reinforcement learning (RL) can interleave searching and reasoning for multi-hop reasoning tasks. However, they face two critical failure modes as the accumulating context becomes flooded with both crucial evidence and irrelevant information: (1) ineffective search chain construction that produces incorrect queries or omits retrieval of critical information, and (2) reasoning hijacking by peripheral evidence that causes models to misidentify distractors as valid evidence. To address these challenges, we propose D$^2$Plan, a Dual-agent Dynamic global Planning paradigm for complex retrieval-augmented reasoning. D$^2$Plan operates through the collaboration of a Reasoner and a Purifier: the Reasoner constructs explicit global plans during reasoning and dynamically adapts them based on retrieval feedback; the Purifier assesses retrieval relevance and condenses key information for the Reasoner. We further introduce a two-stage training framework consisting of supervised fine-tuning (SFT) cold-start on synthesized trajectories and RL with plan-oriented rewards to teach LLMs to master the D$^2$Plan paradigm. Extensive experiments demonstrate that D$^2$Plan enables more coherent multi-step reasoning and stronger resilience to irrelevant information, thereby achieving superior performance on challenging QA benchmarks.
[17] AgriAgent: Contract-Driven Planning and Capability-Aware Tool Orchestration in Real-World Agriculture cs.CLPDF
Bo Yang, Yu Zhang, Yunkui Chen, Lanfei Feng, Xiao Xu
TL;DR: AgriAgent是一个面向真实农业场景的两层智能体框架,通过基于任务复杂度的分层执行策略来处理多模态输入下的多样化任务。对于简单任务,它使用特定模态的智能体进行直接推理;对于复杂任务,则采用契约驱动的规划机制,将任务分解为能力需求,并进行能力感知的工具编排与动态工具生成,从而实现多步骤、可验证且具备故障恢复能力的执行。
Details
Motivation: 现有智能体系统在真实农业环境中通常采用统一的执行范式,难以适应任务复杂度差异大和工具可用性不完整等挑战。AgriAgent旨在解决这一问题,通过分层策略和契约驱动规划来提升任务执行的适应性和鲁棒性。
Result: 实验结果表明,与依赖统一执行范式的现有以工具为中心的智能体基线相比,AgriAgent在复杂任务上实现了更高的执行成功率和鲁棒性。
Insight: 创新点在于提出了基于任务复杂度的分层执行策略,以及契约驱动的规划与能力感知的工具编排机制,这为处理真实世界农业中任务多样性和工具不完整性提供了可借鉴的框架设计思路。
Abstract: Intelligent agent systems in real-world agricultural scenarios must handle diverse tasks under multimodal inputs, ranging from lightweight information understanding to complex multi-step execution. However, most existing approaches rely on a unified execution paradigm, which struggles to accommodate large variations in task complexity and incomplete tool availability commonly observed in agricultural environments. To address this challenge, we propose AgriAgent, a two-level agent framework for real-world agriculture. AgriAgent adopts a hierarchical execution strategy based on task complexity: simple tasks are handled through direct reasoning by modality-specific agents, while complex tasks trigger a contract-driven planning mechanism that formulates tasks as capability requirements and performs capability-aware tool orchestration and dynamic tool generation, enabling multi-step and verifiable execution with failure recovery. Experimental results show that AgriAgent achieves higher execution success rates and robustness on complex tasks compared to existing tool-centric agent baselines that rely on unified execution paradigms. All code, data will be released at after our work be accepted to promote reproducible research.
[18] Detecting Mental Manipulation in Speech via Synthetic Multi-Speaker Dialogue cs.CLPDF
Run Chen, Wen Liang, Ziwei Gong, Lin Ai, Julia Hirschberg
TL;DR: 本文首次研究语音对话中的心理操纵检测,构建了合成多说话人基准SPEECHMENTALMANIP,通过高质量文本转语音增强文本数据集,并利用少样本大音频-语言模型和人工标注评估模态对检测的影响。
Details
Motivation: 心理操纵是计算社会推理中的新兴任务,先前研究仅关注文本对话,忽视了操纵策略在语音中的表现,因此需要探究语音模态下的检测问题。
Result: 模型在语音上的特异性高但召回率显著低于文本,表明对训练中缺失的声学或韵律线索敏感;人工标注者在音频设置下也表现出类似的不确定性,突显了操纵性语音的固有模糊性。
Insight: 创新点在于构建首个语音心理操纵检测基准,并揭示多模态对话系统中需进行模态感知评估和安全对齐;客观分析表明,该研究强调了跨模态泛化挑战和声学特征在社交推理中的重要性。
Abstract: Mental manipulation, the strategic use of language to covertly influence or exploit others, is a newly emerging task in computational social reasoning. Prior work has focused exclusively on textual conversations, overlooking how manipulative tactics manifest in speech. We present the first study of mental manipulation detection in spoken dialogues, introducing a synthetic multi-speaker benchmark SPEECHMENTALMANIP that augments a text-based dataset with high-quality, voice-consistent Text-to-Speech rendered audio. Using few-shot large audio-language models and human annotation, we evaluate how modality affects detection accuracy and perception. Our results reveal that models exhibit high specificity but markedly lower recall on speech compared to text, suggesting sensitivity to missing acoustic or prosodic cues in training. Human raters show similar uncertainty in the audio setting, underscoring the inherent ambiguity of manipulative speech. Together, these findings highlight the need for modality-aware evaluation and safety alignment in multimodal dialogue systems.
[19] Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering cs.CL | cs.LGPDF
Nonghai Zhang, Weitao Ma, Zhanyu Ma, Jun Xu, Jiuchong Gao
TL;DR: 本文提出了一种名为Latent-GRPO的强化学习框架,旨在解决GRPO方法依赖昂贵外部验证器或人工规则的问题。该方法通过分析潜在空间几何特性,发现正确推理轨迹的终端令牌表示会形成密集簇,而错误轨迹则呈离群点分布。基于此,作者设计了IRCE算法来生成密集、连续的内部奖励,从而在保持模型性能的同时显著提升训练效率。
Details
Motivation: 动机是解决Group Relative Policy Optimization (GRPO)方法在提升大语言模型推理性能时,对昂贵外部验证器或人工规则的严重依赖问题,这种依赖导致高计算成本、训练延迟以及稀疏奖励,从而阻碍优化效率。
Result: 在多个数据集上的实验结果表明,该方法在保持模型性能的同时,相比基线实现了超过2倍的训练加速,并展现出强大的泛化能力和鲁棒性。
Insight: 创新点在于揭示了正确与错误推理轨迹在潜在空间中的几何聚类特性,并据此设计了IRCE算法,通过球面投影缓解幅度波动和迭代聚合估计稳健的“真实质心”,从而直接从潜在空间几何推导出密集的内部奖励,减少对外部验证的依赖。
Abstract: Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs). However, this success heavily relies on expensive external verifiers or human rules. Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency. To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry. Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers. In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical projection and estimating a robust ``truth centroid’’ through iterative aggregation. Experimental results on multiple datasets show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines. Furthermore, extensive results demonstrate strong generalization ability and robustness. The code will be released soon.
[20] Fine-Mem: Fine-Grained Feedback Alignment for Long-Horizon Memory Management cs.CLPDF
Weitao Ma, Xiaocheng Feng, Lei Huang, Xiachong Feng, Zhanyu Ma
TL;DR: 本文提出Fine-Mem框架,通过细粒度反馈对齐来解决长视野任务中LLM智能体的内存管理问题。该框架引入了分块级步骤奖励和证据锚定的奖励归因机制,以缓解强化学习中奖励稀疏和信用分配困难的问题,从而优化内存操作策略。
Details
Motivation: 现有基于强化学习的内存管理器主要依赖最终任务性能作为奖励,导致奖励稀疏和信用分配效率低下,无法为单个内存操作提供有效指导。
Result: 在Memalpha和MemoryAgentBench基准测试上的实验表明,Fine-Mem持续优于强基线模型,在各种子任务上取得了更高的成功率,并展现出对不同模型配置和骨干网络的良好适应性与泛化能力。
Insight: 创新点在于提出了一个统一的细粒度反馈对齐框架,通过辅助QA任务生成即时步骤奖励,并基于推理中使用的具体内存项作为证据来锚定和重新分配全局奖励,从而将局部内存操作与内存的长期效用对齐。
Abstract: Effective memory management is essential for large language model agents to navigate long-horizon tasks. Recent research has explored using Reinforcement Learning to develop specialized memory manager agents. However, existing approaches rely on final task performance as the primary reward, which results in severe reward sparsity and ineffective credit assignment, providing insufficient guidance for individual memory operations. To this end, we propose Fine-Mem, a unified framework designed for fine-grained feedback alignment. First, we introduce a Chunk-level Step Reward to provide immediate step-level supervision via auxiliary chunk-specific question answering tasks. Second, we devise Evidence-Anchored Reward Attribution to redistribute global rewards by anchoring credit to key memory operations, based on the specific memory items utilized as evidence in reasoning. Together, these components enable stable policy optimization and align local memory operations with the long-term utility of memory. Experiments on Memalpha and MemoryAgentBench demonstrate that Fine-Mem consistently outperforms strong baselines, achieving superior success rates across various sub-tasks. Further analysis reveals its adaptability and strong generalization capabilities across diverse model configurations and backbones.
[21] JudgeRLVR: Judge First, Generate Second for Efficient Reasoning cs.CL | cs.AI | cs.LGPDF
Jiangshan Duo, Hanyu Li, Hailin Zhang, Yudong Wang, Sujian Li
TL;DR: 本文提出JudgeRLVR,一种’先判断后生成’的两阶段强化学习范式,用于提升大语言模型在可验证奖励强化学习中的推理效率。该方法通过训练模型先学会判断解决方案的有效性,再基于此进行生成,从而在数学领域任务上实现了更好的准确性与生成效率的权衡。
Details
Motivation: 传统的可验证奖励强化学习仅优化最终答案的正确性,容易导致模型进行冗长、无目的的探索,依赖试错而非结构化规划。启发式约束(如长度惩罚)虽能减少冗余,但可能截断关键推理步骤,造成效率与验证之间的困难权衡。本文认为判别能力是高效生成的前提。
Result: 在相同数学领域训练数据下,相比Vanilla RLVR,JudgeRLVR在Qwen3-30B-A3B模型上取得了更好的质量-效率权衡:在领域内数学任务上,平均准确率提升约3.7分,同时平均生成长度减少42%;在领域外基准测试上,平均准确率提升约4.5分,显示出更强的泛化能力。
Insight: 核心创新点在于将判别能力作为高效生成的前提,提出了’先判断后生成’的两阶段范式。这使模型能够内化一个指导信号来剪枝搜索空间,从而更结构化地规划推理路径,而非盲目试错,有效平衡了推理的准确性与效率。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality–efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.
[22] Do You Understand How I Feel?: Towards Verified Empathy in Therapy Chatbots cs.CL | cs.HC | cs.SEPDF
Francesco Dettori, Matteo Forasassi, Lorenzo Veronese, Livia Lestingi, Vincenzo Scotti
TL;DR: 本文提出了一种结合自然语言处理和形式化验证的框架,用于开发具有可验证共情能力的治疗聊天机器人。该框架利用Transformer模型提取对话特征,并将其转化为随机混合自动机模型,通过统计模型检验验证共情属性,并通过策略合成指导聊天机器人行为。初步结果表明,该形式化模型能较好地捕捉治疗对话动态,且特定策略能提高满足共情要求的概率。
Details
Motivation: 当前治疗聊天机器人的开发缺乏系统化方法来规范和验证其共情能力,而共情是治疗场景中的关键非功能性需求。
Result: 初步结果显示,构建的随机混合自动机模型能以良好保真度捕捉治疗对话动态,且通过策略合成能提高满足共情属性的概率。
Insight: 创新点在于将NLP(Transformer特征提取)与形式化方法(随机混合自动机、统计模型检验)结合,为聊天机器人的非功能性需求(如共情)提供了可验证的工程化框架。
Abstract: Conversational agents are increasingly used as support tools along mental therapeutic pathways with significant societal impacts. In particular, empathy is a key non-functional requirement in therapeutic contexts, yet current chatbot development practices provide no systematic means to specify or verify it. This paper envisions a framework integrating natural language processing and formal verification to deliver empathetic therapy chatbots. A Transformer-based model extracts dialogue features, which are then translated into a Stochastic Hybrid Automaton model of dyadic therapy sessions. Empathy-related properties can then be verified through Statistical Model Checking, while strategy synthesis provides guidance for shaping agent behavior. Preliminary results show that the formal model captures therapy dynamics with good fidelity and that ad-hoc strategies improve the probability of satisfying empathy requirements.
[23] STAGE: A Benchmark for Knowledge Graph Construction, Question Answering, and In-Script Role-Playing over Movie Screenplays cs.CL | cs.AIPDF
Qiuyu Tian, Yiding Li, Fengyi Chen, Zequn Liu, Youyong Kong
TL;DR: STAGE是一个针对电影剧本的综合性基准测试,旨在评估模型在构建连贯故事世界并跨多种推理和生成任务中保持一致性的能力。该基准包含知识图谱构建、场景级事件摘要、长上下文剧本问答和剧本内角色扮演四个任务,覆盖150部中英文电影的清洗剧本、知识图谱及事件与角色注释。
Details
Motivation: 现有基准多关注问答或对话生成等单一子任务,缺乏对模型构建连贯故事世界并跨多种推理形式一致使用该世界能力的评估,STAGE旨在填补这一空白。
Result: 论文未在摘要中提及具体定量结果或基准比较,但提供了包含150部中英文电影清洗剧本、知识图谱及注释的数据集,支持对模型世界构建、事件抽象与验证、长叙事推理和角色一致生成能力的整体评估。
Insight: 创新点在于将知识图谱构建、事件摘要、问答和角色扮演统一于共享叙事世界表示中,实现跨任务一致性评估;从客观角度看,其多任务集成和长叙事接地设计为评估模型叙事理解提供了更全面的框架。
Abstract: Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models’ abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.
[24] STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio cs.CL | cs.CR | cs.LGPDF
Seong-Gyu Park, Sohee Park, Jisu Lee, Hyunsik Na, Daeseon Choi
TL;DR: 本文提出STAR框架,通过分析输出概率偏移来检测LLM推理过程中的推理时后门攻击,该攻击通过注入恶意推理路径而不修改模型参数,利用状态转移放大比和CUSUM算法实现高效检测。
Details
Motivation: 随着LLM越来越多地集成思维链等推理机制,显式推理暴露了新的推理时后门攻击面,这些攻击生成语言连贯的路径,能有效规避传统检测方法,因此需要新的检测手段。
Result: 在多种模型(8B-70B)和五个基准数据集上的实验表明,STAR展现出强大的泛化能力,始终实现接近完美的性能(AUROC ≈ 1.0),效率比现有基线提高约42倍,且对试图绕过检测的自适应攻击具有鲁棒性。
Insight: 创新点在于利用恶意输入诱导路径在模型先验概率低但后验概率高的统计差异,量化状态转移放大,并结合CUSUM算法检测持续异常,为推理时后门检测提供了高效且鲁棒的新方法。
Abstract: Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model’s general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.
[25] Ministral 3 cs.CLPDF
Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi
TL;DR: 论文介绍了Ministral 3系列模型,这是一个参数高效、设计用于计算和内存受限应用的密集语言模型家族,包含3B、8B和14B三种参数规模。每个规模提供预训练基础模型、指令微调模型和复杂问题解决推理模型三种变体。模型通过级联蒸馏方法(迭代剪枝与持续蒸馏训练)获得,并具备图像理解能力,全部采用Apache 2.0许可证发布。
Details
Motivation: 解决在计算和内存资源受限的应用场景下,部署高效、多功能语言模型的需求。
Result: 未在摘要中提及具体的基准测试结果或性能水平。
Insight: 创新点包括:1) 针对资源受限场景设计参数高效的密集模型系列;2) 采用级联蒸馏(Cascade Distillation)的模型推导方法,结合迭代剪枝与持续蒸馏训练;3) 每个模型变体均集成图像理解能力,实现多模态支持;4) 全部模型开源,便于实际应用部署。
Abstract: We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.
[26] GraphSearch: Agentic Search-Augmented Reasoning for Zero-Shot Graph Learning cs.CLPDF
Jiajin Liu, Yuanfu Sun, Dongzhe Fan, Qiaoyu Tan
TL;DR: GraphSearch是首个将搜索增强推理扩展到图学习领域的框架,实现了无需任务特定微调的零样本图学习。它通过图感知查询规划器分离搜索空间与语义查询,并结合图感知检索器基于拓扑构建候选集,采用混合评分函数进行排序。该框架提供了递归扩展邻域的GraphSearch-R和灵活检索局部与全局邻域的GraphSearch-F两种遍历模式。
Details
Motivation: 现有搜索增强大推理模型在文本数据上表现良好,但在图结构数据(如电商、社交网络和科学引文网络)上的应用尚未充分探索。图数据包含丰富的拓扑信号,可作为检索的有价值先验,但有效利用这些结构面临生成图表达查询和平衡结构语义相关性检索的挑战。
Result: 在多个基准测试上的广泛实验表明,GraphSearch在零样本节点分类和链接预测任务中达到了最先进水平,其性能与监督图学习方法相当甚至更优。
Insight: 创新点在于将搜索增强推理与图学习结合,通过解耦搜索空间与语义查询、设计混合检索评分机制,实现了对图拓扑结构的有效利用。GraphSearch-R和GraphSearch-F两种模式提供了灵活的结构遍历策略,为图上的智能推理提供了通用范式。
Abstract: Recent advances in search-augmented large reasoning models (LRMs) enable the retrieval of external knowledge to reduce hallucinations in multistep reasoning. However, their ability to operate on graph-structured data, prevalent in domains such as e-commerce, social networks, and scientific citations, remains underexplored. Unlike plain text corpora, graphs encode rich topological signals that connect related entities and can serve as valuable priors for retrieval, enabling more targeted search and improved reasoning efficiency. Yet, effectively leveraging such structure poses unique challenges, including the difficulty of generating graph-expressive queries and ensuring reliable retrieval that balances structural and semantic relevance. To address this gap, we introduce GraphSearch, the first framework that extends search-augmented reasoning to graph learning, enabling zero-shot graph learning without task-specific fine-tuning. GraphSearch combines a Graph-aware Query Planner, which disentangles search space (e.g., 1-hop, multi-hop, or global neighbors) from semantic queries, with a Graph-aware Retriever, which constructs candidate sets based on topology and ranks them using a hybrid scoring function. We further instantiate two traversal modes: GraphSearch-R, which recursively expands neighborhoods hop by hop, and GraphSearch-F, which flexibly retrieves across local and global neighborhoods without hop constraints. Extensive experiments across diverse benchmarks show that GraphSearch achieves competitive or even superior performance compared to supervised graph learning methods, setting state-of-the-art results in zero-shot node classification and link prediction. These findings position GraphSearch as a flexible and generalizable paradigm for agentic reasoning over graphs.
[27] A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding cs.CLPDF
Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit
TL;DR: 本文介绍了XMPIE,一个多语言、多模态的潜在习语表达(PIE)平行数据集,涵盖34种语言和超过一万个条目,旨在评估NLP系统在跨语言和跨模态(文本与图像)的习语理解能力。
Details
Motivation: 解决NLP系统在理解和处理与特定语言社区日常经验和文化紧密相关的潜在习语表达时所面临的挑战,以评估其语言和文化能力。
Result: 创建了一个高质量的数据集,可用于评估模型在不同语言间以及文本与图像模态间的习语理解迁移能力。
Insight: 创新点在于构建了一个大规模、平行、多模态的习语基准,支持跨语言和跨模态的对比分析,为研究文化共享方面和模态间理解迁移提供了新工具。
Abstract: Potentially idiomatic expressions (PIEs) construe meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows to evaluate model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.
[28] RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation cs.CL | cs.AI | cs.LGPDF
Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei
TL;DR: 本文提出了RULERS框架,旨在解决LLM作为评估者时因生成随机性导致的与人类标准对齐问题。该框架将自然语言评分标准编译成可执行的规范,通过版本化锁定、结构化解码与证据验证以及基于Wasserstein距离的后校准,显著提升了评估的稳定性、可验证性和与人类评分的一致性。
Details
Motivation: 动机是解决LLM作为评估者时存在的三个常见失效模式:评分标准因提示词敏感性而不稳定、推理过程缺乏可审计证据、以及评分尺度与人类评分边界不匹配。
Result: 在文章和摘要生成基准测试上的大量实验表明,RULERS在人类评分一致性上显著优于代表性基线方法,对对抗性评分标准扰动保持强稳定性,并能使较小模型达到与大型专有评估模型相当的水平。
Insight: 创新点在于将评估者对齐问题重构为标准迁移问题,并提出了一个无需更新模型参数的编译器-执行器框架,强调可靠的LLM评估需要可执行的评分标准、可验证的证据和校准的尺度,而不仅仅是提示词的措辞。
Abstract: The LLM-as-a-Judge paradigm promises scalable rubric-based evaluation, yet aligning frozen black-box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring), a compiler-executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein-based post-hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at https://github.com/LabRAI/Rulers.git.
[29] QuantEval: A Benchmark for Financial Quantitative Tasks in Large Language Models cs.CLPDF
Zhaolu Kang, Junhao Gong, Wenqing Hu, Shuo Yin, Kehan Jiang
TL;DR: 该论文提出了QuantEval基准测试,用于全面评估大语言模型在金融量化任务中的能力,涵盖知识问答、数学推理和策略编码三个维度,并引入CTA风格的回测框架来评估模型生成的交易策略。
Details
Motivation: 现有的大语言模型评估在金融量化任务中较为零散且主要局限于知识问答,缺乏对量化推理和策略编码能力的系统性评估。
Result: 评估了当前最先进的开源和专有大语言模型,发现其在推理和策略编码方面与人类专家存在显著差距;通过在领域对齐数据上进行监督微调和强化学习实验,模型性能得到了持续提升。
Insight: 创新性地将金融量化任务分解为三个维度并进行综合评估,并引入可执行的回测框架来现实地评估策略编码能力;提供了完整的确定性回测配置以确保严格的可复现性,有助于推动大语言模型在真实交易工作流中的实际应用。
Abstract: Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a benchmark that evaluates LLMs across three essential dimensions of quantitative finance: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting framework that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability. We evaluate some state-of-the-art open-source and proprietary LLMs and observe substantial gaps to human experts, particularly in reasoning and strategy coding. Finally, we conduct large-scale supervised fine-tuning and reinforcement learning experiments on domain-aligned data, demonstrating consistent improvements. We hope QuantEval will facilitate research on LLMs’ quantitative finance capabilities and accelerate their practical adoption in real-world trading workflows. We additionally release the full deterministic backtesting configuration (asset universe, cost model, and metric definitions) to ensure strict reproducibility.
[30] RAGShaper: Eliciting Sophisticated Agentic RAG Skills via Automated Data Synthesis cs.CLPDF
Zhengwei Tao, Bo Li, Jialong Wu, Guochen Yan, Huanyao Zhang
TL;DR: 本文提出了RAGShaper,一个用于自动化合成RAG任务和鲁棒智能体轨迹的数据生成框架,旨在解决高质量训练数据稀缺的问题。该框架通过构建包含对抗性干扰信息的密集信息树,并采用约束导航策略迫使教师智能体处理这些干扰,从而生成能展示纠错和抗噪能力的轨迹。实验表明,使用该合成数据训练的模型在噪声密集和复杂检索任务中显著优于现有基线。
Details
Motivation: 开发鲁棒的智能体RAG系统面临高质量训练数据稀缺的挑战,现有手动标注方法难以规模化且无法有效捕捉处理检索失败所需的动态推理策略。
Result: 在噪声密集和复杂的检索任务上进行综合实验,结果表明,使用RAGShaper合成数据训练的模型显著优于现有基线,展现出更强的鲁棒性。
Insight: 创新点在于通过自动化框架(InfoCurator构建密集信息树并注入对抗性干扰,结合约束导航策略)来合成能显式展示错误纠正和噪声拒绝能力的智能体轨迹,为训练鲁棒RAG智能体提供了可扩展的高质量数据生成方法。
Abstract: Agentic Retrieval-Augmented Generation (RAG) empowers large language models to autonomously plan and retrieve information for complex problem-solving. However, the development of robust agents is hindered by the scarcity of high-quality training data that reflects the noise and complexity of real-world retrieval environments. Conventional manual annotation is unscalable and often fails to capture the dynamic reasoning strategies required to handle retrieval failures. To bridge this gap, we introduce RAGShaper, a novel data synthesis framework designed to automate the construction of RAG tasks and robust agent trajectories. RAGShaper incorporates an InfoCurator to build dense information trees enriched with adversarial distractors spanning Perception and Cognition levels. Furthermore, we propose a constrained navigation strategy that forces a teacher agent to confront these distractors, thereby eliciting trajectories that explicitly demonstrate error correction and noise rejection. Comprehensive experiments confirm that models trained on our synthesized corpus significantly outperform existing baselines, exhibiting superior robustness in noise-intensive and complex retrieval tasks.
[31] PrivGemo: Privacy-Preserving Dual-Tower Graph Retrieval for Empowering LLM Reasoning with Memory Augmentation cs.CLPDF
Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan
TL;DR: PrivGemo是一个隐私保护的检索增强框架,旨在解决知识图谱(KG)增强大语言模型(LLM)推理时的隐私泄露风险。它采用双塔设计,将原始KG知识保留在本地,同时允许在匿名视图上进行远程推理,该视图超越了简单的名称掩码,限制了语义和结构暴露。该框架支持多跳、多实体推理,并通过分层控制器和隐私感知经验记忆减少不必要的探索和远程交互。
Details
Motivation: 动机是解决在利用私有知识图谱增强LLM推理时,现有隐私处理方法(如仅掩码实体名称)面临的四大局限:语义掩码下的结构泄露、不可控的远程交互、脆弱的多跳多实体推理,以及稳定性和效率方面的经验复用有限。
Result: 在六个基准测试上的综合实验表明,PrivGemo取得了总体最先进(SOTA)的结果,比最强基线高出最多17.1%。此外,它使得较小的模型(如Qwen3-4B)能够达到与GPT-4-Turbo相当的推理性能。
Insight: 创新点在于提出了一个超越名称掩码的匿名化视图,结合双塔架构实现本地知识保留与远程推理的分离,以及通过分层控制器和隐私感知经验记忆来优化探索和交互。从客观角度看,其核心创新是将隐私保护深度集成到图检索和推理流程中,通过结构匿名化和本地验证机制,在保护私有KG的同时维持了复杂推理能力,并显著提升了小模型的性能。
Abstract: Knowledge graphs (KGs) provide structured evidence that can ground large language model (LLM) reasoning for knowledge-intensive question answering. However, many practical KGs are private, and sending retrieved triples or exploration traces to closed-source LLM APIs introduces leakage risk. Existing privacy treatments focus on masking entity names, but they still face four limitations: structural leakage under semantic masking, uncontrollable remote interaction, fragile multi-hop and multi-entity reasoning, and limited experience reuse for stability and efficiency. To address these issues, we propose PrivGemo, a privacy-preserving retrieval-augmented framework for KG-grounded reasoning with memory-guided exposure control. PrivGemo uses a dual-tower design to keep raw KG knowledge local while enabling remote reasoning over an anonymized view that goes beyond name masking to limit both semantic and structural exposure. PrivGemo supports multi-hop, multi-entity reasoning by retrieving anonymized long-hop paths that connect all topic entities, while keeping grounding and verification on the local KG. A hierarchical controller and a privacy-aware experience memory further reduce unnecessary exploration and remote interactions. Comprehensive experiments on six benchmarks show that PrivGemo achieves overall state-of-the-art results, outperforming the strongest baseline by up to 17.1%. Furthermore, PrivGemo enables smaller models (e.g., Qwen3-4B) to achieve reasoning performance comparable to that of GPT-4-Turbo.
[32] From Rows to Reasoning: A Retrieval-Augmented Multimodal Framework for Spreadsheet Understanding cs.CLPDF
Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul
TL;DR: 本文提出了一个名为FRTR的多模态检索增强生成框架,用于理解和推理大规模企业级电子表格。该框架通过将Excel工作簿分解为行、列和块的嵌入表示,结合混合检索和融合策略,并整合多模态嵌入来处理数值和视觉信息。作者还构建了首个大规模多模态电子表格推理基准FRTR-Bench,并在多个LLM上验证了FRTR的有效性。
Details
Motivation: 现有大型语言模型在处理包含数千行数值、多个链接表格以及图表和收据等嵌入视觉内容的企业级电子表格时存在推理困难。先前的方法通常依赖单表压缩或全上下文编码,这限制了可扩展性,且未能反映真实用户与复杂多模态工作簿的交互方式。
Result: 在FRTR-Bench基准上,FRTR框架在Claude Sonnet 4.5模型上实现了74%的答案准确率,显著优于先前仅达到24%准确率的SOTA方法。在SpreadsheetLLM基准上,FRTR在GPT-5模型上达到87%的准确率,同时与上下文压缩方法相比,令牌使用量减少了约50%。
Insight: 论文的创新点包括:提出了首个大规模多模态电子表格推理基准FRTR-Bench;设计了FRTR框架,通过细粒度分解、混合检索(结合词汇和密集检索与RRF融合)和多模态嵌入集成,有效处理复杂电子表格的数值和视觉信息。从客观角度看,该框架在可扩展性和真实交互模拟方面具有优势,为电子表格理解任务提供了新的解决方案。
Abstract: Large Language Models (LLMs) struggle to reason over large-scale enterprise spreadsheets containing thousands of numeric rows, multiple linked sheets, and embedded visual content such as charts and receipts. Prior state-of-the-art spreadsheet reasoning approaches typically rely on single-sheet compression or full-context encoding, which limits scalability and fails to reflect how real users interact with complex, multimodal workbooks. We introduce FRTR-Bench, the first large-scale benchmark for multimodal spreadsheet reasoning, comprising 30 enterprise-grade Excel workbooks spanning nearly four million cells and more than 50 embedded images. To address these challenges, we present From Rows to Reasoning (FRTR), an advanced, multimodal retrieval-augmented generation framework that decomposes Excel workbooks into granular row, column, and block embeddings, employs hybrid lexical-dense retrieval with Reciprocal Rank Fusion (RRF), and integrates multimodal embeddings to reason over both numerical and visual information. We tested FRTR on six LLMs, achieving 74% answer accuracy on FRTR-Bench with Claude Sonnet 4.5, a substantial improvement over prior state-of-the-art approaches that reached only 24%. On the SpreadsheetLLM benchmark, FRTR achieved 87% accuracy with GPT-5 while reducing token usage by roughly 50% compared to context-compression methods.
[33] Inferring Latent Intentions: Attributional Natural Language Inference in LLM Agents cs.CLPDF
Xin Quan, Jiafeng Xiong, Marco Valentino, André Freitas
TL;DR: 本文提出了一种名为Attributional NLI(Att-NLI)的框架,用于评估大型语言模型(LLM)在多智能体环境中推断潜在意图的能力。该框架结合了溯因推理(生成关于潜在意图的假设)和演绎推理(得出逻辑结论),并通过一个名为Undercover-V的文本游戏进行实例化。实验比较了三种不同推理能力的LLM智能体,结果表明,结合外部定理证明器的神经符号Att-NLI智能体表现最佳。
Details
Motivation: 传统的自然语言推理(NLI)无法捕捉复杂交互系统中基于意图的细微推理,而推断行为背后潜在意图的归因推理能力对于在多智能体环境中运行的LLM至关重要,但目前研究不足。
Result: 在Undercover-V文本游戏上的大量实验表明,归因推理能力存在清晰层次:神经符号Att-NLI智能体始终优于其他智能体,平均胜率达到17.08%。
Insight: 主要创新点是将社会心理学原理(溯因与演绎推理)引入NLI框架,构建了Att-NLI来专门评估LLM的意图推断能力。从客观角度看,将神经符号AI(结合LLM与外部定理证明器)应用于多智能体环境中的理性推理,是一个有潜力的研究方向。
Abstract: Attributional inference, the ability to predict latent intentions behind observed actions, is a critical yet underexplored capability for large language models (LLMs) operating in multi-agent environments. Traditional natural language inference (NLI), in fact, fails to capture the nuanced, intention-driven reasoning essential for complex interactive systems. To address this gap, we introduce Attributional NLI (Att-NLI), a framework that extends NLI with principles from social psychology to assess an agent’s capacity for abductive intentional inference (generating hypotheses about latent intentions), and subsequent deductive verification (drawing valid logical conclusions). We instantiate Att-NLI via a textual game, Undercover-V, experimenting with three types of LLM agents with varying reasoning capabilities and access to external tools: a standard NLI agent using only deductive inference, an Att-NLI agent employing abductive-deductive inference, and a neuro-symbolic Att-NLI agent performing abductive-deductive inference with external theorem provers. Extensive experiments demonstrate a clear hierarchy of attributional inference capabilities, with neuro-symbolic agents consistently outperforming others, achieving an average win rate of 17.08%. Our results underscore the role that Att-NLI can play in developing agents with sophisticated reasoning capabilities, highlighting, at the same time, the potential impact of neuro-symbolic AI in building rational LLM agents acting in multi-agent environments.
[34] To Retrieve or To Think? An Agentic Approach for Context Evolution cs.CL | cs.AIPDF
Rubing Chen, Jian Wang, Wenjie Li, Xiao-Yong Wei, Qing Li
TL;DR: 本文提出了Agentic Context Evolution (ACE)框架,旨在解决当前检索增强生成方法在知识密集型推理任务中存在的检索步骤僵化、计算成本高和引入噪声的问题。ACE通过一个中央编排器代理,基于多数投票策略动态决策何时进行外部检索或内部推理,从而在保持上下文简洁的同时提升性能。
Details
Motivation: 现有检索增强生成方法通常在每个步骤都执行检索,这种不加区分的策略不仅带来不必要的计算开销,还会因引入不相关噪声而降低性能。论文旨在设计一个更智能、动态的上下文演化框架来克服这些限制。
Result: 在具有挑战性的多跳问答基准测试上的广泛实验表明,ACE在准确性上显著优于竞争基线,同时实现了高效的token消耗。
Insight: 论文的创新点在于受人类元认知启发,引入了动态决策机制(通过编排器代理进行多数投票),在检索与推理之间进行战略切换,从而优化上下文演化过程。从客观角度看,这种将检索决策过程显式建模并交由代理协调的思路,为构建更高效、鲁棒的上下文感知生成系统提供了新方向。
Abstract: Current context augmentation methods, such as retrieval-augmented generation, are essential for solving knowledge-intensive reasoning tasks.However, they typically adhere to a rigid, brute-force strategy that executes retrieval at every step. This indiscriminate approach not only incurs unnecessary computational costs but also degrades performance by saturating the context with irrelevant noise. To address these limitations, we introduce Agentic Context Evolution (ACE), a framework inspired by human metacognition that dynamically determines whether to seek new evidence or reason with existing knowledge. ACE employs a central orchestrator agent to make decisions strategically via majority voting.It aims to alternate between activating a retriever agent for external retrieval and a reasoner agent for internal analysis and refinement. By eliminating redundant retrieval steps, ACE maintains a concise and evolved context. Extensive experiments on challenging multi-hop QA benchmarks demonstrate that ACE significantly outperforms competitive baselines in accuracy while achieving efficient token consumption.Our work provides valuable insights into advancing context-evolved generation for complex, knowledge-intensive tasks.
[35] Spatial Context Improves the Integration of Text with Remote Sensing for Mapping Environmental Variables cs.CLPDF
Valerie Zermatten, Chiara Vanalli, Gencer Sumbul, Diego Marcos, Devis Tuia
TL;DR: 本文提出了一种基于注意力机制的模型,用于结合高分辨率航空影像和地理定位文本(来自维基百科)来预测环境变量。该方法通过整合空间邻域内的多模态数据,动态选择对预测任务有用的邻近观测,从而提升预测性能。
Details
Motivation: 解决文本数据在生态学应用中面临的挑战:文本贡献不明确、数据稀疏且不规则,以及如何有效整合文本与地理空间数据(如航空影像)以揭示局部环境条件。
Result: 在EcoWikiRS数据集上评估,预测SWECO25数据立方体中的103个环境变量。该方法在多个变量组(气候、土壤、人口、土地利用/覆盖)上显著优于仅使用单一位置或单模态(仅图像或仅文本)的基线模型。
Insight: 创新点在于引入空间上下文,通过注意力机制动态整合邻近位置的文本和图像信息;客观来看,该方法为多模态地理空间数据分析提供了可扩展的框架,强调了空间邻域信息在提升文本与遥感数据融合效果中的关键作用。
Abstract: Recent developments in natural language processing highlight text as an emerging data source for ecology. Textual resources carry unique information that can be used in complementarity with geospatial data sources, thus providing insights at the local scale into environmental conditions and properties hidden from more traditional data sources. Leveraging textual information in a spatial context presents several challenges. First, the contribution of textual data remains poorly defined in an ecological context, and it is unclear for which tasks it should be incorporated. Unlike ubiquitous satellite imagery or environmental covariates, the availability of textual data is sparse and irregular; its integration with geospatial data is not straightforward. In response to these challenges, this work proposes an attention-based approach that combines aerial imagery and geolocated text within a spatial neighbourhood, i.e. integrating contributions from several nearby observations. Our approach combines vision and text representations with a geolocation encoding, with an attention-based module that dynamically selects spatial neighbours that are useful for predictive tasks.The proposed approach is applied to the EcoWikiRS dataset, which combines high-resolution aerial imagery with sentences extracted from Wikipedia describing local environmental conditions across Switzerland. Our model is evaluated on the task of predicting 103 environmental variables from the SWECO25 data cube. Our approach consistently outperforms single-location or unimodal, i.e. image-only or text-only, baselines. When analysing variables by thematic groups, results show a significant improvement in performance for climatic, edaphic, population and land use/land cover variables, underscoring the benefit of including the spatial context when combining text and image data.
[36] Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge cs.CL | cs.AI | cs.LGPDF
Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei
TL;DR: 本文提出了一种名为’Multiplex Thinking’的随机软推理机制,用于增强大语言模型在复杂推理任务中的性能。该方法通过在每一步思考中采样K个候选token,并将其嵌入聚合成一个连续的多路复用token,从而在保持标准离散生成先验的同时,紧凑地表示多种可能的推理路径,并可通过策略强化学习直接优化。
Details
Motivation: 受人类软推理(即对多个可能下一步保持概率分布)的启发,旨在解决传统思维链方法序列长、带宽低的缺点,实现更高效、自适应的推理过程。
Result: 在多个具有挑战性的数学推理基准测试中,从Pass@1到Pass@1024,Multiplex Thinking均持续优于强大的离散思维链和强化学习基线方法,同时生成更短的序列,达到了SOTA水平。
Insight: 创新点在于引入了连续的多路复用token来表示概率分布,实现了推理过程的自适应性(在置信时接近离散生成,不确定时紧凑表示多种可能),并可直接用策略强化学习优化,为高效推理提供了新思路。
Abstract: Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR-Penn/Multiplex-Thinking.
cs.CV [Back]
[37] An Empirical Study on Knowledge Transfer under Domain and Label Shifts in 3D LiDAR Point Clouds cs.CV | cs.AIPDF
Subeen Lee, Siyeong Lee, Namil Kim, Jaesik Choi
TL;DR: 本文提出了ROAD基准测试,用于评估在3D LiDAR点云中同时面临领域偏移和标签偏移时的知识迁移能力。研究基于大规模自动驾驶数据集,评估了零样本迁移、线性探测和持续学习等方法,并分析了主干架构、训练目标和持续学习方法的影响。
Details
Motivation: 为了解决3D点云感知在持续学习和迁移学习方面研究不足的问题,特别是在领域和标签同时发生偏移的现实场景下,需要建立一个全面的评估基准。
Result: 研究在Waymo、NuScenes和Argoverse2等大规模数据集上进行了评估,揭示了现有方法在现实偏移下的局限性,并为未来鲁棒3D感知研究建立了强基线。
Insight: 创新点在于提出了一个专门针对3D LiDAR点云、同时考虑领域偏移和多种标签演化形式的综合基准测试ROAD,系统性地评估了不同迁移策略和架构的鲁棒性。
Abstract: For 3D perception systems to be practical in real-world applications – from autonomous driving to embodied AI – models must adapt to continuously evolving object definitions and sensor domains. Yet, research on continual and transfer learning in 3D point cloud perception remains underexplored compared to 2D vision – particularly under simultaneous domain and label shifts. To address this gap, we propose the RObust Autonomous driving under Dataset shifts (ROAD) benchmark, a comprehensive evaluation suite for LiDAR-based object classification that explicitly accounts for domain shifts as well as three key forms of label evolution: class split, class expansion, and class insertion. Using large-scale datasets (Waymo, NuScenes, Argoverse2), we evaluate zero-shot transfer, linear probe, and CL, and analyze the impact of backbone architectures, training objectives, and CL methods. Our findings reveal limitations of existing approaches under realistic shifts and establish strong baselines for future research in robust 3D perception.
[38] Sesame Plant Segmentation Dataset: A YOLO Formatted Annotated Dataset cs.CVPDF
Sunusi Ibrahim Muhammad, Ismail Ismail Tijjani, Saadatu Yusuf Jumare, Fatima Isah Jibrin
TL;DR: 本文介绍了芝麻植物分割数据集,这是一个开源的、采用YOLO兼容分割格式标注的图像数据集,专门用于支持农业应用中的人工智能模型开发,特别是针对芝麻植物。数据集包含206张训练图像、43张验证图像和43张测试图像,捕捉了尼日利亚卡齐纳州Daura地方政府区Jirdede农场早期生长阶段芝麻植物在不同环境条件下的图像。数据使用高分辨率移动相机采集,并在农民监督下使用Segment Anything Model version 2进行标注。与传统的边界框数据集不同,该数据集采用像素级分割,以实现更精确的芝麻植物检测和分析。使用Ultralytics YOLOv8框架评估模型,在检测和分割任务上均表现出色。
Details
Motivation: 动机是解决农业应用中缺乏针对芝麻植物的高质量、像素级分割数据集的问题,以支持更精确的植物监测、产量估计和农业研究,特别是在尼日利亚等地区。
Result: 在Ultralytics YOLOv8框架上评估,边界框检测的召回率为79%、精确率为79%、IoU 0.50下的平均精度为84%、IoU 0.50到0.95的平均精度为58%;分割任务的召回率为82%、精确率为77%、IoU 0.50下的平均精度为84%、IoU 0.50到0.95的平均精度为52%。
Insight: 创新点在于提供了一个专门针对芝麻植物的开源像素级分割数据集,采用YOLO格式,结合Segment Anything Model version 2和农民监督进行标注,增强了数据集的实用性和准确性,填补了尼日利亚农业视觉数据集的空白。
Abstract: This paper presents the Sesame Plant Segmentation Dataset, an open source annotated image dataset designed to support the development of artificial intelligence models for agricultural applications, with a specific focus on sesame plants. The dataset comprises 206 training images, 43 validation images, and 43 test images in YOLO compatible segmentation format, capturing sesame plants at early growth stages under varying environmental conditions. Data were collected using a high resolution mobile camera from farms in Jirdede, Daura Local Government Area, Katsina State, Nigeria, and annotated using the Segment Anything Model version 2 with farmer supervision. Unlike conventional bounding box datasets, this dataset employs pixel level segmentation to enable more precise detection and analysis of sesame plants in real world farm settings. Model evaluation using the Ultralytics YOLOv8 framework demonstrated strong performance for both detection and segmentation tasks. For bounding box detection, the model achieved a recall of 79 percent, precision of 79 percent, mean average precision at IoU 0.50 of 84 percent, and mean average precision from 0.50 to 0.95 of 58 percent. For segmentation, it achieved a recall of 82 percent, precision of 77 percent, mean average precision at IoU 0.50 of 84 percent, and mean average precision from 0.50 to 0.95 of 52 percent. The dataset represents a novel contribution to sesame focused agricultural vision datasets in Nigeria and supports applications such as plant monitoring, yield estimation, and agricultural research.
[39] An Efficient Additive Kolmogorov-Arnold Transformer for Point-Level Maize Localization in Unmanned Aerial Vehicle Imagery cs.CVPDF
Fei Li, Lang Qiao, Jiahao Fan, Yijia Xu, Shawn M. Kaeppler
TL;DR: 本文提出了一种高效的加法Kolmogorov-Arnold Transformer(AKT),用于解决无人机高分辨率图像中点级玉米定位的挑战。该方法通过引入Pade Kolmogorov-Arnold Network(PKAN)模块和PKAN加法注意力(PAA)机制,增强了小目标特征提取能力并降低了计算复杂度。同时,作者构建了一个包含约50.1万个点标注的真实农田数据集PML。实验表明,AKT在多项指标上超越了现有最优方法。
Details
Motivation: 解决无人机高分辨率图像中点级玉米定位面临的三大挑战:目标像素占比极低(<0.1%)、超高分辨率图像上二次注意力计算成本过高,以及通用视觉模型难以处理农业场景特有的稀疏目标分布和环境变化。
Result: 在提出的PML数据集上,AKT的平均F1分数达到62.8%,比现有最优方法(SOTA)高出4.2%,同时FLOPs减少12.6%,推理吞吐量提升20.7%。在下游任务中,株数计数的平均绝对误差为7.1,株距估计的均方根误差为1.95-1.97厘米。
Insight: 主要创新点在于将Kolmogorov-Arnold表示理论与高效注意力机制相结合,具体体现为用PKAN模块替代传统MLP以增强函数表达能力,并设计PAA机制来建模多尺度空间依赖以降低计算复杂度。这为高分辨率农业遥感提供了一个有效的框架。
Abstract: High-resolution UAV photogrammetry has become a key technology for precision agriculture, enabling centimeter-level crop monitoring and point-level plant localization. However, point-level maize localization in UAV imagery remains challenging due to (1) extremely small object-to-pixel ratios, typically less than 0.1%, (2) prohibitive computational costs of quadratic attention on ultra-high-resolution images larger than 3000 x 4000 pixels, and (3) agricultural scene-specific complexities such as sparse object distribution and environmental variability that are poorly handled by general-purpose vision models. To address these challenges, we propose the Additive Kolmogorov-Arnold Transformer (AKT), which replaces conventional multilayer perceptrons with Pade Kolmogorov-Arnold Network (PKAN) modules to enhance functional expressivity for small-object feature extraction, and introduces PKAN Additive Attention (PAA) to model multiscale spatial dependencies with reduced computational complexity. In addition, we present the Point-based Maize Localization (PML) dataset, consisting of 1,928 high-resolution UAV images with approximately 501,000 point annotations collected under real field conditions. Extensive experiments show that AKT achieves an average F1-score of 62.8%, outperforming state-of-the-art methods by 4.2%, while reducing FLOPs by 12.6% and improving inference throughput by 20.7%. For downstream tasks, AKT attains a mean absolute error of 7.1 in stand counting and a root mean square error of 1.95-1.97 cm in interplant spacing estimation. These results demonstrate that integrating Kolmogorov-Arnold representation theory with efficient attention mechanisms offers an effective framework for high-resolution agricultural remote sensing.
[40] Likelihood ratio for a binary Bayesian classifier under a noise-exclusion model cs.CV | math.ST | stat.COPDF
Howard C. Gifford
TL;DR: 本文提出了一种新的统计理想观测者模型,通过设置最小可提取图像特征的阈值来执行整体视觉搜索(或要旨)处理。该模型减少了自由参数数量,从而简化了系统。其应用领域包括医学图像感知(用于优化成像系统和算法)、计算机视觉、基准性能测试以及特征选择/评估,还可用于国防/安全领域的目标检测与识别以及传感器和探测器评估。
Details
Motivation: 动机是开发一个统计理想观测者模型,以通过阈值化最小可提取特征来执行整体视觉搜索处理,旨在减少参数并简化系统,适用于多个领域如医学图像和计算机视觉。
Result: 摘要未提及具体的定量实验结果或基准测试,但暗示该模型可用于基准性能测试和特征评估,可能作为理论框架或工具。
Insight: 创新点在于提出一个基于噪声排除模型的二元贝叶斯分类器的似然比框架,通过阈值特征提取实现整体处理,减少参数,为医学图像、计算机视觉等领域提供优化和评估工具。
Abstract: We develop a new statistical ideal observer model that performs holistic visual search (or gist) processing in part by placing thresholds on minimum extractable image features. In this model, the ideal observer reduces the number of free parameters thereby shrinking down the system. The applications of this novel framework is in medical image perception (for optimizing imaging systems and algorithms), computer vision, benchmarking performance and enabling feature selection/evaluations. Other applications are in target detection and recognition in defense/security as well as evaluating sensors and detectors.
[41] Predicting Region of Interest in Human Visual Search Based on Statistical Texture and Gabor Features cs.CV | eess.IV | eess.SP | physics.med-phPDF
Hongwei Lin, Diego Andrade, Mini Das, Howard C. Gifford
TL;DR: 本研究探讨了基于Gabor特征和灰度共生矩阵(GLCM)纹理特征在建模早期视觉搜索行为中的关系,提出了两种特征融合流程来预测人类注视的可能区域,并在模拟数字乳腺断层合成图像上进行了评估。
Details
Motivation: 解决在位置未知的搜索任务中如何建模观察者注意力分配的问题,以理解人类视觉搜索行为。
Result: 提出的流程与基于阈值的模型观察者在预测注视候选区域上具有定性一致性,GLCM均值与Gabor特征响应之间存在强相关性,且预测区域与人类观察者的眼动数据在早期注视行为上表现一致。
Insight: 创新点在于结合结构(Gabor)和纹理(GLCM)特征来建模视觉搜索,支持开发感知启发的观察者模型,揭示了不同特征公式编码相关图像信息的互补性。
Abstract: Understanding human visual search behavior is a fundamental problem in vision science and computer vision, with direct implications for modeling how observers allocate attention in location-unknown search tasks. In this study, we investigate the relationship between Gabor-based features and gray-level co-occurrence matrix (GLCM) based texture features in modeling early-stage visual search behavior. Two feature-combination pipelines are proposed to integrate Gabor and GLCM features for narrowing the region of possible human fixations. The pipelines are evaluated using simulated digital breast tomosynthesis images. Results show qualitative agreement among fixation candidates predicted by the proposed pipelines and a threshold-based model observer. A strong correlation is observed between GLCM mean and Gabor feature responses, indicating that these features encode related image information despite their different formulations. Eye-tracking data from human observers further suggest consistency between predicted fixation regions and early-stage gaze behavior. These findings highlight the value of combining structural and texture-based features for modeling visual search and support the development of perceptually informed observer models.
[42] CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation cs.CVPDF
Chaoyu Li, Deeparghya Dutta Barua, Fei Tao, Pooyan Fazli
TL;DR: 本文提出了CASHEW和CASHEW-RL两种方法,旨在稳定多模态模型的多步推理过程。CASHEW是一个推理时框架,通过迭代聚合多个候选推理轨迹并利用视觉验证过滤幻觉步骤来生成更高质量的推理路径;CASHEW-RL则是一个通过强化学习训练的变体,将聚合行为内化到单一模型中。
Details
Motivation: 动机是解决当前视觉语言模型在多步推理中存在的不稳定问题,即相同输入下重复采样会产生发散性推理轨迹和不一致的最终预测。
Result: 在13个图像理解、视频理解和视频推理基准测试上的广泛实验表明,该方法带来了显著的性能提升,例如在ScienceQA上提升了23.6个百分点,在EgoSchema上提升了8.1个百分点。
Insight: 创新点在于提出了一个受测试时扩展启发的推理时稳定框架,通过轨迹聚合和视觉验证来提升推理的鲁棒性;同时,其强化学习变体CASHEW-RL通过新颖的训练目标(GSPO和复合奖励)实现了推理时的自适应和自我聚合能力。
Abstract: Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.
[43] Representations of Text and Images Align From Layer One cs.CV | cs.AIPDF
Evžen Wybitul, Javier Rando, Florian Tramèr, Stanislav Fort
TL;DR: 本文提出了一种基于合成的新方法,用于研究适配器视觉语言模型中图像与文本表示的跨模态对齐。研究发现,在Gemma 3模型中,从第一层开始,许多概念的图像和文本表示就已存在有意义的对齐,这挑战了传统认为对齐仅出现在深层网络的观点。
Details
Motivation: 动机是探究视觉语言模型中图像与文本表示的对齐究竟发生在网络的哪个层次,并挑战现有观点——即这种对齐通常只在深层出现。
Result: 在Gemma 3模型的七个层上对数百个概念进行测试,结果显示,即使在第一层,超过50%的合成图像能描绘出目标文本概念(如动物、活动、季节)的可识别视觉特征,提供了概念和层级上的直接对齐证据。
Insight: 创新点在于提出了一种受DeepDream启发的、基于优化的合成方法,无需辅助模型或数据集即可直接、高效地可视化模型表示空间的对齐情况,为模型可解释性提供了新路径。
Abstract: We show that for a variety of concepts in adapter-based vision-language models, the representations of their images and their text descriptions are meaningfully aligned from the very first layer. This contradicts the established view that such image-text alignment only appears in late layers. We show this using a new synthesis-based method inspired by DeepDream: given a textual concept such as “Jupiter”, we extract its concept vector at a given layer, and then use optimisation to synthesise an image whose representation aligns with that vector. We apply our approach to hundreds of concepts across seven layers in Gemma 3, and find that the synthesised images often depict salient visual features of the targeted textual concepts: for example, already at layer 1, more than 50 % of images depict recognisable features of animals, activities, or seasons. Our method thus provides direct, constructive evidence of image-text alignment on a concept-by-concept and layer-by-layer basis. Unlike previous methods for measuring multimodal alignment, our approach is simple, fast, and does not require auxiliary models or datasets. It also offers a new path towards model interpretability, by providing a way to visualise a model’s representation space by backtracing through its image processing components.
[44] Training Free Zero-Shot Visual Anomaly Localization via Diffusion Inversion cs.CVPDF
Samet Hicsonmez, Abd El Rahman Shabayek, Djamila Aouada
TL;DR: 本文提出了一种无需训练、仅依赖视觉的零样本异常检测与定位框架DIVAD,通过预训练去噪扩散隐式模型(DDIM)的反演过程重构输入图像,利用输入与重构图像之间的差异来定位异常,无需细粒度提示或辅助模态。
Details
Motivation: 解决零样本图像异常检测(ZSAD)中仅视觉方法缺乏空间定位精度、依赖细粒度提示或额外模态(如语言)的问题,旨在实现无需训练样本的精确异常定位。
Result: 在VISA数据集上达到最先进(SOTA)性能,展示了强大的异常定位能力,无需辅助模态。
Insight: 创新点在于利用预训练扩散模型的反演过程进行训练无关的异常定位,通过固定中间时间步启动去噪重构正常图像,避免了提示依赖,为仅视觉零样本异常检测提供了新思路。
Abstract: Zero-Shot image Anomaly Detection (ZSAD) aims to detect and localise anomalies without access to any normal training samples of the target data. While recent ZSAD approaches leverage additional modalities such as language to generate fine-grained prompts for localisation, vision-only methods remain limited to image-level classification, lacking spatial precision. In this work, we introduce a simple yet effective training-free vision-only ZSAD framework that circumvents the need for fine-grained prompts by leveraging the inversion of a pretrained Denoising Diffusion Implicit Model (DDIM). Specifically, given an input image and a generic text description (e.g., “an image of an [object class]”), we invert the image to obtain latent representations and initiate the denoising process from a fixed intermediate timestep to reconstruct the image. Since the underlying diffusion model is trained solely on normal data, this process yields a normal-looking reconstruction. The discrepancy between the input image and the reconstructed one highlights potential anomalies. Our method achieves state-of-the-art performance on VISA dataset, demonstrating strong localisation capabilities without auxiliary modalities and facilitating a shift away from prompt dependence for zero-shot anomaly detection research. Code is available at https://github.com/giddyyupp/DIVAD.
[45] A Highly Efficient Diversity-based Input Selection for DNN Improvement Using VLMs cs.CV | cs.SEPDF
Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand
TL;DR: 本文提出了一种基于概念多样性(CBD)的高效图像输入选择方法,该方法利用视觉语言模型(VLM)来度量输入图像的多样性,并结合简单的不确定性度量(Margin)形成混合选择策略,旨在以较低的计算成本选择信息量大的子集用于DNN微调,从而提升模型性能。
Details
Motivation: 解决现有基于多样性的输入选择方法计算成本高、难以扩展到大规模输入集的问题,以降低DNN持续改进中数据标注的成本和耗时。
Result: 在多种DNN模型、输入集、选择预算和五种最先进的基线方法上进行综合评估,结果表明基于CBD的选择方法在指导输入选择以改进DNN模型方面始终优于所有基线,且在ImageNet等大型数据集上保持接近简单不确定性方法(如Margin)的高效性。
Insight: 创新点在于利用VLM提取的语义概念来高效计算输入多样性(CBD),并与不确定性度量结合,在保证选择效果的同时显著提升了计算效率和可扩展性,为大规模实际应用提供了可行方案。
Abstract: Maintaining or improving the performance of Deep Neural Networks (DNNs) through fine-tuning requires labeling newly collected inputs, a process that is often costly and time-consuming. To alleviate this problem, input selection approaches have been developed in recent years to identify small, yet highly informative subsets for labeling. Diversity-based selection is one of the most effective approaches for this purpose. However, they are often computationally intensive and lack scalability for large input sets, limiting their practical applicability. To address this challenge, we introduce Concept-Based Diversity (CBD), a highly efficient metric for image inputs that leverages Vision-Language Models (VLM). Our results show that CBD exhibits a strong correlation with Geometric Diversity (GD), an established diversity metric, while requiring only a fraction of its computation time. Building on this finding, we propose a hybrid input selection approach that combines CBD with Margin, a simple uncertainty metric. We conduct a comprehensive evaluation across a diverse set of DNN models, input sets, selection budgets, and five most effective state-of-the-art selection baselines. The results demonstrate that the CBD-based selection consistently outperforms all baselines at guiding input selection to improve the DNN model. Furthermore, the CBD-based selection approach remains highly efficient, requiring selection times close to those of simple uncertainty-based methods such as Margin, even on larger input sets like ImageNet. These results confirm not only the effectiveness and computational advantage of the CBD-based approach, particularly compared to hybrid baselines, but also its scalability in repetitive and extensive input selection scenarios.
[46] FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures cs.CV | cs.AI | cs.CLPDF
Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao
TL;DR: FigEx2是一个视觉条件化的框架,用于检测科学复合图中的面板并生成面板级描述。它通过噪声感知门控融合模块稳定检测查询空间,并采用结合监督学习和强化学习的阶段优化策略,利用CLIP对齐和BERTScore语义奖励确保多模态一致性。该方法在BioSci-Fig-Cap等跨学科测试集上实现了优越的面板检测和描述生成性能,并展现出出色的零样本迁移能力。
Details
Motivation: 解决科学复合图中面板级理解困难的问题,因为真实流程中的描述通常缺失或仅提供图级摘要,难以支持面板级别的分析。
Result: 在BioSci-Fig-Cap基准上,检测mAP@0.5:0.95达到0.726,描述生成在METEOR和BERTScore上显著优于Qwen3-VL-8B模型(分别高出0.51和0.24),并在物理和化学领域测试中展现出零样本迁移能力。
Insight: 创新点包括噪声感知门控融合模块以应对开放描述中的多样表达,以及结合监督学习和强化学习的阶段优化策略,利用多模态奖励确保一致性;客观分析认为其跨学科零样本迁移性和高质量基准构建具有借鉴意义。
Abstract: Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.
[47] Rescind: Countering Image Misconduct in Biomedical Publications with Vision-Language and State-Space Modeling cs.CVPDF
Soumyaroop Nandi, Prem Natarajan
TL;DR: 本文提出了首个结合视觉语言引导的框架,用于生成和检测生物医学图像伪造,通过扩散模型合成与视觉语言提示实现语义可控的篡改,并构建了大规模基准数据集Rescind和基于结构化状态空间建模的检测框架Integscan,在检测和定位任务上达到SOTA水平。
Details
Motivation: 解决生物医学出版物中科学图像篡改对研究完整性和可重复性构成的威胁,针对生物医学领域特有的伪影、复杂纹理和无结构布局带来的检测挑战。
Result: 在Rescind基准和现有基准上的广泛实验表明,Integscan在检测和定位任务上均达到最先进(SOTA)性能。
Insight: 创新点包括:结合视觉语言引导的生成与检测一体化框架,利用扩散模型和提示实现语义可控伪造;提出结构化状态空间建模框架Integscan,集成注意力增强视觉编码与提示条件语义对齐;引入视觉语言模型验证循环以确保语义保真度;构建了带细粒度注释和模态特定分割的大规模基准Rescind。
Abstract: Scientific image manipulation in biomedical publications poses a growing threat to research integrity and reproducibility. Unlike natural image forensics, biomedical forgery detection is uniquely challenging due to domain-specific artifacts, complex textures, and unstructured figure layouts. We present the first vision-language guided framework for both generating and detecting biomedical image forgeries. By combining diffusion-based synthesis with vision-language prompting, our method enables realistic and semantically controlled manipulations, including duplication, splicing, and region removal, across diverse biomedical modalities. We introduce Rescind, a large-scale benchmark featuring fine-grained annotations and modality-specific splits, and propose Integscan, a structured state space modeling framework that integrates attention-enhanced visual encoding with prompt-conditioned semantic alignment for precise forgery localization. To ensure semantic fidelity, we incorporate a vision-language model based verification loop that filters generated forgeries based on consistency with intended prompts. Extensive experiments on Rescind and existing benchmarks demonstrate that Integscan achieves state of the art performance in both detection and localization, establishing a strong foundation for automated scientific integrity analysis.
[48] From Prompts to Deployment: Auto-Curated Domain-Specific Dataset Generation via Diffusion Models cs.CVPDF
Dongsik Yoon, Jongeun Kim
TL;DR: 本文提出了一种基于扩散模型的自动化流水线,用于生成领域特定的合成数据集,以解决预训练模型与真实部署环境之间的分布偏移问题。该框架包含三个阶段:通过可控修复在特定背景中合成目标对象,通过多模态评估(包括目标检测、美学评分和视觉语言对齐)验证生成结果,并利用用户偏好分类器捕捉主观选择标准。
Details
Motivation: 动机是解决预训练模型在真实世界部署时面临的分布偏移问题,减少对大量真实数据收集的依赖,通过自动化生成高质量、可部署的领域特定数据集。
Result: 摘要中未提及具体的定量结果或基准测试,但宣称该流水线能高效构建高质量数据集。
Insight: 创新点在于结合扩散模型、多模态评估和用户偏好分类器的自动化流水线,实现可控的领域特定数据集生成,可借鉴于数据增强和合成数据生成任务。
Abstract: In this paper, we present an automated pipeline for generating domain-specific synthetic datasets with diffusion models, addressing the distribution shift between pre-trained models and real-world deployment environments. Our three-stage framework first synthesizes target objects within domain-specific backgrounds through controlled inpainting. The generated outputs are then validated via a multi-modal assessment that integrates object detection, aesthetic scoring, and vision-language alignment. Finally, a user-preference classifier is employed to capture subjective selection criteria. This pipeline enables the efficient construction of high-quality, deployable datasets while reducing reliance on extensive real-world data collection.
[49] Subspace Alignment for Vision-Language Model Test-time Adaptation cs.CV | cs.AIPDF
Zhichen Zeng, Wenxuan Bao, Xiao Lin, Ruizhong Qiu, Tianxin Wei
TL;DR: 本文提出了一种名为SubTTA的新方法,用于解决视觉语言模型在分布偏移下进行测试时自适应的问题。该方法通过对齐视觉和文本模态的语义子空间来提升零样本预测的可靠性,从而更好地指导自适应过程。
Details
Motivation: 现有TTA方法严重依赖零样本预测作为伪标签进行自训练,但在分布偏移下,这些预测可能不可靠,主要受限于模态间隙和视觉干扰这两个根本问题。
Result: 在多个基准测试和VLM架构上的广泛实验表明,SubTTA方法有效,相比最先进的TTA方法平均提升了2.24%。
Insight: 核心创新在于通过提取并最小化弦距离来对齐模态的主子空间以弥合模态间隙,并通过将视觉特征投影到任务特定的文本子空间来过滤视觉干扰,从而在净化后的空间进行标准TTA以优化决策边界。
Abstract: Vision-language models (VLMs), despite their extraordinary zero-shot capabilities, are vulnerable to distribution shifts. Test-time adaptation (TTA) emerges as a predominant strategy to adapt VLMs to unlabeled test data on the fly. However, existing TTA methods heavily rely on zero-shot predictions as pseudo-labels for self-training, which can be unreliable under distribution shifts and misguide adaptation due to two fundamental limitations. First (Modality Gap), distribution shifts induce gaps between visual and textual modalities, making cross-modal relations inaccurate. Second (Visual Nuisance), visual embeddings encode rich but task-irrelevant noise that often overwhelms task-specific semantics under distribution shifts. To address these limitations, we propose SubTTA, which aligns the semantic subspaces of both modalities to enhance zero-shot predictions to better guide the TTA process. To bridge the modality gap, SubTTA extracts the principal subspaces of both modalities and aligns the visual manifold to the textual semantic anchor by minimizing their chordal distance. To eliminate visual nuisance, SubTTA projects the aligned visual features onto the task-specific textual subspace, which filters out task-irrelevant noise by constraining visual embeddings within the valid semantic span, and standard TTA is further performed on the purified space to refine the decision boundaries. Extensive experiments on various benchmarks and VLM architectures demonstrate the effectiveness of SubTTA, yielding an average improvement of 2.24% over state-of-the-art TTA methods.
[50] Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention cs.CV | cs.MMPDF
Shezheng Song, Shasha Li, Jie Yu
TL;DR: 本文通过层级的掩码分析揭示了多模态大语言模型(MLLMs)中视觉-文本融合的演化规律,发现融合发生在特定层而非均匀分布,并存在后期视觉信号再激活的‘回顾’现象。基于此,作者提出了一种无需训练的对比注意力框架,通过建模早期融合层与最终层之间的注意力转移来提升多模态推理性能,并在多个基准测试中验证了其有效性。
Details
Motivation: 尽管多模态大语言模型在视觉-语言理解方面取得显著进展,但其内部如何整合视觉和文本信息仍不明确,本文旨在通过系统分析来填补这一理解空白。
Result: 在多种MLLM架构和基准测试上的广泛实验表明,所提出的对比注意力框架能有效提升多模态推理性能,验证了分析结论。
Insight: 创新点在于通过层级掩码分析揭示了MLLMs中视觉融合的非均匀分布和‘回顾’现象,并据此设计了一种无需训练的对比注意力方法,通过建模注意力转移来增强模型对相关区域的关注,从而改善性能。
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage “review” phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts. Extensive experiments across various MLLMs and benchmarks validate our analysis and demonstrate that the proposed approach improves multimodal reasoning performance. Code will be released.
[51] Instance-Aligned Captions for Explainable Video Anomaly Detection cs.CVPDF
Inpyo Song, Minjun Joo, Joonhyung Kwon, Eunji Jeon, Jangwon Lee
TL;DR: 本文提出了一种实例对齐的文本描述方法,用于提升可解释视频异常检测(VAD)的可验证性。该方法通过将文本描述与视频中的具体对象实例及其外观、运动属性关联起来,明确回答“谁导致了异常、每个实体在做什么、影响了谁、解释基于何处”等问题。作者还构建了一个包含八个VAD基准数据集和扩展的VIEW360+数据集的大规模测试平台。
Details
Motivation: 现有可解释VAD方法在多实体交互场景中缺乏空间定位能力,导致生成的解释不完整或视觉上未对齐,降低了其可信度。
Result: 实验表明,所提出的实例级空间定位描述揭示了当前基于LLM和VLM的方法存在显著局限性,并为未来可信、可解释的异常检测研究提供了一个稳健的基准。
Insight: 核心创新在于提出了“实例对齐的文本描述”框架,将解释文本与视频中的具体对象实例及其属性(外观、运动)进行细粒度对齐,从而生成可验证、可操作的推理。同时,构建了VIEW360+这一大规模、多场景的综合性可解释VAD测试平台。
Abstract: Explainable video anomaly detection (VAD) is crucial for safety-critical applications, yet even with recent progress, much of the research still lacks spatial grounding, making the explanations unverifiable. This limitation is especially pronounced in multi-entity interactions, where existing explainable VAD methods often produce incomplete or visually misaligned descriptions, reducing their trustworthiness. To address these challenges, we introduce instance-aligned captions that link each textual claim to specific object instances with appearance and motion attributes. Our framework captures who caused the anomaly, what each entity was doing, whom it affected, and where the explanationis grounded, enabling verifiable and actionable reasoning. We annotate eight widely used VAD benchmarks and extend the 360-degree egocentric dataset, VIEW360, with 868 additional videos, eight locations, and four new anomaly types, creating VIEW360+, a comprehensive testbed for explainable VAD. Experiments show that our instance-level spatially grounded captions reveal significant limitations in current LLM- and VLM-based methods while providing a robust benchmark for future research in trustworthy and interpretable anomaly detection.
[52] Representation Learning with Semantic-aware Instance and Sparse Token Alignments cs.CVPDF
Phuoc-Nguyen Bui, Toan Duc Nguyen, Junghyun Bum, Duc-Tai Le, Hyunseung Choo
TL;DR: 该论文提出了一种名为SISTA的多层次对齐框架,用于医学视觉-语言预训练,通过结合图像-报告和图像块-单词两个层次的语义对应关系,改进传统的对比学习方法,以消除假阴性样本并增强图像块与相关单词的对齐,从而提升下游任务的性能。
Details
Motivation: 传统医学视觉-语言预训练方法通常采用对比学习,将配对的图像-报告样本视为正样本,未配对的视为负样本,但医学数据集中不同患者的图像或报告可能存在高度相似性,将所有未配对样本视为负样本会破坏语义结构并影响表示质量,因此需要更精细的对齐策略。
Result: 实验结果表明,该框架在三个下游任务(图像分类、图像分割和目标检测)上提高了跨数据集的迁移性能,特别是在有限标注数据的细粒度任务中取得了显著改进。
Insight: 创新点包括引入报告间相似性来消除假阴性,以及有效对齐图像块与相关单词的方法,这有助于捕捉更细粒度的语义信息,提升表示学习的质量,可借鉴于其他需要处理复杂语义对齐的视觉-语言任务中。
Abstract: Medical contrastive vision-language pre-training (VLP) has demonstrated significant potential in improving performance on downstream tasks. Traditional approaches typically employ contrastive learning, treating paired image-report samples as positives and unpaired ones as negatives. However, in medical datasets, there can be substantial similarities between images or reports from different patients. Rigidly treating all unpaired samples as negatives, can disrupt the underlying semantic structure and negatively impact the quality of the learned representations. In this paper, we propose a multi-level alignment framework, Representation Learning with Semantic-aware Instance and Sparse Token Alignments (SISTA) by exploiting the semantic correspondence between medical image and radiology reports at two levels, i.e., image-report and patch-word levels. Specifically, we improve the conventional contrastive learning by incorporating inter-report similarity to eliminate the false negatives and introduce a method to effectively align image patches with relevant word tokens. Experimental results demonstrate the effectiveness of the proposed framework in improving transfer performance across different datasets on three downstream tasks: image classification, image segmentation, and object detection. Notably, our framework achieves significant improvements in fine-grained tasks even with limited labeled data. Codes and pre-trained models will be made available.
[53] CogniMap3D: Cognitive 3D Mapping and Rapid Retrieval cs.CVPDF
Feiran Wang, Junyi Wu, Dawen Cai, Yuan Hong, Yan Yan
TL;DR: CogniMap3D是一个受生物启发的框架,用于动态3D场景理解和重建,它模拟人类认知过程。该框架通过维护静态场景的持久记忆库,实现了高效的空间知识存储和快速检索。它集成了多阶段运动线索识别动态物体、认知映射系统存储与更新静态场景、以及因子图优化相机位姿三大核心能力。
Details
Motivation: 解决动态3D场景理解中,如何高效存储、检索和更新静态场景知识,并实现跨多次访问的连续场景理解问题。
Result: 在视频深度估计、相机位姿重建和3D映射任务上的评估表明,其性能达到了最先进水平(SOTA),并有效支持了长序列和多次访问的连续场景理解。
Insight: 创新点在于将人类认知过程(如记忆存储与回忆)引入3D映射,通过持久记忆库和认知映射系统实现静态场景的高效管理与快速检索,结合运动线索和因子图优化,提升了动态环境下的鲁棒性和效率。
Abstract: We present CogniMap3D, a bioinspired framework for dynamic 3D scene understanding and reconstruction that emulates human cognitive processes. Our approach maintains a persistent memory bank of static scenes, enabling efficient spatial knowledge storage and rapid retrieval. CogniMap3D integrates three core capabilities: a multi-stage motion cue framework for identifying dynamic objects, a cognitive mapping system for storing, recalling, and updating static scenes across multiple visits, and a factor graph optimization strategy for refining camera poses. Given an image stream, our model identifies dynamic regions through motion cues with depth and camera pose priors, then matches static elements against its memory bank. When revisiting familiar locations, CogniMap3D retrieves stored scenes, relocates cameras, and updates memory with new observations. Evaluations on video depth estimation, camera pose reconstruction, and 3D mapping tasks demonstrate its state-of-the-art performance, while effectively supporting continuous scene understanding across extended sequences and multiple visits.
[54] Instruction-Driven 3D Facial Expression Generation and Transition cs.CV | cs.AI | cs.GR | cs.LG | cs.MMPDF
Anh H. Vo, Tae-Seok Kim, Hulin Jin, Soo-Mi Choi, Yong-Guk Kim
TL;DR: 本文提出了一种指令驱动的3D面部表情生成与过渡框架,能够根据文本指令生成3D面部,并实现任意两种指定表情之间的平滑过渡。该框架包含IFED模块、I2FET方法和面部表情过渡模型,在CK+和CelebV-HQ数据集上超越了现有方法。
Details
Motivation: 为了解决3D虚拟角色通常只有少数几种基本表情、难以模拟真实情感变化的问题,本文旨在通过文本指令驱动,实现任意面部表情之间的生成与平滑过渡,从而极大地扩展面部表情及其过渡的多样性。
Result: 在CK+和CelebV-HQ数据集上的广泛评估表明,所提出的模型性能优于最先进(SOTA)的方法,能够根据文本指令生成面部表情轨迹。
Insight: 创新点在于引入了指令驱动的面部表情分解器(IFED)模块来促进多模态数据学习并捕获文本描述与面部表情特征之间的关联,以及提出了指令到面部表情过渡(I2FET)方法,利用顶点重建损失函数来细化潜在向量的语义理解,从而根据给定指令生成表情序列。这为通过自然语言灵活控制3D面部动画提供了新思路。
Abstract: A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction-driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction-driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state-of-the-art methods on the CK+ and CelebV-HQ datasets. The results show that our framework can generate facial expression trajectories according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications More information about our project can be found at https://vohoanganh.github.io/tg3dfet/
[55] Second-order Gaussian directional derivative representations for image high-resolution corner detection cs.CVPDF
Dongbo Xie, Junjie Qiu, Changming Sun, Weichuan Zhang
TL;DR: 本文提出了一种基于二阶高斯方向导数(SOGDD)的图像高分辨率角点检测方法,通过使用SOGDD滤波器平滑两种典型的高分辨率角点模型(END型和L型),推导出它们的SOGDD表示,并发现了高分辨率角点的多种特性,从而能够准确检测相邻角点。实验验证了该方法在定位误差、对图像模糊变换的鲁棒性、图像匹配和3D重建方面优于现有最先进方法。
Details
Motivation: 针对Zhang等人使用简单角点模型获取角点特征时存在的理论缺陷,即相邻角点的灰度信息会相互影响,本文旨在解决高分辨率图像中相邻角点的准确检测问题。
Result: 实验结果表明,所提出的方法在定位误差、对图像模糊变换的鲁棒性、图像匹配和3D重建等任务上优于当前最先进(SOTA)的方法。
Insight: 创新点在于首次使用二阶高斯方向导数(SOGDD)滤波器来建模高分辨率角点,并推导出END型和L型角点的SODD表示,从而能够通过选择合适的高斯滤波尺度来准确分离和描述相邻角点的强度变化信息,这为高精度角点检测提供了新的理论框架和实用方法。
Abstract: Corner detection is widely used in various computer vision tasks, such as image matching and 3D reconstruction. Our research indicates that there are theoretical flaws in Zhang et al.’s use of a simple corner model to obtain a series of corner characteristics, as the grayscale information of two adjacent corners can affect each other. In order to address the above issues, a second-order Gaussian directional derivative (SOGDD) filter is used in this work to smooth two typical high-resolution angle models (i.e. END-type and L-type models). Then, the SOGDD representations of these two corner models were derived separately, and many characteristics of high-resolution corners were discovered, which enabled us to demonstrate how to select Gaussian filtering scales to obtain intensity variation information from images, accurately depicting adjacent corners. In addition, a new high-resolution corner detection method for images has been proposed for the first time, which can accurately detect adjacent corner points. The experimental results have verified that the proposed method outperforms state-of-the-art methods in terms of localization error, robustness to image blur transformation, image matching, and 3D reconstruction.
[56] GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards cs.CV | cs.AIPDF
Yan Zhu, Te Luo, Pei-Yao Fu, Zhen Zhang, Zi-Long Wang
TL;DR: 该论文提出了GI-Bench,一个用于系统评估多模态大语言模型在胃肠内窥镜全景工作流中性能的基准测试。研究评估了12个先进MLLM在五个临床阶段的表现,并与人类内镜医师对比,发现顶级模型在诊断推理上可与初级医师媲美,但在病灶定位和事实准确性上存在显著瓶颈。
Details
Motivation: 动机在于系统评估MLLM在胃肠病学中的临床实用性,验证其相对于人类医师和综合临床工作流程的性能,以揭示当前模型的局限性和潜力。
Result: 在GI-Bench上,Gemini-3-Pro达到SOTA性能;顶级模型在诊断推理的Macro-F1分数(0.641)优于实习生(0.492),与初级医师(0.727)相当(p>0.05),但在病灶定位的mIoU(0.345)显著低于人类(>0.506;p<0.05),且存在报告流畅性高但事实准确性低的矛盾。
Insight: 创新点在于构建了涵盖全景工作流和细粒度病灶类别的临床基准GI-Bench,并揭示了MLLM存在的’空间定位瓶颈’和’流畅性-准确性悖论’,即模型在语言生成上优于人类,但因’过度解释’和幻觉导致事实错误,这对医疗AI的可靠部署具有重要启示。
Abstract: Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical “spatial grounding bottleneck” persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a “fluency-accuracy paradox”: models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to “over-interpretation” and hallucination of visual features.GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.
[57] Human-inspired Global-to-Parallel Multi-scale Encoding for Lightweight Vision Models cs.CVPDF
Wei Xu
TL;DR: 本文提出了一种受人类视觉启发的全局到并行多尺度编码(GPM)方法,并基于此构建了轻量级网络H-GPE。该方法通过全局洞察生成器(GIG)提取整体线索,并利用并行分支(LSAE和IRB)分别处理中/大尺度语义关系和细粒度纹理信息,以协同表征全局和局部特征。在图像分类、目标检测和语义分割任务上的实验表明,H-GPE在保持较低计算量(FLOPs)和参数量平衡的同时,实现了与当前最先进轻量级模型相比更优的精度-效率权衡。
Details
Motivation: 现有轻量级视觉网络在参数量、计算开销和任务性能之间难以取得满意平衡,且一些受人类视觉启发的模型对视觉过程的建模过于简化。本文旨在通过模仿人类视觉系统的协作机制(先整体后细节,并在局部注意时保持广泛上下文感知),设计一种更符合真实感知的轻量级编码方法。
Result: 在图像分类、目标检测和语义分割基准测试中,H-GPE在保持FLOPs和参数量平衡的前提下,取得了强大的性能,与近期最先进(SOTA)轻量级模型相比提供了更优的精度-效率权衡。
Insight: 创新点在于提出了一种受人类视觉启发的全局到并行多尺度编码(GPM)范式,通过全局洞察生成器与并行分支(分别关注语义关系和纹理细节)的协同设计,实现了对整体与局部特征的连贯表征。这为构建更高效、更符合生物视觉原理的轻量级模型提供了新思路。
Abstract: Lightweight vision networks have witnessed remarkable progress in recent years, yet achieving a satisfactory balance among parameter scale, computational overhead, and task performance remains difficult. Although many existing lightweight models manage to reduce computation considerably, they often do so at the expense of a substantial increase in parameter count (e.g., LSNet, MobileMamba), which still poses obstacles for deployment on resource-limited devices. In parallel, some studies attempt to draw inspiration from human visual perception, but their modeling tends to oversimplify the visual process, making it hard to reflect how perception truly operates. Revisiting the cooperative mechanism of the human visual system, we propose GPM (Global-to-Parallel Multi-scale Encoding). GPM first employs a Global Insight Generator (GIG) to extract holistic cues, and subsequently processes features of different scales through parallel branches: LSAE emphasizes mid-/large-scale semantic relations, while IRB (Inverted Residual Block) preserves fine-grained texture information, jointly enabling coherent representation of global and local features. As such, GPM conforms to two characteristic behaviors of human vision perceiving the whole before focusing on details, and maintaining broad contextual awareness even during local attention. Built upon GPM, we further develop the lightweight H-GPE network. Experiments on image classification, object detection, and semantic segmentation show that H-GPE achieves strong performance while maintaining a balanced footprint in both FLOPs and parameters, delivering a more favorable accuracy-efficiency trade-off compared with recent state-of-the-art lightweight models.
[58] Route, Retrieve, Reflect, Repair: Self-Improving Agentic Framework for Visual Detection and Linguistic Reasoning in Medical Imaging cs.CVPDF
Md. Faiyaz Abdullah Sayeedi, Rashedur Rahman, Siam Tahsin Bhuiyan, Sefatul Wasi, Ashraful Islam
TL;DR: 本文提出了一个名为R^4的智能体框架,用于医学影像分析。该框架将工作流程分解为四个协同智能体:Router(路由器)根据图像、患者历史和元数据配置任务感知提示;Retriever(检索器)利用范例记忆和pass@k采样联合生成自由文本报告和边界框;Reflector(反射器)针对关键临床错误模式(如否定、偏侧性、无支持主张、矛盾、遗漏发现和定位错误)对每个草稿-边界框对进行批判;Repairer(修复器)在针对性约束下迭代修订叙述和空间输出,同时为未来病例整理高质量范例。在胸部X光分析中,R^4无需基于梯度的微调,即可持续提升报告生成和弱监督检测的性能。
Details
Motivation: 当前医学影像分析主要依赖大型视觉-语言模型,但大多数系统是单次通过的黑箱,对推理过程、安全性和空间定位的控制有限。本文旨在通过一个可分解、可控制的智能体框架来解决这些问题,提高临床图像解释的可靠性和可解释性。
Result: 在胸部X光分析任务上,使用多个现代VLM骨干网络进行评估,R^4框架在报告生成和弱监督检测方面均优于强力的单VLM基线。具体而言,它使LLM-as-a-Judge分数提升了约+1.7到+2.5分,mAP50提升了+2.5到+3.5个绝对百分点。
Insight: 论文的创新点在于将端到端的医学影像分析任务分解为由路由、检索、反思和修复四个智能体组成的协作框架,实现了对推理过程的显式控制和迭代改进。从客观角度看,其核心创新在于将反思-修复循环与范例记忆机制相结合,针对特定临床错误模式进行定向修正,从而在不进行梯度微调的情况下,将现有强大的但脆弱的VLM转变为更可靠、更具空间基础的工具。这为构建更安全、可控的医学AI系统提供了一种新的架构范式。
Abstract: Medical image analysis increasingly relies on large vision-language models (VLMs), yet most systems remain single-pass black boxes that offer limited control over reasoning, safety, and spatial grounding. We propose R^4, an agentic framework that decomposes medical imaging workflows into four coordinated agents: a Router that configures task- and specialization-aware prompts from the image, patient history, and metadata; a Retriever that uses exemplar memory and pass@k sampling to jointly generate free-text reports and bounding boxes; a Reflector that critiques each draft-box pair for key clinical error modes (negation, laterality, unsupported claims, contradictions, missing findings, and localization errors); and a Repairer that iteratively revises both narrative and spatial outputs under targeted constraints while curating high-quality exemplars for future cases. Instantiated on chest X-ray analysis with multiple modern VLM backbones and evaluated on report generation and weakly supervised detection, R^4 consistently boosts LLM-as-a-Judge scores by roughly +1.7-+2.5 points and mAP50 by +2.5-+3.5 absolute points over strong single-VLM baselines, without any gradient-based fine-tuning. These results show that agentic routing, reflection, and repair can turn strong but brittle VLMs into more reliable and better grounded tools for clinical image interpretation. Our code can be found at: https://github.com/faiyazabdullah/MultimodalMedAgent
[59] MobiDiary: Autoregressive Action Captioning with Wearable Devices and Wireless Signals cs.CVPDF
Fei Deng, Yinghui He, Chuntong Chu, Ge Wang, Han Ding
TL;DR: 本文提出了MobiDiary框架,该框架能够直接从异构物理信号(IMU和Wi-Fi)生成日常活动的自然语言描述,解决了传统基于视觉的活动识别系统面临的隐私和环境限制问题。
Details
Motivation: 智能家居中的人类活动识别对健康监测和辅助生活至关重要,但基于视觉的系统存在隐私问题和环境限制(如遮挡),因此需要一种能够从物理信号直接生成可读描述的方法。
Result: 在多个公开基准(XRF V2、UWash和WiFiTAD)上的实验结果表明,MobiDiary在描述指标(如BLEU@4、CIDEr、RMC)上达到了最先进的性能,并在连续动作理解方面优于专门的基线方法。
Insight: 创新点包括提出统一的传感器编码器,利用基于补丁的机制捕获局部时间相关性,并集成异构放置嵌入以统一不同传感器的空间上下文,从而弥合连续噪声物理信号与离散语言描述之间的语义鸿沟。
Abstract: Human Activity Recognition (HAR) in smart homes is critical for health monitoring and assistive living. While vision-based systems are common, they face privacy concerns and environmental limitations (e.g., occlusion). In this work, we present MobiDiary, a framework that generates natural language descriptions of daily activities directly from heterogeneous physical signals (specifically IMU and Wi-Fi). Unlike conventional approaches that restrict outputs to pre-defined labels, MobiDiary produces expressive, human-readable summaries. To bridge the semantic gap between continuous, noisy physical signals and discrete linguistic descriptions, we propose a unified sensor encoder. Instead of relying on modality-specific engineering, we exploit the shared inductive biases of motion-induced signals–where both inertial and wireless data reflect underlying kinematic dynamics. Specifically, our encoder utilizes a patch-based mechanism to capture local temporal correlations and integrates heterogeneous placement embedding to unify spatial contexts across different sensors. These unified signal tokens are then fed into a Transformer-based decoder, which employs an autoregressive mechanism to generate coherent action descriptions word-by-word. We comprehensively evaluate our approach on multiple public benchmarks (XRF V2, UWash, and WiFiTAD). Experimental results demonstrate that MobiDiary effectively generalizes across modalities, achieving state-of-the-art performance on captioning metrics (e.g., BLEU@4, CIDEr, RMC) and outperforming specialized baselines in continuous action understanding.
[60] Knowledge-based learning in Text-RAG and Image-RAG cs.CV | cs.AIPDF
Alexander Shim, Khalil Saieh, Samuel Clarke
TL;DR: 本研究分析和比较了基于视觉Transformer(EVA-ViT)图像编码器与LlaMA或ChatGPT LLM的多模态方法,旨在减少幻觉问题并检测胸部X光图像中的疾病。研究利用NIH胸部X光图像数据集进行训练,并在基于图像的RAG、基于文本的RAG以及基线方法之间进行比较。结果表明,基于文本的RAG通过利用外部知识信息有效减少了幻觉问题,而基于图像的RAG通过使用KNN方法提高了预测置信度和校准性。此外,GPT LLM相比基于LlaMA的模型表现出更好的性能、更低的幻觉率和更优的预期校准误差(ECE)。研究揭示了数据不平衡和复杂多阶段结构的挑战,但提出了大规模实验环境和平衡使用示例的建议。
Details
Motivation: 解决在胸部X光图像疾病检测中,多模态方法(结合视觉和语言模型)可能产生的幻觉问题,并提高预测的准确性和可靠性。
Result: 在NIH Chest X-ray数据集上,基于文本的RAG有效减少幻觉,基于图像的RAG提升预测置信度和校准;GPT LLM在性能、幻觉率和ECE方面优于LlaMA模型,但未明确提及是否达到SOTA水平。
Insight: 创新点包括:结合EVA-ViT图像编码器与LLM进行多模态疾病检测,利用外部知识的文本RAG减少幻觉,以及使用KNN方法的图像RAG改进校准;客观分析认为,该方法通过整合不同模态的RAG策略,为医学图像分析中的可靠AI诊断提供了新思路。
Abstract: This research analyzed and compared the multi-modal approach in the Vision Transformer(EVA-ViT) based image encoder with the LlaMA or ChatGPT LLM to reduce the hallucination problem and detect diseases in chest x-ray images. In this research, we utilized the NIH Chest X-ray image to train the model and compared it in image-based RAG, text-based RAG, and baseline. [3] [5] In a result, the text-based RAG[2] e!ectively reduces the hallucination problem by using external knowledge information, and the image-based RAG improved the prediction con”dence and calibration by using the KNN methods. [4] Moreover, the GPT LLM showed better performance, a low hallucination rate, and better Expected Calibration Error(ECE) than Llama Llama-based model. This research shows the challenge of data imbalance, a complex multi-stage structure, but suggests a large experience environment and a balanced example of use.
[61] Improving Zero-shot ADL Recognition with Large Language Models through Event-based Context and Confidence cs.CV | cs.DCPDF
Michele Fiori, Gabriele Civitarese, Marco Colussi, Claudio Bettini
TL;DR: 本文提出了一种改进的零样本日常生活活动(ADL)识别方法,通过事件分割和新的置信度估计来提升大型语言模型(LLM)在智能家居传感器数据上的性能。
Details
Motivation: 现有基于LLM的零样本ADL识别方法依赖时间分割,与LLM的上下文推理能力不匹配,且缺乏预测置信度估计方法。
Result: 在复杂真实数据集上,事件分割方法持续优于基于时间的LLM方法,甚至超越了监督数据驱动方法,且提出的置信度度量能有效区分预测正误。
Insight: 创新点在于用事件分割替代时间分割以更好地利用LLM的上下文推理,并引入置信度估计来提高零样本识别的可靠性。
Abstract: Unobtrusive sensor-based recognition of Activities of Daily Living (ADLs) in smart homes by processing data collected from IoT sensing devices supports applications such as healthcare, safety, and energy management. Recent zero-shot methods based on Large Language Models (LLMs) have the advantage of removing the reliance on labeled ADL sensor data. However, existing approaches rely on time-based segmentation, which is poorly aligned with the contextual reasoning capabilities of LLMs. Moreover, existing approaches lack methods for estimating prediction confidence. This paper proposes to improve zero-shot ADL recognition with event-based segmentation and a novel method for estimating prediction confidence. Our experimental evaluation shows that event-based segmentation consistently outperforms time-based LLM approaches on complex, realistic datasets and surpasses supervised data-driven methods, even with relatively small LLMs (e.g., Gemma 3 27B). The proposed confidence measure effectively distinguishes correct from incorrect predictions.
[62] HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding cs.CV | cs.AIPDF
Qitan Lv, Tianyu Liu, Wen Wu, Xuenan Xu, Bowen Zhou
TL;DR: 本文提出了HIPPO,一个用于加速视频大语言模型推理的、整体感知的并行推测解码框架。它通过语义感知的令牌保留方法和视频并行推测解码算法,解决了现有方法因视觉语义信息丢失和草稿模型推理成本限制而导致的加速瓶颈。
Details
Motivation: 现有为视频大语言模型设计的推测解码方法主要通过剪枝冗余视觉令牌来加速,但存在两个主要问题:一是剪枝策略未能充分保留视觉语义令牌,导致草稿质量下降和接受率降低;二是即使进行了激进的剪枝,草稿模型剩余的计算成本仍然限制了整体加速效果。HIPPO旨在解决这些局限性。
Result: 在四个视频大语言模型和六个基准测试上的实验表明,HIPPO是有效的,与标准的自回归解码相比,实现了高达3.51倍的加速。
Insight: 主要创新点包括:1) 提出了一种语义感知的令牌保留方法,融合全局注意力分数与局部视觉语义,以在高剪枝率下保留语义信息;2) 提出了一种视频并行推测解码算法,将草稿生成和目标验证阶段解耦并重叠执行,从而提升并行度。从客观角度看,其核心在于将推测解码的优化从单纯的令牌数量减少,深化为对令牌语义质量的保持和计算流程的重构。
Abstract: Speculative decoding (SD) has emerged as a promising approach to accelerate LLM inference without sacrificing output quality. Existing SD methods tailored for video-LLMs primarily focus on pruning redundant visual tokens to mitigate the computational burden of massive visual inputs. However, existing methods do not achieve inference acceleration comparable to text-only LLMs. We observe from extensive experiments that this phenomenon mainly stems from two limitations: (i) their pruning strategies inadequately preserve visual semantic tokens, degrading draft quality and acceptance rates; (ii) even with aggressive pruning (e.g., 90% visual tokens removed), the draft model’s remaining inference cost limits overall speedup. To address these limitations, we propose HIPPO, a general holistic-aware parallel speculative decoding framework. Specifically, HIPPO proposes (i) a semantic-aware token preservation method, which fuses global attention scores with local visual semantics to retain semantic information at high pruning ratios; (ii) a video parallel SD algorithm that decouples and overlaps draft generation and target verification phases. Experiments on four video-LLMs across six benchmarks demonstrate HIPPO’s effectiveness, yielding up to 3.51x speedup compared to vanilla auto-regressive decoding.
[63] One-Shot Identification with Different Neural Network Approaches cs.CV | cs.LGPDF
Janis Mohr, Jörg Frochte
TL;DR: 本文探索了在数据稀缺场景下,使用不同神经网络方法进行单次识别任务,特别关注了工业应用和人脸识别领域。研究采用了堆叠图像技术和孪生胶囊网络,发现胶囊网络架构在多个数据集上取得了优异性能,超越了其他技术,且易于使用和优化。
Details
Motivation: 解决在数据匮乏(单次学习)场景下,卷积神经网络难以学习有效特征的问题,探索适用于单次识别的特殊技术。
Result: 在从工业应用到人脸识别的广泛数据集上,使用胶囊架构的方法取得了强劲结果,并超越了其他技术。
Insight: 创新点在于将堆叠图像技术与孪生胶囊网络结合用于单次识别;客观来看,胶囊网络在单次学习任务中展现出了结构优势与实用性。
Abstract: Convolutional neural networks (CNNs) have been widely used in the computer vision community, significantly improving the state-of-the-art. But learning good features often is computationally expensive in machine learning settings and is especially difficult when there is a lack of data. One-shot learning is one such area where only limited data is available. In one-shot learning, predictions have to be made after seeing only one example from one class, which requires special techniques. In this paper we explore different approaches to one-shot identification tasks in different domains including an industrial application and face recognition. We use a special technique with stacked images and use siamese capsule networks. It is encouraging to see that the approach using capsule architecture achieves strong results and exceeds other techniques on a wide range of datasets from industrial application to face recognition benchmarks while being easy to use and optimise.
[64] KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old? cs.CVPDF
Xianfeng Wang, Kaiwei Zhang, Qi Jia, Zijian Chen, Guangtao Zhai
TL;DR: 论文提出了KidVis基准,用于评估多模态大语言模型是否具备6-7岁儿童的基本视觉感知能力。研究发现,尽管MLLMs在高级推理任务上表现出色,但在低语义依赖的视觉任务上,其表现远低于人类儿童,且模型规模的扩大并未带来线性性能提升。
Details
Motivation: 探究多模态大语言模型是否具备人类直觉性的基础视觉感知能力,以弥补当前研究在模型底层视觉能力评估上的不足。
Result: 在KidVis基准上,人类儿童平均得分95.32,而最先进的GPT-5仅得67.33,显示出显著差距;同时发现模型参数增加与基础视觉能力提升之间存在‘缩放定律悖论’。
Insight: 创新点在于基于人类视觉发展理论构建了评估基础视觉原子能力的基准,揭示了MLLMs在底层感知上的缺陷;可借鉴之处在于强调了视觉智能评估需涵盖从低级感知到高级推理的完整层次。
Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated impressive proficiency in high-level reasoning tasks, such as complex diagrammatic interpretation, it remains an open question whether they possess the fundamental visual primitives comparable to human intuition. To investigate this, we introduce KidVis, a novel benchmark grounded in the theory of human visual development. KidVis deconstructs visual intelligence into six atomic capabilities - Concentration, Tracking, Discrimination, Memory, Spatial, and Closure - already possessed by 6-7 year old children, comprising 10 categories of low-semantic-dependent visual tasks. Evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity. Results indicate that while human children achieve a near-perfect average score of 95.32, the state-of-the-art GPT-5 attains only 67.33. Crucially, we observe a “Scaling Law Paradox”: simply increasing model parameters fails to yield linear improvements in these foundational visual capabilities. This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.
[65] M3SR: Multi-Scale Multi-Perceptual Mamba for Efficient Spectral Reconstruction cs.CVPDF
Yuze Zhang, Lingjie Li, Qiuzhen Lin, Zhong Ming, Fei Yu
TL;DR: 本文提出了一种用于光谱重建任务的多尺度多感知Mamba架构M3SR,旨在解决现有Mamba方法在空间感知单一和特征提取尺度单一方面的局限性。通过设计多感知融合块并集成到U-Net结构中,M3SR能够有效提取和融合全局、中间和局部特征,从而实现高光谱图像的多尺度精确重建。
Details
Motivation: 现有基于Mamba架构的光谱重建方法面临两个主要挑战:单一空间感知限制了全面理解高光谱图像的能力,以及单一尺度特征提取难以捕捉图像中的复杂结构和精细细节。
Result: 大量定量和定性实验表明,所提出的M3SR方法在计算成本更低的情况下,性能优于现有的最先进方法。
Insight: 核心创新点在于设计了多感知融合块,并将其与U-Net结构结合,实现了多尺度、多感知的特征提取与融合,从而更全面地建模高光谱图像信息,在提升性能的同时保持了计算效率。
Abstract: The Mamba architecture has been widely applied to various low-level vision tasks due to its exceptional adaptability and strong performance. Although the Mamba architecture has been adopted for spectral reconstruction, it still faces the following two challenges: (1) Single spatial perception limits the ability to fully understand and analyze hyperspectral images; (2) Single-scale feature extraction struggles to capture the complex structures and fine details present in hyperspectral images. To address these issues, we propose a multi-scale, multi-perceptual Mamba architecture for the spectral reconstruction task, called M3SR. Specifically, we design a multi-perceptual fusion block to enhance the ability of the model to comprehensively understand and analyze the input features. By integrating the multi-perceptual fusion block into a U-Net structure, M3SR can effectively extract and fuse global, intermediate, and local features, thereby enabling accurate reconstruction of hyperspectral images at multiple scales. Extensive quantitative and qualitative experiments demonstrate that the proposed M3SR outperforms existing state-of-the-art methods while incurring a lower computational cost.
[66] Enhancing Image Quality Assessment Ability of LMMs via Retrieval-Augmented Generation cs.CV | cs.AIPDF
Kang Fu, Huiyu Duan, Zicheng Zhang, Yucheng Zhu, Jun Zhao
TL;DR: 本文提出了一种名为IQARAG的无训练框架,通过检索增强生成(RAG)技术,利用语义相似但质量不同的参考图像及其平均意见分数(MOS),来增强大型多模态模型(LMMs)在图像质量评估(IQA)任务中的零样本能力。
Details
Motivation: 解决现有LMMs在IQA任务中达到SOTA性能通常需要计算成本高昂的微调方法的问题,旨在提供一种资源高效的替代方案。
Result: 在KADID、KonIQ、LIVE Challenge和SPAQ等多个IQA数据集上的广泛实验表明,IQARAG有效提升了LMMs的IQA性能。
Insight: 创新点在于将RAG引入IQA任务,通过检索提供视觉感知锚点,无需训练即可增强模型能力;客观来看,这是一种将外部知识(参考图像与MOS)以提示工程方式高效整合到LMM推理过程中的新颖方法。
Abstract: Large Multimodal Models (LMMs) have recently shown remarkable promise in low-level visual perception tasks, particularly in Image Quality Assessment (IQA), demonstrating strong zero-shot capability. However, achieving state-of-the-art performance often requires computationally expensive fine-tuning methods, which aim to align the distribution of quality-related token in output with image quality levels. Inspired by recent training-free works for LMM, we introduce IQARAG, a novel, training-free framework that enhances LMMs’ IQA ability. IQARAG leverages Retrieval-Augmented Generation (RAG) to retrieve some semantically similar but quality-variant reference images with corresponding Mean Opinion Scores (MOSs) for input image. These retrieved images and input image are integrated into a specific prompt. Retrieved images provide the LMM with a visual perception anchor for IQA task. IQARAG contains three key phases: Retrieval Feature Extraction, Image Retrieval, and Integration & Quality Score Generation. Extensive experiments across multiple diverse IQA datasets, including KADID, KonIQ, LIVE Challenge, and SPAQ, demonstrate that the proposed IQARAG effectively boosts the IQA performance of LMMs, offering a resource-efficient alternative to fine-tuning for quality assessment.
[67] YOLOBirDrone: Dataset for Bird vs Drone Detection and Classification and a YOLO based enhanced learning architecture cs.CVPDF
Dapinder Kaur, Neeraj Battish, Arnav Bhavsar, Shashi Poddar
TL;DR: 本文提出了一种名为YOLOBirDrone的新型目标检测架构,旨在提高无人机与鸟类的检测和分类精度。同时,文章还引入了一个大规模数据集BirDrone,该数据集包含具有挑战性的小目标,用于鲁棒的空中物体识别。实验结果表明,所提出的架构在性能指标上优于其他最先进算法,检测准确率在各种场景下达到约85%。
Details
Motivation: 商用和军用无人机的广泛使用带来了安全挑战,特别是其可能被用于针对性攻击。现有的基于视觉的无人机检测系统在准确率上存在局限,尤其是在区分小型无人机和鸟类时面临困难。
Result: 在BirDrone数据集上的实验表明,YOLOBirDrone架构相比其他最先进(SOTA)算法,在检测准确率等性能指标上有所提升,在各种场景下达到约85%的检测准确率。
Insight: 论文的创新点包括:1) 提出了一种新的YOLO变体架构YOLOBirDrone,其核心组件包括自适应扩展层聚合(AELAN)、多尺度渐进式双重注意力模块(MPDA)及其反向版本(RMPDA),旨在保留形状信息并利用局部和全局的空间与通道信息来丰富特征;2) 构建并发布了一个专门针对鸟类与无人机检测的大规模、包含挑战性小目标的公开数据集BirDrone,这对推动该领域研究具有重要价值。
Abstract: The use of aerial drones for commercial and defense applications has benefited in many ways and is therefore utilized in several different application domains. However, they are also increasingly used for targeted attacks, posing a significant safety challenge and necessitating the development of drone detection systems. Vision-based drone detection systems currently have an accuracy limitation and struggle to distinguish between drones and birds, particularly when the birds are small in size. This research work proposes a novel YOLOBirDrone architecture that improves the detection and classification accuracy of birds and drones. YOLOBirDrone has different components, including an adaptive and extended layer aggregation (AELAN), a multi-scale progressive dual attention module (MPDA), and a reverse MPDA (RMPDA) to preserve shape information and enrich features with local and global spatial and channel information. A large-scale dataset, BirDrone, is also introduced in this article, which includes small and challenging objects for robust aerial object identification. Experimental results demonstrate an improvement in performance metrics through the proposed YOLOBirDrone architecture compared to other state-of-the-art algorithms, with detection accuracy reaching approximately 85% across various scenarios.
[68] UM-Text: A Unified Multimodal Model for Image Understanding cs.CVPDF
Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu
TL;DR: 本文提出UM-Text,一个通过自然语言指令进行图像理解和视觉文本编辑的统一多模态模型。它利用视觉语言模型(VLM)处理指令和参考图像以设计文本内容和布局,并通过UM-Encoder融合多条件信息,结合区域一致性损失和三阶段训练策略提升性能。模型在多个公开基准测试中取得了最先进的结果。
Details
Motivation: 解决现有视觉文本编辑方法步骤复杂、需要手动指定文本属性,且未充分考虑生成文本与参考图像风格一致性的问题。
Result: 在多个公开基准测试上的广泛定性和定量实验表明,该方法达到了最先进的性能水平。
Insight: 创新点包括:1) 统一的、基于自然语言指令的多模态理解与编辑框架;2) 引入VLM自动配置条件信息融合的UM-Encoder;3) 提出区域一致性损失和专门的三阶段训练策略;4) 贡献了大规模多样化场景的视觉文本图像数据集UM-DATA-200K。
Abstract: With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.
[69] Tissue Classification and Whole-Slide Images Analysis via Modeling of the Tumor Microenvironment and Biological Pathways cs.CVPDF
Junzhuo Liu, Xuemei Du, Daniel Reisenbuchler, Ye Chen, Markus Eckstein
TL;DR: 本文提出了一种名为BioMorphNet的多模态网络,用于自动整合全切片图像的组织形态学特征和空间基因表达数据,以支持组织分类和差异基因分析。该方法通过建模肿瘤微环境和生物通路,在多个癌症数据集上实现了分类性能的提升,并有助于肿瘤定位和生物标志物发现。
Details
Motivation: 现有研究多关注单个基因序列和切片级别的分类任务,对空间转录组学和图像块级别应用的关注有限。本文旨在解决这一局限,通过整合组织形态和空间基因表达信息,以更好地表征肿瘤微环境。
Result: 与最新的形态-基因多模态方法相比,BioMorphNet在前列腺癌、结直肠癌和乳腺癌数据集上的平均分类指标分别提升了2.67%、5.48%和6.29%,达到了SOTA水平。
Insight: 创新点包括:1) 构建图模型来建模目标图像块与其邻域的关系,并根据形态和分子相似性调整响应强度;2) 从空间转录组数据中提取临床通路特征,作为组织形态与基因表达的桥梁;3) 设计可学习的通路模块自动模拟生物通路形成过程,作为现有临床通路的补充表示。这些方法增强了多模态整合的生物学可解释性。
Abstract: Automatic integration of whole slide images (WSIs) and gene expression profiles has demonstrated substantial potential in precision clinical diagnosis and cancer progression studies. However, most existing studies focus on individual gene sequences and slide level classification tasks, with limited attention to spatial transcriptomics and patch level applications. To address this limitation, we propose a multimodal network, BioMorphNet, which automatically integrates tissue morphological features and spatial gene expression to support tissue classification and differential gene analysis. For considering morphological features, BioMorphNet constructs a graph to model the relationships between target patches and their neighbors, and adjusts the response strength based on morphological and molecular level similarity, to better characterize the tumor microenvironment. In terms of multimodal interactions, BioMorphNet derives clinical pathway features from spatial transcriptomic data based on a predefined pathway database, serving as a bridge between tissue morphology and gene expression. In addition, a novel learnable pathway module is designed to automatically simulate the biological pathway formation process, providing a complementary representation to existing clinical pathways. Compared with the latest morphology gene multimodal methods, BioMorphNet’s average classification metrics improve by 2.67%, 5.48%, and 6.29% for prostate cancer, colorectal cancer, and breast cancer datasets, respectively. BioMorphNet not only classifies tissue categories within WSIs accurately to support tumor localization, but also analyzes differential gene expression between tissue categories based on prediction confidence, contributing to the discovery of potential tumor biomarkers.
[70] From Local Windows to Adaptive Candidates via Individualized Exploratory: Rethinking Attention for Image Super-Resolution cs.CVPDF
Chunyu Meng, Wei Long, Shuhang Gu
TL;DR: 本文提出了一种名为个体化探索变换器(IET)的新方法,用于单图像超分辨率任务。该方法通过引入个体化探索注意力(IEA)机制,使每个令牌能够自适应地选择其内容感知且独立的注意力候选,从而在保持计算效率的同时实现更精确的信息聚合。
Details
Motivation: 现有基于Transformer的超分辨率方法通常将图像划分为固定组并在组内进行注意力计算,这忽略了令牌相似性的内在不对称性,导致无法实现灵活且令牌自适应的注意力计算。本文旨在解决这一效率与灵活性之间的权衡问题。
Result: 在标准超分辨率基准测试上的大量实验表明,IET在可比较的计算复杂度下实现了最先进的性能。
Insight: 创新点在于提出了一种令牌自适应且非对称的注意力机制(IEA),允许每个令牌独立探索并选择其注意力候选,这突破了传统固定分组注意力的限制,实现了更灵活高效的长程依赖建模。
Abstract: Single Image Super-Resolution (SISR) is a fundamental computer vision task that aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) input. Transformer-based methods have achieved remarkable performance by modeling long-range dependencies in degraded images. However, their feature-intensive attention computation incurs high computational cost. To improve efficiency, most existing approaches partition images into fixed groups and restrict attention within each group. Such group-wise attention overlooks the inherent asymmetry in token similarities, thereby failing to enable flexible and token-adaptive attention computation. To address this limitation, we propose the Individualized Exploratory Transformer (IET), which introduces a novel Individualized Exploratory Attention (IEA) mechanism that allows each token to adaptively select its own content-aware and independent attention candidates. This token-adaptive and asymmetric design enables more precise information aggregation while maintaining computational efficiency. Extensive experiments on standard SR benchmarks demonstrate that IET achieves state-of-the-art performance under comparable computational complexity.
[71] Semantic Misalignment in Vision-Language Models under Perceptual Degradation cs.CVPDF
Guo Cheng
TL;DR: 本文系统研究了视觉语言模型(VLMs)在上游视觉感知受控退化下的语义错位问题。通过在Cityscapes数据集上使用语义分割作为代表性感知模块,引入感知现实的退化,发现即使传统分割指标仅适度下降,也会导致下游VLM行为出现严重故障,如幻觉对象提及、安全关键实体遗漏和不一致的安全判断。
Details
Motivation: 动机是探究VLMs在自动驾驶和具身AI等安全关键系统中,面对现实感知退化时的鲁棒性,当前研究对此理解不足,而可靠的感知对于安全语义推理和决策至关重要。
Result: 结果揭示了像素级鲁棒性与多模态语义可靠性之间存在明显脱节,通过提出的语言级错位指标(捕捉幻觉、关键遗漏和安全误解)量化了这些影响,并在多个对比式和生成式VLM上分析了其与分割质量的关系。
Insight: 创新点在于系统性地研究了感知退化对VLM语义对齐的影响,并提出了专门的语言级错位评估指标,强调了当前基于VLM的系统在安全关键应用中的局限性,以及需要显式考虑感知不确定性的评估框架。
Abstract: Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
[72] Edge-Optimized Multimodal Learning for UAV Video Understanding via BLIP-2 cs.CV | cs.ROPDF
Yizhan Feng, Hichem Snoussi, Jing Teng, Jian Liu, Yuyang Wang
TL;DR: 本文提出了一种基于BLIP-2的轻量级多模态任务平台,旨在解决无人机边缘设备上大视觉语言模型计算成本高与资源有限的矛盾。该平台通过集成YOLO-World和YOLOv8-Seg模型扩展了BLIP-2的多任务能力,并设计了基于K-Means聚类的关键帧采样机制以及统一提示优化方案,以高效处理无人机视频理解与交互任务。
Details
Motivation: 解决无人机在复杂场景中实时视觉理解与交互需求与大视觉语言模型在边缘设备上高计算成本、资源有限之间的矛盾。
Result: 论文提出的方法无需在无人机数据上进行特定任务的微调,通过集成YOLO模型和设计的关键帧采样与提示优化机制,使轻量级BLIP-2架构能够有效处理视频级交互任务,但摘要中未提及具体的定量实验结果或基准测试对比。
Insight: 创新点包括:1) 将BLIP-2与YOLO模型深度集成,利用YOLO的精确感知结果增强视觉注意力理解与推理;2) 设计基于K-Means聚类的智能关键帧采样与时间特征拼接机制,使轻量模型能处理视频任务;3) 提出统一提示优化方案,将YOLO的结构化事件日志作为上下文注入BLIP-2,并结合输出约束生成准确、相关的多任务输出。从客观角度看,这是一种高效的边缘设备多模态学习框架,通过模型集成与机制设计平衡性能与资源限制。
Abstract: The demand for real-time visual understanding and interaction in complex scenarios is increasingly critical for unmanned aerial vehicles. However, a significant challenge arises from the contradiction between the high computational cost of large Vision language models and the limited computing resources available on UAV edge devices. To address this challenge, this paper proposes a lightweight multimodal task platform based on BLIP-2, integrated with YOLO-World and YOLOv8-Seg models. This integration extends the multi-task capabilities of BLIP-2 for UAV applications with minimal adaptation and without requiring task-specific fine-tuning on drone data. Firstly, the deep integration of BLIP-2 with YOLO models enables it to leverage the precise perceptual results of YOLO for fundamental tasks like object detection and instance segmentation, thereby facilitating deeper visual-attention understanding and reasoning. Secondly, a content-aware key frame sampling mechanism based on K-Means clustering is designed, which incorporates intelligent frame selection and temporal feature concatenation. This equips the lightweight BLIP-2 architecture with the capability to handle video-level interactive tasks effectively. Thirdly, a unified prompt optimization scheme for multi-task adaptation is implemented. This scheme strategically injects structured event logs from the YOLO models as contextual information into BLIP-2’s input. Combined with output constraints designed to filter out technical details, this approach effectively guides the model to generate accurate and contextually relevant outputs for various tasks.
[73] MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP cs.CVPDF
Aditya Chaudhary, Sneha Barman, Mainak Singha, Ankit Jha, Girish Mishra
TL;DR: 本文提出了一种新颖的多模态框架MMLGNet,通过CLIP等视觉语言模型将高光谱成像(HSI)和激光雷达(LiDAR)等异构遥感模态与自然语言语义对齐,以融合光谱、空间和几何信息并实现语义级理解。
Details
Motivation: 随着多模态地球观测数据的日益丰富,需要有效融合光谱、空间和几何信息并实现语义级理解的方法,以弥合高维遥感数据与语言引导解释之间的差距。
Result: 在两个基准数据集上,MMLGNet使用简单的基于CNN的编码器取得了强劲性能,超越了多种已建立的多模态纯视觉方法,展示了语言监督的显著优势。
Insight: 创新点在于利用CLIP的训练范式,通过模态特定编码器和双向对比学习,将视觉特征与手工构建的文本嵌入在共享潜在空间中对齐,从而实现了遥感数据与语言语义的跨模态对齐。
Abstract: In this paper, we propose a novel multimodal framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities like Hyperspectral Imaging (HSI) and LiDAR with natural language semantics using vision-language models such as CLIP. With the increasing availability of multimodal Earth observation data, there is a growing need for methods that effectively fuse spectral, spatial, and geometric information while enabling semantic-level understanding. MMLGNet employs modality-specific encoders and aligns visual features with handcrafted textual embeddings in a shared latent space via bi-directional contrastive learning. Inspired by CLIP’s training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation. Notably, MMLGNet achieves strong performance with simple CNN-based encoders, outperforming several established multimodal visual-only methods on two benchmark datasets, demonstrating the significant benefit of language supervision. Codes are available at https://github.com/AdityaChaudhary2913/CLIP_HSI.
[74] Deep Learning Based Facial Retargeting Using Local Patches cs.CV | cs.GRPDF
Yeonsoo Choi, Inyup Lee, Sihun Cha, Seonghyeon Kim, Sunjin Jung
TL;DR: 本文提出了一种基于局部块(local patch)的深度学习面部重定向方法,用于将源表演视频中的面部动画迁移到风格化或夸张的3D角色上。该方法包含三个模块:自动块提取模块、重演模块和权重估计模块,旨在处理源与目标角色面部结构差异大的挑战。
Details
Motivation: 现有面部运动重定向方法在形状相似的模型间效果良好,但在处理面部结构与人类差异显著的风格化或夸张3D角色时面临挑战,需要同时考虑目标角色的面部结构和运动范围以保持原始面部运动的语义。
Result: 大量实验表明,该方法能成功地将源面部表情的语义含义迁移到面部特征比例变化很大的风格化角色上,实现了有效的重定向。
Insight: 创新点在于采用局部块而非全局面部进行处理,这能更好地适应源与目标之间的结构差异;模块化设计(提取、重演、权重估计)提供了清晰的流程,可能提升对风格化角色的语义保持能力。
Abstract: In the era of digital animation, the quest to produce lifelike facial animations for virtual characters has led to the development of various retargeting methods. While the retargeting facial motion between models of similar shapes has been very successful, challenges arise when the retargeting is performed on stylized or exaggerated 3D characters that deviate significantly from human facial structures. In this scenario, it is important to consider the target character’s facial structure and possible range of motion to preserve the semantics assumed by the original facial motions after the retargeting. To achieve this, we propose a local patch-based retargeting method that transfers facial animations captured in a source performance video to a target stylized 3D character. Our method consists of three modules. The Automatic Patch Extraction Module extracts local patches from the source video frame. These patches are processed through the Reenactment Module to generate correspondingly re-enacted target local patches. The Weight Estimation Module calculates the animation parameters for the target character at every frame for the creation of a complete facial animation sequence. Extensive experiments demonstrate that our method can successfully transfer the semantic meaning of source facial expressions to stylized characters with considerable variations in facial feature proportion.
[75] Incentivizing Cardiologist-Like Reasoning in MLLMs for Interpretable Echocardiographic Diagnosis cs.CVPDF
Yi Qin, Lehan Wang, Chenxu Zhao, Alex P. W. Lee, Xiaomeng Li
TL;DR: 本文提出了一种名为CardiacMind的新方法,通过引入心脏科医生式的思维模式来增强多模态大语言模型(MLLM)在超声心动图诊断中的推理能力。该方法包含心脏推理模板(CRT)和一个新颖的强化学习方案,通过三个奖励机制(PQtR、PQlR和ESR)激励模型遵循结构化的诊断程序,从而提升诊断的准确性和可解释性。
Details
Motivation: 现有超声心动图基础模型未能有效捕捉定量测量与临床表现之间的关系,而医学推理MLLM需要昂贵的人工构建详细推理路径,且难以直接融入超声心动图先验知识。本文旨在解决这些限制,激励MLLM进行类似心脏科医生的可解释推理。
Result: 在15种复杂心脏病的多视角超声心动图诊断中,方法性能提升了48%;在CardiacNet-PAH基准上提升了5%。用户研究表明,其推理输出的逻辑获得了93.33%临床医生的认同。
Insight: 创新点在于提出了结构化的心脏推理模板(CRT)来规范化诊断流程,并设计了结合过程数量、过程质量和超声语义三个维度的强化学习奖励机制,以系统性地引导模型进行详细、多模态融合且视觉内容相关的逐步推理,从而将领域先验知识有效整合到MLLM的推理过程中。
Abstract: Echocardiographic diagnosis is vital for cardiac screening yet remains challenging. Existing echocardiography foundation models do not effectively capture the relationships between quantitative measurements and clinical manifestations, whereas medical reasoning multimodal large language models (MLLMs) require costly construction of detailed reasoning paths and remain ineffective at directly incorporating such echocardiographic priors into their reasoning. To address these limitations, we propose a novel approach comprising Cardiac Reasoning Template (CRT) and CardiacMind to enhance MLLM’s echocardiographic reasoning by introducing cardiologist-like mindset. Specifically, CRT provides stepwise canonical diagnostic procedures for complex cardiac diseases to streamline reasoning path construction without the need for costly case-by-case verification. To incentivize reasoning MLLM under CRT, we develop CardiacMind, a new reinforcement learning scheme with three novel rewards: Procedural Quantity Reward (PQtR), Procedural Quality Reward (PQlR), and Echocardiographic Semantic Reward (ESR). PQtR promotes detailed reasoning; PQlR promotes integration of evidence across views and modalities, while ESR grounds stepwise descriptions in visual content. Our methods show a 48% improvement in multiview echocardiographic diagnosis for 15 complex cardiac diseases and a 5% improvement on CardiacNet-PAH over prior methods. The user study on our method’s reasoning outputs shows 93.33% clinician agreement with cardiologist-like reasoning logic. Our code will be available.
[76] Noise-Adaptive Regularization for Robust Multi-Label Remote Sensing Image Classification cs.CV | cs.LGPDF
Tom Burgert, Julia Henkel, Begüm Demir
TL;DR: 本文提出了一种名为NAR的噪声自适应正则化方法,用于解决遥感图像多标签分类中的标签噪声问题。该方法在半监督学习框架内,通过置信度机制区分并处理加性噪声和减性噪声,结合早期学习正则化来稳定训练并减轻对噪声标签的过拟合。
Details
Motivation: 遥感多标签分类中,由于使用主题产品或众包标注以降低成本,常引入部分错误的标签噪声(包括加性、减性及混合噪声)。现有方法常忽视噪声类型的区别,缺乏针对不同噪声类型自适应调整学习行为的机制。
Result: 在加性、减性和混合噪声场景下的实验表明,NAR相比现有方法持续提升了鲁棒性,尤其在减性和混合噪声下性能提升最为显著。
Insight: 创新点在于明确区分并自适应处理加性和减性噪声,通过基于置信度的标签处理机制(动态保留高置信度标签、暂时停用中等置信度标签、通过翻转校正低置信度标签)与早期学习正则化相结合,为噪声鲁棒学习提供了有效策略。
Abstract: The development of reliable methods for multi-label classification (MLC) has become a prominent research direction in remote sensing (RS). As the scale of RS data continues to expand, annotation procedures increasingly rely on thematic products or crowdsourced procedures to reduce the cost of manual annotation. While cost-effective, these strategies often introduce multi-label noise in the form of partially incorrect annotations. In MLC, label noise arises as additive noise, subtractive noise, or a combination of both in the form of mixed noise. Previous work has largely overlooked this distinction and commonly treats noisy annotations as supervised signals, lacking mechanisms that explicitly adapt learning behavior to different noise types. To address this limitation, we propose NAR, a noise-adaptive regularization method that explicitly distinguishes between additive and subtractive noise within a semi-supervised learning framework. NAR employs a confidence-based label handling mechanism that dynamically retains label entries with high confidence, temporarily deactivates entries with moderate confidence, and corrects low confidence entries via flipping. This selective attenuation of supervision is integrated with early-learning regularization (ELR) to stabilize training and mitigate overfitting to corrupted labels. Experiments across additive, subtractive, and mixed noise scenarios demonstrate that NAR consistently improves robustness compared with existing methods. Performance improvements are most pronounced under subtractive and mixed noise, indicating that adaptive suppression and selective correction of noisy supervision provide an effective strategy for noise robust learning in RS MLC.
[77] CoMa: Contextual Massing Generation with Vision-Language Models cs.CV | cs.AIPDF
Evgenii Maslov, Valentin Khrulkov, Anastasia Volkova, Anton Gusarov, Andrey Kuznetsov
TL;DR: 本文提出CoMa框架,利用视觉语言模型(VLMs)根据功能需求和场地上下文自动生成建筑体量,并构建了包含2万个样本的CoMa-20K数据集,包含几何、经济、功能及场地视觉信息。通过微调和零样本模型评估,展示了VLMs在生成上下文敏感体量方案上的潜力。
Details
Motivation: 解决建筑与城市规划中概念设计阶段(尤其是建筑体量生成)依赖设计师直觉和手动操作、缺乏自动化数据驱动方法的问题,并克服相关数据集缺失的障碍。
Result: 在CoMa-20K数据集上对VLMs进行基准测试,实验表明任务具有固有复杂性,但VLMs能够生成上下文敏感的体量选项,为数据驱动建筑设计建立了基础基准。
Insight: 创新点在于将建筑体量生成构建为视觉语言模型的条件生成任务,并首次引入包含多模态信息(几何、经济、功能、视觉上下文)的大规模数据集CoMa-20K,推动了数据驱动建筑设计的研究。
Abstract: The conceptual design phase in architecture and urban planning, particularly building massing, is complex and heavily reliant on designer intuition and manual effort. To address this, we propose an automated framework for generating building massing based on functional requirements and site context. A primary obstacle to such data-driven methods has been the lack of suitable datasets. Consequently, we introduce the CoMa-20K dataset, a comprehensive collection that includes detailed massing geometries, associated economical and programmatic data, and visual representations of the development site within its existing urban context. We benchmark this dataset by formulating massing generation as a conditional task for Vision-Language Models (VLMs), evaluating both fine-tuned and large zero-shot models. Our experiments reveal the inherent complexity of the task while demonstrating the potential of VLMs to produce context-sensitive massing options. The dataset and analysis establish a foundational benchmark and highlight significant opportunities for future research in data-driven architectural design.
[78] Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling cs.CV | cs.LG | eess.IVPDF
Takamichi Miyata, Sumiko Miyata, Andrew Morris
TL;DR: 本文提出了一种基于视觉语言模型(VLM)的零样本分心驾驶检测方法,通过双重解耦框架解决现有方法因驾驶员外观特征(如衣着、年龄、性别)与行为线索纠缠而性能不佳的问题。该方法首先从图像嵌入中提取并移除驾驶员外观嵌入以强调分心相关证据,然后通过Stiefel流形上的度量投影正交化文本嵌入以提升可分离性。
Details
Motivation: 分心驾驶是交通事故的主要原因,需要鲁棒且可扩展的检测方法。现有基于VLM的零样本检测器在实际场景中表现不佳,主要瓶颈在于模型将驾驶员特定外观变化与行为线索纠缠在一起,导致决策基于’驾驶员是谁’而非’驾驶员在做什么’。
Result: 实验表明,该方法在分心驾驶检测任务上相比现有基线模型取得了持续的性能提升,显示出在实际道路安全应用中的潜力。
Insight: 创新点包括:1)提出主体解耦框架,从图像嵌入中分离并移除驾驶员外观嵌入,使模型关注分心行为本身;2)通过Stiefel流形上的度量投影正交化文本嵌入,在保持原始语义的同时提升类别嵌入的可分离性。这为解决VLM在细粒度视觉任务中因主体外观干扰而性能下降的问题提供了新思路。
Abstract: Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.
[79] Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs cs.CVPDF
Takara Taniguchi, Kuniaki Saito, Atsushi Hashimoto
TL;DR: 本文提出了HazardForge,一个可扩展的管道,用于生成包含时空动态的多样化危险场景,并构建了MovSafeBench基准来评估视觉语言模型(VLMs)在复杂环境中的安全决策能力。实验表明,VLMs在异常物体场景下性能显著下降,尤其是在需要细微运动理解的场景中。
Details
Motivation: 现有基准测试未能充分覆盖多样化的危险情况,特别是具有时空动态的异常场景,这限制了VLMs在自动驾驶等移动系统中安全决策能力的评估。
Result: 在MovSafeBench基准(包含7,254张图像和QA对)上的实验显示,VLMs在异常物体条件下的性能显著下降,最大降幅出现在需要细微运动理解的场景中。
Insight: 创新点在于通过HazardForge管道结合图像编辑模型和布局决策算法,可扩展地生成包含移动、侵入和远距离物体的真实世界危险场景,从而构建更全面的评估基准。
Abstract: Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems, making it crucial to evaluate their ability to support safer decision-making in complex environments. However, existing benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with spatio-temporal dynamics. While image editing models are a promising means to synthesize such hazards, it remains challenging to generate well-formulated scenarios that include moving, intrusive, and distant objects frequently observed in the real world. To address this gap, we introduce \textbf{HazardForge}, a scalable pipeline that leverages image editing models to generate these scenarios with layout decision algorithms, and validation modules. Using HazardForge, we construct \textbf{MovSafeBench}, a multiple-choice question (MCQ) benchmark comprising 7,254 images and corresponding QA pairs across 13 object categories, covering both normal and anomalous objects. Experiments using MovSafeBench show that VLM performance degrades notably under conditions including anomalous objects, with the largest drop in scenarios requiring nuanced motion understanding.
[80] Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models cs.CV | cs.MMPDF
Hao Tang, Yu Liu, Shuanglin Yan, Fei Shen, Shengfeng He
TL;DR: 本文提出了一种名为CoEvo的训练和标注免费测试时框架,用于提升视觉语言模型在零样本开放世界场景下的分布外检测性能。该框架通过双向、样本条件化的文本和视觉代理适应机制,动态挖掘上下文文本负例并迭代优化视觉代理,以重新对齐跨模态相似性并扩大局部分布外边界,最终通过动态加权双模态代理贡献获得鲁棒的校准分布外分数。
Details
Motivation: 现有基于负标签的零样本分布外检测方法依赖固定的文本代理集,存在对分布内类别之外语义空间采样稀疏以及仅视觉特征漂移时文本代理保持静态的问题,导致跨模态错位和预测不稳定。
Result: 在标准基准测试上的大量实验表明,CoEvo实现了最先进的性能,在ImageNet-1K上相比强大的负标签基线方法,AUROC提高了1.33%,FPR95降低了45.98%。
Insight: 创新点在于提出了代理对齐的协同进化机制,通过维护两个动态演化的代理缓存,实现了测试时文本和视觉代理的双向、样本条件化自适应,从而渐进式地重新对齐跨模态相似性并扩大局部分布外边界,最终通过动态加权获得鲁棒的校准分数。该方法无需训练和额外标注,是一种高效的测试时适应框架。
Abstract: Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided by test images and iteratively refines visual proxies, progressively realigning cross-modal similarities and enlarging local OOD margins. Finally, we dynamically re-weight the contributions of dual-modal proxies to obtain a calibrated OOD score that is robust to distribution shift. Extensive experiments on standard benchmarks demonstrate that CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.
[81] EfficientFSL: Enhancing Few-Shot Classification via Query-Only Tuning in Vision Transformers cs.CV | cs.AIPDF
Wenwen Liao, Hang Ruan
TL;DR: 本文提出EfficientFSL,一种专为视觉Transformer(ViT)设计的查询专用微调框架,用于小样本分类。该方法通过引入轻量级可训练模块,仅调整极少量参数,在显著降低计算开销的同时,实现了与现有方法相竞争的性能。
Details
Motivation: 动机在于解决大型模型(如ViT)在小样本分类任务中微调时计算开销大、内存需求高的问题,使其更适用于现实世界中的低资源场景。
Result: 在四个领域内小样本数据集和六个跨领域数据集上达到了最先进的性能(SOTA),证明了其在实际应用中的有效性。
Insight: 创新点包括:1)仅通过查询样本进行微调,极大减少可训练参数;2)引入轻量级前向块合成任务特定查询,从预训练模型中间表示中提取信息特征;3)结合块融合多层输出以增强特征深度和鲁棒性;4)支持-查询注意力块通过调整原型以对齐查询集分布来缓解分布偏移。
Abstract: Large models such as Vision Transformers (ViTs) have demonstrated remarkable superiority over smaller architectures like ResNet in few-shot classification, owing to their powerful representational capacity. However, fine-tuning such large models demands extensive GPU memory and prolonged training time, making them impractical for many real-world low-resource scenarios. To bridge this gap, we propose EfficientFSL, a query-only fine-tuning framework tailored specifically for few-shot classification with ViT, which achieves competitive performance while significantly reducing computational overhead. EfficientFSL fully leverages the knowledge embedded in the pre-trained model and its strong comprehension ability, achieving high classification accuracy with an extremely small number of tunable parameters. Specifically, we introduce a lightweight trainable Forward Block to synthesize task-specific queries that extract informative features from the intermediate representations of the pre-trained model in a query-only manner. We further propose a Combine Block to fuse multi-layer outputs, enhancing the depth and robustness of feature representations. Finally, a Support-Query Attention Block mitigates distribution shift by adjusting prototypes to align with the query set distribution. With minimal trainable parameters, EfficientFSL achieves state-of-the-art performance on four in-domain few-shot datasets and six cross-domain datasets, demonstrating its effectiveness in real-world applications.
[82] Closed-Loop LLM Discovery of Non-Standard Channel Priors in Vision Models cs.CVPDF
Tolgay Atinc Uzun, Dmitry Ignatov, Radu Timofte
TL;DR: 本文提出了一种利用大型语言模型(LLM)驱动神经架构搜索(NAS)的新框架,专门用于优化深度神经网络中的通道配置(如层宽度)。该方法将搜索过程构建为一系列条件代码生成任务,并通过抽象语法树(AST)突变生成大量形状一致但性能不一的架构数据,以解决训练数据稀缺问题,使LLM能够学习通道配置与模型性能之间的潜在关系。在CIFAR-100数据集上的实验验证了该方法的有效性,模型准确率获得了统计显著的提升。
Details
Motivation: 通道配置搜索(如层宽度优化)是一个受张量形状兼容性和计算预算约束的复杂组合优化问题。作者认为,大型语言模型(LLM)能够以传统启发式方法无法实现的方式推理架构代码结构,为神经架构搜索(NAS)提供了一种变革性方法。
Result: 在CIFAR-100基准测试上的实验结果表明,该方法带来了统计显著的准确率提升,验证了其有效性。分析证实,LLM成功获取了领域特定的架构先验,使该方法区别于随机搜索。
Insight: 创新点在于将LLM驱动的NAS框架专门应用于通道配置搜索,并通过AST突变生成大规模、形状一致的架构数据来解决数据稀缺问题,使LLM能够内化复杂的设计模式。这凸显了语言驱动设计在深度学习中的巨大潜力,为自动化架构优化提供了新思路。
Abstract: Channel configuration search the optimization of layer specifications such as layer widths in deep neural networks presents a complex combinatorial challenge constrained by tensor shape compatibility and computational budgets. We posit that Large Language Models (LLMs) offer a transformative approach to Neural Architecture Search (NAS), capable of reasoning about architectural code structure in ways that traditional heuristics cannot. In this paper, we investigate the application of an LLM-driven NAS framework to the problem of channel configuration. We formulate the search as a sequence of conditional code generation tasks, where an LLM refines architectural specifications based on performance telemetry. Crucially, we address the data scarcity problem by generating a vast corpus of valid, shape-consistent architectures via Abstract Syntax Tree (AST) mutations. While these mutated networks are not necessarily high-performing, they provide the critical volume of structural data required for the LLM to learn the latent relationship between channel configurations and model performance. This allows the LLM to internalize complex design patterns and apply them to optimize feature extraction strategies. Experimental results on CIFAR-100 validate the efficacy of this approach, demonstrating that the model yields statistically significant improvements in accuracy. Our analysis confirms that the LLM successfully acquires domain-specific architectural priors, distinguishing this method from random search and highlighting the immense potential of language-driven design in deep learning.
[83] VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations cs.CV | cs.AIPDF
Sushant Gautam, Cise Midoglu, Vajira Thambawita, Michael A. Riegler, Pål Halvorsen
TL;DR: 本文提出了VideoHEDGE框架,用于检测视频视觉语言模型在视频问答任务中的幻觉问题。该方法通过生成干净视频和时空扰动视频的多个高温度答案,利用语义聚类计算熵值,得到三种可靠性分数,并在SoccerChat基准上验证了其有效性。
Details
Motivation: 现有不确定性度量方法难以有效检测视频视觉语言模型产生的高置信度幻觉,因此需要一种专门针对视频时序结构输入的幻觉检测框架。
Result: 在SoccerChat基准上使用三个7B Video-VLM模型测试,Vision-Amplified Semantic Entropy (VASE) 在所有方法中取得了最高的ROC-AUC,尤其在较大扰动预算下表现突出,而其他方法接近随机水平。
Insight: 创新点包括将基于熵的可靠性估计从图像扩展到视频,结合语义聚类和时空扰动生成多假设;发现基于嵌入的聚类在检测性能上可与基于NLI的方法媲美但计算成本更低;开源库支持可复现的基准测试。
Abstract: Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods. Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels. Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge .
[84] End-to-End Video Character Replacement without Structural Guidance cs.CVPDF
Zhengbo Xu, Jie Ma, Ziheng Wang, Zhan Peng, Jun Liang
TL;DR: 本文提出MoCha框架,用于无需结构指导的端到端视频角色替换,仅需单帧掩码即可生成高质量、时序一致的视频,并通过条件感知RoPE和RL后训练增强身份适应性,同时构建了三个专用数据集以解决配对数据稀缺问题。
Details
Motivation: 现有基于重建的方法依赖逐帧分割掩码和显式结构指导(如骨架、深度),在遮挡、角色-物体交互、非常规姿态或复杂光照等复杂场景中泛化性差,易产生视觉伪影和时序不一致,因此需要一种更通用、无需结构指导的解决方案。
Result: 在广泛实验中,MoCha方法显著优于现有最先进方法,实现了更高的视觉质量和时序一致性,具体基准未在摘要中明确提及,但声称达到SOTA水平。
Insight: 创新点包括:仅需单帧掩码的端到端框架,避免了结构依赖;引入条件感知RoPE和RL后训练以增强多模态条件适应和面部身份;构建综合数据管道(UE5渲染、表情驱动合成和增强数据集)解决数据稀缺问题,可借鉴于视频生成和编辑任务的数据增强与条件设计。
Abstract: Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research. Please refer to our project page for more details: orange-3dv-team.github.io/MoCha
[85] WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation cs.CV | cs.AIPDF
Zishan Shu, Juntong Wu, Wei Yan, Xudong Liu, Hongyu Zhang
TL;DR: WaveFormer提出了一种基于波动方程的频率-时间解耦视觉建模方法,将特征图视为空间信号,通过欠阻尼波动方程控制其在网络深度(传播时间)上的演化,从而显式建模空间频率(从低频全局布局到高频边缘和纹理)及其与传播时间的交互。该方法推导出闭式解并实现为轻量级的波传播算子(WPO),以O(N log N)复杂度建模全局交互,替代标准ViT和CNN中的注意力或卷积模块,在图像分类、目标检测和语义分割任务上达到竞争性精度,同时实现更高吞吐量和更低计算开销。
Details
Motivation: 现有Transformer的注意力机制虽能捕获视觉依赖,但缺乏对语义信息如何在空间上传播的原理性解释;本文从波动视角重新审视该问题,旨在提供一种显式建模空间频率与传播时间交互的、更高效且可解释的视觉建模方法。
Result: 在图像分类、目标检测和语义分割基准测试中,WaveFormer模型作为ViT和CNN的即插即用替代方案,达到了竞争性精度;相比基于注意力的方法,实现了最高1.6倍的吞吐量提升和30%的FLOPs减少。
Insight: 创新点在于将视觉建模重新表述为波动方程驱动的信号传播过程,显式解耦频率与时间维度,并通过WPO实现高效全局建模;这为视觉表示学习提供了可解释的物理类比(波动传播),并引入了一种与基于热方程的方法互补的建模偏置,能同时捕获全局连贯性和高频细节。
Abstract: Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave-based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency-from low-frequency global layout to high-frequency edges and textures-is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed-form, frequency-time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(N log N) time-far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop-in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6x higher throughput and 30% fewer FLOPs than attention-based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat-based methods, effectively capturing both global coherence and high-frequency details essential for rich visual semantics. Codes are available at: https://github.com/ZishanShu/WaveFormer.
[86] SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning cs.CVPDF
Leo Fillioux, Omprakash Chakraborty, Ismail Ben Ayed, Paul-Henry Cournède, Stergios Christodoulidis
TL;DR: 本文提出了一种名为语义正交校准(SoC)的新方法,用于改进视觉语言模型(VLM)在测试时提示调优(TPT)中的不确定性校准。该方法通过一个基于Huber损失的规则化器,在增强类别原型可分性的同时保持语义相近类别的邻近性,从而解决了现有正交约束方法可能导致模型过度自信的问题。
Details
Motivation: 随着视觉语言模型在医疗、自动驾驶等关键决策系统中的广泛应用,其不确定性估计的校准变得至关重要。然而,现有测试时提示调优的研究主要关注提升判别性能,而忽略了校准这一维度。现有方法强制文本提示嵌入完全正交以增强可分性,但作者从理论上证明,这会迫使语义相关的类别分离,导致模型过度自信。
Result: 在全面的实证验证中,SoC方法在多个基准测试上一致地提升了校准性能(例如,降低了预期校准误差ECE),同时保持了有竞争力的判别能力。
Insight: 创新点在于提出了一个平衡可分性与语义邻近性的校准规则化器(SoC),其核心洞察是:完全正交约束的梯度会破坏语义结构,而平滑的、基于Huber损失的正交校准能更好地实现校准目标。这为改进VLM的可靠性和安全性提供了新思路。
Abstract: With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.
[87] CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion cs.CVPDF
Yiming Sun, Yuan Ruan, Qinghua Hu, Pengfei Zhu
TL;DR: 本文提出CtrlFuse,一种可控的红外与可见光图像融合框架,通过掩码提示引导实现交互式动态融合。该模型包含多模态特征提取器、参考提示编码器和提示语义融合模块,通过并行分割与融合分支的协同优化,提升任务性能与融合质量。实验表明,该方法在融合可控性和分割精度上达到SOTA水平。
Details
Motivation: 现有红外与可见光图像融合方法要么仅关注像素级融合而忽略下游任务适应性,要么通过级联检测/分割模型隐式学习固定语义,无法交互式满足多样化的语义目标感知需求。本文旨在解决这一问题,实现可控的、任务导向的图像融合。
Result: 在融合可控性和分割精度方面取得SOTA结果,适应后的任务分支甚至超越了原始分割模型的性能。
Insight: 创新点在于引入掩码提示引导的可控融合机制,通过参考提示编码器动态编码任务特定语义,并利用提示语义融合模块显式注入语义到融合特征中,实现了任务性能与融合质量的相互增强。
Abstract: Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities, enhancing environmental awareness for intelligent unmanned systems. Existing methods either focus on pixel-level fusion while overlooking downstream task adaptability or implicitly learn rigid semantics through cascaded detection/segmentation models, unable to interactively address diverse semantic target perception needs. We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts. The model integrates a multi-modal feature extractor, a reference prompt encoder (RPE), and a prompt-semantic fusion module (PSFM). The RPE dynamically encodes task-specific semantic prompts by fine-tuning pre-trained segmentation models with input mask guidance, while the PSFM explicitly injects these semantics into fusion features. Through synergistic optimization of parallel segmentation and fusion branches, our method achieves mutual enhancement between task performance and fusion quality. Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.
[88] SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models cs.CV | cs.AI | cs.CR | cs.LGPDF
Renyang Liu, Kangjie Chen, Han Qiu, Jie Zhang, Kwok-Yan Lam
TL;DR: SafeRedir是一个轻量级的推理时框架,通过提示嵌入重定向技术,在不修改底层图像生成模型的情况下,实现对有害概念(如NSFW内容和受版权保护的艺术风格)的鲁棒性遗忘。该方法利用潜在感知多模态安全分类器识别不安全生成轨迹,并通过令牌级增量生成器进行精确语义重定向,有效平衡遗忘效果与良性生成质量。
Details
Motivation: 解决图像生成模型在训练中记忆并复现不安全或侵权内容的问题,以及现有遗忘方法存在的成本高、损害良性生成质量、对提示改写和对抗攻击鲁棒性差等局限性。
Result: 在多个代表性遗忘任务上的实证结果表明,SafeRedir实现了有效的遗忘能力、高度的语义和感知保留、鲁棒的图像质量,并增强了对对抗攻击的抵抗力。该方法在多种扩散骨干模型和现有已遗忘模型上均能有效泛化。
Insight: 创新点在于提出了一种无需模型重训练的推理时嵌入空间干预框架,通过令牌级语义重定向实现细粒度控制;其模块化设计(安全分类器与增量生成器)提供了可插拔的兼容性和广泛的适用性,为模型安全部署提供了新思路。
Abstract: Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.
[89] ISLA: A U-Net for MRI-based acute ischemic stroke lesion segmentation with deep supervision, attention, domain adaptation, and ensemble learning cs.CV | cs.AIPDF
Vincent Roca, Martin Bretzner, Hilde Henon, Laurent Puy, Grégory Kuchcinski
TL;DR: 本文提出了一种名为ISLA的新型深度学习模型,用于基于扩散MRI的急性缺血性卒中(AIS)病灶分割。该模型基于U-Net框架,通过系统优化损失函数、卷积架构、深度监督和注意力机制,并结合无监督域适应和集成学习,在包含超过1500名AIS参与者的多中心数据库上训练,并在外部测试集上超越了两种最先进的方法。
Details
Motivation: 当前基于U-Net的AIS病灶自动分割方法在损失函数、深度监督、残差连接和注意力机制的选择上存在差异,且许多实现未公开,最优配置不明确,因此需要开发一个公开、鲁棒且经过系统优化的分割框架。
Result: 在外部测试集上,ISLA在AIS病灶分割任务中超越了两种最先进的方法(state-of-the-art),达到了更高的性能水平。
Insight: 创新点包括对损失函数、卷积架构、深度监督和注意力机制进行系统优化,并引入无监督域适应以提升模型在外部临床数据集上的泛化能力;同时,承诺公开代码和训练模型以促进可重复性,这在实践中具有重要价值。
Abstract: Accurate delineation of acute ischemic stroke lesions in MRI is a key component of stroke diagnosis and management. In recent years, deep learning models have been successfully applied to the automatic segmentation of such lesions. While most proposed architectures are based on the U-Net framework, they primarily differ in their choice of loss functions and in the use of deep supervision, residual connections, and attention mechanisms. Moreover, many implementations are not publicly available, and the optimal configuration for acute ischemic stroke (AIS) lesion segmentation remains unclear. In this work, we introduce ISLA (Ischemic Stroke Lesion Analyzer), a new deep learning model for AIS lesion segmentation from diffusion MRI, trained on three multicenter databases totaling more than 1500 AIS participants. Through systematic optimization of the loss function, convolutional architecture, deep supervision, and attention mechanisms, we developed a robust segmentation framework. We further investigated unsupervised domain adaptation to improve generalization to an external clinical dataset. ISLA outperformed two state-of-the-art approaches for AIS lesion segmentation on an external test set. Codes and trained models will be made publicly available to facilitate reuse and reproducibility.
[90] UR-Bench: A Benchmark for Multi-Hop Reasoning over Ultra-High-Resolution Images cs.CV | cs.AIPDF
Siqi Li, Xinyu Cai, Jianbiao Mei, Nianchen Deng, Pinlong Cai
TL;DR: 本文提出了一个名为UR-Bench的基准测试,用于评估多模态大语言模型在超高分辨率图像上进行多跳推理的能力。该基准包含人文和自然场景两大类,涵盖四种具有不同空间结构和数据源的超高分辨率图像子集,图像分辨率从数百兆像素到千兆像素不等,并配有三个难度级别的问题。此外,论文还提出了一个基于智能体的框架,该框架通过调用外部视觉工具来处理图像,并引入了语义抽象和检索工具以提高处理效率。
Details
Motivation: 现有的视觉问答基准通常依赖中等分辨率数据,视觉复杂性有限,而当前多模态大语言模型在超高分辨率图像上的推理能力尚未得到充分探索。为了填补这一空白,需要建立一个专门的基准来评估模型在极端视觉信息下的推理能力。
Result: 论文评估了最先进的端到端多模态大语言模型以及提出的基于智能体的框架,结果表明该框架是有效的。
Insight: 主要创新点在于构建了首个专注于超高分辨率图像多跳推理的基准测试UR-Bench,并提出了一个结合语义抽象和检索工具的智能体框架,以更高效地处理海量像素信息,这为评估和提升模型在复杂视觉场景下的细粒度推理能力提供了新方向。
Abstract: Recent multimodal large language models (MLLMs) show strong capabilities in visual-language reasoning, yet their performance on ultra-high-resolution imagery remains largely unexplored. Existing visual question answering (VQA) benchmarks typically rely on medium-resolution data, offering limited visual complexity. To bridge this gap, we introduce Ultra-high-resolution Reasoning Benchmark (UR-Bench), a benchmark designed to evaluate the reasoning capabilities of MLLMs under extreme visual information. UR-Bench comprises two major categories, Humanistic Scenes and Natural Scenes, covering four subsets of ultra-high-resolution images with distinct spatial structures and data sources. Each subset contains images ranging from hundreds of megapixels to gigapixels, accompanied by questions organized into three levels, enabling evaluation of models’ reasoning capabilities in ultra-high-resolution scenarios. We further propose an agent-based framework in which a language model performs reasoning by invoking external visual tools. In addition, we introduce Semantic Abstraction and Retrieval tools that enable more efficient processing of ultra-high-resolution images. We evaluate state-of-the-art models using both an end-to-end MLLMs and our agent-based framework, demonstrating the effectiveness of our framework.
[91] Near-perfect photo-ID of the Hula painted frog with zero-shot deep local-feature matching cs.CV | q-bio.QMPDF
Maayan Yesharim, R. G. Bina Perl, Uri Roll, Sarig Gafny, Eli Geffen
TL;DR: 本研究评估了计算机视觉方法在极度濒危的胡拉彩蛙个体照片重识别中的应用,比较了零样本深度局部特征匹配与深度全局特征嵌入模型。局部特征方法在封闭集识别中达到98%的Top-1准确率,优于所有全局特征模型。为平衡准确性与可扩展性,提出两阶段工作流:先用微调的全局特征模型检索候选列表,再用局部特征匹配重排序,将端到端运行时间从数小时缩短至约38分钟,同时保持约96%的Top-1准确率。该方法已部署为支持保护监测的Web应用。
Details
Motivation: 解决对极度濒危两栖动物进行非侵入性个体识别的需求,传统标记方法不适用,需利用照片实现准确、高效的监测。
Result: 在包含1,233张腹面图像的数据集上,零样本深度局部特征匹配达到98%的Top-1封闭集识别准确率,优于全局特征模型(微调后最佳为60% Top-1)。两阶段工作流在保持约96% Top-1准确率的同时,将运行时间从6.5-7.8小时大幅减少至约38分钟。
Insight: 创新点包括:零样本深度局部特征匹配在特定生物识别任务中显著优于全局特征嵌入;结合全局特征检索与局部特征重排序的两阶段工作流,有效平衡了准确性与计算效率;通过匹配分数阈值化实现开放集识别,支持处理新个体。该方法为非侵入性生物监测提供了可扩展的解决方案。
Abstract: Accurate individual identification is essential for monitoring rare amphibians, yet invasive marking is often unsuitable for critically endangered species. We evaluate state-of-the-art computer-vision methods for photographic re-identification of the Hula painted frog (Latonia nigriventer) using 1,233 ventral images from 191 individuals collected during 2013-2020 capture-recapture surveys. We compare deep local-feature matching in a zero-shot setting with deep global-feature embedding models. The local-feature pipeline achieves 98% top-1 closed-set identification accuracy, outperforming all global-feature models; fine-tuning improves the best global-feature model to 60% top-1 (91% top-10) but remains below local matching. To combine scalability with accuracy, we implement a two-stage workflow in which a fine-tuned global-feature model retrieves a short candidate list that is re-ranked by local-feature matching, reducing end-to-end runtime from 6.5-7.8 hours to ~38 minutes while maintaining ~96% top-1 closed-set accuracy on the labeled dataset. Separation of match scores between same- and different-individual pairs supports thresholding for open-set identification, enabling practical handling of novel individuals. We deploy this pipeline as a web application for routine field use, providing rapid, standardized, non-invasive identification to support conservation monitoring and capture-recapture analyses. Overall, in this species, zero-shot deep local-feature matching outperformed global-feature embedding and provides a strong default for photo-identification.
[92] S3-CLIP: Video Super Resolution for Person-ReID cs.CV | cs.AIPDF
Tamas Endrei, Gyorgy Cserey
TL;DR: 本文提出了S3-CLIP,一个基于视频超分辨率的CLIP-ReID框架,旨在通过提升行人轨迹片段(tracklet)的质量来改进视频行人重识别(ReID)性能,特别是在具有挑战性的跨视角(如无人机视角到地面视角)场景下。该方法将超分辨率网络与任务驱动的超分辨率流程相结合,并首次系统性地探索了视频超分辨率在提升ReID轨迹质量方面的应用。
Details
Motivation: 现有大多数行人重识别方法主要关注模型架构的改进,而忽视了轨迹片段质量这一关键限制因素,这导致在真实世界的困难场景(如跨视角)中部署ReID系统面临挑战。本文旨在解决这一被忽视的问题。
Result: 在WACV 2026 VReID-XFD挑战赛的基准测试中,S3-CLIP取得了与基线模型竞争的性能:在无人机到地面(aerial-to-ground)场景下达到37.52% mAP,在地面到无人机(ground-to-aerial)场景下达到29.16% mAP。特别是在地面到无人机场景下,Rank-1、Rank-5和Rank-10的排名准确率分别提升了11.24%、13.48%和17.98%,取得了显著增益。
Insight: 论文宣称的主要创新点是首次系统性地将视频超分辨率技术作为提升行人重识别轨迹质量的手段,并构建了一个任务驱动的超分辨率流程。从客观角度看,其核心创新在于将图像/视频增强领域(超分辨率)的技术与ReID任务深度结合,以解决跨视角等低质量输入场景下的根本性数据问题,而非仅仅改进识别模型本身,这为处理困难ReID场景提供了一个新的思路。
Abstract: Tracklet quality is often treated as an afterthought in most person re-identification (ReID) methods, with the majority of research presenting architectural modifications to foundational models. Such approaches neglect an important limitation, posing challenges when deploying ReID systems in real-world, difficult scenarios. In this paper, we introduce S3-CLIP, a video super-resolution-based CLIP-ReID framework developed for the VReID-XFD challenge at WACV 2026. The proposed method integrates recent advances in super-resolution networks with task-driven super-resolution pipelines, adapting them to the video-based person re-identification setting. To the best of our knowledge, this work represents the first systematic investigation of video super-resolution as a means of enhancing tracklet quality for person ReID, particularly under challenging cross-view conditions. Experimental results demonstrate performance competitive with the baseline, achieving 37.52% mAP in aerial-to-ground and 29.16% mAP in ground-to-aerial scenarios. In the ground-to-aerial setting, S3-CLIP achieves substantial gains in ranking accuracy, improving Rank-1, Rank-5, and Rank-10 performance by 11.24%, 13.48%, and 17.98%, respectively.
[93] Reasoning Matters for 3D Visual Grounding cs.CV | cs.AIPDF
Hsiang-Wei Huang, Kuang-Ming Chen, Wenhao Chai, Cheng-Yen Yang, Jen-Hao Cheng
TL;DR: 本文提出了一种能够自动合成3D视觉定位数据及其对应推理过程的数据流水线,并利用该数据微调LLM,构建了名为Reason3DVG-8B的3D视觉定位大语言模型。该模型仅使用先前LLM方法3D-GRAND 1.6%的训练数据,性能即超越后者,证明了所提数据的有效性以及推理能力在3D视觉定位中的重要性。
Details
Motivation: 当前3D视觉定位模型因推理能力有限而面临挑战,现有方法要么需要大量人工标注数据进行监督训练,要么依赖合成数据但性能提升有限且成本高昂。
Result: 提出的Reason3DVG-8B模型在3D视觉定位任务上超越了之前的LLM基方法3D-GRAND,且仅使用了后者1.6%的训练数据。
Insight: 核心创新在于构建了一个能自动生成带有推理过程的3D视觉定位数据的数据流水线,这有效提升了模型在复杂3D场景下的理解和推理能力,表明高质量、富含推理逻辑的合成数据比单纯扩大数据规模更能高效提升模型性能。
Abstract: The recent development of Large Language Models (LLMs) with strong reasoning ability has driven research in various domains such as mathematics, coding, and scientific discovery. Meanwhile, 3D visual grounding, as a fundamental task in 3D understanding, still remains challenging due to the limited reasoning ability of recent 3D visual grounding models. Most of the current methods incorporate a text encoder and visual feature encoder to generate cross-modal fuse features and predict the referring object. These models often require supervised training on extensive 3D annotation data. On the other hand, recent research also focus on scaling synthetic data to train stronger 3D visual grounding LLM, however, the performance gain remains limited and non-proportional to the data collection cost. In this work, we propose a 3D visual grounding data pipeline, which is capable of automatically synthesizing 3D visual grounding data along with corresponding reasoning process. Additionally, we leverage the generated data for LLM fine-tuning and introduce Reason3DVG-8B, a strong 3D visual grounding LLM that outperforms previous LLM-based method 3D-GRAND using only 1.6% of their training data, demonstrating the effectiveness of our data and the importance of reasoning in 3D visual grounding.
[94] Motion Attribution for Video Generation cs.CV | cs.AI | cs.LG | cs.MM | cs.ROPDF
Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé
TL;DR: 本文提出了Motive框架,一种基于梯度的运动归因方法,用于分析视频生成模型中数据对运动的影响。该框架通过运动加权损失掩码分离时间动态与静态外观,能够高效计算特定运动的影响,并指导数据筛选以提升生成视频的时间一致性和物理合理性。
Details
Motivation: 尽管视频生成模型进展迅速,但数据如何影响运动仍缺乏深入理解。本文旨在开发一种可扩展的运动归因框架,以识别哪些微调数据片段能改善或损害时间动态。
Result: 在VBench基准测试中,使用Motive筛选的高影响力数据微调后,模型在运动平滑度和动态程度上均有提升,相比预训练基础模型获得了74.1%的人类偏好胜率。
Insight: 创新点在于首次在视频生成模型中专注于运动而非视觉外观的归因,并利用归因结果指导数据筛选;客观来看,该方法通过运动加权掩码实现了高效且可扩展的运动特定影响计算,为数据驱动的视频生成优化提供了新思路。
Abstract: Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.
[95] 3AM: Segment Anything with Geometric Consistency in Videos cs.CVPDF
Yang-Che Sun, Cheng Sun, Chin-Yang Lin, Fu-En Yang, Min-Hung Chen
TL;DR: 本文提出了3AM方法,通过在训练时增强SAM2模型,引入MUSt3R的3D感知特征,以几何一致性解决视频中因大视角变化导致的物体分割难题。该方法仅需RGB输入,无需相机位姿或深度图预处理,在ScanNet++和Replica等具有宽基线运动的挑战性数据集上显著超越了现有视频物体分割方法。
Details
Motivation: 现有基于记忆架构的视频物体分割方法(如SAM2)依赖外观特征,在大视角变化下性能受限;而传统3D实例分割方法需要相机位姿、深度图等昂贵预处理。本文旨在结合两者优势,实现仅需RGB输入、具有几何一致性的视频分割。
Result: 在具有宽基线运动的ScanNet++和Replica数据集上,3AM大幅优于SAM2及其扩展方法。在ScanNet++的选定子集上,达到了90.6%的IoU和71.7%的Positive IoU,比最先进的视频物体分割方法分别提升了15.9和30.4个百分点。
Insight: 创新点包括:1)将编码隐式几何对应的多级MUSt3R 3D感知特征与SAM2的外观特征通过轻量级特征融合器结合,实现基于空间位置和视觉相似性的几何一致识别;2)提出视场感知采样策略,确保帧间观测空间一致的对象区域,以学习可靠的3D对应关系;3)推理时仅需RGB输入,无需额外预处理,提升了实用性。
Abstract: Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2’s appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++’s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/
cs.SE [Back]
[96] APEX-SWE cs.SE | cs.AI | cs.CLPDF
Abhi Kottamasu, Akul Datta, Aakash Barthwal, Chirag Mahapatra, Ajay Arun
TL;DR: 本文介绍了AI生产力指数软件工程基准(APEX-SWE),用于评估前沿AI模型执行具有经济价值的软件工程工作的能力。该基准包含集成任务和可观测性任务两类新型任务,模拟真实世界软件工程场景。评估了八个前沿模型,其中Gemini 3 Pro表现最佳,Pass@1得分为25%。分析表明,强性能主要源于认知推理能力与解决不确定性的主动性。
Details
Motivation: 现有AI评估主要关注狭窄、定义明确的任务,缺乏对反映真实世界软件工程经济价值工作的评估,因此需要构建更贴近实际工作的基准。
Result: 在APEX-SWE基准上评估了八个前沿模型,Gemini 3 Pro(Thinking = High)表现最好,Pass@1得分为25%,达到了当前最佳水平(SOTA)。
Insight: 创新点在于设计了集成任务和可观测性任务两类新型评估任务,更贴近真实软件工程工作;客观分析表明,认知推理(区分假设与已验证事实)与主动解决不确定性的能力是模型性能的关键驱动因素。
Abstract: We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering work: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eight frontier models on APEX-SWE. Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25%. Our analysis shows that strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).
cs.LG [Back]
[97] KVzap: Fast, Adaptive, and Faithful KV Cache Pruning cs.LG | cs.AI | cs.CLPDF
Simon Jegou, Maximilian Jeblick
TL;DR: KVzap是一种针对Transformer语言模型推理中关键值(KV)缓存瓶颈的快速、自适应剪枝方法。它通过近似KVzip算法,在预填充和解码阶段工作,在Qwen3和Llama等模型的长上下文和推理任务上实现了2-4倍的KV缓存压缩,且精度损失可忽略,达到了KVpress排行榜的SOTA性能。
Details
Motivation: 随着Transformer语言模型上下文长度增长,KV缓存成为推理关键瓶颈;现有KV缓存剪枝方法因速度与精度权衡问题尚未被主流推理引擎采纳。
Result: 在Qwen3-8B、Llama-3.1-8B-Instruct和Qwen3-32B模型的长上下文和推理任务上,KVzap实现了2-4倍的KV缓存压缩,精度损失可忽略,并在KVpress排行榜上达到了最先进的(SOTA)性能。
Insight: 创新点在于提出了一种快速、输入自适应的KVzip近似方法,适用于预填充和解码两个阶段,在保持高精度的同时显著提升压缩效率,解决了现有方法速度-精度权衡的痛点。
Abstract: Growing context lengths in transformer-based language models have made the key-value (KV) cache a critical inference bottleneck. While many KV cache pruning methods have been proposed, they have not yet been adopted in major inference engines due to speed–accuracy trade-offs. We introduce KVzap, a fast, input-adaptive approximation of KVzip that works in both prefilling and decoding. On Qwen3-8B, Llama-3.1-8B-Instruct, and Qwen3-32B across long-context and reasoning tasks, KVzap achieves $2$–$4\times$ KV cache compression with negligible accuracy loss and achieves state-of-the-art performance on the KVpress leaderboard. Code and models are available at https://github.com/NVIDIA/kvpress.
[98] Towards Specialized Generalists: A Multi-Task MoE-LoRA Framework for Domain-Specific LLM Adaptation cs.LG | cs.AI | cs.CLPDF
Yuxin Yang, Aoxiong Zeng, Xiangquan Yang
TL;DR: 本文提出了一种名为Med-MoE-LoRA的新型框架,通过将混合专家模型与低秩自适应技术相结合,旨在解决大语言模型在特定领域(如医学)适应时面临的稳定性-可塑性困境和多任务干扰问题。该框架采用非对称专家分布和知识保留插件,在医学基准测试中实现了优越性能,同时保持了模型的通用认知能力。
Details
Motivation: 解决大语言模型适应专业领域(如医学)时的两大挑战:一是稳定性-可塑性困境,即模型在获取复杂领域知识时需避免灾难性遗忘通用知识;二是任务干扰问题,即不同子任务在有限低秩参数空间中相互竞争。
Result: 实验结果表明,该方法在多个临床自然语言处理任务上,持续优于标准的LoRA和传统的MoE架构,并在医学基准测试中取得了优越性能,同时保持了模型的通用能力。
Insight: 主要创新点包括:1) 将MoE与LoRA结合用于高效多任务领域适应;2) 采用非对称专家分布,在深层部署更多LoRA专家以捕获复杂语义抽象;3) 引入受LoRA MoE启发的“知识保留插件”,以隔离和保护通用推理能力;4) 利用自适应路由的软合并和秩级解耦来减少干扰。
Abstract: The rapid evolution of Large Language Models (LLMs) has shifted focus from general-purpose capabilities to domain-specific expertise. However, adapting LLMs to specialized fields such as medicine presents two challenge: (1) the “Stability-Plasticity Dilemma”, where the model must acquire complex clinical knowledge without suffering from catastrophic forgetting of general world knowledge; and (2) “Task Interference”, where disparate sub-tasks, such as medical diagnosis, report summarization, and drug-drug interaction prediction, compete for limited low-rank parameter space. In this paper, we propose Med-MoE-LoRA, a novel framework that integrates Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to enable efficient multi-task domain adaptation, especially for medical scenarios. Drawing inspiration from recent advances, our framework employs an asymmetric expert distribution where deeper layers are equipped with a higher density of LoRA experts to capture complex semantic abstractions. We further introduce a “Knowledge-Preservation Plugin”, inspired by LoRA MoE, to isolate and protect general-purpose reasoning. By utilizing soft merging with adaptive routing and rank-wise decoupling, Med-MoE-LoRA achieves superior performance in medical benchmarks while reducing interference. Experimental results demonstrate that our approach consistently outperforms standard LoRA and conventional MoE architectures across multiple clinical NLP tasks while retaining the model’s general cognitive capabilities.
[99] Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs cs.LG | cs.CLPDF
Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao
TL;DR: 本文提出了一种名为’Uniqueness-Aware Reinforcement Learning’(独特性感知强化学习)的新方法,用于解决大语言模型(LLM)在复杂推理任务后训练中,传统强化学习(RL)常出现的探索崩溃问题。该方法通过一个基于LLM的评判器,根据高层解决策略对同一问题的多个推理路径(rollout)进行聚类,并依据聚类大小对策略优势进行反向加权,从而显式地奖励那些正确但罕见的解决方案,以提高解决方案的多样性。
Details
Motivation: 传统强化学习范式在训练大语言模型进行复杂推理时,容易过早地集中于少数主导的推理模式,导致探索崩溃,即虽然提高了单次尝试通过率(pass@1),但限制了推理路径的多样性和在多次尝试通过率(pass@k)上的增益。其根本原因在于对局部令牌行为的正则化,而非对解决方案集合多样性的关注。
Result: 在数学、物理和医学推理基准测试中,该方法在不牺牲pass@1性能的前提下,在大规模采样预算下持续提升了pass@k指标,增加了pass@k曲线下面积(AUC@K),并维持了探索能力,发现了更多样化的解决方案策略。
Insight: 核心创新点在于将奖励机制从传统的局部令牌层面提升到整个解决方案路径(rollout)层面,并引入基于高层策略的聚类来量化并奖励解决方案的独特性。这为强化学习训练LLM提供了一种新颖的、旨在促进创造性问题解决和探索多样性的优化目标,而非仅仅追求单一最优路径。
Abstract: Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
[100] HOSC: A Periodic Activation with Saturation Control for High-Fidelity Implicit Neural Representations cs.LG | cs.CV | cs.GRPDF
Michal Jan Wlodarczyk, Danzel Serrano, Przemyslaw Musialski
TL;DR: 本文提出了一种名为HOSC(具有饱和控制的双曲振荡器)的新型周期性激活函数,用于隐式神经表示(INR)。该函数形式为tanh(β sin(ω₀ x)),通过引入显式参数β来控制激活函数的Lipschitz界,从而在保留周期性载波的同时调节梯度幅度,旨在解决传统正弦激活函数存在的梯度不稳定和多尺度行为控制有限的问题。
Details
Motivation: 传统周期性激活函数(如正弦函数)虽然能通过其振荡结构在隐式神经表示中保留高频信息,但常常面临梯度不稳定和对多尺度行为控制有限的问题。本文的动机是设计一种新的激活函数,提供对梯度幅度的直接控制机制。
Result: 在图像、音频、视频、神经辐射场(NeRF)和符号距离函数(SDF)等多个领域,使用标准化训练协议进行的综合实证研究表明,与SIREN、FINER等相关方法相比,HOSC在某些方面提供了显著优势,在其他方面则达到了有竞争力的同等水平。
Insight: 主要创新点在于提出了HOSC激活函数,其核心是引入了显式的饱和控制参数β,该参数与频率ω₀共同决定了激活函数的Lipschitz界,从而为梯度稳定性和训练行为提供了直接、可调的控制机制。从客观角度看,这种将饱和非线性(tanh)与周期性振荡(sin)显式结合并参数化的设计,为解决INR训练中的梯度问题提供了一种简洁有效的方案,并给出了领域特定的超参数选择指导,具有实用价值。
Abstract: Periodic activations such as sine preserve high-frequency information in implicit neural representations (INRs) through their oscillatory structure, but often suffer from gradient instability and limited control over multi-scale behavior. We introduce the Hyperbolic Oscillator with Saturation Control (HOSC) activation, $\text{HOSC}(x) = \tanh\bigl(β\sin(ω_0 x)\bigr)$, which exposes an explicit parameter $β$ that controls the Lipschitz bound of the activation by $βω_0$. This provides a direct mechanism to tune gradient magnitudes while retaining a periodic carrier. We provide a mathematical analysis and conduct a comprehensive empirical study across images, audio, video, NeRFs, and SDFs using standardized training protocols. Comparative analysis against SIREN, FINER, and related methods shows where HOSC provides substantial benefits and where it achieves competitive parity. Results establish HOSC as a practical periodic activation for INR applications, with domain-specific guidance on hyperparameter selection. For code visit the project page https://hosc-nn.github.io/ .
q-bio.QM [Back]
[101] Imaging-anchored Multiomics in Cardiovascular Disease: Integrating Cardiac Imaging, Bulk, Single-cell, and Spatial Transcriptomics q-bio.QM | cs.AI | cs.CV | cs.LGPDF
Minh H. N. Le, Tuan Vinh, Thanh-Huy Nguyen, Tao Li, Bao Quang Gia Le
TL;DR: 这篇综述论文探讨了心血管疾病中影像学与多组学数据的整合方法,提出了一种以影像学为锚点的多模态融合框架,旨在将心脏成像表型与转录组学及空间分子状态联系起来。
Details
Motivation: 心血管疾病涉及遗传风险、分子程序和组织尺度重塑的相互作用,临床通过影像学观察,但目前心脏MRI、CT、超声心动图与批量、单细胞和空间转录组学数据仍被分开分析,缺乏整合方法。
Result: 论文是综述性质,未提供具体定量结果,但讨论了整合流程在放射基因组学、空间分子对齐和基于图像的基因表达预测中的应用,并指出空间多组学、单细胞与空间基础模型以及多模态医学基础模型正推动大规模心血管转化。
Insight: 创新点在于提出以影像学为锚点的整合视角,将成像定义为空间表型,多组学提供细胞类型和位置特异性分子背景,并综述了处理缺失数据、样本量有限和批次效应的多模态融合方法,为心血管疾病研究提供了可扩展的分析框架。
Abstract: Cardiovascular disease arises from interactions between inherited risk, molecular programmes, and tissue-scale remodelling that are observed clinically through imaging. Health systems now routinely generate large volumes of cardiac MRI, CT and echocardiography together with bulk, single-cell and spatial transcriptomics, yet these data are still analysed in separate pipelines. This review examines joint representations that link cardiac imaging phenotypes to transcriptomic and spatially resolved molecular states. An imaging-anchored perspective is adopted in which echocardiography, cardiac MRI and CT define a spatial phenotype of the heart, and bulk, single-cell and spatial transcriptomics provide cell-type- and location-specific molecular context. The biological and technical characteristics of these modalities are first summarised, and representation-learning strategies for each are outlined. Multimodal fusion approaches are reviewed, with emphasis on handling missing data, limited sample size, and batch effects. Finally, integrative pipelines for radiogenomics, spatial molecular alignment, and image-based prediction of gene expression are discussed, together with common failure modes, practical considerations, and open challenges. Spatial multiomics of human myocardium and atherosclerotic plaque, single-cell and spatial foundation models, and multimodal medical foundation models are collectively bringing imaging-anchored multiomics closer to large-scale cardiovascular translation.
[102] Automated Lesion Segmentation of Stroke MRI Using nnU-Net: A Comprehensive External Validation Across Acute and Chronic Lesions q-bio.QM | cs.CVPDF
Tammar Truzman, Matthew A. Lambon Ralph, Ajay D. Halai
TL;DR: 本研究系统评估了基于nnU-Net框架的卒中MRI病灶自动分割方法在急性和慢性卒中多个公开数据集上的泛化性能。研究发现,模型在不同卒中阶段均表现出稳健的泛化能力,分割准确度接近人工标注者间的一致性水平,并揭示了成像模态、训练数据规模、病灶体积和图像质量等因素对泛化性能的关键影响。
Details
Motivation: 解决现有深度学习卒中病灶分割模型通常针对特定成像环境优化,在独立数据集、不同模态和卒中阶段泛化能力不足的问题,以推动临床研究、预后建模和个性化干预。
Result: 在多个公开的急性和慢性卒中MRI数据集(包括DWI、FLAIR和T1加权图像)上进行训练和独立测试。模型分割准确度接近报告的人工标注者间可靠性。在急性卒中中,基于DWI训练的模型始终优于基于FLAIR的模型,多模态组合带来的提升有限。在慢性卒中中,增加训练集规模可提升性能,但超过数百例后收益递减。
Insight: 创新点在于对nnU-Net框架在卒中病灶分割任务上进行了全面、系统的跨数据集外部验证,并深入分析了影响模型泛化能力的关键因素(如病灶体积、图像质量、训练数据特性)。客观来看,研究为开发鲁棒的病灶分割工具提供了实证指导,强调了数据质量和多样性而非单纯模型复杂性的重要性。
Abstract: Accurate and generalisable segmentation of stroke lesions from magnetic resonance imaging (MRI) is essential for advancing clinical research, prognostic modelling, and personalised interventions. Although deep learning has improved automated lesion delineation, many existing models are optimised for narrow imaging contexts and generalise poorly to independent datasets, modalities, and stroke stages. Here, we systematically evaluated stroke lesion segmentation using the nnU-Net framework across multiple heterogeneous, publicly available MRI datasets spanning acute and chronic stroke. Models were trained and tested on diffusion-weighted imaging (DWI), fluid-attenuated inversion recovery (FLAIR), and T1-weighted MRI, and evaluated on independent datasets. Across stroke stages, models showed robust generalisation, with segmentation accuracy approaching reported inter-rater reliability. Performance varied with imaging modality and training data characteristics. In acute stroke, DWI-trained models consistently outperformed FLAIR-based models, with only modest gains from multimodal combinations. In chronic stroke, increasing training set size improved performance, with diminishing returns beyond several hundred cases. Lesion volume was a key determinant of accuracy: smaller lesions were harder to segment, and models trained on restricted volume ranges generalised poorly. MRI image quality further constrained generalisability: models trained on lower-quality scans transferred poorly, whereas those trained on higher-quality data generalised well to noisier images. Discrepancies between predictions and reference masks were often attributable to limitations in manual annotations. Together, these findings show that automated lesion segmentation can approach human-level performance while identifying key factors governing generalisability and informing the development of lesion segmentation tools.
cs.MM [Back]
[103] MLLM-VADStory: Domain Knowledge-Driven Multimodal LLMs for Video Ad Storyline Insights cs.MM | cs.CVPDF
Jasmine Yang, Poppy Zhang, Shawndra Hill
TL;DR: 本文提出了MLLM-VADStory,一个由领域知识引导的多模态大语言模型框架,用于系统性地量化和生成视频广告故事情节的大规模理解洞见。该框架将广告分割为功能单元,使用新颖的广告特定功能角色分类法对每个单元进行分类,并聚合功能序列以恢复数据驱动的故事情节结构。
Details
Motivation: 解决大规模视频广告故事情节理解的问题,利用领域知识引导MLLMs生成可扩展的洞见,以理解广告叙事的功能结构和其对效果的影响。
Result: 在四个行业子垂直领域的5万个社交媒体视频广告上应用该框架,发现基于故事的创意能提高视频留存率,并推荐了表现最佳的故事情节弧线以指导广告创意设计。
Insight: 创新点在于将领域知识(广告功能角色分类法)与MLLMs结合,系统化分析视频广告的叙事结构,实现从功能单元到整体故事情节的量化洞察,为视频创意理解提供了一个通用工具。
Abstract: We propose MLLM-VADStory, a novel domain knowledge-guided multimodal large language models (MLLM) framework to systematically quantify and generate insights for video ad storyline understanding at scale. The framework is centered on the core idea that ad narratives are structured by functional intent, with each scene unit performing a distinct communicative function, delivering product and brand-oriented information within seconds. MLLM-VADStory segments ads into functional units, classifies each unit’s functionality using a novel advertising-specific functional role taxonomy, and then aggregates functional sequences across ads to recover data-driven storyline structures. Applying the framework to 50k social media video ads across four industry subverticals, we find that story-based creatives improve video retention, and we recommend top-performing story arcs to guide advertisers in creative design. Our framework demonstrates the value of using domain knowledge to guide MLLMs in generating scalable insights for video ad storylines, making it a versatile tool for understanding video creatives in general.
cs.IR [Back]
[104] VeriTaS: The First Dynamic Benchmark for Multimodal Automated Fact-Checking cs.IR | cs.AI | cs.CV | cs.MMPDF
Mark Rothermel, Marcus Kornmann, Marcus Rohrbach, Anna Rohrbach
TL;DR: 本文介绍了VeriTaS,这是首个动态的多模态自动事实核查基准,旨在解决现有静态基准因数据泄露而无法可靠评估模型真实能力的问题。它包含来自108个专业机构的24,000条多语言、多模态的真实世界声明,并通过一个全自动的七阶段流程定期更新,以抵抗大规模预训练带来的数据泄露。
Details
Motivation: 现有自动事实核查基准在任务范围、模态、领域、语言多样性、真实性和错误信息类型覆盖方面存在局限,且多为静态,容易因声明进入大语言模型的预训练语料库而导致数据泄露,使得基准性能无法可靠反映模型验证声明的实际能力。
Result: 通过人工评估,证明其自动化标注与人类判断高度一致,从而验证了基准的可靠性。
Insight: 创新点在于提出了首个动态、抗数据泄露的多模态事实核查基准,并设计了一个全自动的七阶段流程来标准化声明表述、检索原始媒体、映射专家裁决到一个新颖、标准化且解耦的评分方案,并承诺未来定期更新,以支持在基础模型快速演进时代进行有意义的评估。
Abstract: The growing scale of online misinformation urgently demands Automated Fact-Checking (AFC). Existing benchmarks for evaluating AFC systems, however, are largely limited in terms of task scope, modalities, domain, language diversity, realism, or coverage of misinformation types. Critically, they are static, thus subject to data leakage as their claims enter the pretraining corpora of LLMs. As a result, benchmark performance no longer reliably reflects the actual ability to verify claims. We introduce Verified Theses and Statements (VeriTaS), the first dynamic benchmark for multimodal AFC, designed to remain robust under ongoing large-scale pretraining of foundation models. VeriTaS currently comprises 24,000 real-world claims from 108 professional fact-checking organizations across 54 languages, covering textual and audiovisual content. Claims are added quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel, standardized, and disentangled scoring scheme with textual justifications. Through human evaluation, we demonstrate that the automated annotations closely match human judgments. We commit to update VeriTaS in the future, establishing a leakage-resistant benchmark, supporting meaningful AFC evaluation in the era of rapidly evolving foundation models. We will make the code and data publicly available.
eess.IV [Back]
[105] Application of Ideal Observer for Thresholded Data in Search Task eess.IV | cs.CV | eess.SP | physics.med-phPDF
Hongwei Lin, Howard C. Gifford
TL;DR: 本研究提出了一种基于阈值化数据的拟人化视觉搜索模型观察者,用于任务驱动的图像质量评估。该模型受人类视觉系统启发,采用两阶段框架(候选区域选择与决策),通过选择性处理高显著性特征来提升判别性能,并增强诊断准确性和计算效率。
Details
Motivation: 动机在于开发一种能够模拟人类视觉搜索行为的模型观察者,以解决在噪声环境中图像质量评估和诊断任务中特征冗余和计算效率低下的问题。
Result: 模拟实验表明,阈值化处理通过排除低显著性特征提升了观察者性能,尤其在噪声环境下;中等阈值通常优于无阈值处理,且模型在较少图像训练下仍能保持与人类性能的一致性。
Insight: 创新点在于将阈值化数据与理想观察者模型结合,构建了一个两阶段框架,实现了特征选择性处理;该方法可推广至计算机视觉、机器学习及安防图像分析等领域,为资源有限下的模型训练提供了新思路。
Abstract: This study advances task-based image quality assessment by developing an anthropomorphic thresholded visual-search model observer. The model is an ideal observer for thresholded data inspired by the human visual system, allowing selective processing of high-salience features to improve discrimination performance. By filtering out irrelevant variability, the model enhances diagnostic accuracy and computational efficiency. The observer employs a two-stage framework: candidate selection and decision-making. Using thresholded data during candidate selection refines regions of interest, while stage-specific feature processing optimizes performance. Simulations were conducted to evaluate the effects of thresholding on feature maps, candidate localization, and multi-feature scenarios. Results demonstrate that thresholding improves observer performance by excluding low-salience features, particularly in noisy environments. Intermediate thresholds often outperform no thresholding, indicating that retaining only relevant features is more effective than keeping all features. Additionally, the model demonstrates effective training with fewer images while maintaining alignment with human performance. These findings suggest that the proposed novel framework can predict human visual search performance in clinically realistic tasks and provide solutions for model observer training with limited resources. Our novel approach has applications in other areas where human visual search and detection tasks are modeled such as in computer vision, machine learning, defense and security image analysis.
[106] Temporal-Enhanced Interpretable Multi-Modal Prognosis and Risk Stratification Framework for Diabetic Retinopathy (TIMM-ProRS) eess.IV | cs.CVPDF
Susmita Kar, A S M Ahsanul Sarkar Akib, Abdul Hasib, Samin Yaser, Anas Bin Azim
TL;DR: 本研究提出了TIMM-ProRS框架,一种结合Vision Transformer、卷积神经网络和图神经网络的多模态融合深度学习模型,用于糖尿病视网膜病变的预后和风险分层。该框架独特地整合了视网膜图像和时序生物标志物(如HbA1c和视网膜厚度),以捕捉多模态和时序动态。在多个数据集上评估,模型实现了97.8%的准确率和0.96的F1分数,表现出最先进的性能。
Details
Motivation: 糖尿病视网膜病变(DR)影响全球数百万人,且诊断复杂,视觉症状与其他眼部疾病重叠,在资源匮乏地区误诊率高,因此需要早期、精确且可解释的诊断方法以支持可扩展的远程医疗管理。
Result: 在APTOS 2019(训练集)、Messidor-2、RFMiD、EyePACS和Messidor-1(验证集)等多个数据集上评估,TIMM-ProRS模型达到97.8%的准确率和0.96的F1分数,超越了RSG-Net和DeepDR等现有方法,实现了SOTA性能。
Insight: 创新点在于整合多模态数据(视网膜图像和时序生物标志物)并融合ViT、CNN和GNN,以捕捉时空动态,提高诊断的准确性和可解释性;从客观角度看,该方法通过时序信息增强模型对疾病进展的理解,有助于早期干预和个性化医疗。
Abstract: Diabetic retinopathy (DR), affecting millions globally with projections indicating a significant rise, poses a severe blindness risk and strains healthcare systems. Diagnostic complexity arises from visual symptom overlap with conditions like age-related macular degeneration and hypertensive retinopathy, exacerbated by high misdiagnosis rates in underserved regions. This study introduces TIMM-ProRS, a novel deep learning framework integrating Vision Transformer (ViT), Convolutional Neural Network (CNN), and Graph Neural Network (GNN) with multi-modal fusion. TIMM-ProRS uniquely leverages both retinal images and temporal biomarkers (HbA1c, retinal thickness) to capture multi-modal and temporal dynamics. Evaluated comprehensively across diverse datasets including APTOS 2019 (trained), Messidor-2, RFMiD, EyePACS, and Messidor-1 (validated), the model achieves 97.8% accuracy and an F1-score of 0.96, demonstrating state-of-the-art performance and outperforming existing methods like RSG-Net and DeepDR. This approach enables early, precise, and interpretable diagnosis, supporting scalable telemedical management and enhancing global eye health sustainability.
[107] M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding eess.IV | cs.CVPDF
Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu
TL;DR: 本文提出了一个名为M3CoTBench的新基准,专门用于评估多模态大语言模型在医学图像理解任务中的思维链推理能力,该基准包含一个涵盖多种检查类型和任务难度的数据集,以及一套针对临床推理的评估指标,并对多个MLLM模型进行了性能分析。
Details
Motivation: 当前医学图像理解基准通常只关注最终答案而忽略推理过程,这种不透明的过程缺乏可靠的判断依据,难以辅助医生诊断,因此需要一个新的基准来系统评估CoT推理的正确性、效率、影响和一致性。
Result: 论文对多个MLLM模型在M3CoTBench上进行了性能分析,揭示了当前模型在生成可靠且临床可解释的推理方面存在局限性。
Insight: 创新点在于构建了一个专门针对医学图像理解中思维链推理的综合性基准,其特色包括多难度层级的数据集、多样化的任务、以及一套紧密结合临床需求的评估指标,这有助于推动开发透明、可信且诊断准确的医疗AI系统。
Abstract: Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at https://juntaojianggavin.github.io/projects/M3CoTBench/.
cs.RO [Back]
[108] VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory cs.RO | cs.CVPDF
Shaoan Wang, Yuanfei Luo, Xingyu Chen, Aocheng Luo, Dongyue Li
TL;DR: VLingNav是一种基于语言驱动认知的视觉语言行动模型,用于具身导航。它通过自适应思维链机制实现快速直觉执行与慢速深思规划的灵活切换,并利用视觉辅助语言记忆模块构建跨模态语义记忆以处理长时程空间依赖。模型在最大规模的具身导航推理数据集Nav-AdaCoT-2.9M上训练,结合在线专家引导强化学习,在多个基准测试中达到SOTA性能,并能零样本迁移到真实机器人平台。
Details
Motivation: 现有VLA模型在具身导航中依赖从观察到动作的被动映射,缺乏复杂长时程任务所需的显式推理能力和持久记忆,因此提出VLingNav以解决这些局限性。
Result: VLingNav在广泛的具身导航基准测试中实现了最先进的性能,并能以零样本方式迁移到真实机器人平台,执行多种导航任务,表现出强大的跨领域和跨任务泛化能力。
Insight: 创新点包括:受人类认知双过程理论启发的自适应思维链机制,动态触发显式推理;视觉辅助语言记忆模块构建持久跨模态语义记忆;结合大规模自适应CoT标注数据集与在线专家引导强化学习的训练方法。
Abstract: VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.
cs.AI [Back]
[109] MemoBrain: Executive Memory as an Agentic Brain for Reasoning cs.AI | cs.CL | cs.IRPDF
Hongjin Qian, Zhao Cao, Zheng Liu
TL;DR: MemoBrain是一种面向工具增强智能体的执行记忆模型,旨在解决长时程推理中上下文信息过载的问题。它通过构建依赖感知的记忆结构,捕获推理步骤中的关键中间状态及其逻辑关系,从而在固定上下文预算下维持紧凑且高显著性的推理主干。
Details
Motivation: 在工具增强的智能体框架中,长时程复杂推理会导致推理轨迹和临时工具产物不断累积,超出大型语言模型的有限工作上下文容量,从而破坏逻辑连续性和任务对齐。因此,记忆机制成为维持长期连贯、目标导向推理的核心组件,而非辅助效率问题。
Result: 在GAIA、WebWalker和BrowseComp-Plus等具有挑战性的长时程基准测试中,MemoBrain相比强基线模型均表现出持续的性能提升。
Insight: 论文的创新点在于将记忆机制从被动上下文积累转变为主动的认知控制工具,通过依赖感知的记忆构建、无效步骤剪枝、已完成子轨迹折叠等机制,在有限上下文预算下动态管理推理进程,确保逻辑连贯性。这为长时程推理智能体的设计提供了新的架构思路。
Abstract: Complex reasoning in tool-augmented agent frameworks is inherently long-horizon, causing reasoning traces and transient tool artifacts to accumulate and strain the bounded working context of large language models. Without explicit memory mechanisms, such accumulation disrupts logical continuity and undermines task alignment. This positions memory not as an auxiliary efficiency concern, but as a core component for sustaining coherent, goal-directed reasoning over long horizons. We propose MemoBrain, an executive memory model for tool-augmented agents that constructs a dependency-aware memory over reasoning steps, capturing salient intermediate states and their logical relations. Operating as a co-pilot alongside the reasoning agent, MemoBrain organizes reasoning progress without blocking execution and actively manages the working context. Specifically, it prunes invalid steps, folds completed sub-trajectories, and preserves a compact, high-salience reasoning backbone under a fixed context budget. Together, these mechanisms enable explicit cognitive control over reasoning trajectories rather than passive context accumulation. We evaluate MemoBrain on challenging long-horizon benchmarks, including GAIA, WebWalker, and BrowseComp-Plus, demonstrating consistent improvements over strong baselines.
[110] An Under-Explored Application for Explainable Multimodal Misogyny Detection in code-mixed Hindi-English cs.AI | cs.CLPDF
Sargam Yadav, Abhishek Kaushik, Kevin Mc Daid
TL;DR: 本文提出了一种用于检测印地语-英语混合代码文本和表情包中厌女内容的多模态可解释网络应用,利用XLM-RoBERTa、mBERT等先进模型处理文本和图像数据,并通过SHAP和LIME技术提供可解释性,旨在帮助研究人员和内容审核者打击网络性别暴力。
Details
Motivation: 解决低资源混合语言环境中网络厌女内容检测的不足,并增强深度学习模型在敏感领域决策的透明度和可解释性。
Result: 系统在约4,193条评论的文本数据集和约4,218个表情包的多模态数据集上训练,使用XLM-R、mBERT及结合EfficientNet/ResNET的模型;通过人类评估者基于CUQ和UEQ问卷评估了整体可用性。
Insight: 创新点包括将可解释人工智能(XAI)技术应用于低资源混合语言的厌女检测多模态任务,并开发了集成先进Transformer模型与SHAP/LIME解释工具的实际应用,以提升模型透明度和实用性。
Abstract: Digital platforms have an ever-expanding user base, and act as a hub for communication, business, and connectivity. However, this has also allowed for the spread of hate speech and misogyny. Artificial intelligence models have emerged as an effective solution for countering online hate speech but are under explored for low resource and code-mixed languages and suffer from a lack of interpretability. Explainable Artificial Intelligence (XAI) can enhance transparency in the decisions of deep learning models, which is crucial for a sensitive domain such as hate speech detection. In this paper, we present a multi-modal and explainable web application for detecting misogyny in text and memes in code-mixed Hindi and English. The system leverages state-of-the-art transformer-based models that support multilingual and multimodal settings. For text-based misogyny identification, the system utilizes XLM-RoBERTa (XLM-R) and multilingual Bidirectional Encoder Representations from Transformers (mBERT) on a dataset of approximately 4,193 comments. For multimodal misogyny identification from memes, the system utilizes mBERT + EfficientNet, and mBERT + ResNET trained on a dataset of approximately 4,218 memes. It also provides feature importance scores using explainability techniques including Shapley Additive Values (SHAP) and Local Interpretable Model Agnostic Explanations (LIME). The application aims to serve as a tool for both researchers and content moderators, to promote further research in the field, combat gender based digital violence, and ensure a safe digital space. The system has been evaluated using human evaluators who provided their responses on Chatbot Usability Questionnaire (CUQ) and User Experience Questionnaire (UEQ) to determine overall usability.
[111] What If TSF: A Benchmark for Reframing Forecasting as Scenario-Guided Multimodal Forecasting cs.AI | cs.CLPDF
Jinkwan Jang, Hyunbin Jin, Hyungjin Park, Kyubyung Chae, Taesup Kim
TL;DR: 该论文提出了一个名为What If TSF(WIT)的多模态时间序列预测基准,旨在评估模型能否根据上下文文本(特别是未来情景)来调整其预测。该基准通过提供专家构建的合理或反事实情景,为情景引导的多模态预测提供了一个严格的测试平台。
Details
Motivation: 现有时间序列预测方法多为单模态且依赖历史模式外推,而大型语言模型(LLMs)的多模态预测潜力尚未在现有基准中得到有效验证,因为现有基准提供的文本上下文往往是回顾性的或未对齐的。受人类专家结合历史证据和“假设”情景进行预测的启发,该研究旨在创建一个能严格测试模型是否真正利用文本输入(尤其是未来情景)进行预测的基准。
Result: 论文介绍了WIT基准本身,但摘要中未提及在其上测试的具体模型性能或定量结果(如是否达到SOTA)。
Insight: 核心创新在于将预测重新定义为情景引导的多模态预测问题,并构建了一个专门用于此目的的基准。其关键设计是提供与未来相关的、专家构建的情景文本,这迫使模型必须理解并整合这些未来导向的文本信息,从而更真实地评估多模态预测模型的能力,而非仅依赖历史数据外推。
Abstract: Time series forecasting is critical to real-world decision making, yet most existing approaches remain unimodal and rely on extrapolating historical patterns. While recent progress in large language models (LLMs) highlights the potential for multimodal forecasting, existing benchmarks largely provide retrospective or misaligned raw context, making it unclear whether such models meaningfully leverage textual inputs. In practice, human experts incorporate what-if scenarios with historical evidence, often producing distinct forecasts from the same observations under different scenarios. Inspired by this, we introduce What If TSF (WIT), a multimodal forecasting benchmark designed to evaluate whether models can condition their forecasts on contextual text, especially future scenarios. By providing expert-crafted plausible or counterfactual scenarios, WIT offers a rigorous testbed for scenario-guided multimodal forecasting. The benchmark is available at https://github.com/jinkwan1115/WhatIfTSF.
[112] Parallel Context-of-Experts Decoding for Retrieval Augmented Generation cs.AI | cs.CLPDF
Giulio Corallo, Paolo Papotti
TL;DR: 本文提出了一种名为并行专家上下文解码(Pced)的无训练框架,旨在解决检索增强生成中多文档推理与计算效率之间的权衡问题。该方法将检索到的文档视为独立的’专家’,通过一种新颖的检索感知对比解码规则同步它们的预测,从而在不构建跨文档共享注意力的情况下恢复跨文档推理能力。
Details
Motivation: 解决检索增强生成中面临的权衡问题:长提示拼接文档可实现多文档推理但导致预填充瓶颈,而单独编码文档KV缓存虽快但破坏了跨文档交互。
Result: 未在摘要中明确提及具体基准测试或定量结果,但暗示该方法能恢复跨文档推理能力且无需训练。
Insight: 创新点在于将证据聚合从注意力机制转移到解码过程,通过对比解码规则权衡专家对数与模型先验,实现了无需跨文档注意力的高效多文档推理。
Abstract: Retrieval Augmented Generation faces a trade-off: concatenating documents in a long prompt enables multi-document reasoning but creates prefill bottlenecks, while encoding document KV caches separately offers speed but breaks cross-document interaction. We propose Parallel Context-of-Experts Decoding (Pced), a training-free framework that shifts evidence aggregation from the attention mechanism to the decoding. Pced treats retrieved documents as isolated “experts”, synchronizing their predictions via a novel retrieval-aware contrastive decoding rule that weighs expert logits against the model prior. This approach recovers cross-document reasoning capabilities without constructing a shared attention across documents.
[113] ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios cs.AI | cs.CVPDF
António Loison, Quentin Macé, Antoine Edy, Victor Xing, Tom Balough
TL;DR: 本文介绍了ViDoRe v3,一个全面的多模态检索增强生成(RAG)基准测试,旨在评估RAG系统在复杂现实场景中的能力。该基准包含10个专业领域的数据集,约26,000个视觉丰富的文档页面和3,099个人工验证的查询,支持6种语言,并提供了检索相关性、边界框定位和参考答案的高质量标注。评估发现,视觉检索器优于文本检索器,后期交互模型和文本重排序显著提升性能,混合或纯视觉上下文能提高答案生成质量,但现有模型在处理非文本元素、开放式查询和细粒度视觉定位方面仍有困难。
Details
Motivation: 现有RAG基准测试未能充分捕捉复杂现实场景的挑战,如理解视觉元素(表格、图表、图像)、跨文档信息合成和提供准确来源定位,往往只关注文本数据、单文档理解或孤立评估检索与生成。因此,需要一个新的综合基准来推动多模态RAG的发展。
Result: 在ViDoRe v3基准上对最先进的RAG管道进行评估,结果显示:视觉检索器性能优于文本检索器;后期交互模型和文本重排序技术能显著提升性能;使用混合或纯视觉上下文可以提高答案生成质量。然而,当前模型在非文本元素处理、开放式查询和细粒度视觉定位方面表现不佳。
Insight: 论文的创新点在于构建了一个大规模、多语言、多领域的视觉文档RAG基准,通过大量人工标注提供了高质量的评估数据。从客观角度看,该工作强调了多模态检索在复杂RAG场景中的重要性,并揭示了视觉上下文和后期交互对性能的关键影响,为未来研究提供了明确的改进方向。
Abstract: Retrieval-Augmented Generation (RAG) pipelines must address challenges beyond simple single-document retrieval, such as interpreting visual elements (tables, charts, images), synthesizing information across documents, and providing accurate source grounding. Existing benchmarks fail to capture this complexity, often focusing on textual data, single-document comprehension, or evaluating retrieval and generation in isolation. We introduce ViDoRe v3, a comprehensive multimodal RAG benchmark featuring multi-type queries over visually rich document corpora. It covers 10 datasets across diverse professional domains, comprising ~26,000 document pages paired with 3,099 human-verified queries, each available in 6 languages. Through 12,000 hours of human annotation effort, we provide high-quality annotations for retrieval relevance, bounding box localization, and verified reference answers. Our evaluation of state-of-the-art RAG pipelines reveals that visual retrievers outperform textual ones, late-interaction models and textual reranking substantially improve performance, and hybrid or purely visual contexts enhance answer generation quality. However, current models still struggle with non-textual elements, open-ended queries, and fine-grained visual grounding. To encourage progress in addressing these challenges, the benchmark is released under a commercially permissive license at https://hf.co/vidore.
[114] MEMEWEAVER: Inter-Meme Graph Reasoning for Sexism and Misogyny Detection cs.AI | cs.CVPDF
Paolo Italiani, David Gimeno-Gomez, Luca Ragazzi, Gianluca Moro, Paolo Rosso
TL;DR: 本文提出MemeWeaver框架,通过构建跨模因图并进行图推理来检测网络性别歧视和厌女内容,在MAMI和EXIST基准测试中优于现有方法,且训练收敛更快。
Details
Motivation: 现有多模态内容审核方法忽视了网络骚扰背后的社会动态,即施害者在同质社区中强化偏见和群体认同,而现有图方法受限于启发式图构建、浅层模态融合和实例级推理。
Result: 在MAMI和EXIST基准测试上,该方法持续优于最先进的基线模型,并实现了更快的训练收敛。
Insight: 创新点在于端到端可训练的多模态框架及新颖的跨模因图推理机制,学习的图结构能捕捉有语义意义的模式,揭示了在线仇恨的关系本质。
Abstract: Women are twice as likely as men to face online harassment due to their gender. Despite recent advances in multimodal content moderation, most approaches still overlook the social dynamics behind this phenomenon, where perpetrators reinforce prejudices and group identity within like-minded communities. Graph-based methods offer a promising way to capture such interactions, yet existing solutions remain limited by heuristic graph construction, shallow modality fusion, and instance-level reasoning. In this work, we present MemeWeaver, an end-to-end trainable multimodal framework for detecting sexism and misogyny through a novel inter-meme graph reasoning mechanism. We systematically evaluate multiple visual–textual fusion strategies and show that our approach consistently outperforms state-of-the-art baselines on the MAMI and EXIST benchmarks, while achieving faster training convergence. Further analyses reveal that the learned graph structure captures semantically meaningful patterns, offering valuable insights into the relational nature of online hate.