Table of Contents

cs.CL [Back]

[1] Entropy-Tree: Tree-Based Decoding with Entropy-Guided Exploration cs.CL | cs.AIPDF

Longxuan Wei, Yubo Zhang, Zijiao Zhang, Zhihu Wang, Shiwan Zhao

TL;DR: 本文提出Entropy-Tree,一种基于树的解码方法,它利用熵作为分支决策的信号,仅在模型表现出真正不确定性的位置扩展搜索树,从而统一了高效的结构化探索和可靠的置信度估计。该方法在推理任务中展现出优于传统多链采样的准确性和校准性。

Details

Motivation: 现有解码策略(如随机采样或独立多次采样)在探索时要么盲目要么冗余,无法高效利用模型的不确定性信息来指导推理过程。

Result: 在多个模型和数据集上的推理任务中,Entropy-Tree在pass@k指标上优于Multi-chain方法,并且其预测熵在AUROC指标上优于多种传统置信度度量方法。

Insight: 核心创新点在于将熵作为动态构建解码树的引导信号,实现了探索效率与不确定性估计的统一;从客观角度看,这是一种将模型内部置信度信息显式地用于结构化搜索决策的机制创新。

Abstract: Large language models achieve strong reasoning performance, yet existing decoding strategies either explore blindly (random sampling) or redundantly (independent multi-sampling). We propose Entropy-Tree, a tree-based decoding method that exploits entropy as a signal for branching decisions–expanding the search tree only at positions where the model exhibits genuine uncertainty. Entropy-Tree shows superior accuracy and calibration in reasoning tasks: it achieves better pass@k than Multi-chain across multiple models and datasets, and its predictive entropy demonstrates better AUROC compared to several traditional metrics. Entropy-Tree unifies efficient structured exploration and reliable uncertainty estimation within a single decoding procedure.


[2] AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports cs.CLPDF

Edward Ajayi

TL;DR: 本文介绍了AfriEconQA,一个基于世界银行报告的非洲经济分析专用基准数据集,包含8,937个高质量问答实例,旨在评估模型在需要高精度数值推理和时间消歧的复杂经济查询上的性能。

Details

Motivation: 解决当前大型语言模型在非洲经济领域专业知识缺失的问题,为信息检索和检索增强生成系统提供一个具有挑战性的领域特定基准。

Result: 实验表明,零样本模型(GPT-5 Mini)无法回答超过90%的查询,即使最先进的RAG流水线也难以实现高精度,证实了该数据集的挑战性。

Insight: 创新点在于构建了首个专注于非洲经济分析的基准数据集,并设计了包含证据、答案和来源元数据的结构化实例,为领域特定的IR和RAG系统评估提供了新标准。

Abstract: We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.


[3] ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation cs.CL | cs.AIPDF

Zhebo Wang, Xiaohu Mu, Zijie Zhou, Mohan Li, Wenpeng Xing

TL;DR: 本文提出了一种名为ICPO的新训练框架,旨在解决大语言模型在多轮对话中因早期错误假设而难以恢复的‘迷失对话’问题,通过增强模型对指令模糊性的敏感度,使其在遇到歧义时表达不确定性或寻求澄清。

Details

Motivation: 解决大语言模型在多轮对话中因用户初始指令模糊而导致的‘迷失对话’现象,以及标准后训练技术如RLVR可能加剧的过度自信问题。

Result: 实验表明,ICPO在多轮对话中平均提升了75%的性能,同时在单轮基准测试上保持了稳健表现。

Insight: 创新点在于通过条件化奖励信号于用户的言外意图,奖励模型在模糊情境下的不确定性表达或澄清请求,从而培养适当的谦逊,提升对话的鲁棒性和协作性。

Abstract: Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation’’ phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial instructions. We find that standard post-training techniques like Reinforcement Learning with Verifiable Rewards (RLVR) exacerbate this issue by rewarding confident, direct answers, thereby inducing overconfidence and discouraging the model from seeking clarification. To address this, we propose Illocution-Calibrated Policy Optimization (ICPO), a novel training framework that sensitizes the model to instruction ambiguity. ICPO augments the training corpus with underspecified prompts and conditions the reward signal on the user’s illocutionary intent, rewarding the model for expressing uncertainty or asking for clarification when faced with ambiguity. Experiments demonstrate that ICPO fosters appropriate humility, yielding a substantial average improvement of 75% in multi-turn conversation, while preserving robust performance on single-turn benchmarks. Our work presents a practical path toward more robust and collaborative conversational AI that can better navigate the nuances of human interaction.


[4] Domain-Specific Knowledge Graphs in RAG-Enhanced Healthcare LLMs cs.CLPDF

Sydney Anuyah, Mehedi Mahmud Kaushik, Hao Dai, Rakesh Shiradkar, Arjan Durresi

TL;DR: 本研究评估了领域知识图谱(KG)在医疗健康领域检索增强生成(RAG)中的作用。通过构建三个基于PubMed的疾病知识图谱(T2DM、阿尔茨海默病及其合并症),并设计两个测试探针,研究了不同检索源和解码温度对七个指令微调LLM性能的影响。研究发现,探针与知识图谱的领域范围对齐至关重要,精确匹配的检索能带来最稳定的性能提升,而盲目合并图谱则会引入干扰信息降低准确性。

Details

Motivation: 解决大型语言模型(LLMs)在生成流畅回答时,在特定领域(如医疗健康)进行可信赖推理的困难,探索领域知识图谱是否能有效提升RAG在该领域的表现。

Result: 在设计的两个探针(Probe 1和Probe 2)上测试了七个LLM。结果显示:当探针与知识图谱范围精确匹配时(特别是使用G2图谱),性能提升最稳定;而简单合并图谱(如G1+G2)常因引入干扰信息而降低准确率。在Probe 1上,较大模型仅凭参数知识(No-RAG基线)常能达到或超过KG-RAG效果,表明其拥有较强的领域先验知识;中小模型则从范围匹配良好的检索中获益更多。解码温度影响次要,较高温度很少有帮助。

Insight: 论文的创新点在于系统性地评估了知识图谱范围对齐对RAG性能的决定性影响,并提出了“精准优先、范围匹配”的KG-RAG策略优于“广度优先”的图谱合并策略。从客观角度看,该研究为在实际应用中如何选择图谱、模型规模以及检索/重排序策略提供了基于实验的实用指南,强调了领域特异性知识整合中“质”优于“量”的重要性。

Abstract: Large Language Models (LLMs) generate fluent answers but can struggle with trustworthy, domain-specific reasoning. We evaluate whether domain knowledge graphs (KGs) improve Retrieval-Augmented Generation (RAG) for healthcare by constructing three PubMed-derived graphs: $\mathbb{G}_1$ (T2DM), $\mathbb{G}_2$ (Alzheimer’s disease), and $\mathbb{G}_3$ (AD+T2DM). We design two probes: Probe 1 targets merged AD T2DM knowledge, while Probe 2 targets the intersection of $\mathbb{G}_1$ and $\mathbb{G}_2$. Seven instruction-tuned LLMs are tested across retrieval sources {No-RAG, $\mathbb{G}_1$, $\mathbb{G}_2$, $\mathbb{G}_1$ + $\mathbb{G}_2$, $\mathbb{G}_3$, $\mathbb{G}_1$+$\mathbb{G}_2$ + $\mathbb{G}_3$} and three decoding temperatures. Results show that scope alignment between probe and KG is decisive: precise, scope-matched retrieval (notably $\mathbb{G}_2$) yields the most consistent gains, whereas indiscriminate graph unions often introduce distractors that reduce accuracy. Larger models frequently match or exceed KG-RAG with a No-RAG baseline on Probe 1, indicating strong parametric priors, whereas smaller/mid-sized models benefit more from well-scoped retrieval. Temperature plays a secondary role; higher values rarely help. We conclude that precision-first, scope-matched KG-RAG is preferable to breadth-first unions, and we outline practical guidelines for graph selection, model sizing, and retrieval/reranking. Code and Data available here - https://github.com/sydneyanuyah/RAGComparison


[5] Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering cs.CL | cs.AI | cs.IRPDF

Anuj Maharjan, Umesh Yadav

TL;DR: 本文通过实证评估,比较了在公共卫生政策文档问答任务中,Vanilla LLM、基础RAG和采用交叉编码器重排序的高级RAG架构的性能。研究发现,高级RAG架构在忠实度上显著优于基础版本和基线模型,但文档分割策略仍是多步推理任务的主要瓶颈。

Details

Motivation: 解决大语言模型在公共卫生政策等高风险领域应用时产生的幻觉问题,通过检索增强生成技术将生成内容锚定在权威文档上下文中,以提高信息完整性。

Result: 在CDC政策文档数据集上,基础RAG的忠实度得分(0.621)较Vanilla基线(0.347)有显著提升,而高级RAG配置达到了0.797的优异忠实度平均值,证明了其有效性。

Insight: 创新点在于系统比较了不同RAG架构(特别是引入交叉编码器重排序的两阶段检索机制)和两种文档分块策略(基于字符递归和基于令牌语义分割)对问答准确性的影响,强调了针对特定领域任务进行精细化检索设计的重要性。

Abstract: The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high-stakes environments where information integrity is non-negotiable. This empirical evaluation explores the effectiveness of Retrieval-Augmented Generation (RAG) architectures in mitigating these risks by grounding generative outputs in authoritative document context. Specifically, this study compares a baseline Vanilla LLM against Basic RAG and Advanced RAG pipelines utilizing cross-encoder re-ranking. The experimental framework employs a Mistral-7B-Instruct-v0.2 model and an all-MiniLM-L6-v2 embedding model to process a corpus of official CDC policy analytical frameworks and guidance documents. The analysis measures the impact of two distinct chunking strategies, recursive character-based and token-based semantic splitting, on system accuracy, measured through faithfulness and relevance scores across a curated set of complex policy scenarios. Quantitative findings indicate that while Basic RAG architectures provide a substantial improvement in faithfulness (0.621) over Vanilla baselines (0.347), the Advanced RAG configuration achieves a superior faithfulness average of 0.797. These results demonstrate that two-stage retrieval mechanisms are essential for achieving the precision required for domain-specific policy question answering, though structural constraints in document segmentation remain a significant bottleneck for multi-step reasoning tasks.


[6] Multi-Persona Thinking for Bias Mitigation in Large Language Models cs.CL | cs.AIPDF

Yuxing Chen, Guoqing Luo, Zijun Wu, Lili Mou

TL;DR: 本文提出了一种名为多角色思维(MPT)的新型推理时框架,旨在减少大型语言模型(LLMs)中存在的显著社会偏见。该框架通过引导模型采用对比性的社会身份(如男性和女性)以及中立视角,并让这些角色进行迭代式辩证推理,从而暴露和纠正偏见。

Details

Motivation: 大型语言模型表现出显著的社会偏见,可能延续有害的刻板印象和不公平结果,因此需要有效的方法来缓解这些偏见。

Result: 在两个广泛使用的偏见基准测试上,对多种规模的开源和闭源模型进行评估,MPT相比现有的基于提示的策略取得了显著改进,实现了最低的偏见水平,同时保持了核心推理能力。

Insight: 创新点在于将角色分配的潜在弱点转化为缓解偏见的优势,通过引入多视角的辩证推理过程,在推理时动态地暴露和修正模型中的偏见,这是一种无需重新训练的有效干预方法。

Abstract: Large Language Models (LLMs) exhibit significant social biases that can perpetuate harmful stereotypes and unfair outcomes. In this paper, we propose Multi-Persona Thinking (MPT), a novel inference-time framework that leverages dialectical reasoning from multiple perspectives to reduce bias. MPT guides models to adopt contrasting social identities (e.g., male and female) along with a neutral viewpoint, and then engages these personas iteratively to expose and correct biases. Through a dialectical reasoning process, the framework transforms the potential weakness of persona assignment into a strength for bias mitigation. We evaluate MPT on two widely used bias benchmarks across both open-source and closed-source models of varying scales. Our results demonstrate substantial improvements over existing prompting-based strategies: MPT achieves the lowest bias while maintaining core reasoning ability.


[7] ViT Registers and Fractal ViT cs.CL | cs.LGPDF

Jason Chuan-Chih Chou, Abhinav Kumar, Shivank Garg

TL;DR: 本文受语言模型中无位置编码(NoPE)Transformer表现良好以及寄存器(与输入无关的额外丢弃令牌)可能提升大型视觉Transformer(ViT)性能的启发,提出并测试了一种名为分形ViT的变体。该模型通过在常规令牌与类似寄存器的“摘要令牌”之间应用注意力掩码,打破令牌间的置换不变性,可单独使用或与多种位置编码结合。实验表明,这些模型并未超越带寄存器的ViT,突显了这些发现可能具有规模、领域或应用特定性。

Details

Motivation: 受语言模型中无位置编码Transformer表现良好及寄存器可能提升大型ViT性能的启发,探索通过引入注意力掩码打破令牌置换不变性,以改进ViT模型。

Result: 实验表明,分形ViT模型在性能上并未超越带寄存器的ViT,未提及具体基准测试或定量结果。

Insight: 创新点在于提出分形ViT结构,通过注意力掩码结合摘要令牌来打破置换不变性;客观分析显示,该方法未带来显著改进,提示此类技术可能高度依赖规模、领域或应用场景。

Abstract: Drawing inspiration from recent findings including surprisingly decent performance of transformers without positional encoding (NoPE) in the domain of language models and how registers (additional throwaway tokens not tied to input) may improve the performance of large vision transformers (ViTs), we invent and test a variant of ViT called fractal ViT that breaks permutation invariance among the tokens by applying an attention mask between the regular tokens and ``summary tokens’’ similar to registers, in isolation or in combination with various positional encodings. These models do not improve upon ViT with registers, highlighting the fact that these findings may be scale, domain, or application-specific.


[8] YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models cs.CLPDF

Junyu Lin, Meizhen Liu, Xiufeng Huang, Jinfeng Li, Haiwen Hong

TL;DR: 本文提出了YuFeng-XGuard,一个以推理为中心、可解释且灵活的大型语言模型护栏模型系列,旨在对LLM交互进行多维风险感知,生成结构化的风险预测和自然语言解释,并通过分层推理和动态策略机制平衡效率与可解释性。

Details

Motivation: 现有LLM安全护栏方案多依赖粗粒度过滤或事后规则,存在透明度低、策略不灵活或推理成本高的问题,需要支持细粒度、可解释且可适配的风险评估。

Result: 在多个公共安全基准测试上的广泛实验表明,YuFeng-XGuard实现了最先进的性能,并在效率与效能之间保持了良好的权衡。

Insight: 创新点包括:1) 生成结构化风险预测(明确风险类别和可配置置信度)及自然语言解释,使安全决策兼具可操作性和可解释性;2) 采用分层推理范式,基于首个解码令牌进行初始风险决策,并按需保留解释性推理;3) 引入动态策略机制,将风险感知与策略执行解耦,无需重新训练模型即可调整安全策略。

Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.


[9] Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow cs.CL | cs.AI | cs.LGPDF

Yangyang Zhong, Yanmei Gu, Zhengqing Zang, Xiaomeng Li, Yuqi Ding

TL;DR: 本文系统评估了掩码扩散语言模型(MDLMs)在并行生成和任意顺序解码方面的实际表现,发现当前MDLMs在依赖建模上仍弱于自回归模型,但展现出任务自适应的解码行为,并提出了生成-编辑范式以平衡效率与依赖关系。

Details

Motivation: 探究MDLMs是否真正实现了其承诺的并行生成和任意顺序解码能力,并分析其性能瓶颈与潜在优势。

Result: 在58个涵盖知识、推理和编程的基准测试中,八个主流MDLM(最大100B参数)表现仍落后于同等规模的自回归模型;MDLMs的并行度和生成顺序随任务领域、推理阶段和输出正确性显著变化,在需要“后向信息”的任务(如数独)中展现出优势。

Insight: MDLMs的并行概率建模削弱了token间依赖关系,但其自适应解码行为揭示了任务相关的生成策略;提出的生成-编辑范式可缓解依赖损失,同时保留并行解码效率,为未来模型设计提供方向。

Abstract: Masked Diffusion Language Models (MDLMs) promise parallel token generation and arbitrary-order decoding, yet it remains unclear to what extent current models truly realize these capabilities. We characterize MDLM behavior along two dimensions – parallelism strength and generation order – using Average Finalization Parallelism (AFP) and Kendall’s tau. We evaluate eight mainstream MDLMs (up to 100B parameters) on 58 benchmarks spanning knowledge, reasoning, and programming. The results show that MDLMs still lag behind comparably sized autoregressive models, mainly because parallel probabilistic modeling weakens inter-token dependencies. Meanwhile, MDLMs exhibit adaptive decoding behavior: their parallelism and generation order vary significantly with the task domain, the stage of reasoning, and whether the output is correct. On tasks that require “backward information” (e.g., Sudoku), MDLMs adopt a solution order that tends to fill easier Sudoku blanks first, highlighting their advantages. Finally, we provide theoretical motivation and design insights supporting a Generate-then-Edit paradigm, which mitigates dependency loss while retaining the efficiency of parallel decoding.


[10] ToxiTwitch: Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms cs.CL | cs.SIPDF

Baktash Ansari, Shiza Ali, Elias Martin, Maryna Sivachenko, Afra Mashhadi

TL;DR: 本文提出了ToxiTwitch,一种针对Twitch直播平台的混合毒性检测模型。该模型通过结合大型语言模型(如DeepSeek-R1-Distill和Llama-3-8B-Instruct)生成的文本和表情符号(emote)嵌入,以及传统机器学习分类器(如随机森林和SVM),来改进直播聊天中的毒性行为检测。研究表明,融入表情符号信息能提升检测效果。

Details

Motivation: 解决Twitch等直播平台在快节奏、高流量、上下文丰富的聊天环境中,传统人工审核和基于关键词过滤的毒性内容审核方法难以有效扩展且审核员易受骚扰的问题。

Result: 在特定频道训练下,所提出的混合方法准确率最高达到80%(相比BERT提升13%),F1分数为76%。这是一个探索性研究,旨在揭示Twitch上表情感知毒性检测的挑战和局限。

Insight: 创新点在于提出了一种结合LLM生成的多模态(文本+表情符号)嵌入与传统分类器的混合模型,强调了在直播聊天毒性检测中融入表情符号这一非文本上下文信息的重要性。从客观角度看,这为多模态、上下文敏感的在线内容审核提供了一个有前景的探索方向。

Abstract: The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (LLMs), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal communication involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines LLM-generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80 percent accuracy under channel-specific training (with 13 percent improvement over BERT and F1-score of 76 percent). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.


[11] Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation cs.CLPDF

Zhiyao Ren, Yibing Zhan, Siyuan Liang, Guozheng Ma, Baosheng Yu

TL;DR: 本文提出了首个用于评估真实医疗咨询中多轮交互置信度的基准,并开发了MedConf框架来增强大型语言模型在医疗诊断中的置信度估计。该基准整合了三种医疗数据,引入信息充分性梯度来刻画置信度与正确性随证据积累的动态关系。

Details

Motivation: 现有研究主要在单轮静态设置下评估置信度,忽略了真实咨询中置信度与正确性随临床证据积累的耦合关系,限制了其对可靠决策的支持。

Result: 在两个LLM和三个医疗数据集上,MedConf在AUROC和皮尔逊相关系数指标上均持续优于最先进方法,在信息不足和多病共存条件下保持稳定性能。

Insight: 创新点在于提出了首个多轮医疗咨询置信度基准,并设计了基于证据的语言自评估框架MedConf,通过检索增强生成构建症状档案,并对齐患者信息与支持、缺失和矛盾关系,加权集成生成可解释的置信度估计。关键洞察是信息充分性是可信医疗置信度建模的关键决定因素。

Abstract: Large-scale language models (LLMs) often offer clinical judgments based on incomplete information, increasing the risk of misdiagnosis. Existing studies have primarily evaluated confidence in single-turn, static settings, overlooking the coupling between confidence and correctness as clinical evidence accumulates during real consultations, which limits their support for reliable decision-making. We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations. Our benchmark unifies three types of medical data for open-ended diagnostic generation and introduces an information sufficiency gradient to characterize the confidence-correctness dynamics as evidence increases. We implement and compare 27 representative methods on this benchmark; two key insights emerge: (1) medical data amplifies the inherent limitations of token-level and consistency-level confidence methods, and (2) medical reasoning must be evaluated for both diagnostic accuracy and information completeness. Based on these insights, we present MedConf, an evidence-grounded linguistic self-assessment framework that constructs symptom profiles via retrieval-augmented generation, aligns patient information with supporting, missing, and contradictory relations, and aggregates them into an interpretable confidence estimate through weighted integration. Across two LLMs and three medical datasets, MedConf consistently outperforms state-of-the-art methods on both AUROC and Pearson correlation coefficient metrics, maintaining stable performance under conditions of information insufficiency and multimorbidity. These results demonstrate that information adequacy is a key determinant of credible medical confidence modeling, providing a new pathway toward building more reliable and interpretable large medical models.


[12] Persona Switch: Mixing Distinct Perspectives in Decoding Time cs.CLPDF

Junseok Kim, Nakyeong Yang, Kyomin Jung

TL;DR: 本文提出了一种名为Persona Switch的解码方法,通过动态结合零样本提示和角色扮演提示的优势,在解码过程中逐步选择置信度更高的输出,从而提升语言模型的零样本推理能力。

Details

Motivation: 零样本提示和角色扮演提示在不同任务或实例上的表现不一致,表明两者可能具有互补优势,而非一方绝对优于另一方。

Result: 在广泛使用的LLMs上的实验表明,Persona Switch方法持续优于竞争基线,实现了高达5.13%的准确率提升。

Insight: 创新点在于提出了一种基于输出置信度(通过logit gap衡量)的动态解码策略,以融合不同提示策略的优势;客观来看,该方法提供了一种简单有效的机制来利用模型内部置信度信号进行决策,可推广到其他需要组合不同生成策略的场景。

Abstract: Role-play prompting is known to steer the behavior of language models by injecting a persona into the prompt, improving their zero-shot reasoning capabilities. However, such improvements are inconsistent across different tasks or instances. This inconsistency suggests that zero-shot and role-play prompting may offer complementary strengths rather than one being universally superior. Building on this insight, we propose Persona Switch, a novel decoding method that dynamically combines the benefits of both prompting strategies. Our method proceeds step-by-step, selecting the better output between zero-shot and role-play prompting at each step by comparing their output confidence, as measured by the logit gap. Experiments with widely-used LLMs demonstrate that Persona Switch consistently outperforms competitive baselines, achieving up to 5.13% accuracy improvement. Furthermore, we show that output confidence serves as an informative measure for selecting the more reliable output.


[13] Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind cs.CL | cs.AIPDF

Zhitao He, Zongwei Lyu, Yi R Fung

TL;DR: 本文提出了首个基于心智理论(Theory of Mind, ToM)的学术反驳框架RebuttalAgent,通过构建ToM-策略-响应(TSR)流程来建模审稿人心理状态、制定说服策略并生成基于策略的回复。为训练该智能体,作者构建了大规模数据集RebuttalBench,并采用监督微调和强化学习两阶段训练方法。同时,开发了专门的自动化评估器Rebuttal-RM。实验表明,该智能体在自动评估和人工评估中均显著优于基线模型及先进的专有模型。

Details

Motivation: 学术反驳是一个在严重信息不对称下进行的复杂战略沟通过程,而非简单的技术辩论。现有方法大多模仿表面语言,缺乏有效说服所需的关键视角采择能力,因此难以应对这一挑战。

Result: RebuttalAgent在自动评估指标上平均比基础模型提升18.3%,同时在自动评估和人工评估中均优于先进的专有模型。专门开发的评估器Rebuttal-RM在评分一致性上超越了GPT-4.1。

Insight: 创新点在于将心智理论(ToM)系统性地引入学术反驳任务,通过TSR流程将心理建模、策略制定与响应生成相结合。方法上,采用新颖的批判-精炼方法构建大规模数据集,并结合监督微调与基于自奖励机制的强化学习进行两阶段训练,实现了可扩展的自我改进。评估方面,构建了专门的多源反驳数据评估器,提升了自动化评估的可靠性与效率。

Abstract: Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) pipeline that models reviewer mental state, formulates persuasion strategy, and generates strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations. Disclaimer: the generated rebuttal content is for reference only to inspire authors and assist in drafting. It is not intended to replace the author’s own critical analysis and response.


[14] Hallucination Mitigating for Medical Report Generation cs.CLPDF

Ruoqing Zhao, Runze Xia, Piji Li

TL;DR: 本文提出了一种名为KERM(Knowledge-Enhanced with Fine-Grained Reinforced Rewards Medical Report Generation)的框架,旨在解决大型视觉语言模型在医学报告生成任务中产生幻觉(即生成看似合理但不准确陈述)的问题。该框架通过MedCLIP进行知识检索,引入净化模块确保知识相关性,并利用细粒度奖励机制引导模型生成高质量报告。

Details

Motivation: 动机在于解决大型视觉语言模型在医学报告生成中易产生幻觉的问题,尤其是在医学这一关键领域,生成不准确陈述可能带来严重后果,因此需要一种方法来增强报告的准确性和临床相关性。

Result: 在IU-Xray和MIMIC-CXR数据集上的实验结果表明,该方法在减轻幻觉和提升报告质量方面有效,但摘要未具体说明是否达到SOTA水平或与特定模型相当。

Insight: 创新点包括:结合知识检索(MedCLIP)和净化模块来增强输入信息的准确性,以及采用细粒度强化奖励机制来对齐模型输出与期望行为。从客观角度看,这为医学报告生成提供了一种结合外部知识和强化学习的新颖框架,可借鉴于其他需要高可靠性的视觉语言任务。

Abstract: In the realm of medical report generation (MRG), the integration of natural language processing has emerged as a vital tool to alleviate the workload of radiologists. Despite the impressive capabilities demonstrated by large vision language models (LVLMs) in understanding natural language, their susceptibility to generating plausible yet inaccurate claims, known as ``hallucinations’’, raises concerns-especially in the nuanced and critical field of medical. In this work, we introduce a framework, \textbf{K}nowledge-\textbf{E}nhanced with Fine-Grained \textbf{R}einforced Rewards \textbf{M}edical Report Generation (KERM), to tackle the issue. Our approach refines the input to the LVLM by first utilizing MedCLIP for knowledge retrieval, incorporating relevant lesion fact sentences from a curated knowledge corpus. We then introduce a novel purification module to ensure the retrieved knowledge is contextually relevant to the patient’s clinical context. Subsequently, we employ fine-grained rewards to guide these models in generating highly supportive and clinically relevant descriptions, ensuring the alignment of model’s outputs with desired behaviors. Experimental results on IU-Xray and MIMIC-CXR datasets validate the effectiveness of our approach in mitigating hallucinations and enhancing report quality.


[15] ExDR: Explanation-driven Dynamic Retrieval Enhancement for Multimodal Fake News Detection cs.CLPDF

Guoxuan Ding, Yuqing Li, Ziyan Zhou, Zheng Lin, Daren Zha

TL;DR: 本文提出ExDR框架,一种用于多模态假新闻检测的解释驱动动态检索增强生成方法。该框架通过利用模型生成解释来优化检索触发和证据检索过程,从三个互补维度评估触发置信度,融合欺骗实体构建实体感知索引,并基于欺骗特定特征检索对比证据以挑战初始声明并增强最终预测。

Details

Motivation: 多模态假新闻的快速传播及其演变性和对及时事实细节的依赖,对现有检测方法构成挑战;动态检索增强生成方法虽提供解决方案,但在应用于欺骗性内容时仍面临冗余检索、相似度粗糙和无关证据等问题。

Result: 在AMG和MR2两个基准数据集上的实验表明,ExDR在检索触发准确性、检索质量和整体检测性能上持续优于先前方法,突显其有效性和泛化能力。

Insight: 创新点在于系统地将模型生成解释集成到检索触发和证据检索模块中,通过多维度置信度评估、实体感知索引构建和欺骗特定特征对比证据检索,提升假新闻检测的准确性和效率;客观分析认为,该方法通过解释驱动机制优化动态检索,有效解决了冗余和无关证据问题,增强了检测的鲁棒性。

Abstract: The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.


[16] Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model cs.CLPDF

Chenghao Fan, Wen Heng, Bo Li, Sichen Liu, Yuxuan Song

TL;DR: 本文提出了Stable-DiffCoder,一种基于块扩散的代码大语言模型。通过引入包含定制预热和块级裁剪噪声调度的持续预训练阶段,该模型在相同数据和架构下,在广泛的代码基准测试中整体超越了其自回归对应模型,并优于多种约80亿参数的AR和DLLM模型。

Details

Motivation: 解决现有基于扩散的代码语言模型在可比预算下仍落后于强自回归基线模型的问题,旨在探索扩散训练是否能超越自回归训练单独提升代码建模质量。

Result: 在相同数据和架构下,Stable-DiffCoder在广泛的代码基准测试中整体超越了其自回归对应模型。仅通过持续预训练和监督微调阶段,其性能就优于一系列约80亿参数的AR和DLLM模型。

Insight: 创新点包括:1) 引入块扩散持续预训练阶段,并采用定制预热和块级裁剪噪声调度以实现高效知识学习和稳定训练;2) 证明了扩散训练本身可以提升代码建模质量;3) 扩散模型的任意顺序建模能力改善了代码编辑和推理的结构化建模,并通过数据增强使低资源编程语言受益。

Abstract: Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of ~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.


[17] Transfer Learning from ImageNet for MEG-Based Decoding of Imagined Speech cs.CL | cs.AI | cs.CVPDF

Soufiane Jhilal, Stéphanie Martin, Anne-Lise Giraud

TL;DR: 本文提出了一种基于图像的方法,将脑磁图(MEG)信号转换为时频表示,以利用ImageNet预训练的视觉模型进行想象语音的解码。该方法通过可学习的传感器空间卷积将MEG数据投影为三种空间尺度图混合,作为预训练视觉模型的图像输入,在想象语音与静默、默读以及元音解码任务上取得了优于传统和非预训练模型的性能。

Details

Motivation: 解决非侵入式想象语音解码因信号微弱、分布广泛且标记数据有限而面临的挑战。

Result: 在21名参与者的MEG数据上,该方法在想象语音与静默、默读对比任务中分别达到90.4%和81.0%的平衡准确率,在元音解码任务中达到60.6%的准确率,优于经典和非预训练模型;跨被试评估证实了预训练模型能捕捉共享的神经表征。

Insight: 创新点在于将MEG信号转换为图像形式,从而能够利用大规模图像数据集(如ImageNet)上预训练的视觉模型的特征提取能力,有效捕捉想象语音的神经信号结构,这为脑机接口和神经解码提供了一种新的跨模态迁移学习范式。

Abstract: Non-invasive decoding of imagined speech remains challenging due to weak, distributed signals and limited labeled data. Our paper introduces an image-based approach that transforms magnetoencephalography (MEG) signals into time-frequency representations compatible with pretrained vision models. MEG data from 21 participants performing imagined speech tasks were projected into three spatial scalogram mixtures via a learnable sensor-space convolution, producing compact image-like inputs for ImageNet-pretrained vision architectures. These models outperformed classical and non-pretrained models, achieving up to 90.4% balanced accuracy for imagery vs. silence, 81.0% vs. silent reading, and 60.6% for vowel decoding. Cross-subject evaluation confirmed that pretrained models capture shared neural representations, and temporal analyses localized discriminative information to imagery-locked intervals. These findings show that pretrained vision models applied to image-based MEG representations can effectively capture the structure of imagined speech in non-invasive neural signals.


Özgür Uğur, Mahmut Göksu, Mahmut Çimen, Musa Yılmaz, Esra Şavirdi

TL;DR: 本文提出了Mecellem模型框架,专门针对土耳其法律领域开发语言模型,包含从头预训练的编码器模型和通过持续预训练(CPT)的Decoder模型。编码器模型基于ModernBERT架构,在1127亿土耳其语为主的语料上预训练,采用基于下游检索性能的检查点选择策略,在计算资源较少的情况下实现了高效的检索性能。Decoder模型则通过受控课程学习的四阶段持续预训练,将通用模型(Qwen3-1.7B/4B)适配到土耳其法律领域,显著降低了领域文本的困惑度。

Details

Motivation: 解决为土耳其法律领域开发高效、专业化语言模型的需求,特别是克服现有SOTA模型多阶段、计算密集的训练流程带来的高成本问题,并提供一种更具成本效益的领域适应方案。

Result: 编码器模型在土耳其检索排行榜上达到前三名,其中155M参数的小模型性能与307M-567M参数的大参考模型相当,生产效率达到SOTA模型的92.36%(排名第四)。Decoder模型在土耳其法律文本上困惑度降低了36.2%,显示了领域适应的显著收益。

Insight: 创新点包括:1)在预训练过程中实施基于下游检索性能的检查点选择策略,发现最佳检查点出现在预训练损失达到最小之前;2)提出单阶段预训练加高效后训练的成本效益替代方案,对比多阶段SOTA训练流程;3)采用受控课程学习的四阶段持续预训练,实现从通用语言到专业法律术语和长上下文推理的渐进过渡。

Abstract: This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.


[19] Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction cs.CLPDF

Tony Cristofano

TL;DR: 本文提出了一种名为’轨迹回放与概念基重构’的框架,用于在无需目标模型拒绝监督的情况下,将拒绝干预从一个’捐赠’模型迁移到不同架构(如密集模型到混合专家模型)的目标模型。该方法通过概念指纹对齐层,并使用共享的’概念原子’配方重构拒绝方向,从而将捐赠模型的消融轨迹映射到目标模型的语义空间中。

Details

Motivation: 动机在于挑战对齐大语言模型中的拒绝行为是模型特有的观点,并假设其源于跨模型共享的、低维的通用语义电路。本文旨在验证这一假设,并实现跨模型的安全对齐干预迁移。

Result: 在包括GPT-OSS-20B和GLM-4在内的8个模型对上进行的评估证实,迁移后的’配方’能持续削弱模型的拒绝行为,同时保持其性能,为安全对齐的语义通用性提供了有力证据。

Insight: 创新点在于提出了一个跨模型迁移拒绝干预的通用框架,其核心是概念基重构和轨迹回放。客观来看,该方法通过将干预与高方差权重子空间解耦(权重SVD稳定性防护),在有效干预的同时最小化对模型能力的损害,为理解和操纵LLM的安全机制提供了新视角。

Abstract: Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe’’ of concept atoms, we map the donor’s ablation trajectory into the target’s semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs (including GPT-OSS-20B and GLM-4) confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.


[20] synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier cs.CL | cs.CVPDF

Haq Nawaz Malik, Kh Mohmad Shafi, Tanveer Ahmad Reshi

TL;DR: 本文提出了SynthOCR-Gen,一个专门为低资源语言设计的开源合成OCR数据集生成器。该工具通过将数字Unicode文本语料库转换为可直接使用的训练数据集,解决了OCR开发中的关键数据瓶颈。作者以克什米尔语为例,生成了一个包含60万个单词样本的OCR数据集并公开发布。

Details

Motivation: 解决低资源语言(如使用复杂波斯-阿拉伯文字的克什米尔语)因缺乏大规模标注训练数据而无法获得主流OCR系统(如Tesseract、TrOCR、PaddleOCR)支持的问题。手动创建此类数据集成本高昂、耗时且易出错。

Result: 成功生成了一个包含60万个单词样本的克什米尔语OCR数据集,并已公开发布在HuggingFace平台上。该数据集是通过所提工具从数字文本语料库合成的,为后续模型训练提供了基础。

Insight: 创新点在于提供了一个端到端的开源合成数据生成管道,集成了文本分割、Unicode规范化、多字体渲染以及25种以上模拟真实文档退化(如旋转、模糊、噪声)的数据增强技术。这为任何低资源语言的OCR开发提供了一个可复现、可扩展的实用解决方案,突破了数据壁垒。

Abstract: Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.


[21] LLM-in-Sandbox Elicits General Agentic Intelligence cs.CL | cs.AIPDF

Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen

TL;DR: 论文提出了LLM-in-Sandbox方法,使大型语言模型(LLM)能够在代码沙盒(即虚拟计算机)中探索,以激发其在非代码领域的通用智能。研究表明,强大的LLM无需额外训练即可利用沙盒处理非代码任务,如获取外部知识、管理长上下文和执行脚本。通过LLM-in-Sandbox强化学习(LLM-in-Sandbox-RL)可进一步增强这些能力,且仅使用非智能数据训练。该方法在数学、物理、化学、生物医学、长上下文理解和指令遵循等多个领域展现了鲁棒的泛化能力,并已开源为Python包。

Details

Motivation: 解决LLM在非代码领域(如数学、科学等)中直接执行任务时可能面临的限制,通过引入代码沙盒环境,让LLM能够探索和利用计算资源来激发更广泛的智能行为,而无需针对特定任务进行训练。

Result: 实验表明,LLM-in-Sandbox在无需训练和经过后训练(post-trained)的设置下,在数学、物理、化学、生物医学、长上下文理解和指令遵循等多个基准上实现了鲁棒的泛化,达到了先进的性能水平(SOTA或相当)。

Insight: 创新点在于将LLM与代码沙盒结合,允许模型通过探索虚拟计算机环境来执行复杂任务,这扩展了LLM的应用范围;从客观角度看,该方法通过强化学习利用非智能数据训练沙盒探索能力,降低了数据需求,并开源实现促进了实际部署,为构建更通用的智能代理提供了新思路。

Abstract: We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox’s efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.


cs.CV [Back]

[22] Evaluating Multimodal Large Language Models for Heterogeneous Face Recognition cs.CVPDF

Hatef Otroshi Shahreza, Anjith George, Sébastien Marcel

TL;DR: 本文系统评估了多模态大语言模型(MLLMs)在异质人脸识别(HFR)任务中的表现,其中注册和查询图像来自不同传感模态(如可见光、近红外、短波红外和热成像)。研究在多个跨模态场景(如VIS-NIR、VIS-SWIR、VIS-THERMAL)上对多种开源MLLMs进行了基准测试,使用生物识别协议和指标(如获取率、等错误率、真实接受率)评估其识别性能。

Details

Motivation: MLLMs在多种视觉-语言任务中表现出色,引发了对其在生物识别应用中潜力的兴趣,但尚未在异质人脸识别这一具有挑战性的跨模态任务中进行系统评估。

Result: 结果显示,尽管MLLMs近期有所进展,但在具有挑战性的跨光谱条件下,其性能与经典人脸识别系统存在显著差距,突显了当前MLLMs在HFR任务中的局限性。

Insight: 论文的创新点在于首次对MLLMs在异质人脸识别任务中进行系统性基准评估,强调了在考虑将MLLMs部署到人脸识别系统时进行严格生物识别评估的重要性,为未来改进MLLMs在跨模态识别中的性能提供了方向。

Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance on a wide range of vision-language tasks, raising interest in their potential use for biometric applications. In this paper, we conduct a systematic evaluation of state-of-the-art MLLMs for heterogeneous face recognition (HFR), where enrollment and probe images are from different sensing modalities, including visual (VIS), near infrared (NIR), short-wave infrared (SWIR), and thermal camera. We benchmark multiple open-source MLLMs across several cross-modality scenarios, including VIS-NIR, VIS-SWIR, and VIS-THERMAL face recognition. The recognition performance of MLLMs is evaluated using biometric protocols and based on different metrics, including Acquire Rate, Equal Error Rate (EER), and True Accept Rate (TAR). Our results reveal substantial performance gaps between MLLMs and classical face recognition systems, particularly under challenging cross-spectral conditions, in spite of recent advances in MLLMs. Our findings highlight the limitations of current MLLMs for HFR and also the importance of rigorous biometric evaluation when considering their deployment in face recognition systems.


[23] CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation cs.CV | cs.AI | cs.CL | cs.LGPDF

Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa

TL;DR: 本文提出CURE,一种基于课程学习的多任务训练框架,用于提升医学影像报告生成的视觉基础性和事实一致性。该方法通过动态调整样本采样策略,强化模型在短语定位、基础报告生成和解剖学基础报告生成任务上的性能,无需额外数据即可显著改善报告质量。

Details

Motivation: 现有医学视觉-语言模型在生成放射学报告时存在视觉基础不准确和事实不一致的问题,导致预测不可靠或基础薄弱,需要提升模型的视觉-文本对齐能力。

Result: 在公开数据集上,CURE将基础准确性提升了+0.37 IoU,报告质量提高了+0.188 CXRFEScore,并将幻觉减少了18.6%,表明其在提升基础准确性和报告可靠性方面有效。

Insight: 创新点在于引入错误感知的课程学习框架,通过动态采样强调困难样本以改善空间和文本对齐;这是一种数据高效的方法,可同时增强基础准确性和报告可靠性,无需额外标注数据。

Abstract: Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure


[24] DevPrompt: Deviation-Based Prompt Learning for One-Normal ShotImage Anomaly Detection cs.CVPDF

Morteza Poudineh, Marc Lalonde

TL;DR: 本文提出了一种基于偏差的提示学习框架DevPrompt,用于解决少样本正常图像异常检测(FNSAD)任务。该方法结合了视觉语言模型(如CLIP)的语义能力和基于偏差统计的评分机制,通过可学习的上下文向量和异常特定后缀令牌增强正常与异常提示的区分度,并引入基于Top-K多示例学习的偏差损失来建模补丁级特征的高斯偏差,从而提升异常定位能力。

Details

Motivation: 解决少样本正常图像异常检测中,现有基于提示学习的方法存在正常与异常提示区分度弱、缺乏补丁级异常评分原则的问题。

Result: 在MVTecAD和VISA基准测试中,该方法在像素级检测性能上优于PromptAD等基线模型,达到了先进水平。消融实验验证了可学习提示、基于偏差的评分和Top-K MIL策略的有效性。

Insight: 创新点在于将视觉语言模型的语义对齐能力与基于统计偏差的可靠评分机制相结合,通过可学习的共享上下文向量与类别感知后缀设计提示,并利用Top-K MIL建模补丁级特征的统计偏差来增强异常定位的判别性和可解释性。

Abstract: Few-normal shot anomaly detection (FNSAD) aims to detect abnormal regions in images using only a few normal training samples, making the task highly challenging due to limited supervision and the diversity of potential defects. Recent approaches leverage vision-language models such as CLIP with prompt-based learning to align image and text features. However, existing methods often exhibit weak discriminability between normal and abnormal prompts and lack principled scoring mechanisms for patch-level anomalies. We propose a deviation-guided prompt learning framework that integrates the semantic power of vision-language models with the statistical reliability of deviation-based scoring. Specifically, we replace fixed prompt prefixes with learnable context vectors shared across normal and abnormal prompts, while anomaly-specific suffix tokens enable class-aware alignment. To enhance separability, we introduce a deviation loss with Top-K Multiple Instance Learning (MIL), modeling patch-level features as Gaussian deviations from the normal distribution. This allows the network to assign higher anomaly scores to patches with statistically significant deviations, improving localization and interpretability. Experiments on the MVTecAD and VISA benchmarks demonstrate superior pixel-level detection performance compared to PromptAD and other baselines. Ablation studies further validate the effectiveness of learnable prompts, deviation-based scoring, and the Top-K MIL strategy.


[25] Hybrid Vision Transformer_GAN Attribute Neutralizer for Mitigating Bias in Chest X_Ray Diagnosis cs.CVPDF

Jobeal Solomon, Ali Mohammed Mansoor Alsahag, Seyed Sahand Mohammadi Ziabari

TL;DR: 该论文提出了一种混合视觉Transformer-GAN属性中和器,用于减轻胸部X光诊断中的偏见。通过将属性中和框架中的U-Net卷积编码器替换为视觉Transformer骨干网络,旨在减少人口统计学属性泄漏,同时保持诊断准确性。在ChestX-ray14数据集上训练了数据高效的DeiT-S中和器,并在多个编辑强度级别下评估了其性能。

Details

Motivation: 胸部X光分类器中的偏见常源于性别和年龄相关的捷径,导致少数亚组的系统性诊断不足。现有的基于卷积编码器的像素空间属性中和器在临床可用编辑强度下虽能减轻但未能完全消除属性泄漏,因此研究探索使用视觉Transformer骨干网络是否能更有效地减少属性泄漏。

Result: 在中等编辑水平(alpha = 0.5)下,视觉Transformer中和器将患者性别识别曲线下面积(AUC)降低至约0.80,比原始框架的卷积U-Net编码器低约10个百分点,尽管训练周期减半。同时,15个发现的宏观ROC AUC保持在未编辑基线的五个百分点内,最差亚组AUC接近0.70。

Insight: 创新点在于将视觉Transformer引入属性中和框架,利用全局自注意力机制进一步抑制属性泄漏,而不牺牲临床实用性。这为开发更公平的胸部X光AI提供了一条实用路径,表明全局自注意力视觉模型在减少偏见方面具有潜力。

Abstract: Bias in chest X-ray classifiers frequently stems from sex- and age-related shortcuts, leading to systematic underdiagnosis of minority subgroups. Previous pixel-space attribute neutralizers, which rely on convolutional encoders, lessen but do not fully remove this attribute leakage at clinically usable edit strengths. This study evaluates whether substituting the U-Net convolutional encoder with a Vision Transformer backbone in the Attribute-Neutral Framework can reduce demographic attribute leakage while preserving diagnostic accuracy. A data-efficient Image Transformer Small (DeiT-S) neutralizer was trained on the ChestX-ray14 dataset. Its edited images, generated across eleven edit-intensity levels, were evaluated with an independent AI judge for attribute leakage and with a convolutional neural network (ConvNet) for disease prediction. At a moderate edit level (alpha = 0.5), the Vision Transformer (ViT) neutralizer reduces patient sex-recognition area under the curve (AUC) to approximately 0.80, about 10 percentage points below the original framework’s convolutional U-Net encoder, despite being trained for only half as many epochs. Meanwhile, macro receiver operating characteristic area under the curve (ROC AUC) across 15 findings stays within five percentage points of the unedited baseline, and the worst-case subgroup AUC remains near 0.70. These results indicate that global self-attention vision models can further suppress attribute leakage without sacrificing clinical utility, suggesting a practical route toward fairer chest X-ray AI.


[26] VIOLA: Towards Video In-Context Learning with Minimal Annotations cs.CV | cs.AIPDF

Ryo Fujii, Hideo Saito, Ryo Hachiuma

TL;DR: 本文提出了VIOLA框架,旨在通过最小化标注实现视频上下文学习,以解决多模态大语言模型(MLLMs)在新视频领域泛化时标注数据稀缺的问题。该框架结合了密度-不确定性加权采样来高效选择标注样本,并利用置信度感知检索和提示机制来整合大量未标注数据,从而在低资源设置下实现鲁棒的模型适应。

Details

Motivation: 解决MLLMs在工业或手术等专业视频领域部署时,因标注数据稀缺且专家标注成本高而难以泛化的问题,旨在通过最小化标注实现有效的上下文学习。

Result: 在九个多样化基准测试中使用四种MLLMs进行的广泛实验表明,该框架在低资源设置下显著优于各种基线方法,能以最小标注成本实现鲁棒的适应性能。

Insight: 创新点包括密度-不确定性加权采样(结合多样性和代表性选择样本)以及置信度感知检索与提示(显式建模标签可靠性以区分真实标注和噪声伪标签),为低资源视频理解提供了高效的上下文学习范式。

Abstract: Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts’ annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.


[27] Explainable Deepfake Detection with RL Enhanced Self-Blended Images cs.CVPDF

Ning Jiang, Dingheng Zeng, Yanhong Liu, Haiyang Yi, Shijie Yu

TL;DR: 本文提出了一种基于强化学习增强的自混合图像(Self-Blended Images)的自动化思维链(CoT)数据生成框架,以及一个强化学习增强的深度伪造检测框架,旨在解决现有深度伪造检测方法缺乏可解释性输出以及高质量标注数据稀缺的问题,并在多个跨数据集基准测试中取得了与最先进方法相当的性能。

Details

Motivation: 现有深度伪造检测方法缺乏可解释性输出,而多模态大语言模型(MLLMs)在可解释检测中的应用面临高质量、带详细伪造属性文本标注的数据集稀缺的障碍;同时,强化学习(RL)在视觉任务中显示出提升性能特别是跨域泛化能力的潜力。

Result: 在多个跨数据集基准测试中,该方法取得了与最先进(SOTA)方法相当的性能。

Insight: 创新点在于提出了一个基于自混合图像的自动化思维链数据生成框架,以低成本生成高质量标注数据;并设计了强化学习增强的检测框架,结合了定制的奖励机制和反馈驱动的合成数据生成方法,以提升模型的可解释性和跨域泛化能力。

Abstract: Most prior deepfake detection methods lack explainable outputs. With the growing interest in multimodal large language models (MLLMs), researchers have started exploring their use in interpretable deepfake detection. However, a major obstacle in applying MLLMs to this task is the scarcity of high-quality datasets with detailed forgery attribution annotations, as textual annotation is both costly and challenging - particularly for high-fidelity forged images or videos. Moreover, multiple studies have shown that reinforcement learning (RL) can substantially enhance performance in visual tasks, especially in improving cross-domain generalization. To facilitate the adoption of mainstream MLLM frameworks in deepfake detection with reduced annotation cost, and to investigate the potential of RL in this context, we propose an automated Chain-of-Thought (CoT) data generation framework based on Self-Blended Images, along with an RL-enhanced deepfake detection framework. Extensive experiments validate the effectiveness of our CoT data construction pipeline, tailored reward mechanism, and feedback-driven synthetic data generation approach. Our method achieves performance competitive with state-of-the-art (SOTA) approaches across multiple cross-dataset benchmarks. Implementation details are available at https://github.com/deon1219/rlsbi.


[28] Evolving Without Ending: Unifying Multimodal Incremental Learning for Continual Panoptic Perception cs.CVPDF

Bo Yuan, Danpei Zhao, Wentao Li, Tian Li, Zhiguo Jiang

TL;DR: 本文提出了一种持续全景感知(CPP)模型,将多模态与多任务持续学习相结合,通过像素级、实例级和图像级的联合解释来增强图像感知能力。模型采用协作跨模态编码器(CCE)进行多模态嵌入,并通过对比特征蒸馏和实例蒸馏的可塑知识继承模块来缓解灾难性遗忘,同时引入跨模态一致性约束(CPP+)确保多任务增量场景下的语义对齐,并采用非对称伪标签方式实现无需示例回放的模型演化。

Details

Motivation: 现有持续学习研究主要集中于单任务场景,限制了多任务和多模态应用的潜力;多任务持续学习不仅面临灾难性遗忘问题,还导致跨模态对齐的语义混淆,从而在增量训练步骤中引发严重的模型退化。

Result: 在多模态数据集和多样持续学习任务上的广泛实验表明,所提模型具有优越性,特别是在细粒度持续学习任务中表现突出。

Insight: 创新点包括将持续学习扩展到多模态多任务的持续全景感知框架,提出协作跨模态编码器和可塑知识继承模块以交互方式缓解遗忘,以及通过跨模态一致性约束和非对称伪标签实现无需回放的语义对齐与模型演化。

Abstract: Continual learning (CL) is a great endeavour in developing intelligent perception AI systems. However, the pioneer research has predominantly focus on single-task CL, which restricts the potential in multi-task and multimodal scenarios. Beyond the well-known issue of catastrophic forgetting, the multi-task CL also brings semantic obfuscation across multimodal alignment, leading to severe model degradation during incremental training steps. In this paper, we extend CL to continual panoptic perception (CPP), integrating multimodal and multi-task CL to enhance comprehensive image perception through pixel-level, instance-level, and image-level joint interpretation. We formalize the CL task in multimodal scenarios and propose an end-to-end continual panoptic perception model. Concretely, CPP model features a collaborative cross-modal encoder (CCE) for multimodal embedding. We also propose a malleable knowledge inheritance module via contrastive feature distillation and instance distillation, addressing catastrophic forgetting from task-interactive boosting manner. Furthermore, we propose a cross-modal consistency constraint and develop CPP+, ensuring multimodal semantic alignment for model updating under multi-task incremental scenarios. Additionally, our proposed model incorporates an asymmetric pseudo-labeling manner, enabling model evolving without exemplar replay. Extensive experiments on multimodal datasets and diverse CL tasks demonstrate the superiority of the proposed model, particularly in fine-grained CL tasks.


[29] Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams cs.CV | cs.AIPDF

Zhenghui Guo, Yuanbin Man, Junyuan Sheng, Bowen Lin, Ahmed Ahmed

TL;DR: 本文提出了Event-VStream,一个事件感知的实时长视频流理解框架。它通过检测视频中的语义连贯事件(如状态转变)来触发语言生成,并将事件嵌入存入持久记忆库,从而解决现有方法因冗余帧处理和上下文遗忘带来的挑战。

Details

Motivation: 现有多模态大语言模型(VLMs)处理长视频流时面临冗余帧处理和快速遗忘过去上下文的问题,而现有的流式系统依赖固定间隔解码或缓存修剪,会导致重复输出或丢失关键时序信息。

Result: 在OVOBench-Realtime和长格式Ego4D评估中,Event-VStream取得了有竞争力的性能:相比VideoLLM-Online-8B基线在OVOBench-Realtime上提升了10.4分;尽管仅使用通用的LLaMA-3-8B文本骨干,性能接近Flash-VStream-7B;在2小时的Ego4D流上保持了约70%的GPT-5胜率。

Insight: 创新点在于将连续视频表示为离散的、语义连贯的事件序列,通过整合运动、语义和预测线索来检测有意义的状态转变,并仅在事件边界触发生成。这实现了低延迟下的长时程推理,其事件驱动的表示和记忆机制为实时视频理解提供了新思路。

Abstract: Real-time understanding of long video streams remains challenging for multimodal large language models (VLMs) due to redundant frame processing and rapid forgetting of past context. Existing streaming systems rely on fixed-interval decoding or cache pruning, which either produce repetitive outputs or discard crucial temporal information. We introduce Event-VStream, an event-aware framework that represents continuous video as a sequence of discrete, semantically coherent events. Our system detects meaningful state transitions by integrating motion, semantic, and predictive cues, and triggers language generation only at those boundaries. Each event embedding is consolidated into a persistent memory bank, enabling long-horizon reasoning while maintaining low latency. Across OVOBench-Realtime, and long-form Ego4D evaluations, Event-VStream achieves competitive performance. It improves over a VideoLLM-Online-8B baseline by +10.4 points on OVOBench-Realtime, achieves performance close to Flash-VStream-7B despite using only a general-purpose LLaMA-3-8B text backbone, and maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.


[30] Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling cs.CV | cs.AIPDF

Hongyang Wei, Hongbo Liu, Zidong Wang, Yi Peng, Baixin Xu

TL;DR: 本文提出了Skywork UniPic 3.0,一个统一的多模态框架,用于单图像编辑和多图像组合任务。该模型通过将多图像组合视为序列建模问题,并采用创新的数据流水线和训练范式,在仅使用70万高质量训练样本的情况下,实现了对任意数量(1~6)和分辨率的输入图像的处理,以及高保真度的快速生成(仅需8步推理)。

Details

Motivation: 社区对多图像组合任务(如Nano-Banana和Seedream 4.0所示)兴趣浓厚,但现有模型在实现高质量融合方面缺乏公开的方法细节,且多图像组合在一致性和质量上面临比单图像编辑更大的挑战。本文旨在系统性地分析和实现一个专注于人-物交互(HOI)这一高需求类别的先进多图像组合解决方案。

Result: Skywork UniPic 3.0在单图像编辑基准测试上达到了最先进(SOTA)性能,并在多图像组合基准测试上超越了Nano-Banana和Seedream 4.0。通过集成轨迹映射和分布匹配的后训练技术,模型实现了12.5倍的加速,仅需8步即可生成高保真样本。

Insight: 主要创新点包括:1) 将多图像组合重新定义为序列建模问题,将条件生成转化为统一的序列合成;2) 设计了一个全面的数据收集、过滤和合成流水线,以少量高质量数据实现强性能;3) 在推理阶段引入轨迹映射和分布匹配,实现了显著的加速和高保真生成。这为统一处理单图与多图生成任务提供了新的框架思路。

Abstract: The recent surge in popularity of Nano-Banana and Seedream 4.0 underscores the community’s strong interest in multi-image composition tasks. Compared to single-image editing, multi-image composition presents significantly greater challenges in terms of consistency and quality, yet existing models have not disclosed specific methodological details for achieving high-quality fusion. Through statistical analysis, we identify Human-Object Interaction (HOI) as the most sought-after category by the community. We therefore systematically analyze and implement a state-of-the-art solution for multi-image composition with a primary focus on HOI-centric tasks. We present Skywork UniPic 3.0, a unified multimodal framework that integrates single-image editing and multi-image composition. Our model supports an arbitrary (1~6) number and resolution of input images, as well as arbitrary output resolutions (within a total pixel budget of 1024x1024). To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline, achieving strong performance with only 700K high-quality training samples. Furthermore, we introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem, transforming conditional generation into unified sequence synthesis. To accelerate inference, we integrate trajectory mapping and distribution matching into the post-training stage, enabling the model to produce high-fidelity samples in just 8 steps and achieve a 12.5x speedup over standard synthesis sampling. Skywork UniPic 3.0 achieves state-of-the-art performance on single-image editing benchmark and surpasses both Nano-Banana and Seedream 4.0 on multi-image composition benchmark, thereby validating the effectiveness of our data pipeline and training paradigm. Code, models and dataset are publicly available.


[31] Performance-guided Reinforced Active Learning for Object Detection cs.CV | cs.LGPDF

Zhixuan Liang, Xingyu Zeng, Rui Zhao, Ping Luo

TL;DR: 本文提出了一种名为MGRAL的性能引导强化主动学习方法,用于目标检测任务。该方法通过强化学习采样代理,以mAP提升作为奖励,优化选择最具信息量的未标注批次,并采用无监督快速查找表来降低计算开销。在PASCAL VOC和COCO基准测试中,MGRAL取得了最高的主动学习曲线。

Details

Motivation: 当前主动学习方法评估数据信息量时,主要关注数据分布或内在信息内容,而未直接与下游任务性能(如目标检测中的mAP)关联,因此需要一种能直接以性能为导向的主动学习策略。

Result: 在PASCAL VOC和COCO基准测试的目标检测任务中,MGRAL方法展示了最高的主动学习曲线,并通过可视化验证了其有效性,为强化学习驱动的主动目标检测建立了新范式。

Insight: 创新点在于将预期模型输出变化作为信息量度量,并利用强化学习策略梯度优化批次选择,以mAP改进作为奖励;同时,通过无监督快速查找表降低mAP估计的计算成本,实现了高效部署。

Abstract: Active learning (AL) strategies aim to train high-performance models with minimal labeling efforts, only selecting the most informative instances for annotation. Current approaches to evaluating data informativeness predominantly focus on the data’s distribution or intrinsic information content and do not directly correlate with downstream task performance, such as mean average precision (mAP) in object detection. Thus, we propose Performance-guided (i.e. mAP-guided) Reinforced Active Learning for Object Detection (MGRAL), a novel approach that leverages the concept of expected model output changes as informativeness. To address the combinatorial explosion challenge of batch sample selection and the non-differentiable correlation between model performance and selected batches, MGRAL skillfully employs a reinforcement learning-based sampling agent that optimizes selection using policy gradient with mAP improvement as reward. Moreover, to reduce the computational overhead of mAP estimation with unlabeled samples, MGRAL utilizes an unsupervised way with fast look-up tables, ensuring feasible deployment. We evaluate MGRAL’s active learning performance on detection tasks over PASCAL VOC and COCO benchmarks. Our approach demonstrates the highest AL curve with convincing visualizations, establishing a new paradigm in reinforcement learning-driven active object detection.


[32] Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs cs.CV | cs.AIPDF

Mingyu Yu, Lana Liu, Zhehao Zhao, Wei Wang, Sujuan Qin

TL;DR: 本文提出了名为Beyond Visual Safety (BVS)的新型图像-文本对越狱框架,旨在探索多模态大语言模型(MLLMs)的视觉安全边界。该框架采用“重建-生成”策略,通过中性化视觉拼接和归纳重组将恶意意图从原始输入中解耦,从而诱导MLLMs生成有害图像。实验表明,BVS对GPT-5(2026年1月12日发布)实现了98.21%的惊人越狱成功率,揭示了当前MLLMs视觉安全对齐中的关键漏洞。

Details

Motivation: 现有研究对MLLMs安全漏洞的探索不足,特别是在视觉安全边界方面,本文旨在解决这一研究空白,探究MLLMs在文本与视觉交叉领域的安全挑战。

Result: 在针对GPT-5(2026年1月12日发布)的实验中,BVS框架实现了98.21%的越狱成功率,这一定量结果显著暴露了当前MLLMs在视觉安全对齐上的脆弱性。

Insight: 创新点在于提出了“重建-生成”策略和语义无关输入的概念,通过中性化视觉拼接和归纳重组技术解耦恶意意图,这是一种绕过视觉安全防护的新颖越狱方法,从客观角度看,其揭示了MLLMs安全机制中可能被忽视的视觉-文本协同攻击面。

Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has introduced complex security challenges, particularly at the intersection of textual and visual safety. While existing schemes have explored the security vulnerabilities of MLLMs, the investigation into their visual safety boundaries remains insufficient. In this paper, we propose Beyond Visual Safety (BVS), a novel image-text pair jailbreaking framework specifically designed to probe the visual safety boundaries of MLLMs. BVS employs a “reconstruction-then-generation” strategy, leveraging neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, thereby leading MLLMs to be induced into generating harmful images. Experimental results demonstrate that BVS achieves a remarkable jailbreak success rate of 98.21% against GPT-5 (12 January 2026 release). Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.


[33] Zero-Shot Product Attribute Labeling with Vision-Language Models: A Three-Tier Evaluation Framework cs.CVPDF

Shubham Shukla, Kunal Sonalkar

TL;DR: 本文提出一个三层次评估框架,用于系统评估视觉语言模型(VLMs)在零样本时尚产品多属性标注任务上的性能。该框架将任务分解为整体性能、属性适用性检测和细粒度分类三个层次,并在DeepFashion-MultiModal数据集上对九个不同规模的VLMs进行了基准测试。

Details

Motivation: 解决时尚零售应用中细粒度属性预测的需求,并应对现有VLM零样本方法在多属性、条件性属性(如‘外层面料’在无外衣时未定义)任务上缺乏系统性评估的挑战,特别是模型需先检测属性是否适用再进行分类的问题。

Result: 在DeepFashion-MultiModal数据集(18个属性,5000张图像)上,零样本VLMs达到64.0%的宏观F1分数,比基于预训练Fashion-CLIP嵌入的逻辑回归提升三倍;VLMs在细粒度分类(第三层次,70.8% F1)表现优异,但在属性适用性检测(第二层次,34.1% NA-F1)上存在瓶颈;高效模型能以更低成本达到旗舰模型90%以上的性能。

Insight: 创新点在于提出了一个诊断性的三层次评估框架,能精确区分错误源于属性可见性检测还是分类本身,为生产系统改进提供明确指导;同时揭示了VLM在零样本属性标注中适用性检测是关键瓶颈,且高效模型在成本效益上具有实用部署价值。

Abstract: Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, “outer fabric” is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn’t exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.


[34] VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning cs.CV | cs.AIPDF

Chenglin Li, Qianglong Chen, Feng Han, Yikun Wang, Xingxi Yin

TL;DR: VideoThinker是一种基于LLM引导工具推理构建的代理式视频大语言模型,通过合成工具交互轨迹训练,解决了长视频理解中静态推理导致的信息丢失问题,实现了动态推理和自适应时间探索。

Details

Motivation: 解决现有视频大语言模型在长视频理解中因均匀采样帧导致的时序定位弱化和信息丢失问题,通过代理工具(如时间检索、空间缩放)实现自适应关键时刻探索,但构建代理视频理解数据存在循环依赖挑战。

Result: 在长视频基准测试中,VideoThinker显著优于仅基于字幕的语言模型代理和强视频模型基线,展示了工具增强合成数据及自适应检索与缩放推理的有效性。

Insight: 创新点在于将视频转换为丰富字幕,利用强大的代理语言模型在字幕空间生成多步工具使用序列,再通过替换为对应帧将轨迹锚定回视频,从而无需底层模型具备长视频理解能力即可生成大规模交错视频与工具推理数据集。

Abstract: Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments. However, constructing agentic video understanding data requires models that already possess strong long-form video comprehension, creating a circular dependency. We address this challenge with VideoThinker, an agentic Video Large Language Model trained entirely on synthetic tool interaction trajectories. Our key idea is to convert videos into rich captions and employ a powerful agentic language model to generate multi-step tool use sequences in caption space. These trajectories are subsequently grounded back to video by replacing captions with the corresponding frames, yielding a large-scale interleaved video and tool reasoning dataset without requiring any long-form understanding from the underlying model. Training on this synthetic agentic dataset equips VideoThinker with dynamic reasoning capabilities, adaptive temporal exploration, and multi-step tool use. Remarkably, VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating the effectiveness of tool augmented synthetic data and adaptive retrieval and zoom reasoning for long-form video understanding.


[35] Sub-Region-Aware Modality Fusion and Adaptive Prompting for Multi-Modal Brain Tumor Segmentation cs.CVPDF

Shadi Alijani, Fereshteh Aghaee Meibodi, Homayoun Najjaran

TL;DR: 本文提出了一种新颖的框架,用于将基础模型适配到多模态医学影像任务中,核心创新在于子区域感知的模态注意力机制和自适应提示工程。该框架旨在解决多模态信息融合不佳和适应病理组织异质性的挑战,并在BraTS 2020脑肿瘤分割数据集上验证了其有效性。

Details

Motivation: 解决基础模型在多模态医学影像中适应性差的问题,特别是现有模型在多源信息融合和处理病理组织异质性方面的不足。

Result: 在BraTS 2020脑肿瘤分割数据集上,该方法显著优于基线方法,尤其是在具有挑战性的坏死核心子区域分割上表现出色。

Insight: 创新点在于提出了子区域感知的模态注意力机制(为每个肿瘤子区域学习最优的模态组合)和自适应提示策略(利用基础模型固有能力提升分割精度),为医学影像中基于基础模型的解决方案提供了原则性且有效的多模态融合与提示方法。

Abstract: The successful adaptation of foundation models to multi-modal medical imaging is a critical yet unresolved challenge. Existing models often struggle to effectively fuse information from multiple sources and adapt to the heterogeneous nature of pathological tissues. To address this, we introduce a novel framework for adapting foundation models to multi-modal medical imaging, featuring two key technical innovations: sub-region-aware modality attention and adaptive prompt engineering. The attention mechanism enables the model to learn the optimal combination of modalities for each tumor sub-region, while the adaptive prompting strategy leverages the inherent capabilities of foundation models to refine segmentation accuracy. We validate our framework on the BraTS 2020 brain tumor segmentation dataset, demonstrating that our approach significantly outperforms baseline methods, particularly in the challenging necrotic core sub-region. Our work provides a principled and effective approach to multi-modal fusion and prompting, paving the way for more accurate and robust foundation model-based solutions in medical imaging.


[36] LL-GaussianMap: Zero-shot Low-Light Image Enhancement via 2D Gaussian Splatting Guided Gain Maps cs.CVPDF

Yuhan Chen, Ying Fang, Guofa Li, Wenxuan Yu, Yicui Shi

TL;DR: LL-GaussianMap是一种零样本低光照图像增强方法,首次将2D高斯泼溅(2DGS)这一显式场景表示技术引入低级视觉任务,通过2DGS引导的增益图生成过程来增强图像,无需配对数据,在保持边缘和抑制伪影方面表现出色。

Details

Motivation: 现有低光照增强方法多在像素域或依赖隐式特征表示,忽略了图像固有的几何结构先验;2DGS具有优越的结构拟合能力,但在低级视觉任务中尚未被探索,本文旨在填补这一空白。

Result: 实验结果表明,LL-GaussianMap在低光照图像增强上取得了优越的性能,同时存储占用极低,突显了显式高斯表示在图像增强中的有效性。

Insight: 创新点在于将2DGS的结构感知能力融入增益图生成,通过无监督学习避免对配对数据的依赖,为低级视觉任务引入显式几何表示提供了新思路。

Abstract: Significant progress has been made in low-light image enhancement with respect to visual quality. However, most existing methods primarily operate in the pixel domain or rely on implicit feature representations. As a result, the intrinsic geometric structural priors of images are often neglected. 2D Gaussian Splatting (2DGS) has emerged as a prominent explicit scene representation technique characterized by superior structural fitting capabilities and high rendering efficiency. Despite these advantages, the utilization of 2DGS in low-level vision tasks remains unexplored. To bridge this gap, LL-GaussianMap is proposed as the first unsupervised framework incorporating 2DGS into low-light image enhancement. Distinct from conventional methodologies, the enhancement task is formulated as a gain map generation process guided by 2DGS primitives. The proposed method comprises two primary stages. First, high-fidelity structural reconstruction is executed utilizing 2DGS. Then, data-driven enhancement dictionary coefficients are rendered via the rasterization mechanism of Gaussian splatting through an innovative unified enhancement module. This design effectively incorporates the structural perception capabilities of 2DGS into gain map generation, thereby preserving edges and suppressing artifacts during enhancement. Additionally, the reliance on paired data is circumvented through unsupervised learning. Experimental results demonstrate that LL-GaussianMap achieves superior enhancement performance with an extremely low storage footprint, highlighting the effectiveness of explicit Gaussian representations for image enhancement.


[37] Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video cs.CVPDF

Pascal Benschop, Justin Dauwels, Jan van Gemert

TL;DR: 该论文引入了一个合成视频基准测试,用于评估视觉语言模型(VLMs)的情境感知(区分有害与良性互动)和空间感知(追踪角色、推理相对位置和运动)能力。通过最小化视频对测试了三个挑战:区分暴力与良性活动、跨视角绑定攻击者角色以及判断细粒度轨迹对齐。

Details

Motivation: 动机在于当前视觉语言模型在依赖细微时间或几何线索的语义理解上仍显脆弱,需要系统评估其情境和空间推理能力。

Result: 在免训练设置下评估近期VLMs,结果显示各任务性能仅略高于随机猜测;简单的稳定颜色线索部分减少了攻击者角色混淆,但未解决根本弱点。

Insight: 创新点在于提出可复现的诊断性合成基准,强调轻量级空间先验可作为大规模预训练的补充;客观分析认为该方法为模型空间推理的脆弱性提供了量化评估框架。

Abstract: Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.


[38] A Mobile Application for Flower Recognition System Based on Convolutional Neural Networks cs.CV | cs.AIPDF

Mustafa Yurdakul, Enes Ayan, Fahrettin Horasan, Sakir Tasdemir

TL;DR: 该研究开发了一个基于卷积神经网络(CNN)的移动应用,用于识别不同类型的花卉,旨在为非专业人士提供便捷的花卉识别服务。研究比较了MobileNet、DenseNet121和Xception三种CNN模型,并评估了七种优化算法,最终确定DenseNet-121结合随机梯度下降(SGD)优化算法在花卉分类任务中表现最佳。

Details

Motivation: 花卉在日常生活中用途广泛,但识别花卉类型通常需要专业知识,而随时随地获取专家帮助并不总是可行。因此,研究旨在开发一个移动应用,利用深度学习技术为非专业人士提供快速、便捷的花卉识别解决方案。

Result: 在花卉分类任务中,DenseNet-121模型结合SGD优化算法取得了最佳性能,准确率达到95.84%,精确率、召回率和F1分数均为96.00%。这表明CNN模型在移动应用中的花卉分类任务上具有高精度和实用性。

Insight: 论文的创新点在于将多种CNN模型(如MobileNet、DenseNet121和Xception)与不同优化算法结合,系统评估了它们在移动端花卉识别任务中的性能,并验证了DenseNet-121结合SGD在移动应用中的高效性。从客观角度看,该研究为轻量级深度学习模型在移动设备上的部署提供了实践参考,强调了模型选择与优化算法组合的重要性。

Abstract: A convolutional neural network (CNN) is a deep learning algorithm that has been specifically designed for computer vision applications. The CNNs proved successful in handling the increasing amount of data in many computer vision problems, where classical machine learning algorithms were insufficient. Flowers have many uses in our daily lives, from decorating to making medicines to detoxifying the environment. Identifying flower types requires expert knowledge. However, accessing experts at any time and in any location may not always be feasible. In this study a mobile application based on CNNs was developed to recognize different types of flowers to provide non-specialists with quick and easy access to information about flower types. The study employed three distinct CNN models, namely MobileNet, DenseNet121, and Xception, to determine the most suitable model for the mobile application. The classification performances of the models were evaluated by training them with seven different optimization algorithms. The DenseNet-121 architecture, which uses the stochastic gradient descent (SGD) optimization algorithm, was the most successful, achieving 95.84 % accuracy, 96.00% precision, recall, and F1-score. This result shows that CNNs can be used for flower classification in mobile applications.


[39] PMPBench: A Paired Multi-Modal Pan-Cancer Benchmark for Medical Image Synthesis cs.CVPDF

Yifan Chen, Fei Yin, Hao Chen, Jia Wu, Chao Li

TL;DR: 该论文提出了PMPBench,这是首个公开、完全配对、涵盖11个人体器官的泛癌症医学影像数据集,包含完整的动态对比增强(DCE)序列和配对非增强/增强CT(CTC)数据。基于此数据集,论文建立了一个全面的基准测试,评估了当代图像到图像翻译方法的性能,旨在推动安全有效的对比剂合成研究,以优化多器官肿瘤成像工作流程。

Details

Motivation: 现有医学影像合成研究受限于数据不足:公共数据集多集中于脑部配对MR模态,其他数据集存在配对不完整、模态/时间戳缺失、空间未对齐及标签不明确等问题,且大量资源未公开。这阻碍了基于AI的图像翻译技术(如从非增强扫描合成增强图像)的发展,而该技术可减少对比剂副作用并简化临床流程。

Result: 论文基于PMPBench数据集建立了基准测试,并报告了当代图像到图像翻译代表性基线的结果,为1对1、N对1和N对N翻译设置(例如从非增强输入预测DCE阶段)提供了严格的评估框架。

Insight: 主要创新点在于创建了首个公开、完全配对、跨多器官的泛癌症医学影像合成基准数据集,解决了现有数据在配对完整性、模态覆盖和空间对齐方面的局限性。这为系统评估和推进医学图像合成方法(尤其是对比增强图像生成)提供了关键资源,直接关联多器官肿瘤成像的临床需求。

Abstract: Contrast medium plays a pivotal role in radiological imaging, as it amplifies lesion conspicuity and improves detection for the diagnosis of tumor-related diseases. However, depending on the patient’s health condition or the medical resources available, the use of contrast medium is not always feasible. Recent work has explored AI-based image translation to synthesize contrast-enhanced images directly from non-contrast scans, aims to reduce side effects and streamlines clinical workflows. Progress in this direction has been constrained by data limitations: (1) existing public datasets focus almost exclusively on brain-related paired MR modalities; (2) other collections include partially paired data but suffer from missing modalities/timestamps and imperfect spatial alignment; (3) explicit labeling of CT vs. CTC or DCE phases is often absent; (4) substantial resources remain private. To bridge this gap, we introduce the first public, fully paired, pan-cancer medical imaging dataset spanning 11 human organs. The MR data include complete dynamic contrast-enhanced (DCE) sequences covering all three phases (DCE1-DCE3), while the CT data provide paired non-contrast and contrast-enhanced acquisitions (CTC). The dataset is curated for anatomical correspondence, enabling rigorous evaluation of 1-to-1, N-to-1, and N-to-N translation settings (e.g., predicting DCE phases from non-contrast inputs). Built upon this resource, we establish a comprehensive benchmark. We report results from representative baselines of contemporary image-to-image translation. We release the dataset and benchmark to catalyze research on safe, effective contrast synthesis, with direct relevance to multi-organ oncology imaging workflows. Our code and dataset are publicly available at https://github.com/YifanChen02/PMPBench.


[40] Understanding the Transfer Limits of Vision Foundation Models cs.CV | cs.AIPDF

Shiqi Huang, Yipei Wang, Natasha Thorley, Alexander Ng, Shaheer Saeed

TL;DR: 本文研究了视觉基础模型(VFMs)在下游任务中表现不均的问题,认为其根源在于预训练目标与下游视觉成像任务需求不匹配。通过在前列腺多参数磁共振成像的五项任务上评估两种VFM(基于MAE的重建模型ProFound和基于对比学习的模型ProViCNet),发现预训练与下游任务的对齐程度(通过微调前后特征的MMD等简单散度度量)与性能提升和收敛速度正相关。

Details

Motivation: 视觉基础模型尽管投入大量计算资源,但在下游任务中表现不均,作者推测这是由于预训练目标(如掩码图像重建或对比学习)与下游任务(如分割、分类或图像合成)的特定需求不匹配所致。

Result: 在五项前列腺多参数磁共振成像任务上的评估表明,预训练与下游任务对齐度越高(通过微调前后特征的MMD度量),性能改进越大且收敛越快,强调了预训练目标设计需考虑下游适用性。

Insight: 创新点在于通过实证分析揭示了视觉基础模型迁移性能的限制源于预训练与下游任务的对齐问题,并提出使用简单散度度量(如MMD)来量化对齐度,为设计和分析预训练目标提供了实用指导。

Abstract: Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model (ProFound) and a contrastive-learning-based model (ProViCNet), on five prostate multiparametric MR imaging tasks, examining how such task alignment influences transfer performance, i.e., from pretraining to fine-tuning. Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and faster convergence, emphasizing the importance of designing and analyzing pretraining objectives with downstream applicability in mind.


[41] RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture cs.CVPDF

Anas Anwarul Haq Khan, Mariam Husain, Kshitij Jadhav

TL;DR: 本文提出了RadJEPA,一种基于联合嵌入预测架构的自监督学习框架,用于从无标签的胸部X光图像中学习视觉表示,无需依赖语言监督。该方法通过预测掩码图像区域的潜在表示进行预训练,并在疾病分类、语义分割和报告生成任务上评估其性能。

Details

Motivation: 现有医学视觉语言模型依赖成对的图像-文本数据进行监督学习,但此类数据获取受限;本文旨在探索是否能在不依赖语言监督的情况下学习稳健的放射学编码器。

Result: 在多个基准测试中,RadJEPA在疾病分类、语义分割和报告生成任务上的性能超越了包括Rad-DINO在内的最先进方法。

Insight: 创新点在于采用联合嵌入预测架构,通过预测掩码区域的潜在表示进行自监督学习,而非依赖跨视图或跨模态的全局表示对齐,这为无语言监督的医学图像表示学习提供了新思路。

Abstract: Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.


[42] ThermoSplat: Cross-Modal 3D Gaussian Splatting with Feature Modulation and Geometry Decoupling cs.CVPDF

Zhaoqi Su, Shihai Chen, Xinyan Lin, Liqin Huang, Zhipeng Su

TL;DR: 本文提出了ThermoSplat,一种用于融合RGB和热红外数据的跨模态3D高斯溅射框架,旨在实现鲁棒的多光谱场景重建。该框架通过跨模态特征调制和自适应几何解耦,解决了现有方法在利用多模态互补信息和处理光谱间复杂差异方面的挑战。

Details

Motivation: 动机在于整合RGB和热红外数据以实现不同光照和天气条件下的鲁棒环境感知,但将3D高斯溅射扩展到多光谱场景存在困难,现有方法难以充分利用多模态数据的互补信息,或无法自适应处理光谱间的复杂结构相关性和物理差异。

Result: 在RGBT-Scenes数据集上的大量实验表明,ThermoSplat在可见光和热红外光谱上都达到了最先进的渲染质量。

Insight: 创新点包括:1) 跨模态FiLM调制机制,利用热结构先验动态调节共享潜在特征;2) 模态自适应几何解耦方案,学习独立的不透明度偏移并为热分支执行独立的光栅化过程;3) 混合渲染管线,结合显式球谐函数与隐式神经解码,确保语义一致性和高频细节保留。

Abstract: Multi-modal scene reconstruction integrating RGB and thermal infrared data is essential for robust environmental perception across diverse lighting and weather conditions. However, extending 3D Gaussian Splatting (3DGS) to multi-spectral scenarios remains challenging. Current approaches often struggle to fully leverage the complementary information of multi-modal data, typically relying on mechanisms that either tend to neglect cross-modal correlations or leverage shared representations that fail to adaptively handle the complex structural correlations and physical discrepancies between spectrums. To address these limitations, we propose ThermoSplat, a novel framework that enables deep spectral-aware reconstruction through active feature modulation and adaptive geometry decoupling. First, we introduce a Cross-Modal FiLM Modulation mechanism that dynamically conditions shared latent features on thermal structural priors, effectively guiding visible texture synthesis with reliable cross-modal geometric cues. Second, to accommodate modality-specific geometric inconsistencies, we propose a Modality-Adaptive Geometric Decoupling scheme that learns independent opacity offsets and executes an independent rasterization pass for the thermal branch. Additionally, a hybrid rendering pipeline is employed to integrate explicit Spherical Harmonics with implicit neural decoding, ensuring both semantic consistency and high-frequency detail preservation. Extensive experiments on the RGBT-Scenes dataset demonstrate that ThermoSplat achieves state-of-the-art rendering quality across both visible and thermal spectrums.


[43] Opening the Black Box: Preliminary Insights into Affective Modeling in Multimodal Foundation Models cs.CVPDF

Zhen Zhang, Runhao Zeng, Sicheng Zhao, Xiping Hu

TL;DR: 本文对多模态基础模型中的情感建模进行了系统性机制研究,发现情感适应主要定位在前馈门控投影(gate_proj)模块,而非注意力模块。通过模块迁移、单模块适应和消融实验,证明了gate_proj对于情感理解和生成是充分、高效且必要的。仅调整约24.5%的参数即可达到AffectGPT平均性能的96.6%,显示出显著的参数效率。

Details

Motivation: 尽管现有情感模型在实证中表现良好,但大规模基础模型中情感表示的位置和机制仍不明确,尤其是在多模态情感设置下,内部架构机制如何支持情感理解和生成尚不清楚。

Result: 在多个架构、训练策略和情感任务中,该方法仅调整约24.5%的参数(相比AffectGPT),在八个情感任务上达到了其平均性能的96.6%,突出了参数效率。

Insight: 创新点在于揭示了基础模型中情感能力在结构上由前馈门控机制介导,并确定gate_proj为情感建模的核心架构位置,这为理解模型内部情感表示提供了新视角,并可能指导更高效的情感模型设计。

Abstract: Understanding where and how emotions are represented in large-scale foundation models remains an open problem, particularly in multimodal affective settings. Despite the strong empirical performance of recent affective models, the internal architectural mechanisms that support affective understanding and generation are still poorly understood. In this work, we present a systematic mechanistic study of affective modeling in multimodal foundation models. Across multiple architectures, training strategies, and affective tasks, we analyze how emotion-oriented supervision reshapes internal model parameters. Our results consistently reveal a clear and robust pattern: affective adaptation does not primarily focus on the attention module, but instead localizes to the feed-forward gating projection (\texttt{gate_proj}). Through controlled module transfer, targeted single-module adaptation, and destructive ablation, we further demonstrate that \texttt{gate_proj} is sufficient, efficient, and necessary for affective understanding and generation. Notably, by tuning only approximately 24.5% of the parameters tuned by AffectGPT, our approach achieves 96.6% of its average performance across eight affective tasks, highlighting substantial parameter efficiency. Together, these findings provide empirical evidence that affective capabilities in foundation models are structurally mediated by feed-forward gating mechanisms and identify \texttt{gate_proj} as a central architectural locus of affective modeling.


[44] The Latency Wall: Benchmarking Off-the-Shelf Emotion Recognition for Real-Time Virtual Avatars cs.CV | cs.HCPDF

Yarin Benyamin

TL;DR: 本文针对虚拟现实(VR)中实时情感识别的应用需求,特别是用于支持自闭症谱系障碍(ASD)社交技能训练的场景,对现成的深度学习模型进行了基准测试。研究聚焦于零样本面部表情识别(FER)在虚拟角色上的性能,评估了YOLO系列(v8、v11、v12)的中型和纳米变体用于人脸检测,以及通用视觉Transformer模型(如CLIP、SigLIP和ViT-FER)。结果表明,在仅使用CPU推理时,人脸检测在风格化虚拟角色上鲁棒性高(100%准确率),但分类阶段存在“延迟墙”,通用Transformer模型无法同时满足实时性(延迟低于140毫秒)和准确性要求。

Details

Motivation: 在VR和人机交互(HCI)领域,实时情感识别有望帮助自闭症谱系障碍患者提升社交技能,但该任务需要严格的延迟-准确性权衡(运动到光子延迟需低于140毫秒以保持连续性)。然而,大多数现成的深度学习模型优先考虑准确性,忽视了消费级硬件的严格时序约束,因此需要评估现有模型在实时VR治疗中的可行性。

Result: 在UIBVFED数据集上的基准测试显示:人脸检测在风格化虚拟角色上达到100%准确率,其中YOLOv11n在检测阶段提供了最佳平衡(约54毫秒延迟)。但通用Transformer模型(如CLIP和SigLIP)在分类阶段无法实现可行的准确性(<23%)或速度(>150毫秒),无法满足实时循环要求。

Insight: 论文的创新点在于首次针对虚拟角色情感识别的实时性需求,系统性地基准测试了现成模型,揭示了分类阶段的“延迟墙”问题。从客观角度看,该研究强调了在治疗等特定领域,需要轻量级、领域专用的架构,而非依赖通用模型,以实现可访问的实时人工智能应用。

Abstract: In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and ViT-FER.Our results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a “Latency Wall” exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.


[45] A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery cs.CVPDF

Valery Fischer, Alan Magdaleno, Anna-Katharina Calek, Nicola Cavalcanti, Nathan Hoffman

TL;DR: 本文提出了一种用于手术场景中3D手部姿态估计的鲁棒多视角流程,并发布了一个包含超过68,000帧图像和3,000个手动标注2D手部姿态的新手术基准数据集。该方法无需领域特定微调,仅使用现成的预训练模型,通过整合人员检测、全身姿态估计、2D手部关键点预测和约束3D优化来实现。

Details

Motivation: 解决在手术环境中进行准确3D手部姿态估计的挑战,这些挑战包括强烈局部光照、器械或人员频繁遮挡、手套导致的统一手部外观,以及缺乏可靠的带标注训练数据集。

Result: 定量实验表明,该方法在2D平均关节误差上减少了31%,在3D平均每关节位置误差上减少了76%,优于基线方法。

Insight: 创新点在于提出一个无需领域微调、完全基于现成模型的鲁棒多视角流程,并贡献了一个在模拟手术室中录制、包含不同场景复杂度的综合性标注数据集,为手术计算机视觉研究建立了强基线。

Abstract: Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.


[46] PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models cs.CV | cs.AIPDF

Chak-Wing Mak, Guanyu Zhu, Boyi Zhang, Hongji Li, Xiaowei Chi

TL;DR: 论文提出了PhysicsMind基准测试,用于评估多模态大语言模型和视频世界模型在物理推理与预测方面的能力。该基准包含真实和模拟环境,聚焦于质心、杠杆平衡和牛顿第一定律三个物理原理,通过视觉问答和视频生成任务来测试模型是否遵循物理规律。

Details

Motivation: 现有基准测试在评估模型对物理规律的理解方面存在不足,它们多依赖于合成数据、视觉问答模板或与物理规律无关的视频质量感知。为了填补这一空白,需要建立一个统一的基准来系统评估模型对物理定律的一致性推理和生成能力。

Result: 在PhysicsMind基准上对一系列近期模型和视频生成模型进行了评估,发现它们往往依赖外观启发式方法,并频繁违反基本力学原理。这表明当前的模型扩展和训练方法仍不足以实现鲁棒的物理理解。

Insight: 论文的创新点在于构建了一个结合真实与模拟环境的统一物理推理基准,专注于核心力学原理,并通过VQA和视频生成任务进行综合评估。这为开发具有物理感知能力的多模态模型提供了一个聚焦的测试平台。

Abstract: Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton’s First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.


[47] Keyframe-Based Feed-Forward Visual Odometry cs.CV | cs.ROPDF

Weichen Dai, Wenhan Su, Da Kong, Yuhang Ming, Wanzeng Kong

TL;DR: 本文提出了一种基于关键帧的前馈式视觉里程计方法,通过强化学习自适应地选择关键帧,以解决现有基于视觉基础模型的方法因处理所有图像帧而导致的冗余计算和性能下降问题。

Details

Motivation: 当前基于视觉基础模型的视觉里程计方法(如VGGT-Long)不加区分地处理原始图像序列,导致计算冗余和因帧间视差小引起的性能下降,而传统几何启发式方法难以集成到这些模型中。

Result: 在TartanAir数据集上训练,并在多个真实世界数据集上进行评估,实验结果表明该方法在多个基准测试中相比最先进的前馈式视觉里程计方法取得了显著且一致的性能提升。

Insight: 创新点在于使用强化学习以数据驱动的方式自适应地学习关键帧选择策略,而非依赖手工规则,使选择与基础模型的内在特性对齐,从而在保持单次前馈网络优势的同时提高效率和精度。

Abstract: The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.


[48] DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models cs.CV | cs.ROPDF

Chenyang Li, Jieyuan Liu, Bin Li, Bo Gao, Yilin Yuan

TL;DR: 本文提出了一种简单有效的即插即用框架——干扰令牌剪枝(DTP),用于视觉语言动作(VLA)模型。该框架通过动态检测并剪枝任务无关图像区域中的‘干扰令牌’,以修正模型的视觉注意力模式,从而提高任务成功率,并探索模型在不改变原始架构或增加额外输入情况下的性能上限。

Details

Motivation: VLA模型在机器人操作中展现出强大能力,但其默认行为可能过度关注任务无关图像区域的令牌(即‘干扰令牌’),这会干扰模型在每一步生成期望动作令牌的过程,从而影响任务成功率。

Result: 在SIMPLER基准测试上的实验表明,该方法在不同类型的新型VLA模型上均能持续提升任务成功率,显示出对基于Transformer的VLA模型的泛化能力。进一步分析揭示了所有测试模型的任务成功率与任务无关区域注意力量之间存在负相关关系。

Insight: 创新点在于提出了一个轻量级、无需修改模型架构的干扰令牌动态剪枝框架,以修正VLA模型的注意力偏差。客观来看,该研究揭示了VLA模型中普遍存在的注意力分散现象及其与性能的负相关性,为未来研究提供了重要指导方向。

Abstract: Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as ‘distracting tokens’. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model’s visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.


[49] DSFedMed: Dual-Scale Federated Medical Image Segmentation via Mutual Distillation Between Foundation and Lightweight Models cs.CV | cs.DCPDF

Hanwen Zhang, Qiaojin Shen, Yuxi Liu, Yuesheng Zhu, Guibo Luo

TL;DR: DSFedMed提出了一种双尺度联邦医学图像分割框架,通过中央基础模型与轻量级客户端模型之间的相互知识蒸馏,在降低计算和通信开销的同时提升分割性能。

Details

Motivation: 基础模型在联邦学习环境中部署面临高计算需求、大通信开销和高推理成本的问题,本文旨在解决这些挑战以实现高效、可扩展的医学图像分割。

Result: 在五个医学图像分割数据集上的评估显示,DSFedMed相比现有联邦基础模型基线,Dice分数平均提升2%,同时通信成本和推理时间减少近90%。

Insight: 创新点包括双尺度相互蒸馏机制、生成高质量医学图像替代真实公共数据集,以及可学习性引导的样本选择策略,这些方法有效平衡了模型性能与资源效率。

Abstract: Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.


[50] Masked Modeling for Human Motion Recovery Under Occlusions cs.CVPDF

Zhiyin Qian, Siwei Zhang, Bharat Lal Bhatnagar, Federica Bogo, Siyu Tang

TL;DR: 本文提出MoRo,一种基于掩码建模的生成式框架,用于从单目RGB视频中恢复被遮挡的人体运动。该方法将运动重建视为视频条件任务,通过融合轨迹感知的运动先验、图像条件姿态先验和视频条件掩码Transformer,在遮挡情况下实现高效、鲁棒的运动恢复,并在全局坐标系中保持一致性。

Details

Motivation: 现有基于回归的方法对缺失观测(如遮挡)脆弱,而基于优化和扩散的方法虽更鲁棒但推理速度慢且预处理复杂。本文旨在克服这些限制,开发一种既能处理遮挡又能高效端到端推理的运动恢复方法。

Result: 在EgoBody和RICH基准测试上的大量实验表明,MoRo在遮挡情况下的准确性和运动真实性方面大幅优于现有SOTA方法,在非遮挡场景下性能相当,并在单块H200 GPU上达到70 FPS的实时推理速度。

Insight: 核心创新在于将生成式掩码建模范式引入运动恢复任务,自然处理遮挡;设计了跨模态学习方案,从异构数据集(MoCap、图像-姿态、视频-运动)中学习并融合多模态先验(运动、姿态、视觉动态),克服了配对视频-运动数据稀缺问题,实现了鲁棒且高效的端到端推理。

Abstract: Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings.Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.


[51] SAMTok: Representing Any Mask with Two Words cs.CVPDF

Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian

TL;DR: 本文提出了SAMTok,一种将任意区域掩码编码为两个离散特殊令牌的掩码令牌化器,使多模态大语言模型(MLLMs)能够通过标准的下一令牌预测和简单的强化学习,无需修改架构或设计专门损失,即可学习和生成像素级掩码。该方法基于SAM2,在2.09亿个多样化掩码上训练,并使用500万个SAMTok格式的数据样本微调QwenVL模型,在区域描述、区域视觉问答、指代表述分割、场景图解析和交互式分割等多个像素级任务上达到SOTA或可比性能。

Details

Motivation: 解决像素级多模态大语言模型(MLLMs)难以扩展的问题,这些模型通常依赖于复杂的区域级编码器、专门的分割解码器和不兼容的训练目标。

Result: 在区域描述、区域VQA、指代表述对话、指代表述分割、场景图解析和多轮交互式分割等任务上达到最先进(SOTA)或可比结果;在GRES和GCG基准测试上,通过强化学习实现了显著提升。

Insight: 核心创新是将掩码视为新的语言令牌(两个离散令牌),从而将像素级任务统一到MLLMs的标准自回归生成框架中,实现了无需架构修改的、可扩展的像素能力集成范式;利用掩码编码器和残差向量量化器生成紧凑且信息丰富的离散表示;引入了文本答案匹配奖励以高效优化掩码生成。

Abstract: Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.


[52] Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing cs.CV | cs.CL | cs.IRPDF

Tingyu Song, Yanzhao Zhang, Mingxin Li, Zhuoning Guo, Dingkun Long

TL;DR: 本文提出了一个名为EDIR的新型细粒度组合图像检索(CIR)基准,该基准利用图像编辑技术精确控制修改类型和内容,生成了涵盖5个主类和15个子类的5000个高质量查询。通过对13个多模态嵌入模型的全面评估,揭示了现有SOTA模型在跨子类任务上存在显著能力差距,并指出了现有基准的局限性,如模态偏见和类别覆盖不足。

Details

Motivation: 当前CIR基准的查询类别有限,无法反映真实场景的多样性需求,存在评估鸿沟。

Result: 在提出的EDIR基准上评估了13个模型,包括RzenEmbed和GME等SOTA模型,发现它们在所有子类别上表现均不稳定,存在显著能力差距。域内训练实验进一步区分了可通过针对性数据解决的任务类别和暴露当前模型架构固有局限性的类别。

Insight: 创新点在于利用图像编辑构建细粒度、可控的CIR基准,提供了更全面的评估框架;客观分析揭示了现有基准的固有缺陷(如模态偏见)和模型能力的真实边界,为未来研究指明了方向。

Abstract: Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.


[53] ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion cs.CVPDF

Remy Sabathier, David Novotny, Niloy J. Mitra, Tom Monnier

TL;DR: 本文提出了ActionMesh,一种基于时间3D扩散的生成模型,用于从前馈方式生成可直接用于生产的动态3D网格。该方法通过修改现有3D扩散模型以包含时间轴,首先生成表示随时间变化的独立3D形状的同步潜在序列,然后通过时间3D自编码器将这些独立形状转换为预定义参考形状的变形,从而构建动画。模型支持从单目视频、文本描述或带动画描述的3D网格等多种输入生成动画,具有快速、无需绑定骨骼、拓扑一致等优点。

Details

Motivation: 现有动态3D对象生成方法通常存在设置受限、运行时间长或质量有限等问题,难以实际应用,因此需要一种能够快速生成高质量、可直接用于生产的动态3D网格的方法。

Result: 在标准视频到4D基准测试(Consistent4D, Objaverse)上评估,模型在几何精度和时间一致性方面均达到了最先进的性能(SOTA),证明了其能以前所未有的速度和质量生成动态3D网格。

Insight: 关键创新在于将时间维度引入3D扩散模型,提出“时间3D扩散”框架,通过同步潜在序列生成和时间3D自编码器将独立形状序列转换为参考形状的变形,实现了高效、拓扑一致的动画生成,便于纹理化和重定向等后续应用。

Abstract: Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes “in action” in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed “temporal 3D diffusion”. Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.


[54] HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval cs.CV | cs.IRPDF

Zequn Xie, Xin Liu, Boyun Zhang, Yuxiao Lin, Sihang Cai

TL;DR: 本文提出了一种受人类视觉认知启发的文本-视频检索模型HVD,通过模拟人类从粗到细的视觉感知过程,设计了帧特征选择模块和补丁特征压缩模块,以解决现有方法因文本查询稀疏性导致的‘盲目’特征交互问题,在多个基准测试上实现了最先进的性能。

Details

Motivation: 现有基于CLIP的文本-视频检索方法存在‘盲目’特征交互问题,即模型难以从背景噪声中区分关键视觉信息,这源于文本查询的稀疏性。

Result: 在五个基准测试上的大量实验表明,HVD不仅能够捕捉类人的视觉焦点,而且实现了最先进的性能。

Insight: 创新点在于受人类认知行为启发,建立了从粗到细的对齐机制,包括模拟宏观感知的帧选择以消除时间冗余,以及模拟微观感知的补丁特征压缩以实现精确的实体级匹配。

Abstract: The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from “blind” feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.


[55] 360Anything: Geometry-Free Lifting of Images and Videos to 360° cs.CVPDF

Ziyi Wu, Daniel Watson, Andrea Tagliasacchi, David J. Fleet, Marcus A. Brubaker

TL;DR: 本文提出了360Anything,一种无需几何信息的框架,用于将透视图像和视频提升至360°全景图。该方法基于预训练的扩散Transformer,将透视输入和全景目标视为token序列,以纯数据驱动的方式学习透视到等距柱状投影的映射,无需相机信息。

Details

Motivation: 现有方法依赖透视与等距柱状投影之间的显式几何对齐,需要已知相机元数据,限制了其在无校准或噪声的野外数据上的应用。本文旨在消除对相机信息的依赖,实现更通用的全景生成。

Result: 在图像和视频透视到360°生成任务上,360Anything取得了最先进的性能,超越了使用真实相机信息的先前工作。在零样本相机视场和方向估计基准测试中也展示了有竞争力的结果。

Insight: 创新点包括:1) 提出几何无关的框架,通过数据驱动学习映射;2) 识别了等距柱状投影边界接缝伪影的根源(VAE编码器中的零填充),并引入循环潜在编码以实现无缝生成;3) 展示了模型对几何的深度理解和在计算机视觉任务中的更广泛实用性。

Abstract: Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything’s deep geometric understanding and broader utility in computer vision tasks. Additional results are available at https://360anything.github.io/.


[56] Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders cs.CVPDF

Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma

TL;DR: 本文研究了表示自编码器(RAEs)在大规模、自由形式的文本到图像(T2I)生成任务中的可扩展性。通过在不同数据上扩展RAE解码器并简化其设计,作者发现RAEs在扩散变换器从0.5B到9.8B参数的规模上,始终优于最先进的FLUX VAE,表现出更快的收敛速度、更好的生成质量,以及在微调时更强的稳定性。

Details

Motivation: 动机是探究在ImageNet上表现出优势的表示自编码器框架,能否扩展到大规模、自由形式的文本到图像生成任务中,并验证其相对于变分自编码器(VAEs)的优越性。

Result: 在从0.5B到9.8B参数的扩散变换器规模上进行的受控比较中,RAEs在预训练阶段始终优于VAEs。在高质量数据集上进行微调时,基于VAE的模型在64个epoch后发生灾难性过拟合,而基于RAE的模型在256个epoch后仍保持稳定,并取得了一致更好的性能。RAE模型在所有实验中均表现出更快的收敛速度和更好的生成质量。

Insight: 论文宣称的创新点在于将RAE框架成功扩展并简化至大规模T2I生成,证明了其相对于VAE的简单性和优越性。客观来看,其核心洞察包括:1)大规模扩展简化了RAE框架,仅维度相关的噪声调度是关键,而复杂的架构设计收益甚微;2)RAE在训练稳定性和抗过拟合方面显著优于VAE;3)视觉理解和生成可在共享表示空间中进行,为统一模型开辟了新可能性。

Abstract: Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.


[57] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation cs.CV | cs.AIPDF

Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang

TL;DR: PyraTok是一种语言对齐的金字塔形分词器,用于视频理解和生成。它通过多尺度时空分辨率学习语义结构化的离散潜在表示,利用共享大型二进制码本对编码器特征进行量化,生成紧凑且富有表现力的视频标记序列。该方法在十个基准测试中实现了最先进的视频重建性能,提升了文本到视频生成质量,并在视频分割、时序动作定位和视频理解等任务上取得了新的零样本SOTA结果。

Details

Motivation: 现有离散视频VAE分词器通常基于单一尺度的视觉码本,词汇量有限且语言监督浅层,导致跨模态对齐和零样本迁移能力不足。PyraTok旨在通过多分辨率语义结构化离散潜在表示和紧密的语言对齐来解决这些问题。

Result: 在十个基准测试中,PyraTok实现了最先进的视频重建性能,显著提升了文本到视频生成质量,并在视频分割、时序动作定位和视频理解任务上取得了新的零样本SOTA结果,可扩展至4K/8K分辨率。

Insight: 创新点包括语言对齐的金字塔量化模块,通过共享大型二进制码本实现多尺度离散化;联合优化多尺度文本引导量化和标记层次结构的全局自回归目标,以增强视觉标记与语言的耦合。这为视频生成和理解任务提供了更有效的跨模态表示学习框架。

Abstract: Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.


[58] Why Can’t I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition cs.CV | cs.AIPDF

Geo Ahn, Inwoong Lee, Taeoh Kim, Minho Shim, Dongyoon Wee

TL;DR: 本文研究了零样本组合动作识别(ZS-CAR)中的失败模式,发现现有模型因对象驱动的动词捷径而无法泛化到未见过的动词-对象组合。作者提出了RCORE框架,通过组合感知增强和时间顺序正则化损失来强制模型进行基于时间的动词学习,从而缓解此问题。

Details

Motivation: 解决现有零样本组合动作识别模型因严重的数据稀疏性、偏态分布以及动词与对象学习难度不对称而导致的过拟合共现统计问题,使其无法真正受益于组合识别。

Result: 在两个基准测试Sth-com和新构建的EK100-com上,RCORE显著提高了未见组合的准确率,降低了对共现偏见的依赖,并实现了持续为正的组合差距。

Insight: 创新点在于识别了对象驱动动词捷径这一关键限制因素,并提出了通过数据增强(不破坏运动线索)和显式建模时间结构来强制进行时序基础动词学习的简单有效框架。

Abstract: We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.


[59] CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback cs.CVPDF

Wenhang Ge, Guibao Shen, Jiawei Feng, Luozhou Wang, Hao Lu

TL;DR: 本文提出CamPilot方法,旨在通过高效的相机奖励反馈学习来提升视频扩散模型中的相机控制能力。该方法的核心是引入一个相机感知的3D解码器,将视频潜在表示和相机姿态解码为3D高斯表示,并通过渲染新视角与真实视图的像素一致性来量化奖励,从而解决现有奖励模型评估能力不足、计算开销大以及忽略3D几何信息的问题。

Details

Motivation: 现有相机控制视频扩散模型的相机可控性仍然有限,且直接应用现有奖励反馈学习方法面临奖励模型评估能力不足、RGB视频解码计算开销大以及忽略3D几何信息等挑战。

Result: 在RealEstate10K和WorldScore基准测试上的大量实验证明了该方法的有效性。

Insight: 创新点在于提出了一个高效的相机感知3D解码器,将相机姿态同时作为输入和投影参数,利用视频潜在表示与相机姿态不匹配会导致3D结构几何畸变的特性,通过优化渲染视图的一致性来定义奖励,并引入可见性项来选择性监督确定性区域,从而实现了更精准的相机控制奖励量化。

Abstract: Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{https://a-bigbao.github.io/CamPilot/}{CamPilot Page}.


cs.GR [Back]

[60] SplatBus: A Gaussian Splatting Viewer Framework via GPU Interprocess Communication cs.GR | cs.CVPDF

Yinghan Xu, Théo Morales, John Dingliana

TL;DR: SplatBus是一个基于GPU进程间通信(IPC)的3D高斯溅射(3D Gaussian Splatting)查看器框架,旨在解决3DGS难以集成到传统基于网格的渲染管线(如Unity、Blender、Unreal Engine)的问题,通过NVIDIA IPC API实现与外部客户端的无缝集成,从而支持实时渲染和交互式应用。

Details

Motivation: 当前3D高斯溅射(3DGS)方法虽然能实现高保真实时渲染,但难以集成到传统的基于网格的渲染管线中,限制了其在交互式应用和艺术探索中的使用。

Result: 该软件方案通过NVIDIA IPC API实现,允许在Unity、Blender、Unreal Engine和OpenGL查看器等外部客户端中查看渲染结果,但摘要未提及具体的定量基准测试或SOTA比较。

Insight: 创新点在于利用GPU进程间通信技术,将3DGS渲染与现有渲染引擎解耦,提供了一种灵活、易于集成的软件框架,可促进3DGS在自动驾驶、机器人、VR/XR等领域的实际应用。

Abstract: Radiance field-based rendering methods have attracted significant interest from the computer vision and computer graphics communities. They enable high-fidelity rendering with complex real-world lighting effects, but at the cost of high rendering time. 3D Gaussian Splatting solves this issue with a rasterisation-based approach for real-time rendering, enabling applications such as autonomous driving, robotics, virtual reality, and extended reality. However, current 3DGS implementations are difficult to integrate into traditional mesh-based rendering pipelines, which is a common use case for interactive applications and artistic exploration. To address this limitation, this software solution uses Nvidia’s interprocess communication (IPC) APIs to easily integrate into implementations and allow the results to be viewed in external clients such as Unity, Blender, Unreal Engine, and OpenGL viewers. The code is available at https://github.com/RockyXu66/splatbus.


cs.RO [Back]

[61] DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning cs.RO | cs.CVPDF

Junha Lee, Eunha Park, Minsu Cho

TL;DR: DextER是一种语言驱动的灵巧抓取生成方法,通过引入基于接触的具身推理来理解任务语义、三维几何和复杂的手-物交互。该方法首先生成指定手指关节与物体表面接触位置的具身接触标记,然后编码手部配置的抓取标记,从而在DexGYS基准上实现了67.14%的成功率,比现有最优方法提升了3.83个百分点。

Details

Motivation: 现有语言驱动的灵巧抓取生成方法直接将观测映射到抓取参数,缺乏对物理交互的中间推理,因此需要一种能够桥接任务语义与物理约束的具身感知中间表示。

Result: 在DexGYS基准上,DextER取得了67.14%的成功率,比当前最优方法(SOTA)提升了3.83个百分点,并在意图对齐方面实现了96.4%的改进。

Insight: 创新点在于提出接触式具身推理作为中间表示,通过预测手部关节与物体表面的接触位置来连接语义与物理约束,实现了可引导的生成和细粒度控制。

Abstract: Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.


eess.IV [Back]

[62] A Machine Vision Approach to Preliminary Skin Lesion Assessments eess.IV | cs.AI | cs.CV | cs.LGPDF

Ali Khreis, Ro’Yah Radaideh, Quinn McGill

TL;DR: 本研究评估了一个结合皮肤镜ABCD规则与机器学习的皮肤病变初步评估系统。该系统在HAM10000数据集的1000张图像子集上,通过自动化规则管道计算总皮肤镜评分(TDS),并与传统分类器和深度学习模型进行比较。实验表明,基于规则的系统虽具临床可解释性,但性能受限;而从头训练的自定义三层CNN在特定预处理下达到78.5%的准确率和86.5%的召回率,显著优于传统方法和预训练模型。

Details

Motivation: 解决皮肤恶性病变早期检测中,传统手工特征方法(如ABCD规则)可能存在的性能瓶颈问题,并探索机器学习在小型医学数据集上的有效应用。

Result: 在HAM10000数据集子集上,自定义三层CNN在经中值滤波处理的图像上达到78.5%准确率和86.5%召回率,比传统方法(如逻辑回归、随机森林、SVM)提升19个百分点准确率;而使用EfficientNet-B0的迁移学习因自然图像与医学图像的领域偏移而表现不佳。

Insight: 创新点在于结合临床可解释的规则系统与机器学习进行对比评估,并证明针对小规模特定领域医学数据集,从头训练的轻量级定制CNN能更有效地从像素级学习诊断模式,超越手工特征和大型预训练模型。

Abstract: Early detection of malignant skin lesions is critical for improving patient outcomes in aggressive, metastatic skin cancers. This study evaluates a comprehensive system for preliminary skin lesion assessment that combines the clinically established ABCD rule of dermoscopy (analyzing Asymmetry, Borders, Color, and Dermoscopic Structures) with machine learning classification. Using a 1,000-image subset of the HAM10000 dataset, the system implements an automated, rule-based pipeline to compute a Total Dermoscopy Score (TDS) for each lesion. This handcrafted approach is compared against various machine learning solutions, including traditional classifiers (Logistic Regression, Random Forest, and SVM) and deep learning models. While the rule-based system provides high clinical interpretability, results indicate a performance bottleneck when reducing complex morphology to five numerical features. Experimental findings show that transfer learning with EfficientNet-B0 failed significantly due to domain shift between natural and medical images. In contrast, a custom three-layer Convolutional Neural Network (CNN) trained from scratch achieved 78.5% accuracy and 86.5% recall on median-filtered images, representing a 19-point accuracy improvement over traditional methods. The results demonstrate that direct pixel-level learning captures diagnostic patterns beyond handcrafted features and that purpose-built lightweight architectures can outperform large pretrained models for small, domain-specific medical datasets.


[63] Phi-SegNet: Phase-Integrated Supervision for Medical Image Segmentation eess.IV | cs.CVPDF

Shams Nafisa Ali, Taufiq Hasan

TL;DR: 本文提出了一种名为Phi-SegNet的新型医学图像分割网络,该网络在架构和优化层面整合了相位感知信息。它通过双特征掩码前体模块和反向傅里叶注意力块来融合特征并利用相位正则化特征优化解码器输出,同时采用相位感知损失来强调边界精度。在涵盖多种成像模态的五个公开数据集上,该方法均取得了最先进的性能,并在跨数据集泛化场景中表现出鲁棒性和优越性。

Details

Motivation: 现有分割架构(如CNN、Transformer及其混合模型)主要编码空间信息,而忽略了捕获丰富结构和纹理线索的频域表示,这限制了模型在不同成像模态和解剖结构间的鲁棒泛化能力。尽管近期研究开始在特征层面探索频谱信息,但在监督层面整合对细粒度目标定位至关重要的频率线索仍未被充分挖掘。

Result: 在五个公开数据集(涵盖X射线、超声、组织病理学、MRI和结肠镜检查)上的评估表明,Phi-SegNet始终实现了最先进的性能,在IoU和F1分数上平均相对优于次优模型1.54+/-1.26%和0.98+/-0.71%。在涉及已知领域未见数据集的跨数据集泛化场景中,Phi-SegNet也展现出鲁棒且优越的性能。

Insight: 论文的核心创新点在于将频谱先验(特别是相位信息)同时整合到特征表示和监督机制中,形成了一个强调边界精度的闭环反馈。具体包括:1)在架构层面引入BFMF模块和RFA块来融合特征和利用相位正则化;2)在优化层面设计了专门的相位感知损失。这为构建擅长细粒度目标定位的通用分割框架提供了新思路。

Abstract: Deep learning has substantially advanced medical image segmentation, yet achieving robust generalization across diverse imaging modalities and anatomical structures remains a major challenge. A key contributor to this limitation lies in how existing architectures, ranging from CNNs to Transformers and their hybrids, primarily encode spatial information while overlooking frequency-domain representations that capture rich structural and textural cues. Although few recent studies have begun exploring spectral information at the feature level, supervision-level integration of frequency cues-crucial for fine-grained object localization-remains largely untapped. To this end, we propose Phi-SegNet, a CNN-based architecture that incorporates phase-aware information at both architectural and optimization levels. The network integrates Bi-Feature Mask Former (BFMF) modules that blend neighboring encoder features to reduce semantic gaps, and Reverse Fourier Attention (RFA) blocks that refine decoder outputs using phase-regularized features. A dedicated phase-aware loss aligns these features with structural priors, forming a closed feedback loop that emphasizes boundary precision. Evaluated on five public datasets spanning X-ray, US, histopathology, MRI, and colonoscopy, Phi-SegNet consistently achieved state-of-the-art performance, with an average relative improvement of 1.54+/-1.26% in IoU and 0.98+/-0.71% in F1-score over the next best-performing model. In cross-dataset generalization scenarios involving unseen datasets from the known domain, Phi-SegNet also exhibits robust and superior performance, highlighting its adaptability and modality-agnostic design. These findings demonstrate the potential of leveraging spectral priors in both feature representation and supervision, paving the way for generalized segmentation frameworks that excel in fine-grained object localization.


cs.AI [Back]

[64] Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning cs.AI | cs.CLPDF

Yuval Kansal, Niraj K. Jha

TL;DR: 本文提出了一种基于知识图谱的强化学习后训练方法,通过将知识图谱路径作为隐式奖励模型,为语言模型提供可验证、可扩展的监督信号,以提升其在专业科学领域(如医学)中的组合式多跳推理能力。

Details

Motivation: 大型语言模型在数学、编程等结构化推理领域已接近专家水平,但在专业科学领域进行组合式多跳推理的能力仍然有限,需要一种基于领域公理事实的、自底向上的学习范式来解决复杂、未见过的任务。

Result: 在医学领域实验中,使用14B参数模型在短跳推理路径(1-3跳)上训练,并在复杂多跳查询(4-5跳)上零样本评估,模型在最具挑战性的推理任务上显著优于GPT-5.2、Gemini 3 Pro等更大规模的前沿系统,且对选项洗牌压力测试表现出鲁棒性。

Insight: 创新点在于将知识图谱路径转化为隐式奖励信号,为强化学习提供可验证、可扩展的监督,鼓励模型组合中间公理而非仅优化最终答案,从而构建“组合式桥梁”,提升模型在结构化知识基础上的组合推理能力。

Abstract: Large language models have achieved near-expert performance in structured reasoning domains like mathematics and programming, yet their ability to perform compositional multi-hop reasoning in specialized scientific fields remains limited. We propose a bottom-up learning paradigm in which models are grounded in axiomatic domain facts and compose them to solve complex, unseen tasks. To this end, we present a post-training pipeline, based on a combination of supervised fine-tuning and reinforcement learning (RL), in which knowledge graphs act as implicit reward models. By deriving novel reward signals from knowledge graph paths, we provide verifiable, scalable, and grounded supervision that encourages models to compose intermediate axioms rather than optimize only final answers during RL. We validate this approach in the medical domain, training a 14B model on short-hop reasoning paths (1-3 hops) and evaluating its zero-shot generalization to complex multi-hop queries (4-5 hops). Our experiments show that path-derived rewards act as a “compositional bridge”, enabling our model to significantly outperform much larger models and frontier systems like GPT-5.2 and Gemini 3 Pro, on the most difficult reasoning tasks. Furthermore, we demonstrate the robustness of our approach to adversarial perturbations against option-shuffling stress tests. This work suggests that grounding the reasoning process in structured knowledge is a scalable and efficient path toward intelligent reasoning.


[65] Logic Programming on Knowledge Graph Networks And its Application in Medical Domain cs.AI | cs.CL | cs.LGPDF

Chuanqing Wang, Zhenmin Zhao, Shanshan Du, Chaoqun Fei, Songmao Zhang

TL;DR: 本文提出了’知识图谱网络’的系统性理论、技术与应用框架,特别聚焦于医疗健康领域。研究涵盖了知识图谱网络在模糊、不确定、多模态、向量化、分布式和联邦等不同条件下的定义、开发、推理、计算与应用,并提供了真实数据案例和实验结果。

Details

Motivation: 当前知识图谱研究虽发展迅速,但在应用层面仍存在不足,如未能充分利用先进的逻辑推理、人工智能技术、专用编程语言及现代概率统计理论,特别是多知识图谱协同与竞争技术未得到足够重视。

Result: 论文在多种条件下(如模糊、不确定、多模态等)提供了基于真实数据的示例和实验结果,展示了所提框架的有效性。

Insight: 创新点在于系统性地定义了’知识图谱网络’概念,并扩展了其在复杂条件(如分布式、联邦学习)下的推理与应用能力,为多知识图谱协同处理提供了理论和技术基础。

Abstract: The rash development of knowledge graph research has brought big driving force to its application in many areas, including the medicine and healthcare domain. However, we have found that the application of some major information processing techniques on knowledge graph still lags behind. This defect includes the failure to make sufficient use of advanced logic reasoning, advanced artificial intelligence techniques, special-purpose programming languages, modern probabilistic and statistic theories et al. on knowledge graphs development and application. In particular, the multiple knowledge graphs cooperation and competition techniques have not got enough attention from researchers. This paper develops a systematic theory, technique and application of the concept ‘knowledge graph network’ and its application in medical and healthcare domain. Our research covers its definition, development, reasoning, computing and application under different conditions such as unsharp, uncertain, multi-modal, vectorized, distributed, federated. Almost in each case we provide (real data) examples and experiment results. Finally, a conclusion of innovation is provided.


[66] MiRAGE: A Multiagent Framework for Generating Multimodal Multihop Question-Answer Dataset for RAG Evaluation cs.AI | cs.CL | cs.MAPDF

Chandan Kumar Sahu, Premith Kumar Chilukuri, Matthew Hetrich

TL;DR: 本文提出MiRAGE,一个用于评估检索增强生成(RAG)系统的多智能体框架,旨在自动生成领域特定、多模态、多跳的问答数据集,以解决现有基准在复杂专业文档评估上的不足。

Details

Motivation: 当前RAG技术向多模态、高风险企业应用快速发展,但缺乏能捕捉专业领域复杂性的评估基准,现有数据集多基于通用领域或纯文本检索,无法处理信息紧密多模态且推理需要综合分散证据的场景。

Result: 在法规、金融、定量生物学和新闻四个不同领域的广泛实证评估表明,MiRAGE生成的数据集具有显著更高的推理复杂度(平均跳数>2.3)和事实忠实度。消融研究指出,若有图像文本描述,MiRAGE可由LLM驱动,但视觉基础仍是前沿挑战。

Insight: 创新点在于采用多智能体协作框架(包括递归上下文优化、对抗性验证器和专家角色识别智能体)来模拟专家认知工作流,自动化生成反映专有语料潜在主题结构的黄金标准评估数据集,为下一代信息检索系统提供严格基准测试的基础设施。

Abstract: The rapid evolution of Retrieval-Augmented Generation (RAG) toward multimodal, high-stakes enterprise applications has outpaced the development of domain specific evaluation benchmarks. Existing datasets often rely on general-domain corpora or purely textual retrieval, failing to capture the complexity of specialized technical documents where information is inextricably multimodal and reasoning requires synthesizing disjoint evidence. We address this gap by introducing MiRAGE, a Multiagent framework for RAG systems Evaluation, that leverages a collaborative swarm of specialized agents to generate verified, domain-specific, multimodal, and multi-hop Question-Answer datasets. MiRAGE orchestrates a swarm of specialized agents: a recursive context optimization loop to aggregate scattered evidence, an adversarial verifier agent to guarantee factual grounding, and an agent to recognize the expert persona and the relevant domain to mimic expert cognitive workflows. Extensive empirical evaluation across four distinct domains (regulations, finance, quantitative biology, and journalism) demonstrates that MiRAGE generates datasets with significantly higher reasoning complexity (>2.3 average hops) and factual faithfulness. Our ablation studies point that MiRAGE can be powered by LLMs if textual descriptions of the images are available. Visual grounding still remains a frontier. By automating the creation of gold standard evaluation datasets that reflect the latent thematic structure of proprietary corpora, MiRAGE provides the necessary infrastructure to rigorously benchmark the next generation information retrieval systems.


[67] Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge cs.AI | cs.CLPDF

Yiyang Feng, Zeming Chen, Haotian Wu, Jiawei Zhou, Antoine Bosselut

TL;DR: 该论文提出了TRACK基准测试,用于评估大语言模型在多步推理中处理新旧知识冲突的能力,发现提供更新事实反而会损害模型性能,且性能下降随冲突知识增加而加剧。

Details

Motivation: 现有方法通过上下文提供更新知识或知识编辑来缓解LLM中的过时或错误信息,但可能引发知识冲突,且当前基准主要关注单次知识更新和事实回忆,缺乏对下游推理影响的评估。

Result: 在TRACK基准(涵盖WIKI、CODE和MATH三个推理密集型场景)上的实验表明,提供更新事实进行推理比不提供更新事实的性能更差,且性能退化随提供更多更新事实而加剧。

Insight: 创新点在于构建了多步推理中知识冲突传播的基准TRACK,揭示了LLM无法忠实整合更新事实以及即使整合后推理仍存在缺陷的双重失败原因,为未来研究提供了严谨的评估工具。

Abstract: A common solution for mitigating outdated or incorrect information in Large Language Models (LLMs) is to provide updated facts in-context or through knowledge editing. However, these methods introduce knowledge conflicts when the knowledge update fails to overwrite the model’s parametric knowledge, which propagate to faulty reasoning. Current benchmarks for this problem, however, largely focus only on single knowledge updates and fact recall without evaluating how these updates affect downstream reasoning. In this work, we introduce TRACK (Testing Reasoning Amid Conflicting Knowledge), a new benchmark for studying how LLMs propagate new knowledge through multi-step reasoning when it conflicts with the model’s initial parametric knowledge. Spanning three reasoning-intensive scenarios (WIKI, CODE, and MATH), TRACK introduces multiple, realistic conflicts to mirror real-world complexity. Our results on TRACK reveal that providing updated facts to models for reasoning can worsen performance compared to providing no updated facts to a model, and that this performance degradation exacerbates as more updated facts are provided. We show this failure stems from both inability to faithfully integrate updated facts, but also flawed reasoning even when knowledge is integrated. TRACK provides a rigorous new benchmark to measure and guide future progress on propagating conflicting knowledge in multi-step reasoning.


[68] Agentic Uncertainty Quantification cs.AI | cs.CLPDF

Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang, Caiming Xiong, Chien-Sheng Wu

TL;DR: 本文提出了一个名为Agentic Uncertainty Quantification (AUQ)的双过程框架,旨在解决AI智能体在长程推理中因早期认知错误传播而导致的‘幻觉螺旋’问题。该框架将言语化的不确定性转化为主动的双向控制信号,通过不确定性感知记忆(UAM)和不确定性感知反思(UAR)两个互补机制,动态平衡高效执行与深度思考,从而提升智能体的可靠性。

Details

Motivation: 现有方法存在两难困境:不确定性量化(UQ)方法通常是被动的风险诊断工具,无法主动解决问题;而自我反思机制则容易陷入持续或漫无目的的修正。本文旨在弥合这一差距,通过一个统一的框架将不确定性转化为主动控制信号,以应对早期错误传播导致的可靠性下降问题。

Result: 在闭环基准测试和开放式深度研究任务上的大量实验表明,这种无需训练的方法实现了卓越的性能和轨迹级别的校准,表现出优越的性能。

Insight: 论文的核心创新在于提出了一个将言语化不确定性主动转化为控制信号的双过程架构(UAM和UAR),实现了诊断与干预的统一,并能动态平衡执行效率与深思熟虑。从客观角度看,这种将不确定性显式地整合到智能体决策循环中作为双向信号的设计,为构建更可靠的智能体提供了一个有原则的框架思路。

Abstract: Although AI agents have demonstrated impressive capabilities in long-horizon reasoning, their reliability is severely hampered by the ``Spiral of Hallucination,’’ where early epistemic errors propagate irreversibly. Existing methods face a dilemma: uncertainty quantification (UQ) methods typically act as passive sensors, only diagnosing risks without addressing them, while self-reflection mechanisms suffer from continuous or aimless corrections. To bridge this gap, we propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals. Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary. This enables the agent to balance efficient execution and deep deliberation dynamically. Extensive experiments on closed-loop benchmarks and open-ended deep research tasks demonstrate that our training-free approach achieves superior performance and trajectory-level calibration. We believe this principled framework AUQ represents a significant step towards reliable agents.


[69] PhysProver: Advancing Automatic Theorem Proving for Physics cs.AI | cs.CLPDF

Hanning Zhang, Ruida Wang, Rui Pan, Wenyuan Wang, Bingxu Meng

TL;DR: 本文提出了PhysProver,这是首个针对物理领域的自动定理证明增强方法。该方法构建了专用数据集PhysLeanData,并基于DeepSeek-Prover-V2-7B模型,采用可验证奖励的强化学习进行训练。实验表明,仅用约5K训练样本,模型在多个物理子领域实现了2.4%的整体性能提升,并在数学基准MiniF2F-Test上获得了1.3%的增益,展现了跨领域的泛化能力。

Details

Motivation: 当前基于可验证语言和LLM的定理证明研究主要集中于数学领域,而形式化物理推理同样依赖类似的问题解决和定理证明框架,却未得到充分关注。本文旨在填补这一空白,首次将自动定理证明扩展到物理领域。

Result: 在仅使用约5K训练样本的情况下,PhysProver在多个物理子领域实现了2.4%的整体性能提升。此外,在形式化数学基准MiniF2F-Test上获得了1.3%的性能增益,表明其具有超越物理领域的非平凡泛化能力。

Insight: 论文的主要创新点在于:1) 首次将自动定理证明系统专门应用于物理领域;2) 构建了结合PhysLean采样和基于猜想的形式化数据生成流程的专用数据集PhysLeanData;3) 采用可验证奖励的强化学习训练范式,高效利用小规模数据提升性能。这为将形式化证明器扩展到数学以外的领域提供了一个有效范式。

Abstract: The combination of verifiable languages and LLMs has significantly influenced both the mathematical and computer science communities because it provides a rigorous foundation for theorem proving. Recent advancements in the field provide foundation models and sophisticated agentic systems pushing the boundaries of formal mathematical reasoning to approach the natural language capability of LLMs. However, little attention has been given to the formal physics reasoning, which also heavily relies on similar problem-solving and theorem-proving frameworks. To solve this problem, this paper presents, to the best of our knowledge, the first approach to enhance formal theorem proving in the physics domain. We compose a dedicated dataset PhysLeanData for the task. It is composed of theorems sampled from PhysLean and data generated by a conjecture-based formal data generation pipeline. In the training pipeline, we leverage DeepSeek-Prover-V2-7B, a strong open-source mathematical theorem prover, and apply Reinforcement Learning with Verifiable Rewards (RLVR) to train our model PhysProver. Comprehensive experiments demonstrate that, using only $\sim$5K training samples, PhysProver achieves an overall 2.4% improvement in multiple sub-domains. Furthermore, after formal physics training, we observe 1.3% gains on the MiniF2F-Test benchmark, which indicates non-trivial generalization beyond physics domains and enhancement for formal math capability as well. The results highlight the effectiveness and efficiency of our approach, which provides a paradigm for extending formal provers outside mathematical domains. To foster further research, we will release both our dataset and model to the community.


[70] ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models cs.AI | cs.CLPDF

Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen

TL;DR: 本文提出了ErrorMap方法,用于分析大语言模型失败的原因,并基于该方法在35个数据集和83个模型上生成了错误分类法ErrorAtlas,揭示了模型重复出现的失败模式,如输出细节缺失和问题误解。

Details

Motivation: 现有LLM基准测试仅能指出模型何时失败,但无法解释失败原因,这使得基准测试不完整且无法有效指导模型改进。

Result: 应用ErrorMap方法生成了ErrorAtlas错误分类法,揭示了当前LLM研究中未被充分探索的错误类型,如输出中必要细节的遗漏和问题误解。

Insight: 创新点在于将评估焦点从模型成功之处转移到失败原因,引入了可跨模型和任务应用的深层评估层,提供了对模型行为和局限性的更丰富洞察。ErrorMap是首个绘制LLM失败源的方法,能提取模型的“失败签名”。

Abstract: Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model’s unique “failure signature”, clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details in the output and question misinterpretation. By shifting focus from where models succeed to why they fail, ErrorMap and ErrorAtlas enable advanced evaluation - one that exposes hidden weaknesses and directs progress. Unlike success, typically measured by task-level metrics, our approach introduces a deeper evaluation layer that can be applied globally across models and tasks, offering richer insights into model behavior and limitations. We make the taxonomy and code publicly available with plans to periodically update ErrorAtlas as new benchmarks and models emerge.


[71] The Paradigm Shift: A Comprehensive Survey on Large Vision Language Models for Multimodal Fake News Detection cs.AI | cs.CVPDF

Wei Ai, Yilong Tan, Yuntao Shou, Tao Meng, Haowen Chen

TL;DR: 本文是一篇关于大视觉语言模型在多模态虚假新闻检测领域应用的综合性综述。文章系统性地回顾了该领域从传统特征工程方法向基于大模型的端到端多模态推理框架的范式转变,涵盖了模型架构、数据集、性能基准、现有技术挑战以及未来研究方向。

Details

Motivation: 大视觉语言模型的快速发展推动了多模态虚假新闻检测领域的范式转变,但目前缺乏系统性的综述来梳理这一转变过程并整合最新进展。本文旨在填补这一空白,全面回顾并分析大视觉语言模型在该领域的变革性作用。

Result: 本文是一篇综述性论文,未提出具体的新方法,因此未报告定量实验结果或基准测试排名。其主要成果是提供了一个结构化的分类法,并系统性地总结了该领域的演变历程、现有方法和挑战。

Insight: 本文的核心创新在于首次系统性地从大视觉语言模型的角度,对多模态虚假新闻检测领域进行了全面的综述和范式分析。其提供的结构化分类法、对技术挑战(如可解释性、时序推理、领域泛化)的剖析,以及对未来方向的展望,为该领域的研究者提供了清晰的路线图。

Abstract: In recent years, the rapid evolution of large vision-language models (LVLMs) has driven a paradigm shift in multimodal fake news detection (MFND), transforming it from traditional feature-engineering approaches to unified, end-to-end multimodal reasoning frameworks. Early methods primarily relied on shallow fusion techniques to capture correlations between text and images, but they struggled with high-level semantic understanding and complex cross-modal interactions. The emergence of LVLMs has fundamentally changed this landscape by enabling joint modeling of vision and language with powerful representation learning, thereby enhancing the ability to detect misinformation that leverages both textual narratives and visual content. Despite these advances, the field lacks a systematic survey that traces this transition and consolidates recent developments. To address this gap, this paper provides a comprehensive review of MFND through the lens of LVLMs. We first present a historical perspective, mapping the evolution from conventional multimodal detection pipelines to foundation model-driven paradigms. Next, we establish a structured taxonomy covering model architectures, datasets, and performance benchmarks. Furthermore, we analyze the remaining technical challenges, including interpretability, temporal reasoning, and domain generalization. Finally, we outline future research directions to guide the next stage of this paradigm shift. To the best of our knowledge, this is the first comprehensive survey to systematically document and analyze the transformative role of LVLMs in combating multimodal fake news. The summary of existing methods mentioned is in our Github: \href{https://github.com/Tan-YiLong/Overview-of-Fake-News-Detection}{https://github.com/Tan-YiLong/Overview-of-Fake-News-Detection}.


[72] GeMM-GAN: A Multimodal Generative Model Conditioned on Histopathology Images and Clinical Descriptions for Gene Expression Profile Generation cs.AI | cs.CV | cs.LGPDF

Francesca Pia Panaccione, Carlo Sgaravatti, Pietro Pinoli

TL;DR: 本文提出了一种名为GeMM-GAN的新型生成对抗网络,它利用组织病理学切片图像和临床描述作为条件输入,来合成逼真的基因表达谱。该方法通过Transformer编码器处理图像块,并结合图像块与文本标记的交叉注意力机制,生成指导向量,从而产生生物学上连贯的基因表达数据。

Details

Motivation: 基因表达数据因隐私法规严格和实验成本高昂而难以广泛用于研究,而医学图像和临床元数据则更易获取。本文旨在解决这一数据获取不平衡问题,通过生成模型来合成基因表达谱,以促进生物医学研究。

Result: 在TCGA数据集上的评估表明,GeMM-GAN优于标准生成模型,能生成更真实且功能上有意义的基因表达谱。在下游疾病类型预测任务中,其准确率比当前最先进的生成模型提高了超过11%。

Insight: 创新点在于将多模态条件(组织病理学图像和临床文本)结合到生成对抗网络中,通过Transformer和交叉注意力机制融合图像与文本信息,以生成生物学上连贯的基因表达数据。这为跨模态数据生成提供了新思路,尤其在生物医学领域具有应用潜力。

Abstract: Biomedical research increasingly relies on integrating diverse data modalities, including gene expression profiles, medical images, and clinical metadata. While medical images and clinical metadata are routinely collected in clinical practice, gene expression data presents unique challenges for widespread research use, mainly due to stringent privacy regulations and costly laboratory experiments. To address these limitations, we present GeMM-GAN, a novel Generative Adversarial Network conditioned on histopathology tissue slides and clinical metadata, designed to synthesize realistic gene expression profiles. GeMM-GAN combines a Transformer Encoder for image patches with a final Cross Attention mechanism between patches and text tokens, producing a conditioning vector to guide a generative model in generating biologically coherent gene expression profiles. We evaluate our approach on the TCGA dataset and demonstrate that our framework outperforms standard generative models and generates more realistic and functionally meaningful gene expression profiles, improving by more than 11% the accuracy on downstream disease type prediction compared to current state-of-the-art generative models. Code will be available at: https://github.com/francescapia/GeMM-GAN


cs.SD [Back]

[73] PF-D2M: A Pose-free Diffusion Model for Universal Dance-to-Music Generation cs.SD | cs.CV | cs.LG | cs.MM | eess.ASPDF

Jaekwon Im, Natalia Polouliakh, Taketo Akama

TL;DR: 本文提出了一种名为PF-D2M的通用舞蹈到音乐生成模型,该模型基于扩散模型,能够从舞蹈视频中提取视觉特征来生成与之对齐的音乐。它通过渐进式训练策略解决了数据稀缺和泛化问题,适用于多舞者和非人类舞者等现实场景。

Details

Motivation: 现有方法通常依赖于从单个舞者提取的身体运动特征和有限的数据集,这限制了其在涉及多舞者和非人类舞者的现实场景中的性能和适用性。本文旨在解决这一问题。

Result: 主客观评估均表明,PF-D2M在舞蹈-音乐对齐和音乐质量方面达到了最先进的性能。

Insight: 创新点在于提出了一种不依赖于姿态估计(Pose-free)的通用模型,直接从舞蹈视频提取视觉特征,并采用渐进式训练策略来增强模型的泛化能力,从而能够处理更广泛的舞蹈输入类型。

Abstract: Dance-to-music generation aims to generate music that is aligned with dance movements. Existing approaches typically rely on body motion features extracted from a single human dancer and limited dance-to-music datasets, which restrict their performance and applicability to real-world scenarios involving multiple dancers and non-human dancers. In this paper, we propose PF-D2M, a universal diffusion-based dance-to-music generation model that incorporates visual features extracted from dance videos. PF-D2M is trained with a progressive training strategy that effectively addresses data scarcity and generalization challenges. Both objective and subjective evaluations show that PF-D2M achieves state-of-the-art performance in dance-music alignment and music quality.


[74] Distillation-based Layer Dropping (DLD) Effective End-to-end Framework for Dynamic Speech Networks cs.SD | cs.CVPDF

Abdul Hannan, Daniele Falavigna, Shah Nawaz, Mubashir Noman, Markus Schedl

TL;DR: 本文提出了一种基于知识蒸馏的层丢弃(DLD)框架,用于构建动态语音网络,以在资源受限的边缘设备上实现性能与计算效率的平衡。该框架通过端到端方式结合知识蒸馏与层丢弃技术,显著提升了动态模型在不同丢弃率下的性能。

Details

Motivation: 边缘设备资源受限且多变,需要动态架构以适应资源限制;现有层丢弃方法在高低丢弃率情况下性能下降严重,影响了性能与计算复杂度的权衡。

Result: 在三个公开基准测试上,使用Conformer和WavLM等知名语音识别方法进行实验,DLD框架在高丢弃和无丢弃情况下分别将词错误率降低了9.32%和2.25%,同时训练时间减少了33.3%,达到了最先进的性能水平。

Insight: 创新点在于将知识蒸馏与层丢弃技术端到端结合,优化动态模型的性能-计算权衡;客观分析表明,该方法通过蒸馏缓解了层丢弃导致的性能下降,提升了动态语音网络的适应性和效率。

Abstract: Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model’s performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32%$ and $2.25%$ for high and no dropping cases with $33.3%$ reduction in training time.


cs.HC [Back]

[75] Elsewise: Authoring AI-Based Interactive Narrative with Possibility Space Visualization cs.HC | cs.AI | cs.CLPDF

Yi Wang, John Joon Young Chung, Melissa Roemmele, Yuqian Sun, Tiffany Wang

TL;DR: 本文介绍了Elsewise,一个用于AI交互式叙事(IN)的创作工具,它通过实现新颖的“捆绑故事线”概念,帮助作者可视化和理解叙事可能性空间,从而弥合作者设想与玩家实际体验之间的差距。

Details

Motivation: 解决生成式AI在交互式叙事中可能扩大作者意图与玩家体验之间差距的问题,确保作者能更好地控制和探索叙事可能性。

Result: 用户研究(n=12)表明,该方法提高了作者对玩家体验叙事的预期,从而更有效地控制和探索叙事可能性空间。

Insight: 创新点在于引入捆绑故事线概念和用户可配置的叙事维度可视化,使作者能直观比较不同游戏流程的异同,增强对AI生成内容的掌控力。

Abstract: Interactive narrative (IN) authors craft spaces of divergent narrative possibilities for players to explore, with the player’s input determining which narrative possibilities they actually experience. Generative AI can enable new forms of IN by improvisationally expanding on pre-authored content in response to open-ended player input. However, this extrapolation risks widening the gap between author-envisioned and player-experienced stories, potentially limiting the strength of plot progression and the communication of the author’s narrative intent. To bridge the gap, we introduce Elsewise: an authoring tool for AI-based INs that implements a novel Bundled Storyline concept to enhance author’s perception and understanding of the narrative possibility space, allowing authors to explore similarities and differences between possible playthroughs of their IN in terms of open-ended, user-configurable narrative dimensions. A user study (n=12) shows that our approach improves author anticipation of player-experienced narrative, leading to more effective control and exploration of the narrative possibility spaces.


[76] VegaChat: A Robust Framework for LLM-Based Chart Generation and Assessment cs.HC | cs.CLPDF

Marko Hostnik, Rauf Kurbanov, Yaroslav Sokolov, Artem Trofimov

TL;DR: 本文提出了VegaChat框架,用于从自然语言生成、验证和评估声明式可视化图表。该框架引入了两个互补的评估指标:Spec Score(衡量规范级相似性的确定性指标)和Vision Score(基于多模态LLM的、与图表库无关的图像相似性评估指标)。在NLV Corpus和ChartLLM的标注子集上评估表明,VegaChat能有效生成有效图表,且新指标与人类判断高度相关。

Details

Motivation: 当前基于大语言模型(LLM)的自然语言到可视化(NL2VIS)系统面临两大相互关联的挑战:一是缺乏标准化的评估指标,难以衡量领域进展和比较不同方法;二是自然语言描述本身具有不完整性,同一查询可能对应多个有效可视化方案。

Result: 在NLV Corpus和ChartLLM的标注子集上,VegaChat实现了接近零的无效或空可视化生成率。提出的Spec Score和Vision Score指标与人类判断表现出强相关性(皮尔逊相关系数分别为0.65和0.71),表明这些指标支持跨图表库的一致比较。

Insight: 论文的创新点在于提出了一个集生成、验证、评估于一体的鲁棒框架,并设计了互补的、覆盖规范层面和视觉层面的量化评估指标,以解决NL2VIS领域缺乏标准化评估和应对自然语言歧义性的核心问题。从客观角度看,其将多模态LLM用于图像相似性评估的Vision Score指标,是一个与具体实现库解耦的、有潜力的评估方法创新。

Abstract: Natural-language-to-visualization (NL2VIS) systems based on large language models (LLMs) have substantially improved the accessibility of data visualization. However, their further adoption is hindered by two coupled challenges: (i) the absence of standardized evaluation metrics makes it difficult to assess progress in the field and compare different approaches; and (ii) natural language descriptions are inherently underspecified, so multiple visualizations may be valid for the same query. To address these issues, we introduce VegaChat, a framework for generating, validating, and assessing declarative visualizations from natural language. We propose two complementary metrics: Spec Score, a deterministic metric that measures specification-level similarity without invoking an LLM, and Vision Score, a library-agnostic, image-based metric that leverages a multimodal LLM to assess chart similarity and prompt compliance. We evaluate VegaChat on the NLV Corpus and on the annotated subset of ChartLLM. VegaChat achieves near-zero rates of invalid or empty visualizations, while Spec Score and Vision Score exhibit strong correlation with human judgments (Pearson 0.65 and 0.71, respectively), indicating that the proposed metrics support consistent, cross-library comparison. The code and evaluation artifacts are available at https://zenodo.org/records/17062309.


cs.LG [Back]

[77] When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards cs.LG | cs.CLPDF

Mingyuan Fan, Weiguang Han, Daixin Wang, Cen Chen, Zhiqiang Zhang

TL;DR: 本文研究了强化学习与可验证奖励(RLVR)在大型语言模型(LLMs)应用中的过锐化现象,即策略会坍缩到有限模式并抑制有效替代方案。作者发现有限批次更新会内在偏向已采样模式,并通过语义耦合全局传播坍缩。为缓解此问题,提出了逆成功优势校准和基于记忆网络的分布级校准策略,经验评估验证了这些策略能有效提升泛化能力。

Details

Motivation: 动机是探究RLVR范式究竟是激发了LLMs的新能力,还是仅仅锐化了现有知识的分布,并针对其可能导致的过锐化与策略坍缩问题进行研究。

Result: 经验评估验证了所提出的逆成功优势校准和分布级校准策略能有效改善泛化性能。

Insight: 创新点在于形式化了RLVR中的过锐化现象,揭示了有限批次更新导致的采样偏差和语义耦合引发的全局坍缩机制,并提出了针对性的校准方法来优先处理困难查询和多样化采样。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a central paradigm for turning large language models (LLMs) into reliable problem solvers, especially in logic-heavy domains. Despite its empirical success, it remains unclear whether RLVR elicits novel capabilities or merely sharpens the distribution over existing knowledge. We study this by formalizing over-sharpening, a phenomenon where the policy collapses onto limited modes, suppressing valid alternatives. At a high level, we discover finite-batch updates intrinsically bias learning toward sampled modes, triggering a collapse that propagates globally via semantic coupling. To mitigate this, we propose inverse-success advantage calibration to prioritize difficult queries and distribution-level calibration to diversify sampling via a memory network. Empirical evaluations validate that our strategies can effectively improve generalization.


[78] Provable Robustness in Multimodal Large Language Models via Feature Space Smoothing cs.LG | cs.CVPDF

Song Xia, Meiwen Ding, Chenqi Kong, Wenhan Yang, Xudong Jiang

TL;DR: 本文提出了一种名为特征空间平滑(FS)的方法,旨在为多模态大语言模型(MLLMs)提供可证明的鲁棒性保证,以抵御对抗性扰动。该方法通过将任何特征编码器转换为平滑变体,确保在ℓ₂有界攻击下,干净和对抗性特征表示之间的余弦相似度存在一个可证明的下界。此外,作者还引入了即插即用的净化器与平滑度映射器(PSM)模块,无需重新训练MLLMs即可提升模型的认证鲁棒性。实验表明,FS-PSM不仅提供了理论鲁棒性保证,而且在多种MLLMs和下游任务中显著降低了攻击成功率。

Details

Motivation: 多模态大语言模型(MLLMs)虽然应用广泛,但其特征表示容易受到对抗性扰动的攻击,导致预测错误。现有方法缺乏理论上的鲁棒性保证,因此需要一种能够提供可证明鲁棒性的防御机制。

Result: 在多种MLLMs和下游任务上的广泛实验表明,FS-PSM方法将各种白盒攻击的攻击成功率(ASR)从接近90%降低到约1%,其经验性能优于对抗训练,并提供了理论上的鲁棒性保证。

Insight: 创新点在于提出了特征空间平滑(FS)方法,为MLLMs的特征表示提供了可证明的鲁棒性下界(特征余弦相似度界,FCSB),并通过即插即用的PSM模块提升了模型的认证鲁棒性,无需重新训练模型,实现了理论保证与实用性的结合。

Abstract: Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90% to about 1%.