Table of Contents
- cs.CL [Total: 98]
- cs.CV [Total: 157]
- cs.AI [Total: 4]
- cs.GR [Total: 1]
- eess.IV [Total: 2]
- cs.CY [Total: 1]
- cs.CR [Total: 3]
- cs.RO [Total: 5]
- cs.CE [Total: 1]
- cs.IR [Total: 4]
- cs.SE [Total: 1]
- cs.SD [Total: 2]
- cs.MA [Total: 2]
- cs.LG [Total: 23]
cs.CL [Back]
[1] Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation
Wei Zhou,Bolei Ma,Annemarie Friedrich,Mohsen Mesgar
Main category: cs.CL
TL;DR: 这篇综述论文系统地整理了基于大语言模型(LLMs)的表格问答(TQA)研究,涵盖任务分类、方法趋势及评估标准,旨在填补该领域的系统性研究空白。
Details
Motivation: 表格问答(TQA)领域在任务定义、核心挑战和方法趋势上缺乏系统性梳理,尤其是在大语言模型(LLMs)兴起的背景下,亟需统一的综述框架。Contribution: 提供了TQA任务的全面分类和任务设置;总结了基于LLMs的方法及其优缺点;指出了未来研究方向。
Method: 通过结构化综述,将TQA任务按表格表示、问题/答案复杂度、多模态性和领域分类;分析了不同方法的针对性挑战。
Result: 论文统一了分散的研究方向,为TQA社区提供了理论基础和实践指导。
Insight: LLMs在TQA中的潜力巨大,但需进一步探索强化学习等新兴方向,以提升模型在多模态和复杂任务中的表现。
Abstract: Table Question Answering (TQA) aims to answer natural language questions about tabular data, often accompanied by additional contexts such as text passages. The task spans diverse settings, varying in table representation, question/answer complexity, modality involved, and domain. While recent advances in large language models (LLMs) have led to substantial progress in TQA, the field still lacks a systematic organization and understanding of task formulations, core challenges, and methodological trends, particularly in light of emerging research directions such as reinforcement learning. This survey addresses this gap by providing a comprehensive and structured overview of TQA research with a focus on LLM-based methods. We provide a comprehensive categorization of existing benchmarks and task setups. We group current modeling strategies according to the challenges they target, and analyze their strengths and limitations. Furthermore, we highlight underexplored but timely topics that have not been systematically covered in prior research. By unifying disparate research threads and identifying open problems, our survey offers a consolidated foundation for the TQA community, enabling a deeper understanding of the state of the art and guiding future developments in this rapidly evolving area.
[2] The Idola Tribus of AI: Large Language Models tend to perceive order where none exists
Shin-nosuke Ishikawa,Masato Todo,Taiki Ogihara,Hirotsugu Ohba
Main category: cs.CL
TL;DR: 研究发现大型语言模型(LLMs)倾向于在随机数列中生成荒谬的模式,显示出逻辑一致性问题,这可能影响其在复杂任务中的应用。
Details
Motivation: 当前LLMs被广泛应用于需要逻辑一致性的任务(如知识检索和多步推理),但对其在识别数列规律中的表现缺乏系统评估。Contribution: 揭示了LLMs在随机数列中过度识别不存在的模式的问题,并指出这一现象(称为AI的”Idola Tribus”)对LLMs逻辑能力的潜在限制。
Method: 通过让LLMs解释不同类型数列(包括算术、几何和随机数列)的模式,观察其生成结果的逻辑一致性。
Result: LLMs在算术和几何数列中表现良好,但在随机数列中频繁生成错误的模式,即使多步推理模型(如OpenAI和Google Gemini系列)也存在此问题。
Insight: LLMs的逻辑缺陷可能影响其在实际任务中的可靠性,需要进一步研究改进其推理机制或开发针对性策略。
Abstract: We present a tendency of large language models (LLMs) to generate absurd patterns despite their clear inappropriateness in a simple task of identifying regularities in number series. Several approaches have been proposed to apply LLMs to complex real-world tasks, such as providing knowledge through retrieval-augmented generation and executing multi-step tasks using AI agent frameworks. However, these approaches rely on the logical consistency and self-coherence of LLMs, making it crucial to evaluate these aspects and consider potential countermeasures. To identify cases where LLMs fail to maintain logical consistency, we conducted an experiment in which LLMs were asked to explain the patterns in various integer sequences, ranging from arithmetic sequences to randomly generated integer series. While the models successfully identified correct patterns in arithmetic and geometric sequences, they frequently over-recognized patterns that were inconsistent with the given numbers when analyzing randomly generated series. This issue was observed even in multi-step reasoning models, including OpenAI o3, o4-mini, and Google Gemini 2.5 Flash Preview Thinking. This tendency to perceive non-existent patterns can be interpreted as the AI model equivalent of Idola Tribus and highlights potential limitations in their capability for applied tasks requiring logical reasoning, even when employing chain-of-thought reasoning mechanisms.
[3] ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models
Wenbin Guo,Xin Wang,Jiaoyan Chen,Lingbing Guo,Zhao Li,Zirui Chen
Main category: cs.CL
TL;DR: ReaLM提出了一种通过残差向量量化(residual vector quantization)将知识图谱嵌入与大型语言模型(LLM)对齐的框架,解决了传统方法无法充分利用结构化语义表示的问题。
Details
Motivation: 现有的基于LLM的知识图谱补全方法难以有效利用结构化语义表示,主要是因为预训练知识图谱模型的连续嵌入空间与LLM的离散词元空间不一致,导致语义传递受限。Contribution: ReaLM的核心贡献包括:1)通过残差向量量化将知识图谱嵌入离散化为紧凑的代码序列;2)将这些代码序列作为可学习的词元整合到LLM词表中;3)引入本体指导的类别约束以增强语义一致性。
Method: ReaLM的主要方法是利用残差向量量化将知识图谱嵌入转换为离散代码序列,并将其作为新词元引入LLM词表中。此外,通过本体指导的类别约束优化实体预测的语义一致性。
Result: 在两个广泛使用的基准数据集上的实验表明,ReaLM取得了最先进的性能,验证了其在结构化知识与大语言模型对齐中的有效性。
Insight: ReaLM的创新点在于弥合了连续嵌入空间与离散词元空间的鸿沟,通过量化技术和语义约束实现了知识图谱与LLM的高效融合,为知识图谱补全提供了新思路。
Abstract: Large Language Models (LLMs) have recently emerged as a powerful paradigm for Knowledge Graph Completion (KGC), offering strong reasoning and generalization capabilities beyond traditional embedding-based approaches. However, existing LLM-based methods often struggle to fully exploit structured semantic representations, as the continuous embedding space of pretrained KG models is fundamentally misaligned with the discrete token space of LLMs. This discrepancy hinders effective semantic transfer and limits their performance. To address this challenge, we propose ReaLM, a novel and effective framework that bridges the gap between KG embeddings and LLM tokenization through the mechanism of residual vector quantization. ReaLM discretizes pretrained KG embeddings into compact code sequences and integrates them as learnable tokens within the LLM vocabulary, enabling seamless fusion of symbolic and contextual knowledge. Furthermore, we incorporate ontology-guided class constraints to enforce semantic consistency, refining entity predictions based on class-level compatibility. Extensive experiments on two widely used benchmark datasets demonstrate that ReaLM achieves state-of-the-art performance, confirming its effectiveness in aligning structured knowledge with large-scale language models.
[4] All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
Shiyuan Guo,Henry Sleight,Fabien Roger
Main category: cs.CL
TL;DR: 当前的语言模型在处理加密语言时推理能力较弱,这可能威胁到链式思维(CoT)监控的有效性。论文测试了28种不同加密方式,发现模型在常见加密(如rot13)中表现较好,但在冷门加密中表现较差,且这种能力与预训练数据中加密的出现频率相关。
Details
Motivation: 随着AI代理的广泛应用,检测有害行为(如对抗攻击或AI不对齐)变得至关重要。CoT监控是一种常用方法,但攻击者可能通过加密语言规避监控。论文旨在评估这一风险。Contribution: 论文测试了多种加密语言下模型的推理能力,揭示了模型在加密推理中的不对称表现,并发现推理能力与加密在预训练数据中的普及度相关。
Method: 通过28种加密方式对多达10种模型进行微调和提示,以数学问题为代理任务衡量推理能力。
Result: 模型在常见加密中表现良好,但在冷门加密中表现显著下降。推理能力与加密的预训练数据频率相关,且微调数据对能力提升作用有限。
Insight: 当前模型在加密推理中存在局限性,这可能限制攻击者利用加密语言规避CoT监控的可行性,并为未来模型的设计提供了参考。
Abstract: Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.
[5] Preference-Aware Memory Update for Long-Term LLM Agents
Haoran Sun,Zekun Zhang,Shaoning Zeng
Main category: cs.CL
TL;DR: 该论文提出了一种偏好感知记忆更新机制(PAMU),通过结合滑动窗口平均和指数移动平均,动态优化长期LLM代理的记忆表示,从而提升对话质量。
Details
Motivation: 现有的长期记忆机制在存储和检索方面已有显著进展,但缺乏动态更新记忆的能力,特别是在适应用户行为和上下文变化时。PAMU旨在填补这一空白。Contribution: 提出了PAMU机制,能够动态且个性化地优化记忆表示,同时捕捉短期波动和长期用户偏好,提升了LLM代理的决策质量。
Method: 结合滑动窗口平均(SW)和指数移动平均(EMA),构建融合的偏好感知记忆表示,实现记忆的动态更新。
Result: 在LoCoMo数据集的五个任务场景中,PAMU显著提升了五个基线的输出质量,验证了其在长期对话中的有效性。
Insight: 动态调整记忆表示是提升长期LLM代理推理能力的关键,结合短期和长期偏好能够更好地适应用户行为的变化。
Abstract: One of the key factors influencing the reasoning capabilities of LLM-based agents is their ability to leverage long-term memory. Integrating long-term memory mechanisms allows agents to make informed decisions grounded in historical interactions. While recent advances have significantly improved the storage and retrieval components, by encoding memory into dense vectors for similarity search or organizing memory as structured knowledge graphs most existing approaches fall short in memory updating. In particular, they lack mechanisms for dynamically refining preference memory representations in response to evolving user behaviors and contexts. To address this gap, we propose a Preference-Aware Memory Update Mechanism (PAMU) that enables dynamic and personalized memory refinement. By integrating sliding window averages (SW) with exponential moving averages (EMA), PAMU constructs a fused preference-aware representation that captures both short-term fluctuations and long-term user tendencies. We conduct experiments on five task scenarios of the LoCoMo dataset, and the results show that our mechanism can significantly improve the output quality of LLM in five baselines, validating its effectiveness in long-term conversations.
[6] VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
Yubo Sun,Chunyi Peng,Yukun Yan,Shi Yu,Zhenghao Liu,Chi Chen,Zhiyuan Liu,Maosong Sun
Main category: cs.CL
TL;DR: 论文提出了EVisRAG框架,通过证据引导的多图像推理来解决视觉检索增强生成(VRAG)系统中的问题,并使用RS-GRPO方法优化视觉感知和推理能力,实验显示在多图像问答任务中显著提升性能。
Details
Motivation: 当前VRAG系统在多图像推理中难以可靠地感知和整合证据,导致推理不准确。为解决这一问题,作者提出了EVisRAG框架和RS-GRPO训练方法。Contribution: 1. 提出EVisRAG框架,通过证据引导的多图像推理提升VRAG系统的性能。2. 设计RS-GRPO训练方法,联合优化视觉感知和推理能力。
Method: EVisRAG框架首先从检索的图像中记录每张图像的证据,然后基于聚合证据生成最终答案。RS-GRPO方法通过细粒度奖励绑定特定范围的token,优化模型的视觉感知和推理能力。
Result: 实验表明,EVisRAG在多图像问答任务中平均提升27%的性能,并能准确感知和定位问题相关证据。
Insight: EVisRAG通过证据聚合和RS-GRPO训练方法,显著提升了多图像推理的准确性,展现了类似真实侦探的推理能力。
Abstract: Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.
[7] Judge’s Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement
Steve Han,Gilberto Titericz Junior,Tom Balough,Wenfei Zhou
Main category: cs.CL
TL;DR: 本文提出了Judge’s Verdict Benchmark,通过两步法评估大型语言模型(LLM)作为‘法官’评判回答准确性的能力,发现27/54的LLM达到Tier 1表现,并揭示了评判能力的提升不仅依赖模型规模,还与训练策略相关。
Details
Motivation: 传统方法仅依赖相关性分析来评估LLM作为‘法官’的能力是不够的,需要更全面的方法衡量其是否能够模拟人类评判的复杂性和一致性。Contribution: 1. 指出相关性分析不足;2. 提出基于一致性的‘法官图灵测试’;3. 建立标准化Benchmark,将LLM法官分类为不同性能层级。
Method: 两步法:1. 相关性测试筛选高对齐模型;2. 使用z-score分析人类相似性或超一致性行为。包括43个开源模型和11个闭源模型。
Result: 27个模型达到Tier 1表现(23个人类相似,4个超一致)。评判能力与模型规模无关,而依赖训练策略。
Insight: LLM作为‘法官’的能力不仅取决于规模,训练策略是关键;超一致性可能隐含过度简化风险。
Abstract: This research introduces the Judge’s Verdict Benchmark, a novel two-step methodology to evaluate Large Language Models (LLMs) as judges for response accuracy evaluation tasks. We assess how well 54 LLMs can replicate human judgment when scoring responses from RAG (Retrieval-Augmented Generation) or Agentic pipelines against ground truth answers. Our methodology progresses from traditional correlation analysis to comprehensive Cohen’s Kappa analysis that measures actual agreement patterns. The two-step approach includes: (1) a correlation test that filters judges with strong alignment, followed by (2) a human-likeness test using z-scores to identify two distinct judgment patterns: human-like judgment (|z| < 1) that mimics natural human variation, and super-consistent judgment (z > 1) that exceeds typical human-to-human agreement levels. This methodology reveals that 27 out of 54 tested LLMs achieve Tier 1 performance: 23 models exhibit human-like patterns that preserve the nuances of human judgment, while 4 models demonstrate super-consistent behavior, a pattern that could indicate either enhanced reliability or oversimplification of complex judgments. Testing 43 open-source models (1B-405B parameters) and 11 closed models (GPT, Gemini, Claude variants), we demonstrate that judge excellence is not solely dependent on model size but on specific training strategies. Our key contributions include: (1) establishing that correlation alone is insufficient for judge evaluation, (2) introducing a “Turing Test for judges” based on agreement patterns, and (3) providing a standardized benchmark for classifying LLM judges into distinct performance tiers for different evaluation needs.
[8] Gold Panning: Turning Positional Bias into Signal for Multi-Document LLM Reasoning
Adam Byerly,Daniel Khashabi
Main category: cs.CL
TL;DR: 论文提出了一种名为Gold Panning Bandits的方法,将LLM在多文档环境中的位置偏置(positional bias)转化为信号,通过重排序文档以高效识别相关内容,显著减少了语言模型查询次数。
Details
Motivation: 大型语言模型(LLM)在多文档处理中存在位置偏置,倾向于依赖文档位置而非相关性。传统方法将这种偏置视为噪声并试图消除,而本文则将其视为可利用的信号。Contribution: 主要贡献是提出了Gold Panning Bandits框架,利用位置偏置作为诊断信号,通过重排序文档高效识别相关内容,并设计了贪心算法以减少计算成本。
Method: 将文档重排序问题建模为二部图匹配问题,提出了一种贪心策略(时间复杂度$O(N \log N)$),优先将不确定性高的文档放在信息量大的位置。
Result: 在知识密集型NLP任务中,该方法比随机重排序基线减少了65%的语言模型查询次数,显著降低了计算成本。
Insight: 研究表明,LLM的内在偏置可以通过巧妙设计转化为优化推理效率的工具,而无需重新训练模型。
Abstract: Large language models exhibit a strong position bias in multi-document contexts, systematically prioritizing information based on location rather than relevance. While existing approaches treat this bias as noise to be mitigated, we introduce Gold Panning Bandits, a framework that leverages position bias as a diagnostic signal: by reordering documents and observing shifts in the model’s responses, we can efficiently identify the most relevant content. We frame the problem of choosing reorderings as a bipartite matching problem. While an optimal assignment can be computed at each iteration with the Hungarian algorithm in $O(N^3)$ time, we propose a greedy $O(N \log N)$ strategy that achieves comparable performance by prioritizing the placement of the most uncertain documents in the most informative positions. Our approach identifies relevant documents using up to 65% fewer language model queries than random permutation baselines on knowledge-intensive NLP tasks, substantially reducing computational cost without model retraining. This work demonstrates that inherent LLM biases can be transformed from liabilities into assets for efficient, inference-time optimization.
[9] Text Prompt Injection of Vision Language Models
Ruizhe Zhu
Main category: cs.CL
TL;DR: 该论文提出了一种针对视觉语言模型的文本提示注入攻击方法,该方法简单高效,计算资源需求低。
Details
Motivation: 随着视觉语言模型的广泛应用,其安全性问题日益突出,作者希望通过研究文本提示注入攻击方法,揭示模型潜在的安全漏洞。Contribution: 论文的主要贡献是提出了一种简单有效的文本提示注入攻击算法,并在实验中验证了其高效性和有效性。
Method: 作者开发了一种算法,通过文本提示注入的方式误导视觉语言模型,并与其他攻击方法进行了比较。
Result: 实验结果表明,该方法对大型模型特别有效,且对计算资源的需求较低。
Insight: 该研究揭示了视觉语言模型在面对文本提示注入攻击时的脆弱性,为未来的模型安全性改进提供了重要参考。
Abstract: The widespread application of large vision language models has significantly raised safety concerns. In this project, we investigate text prompt injection, a simple yet effective method to mislead these models. We developed an algorithm for this type of attack and demonstrated its effectiveness and efficiency through experiments. Compared to other attack methods, our approach is particularly effective for large models without high demand for computational resources.
[10] NG-Router: Graph-Supervised Multi-Agent Collaboration for Nutrition Question Answering
Kaiwen Shi,Zheyuan Zhang,Zhengqing Yuan,Keerthiram Murugesan,Vincent Galass,Chuxu Zhang,Yanfang Ye
Main category: cs.CL
TL;DR: NG-Router提出了一种基于图监督的多智能体协作框架,用于营养问答任务,解决了单智能体推理能力有限和多智能体架构设计复杂的问题,并通过梯度子图检索提高推理准确性。
Details
Motivation: 饮食对人类健康至关重要,营养问答为个性化饮食指导和预防慢性疾病提供了可能。现有方法存在单智能体推理能力有限和多智能体架构设计复杂的问题,以及上下文过载导致的决策困难。Contribution: 1. 提出NG-Router框架,将营养问答建模为知识图谱引导的多智能体协作问题;2. 使用图神经网络学习任务感知的路由分布;3. 提出梯度子图检索机制,优化多跳和关系推理。
Method: 1. 将智能体节点嵌入异构知识图谱;2. 基于图神经网络学习智能体的路由分布;3. 引入梯度子图检索机制,增强推理能力。
Result: 在多个基准测试和主干模型中,NG-Router均优于单智能体和集成基线方法。
Insight: 通过知识图谱和多智能体协作的结合,NG-Router为解决复杂领域任务提供了一种有效的方法。梯度子图检索机制有望扩展到其他依赖上下文推理的任务中。
Abstract: Diet plays a central role in human health, and Nutrition Question Answering (QA) offers a promising path toward personalized dietary guidance and the prevention of diet-related chronic diseases. However, existing methods face two fundamental challenges: the limited reasoning capacity of single-agent systems and the complexity of designing effective multi-agent architectures, as well as contextual overload that hinders accurate decision-making. We introduce Nutritional-Graph Router (NG-Router), a novel framework that formulates nutritional QA as a supervised, knowledge-graph-guided multi-agent collaboration problem. NG-Router integrates agent nodes into heterogeneous knowledge graphs and employs a graph neural network to learn task-aware routing distributions over agents, leveraging soft supervision derived from empirical agent performance. To further address contextual overload, we propose a gradient-based subgraph retrieval mechanism that identifies salient evidence during training, thereby enhancing multi-hop and relational reasoning. Extensive experiments across multiple benchmarks and backbone models demonstrate that NG-Router consistently outperforms both single-agent and ensemble baselines, offering a principled approach to domain-aware multi-agent reasoning for complex nutritional health tasks.
[11] NarraBench: A Comprehensive Framework for Narrative Benchmarking
Sil Hamilton,Matthew Wilkens,Andrew Piper
Main category: cs.CL
TL;DR: NarraBench是一个全面的叙事理解任务分类框架,通过对78个现有基准的调查,发现当前评测仅覆盖27%的叙事任务,并提出需要更多关注被忽视或评测不佳的叙事方面。
Details
Motivation: 现有的叙事理解评测工具覆盖不全面,许多重要的叙事方面(如事件、风格、视角、揭示)未被充分评测,且缺乏对主观性和多视角内容的评估能力。Contribution: 1. 提出NarraBench,一个理论驱动的叙事任务分类框架;2. 对78个现有基准进行了系统性调查;3. 识别了评测不足的叙事领域和主观性内容的需求。
Method: 1. 设计了一套叙事任务的分类法;2. 对78个现有评测工具进行了调查和分析;3. 统计并量化了评测覆盖率的不足。
Result: 发现当前评测仅覆盖27%的叙事任务,多项关键叙事方面(如事件、风格等)几乎未被评测,且缺乏主观性评估能力。
Insight: 1. 叙事理解评测亟需扩展至更多领域;2. 主观性和多视角内容的评测是未来的重点方向。
Abstract: We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas – including narrative events, style, perspective, and revelation – are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.
[12] CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs
Nafiseh Nikeghbal,Amir Hossein Kargaran,Jana Diesner
Main category: cs.CL
TL;DR: 论文介绍了CoBia,一种轻量级的对抗攻击方法,用于系统分析大型语言模型(LLM)在构建对话中暴露的潜在社会偏见,重点关注性别、种族、宗教等社会群体。
Details
Motivation: 尽管LLM的安全防护不断增强,但在对话中仍可能表现出有害行为(如种族主义观点)。论文旨在通过构建对话揭示这些隐蔽的偏见。Contribution: 提出了CoBia方法,用于触发和评估LLM在对话中的偏见行为,并提供了对11种开源和专有LLM的系统性实验结果。
Method: 通过构建对话让模型表达偏见观点,并评估其能否从偏见中恢复或拒绝后续的偏见问题。使用了六个社会类别进行分析。
Result: 实验结果表明,LLM在构建对话中容易放大偏见,且常常无法拒绝后续的偏见问题,揭示了其内在的偏见问题。
Insight: LLM的偏见问题可能在常规安全测试中被掩盖,而通过交互式攻击(如CoBia)可以更有效地揭示这些问题。
Abstract: Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks. However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations. To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. CoBia creates a constructed conversation where the model utters a biased claim about a social group. We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions. We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others. Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs’ reliability and alignment. The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue. This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction. Code and artifacts are available at https://github.com/nafisenik/CoBia.
[13] iBERT: Interpretable Style Embeddings via Sense Decomposition
Vishal Anand,Milad Alshomary,Kathleen McKeown
Main category: cs.CL
TL;DR: iBERT是一种可解释的BERT编码器,通过上下文无关的sense向量分解生成可解释和可控的嵌入,适用于风格和语义结构的模块化表示。
Details
Motivation: 现有BERT模型的嵌入难以解释和控制,iBERT旨在通过模块化和显式的sense分解方法解决这一问题。Contribution: 提出了iBERT,生成非负稀疏的sense向量混合表示,支持模块化的风格控制和可解释性。
Method: 将输入token表示为k个上下文无关sense向量的混合,可聚合为句子嵌入或直接使用,支持风格属性的模块化控制。
Result: 在STEL基准上,iBERT风格表示效果提升8%,同时在作者验证任务中保持竞争力。
Insight: iBERT的结构化sense分解不仅适用于风格建模,还能泛化到混合监督信号的任务,揭示了嵌入的潜在可解释性。
Abstract: We present iBERT (interpretable-BERT), an encoder to produce inherently interpretable and controllable embeddings - designed to modularize and expose the discriminative cues present in language, such as stylistic and semantic structure. Each input token is represented as a sparse, non-negative mixture over k context-independent sense vectors, which can be pooled into sentence embeddings or used directly at the token level. This enables modular control over representation, before any decoding or downstream use. To demonstrate our model’s interpretability, we evaluate it on a suite of style-focused tasks. On the STEL benchmark, it improves style representation effectiveness by ~8 points over SBERT-style baselines, while maintaining competitive performance on authorship verification. Because each embedding is a structured composition of interpretable senses, we highlight how specific style attributes - such as emoji use, formality, or misspelling can be assigned to specific sense vectors. While our experiments center on style, iBERT is not limited to stylistic modeling. Its structural modularity is designed to interpretably decompose whichever discriminative signals are present in the data - enabling generalization even when supervision blends stylistic and semantic factors.
[14] DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
Hossein Entezari Zarch,Lei Gao,Chaoyi Jiang,Murali Annavarm
Main category: cs.CL
TL;DR: DELTA提出了一种动态分层感知的token注意力机制,通过分层设计实现高效的长上下文推理,减少计算开销的同时保持模型准确性。
Details
Motivation: 现有的稀疏注意力方法在长推理任务中因累积选择错误和token动态重要性而准确性下降,DELTA旨在解决这一问题。Contribution: DELTA是一种无需训练的稀疏注意力机制,通过分层设计(全注意力层、选择层和稀疏注意力层)显著减少计算开销,同时保持准确性。
Method: DELTA将Transformer层分为三组:初始全注意力层、选择层(聚合头部注意力分数识别重要token)和稀疏注意力层(仅关注选择的子集)。
Result: 在AIME和GPQA-Diamond等推理基准上,DELTA在保持准确性的同时,减少5倍的token计算量,实现1.5倍的端到端加速。
Insight: 选择性复用中间注意力图为高效长上下文推理提供了新思路。
Abstract: Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire growing sequence. Existing sparse attention methods reduce computation by pruning the key-value (KV) cache, yet they suffer from severe accuracy degradation on reasoning tasks due to cumulative selection errors and the dynamic importance of tokens over long derivations. We present \textbf{DELTA}, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy. DELTA partitions transformer layers into three groups: initial layers that use full attention, a small set of \emph{selection layers} that identify salient tokens via aggregated head-level attention scores, and subsequent \emph{sparse-attention layers} that attend only to the selected subset. This design preserves the full KV cache in GPU memory for accuracy, while avoiding expensive full-attention computation over many layers. On reasoning benchmarks such as AIME and GPQA-Diamond, DELTA matches or surpasses full attention in accuracy, while reducing the number of attended tokens by up to $5\times$ and delivering $1.5\times$ end-to-end speedup. Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.
[15] Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs
Xu Pan,Ely Hahami,Jingxuan Fan,Ziqian Xie,Haim Sompolinsky
Main category: cs.CL
TL;DR: 本文研究了自回归大语言模型(arLLMs)和掩码扩散大语言模型(dLLMs)在数据效率和知识注入方面的差异,提出了一种新方法提升arLLMs的微调效率。
Details
Motivation: 自回归模型在注入新知识时存在‘逆序诅咒’等问题,而掩码扩散模型在预训练阶段表现更好,但其在微调阶段的表现未知。本文旨在填补这一空白。Contribution: 1. 验证了dLLMs在微调阶段的高数据效率和抗逆序诅咒能力;2. 提出了一种新的掩码微调范式,显著提升了arLLMs的数据效率。
Method: 在三个数据集上微调arLLMs和dLLMs,通过正向和逆向问答评估知识泛化能力;提出了一种掩码微调方法用于arLLMs。
Result: dLLMs无需额外数据增强即可在两种问答任务中表现优异,而arLLMs依赖数据增强且效果受限;提出的掩码微调方法显著缩小了arLLMs与dLLMs的性能差距。
Insight: 掩码扩散模型在知识注入和数据效率上具有天然优势;通过改进微调方法,可以显著提升自回归模型的效果。
Abstract: Despite autoregressive large language models (arLLMs) being the current dominant paradigm in language modeling, they resist knowledge injection via fine-tuning due to inherent shortcomings such as the “reversal curse” – the challenge of answering questions that reverse the original information order in the training sample. Masked diffusion large language models (dLLMs) are rapidly emerging as a powerful alternative to the arLLM paradigm, with evidence of better data efficiency and free of the “reversal curse” in pre-training. However, it is unknown whether these advantages extend to the post-training phase, i.e. whether pre-trained dLLMs can easily acquire new knowledge through fine-tuning. On three diverse datasets, we fine-tune arLLMs and dLLMs, evaluating them with forward and backward style Question Answering (QA) to probe knowledge generalization and the reversal curse. Our results confirm that arLLMs critically rely on extensive data augmentation via paraphrases for QA generalization, and paraphrases are only effective when their information order matches the QA style. Conversely, dLLMs achieve high accuracies on both forward and backward QAs without paraphrases; adding paraphrases yields only marginal gains. Lastly, inspired by the dLLM’s performance, we introduce a novel masked fine-tuning paradigm for knowledge injection into pre-trained arLLMs. This proposed method successfully and drastically improves the data efficiency of arLLM fine-tuning, effectively closing the performance gap with dLLMs.
[16] Abductive Preference Learning
Yijin Ni,Peng Qi
Main category: cs.CL
TL;DR: 该论文提出了一种新的偏好学习方法——溯因偏好学习(Abductive Preference Learning),通过逆转传统的条件学习方式,解决了现有方法在应对反事实提示时的局限性。实验表明,该方法在响应选择和提示区分任务中均取得了显著提升。
Details
Motivation: 现有的大语言模型(如GPT-5和Claude Sonnet)即使经过RLHF和DPO等对齐方法,仍存在过度自信的问题,尤其是无法区分反事实提示。论文希望通过溯因偏好学习改进这一问题。Contribution: 1. 提出了溯因偏好学习方法,通过逆转条件(给定响应学习提示偏好)解决现有方法的局限性;2. 构建了一个包含1,001条目的溯因数据集;3. 设计了多任务目标结合标准方法和溯因方法,显著提升了性能。
Method: 1. 逆转传统条件,构建溯因偏好学习框架;2. 基于HaluEval QA基准构建溯因数据集;3. 实现溯因DPO及其变体DPOP,并通过多任务目标结合两者。
Result: 在多任务DPOP中,响应选择准确率从90.0%提升至99.5%,提示区分准确率从54.7%提升至85.0%。在AlpacaEval评估中,胜率从5.26%提升至6.17%。
Insight: 1. 传统偏好学习忽略了反事实提示的重要性;2. 通过逆转条件学习提示偏好能有效提升模型的敏感性和区分能力;3. 多任务目标结合了传统和溯因方法的优势。
Abstract: Frontier large language models such as GPT-5 and Claude Sonnet remain prone to overconfidence even after alignment through Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). For instance, they tend to offer the same conservative answer “No” to both questions “Can I eat the [food / potato chips] that has been left out overnight?” despite the latter requiring no refridgeration for safe consumption. We find that this failure is potentially attributed to a limitation of existing preference learning: it emphasizes selecting the correct response for a given prompt, while neglecting counterfactual prompts that should alter the response. To address this limitation, we propose abductive preference learning, a fine-tuning paradigm that reverses the conventional conditioning by learning preferences over prompts given a response. To validate this idea, we construct an abductive dataset derived from the HaluEval QA benchmark with 1,001 entries, implementing abductive DPO and its variant DPOP. Experiments reveal complementary strengths: standard methods improve response selection, abductive methods improve prompt discrimination, while a multitask objective unifies both. On the abductive dataset, multitask DPOP boosts accuracy from $90.0%$ to $99.5%$ in response selection and $54.7%$ to $85.0%$ in prompt discrimination, with qualitative evidence highlighting improved sensitivity to prompt differences. Finally, evaluation on AlpacaEval shows multitask DPOP improves win rate (from $5.26%$ to $6.17%$), confirming that abductive preference learning preserves the benefits of conventional preference optimization while addressing the overlooked challenge of counterfactual prompts.
[17] HIPPD: Brain-Inspired Hierarchical Information Processing for Personality Detection
Guanming Chen,Lingzhi Shen,Xiaohao Cai,Imran Razzak,Shoaib Jameel
Main category: cs.CL
TL;DR: 这篇论文提出了一种名为HIPPD的大脑启发式框架,用于从文本中检测人格特质。HIPPD通过模拟人脑的分层次信息处理机制,结合大型语言模型、动态记忆模块和轻量级专家模型,显著提升了检测性能。
Details
Motivation: 现有的人格检测方法在多篇文本的上下文信息捕捉和语义稀疏环境下的鲁棒特征提取方面表现不佳。HIPPD通过模拟人脑的分层次处理机制来解决这些问题。Contribution: HIPPD的主要贡献包括:提出了一个大脑启发的分层次框架,结合了大型语言模型(模拟大脑皮层)、动态记忆模块(模拟前额叶皮层)和轻量级专家模型(模拟基底节),并通过多巴胺能预测误差反馈驱动自适应调整。
Method: HIPPD的方法分为三部分:(1)使用大型语言模型进行全局语义推理和深度特征抽象;(2)动态记忆模块通过自适应门控和选择性保留关键特征;(3)一组轻量级专家模型通过严格胜者通吃机制动态选择最佳匹配模式。
Result: 在Kaggle和Pandora数据集上的实验表明,HIPPD在人格检测任务上一致优于现有的最先进基线方法。
Insight: HIPPD的成功表明,模拟人脑的分层次信息处理机制可以有效提升文本分析任务的性能,尤其是在上下文感知和多层次特征提取方面。
Abstract: Personality detection from text aims to infer an individual’s personality traits based on linguistic patterns. However, existing machine learning approaches often struggle to capture contextual information spanning multiple posts and tend to fall short in extracting representative and robust features in semantically sparse environments. This paper presents HIPPD, a brain-inspired framework for personality detection that emulates the hierarchical information processing of the human brain. HIPPD utilises a large language model to simulate the cerebral cortex, enabling global semantic reasoning and deep feature abstraction. A dynamic memory module, modelled after the prefrontal cortex, performs adaptive gating and selective retention of critical features, with all adjustments driven by dopaminergic prediction error feedback. Subsequently, a set of specialised lightweight models, emulating the basal ganglia, are dynamically routed via a strict winner-takes-all mechanism to capture the personality-related patterns they are most proficient at recognising. Extensive experiments on the Kaggle and Pandora datasets demonstrate that HIPPD consistently outperforms state-of-the-art baselines.
[18] Don’t Throw Away Your Pretrained Model
Shangbin Feng,Wenhao Yu,Yike Wang,Hongming Zhang,Yulia Tsvetkov,Dong Yu
Main category: cs.CL
TL;DR: 这篇论文提出了一种名为‘Switch Generation’的模型协作方法,通过让预训练模型和对齐模型在生成响应时动态切换,结合各自的优势,显著提升了任务表现。
Details
Motivation: 对齐训练虽然能提升模型在推理和指令跟随上的能力,但可能会削弱其在创造性和校准等方面的表现。作者希望通过模型协作来兼顾两者的优势。Contribution: 1. 提出了‘Switch Generation’方法,通过动态切换不同模型的分段生成,最大化各自的技能优势。2. 在18个数据集上验证了该方法优于单一模型和基线协作方法。
Method: 训练一个开关模型(switcher LM),学习在不同查询和上下文中选择最适合生成下一个分段的模型。在推理时,开关模型动态指导不同模型生成响应。
Result: 实验表明,Switch Generation在16/18任务上优于单一模型,平均比基线方法提升12.9%。该方法还能发现组合技能并泛化到新模型和任务。
Insight: 模型协作可以充分利用训练管道中通常被丢弃的副产品(如不同版本的模型),解决单一模型难以应对的任务。
Abstract: Alignment training has tradeoffs: it helps language models (LMs) gain in reasoning and instruction following but might lose out on skills such as creativity and calibration, where unaligned base models are better at. We aim to make the best of both worlds through model collaboration, where different models in the training pipeline collaborate and complement each other. Since LM responses feature interleaving skills that favor different models, we propose Switch Generation, where pretrained and aligned model versions take turns to ``speak’’ in a response sequence. Specifically, we train a switcher LM by learning from outcomes of choosing different models to generate the next segment across diverse queries and contexts. At inference time, the switcher LM guides different model checkpoints to dynamically generate the next segment where their strengths are most needed. Extensive experiments with 8 model collaboration baselines and 18 datasets show that 1) model collaboration consistently outperforms individual models on 16 out of 18 tasks, and 2) Switch Generation further outperforms baselines by 12.9% on average. Further analysis reveals that Switch Generation discovers compositional skills to solve problems where individual models struggle and generalizes to unseen models and tasks, reusing and repurposing by-products in expensive model training pipelines that are otherwise discarded.
[19] Enhancing Faithfulness in Abstractive Summarization via Span-Level Fine-Tuning
Sicong Huang,Qianqi Yan,Shengze Wang,Ian Lane
Main category: cs.CL
TL;DR: 该论文通过span级别的微调方法,提升抽象摘要的忠实性,提出了一种新的数据集和三种微调技术,其中unlikelihood训练效果最佳。
Details
Motivation: 大型语言模型(LLM)生成的摘要虽然流畅,但存在不忠实的问题(如幻觉),现有方法无法完全解决,因此需要更有效的微调策略。Contribution: 1. 引入了包含忠实和非忠实摘要及span级别标签的新数据集;2. 提出了三种微调技术(梯度上升、unlikelihood训练、任务向量否定)以提升摘要忠实性。
Method: 1. 使用多种LLM自动生成训练集的摘要,并用GPT-4o标注幻觉的span;2. 结合忠实摘要和非忠实span微调LLM,评估三种技术。
Result: 实验表明,三种方法均能利用span标注提升忠实性,其中unlikelihood训练效果最好。
Insight: span级别标注的直接监督可以有效减少幻觉,unlikelihood训练是一种高效的微调策略。
Abstract: Abstractive summarization using large language models (LLMs) has become an essential tool for condensing information. However, despite their ability to generate fluent summaries, these models sometimes produce unfaithful summaries, introducing hallucinations at the word, phrase, or concept level. Existing mitigation strategies, such as post-processing corrections or contrastive learning with synthetically generated negative samples, fail to fully address the diverse errors that can occur in LLM-generated summaries. In this paper, we investigate fine-tuning strategies to reduce the occurrence of unfaithful spans in generated summaries. First, we automatically generate summaries for the set of source documents in the training set with a variety of LLMs and then use GPT-4o to annotate any hallucinations it detects at the span-level. Leveraging these annotations, we fine-tune LLMs with both hallucination-free summaries and annotated unfaithful spans to enhance model faithfulness. In this paper, we introduce a new dataset that contains both faithful and unfaithful summaries with span-level labels and we evaluate three techniques to fine-tuning a LLM to improve the faithfulness of the resulting summarization: gradient ascent, unlikelihood training, and task vector negation. Experimental results show that all three approaches successfully leverage span-level annotations to improve faithfulness, with unlikelihood training being the most effective.
[20] Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey
Jiaqi Wei,Xiang Zhang,Yuejin Yang,Wenxuan Huang,Juntai Cao,Sheng Xu,Xiang Zhuang,Zhangyang Gao,Muhammad Abdul-Mageed,Laks V. S. Lakshmanan,Chenyu You,Wanli Ouyang,Siqi Sun
Main category: cs.CL
TL;DR: 该论文提出了一个统一框架,将树搜索算法和奖励设计分解为三个核心组件,解决了奖励信号角色的模糊性,并为自主、自改进代理的研究指明了方向。
Details
Motivation: 现代大型语言模型(LLM)研究中,树搜索算法是一个关键范式,但领域内缺乏统一的形式化定义,尤其是奖励信号的角色模糊不清。本文旨在解决这一问题,为研究提供清晰的理论基础。Contribution: 1. 引入了一个统一框架,将搜索算法分解为三个核心组件:搜索机制、奖励设计和转移函数;2. 明确区分了瞬时的搜索指导和持久的参数化奖励建模;3. 提出了一个以组件为中心的分类法,并总结了最新的研究进展。
Method: 论文通过形式化方法将树搜索算法分解为搜索机制、奖励设计和转移函数,并对奖励信号的角色进行了明确的分类和研究。
Result: 提出了一套系统化的理论框架和分类法,为未来自主、自改进代理的研究奠定了基础。
Insight: 1. 奖励信号的设计可以从瞬时和持久两个角度进行区分;2. 搜索算法的研究需要结合具体的任务需求和模型改进目标。
Abstract: Deliberative tree search is a cornerstone of modern Large Language Model (LLM) research, driving the pivot from brute-force scaling toward algorithmic efficiency. This single paradigm unifies two critical frontiers: \textbf{Test-Time Scaling (TTS)}, which deploys on-demand computation to solve hard problems, and \textbf{Self-Improvement}, which uses search-generated data to durably enhance model parameters. However, this burgeoning field is fragmented and lacks a common formalism, particularly concerning the ambiguous role of the reward signal – is it a transient heuristic or a durable learning target? This paper resolves this ambiguity by introducing a unified framework that deconstructs search algorithms into three core components: the \emph{Search Mechanism}, \emph{Reward Formulation}, and \emph{Transition Function}. We establish a formal distinction between transient \textbf{Search Guidance} for TTS and durable \textbf{Parametric Reward Modeling} for Self-Improvement. Building on this formalism, we introduce a component-centric taxonomy, synthesize the state-of-the-art, and chart a research roadmap toward more systematic progress in creating autonomous, self-improving agents.
[21] Toward Machine Translation Literacy: How Lay Users Perceive and Rely on Imperfect Translations
Yimin Xiao,Yongle Zhang,Dayeon Ki,Calvin Bao,Marianna J. Martindale,Charlotte Vaughn,Ge Gao,Marine Carpuat
Main category: cs.CL
TL;DR: 论文研究了机器翻译(MT)在普通用户中的使用情况,发现非双语用户因缺乏评估策略和替代方案而过度依赖MT,而错误体验可能促使用户重新评估未来依赖。
Details
Motivation: 随着MT的日益普及,理解公众对不完美MT的感知和使用方式对于MT研究在实际应用中的意义至关重要。Contribution: 揭示了非双语用户过度依赖MT的现象,并提出MT评估和NLP解释技术的重要性,以提升用户的MT素养。
Method: 通过在公共博物馆进行的人类研究(n=452),探讨了流利性和准确性错误对双语和非双语用户依赖MT的影响。
Result: 研究表明,非双语用户常因缺乏评估能力而过度依赖MT,但错误体验可能促使其减少未来依赖。
Insight: 提升MT质量的同时,需注重培养用户的MT素养,以减少对不完美翻译的盲目依赖。
Abstract: As Machine Translation (MT) becomes increasingly commonplace, understanding how the general public perceives and relies on imperfect MT is crucial for contextualizing MT research in real-world applications. We present a human study conducted in a public museum (n=452), investigating how fluency and adequacy errors impact bilingual and non-bilingual users’ reliance on MT during casual use. Our findings reveal that non-bilingual users often over-rely on MT due to a lack of evaluation strategies and alternatives, while experiencing the impact of errors can prompt users to reassess future reliance. This highlights the need for MT evaluation and NLP explanation techniques to promote not only MT quality, but also MT literacy among its users.
[22] MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction
Jianjin Wang,Runsong Zhao,Xiaoqian Liu,Yuan Ge,Ziqiang Xu,Tong Xiao,Shengxiang Gao,Zhengtao Yu,Jingbo Zhu
Main category: cs.CL
TL;DR: 论文提出了一种多令牌预测(MTP)损失函数,应用于语音到单位翻译(S2UT)模型中,通过预测多个后续令牌来增强语义密度和翻译质量。进一步提出MTP-S2UT损失,将MTP应用于中间层,实现更早的语义信息增强。
Details
Motivation: 当前语音到语音翻译方法使用单个语音令牌作为中间表示,但其语义密度不足,难以表达完整语义单元。因此需要一种方法能预测多个令牌,提升语义完整性和信息密度。Contribution: 1. 引入多令牌预测(MTP)损失函数,增强S2UT模型的语义表达能力。2. 提出MTP-S2UT损失,将MTP应用于中间层,实现更早且有效的语义信息增强。
Method: 1. 在S2UT模型的最后层应用MTP损失,提高输出表示的语义密度。2. 扩展到中间层(CTC损失计算层),提出MTP-S2UT损失,提前信息增强过程。
Result: 实验表明,所有MTP损失变体均能提升S2UT翻译质量,其中MTP-S2UT表现最佳。
Insight: 语义信息增强不仅需关注输出层,还应提前到模型中间层,以实现更早且更全面的语义捕获。
Abstract: Current direct speech-to-speech translation methods predominantly employ speech tokens as intermediate representations. However, a single speech token is not dense in semantics, so we generally need multiple tokens to express a complete semantic unit. To address this limitation, we introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models, enabling models to predict multiple subsequent tokens at each position, thereby capturing more complete semantics and enhancing information density per position. Initial MTP implementations apply the loss at the final layer, which improves output representation but initiates information enrichment too late. We hypothesize that advancing the information enrichment process to intermediate layers can achieve earlier and more effective enhancement of hidden representation. Consequently, we propose MTP-S2UT loss, applying MTP loss to hidden representation where CTC loss is computed. Experiments demonstrate that all MTP loss variants consistently improve the quality of S2UT translation, with MTP-S2UT achieving the best performance.
[23] Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning
Shu Zhao,Tan Yu,Anbang Xu
Main category: cs.CL
TL;DR: 本文提出了ExpandSearch方法,通过强化学习训练LLM搜索代理,使其具备查询扩展能力,并结合预训练的Squeezer模型提升检索召回率和回答生成能力,在七大多跳问答基准上平均提升4.4%。
Details
Motivation: 现有的搜索代理(如Search-R1)在推理和搜索能力上受限,导致多跳问答任务表现不佳。为了解决这一问题,需要一种能在复杂查询中扩展查询并高效检索的方法。Contribution: 1. 提出一种基于强化学习的查询扩展方法,使LLM搜索代理能够生成多个查询变体以覆盖更多相关信息。2. 引入预训练的Squeezer模型,帮助代理更好地理解检索文档,从而专注于高召回率的查询生成。3. 在七大多跳问答基准上实现平均4.4%的性能提升。
Method: 1. 使用强化学习训练LLM搜索代理,使其具备查询扩展能力。2. 在每个回合中,代理生成多个查询变体并行搜索。3. 结合预训练的Squeezer模型辅助理解检索文档,减轻代理多任务负担。
Result: 在七大多跳问答基准测试中,ExpandSearch方法平均提升了4.4%的性能,显著优于现有基线方法。
Insight: 即使小规模(3B)的LLM,在结合Squeezer模型后也能展现出强大的查询扩展能力,表明辅助模型可以有效提升LLM的核心功能。
Abstract: Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.
[24] Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety
Yuyi Huang,Runzhe Zhan,Lidia S. Chao,Ailin Tao,Derek F. Wong
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLM)在长链思维推理(Long-CoT)中的路径漂移问题,即模型推理偏离安全路径的现象,并提出了触发路径漂移的三个行为机制及防御策略。
Details
Motivation: 尽管通过RLHF等对齐技术实现了早期安全防护,但在长链推理任务中,模型的推理路径可能会偏离对齐路径,导致不安全内容的生成。这一现象尚未被充分研究,因此论文旨在揭示其机制并提出解决方案。Contribution: 论文的主要贡献包括:(1)定义了路径漂移现象;(2)识别了三种触发路径漂移的行为机制;(3)提出了路径漂移诱导框架和防御策略。
Method: 论文通过实证分析发现了三种路径漂移行为机制,并设计了包含认知负载放大、自我角色启动和条件链劫持的三阶段诱导框架。同时,提出了基于角色归因修正和元认知反思的防御策略。
Result: 实验表明,三阶段的诱导框架能够独立或联合降低模型的拒绝率,而防御策略能够有效缓解路径漂移的风险。
Insight: 论文揭示了长链推理任务中路径漂移的风险,强调了在长形式推理中需要对轨迹级别对齐进行监督的重要性,而不仅仅是令牌级别的对齐。
Abstract: As large language models (LLMs) are increasingly deployed for complex reasoning tasks, Long Chain-of-Thought (Long-CoT) prompting has emerged as a key paradigm for structured inference. Despite early-stage safeguards enabled by alignment techniques such as RLHF, we identify a previously underexplored vulnerability: reasoning trajectories in Long-CoT models can drift from aligned paths, resulting in content that violates safety constraints. We term this phenomenon Path Drift. Through empirical analysis, we uncover three behavioral triggers of Path Drift: (1) first-person commitments that induce goal-driven reasoning that delays refusal signals; (2) ethical evaporation, where surface-level disclaimers bypass alignment checkpoints; (3) condition chain escalation, where layered cues progressively steer models toward unsafe completions. Building on these insights, we introduce a three-stage Path Drift Induction Framework comprising cognitive load amplification, self-role priming, and condition chain hijacking. Each stage independently reduces refusal rates, while their combination further compounds the effect. To mitigate these risks, we propose a path-level defense strategy incorporating role attribution correction and metacognitive reflection (reflective safety cues). Our findings highlight the need for trajectory-level alignment oversight in long-form reasoning beyond token-level alignment.
[25] CLMN: Concept based Language Models via Neural Symbolic Reasoning
Yibo Yang
Main category: cs.CL
TL;DR: CLMN是一种神经符号框架,通过结合连续可读的概念嵌入和模糊逻辑推理,提升了NLP模型的性能和可解释性。
Details
Motivation: 深度学习的NLP模型在医疗和金融等领域缺乏可解释性。现有的概念瓶颈模型在文本处理中要么使用二值激活损害文本表示,要么使用潜在概念削弱语义,且缺乏动态概念交互的建模。Contribution: 提出CLMN框架,将概念表示为连续可读的嵌入,并通过模糊逻辑推理学习动态概念交互规则,同时保持性能和可解释性。
Method: CLMN结合神经符号方法,生成概念感知的文本表示,并自动推导可解释的逻辑规则。
Result: 在多个数据集和预训练语言模型上,CLMN在准确性和解释质量上均优于现有方法。
Insight: 神经表示与符号推理在统一概念空间中的结合,可实现实用且透明的NLP系统。
Abstract: Deep learning has advanced NLP, but interpretability remains limited, especially in healthcare and finance. Concept bottleneck models tie predictions to human concepts in vision, but NLP versions either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept interactions such as negation and context. We introduce the Concept Language Model Network (CLMN), a neural-symbolic framework that keeps both performance and interpretability. CLMN represents concepts as continuous, human-readable embeddings and applies fuzzy-logic reasoning to learn adaptive interaction rules that state how concepts affect each other and the final decision. The model augments original text features with concept-aware representations and automatically induces interpretable logic rules. Across multiple datasets and pre-trained language models, CLMN achieves higher accuracy than existing concept-based methods while improving explanation quality. These results show that integrating neural representations with symbolic reasoning in a unified concept space can yield practical, transparent NLP systems.
[26] Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference
Hua Cai,Shuang Zhao,Liang Zhang,Xuli Shen,Qing Xu,Weilin Shen,Zihao Wen,Tianke Ban
Main category: cs.CL
TL;DR: 论文介绍了Unilaw-R1,一个针对法律推理的大语言模型,通过监督微调和强化学习两阶段训练策略,解决了法律知识不足、推理逻辑不可靠和业务泛化能力弱的问题,并在多项基准测试中表现优异。
Details
Motivation: 当前大语言模型在处理复杂法律问题上的能力尚未充分探索,Unilaw-R1旨在填补这一空白,通过轻量化的模型设计降低部署成本,同时提升法律推理性能和可解释性。Contribution: 1. 提出Unilaw-R1,一个专为法律推理设计的轻量化大语言模型;2. 构建高质量的Unilaw-R1-Data数据集;3. 引入两阶段训练策略(监督微调+强化学习);4. 设计Unilaw-R1-Eval基准测试。
Method: 1. 构建包含17K高质量推理链的数据集;2. 使用监督微调(SFT)和强化学习(RL)两阶段训练策略;3. 通过迭代推理提升模型的法律推理能力。
Result: Unilaw-R1在权威基准测试中表现优异,超越同类规模模型,并与更大规模模型(如DeepSeek-R1-Distill-Qwen-32B)竞争;在LawBench和LexEval上显著优于Qwen-2.5-7B-Instruct,平均提升6.6%。
Insight: 轻量化设计与高质量数据集的结合,以及两阶段训练策略,可以显著提升模型在法律领域的推理能力和泛化性能;强化学习的应用增强了模型的解释性。
Abstract: Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tailored for legal reasoning. With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost while effectively tackling three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization. To address these issues, we first construct Unilaw-R1-Data, a high-quality dataset containing 17K distilled and screened chain-of-thought (CoT) samples. Based on this, we adopt a two-stage training strategy combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which significantly boosts the performance on complex legal reasoning tasks and supports interpretable decision-making in legal AI applications. To assess legal reasoning ability, we also introduce Unilaw-R1-Eval, a dedicated benchmark designed to evaluate models across single- and multi-choice legal tasks. Unilaw-R1 demonstrates strong results on authoritative benchmarks, outperforming all models of similar scale and achieving performance on par with the much larger DeepSeek-R1-Distill-Qwen-32B (54.9%). Following domain-specific training, it also showed significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct (46.6%) by an average margin of 6.6%.
[27] Stop When Enough: Adaptive Early-Stopping for Chain-of-Thought Reasoning
Renliang Sun,Wei Cheng,Dawei Li,Haifeng Chen,Wei Wang
Main category: cs.CL
TL;DR: 该论文提出了REFRAIN框架,通过自适应停止推理来减少大型语言模型(LLMs)在Chain-of-Thought(CoT)推理中的冗余步骤,从而降低成本并避免错误结论。
Details
Motivation: Chain-of-Thought推理虽然在复杂任务中提升了LLMs的性能,但冗余的推理步骤(称为“overthinking”)会增加计算成本并导致错误结论,因此需要一种自适应停止机制。Contribution: 提出了REFRAIN框架,一个无需训练的方法,动态决定何时停止推理以减少冗余。
Method: REFRAIN结合了两阶段的停止判别器和一个滑动窗口UCB(SW-UCB)多臂老虎机控制器,动态调整停止阈值以适应不同问题的难度。
Result: 在四个基准测试和两种模型家族中,REFRAIN减少了20-55%的token使用量,同时保持或提高了推理准确性。
Insight: 研究突出了“何时停止”作为一种新的测试时扩展维度,使模型能够在推理中做到“恰到好处”。
Abstract: Chain-of-Thought (CoT) reasoning has driven recent gains of large language models (LLMs) on reasoning-intensive tasks by externalizing intermediate steps. However, excessive or redundant reasoning – so-called overthinking – can increase inference costs and lead LLMs toward incorrect conclusions. In this paper, we present REFRAIN ($\underline{REF}$lective-$\underline{R}$edundancy for $\underline{A}$daptive $\underline{IN}$ference), a training-free framework that adaptively determines when to stop reasoning to mitigate overthinking. REFRAIN integrates a two-stage stop discriminator to identify reflective yet redundant reasoning and a sliding-window Upper Confidence Bound (SW-UCB) multi-armed bandit controller to dynamically adjust stopping thresholds according to problem difficulty without supervision or fine-tuning. Across four representative benchmarks and two model families, REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting. Extensive ablation and robustness analyses demonstrate its stability across models, scorers, and prompt variations. In summary, our findings highlight when-to-stop as a new and practical axis of test-time scaling – enabling models to reason not just more, but just enough.
[28] LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora
Luyao Zhuang,Shengyuan Chen,Yilin Xiao,Huachi Zhou,Yujing Zhang,Hao Chen,Qinggang Zhang,Xiao Huang
Main category: cs.CL
TL;DR: LinearRAG提出了一种高效的检索增强生成(RAG)框架,通过构建无关系的分层图(Tri-Graph),避免了传统方法中不稳定且高成本的关系提取问题,显著提升了在大规模语料上的检索效果。
Details
Motivation: 传统RAG系统在大规模非结构化语料中表现不佳,而基于知识图谱的GraphRAG方法因关系提取不稳定且成本高,导致图构建质量差,影响检索效果。Contribution: 提出LinearRAG框架:1)构建无关系的Tri-Graph,避免关系提取;2)采用两阶段检索策略(局部语义桥接和全局重要性聚合)。
Method: 1)通过轻量实体提取和语义链接构建Tri-Graph;2)两阶段检索:先激活相关实体,再聚合重要性检索段落。
Result: 在四个数据集上的实验表明,LinearRAG显著优于基线模型。
Insight: 避免了复杂关系提取,提升了效率和可靠性,适用于大规模语料的多跳推理任务。
Abstract: Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models.
[29] Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task
Zilong Wang,Xiaoyu Shen
Main category: cs.CL
TL;DR: 论文提出了一种结合OCR引擎与大语言模型(LLM)的混合框架,用于企业级文档信息提取,特别是在重复内容较多的任务中优化了准确性与效率的权衡。
Details
Motivation: 处理大量结构相似的文档内容是企业的关键任务,但现有方法缺乏针对性,无法高效应对重复性任务。本文旨在通过智能策略选择,利用文档特性提升性能。Contribution: 1. 提出了一种混合OCR-LLM框架,专注于重复内容多的文档提取任务;2. 实现了25种配置,覆盖三种提取范式(直接、替换、表格);3. 在多种文档格式中取得了高效且高精度的结果。
Method: 框架结合OCR引擎和LLM,通过三种提取范式(直接、替换、表格)进行信息提取,并采用格式感知路由处理异构文档流。
Result: 在结构化文档中实现了F1=1.0的精度和0.97秒的延迟;在图像输入中达到F1=0.997的精度和0.6秒的延迟;性能比传统方法提升54倍。
Insight: 重复性任务可以通过结构感知的方法选择转化为优化机会,为类似任务提供了通用设计原则。
Abstract: Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks. Unlike existing approaches that pursue universal solutions, our method exploits document-specific characteristics through intelligent strategy selection. We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats (PNG, DOCX, XLSX, PDF). Through table-based extraction methods, our adaptive framework delivers outstanding results: F1=1.0 accuracy with 0.97s latency for structured documents, and F1=0.997 accuracy with 0.6 s for challenging image inputs when integrated with PaddleOCR, all while maintaining sub-second processing speeds. The 54 times performance improvement compared with multimodal methods over naive approaches, coupled with format-aware routing, enables processing of heterogeneous document streams at production scale. Beyond the specific application to identity extraction, this work establishes a general principle: the repetitive nature of copy-heavy tasks can be transformed from a computational burden into an optimization opportunity through structure-aware method selection.
[30] A Survey of Inductive Reasoning for Large Language Models
Kedi Chen,Dezhao Ruan,Yuhao Dan,Yaoting Wang,Siyu Yan,Xuecheng Wu,Yinqi Zhang,Qin Chen,Jie Zhou,Liang He,Biqing Qi,Linyang Li,Qipeng Guo,Xiaoming Shi,Wei Zhang
Main category: cs.CL
TL;DR: 这篇论文首次全面综述了大语言模型(LLMs)中的归纳推理方法,将其改进方法分为三类,总结了现有基准,并提出了基于沙箱的统一评估方法。
Details
Motivation: 归纳推理是LLMs中重要的推理范式,有助于知识泛化和人类认知对齐,但目前缺乏系统性总结。Contribution: 1. 首次全面综述LLMs的归纳推理;2. 分类改进归纳推理的方法;3. 总结基准并提出统一评估方法。
Method: 将改进方法分为后训练、测试时扩展和数据增强三类,并提出基于沙箱的评估方法。
Result: 提供了归纳能力来源的分析,以及简单模型结构和数据对归纳任务的帮助。
Insight: 归纳能力可能源于模型架构和数据设计的结合,为未来研究奠定基础。
Abstract: Reasoning is an important task for large language models (LLMs). Among all the reasoning paradigms, inductive reasoning is one of the fundamental types, which is characterized by its particular-to-general thinking process and the non-uniqueness of its answers. The inductive mode is crucial for knowledge generalization and aligns better with human cognition, so it is a fundamental mode of learning, hence attracting increasing interest. Despite the importance of inductive reasoning, there is no systematic summary of it. Therefore, this paper presents the first comprehensive survey of inductive reasoning for LLMs. First, methods for improving inductive reasoning are categorized into three main areas: post-training, test-time scaling, and data augmentation. Then, current benchmarks of inductive reasoning are summarized, and a unified sandbox-based evaluation approach with the observation coverage metric is derived. Finally, we offer some analyses regarding the source of inductive ability and how simple model architectures and data help with inductive tasks, providing a solid foundation for future research.
[31] MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems
Lei Gu,Yinghao Zhu,Haoran Sang,Zixiang Wang,Dehao Sui,Wen Tang,Ewen Harrison,Junyi Gao,Lequan Yu,Liantao Ma
Main category: cs.CL
TL;DR: 论文对基于大规模语言模型(LLM)的多智能体医疗咨询系统进行了实证研究,揭示了其协作过程中的失败模式,强调了透明和可验证的推理过程的重要性。
Details
Motivation: 现有对多智能体医疗系统的评估仅关注最终答案的准确性,忽略了其内部协作过程的透明性和可靠性,这在高风险医疗应用中可能引发严重后果。Contribution: 提出了一种混合方法(定性与定量结合),对3600个案例进行了分析,识别了四种主要的协作失败模式,并强调了可审计推理过程的必要性。
Method: 通过六种代表性多智能体框架和六个医疗数据集的大规模实证研究,结合定性分析和定量审计,建立了一个全面的协作失败模式分类。
Result: 研究发现四种主导的失败模式:共享模型缺陷驱动的错误共识、正确少数观点的压制、无效的讨论动态以及合成过程中的关键信息丢失。
Insight: 高准确性不足以衡量医疗AI的可信度,透明和可审计的推理过程是负责任开发和部署医疗AI的关键。
Abstract: While large language model (LLM)-based multi-agent systems show promise in simulating medical consultations, their evaluation is often confined to final-answer accuracy. This practice treats their internal collaborative processes as opaque “black boxes” and overlooks a critical question: is a diagnostic conclusion reached through a sound and verifiable reasoning pathway? The inscrutable nature of these systems poses a significant risk in high-stakes medical applications, potentially leading to flawed or untrustworthy conclusions. To address this, we conduct a large-scale empirical study of 3,600 cases from six medical datasets and six representative multi-agent frameworks. Through a rigorous, mixed-methods approach combining qualitative analysis with quantitative auditing, we develop a comprehensive taxonomy of collaborative failure modes. Our quantitative audit reveals four dominant failure patterns: flawed consensus driven by shared model deficiencies, suppression of correct minority opinions, ineffective discussion dynamics, and critical information loss during synthesis. This study demonstrates that high accuracy alone is an insufficient measure of clinical or public trust. It highlights the urgent need for transparent and auditable reasoning processes, a cornerstone for the responsible development and deployment of medical AI.
[32] You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs
Yijie Xu,Huizai Yao,Zhiyu Guo,Weiyu Guo,Pengteng Li,Aiwei Liu,Xuming Hu,Hui Xiong
Main category: cs.CL
TL;DR: 论文提出了一种无需标注数据的测试时自适应框架SyTTA,通过结合输入端的困惑度和输出端的预测熵信号,显著提升了语言模型在分布偏移下的表现。
Details
Motivation: 大语言模型在专业领域部署时常面临训练数据分布偏移的问题,专业领域标注数据昂贵且稀缺,因此需要一种无监督的测试时自适应方法。Contribution: 提出了SyTTA框架,仅需4个额外标记即可实现高效的测试时自适应,无需标注数据,显著提升了模型在专业领域的性能。
Method: 结合输入端的困惑度和输出端的预测熵信号,通过双信号互补实现模型在测试时的动态自适应。
Result: 在农业问答等任务中,SyTTA显著提升了性能(如Rouge-LSum提高120%),验证了方法的有效性。
Insight: 测试时自适应可以通过无监督信号实现,为语言模型在标注稀缺领域的部署提供了新思路。
Abstract: Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture, where they face significant distribution shifts from their training data. Domain-specific fine-tuning can mitigate this challenge but relies on high-quality labeled data that is expensive and slow to collect in expertise-limited settings. We study label-free test-time adaptation for language models and present SyTTA, an inference-time framework that adapts models on-the-fly without additional supervision. SyTTA couples two complementary uncertainty signals that arise under distribution shift: input-side perplexity, indicating mismatch with domain-specific terminology and patterns, and output-side predictive entropy, indicating diffuse and unstable token probabilities during generation. Across diverse model architectures and domain-specific benchmarks, SyTTA delivers consistent gains. Notably, on agricultural question answering, SyTTA improves Rouge-LSum by over 120% on Qwen-2.5-7B with only 4 extra tokens per query. These results show that effective test-time adaptation for language models is achievable without labeled examples, supporting deployment in label-scarce domains. The code will be made available upon acceptance.
[33] Text2Token: Unsupervised Text Representation Learning with Token Target Prediction
Ruize An,Richong Zhang,Zhijie Nie,Zhanyu Wu,Yanzhao Zhang,Dingkun Long
Main category: cs.CL
TL;DR: 论文提出了Text2Token,一种基于无监督文本特征学习的生成框架,通过预测目标token分布来学习高质量的文本表示。该方法在MTEB v2基准测试中表现优异,与基于对比学习的LLM2Vec性能相当。
Details
Motivation: 无监督文本表征学习对自然语言处理任务非常重要,尤其是利用网络上的未标注数据。研究发现高质量的表征与文本中的关键token对齐,这启发了作者探索表征空间与词汇空间的潜在联系。Contribution: 1. 提出了Text2Token框架,通过预测目标token分布实现无监督文本表征学习;2. 分析了token对齐特性,提出了两种构建目标token分布的方法;3. 在实验中展示了与LLM2Vec相当的性能。
Method: 1. 使用token目标预测任务作为生成框架的基础;2. 提出数据驱动和模型驱动的方法构建目标token分布;3. 结合高级embedder分析token对齐特性。
Result: 在MTEB v2基准测试中,Text2Token的性能与基于对比学习的LLM2Vec相当。研究表明词汇空间与表征空间在训练中协同优化。
Insight: 1. token对齐是高质量文本表征的关键;2. 词汇空间与表征空间的协同优化为未来研究提供了新思路。
Abstract: Unsupervised text representation learning (TRL) is a fundamental task in natural language processing, which is beneficial for improving search and recommendations with the web’s unlabeled texts. A recent empirical study finds that the high-quality representation aligns with the key token of the input text, uncovering the potential connection between representation space and vocabulary space. Inspired by the findings, we revisit the generative tasks and develop an unsupervised generative framework for TRL, Text2Token. The framework is based on the token target prediction task, utilizing carefully constructed target token distribution as supervisory signals. To construct the high-quality target token distribution, we analyze the token-alignment properties with advanced embedders and identify two essential categories of key tokens: (1) the meaningful tokens in the text and (2) semantically derived tokens beyond the text. Based on these insights, we propose two methods – data-driven and model-derived – to construct synthetic token targets from data or the LLM backbone. Experiments on the MTEB v2 benchmark demonstrate that Text2Token achieves performance competitive with the state-of-the-art embedder with unsupervised contrastive learning, LLM2Vec. Our analysis further shows that vocabulary and representation spaces optimize together and toward the optimum solution during training, providing new ideas and insights for future work.
[34] ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement
Kangyang Luo,Yuzhuo Bai,Shuzheng Si,Cheng Gao,Zhitong Wang,Yingli Shen,Wenhao Li,Zhu Liu,Yufeng Han,Jiayi Wu,Cunliang Kong,Maosong Sun
Main category: cs.CL
TL;DR: 论文提出了ImCoref-CeS框架,结合改进的监督神经网络方法和LLM推理能力,通过轻量级桥接模块和双仿射评分器提升长期文本编码能力,并利用LLM作为检查器-分割器优化核心指代结果。
Details
Motivation: 当前核心指代消解任务面临一个问题:是基于小语言模型的监督神经方法继续优化,还是利用大语言模型(LLM)的强大能力。本文旨在结合两者的优势。Contribution: 主要贡献包括:1) 改进的核心指代方法ImCoref,通过桥接模块、双仿射评分器和混合提及正则化提升性能;2) 引入LLM作为检查器-分割器,优化候选提及和指代结果。
Method: 方法分为两部分:1) ImCoref改进监督神经网络,增强长期文本编码和位置信息捕捉;2) LLM作为多角色代理,检查无效提及并分割错误聚类。
Result: 实验证明ImCoref-CeS优于现有SOTA方法,验证了框架的有效性。
Insight: 结合监督神经方法和LLM推理能力是解决核心指代问题的有效途径,轻量化设计和多角色LLM代理为未来研究提供了新思路。
Abstract: Coreference Resolution (CR) is a critical task in Natural Language Processing (NLP). Current research faces a key dilemma: whether to further explore the potential of supervised neural methods based on small language models, whose detect-then-cluster pipeline still delivers top performance, or embrace the powerful capabilities of Large Language Models (LLMs). However, effectively combining their strengths remains underexplored. To this end, we propose \textbf{ImCoref-CeS}, a novel framework that integrates an enhanced supervised model with LLM-based reasoning. First, we present an improved CR method (\textbf{ImCoref}) to push the performance boundaries of the supervised neural method by introducing a lightweight bridging module to enhance long-text encoding capability, devising a biaffine scorer to comprehensively capture positional information, and invoking a hybrid mention regularization to improve training efficiency. Importantly, we employ an LLM acting as a multi-role Checker-Splitter agent to validate candidate mentions (filtering out invalid ones) and coreference results (splitting erroneous clusters) predicted by ImCoref. Extensive experiments demonstrate the effectiveness of ImCoref-CeS, which achieves superior performance compared to existing state-of-the-art (SOTA) methods.
[35] Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models
Samir Abdaljalil,Erchin Serpedin,Khalid Qaraqe,Hasan Kurban
Main category: cs.CL
TL;DR: 本文提出了一种名为‘理解审计’(AoU)的框架,用于约束语言模型在数学推理中的推理过程,避免无支持的假设导致错误结论,从而显著提升了准确性和可信度。
Details
Motivation: 大型语言模型(LLMs)在推理过程中常常基于未经支持的假设生成看似连贯但实际上错误的结论。现有方法主要关注事实性幻觉或事后验证,未能有效解决推理引发的幻觉问题。Contribution: 1)提出了AoU框架,通过分解查询、审计支持条件和约束推理三个阶段,避免无支持假设的影响;2)在完美验证和不完美验证下提供了理论保证和风险界限;3)实证显示AoU在多个数据集上显著优于现有方法。
Method: AoU通过以下三步实现:1)将查询分解为候选假设;2)审计这些假设的支持性;3)仅基于已验证的假设子集进行推理。该方法形式上属于‘后验约束推理’,与选择性预测和拒绝学习相关。
Result: 在GSM8K、MultiArith和SVAMP数据集上,AoU显著提升了准确性和可信度,最高分别提升了30%、45%和20-28%,优于Chain-of-Thought等方法。
Insight: AoU的核心在于约束推理过程,避免模型依赖未经支持的假设。这种方法不仅适用于数学推理,也可能推广到其他需要严谨推理的任务中。
Abstract: Large language models (LLMs) often generate reasoning traces that appear coherent but rest on unsupported assumptions, leading to hallucinated conclusions. Prior work mainly addresses factual hallucinations or relies on post-hoc verification, leaving reasoning-induced hallucinations largely unaddressed. We propose Audit-of-Understanding (AoU), a framework that constrains inference to validated premises through three phases: (1) decomposing a query into candidate assumptions, (2) auditing their support, and (3) conditioning inference only on the validated subset. Formally, AoU is \emph{posterior-constrained inference}, connecting to selective prediction and rejection learning. Our contributions are threefold: (i) theoretical guarantees under perfect validation, (ii) excess-risk bounds under imperfect audits, and (iii) tractability analysis. Empirically, AoU improves both accuracy and faithfulness on GSM8K, MultiArith, and SVAMP, achieving up to +30% gains on GSM8K, +45% on MultiArith, and consistent +20–28% improvements on SVAMP over Chain-of-Thought, Self-Consistency, and CoT-Decoding. Code is available at https://anonymous.4open.science/r/audit-of-understanding-E28B.
[36] Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models
Liang Lin,Miao Yu,Moayad Aloqaily,Zhenhong Zhou,Kun Wang,Linsey Pang,Prakhar Mehrotra,Qingsong Wen
Main category: cs.CL
TL;DR: 该论文提出了一种名为Backdoor Collapse的防御框架,旨在消除语言模型中未知的后门威胁,通过注入已知后门触发器和恢复微调来实现,显著降低了攻击成功率,同时保持了模型的原始性能。
Details
Motivation: 后门攻击对大型语言模型(LLMs)构成严重威胁,而现有防御方法对触发器设置的假设不切实际。本文旨在解决这一问题,提出一种无需触发器先验知识的防御框架。Contribution: 提出了一种新的防御框架Backdoor Collapse,通过注入已知后门触发器并利用其在表示空间的聚集效应,实现了对未知后门的有效消除。
Method: 采用两阶段方法:首先通过注入已知触发器聚合后门表示,然后进行恢复微调以恢复良性输出。
Result: 攻击成功率显著降低(平均4.41%),同时模型清洁精度和功能损失极小(<0.5%),且方法对不同类型后门具有普适性。
Insight: 论文揭示了后门在表示空间的聚集现象,并通过已知后门聚合实现对未知后门的防御,为后门防御研究提供了新视角。
Abstract: Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose \ourmethod, a defense framework that requires no prior knowledge of trigger settings. \ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. \ourmethod leverages this through a two-stage process: \textbf{first}, aggregating backdoor representations by injecting known triggers, and \textbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) \ourmethod reduces the average Attack Success Rate to 4.41% across multiple benchmarks, outperforming existing baselines by 28.1%$\sim$69.3%$\uparrow$. (II) Clean accuracy and utility are preserved within 0.5% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.
[37] On the Entity-Level Alignment in Crosslingual Consistency
Yihong Liu,Mingyang Wang,François Yvon,Hinrich Schütze
Main category: cs.CL
TL;DR: 多语言大语言模型在跨语言一致性中存在实体对齐问题,本文通过实体级翻译任务验证了这一点,并提出两种方法提升一致性。
Details
Motivation: 多语言大语言模型需在不同语言间保持一致的事实性知识,但现有研究对其不一致性原因理解不足,本文聚焦于实体对齐问题。Contribution: 1)验证实体对齐与跨语言一致性的强相关性;2)提出SubSub和SubInj两种方法提升一致性;3)揭示模型通过内部枢纽语言处理对齐实体表示。
Method: 通过实体级翻译任务评估对齐性,并提出SubSub(替换主语)和SubInj(注入主语)两种方法,利用英语翻译优化多语言提示。
Result: 实验显示SubSub和SubInj显著提升事实性召回准确率和一致性,同时揭示模型内部处理机制。
Insight: 实体对齐是跨语言一致性的关键,通过枢纽语言处理可有效对齐多语言表示,为多语言事实预测提供实用策略。
Abstract: Multilingual large language models (LLMs) are expected to recall factual knowledge consistently across languages. However, the factors that give rise to such crosslingual consistency – and its frequent failure – remain poorly understood. In this work, we hypothesize that these inconsistencies may arise from failures in entity alignment, the process of mapping subject and object entities into a shared conceptual space across languages. To test this, we assess alignment through entity-level (subject and object) translation tasks, and find that consistency is strongly correlated with alignment across all studied models, with misalignment of subjects or objects frequently resulting in inconsistencies. Building on this insight, we propose SubSub and SubInj, two effective methods that integrate English translations of subjects into prompts across languages, leading to substantial gains in both factual recall accuracy and consistency. Finally, our mechanistic analysis reveals that these interventions reinforce the entity representation alignment in the conceptual space through model’s internal pivot-language processing, offering effective and practical strategies for improving multilingual factual prediction.
[38] MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning
Hongwei Chen,Yishu Lei,Dan Zhang,Bo Ke,Danxiang Zhu,Xuyi Chen,Yuxiang Lu,Zhengjie Huang,Shikun Feng,Jingzhou He,Yu Sun,Hua Wu,Haifeng Wang
Main category: cs.CL
TL;DR: MatryoshkaThinking 是一种递归测试时缩放方法,显著降低计算成本的同时保持高性能,实现了高效推理。
Details
Motivation: 传统测试时缩放方法(如 DeepConf)虽然有效,但计算开销较大。本文希望通过递归利用模型的内在能力,减少计算成本并提升性能。Contribution: 提出 MatryoshkaThinking 方法,通过递归利用模型的推理、验证和总结能力,显著降低计算开销(仅需 4%的计算资源)并在 AIME2025 上达到 99.79 的高分。
Method: 采用递归方式利用模型的内在能力(推理、验证、总结),增强正确解的保留并缩小 Pass@k 和 Pass@1 的差距。
Result: 在多个开源模型和多模态推理基准测试中验证了方法的有效性和通用性,显著降低计算成本的同时保持高性能。
Insight: 递归测试时缩放为设计高效、可扩展的语言模型推理策略提供了新思路。
Abstract: Test-time scaling has emerged as a promising paradigm in language modeling, wherein additional computational resources are allocated during inference to enhance model performance. Recent approaches, such as DeepConf, have demonstrated the efficacy of this strategy, however, they often incur substantial computational overhead to achieve competitive results. In this work, we propose MatryoshkaThinking, a novel method that significantly reduces computational cost while maintaining state-of-the-art performance. Specifically, MatryoshkaThinking attains a score of 99.79 on AIME2025 using only 4% of the computation required by DeepConf. The core of our approach lies in the recursive exploitation of the model’s intrinsic capabilities in reasoning, verification, and summarization, which collectively enhance the retention of correct solutions and reduce the disparity between Pass@k and Pass@1. Comprehensive evaluations across multiple open-source models and challenging multi-modal reasoning benchmarks validate the effectiveness and generality of our method. These findings offer new insights into the design of efficient and scalable test-time inference strategies for advanced language models.
[39] End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs
Nam Luu,Ondřej Bojar
Main category: cs.CL
TL;DR: 该论文提出了一种结合预训练语音编码器和大型语言模型(LLMs)的端到端架构,同时完成自动语音识别(ASR)和语音翻译(ST)任务,并在英语到德语的实验中表现优于SeamlessM4T模型,甚至媲美级联系统的性能。
Details
Motivation: 语音翻译任务的传统级联方法和端到端方法各有优劣。论文旨在探索如何利用预训练的语音编码器和LLMs,构建一个统一的端到端模型,以同时高效完成ASR和ST任务。Contribution: 主要贡献是提出了一种结合预训练语音编码器和LLMs的端到端架构,实现了ASR和ST的高效集成,并在实验中展示了优于现有模型(如SeamlessM4T)的性能。
Method: 方法包括:1)使用预训练的语音编码器提取语音特征;2)将语音特征输入LLMs生成目标语言的文本;3)端到端训练整个系统。
Result: 实验结果显示,该模型在英语到德语的翻译任务中,性能优于SeamlessM4T,并在COMET DA22指标上提升了8%。
Insight: 通过结合语音编码器和LLMs,端到端模型能够避免级联系统的错误传播问题,同时利用预训练模型的能力实现高效翻译。
Abstract: Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $\text{COMET}^{\text{DA}}_{22}$ metric.
[40] RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models
Aashiq Muhamed,Leonardo F. R. Ribeiro,Markus Dreyer,Virginia Smith,Mona T. Diab
Main category: cs.CL
TL;DR: 论文提出了RefusalBench,一种生成式评估方法,用于测试RAG系统中语言模型在有缺陷上下文下选择性拒绝的能力。研究发现前沿模型在多文档任务中拒绝准确性低于50%,并提出改进方向。
Details
Motivation: 现有的静态基准测试无法可靠评估语言模型在有缺陷上下文下的选择性拒绝能力,导致模型可能利用数据特异性或记忆测试实例,提出了动态评测的需求。Contribution: 1. 提出了RefusalBench,一种通过程序化生成诊断测试用例的动态评估框架;2. 揭示了模型在选择性拒绝任务上的系统性失败模式;3. 发布了两项基准测试和生成框架。
Method: 1. 设计了176种扰动策略,覆盖六类信息不确定性和三种强度;2. 通过生成式方法动态创建测试用例;3. 对30多种模型进行评估。
Result: 前沿模型在多文档任务中的拒绝准确性低于50%,表明选择性拒绝是一项可训练的、对齐敏感的能力。
Insight: 选择性拒绝可分为检测和分类两个技能,但目前模型的性能和规模或推理能力无关,提供了明确的改进方向。
Abstract: The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting either dangerous overconfidence or overcaution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks – RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) – and our complete generation framework to enable continued, dynamic evaluation of this critical capability.
[41] AssoMem: Scalable Memory QA with Multi-Signal Associative Retrieval
Kai Zhang,Xinyuan Zhang,Ejaz Ahmed,Hongda Jiang,Caleb Kumar,Kai Sun,Zhaojiang Lin,Sanat Sharma,Shereen Oraby,Aaron Colak,Ahmed Aly,Anuj Kumar,Xiaozhong Liu,Xin Luna Dong
Main category: cs.CL
TL;DR: AssoMem是一种新颖的内存增强框架,通过构建关联记忆图来改善大规模记忆中的问答任务,结合多维信号(相关性、重要性和时间对齐)并通过自适应融合策略提升检索效果,在多个基准测试中表现优于现有方法。
Details
Motivation: 大规模记忆中的准确召回是AI助理问答任务的核心挑战,尤其是在语义密度高的场景下,现有方法主要依赖查询的语义距离进行检索,忽视了人类信息关联的多维特性。Contribution: 提出了AssoMem框架,利用关联记忆图和自动提取的线索实现对话信息的有效组织;设计了一种多维信号(相关性、重要性和时间对齐)的自适应融合策略。
Method: 构建关联记忆图,锚定对话语句到自动提取的线索;采用基于自适应互信息的融合策略整合多维检索信号。
Result: 在三个基准测试和新数据集MeetingQA上,AssoMem的表现优于现有最先进方法,验证了其在上下文感知记忆召回中的优越性。
Insight: 通过模拟人类关联信息的特性,结合多维信号和自适应融合,可以显著提升记忆增强问答任务的性能。
Abstract: Accurate recall from large scale memories remains a core challenge for memory augmented AI assistants performing question answering (QA), especially in similarity dense scenarios where existing methods mainly rely on semantic distance to the query for retrieval. Inspired by how humans link information associatively, we propose AssoMem, a novel framework constructing an associative memory graph that anchors dialogue utterances to automatically extracted clues. This structure provides a rich organizational view of the conversational context and facilitates importance aware ranking. Further, AssoMem integrates multi-dimensional retrieval signals-relevance, importance, and temporal alignment using an adaptive mutual information (MI) driven fusion strategy. Extensive experiments across three benchmarks and a newly introduced dataset, MeetingQA, demonstrate that AssoMem consistently outperforms SOTA baselines, verifying its superiority in context-aware memory recall.
[42] STEAM: A Semantic-Level Knowledge Editing Framework for Large Language Models
Geunyeong Jeong,Juoh Sun,Seonghee Lee,Harksoo Kim
Main category: cs.CL
TL;DR: STEAM是一个语义级知识编辑框架,旨在通过潜在空间对齐提升LLM的知识编辑能力,使其编辑后的知识更具语义连贯性。
Details
Motivation: 大型语言模型的知识是静态的,无法动态更新。现有知识编辑方法多关注词级优化,缺乏语义连贯性。Contribution: 提出了STEAM框架,通过语义锚点和对齐损失优化知识编辑,增强模型对编辑知识的推理能力和语义连贯性。
Method: 1. 识别目标表示的语义锚点;2. 通过对齐损失引导编辑事实的内部表示与其对齐。
Result: 实验表明STEAM提升了模型的推理能力和语义连贯性,验证了潜在空间对齐对知识编辑的重要性。
Insight: 语义级编辑比词级优化更能确保知识的自然整合,潜在空间对齐是关键。
Abstract: Large Language Models store extensive factual knowledge acquired during large-scale pre-training. However, this knowledge is inherently static, reflecting only the state of the world at the time of training. Knowledge editing has emerged as a promising solution for updating outdated or incorrect facts without full retraining. However, most existing locate-and-edit methods primarily focus on token-level likelihood optimization without addressing semantic coherence. Our analysis reveals that such edited knowledge is often encoded as isolated residual streams in the model’s latent space, distinct from pre-existing knowledge and bypassing natural reasoning process. To address this, we propose \textsc{Steam}, a semantic-level knowledge editing framework that enhances integration of updated knowledge into the model’s knowledge structure. \textsc{Steam} first identifies target representations as semantic anchors for the updated factual association, then guides the internal representation of the edited fact towards these anchors through an alignment loss during optimization. Experimental results demonstrate that \textsc{Steam} improves model’s ability to reason with edited knowledge and enhances semantic coherence, underscoring the importance of latent-space alignment for reliable and coherent knowledge editing. The code is available at https://github.com/GY-Jeong/STEAM.
[43] Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance
Jingyi Chen,Zhimeng Guo,Jiyun Chun,Pichao Wang,Andrew Perrault,Micha Elsner
Main category: cs.CL
TL;DR: 该论文提出了LISTEN基准测试,用于评估音频语言模型(LALMs)在处理情绪时对词汇和声学信息的依赖,发现现有模型主要依赖词汇信息而非声学线索。
Details
Motivation: 目前尚不清楚大型音频语言模型在理解情绪时是真正处理声学信息还是主要依赖词汇内容。论文旨在量化模型对这两种线索的依赖程度。Contribution: 提出了LISTEN基准测试,设计了一种可控的方法来分离词汇和声学信息在情绪理解中的作用,并评估了六种先进的LALMs。
Method: 设计了LISTEN测试,通过控制词汇和声学信息的对齐或冲突,评估模型在不同情境下的表现。
Result: 现有LALMs表现出明显的词汇主导现象,模型在词汇中性时预测为‘中性’,在声学和词汇线索冲突时分类能力下降,性能接近随机猜测。
Insight: 当前LALMs更像‘转录器’而非‘倾听者’,严重依赖词汇语义而忽视了声学线索。LISTEN为多模态模型的情绪理解评估提供了框架。
Abstract: Understanding emotion from speech requires sensitivity to both lexical and acoustic cues. However, it remains unclear whether large audio language models (LALMs) genuinely process acoustic information or rely primarily on lexical content. We present LISTEN (Lexical vs. Acoustic Speech Test for Emotion in Narratives), a controlled benchmark designed to disentangle lexical reliance from acoustic sensitivity in emotion understanding. Across evaluations of six state-of-the-art LALMs, we observe a consistent lexical dominance. Models predict “neutral” when lexical cues are neutral or absent, show limited gains under cue alignment, and fail to classify distinct emotions under cue conflict. In paralinguistic settings, performance approaches chance. These results indicate that current LALMs largely “transcribe” rather than “listen,” relying heavily on lexical semantics while underutilizing acoustic cues. LISTEN offers a principled framework for assessing emotion understanding in multimodal models.
[44] RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation
Zhichao Xu,Minheng Wang,Yawei Wang,Wenqian Ye,Yuntao Du,Yunpu Ma,Yijun Tian
Main category: cs.CL
TL;DR: RECON是一个通过在推理循环中引入显式摘要模块来压缩检索到的文档的框架,从而提高了检索增强生成(RAG)系统的效率和性能。
Details
Motivation: 现有的基于强化学习的RAG系统由于上下文管理效率低下,导致了长文档和噪声数据增加了成本并降低了性能,因此需要一种高效的上下文压缩方法。Contribution: RECON框架的主要贡献包括:1)集成显式摘要模块压缩证据;2)通过两阶段训练(相关性预训练和多方面蒸馏)优化摘要质量;3)在Search-R1管道中的应用,显著减少上下文长度并提升性能。
Method: RECON的训练方法分为两阶段:1)在QA数据集上进行相关性预训练;2)从专有LLM中多方面蒸馏,以确保摘要的事实性和清晰度。该方法嵌入推理循环,实现对检索文档的动态压缩。
Result: 实验结果显示,RECON将上下文长度减少35%,训练速度和推理延迟均得到提升。在QA基准测试中,平均EM分数提升显著(3B模型提升14.5%,7B模型提升3.0%)。
Insight: RECON证明了学习的上下文压缩对于构建高效、可扩展且高性能的RAG系统至关重要,尤其在多跳QA任务中表现突出。
Abstract: Retrieval-augmented generation (RAG) systems trained using reinforcement learning (RL) with reasoning are hampered by inefficient context management, where long, noisy retrieved documents increase costs and degrade performance. We introduce RECON (REasoning with CONdensation), a framework that integrates an explicit summarization module to compress evidence within the reasoning loop. Our summarizer is trained via a two-stage process: relevance pretraining on QA datasets, followed by multi-aspect distillation from proprietary LLMs to ensure factuality and clarity. Integrated into the Search-R1 pipeline, RECON reduces total context length by 35%, leading to improved training speed and inference latency, while simultaneously improving RAG performance on downstream QA benchmarks. Notably, it boosts the average EM score of the 3B model by 14.5% and the 7B model by 3.0%, showing particular strength in multi-hop QA. RECON demonstrates that learned context compression is essential for building practical, scalable, and performant RAG systems. Our code implementation is made available at https://github.com/allfornancy/RECON.
[45] Steering Over-refusals Towards Safety in Retrieval Augmented Generation
Utsav Maskey,Mark Dras,Usman Naseem
Main category: cs.CL
TL;DR: 论文研究了在检索增强生成(RAG)中大型语言模型(LLMs)的安全性对齐导致的过度拒绝问题,并提出了SafeRAG-Steering方法以减少这种问题。
Details
Motivation: 大型语言模型在安全性对齐时容易因激进的过滤器而对良性请求产生过度拒绝,尤其是在RAG中,查询意图和检索上下文特性会进一步影响拒绝行为。Contribution: 提出了RagRefuse基准,用于分析RAG中的过度拒绝现象;设计了SafeRAG-Steering方法,通过嵌入干预减少过度拒绝问题。
Method: 利用RagRefuse基准分析了不同因素(如上下文污染、领域、有害文本密度)对拒绝行为的影响;提出了SafeRAG-Steering方法,通过调整嵌入区域引导模型生成安全的非拒绝输出。
Result: SafeRAG-Steering有效减少了RAG管道中因上下文污染导致的过度拒绝问题,同时保留了合理的拒绝行为。
Insight: 研究表明,模型特定的对齐选择和上下文特性是触发过度拒绝的关键因素;嵌入干预可以作为一种有效的缓解手段。
Abstract: Safety alignment in large language models (LLMs) induces over-refusals – where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce \textsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.
[46] When or What? Understanding Consumer Engagement on Digital Platforms
Jingyi Wu,Junying Liang
Main category: cs.CL
TL;DR: 论文研究了数字平台上消费者关注度的驱动因素,发现时间动态比内容主题更能影响消费者参与度。
Details
Motivation: 在当前数字服务经济中,内容创作者竞争消费者注意力。虽然以往研究关注内容特征,但创作者常高估其对受众的价值。本研究旨在揭示消费者参与度的真实驱动因素。Contribution: 揭示了创作者供应与受众需求之间的不匹配现象,并首次提出时间动态对消费者参与度的强影响力。
Method: 采用LDA主题建模分析TED Talks的大规模语料库,对比创作者的主题供应与受众的实际参与需求,并结合纵向分析时间动态的影响。
Result: 研究发现时间动态比主题内容更能预测消费者参与度,表明”何时”比”什么”更重要。
Insight: 挑战了内容特征是关注度主要驱动因素的假设,强调了时间规划和上下文因素的重要性,为优化受众参与策略提供了新视角。
Abstract: Understanding what drives popularity is critical in today’s digital service economy, where content creators compete for consumer attention. Prior studies have primarily emphasized the role of content features, yet creators often misjudge what audiences actually value. This study applies Latent Dirichlet Allocation (LDA) modeling to a large corpus of TED Talks, treating the platform as a case of digital service provision in which creators (speakers) and consumers (audiences) interact. By comparing the thematic supply of creators with the demand expressed in audience engagement, we identify persistent mismatches between producer offerings and consumer preferences. Our longitudinal analysis further reveals that temporal dynamics exert a stronger influence on consumer engagement than thematic content, suggesting that when content is delivered may matter more than what is delivered. These findings challenge the dominant assumption that content features are the primary drivers of popularity and highlight the importance of timing and contextual factors in shaping consumer responses. The results provide new insights into consumer attention dynamics on digital platforms and carry practical implications for marketers, platform managers, and content creators seeking to optimize audience engagement strategies.
[47] Assessing Large Language Models for Structured Medical Order Extraction
A H M Rezaul Karim,Ozlem Uzuner
Main category: cs.CL
TL;DR: 论文评估了大型语言模型在结构化医疗订单提取中的表现,使用通用LLaMA-4 17B模型,无需领域微调,仅通过单例上下文示例指导,在MEDIQA-OE 2025任务中排名第5,证明了其作为临床NLP任务基准的潜力。
Details
Motivation: 医疗订单提取对临床决策和自动化工作流至关重要,但订单来源多样且类型复杂,需要高效、通用的方法来解决这一挑战。Contribution: 展示了通用大型语言模型(LLaMA-4 17B)在未进行领域微调的情况下,通过少量示例提示工程,能够有效完成医疗订单提取任务。
Method: 采用指令调优的LLaMA-4 17B模型,结合单例上下文示例的提示工程,进行医疗订单的结构化提取。
Result: 在MEDIQA-OE 2025任务中平均F1得分为37.76,尤其在订单原因和来源准确性上有显著提升,排名第5。
Insight: 通用大型语言模型结合少量示例提示工程,可以作为临床NLP任务的高效、可扩展基准,减少对领域特定模型的依赖。
Abstract: Medical order extraction is essential for structuring actionable clinical information, supporting decision-making, and enabling downstream applications such as documentation and workflow automation. Orders may be embedded in diverse sources, including electronic health records, discharge summaries, and multi-turn doctor-patient dialogues, and can span categories such as medications, laboratory tests, imaging studies, and follow-up actions. The MEDIQA-OE 2025 shared task focuses on extracting structured medical orders from extended conversational transcripts, requiring the identification of order type, description, reason, and provenance. We present the MasonNLP submission, which ranked 5th among 17 participating teams with 105 total submissions. Our approach uses a general-purpose, instruction-tuned LLaMA-4 17B model without domain-specific fine-tuning, guided by a single in-context example. This few-shot configuration achieved an average F1 score of 37.76, with notable improvements in reason and provenance accuracy. These results demonstrate that large, non-domain-specific LLMs, when paired with effective prompt engineering, can serve as strong, scalable baselines for specialized clinical NLP tasks.
[48] Merlin’s Whisper: Enabling Efficient Reasoning in LLMs via Black-box Adversarial Prompting
Heming Xia,Cunxiao Du,Rui Li,Chak Tou Leong,Yongqi Li,Wenjie Li
Main category: cs.CL
TL;DR: 这篇论文提出了AdvPrompt框架,通过黑盒对抗提示(black-box adversarial prompting)减少大型推理模型(LRM)的计算开销,同时保持高准确性。
Details
Motivation: 大型推理模型在复杂推理任务中表现出色,但其冗长的推理过程导致了高昂的计算和延迟成本,限制了实际应用。因此,需要一种高效的方法来减少模型的“过度思考”(overthinking)。Contribution: 论文的主要贡献是提出了AdvPrompt框架,通过黑盒对抗提示生成高质量的对抗性提示(adversarial prompts),显著减少模型的响应长度(token usage),同时保持性能。
Method: AdvPrompt是一个迭代优化框架,通过多种视角生成高质量的对抗性提示。这种方法不依赖于模型内部结构,适用于开源和闭源的LRMs。
Result: 实验表明,AdvPrompt在多个基准测试中减少了约40%的token使用量,例如在GSM8K上为Qwen3系列模型减少了3倍的响应长度,在MATH-500上为Claude-3.7和Gemini-2.5分别减少了35%和47%的token使用量。
Insight: AdvPrompt展示了黑盒提示作为一种实用策略的潜力,能够有效提升大型推理模型的效率,且适用于不同规模和系列的模型。
Abstract: Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex reasoning tasks through step-by-step thinking. However, such a lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of these models. In this work, we present a new perspective on mitigating overthinking in LRMs via black-box adversarial prompting. By treating both open-source LRMs and closed-source APIs as black-box communicators, we investigate how to elicit concise responses without sacrificing accuracy. We introduce AdvPrompt, an iterative refinement framework that generates high-quality adversarial prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that AdvPrompt consistently reduces token usage while preserving performance. Notably, AdvPrompt achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series, and delivers an average ~40% token reduction across four benchmarks. For closed-source APIs, AdvPrompt reduces token usage on MATH-500 by 35% for Claude-3.7 and 47% for Gemini-2.5. Further analysis reveals the generalizability of AdvPrompt across various model scales and families, underscoring the potential of black-box prompting as a practical and effective strategy for enhancing LRM efficiency.
[49] Detecting Hallucinations in Authentic LLM-Human Interactions
Yujie Ren,Niklas Gruhlke,Anne Lauscher
Main category: cs.CL
TL;DR: 论文介绍了首个基于真实大语言模型(LLM)-人类对话的幻觉检测基准AuthenHallu,解决了现有基准多为人工构造的问题。
Details
Motivation: 当前幻觉检测基准多为人工构造,无法反映真实场景中LLM的幻觉特性,限制了其在敏感领域(如医学和法律)的应用。Contribution: 提出了首个基于真实LLM-人类对话的幻觉检测基准AuthenHallu,并通过统计分析展示了幻觉的高发率,尤其是在数学和数字问题领域。
Method: 通过选择和标注真实LLM-人类对话样本构建AuthenHallu基准,并探索了使用普通LLM作为幻觉检测器的潜力。
Result: 统计显示,31.4%的查询-响应对存在幻觉,而在数学和数字问题领域,这一比例高达60%。普通LLM作为检测器的效果有限。
Insight: 真实对话中的幻觉特性与人工构造的基准存在显著差异,尤其在复杂领域幻觉率更高,凸显了对真实场景数据的需求和挑战。
Abstract: As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed–either through deliberate hallucination induction or simulated interactions–rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.
[50] BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
Euhid Aman,Esteban Carlin,Hsing-Kuo Pao,Giovanni Beltrame,Ghaluh Indah Permata Sari,Yie-Tarng Chen
Main category: cs.CL
TL;DR: BitMar是一种低比特率的多模态融合模型,结合外部情景记忆,适用于边缘设备,支持高效的图像-文本生成。
Details
Motivation: 现有的多模态模型(如跨注意力Transformer)计算量大,难以部署在边缘设备上。基于量化技术和情景记忆的模型可以解决这一问题。Contribution: 提出了BitMar,一种量化多模态Transformer,结合1.58比特编码器和外部情景记忆,支持高效的图像-文本生成。
Method: 采用BitNet-style文本编码器和DiNOv2-based视觉编码器生成紧凑嵌入,通过查询固定大小的键值情景记忆完成任务。解码器采用逐层条件化和滑动窗口注意力机制处理长输入。
Result: BitMar在低延迟和小模型占用下,实现了具有竞争力的图像描述和多模态理解能力。
Insight: 量化技术和情景记忆的结合可以有效减少模型计算量,同时保持生成内容的上下文相关性,适合边缘设备部署。
Abstract: Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.
[51] Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization
Bowei He,Lihao Yin,Huiling Zhen,Shuqi Liu,Han Wu,Xiaokun Zhang,Mingxuan Yuan,Chen Ma
Main category: cs.CL
TL;DR: 该论文探讨了后训练压缩中校准数据对大语言模型(LLM)能力的影响,并通过激活模式分析提出了一种校准数据优化框架,以提升压缩后模型的复杂推理能力。
Details
Motivation: 现有研究大多仅关注校准数据的来源或样本量对语言建模或常识推理的影响,缺乏对校准数据的组合性质和领域对应性的系统性分析,尤其是对高级复杂推理能力的影响。Contribution: 1. 系统性分析了校准数据对不同LLM能力的影响;2. 从激活模式角度揭示了校准数据的代表性和多样性对模型能力的关键作用;3. 提出了一种基于分析的校准数据优化框架,显著提升了现有后训练压缩方法的能力保留效果。
Method: 通过分析校准数据在激活空间的代表性和多样性,提出了一种数据优化框架,结合数学解题和代码生成等高阶任务评估模型能力。
Result: 提出的优化框架有效提升了压缩后模型在复杂推理任务上的性能。
Insight: 校准数据的质量不仅取决于样本量和来源,其激活空间的代表性和多样性对模型能力保留更为关键。
Abstract: Post-training compression has been a widely employed approach to scale down large language model (LLM) and facilitate efficient inference. In various proposed compression methods, including pruning and quantization, calibration data plays a vital role by informing the weight importance and activation dynamic ranges. However, how calibration data impacts the LLM capability after compression is less explored. Few of the existing works, though recognizing the significance of this study, only investigate the language modeling or commonsense reasoning performance degradation from limited angles, like the data sources or sample amounts. More systematic research is still needed to examine the impacts on different LLM capabilities in terms of compositional properties and domain correspondence of calibration data. In this work, we aim at bridging this gap and further analyze underlying influencing mechanisms from the activation pattern perspective. Especially, we explore the calibration data’s impacts on high-level complex reasoning capabilities, like math problem solving and code generation. Delving into the underlying mechanism, we find that the representativeness and diversity in activation space more fundamentally determine the quality of calibration data. Finally, we propose a calibration data curation framework based on such observations and analysis, enhancing the performance of existing post-training compression methods on preserving critical LLM capabilities. Our code is provided in \href{https://github.com/BokwaiHo/COLA.git}{Link}.
[52] AGENTIQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation
Omid Reza Heidari,Siobhan Reid,Yassine Yaakoubi
Main category: cs.CL
TL;DR: AGENTIQL是一个基于多专家框架的文本到SQL生成方法,通过分解问题、生成子查询和优化列选择,结合路由机制提升效率和准确性,在Spider基准上表现优异。
Details
Motivation: 现有的大型语言模型(LLM)在文本到SQL生成中存在复杂推理和多样表结构处理的局限性,AGENTIQL旨在通过模块化和并行化设计解决这些问题。Contribution: 提出AGENTIQL框架,结合推理代理、编码代理和优化步骤,通过自适应路由机制提升效率和准确性,缩小与GPT-4的性能差距。
Method: 采用多专家协作框架,包括问题分解、子查询生成和列选择优化,并行执行部分步骤,并通过路由机制选择最优执行路径。
Result: 在Spider基准上,AGENTIQL实现了86.07%的执行准确率,接近GPT-4的SOTA(89.65%),同时提升了透明度和可解释性。
Insight: 模块化和并行化设计能够显著提升文本到SQL生成的性能和可扩展性,路由机制在平衡效率和准确性中起关键作用。
Abstract: LLMs have advanced text-to-SQL generation, yet monolithic architectures struggle with complex reasoning and schema diversity. We propose AGENTIQL, an agent-inspired multi-expert framework that combines a reasoning agent for question decomposition, a coding agent for sub-query generation, and a refinement step for column selection. An adaptive router further balances efficiency and accuracy by selecting between our modular pipeline and a baseline parser. Several steps in the pipeline can be executed in parallel, making the framework scalable to larger workloads. Evaluated on the Spider benchmark, AGENTIQL improves execution accuracy and interpretability and achieves up to 86.07% EX with 14B models using the Planner&Executor merging strategy. The attained performance is contingent upon the efficacy of the routing mechanism, thereby narrowing the gap to GPT-4-based SOTA (89.65% EX) while using much smaller open-source LLMs. Beyond accuracy, AGENTIQL enhances transparency by exposing intermediate reasoning steps, offering a robust, scalable, and interpretable approach to semantic parsing.
[53] BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
Zhengbo Zhang,Zhiheng Lyu,Junhao Gong,Hongzhu Yi,Xinming Wang,Yuxuan Zhou,Jiabing Yang,Ping Nie,Yan Huang,Wenhu Chen
Main category: cs.CL
TL;DR: BrowserAgent提出了一种基于人类浏览行为的交互式网络代理,通过直接在网页上进行操作(如滚动、点击、输入)完成任务,采用两阶段训练(SFT和RFT)显著提升了模型的泛化能力,并在多项任务中表现优异。
Details
Motivation: 现有的大多数网络代理依赖工具将动态网络环境转换为静态文本,与人类丰富的浏览器交互行为(如滚动、点击)不符。BrowserAgent旨在通过模仿人类浏览行为,提升代理在动态网络环境中的交互能力。Contribution: 1) 提出了BrowserAgent,一种更接近人类浏览行为的网络代理;2) 采用两阶段训练(SFT和RFT)提升泛化能力;3) 引入显式记忆机制增强长程任务的推理能力;4) 在多项任务中性能显著超越现有方法。
Method: 1) 使用Playwright直接在网页上执行人类启发式的浏览器动作(如滚动、点击);2) 两阶段训练(SFT+RFT);3) 引入显式记忆机制存储关键信息。
Result: BrowserAgent-7B在HotpotQA、2Wiki等多元跳转问题上比Search-R1提升约20%,训练数据更少但性能更优。
Insight: 通过模拟人类浏览行为,BrowserAgent展示了在网络任务中更高效的交互能力,显式记忆机制对长程推理任务尤为重要。
Abstract: Efficiently solving real-world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search-R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human-inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)) to improve the model’s generalization abilities. Despite using significantly less training data than Search-R1, BrowserAgent achieves more competitive results across different Open-QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model’s reasoning capabilities for long-horizon tasks. Notably, BrowserAgent-7B can achieve around 20% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.
[54] Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data
Zhuowei Chen,Bowei Zhang,Nankai Lin,Tian Hou,Lianxi Wang
Main category: cs.CL
TL;DR: 论文提出了一种名为ConsistentGuard的新型基于推理的多语言LLM安全防护方法,通过推理增强解释性,并通过对齐提升语言间的知识迁移,仅需1000个训练样本即可在多语言基准上表现出色。
Details
Motivation: 现有基于分类器的方法在低资源语言上表现不佳且缺乏解释性,而LLM的安全防护需求日益增长,因此需要一种更有效的方法。Contribution: 1)提出ConsistentGuard,一种基于推理的对齐方法;2)在六种语言的三个数据集上展示了优异性能;3)贡献了一个多语言基准扩展并开源代码。
Method: 通过推理增强解释性,利用对齐技术促进语言间的知识迁移,仅需少量训练数据(1000样本)。
Result: 在六种语言的三个数据集上性能卓越,优于需要更多数据训练的大模型,并展现了较强的解释性和泛化能力。
Insight: 小样本学习与推理对齐的结合可以显著提升低资源语言场景下的LLM防护能力,同时增强了模型的解释性。
Abstract: Recent advances in LLMs have enhanced AI capabilities, but also increased the risk posed by malicious requests, highlighting the need for effective LLM safeguards to detect such queries. Existing approaches largely rely on classifier-based methods that lack interpretability and perform poorly on low-resource languages. To address these limitations, we propose ConsistentGuard, a novel reasoning-based multilingual safeguard, which enhances explainability via reasoning and boosts knowledge transfer between languages through alignment. With only 1,000 training samples, our method demonstrates superior performance on three datasets across six languages, outperforming larger models trained with significantly more data, and exhibits strong interpretability and generalization ability. We also contribute a multilingual benchmark extension and release our codes to support future research.
[55] RePro: Training Language Models to Faithfully Recycle the Web for Pretraining
Zichun Yu,Chenyan Xiong
Main category: cs.CL
TL;DR: RePro提出了一种基于强化学习的网络数据回收方法,通过训练小型语言模型生成高质量且语义忠实的数据改写,显著提升了LLM预训练数据的效率和效果。
Details
Motivation: 前沿大型语言模型(LLM)的高质量预训练数据日益稀缺,RePro旨在通过数据回收方法解决这一问题,提高数据的利用效率。Contribution: 1. 设计了基于强化学习的网络数据回收方法RePro;2. 提出了一个质量奖励和三个忠实性奖励,优化模型生成高质量且语义忠实的数据改写;3. 实验验证了RePro在多个任务上的显著性能提升。
Method: 1. 训练一个4B参数的小型语言模型(rephraser);2. 设计了质量奖励和忠实性奖励,通过强化学习优化模型;3. 利用DCLM-RefinedWeb数据集生成72B tokens的改写数据。
Result: 1. RePro在22项下游任务中比仅使用原始数据的基线提升了4.7%-14.0%的相对准确率;2. 在数据效率上比原始数据提升了2-3倍;3. 改写的语义忠实性优于基于提示的state-of-the-art方法。
Insight: 1. 小型模型可以通过强化学习实现高效数据回收;2. 忠实性奖励设计对于保留核心语义和结构至关重要;3. 数据改写技术在缓解LLM数据稀缺问题上具有潜力。
Abstract: High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at https://github.com/cxcscmu/RePro.
[56] Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG
Zhichao Wang,Cheng Wan,Dong Nie
Main category: cs.CL
TL;DR: 该论文综述了大型语言模型(LLM)在推理时扩展策略的最新进展,重点讨论了输出导向和输入导向的两类方法,并系统整理了相关技术。
Details
Motivation: 随着高质量训练数据日益稀缺,传统的模型规模和训练数据扩展方法面临瓶颈,研究转向了推理时的计算扩展,以在不重训练模型的情况下提升性能。Contribution: 论文系统性梳理并分类了推理时扩展的两类方法(输出导向和输入导向),尤其详细分析了RAG技术,为研究者提供了清晰的框架。
Method: 输出导向方法包括多步生成策略(如CoT、ToT)、搜索方法(如MCTS)和模型集成;输入导向方法则主要是少样本学习和RAG技术。
Result: 论文整合了大量技术,展示了推理时扩展的潜力,并提出分类框架以指导未来研究方向。
Insight: 推理时扩展是LLM性能提升的新方向,RAG等输入导向方法展示了强大的应用潜力,但输出导向方法的复杂性仍需进一步优化。
Abstract: The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve LLM performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and decoding methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, LLM generation methods, and multi-modal RAG.
[57] DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models
Kaixuan Ren,Preslav Nakov,Usman Naseem
Main category: cs.CL
TL;DR: DUAL-Bench是一个多模态基准测试,旨在衡量视觉语言模型(VLMs)的过度拒绝(over-refusal)和安全完成任务的能力。研究发现,当前模型在这些方面表现不佳,表明需要更精细的对齐策略以平衡安全性与实用性。
Details
Motivation: 随着视觉语言模型能力的提升,如何在安全性和实用性之间取得平衡成为一个关键挑战。现有的安全机制可能导致过度拒绝(拒绝良性请求),但在多模态场景中,特别是在指令无害但图像有害的情况下,模型的表现仍然存在问题。Contribution: DUAL-Bench是首个专注于多模态场景中过度拒绝和安全完成任务的基准测试。它系统地评估了18个VLMs在12种危险类别下的表现,并揭示了当前模型的局限性。
Method: DUAL-Bench设计了一个多模态任务集,包含语义保留的视觉扰动,用于测试模型的鲁棒性。研究通过分析模型的拒绝行为和安全完成任务的准确性来评估其表现。
Result: 结果显示,当前模型在安全完成任务方面表现较差:GPT-5-Nano仅为12.9%,GPT-5系列平均7.9%,Qwen系列仅3.9%。这表明模型在多模态场景中仍需改进。
Insight: DUAL-Bench揭示了多模态对齐的复杂性,尤其是在处理视觉和语言结合的指令时,模型需要更精细的策略以实现安全性和实用性的平衡。
Abstract: As vision-language models become increasingly capable, maintaining a balance between safety and usefulness remains a central challenge. Safety mechanisms, while essential, can backfire, causing over-refusal, where models decline benign requests out of excessive caution. Yet, no existing benchmark has systematically addressed over-refusal in the visual modality. This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content. Models frequently fail in such scenarios, either refusing too conservatively or completing tasks unsafely, which highlights the need for more fine-grained alignment. The ideal behavior is safe completion, i.e., fulfilling the benign parts of a request while explicitly warning about any potentially harmful elements. To address this, we present DUAL-Bench, the first multimodal benchmark focused on over-refusal and safe completion in VLMs. We evaluated 18 VLMs across 12 hazard categories, with focus on their robustness under semantics-preserving visual perturbations. The results reveal substantial room for improvement: GPT-5-Nano achieves 12.9% safe completion, GPT-5 models average 7.9%, and Qwen models only 3.9%. We hope that DUAL-Bench will foster the development of more nuanced alignment strategies that ensure models remain both safe and useful in complex multimodal settings.
[58] Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks
Jiajing Guo,Kenil Patel,Jorge Piazentin Ono,Wenbin He,Liu Ren
Main category: cs.CL
TL;DR: 这篇论文评估了六种轻量级、面向产业的测试时扩展策略和四种大型语言模型(LLMs,包括两个推理模型)在BIRD Mini-Dev基准上的表现,发现Divide-and-Conquer提示和少样本演示能持续提升性能,而额外的工作流步骤效果不一。
Details
Motivation: 大型语言模型(LLMs)在Text-to-SQL(Text2SQL)系统中的应用日益广泛,但在实际部署中,测试时扩展策略的效能仍不明确,尤其是在最新的推理模型中。Contribution: 1. 在BIRD Mini-Dev基准上评估了六种测试时扩展策略和四种LLMs的性能;2. 提出Divide-and-Conquer提示和少样本演示的提升效果;3. 揭示了实际系统部署中精度、效率和复杂性的权衡。
Method: 采用了六种轻量级测试时扩展策略和四种LLMs(包括两个推理模型),在BIRD Mini-Dev基准上进行评测,不仅关注精度,还考虑了推理延迟和令牌消耗。
Result: Divide-and-Conquer提示和少样本演示对所有LLMs均有性能提升,但额外工作流步骤效果不一致,且基础模型的选择对结果有显著影响。
Insight: 在实际部署中,平衡精度、效率和复杂性是关键;Divide-and-Conquer提示和少样本演示是提升Text2SQL系统性能的有效手段。
Abstract: Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.
[59] LLM$\times$MapReduce-V3: Enabling Interactive In-Depth Survey Generation through a MCP-Driven Hierarchically Modular Agent System
Yu Chao,Siyu Lin,xiaorong wang,Zhu Zhang,Zihan Zhou,Haoyu Wang,Shuo Wang,Jie Zhou,Zhiyuan Liu,Maosong Sun
Main category: cs.CL
TL;DR: LLM×MapReduce-V3是一个层次化模块化代理系统,专注于生成长篇调研报告。它采用多代理架构,将功能组件实现为独立的MCP服务器,并通过高级规划代理动态协调工作流,支持人机交互干预,最终生成内容深度和长度优于基线方法的报告。
Details
Motivation: 现有系统在长篇调研报告生成中难以平衡内容深度和交互灵活性,LLM×MapReduce-V3通过模块化和层次化设计,旨在提升系统的定制化和多轮交互能力。Contribution: 1. 引入层次化模块化代理架构,功能组件以独立MCP服务器实现。2. 提出动态规划代理,基于MCP工具描述和执行历史选择模块。3. 支持人机交互干预,提升定制化能力。
Method: 系统将骨架初始化、摘要构建和骨架细化等功能实现为MCP服务器,通过高级规划代理动态协调模块执行,支持多轮交互。
Result: 人工评估显示,系统在内容深度和长度上优于基线方法,验证了MCP模块化规划的有效性。
Insight: 模块化和层次化设计提升了系统的灵活性和交互能力,MCP驱动的动态规划是支持复杂任务的关键。
Abstract: We introduce LLM x MapReduce-V3, a hierarchically modular agent system designed for long-form survey generation. Building on the prior work, LLM x MapReduce-V2, this version incorporates a multi-agent architecture where individual functional components, such as skeleton initialization, digest construction, and skeleton refinement, are implemented as independent model-context-protocol (MCP) servers. These atomic servers can be aggregated into higher-level servers, creating a hierarchically structured system. A high-level planner agent dynamically orchestrates the workflow by selecting appropriate modules based on their MCP tool descriptions and the execution history. This modular decomposition facilitates human-in-the-loop intervention, affording users greater control and customization over the research process. Through a multi-turn interaction, the system precisely captures the intended research perspectives to generate a comprehensive skeleton, which is then developed into an in-depth survey. Human evaluations demonstrate that our system surpasses representative baselines in both content depth and length, highlighting the strength of MCP-based modular planning.
[60] ADVICE: Answer-Dependent Verbalized Confidence Estimation
Ki Jung Seo,Sehun Lim,Taeuk Kim
Main category: cs.CL
TL;DR: 论文提出ADVICE框架,通过细调提升LLM的置信度校准能力,解决因答案独立性导致的过度自信问题。
Details
Motivation: LLM在自然语言中表达置信度时经常表现出过度自信,这种现象的原因尚不清楚。研究发现答案独立性是主要原因之一,即模型未能将置信度与其生成的答案条件化。Contribution: 提出了ADVICE框架,通过细调方法实现基于答案的置信度估计,显著改善了置信度校准,同时保持了任务性能。
Method: ADVICE框架通过细调LLM,使其能够根据生成的答案调整置信度表达,从而增强答案相关性,提升校准效果。
Result: 实验表明ADVICE大幅改善了置信度校准,同时不影响任务性能。分析还显示ADVICE增强了置信度分布的平衡性和校准性。
Insight: 揭示了LLM过度自信的根源之一是答案独立性,并提出了一种可信赖的置信度表达框架。
Abstract: Recent progress in large language models (LLMs) has enabled them to express their confidence in natural language, enhancing transparency and reliability. However, their confidence often exhibits overconfidence, the cause of which remains poorly understood. In this work, we conduct a detailed analysis of the dynamics underlying verbalized confidence and identify answer-independence as a key factor, defined as the model’s failure to condition confidence on its own answer. To address this, we propose ADVICE (Answer-Dependent Verbalized Confidence Estimation), a fine-tuning framework that facilitates answer-grounded confidence estimation. Extensive experiments show that ADVICE substantially improves confidence calibration while preserving task performance. Further analyses confirm that ADVICE strengthens answer-groundedness, leading to more balanced and well-calibrated confidence distributions. Our findings shed light on the origin of overconfidence and establish a framework for more trustworthy confidence verbalization.
[61] Evaluating Language Models’ Evaluations of Games
Katherine M. Collins,Cedegao E. Zhang,Graham Todd,Lance Ying,Mauricio Barba da Costa,Ryan Liu,Prafull Sharma,Adrian Weller,Ionatan Kuperwajs,Lionel Wong,Joshua B. Tenenbaum,Thomas L. Griffiths
Main category: cs.CL
TL;DR: 该论文提出了一种评估AI系统对游戏评价的新范式,比较了现代语言和推理模型与人类和符号计算代理的评价。结果表明,推理模型更接近人类评价,但随着接近博弈论最优,拟合度反而下降。
Details
Motivation: 传统AI评估主要关注问题解决能力,本文探讨AI系统对游戏本身的评价能力,填补了评估方向的空白。Contribution: 1. 提出评估AI系统游戏评价的形式化方法;2. 利用大规模数据集比较人类与模型的游戏评价;3. 发现推理模型接近人类评价但存在非单调性。
Method: 1. 设计形式化框架评估游戏评价;2. 使用100+新棋盘游戏和450+人类判断数据集;3. 对比语言模型、推理模型与人类的评价(公平性和趣味性)。
Result: 推理模型比非推理语言模型更接近人类评价,但接近博弈论最优时拟合度下降;趣味性评价更不稳定且难量化。
Insight: 1. 游戏评价能力的非单调性值得深入研究;2. 趣味性量化更难,需改进模型设计;3. 资源分配不稳定,需增强资源理性的元推理能力。
Abstract: Reasoning is not just about solving problems – it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems’ evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over $100$ novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more “jaggedness” across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.
[62] Judge Before Answer: Can MLLM Discern the False Premise in Question?
Jidong Li,Lingyong Fang,Haodong Zhao,Sufeng Duan,Gongshen Liu
Main category: cs.CL
TL;DR: 论文针对多模态大语言模型(MLLM)在识别问题中错误前提的能力不足,提出了一种自动化构建全面评测基准(JBA数据集)的方法,并设计了增强MLLM识别能力的框架。实验表明,该框架显著提升了模型识别错误前提的能力。
Details
Motivation: 尽管MLLM在多模态任务中表现优异,但在识别问题中错误前提的能力仍存在明显短板。现有评测基准覆盖范围有限,缺乏细粒度分类,无法全面评估模型的这一能力。因此,亟需一种更全面、自动化的评测方法来填补这一空白。Contribution: 1. 提出了一种自动化构建评测基准(JBA数据集)的方法,涵盖3种主要类型和13种子类型的错误前提;
2. 设计了一个增强MLLM识别错误前提能力的框架;
3. 实验验证了该框架的有效性,提升了模型的稳健性。
Method: 1. 通过自动化管道构建JBA数据集,对错误前提进行系统分类;
2. 提出一个识别增强框架,集成到MLLM的训练中;
3. 在JBA数据集上进行实验,验证框架的有效性。
Result: 实验表明,当前MLLM在识别错误前提方面仍有不足。通过提出的增强框架训练的模型,在这一任务上表现显著提升。
Insight: 错误前提识别是MLLM稳健性的关键指标之一,未来研究可以进一步扩展评测基准的多样性,并探索更高效的识别方法。
Abstract: Multimodal large language models (MLLMs) have witnessed astonishing advancements in recent years. Despite these successes, MLLMs remain vulnerable to flase premise problems. However, existing benchmarks targeting this issue are limited in scope: they often lack fine-grained categorization, exhibit insufficient coverage, and thus fail to provide a rigorous evaluation of the ability of models to recognize false premises. To bridge this gap, we introduce a fully automated pipeline for constructing a comprehensive benchmark of false premise questions. Our method systematically categorizes the premises into three main types and thirteen subtypes according to the abilities required to identify the premises, resulting in the JBA dataset.Results show current MLLMs still struggle with false premise recognition. Building upon this benchmark, we further propose a recognition enhancement framework tailored to strengthen the robustness of MLLMs to detect false premises. Extensive experiments demonstrate that models trained with our framework achieve significant improvements in false premise recognition.
[63] RV-HATE: Reinforced Multi-Module Voting for Implicit Hate Speech Detection
Yejin Lee,Hyeseon Ahn,Yo-Sub Han
Main category: cs.CL
TL;DR: RV-HATE是一个多模块投票框架,通过强化学习优化模块权重,针对不同仇恨言论数据集的特性进行定制化检测,提升准确性并提供可解释性。
Details
Motivation: 仇恨言论的形式多样且隐晦,现有方法未能充分适应不同数据集的多样性。RV-HATE旨在通过动态适应数据集特性解决这一问题。Contribution: 1. 提出RV-HATE框架,结合多模块设计和强化学习优化;2. 实现数据集定制化检测,提升准确性;3. 提供模块输出的可解释性分析。
Method: RV-HATE包含多个专用模块,分别捕获仇恨言论的不同特征。通过强化学习优化模块权重,再通过投票机制整合输出。
Result: RV-HATE在隐式仇恨言论检测上表现优于传统静态方法,并提供对数据集特征的深入洞察。
Insight: 动态适应数据集特性的方法能显著提升仇恨言论检测效果,同时可解释性有助于理解不同数据集的独特性。
Abstract: Hate speech remains prevalent in human society and continues to evolve in its forms and expressions. Modern advancements in internet and online anonymity accelerate its rapid spread and complicate its detection. However, hate speech datasets exhibit diverse characteristics primarily because they are constructed from different sources and platforms, each reflecting different linguistic styles and social contexts. Despite this diversity, prior studies on hate speech detection often rely on fixed methodologies without adapting to data-specific features. We introduce RV-HATE, a detection framework designed to account for the dataset-specific characteristics of each hate speech dataset. RV-HATE consists of multiple specialized modules, where each module focuses on distinct linguistic or contextual features of hate speech. The framework employs reinforcement learning to optimize weights that determine the contribution of each module for a given dataset. A voting mechanism then aggregates the module outputs to produce the final decision. RV-HATE offers two primary advantages: (1)it improves detection accuracy by tailoring the detection process to dataset-specific attributes, and (2)it also provides interpretable insights into the distinctive features of each dataset. Consequently, our approach effectively addresses implicit hate speech and achieves superior performance compared to conventional static methods. Our code is available at https://github.com/leeyejin1231/RV-HATE.
[64] Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning
Zhiwen Ruan,Yixia Li,He Zhu,Yun Chen,Peng Li,Yang Liu,Guanhua Chen
Main category: cs.CL
TL;DR: 本文提出了一种名为Critical Token Fine-tuning (CFT)的方法,通过选择性更新关键推理令牌(而非所有令牌)来增强大语言模型的推理能力,实验证明其在多个数学推理任务中优于标准监督微调。
Details
Motivation: 标准监督微调(SFT)对所有令牌采用均匀惩罚,导致输出多样性降低和泛化能力受限。本文发现仅一小部分关键令牌对推理正确性起决定性作用,因此提出一种更高效的微调方法。Contribution: 1. 提出CFT方法,仅更新通过反事实扰动识别的关键令牌;2. 在多个模型和数学推理基准测试中验证了其有效性;3. CFT能够提升多样性和强化学习的初始化效果。
Method: CFT通过选择性更新关键令牌(功能不可替代的令牌)来优化模型,同时保留非关键令牌的多样性。具体包括反事实扰动识别关键令牌和梯度信号集中在这些令牌上。
Result: 实验表明,CFT在5个模型(Qwen、OLMo、LLaMA)和11个数学推理任务上优于标准SFT,尽管仅微调了不到12%的令牌。CFT还提升了采样多样性和强化学习的初始化效果。
Insight: 1. 关键令牌对推理至关重要;2. 选择性更新可平衡性能和多样性;3. CFT是一种高效且通用的LLM微调框架。
Abstract: Large language models (LLMs) primarily rely on supervised fine-tuning (SFT) as a key method to adapt pre-trained models to domain-specific tasks such as mathematical reasoning. However, standard SFT uniformly penalizes all tokens, neglecting that only a small subset of critical tokens determines reasoning correctness. This uniform supervision often causes reduced output diversity and limited generalization. We propose Critical Token Fine-tuning (CFT), a simple yet effective approach that updates only tokens identified as functionally indispensable via counterfactual perturbations. By focusing gradient signals on these decisive reasoning steps while preserving the diversity of non-critical tokens, CFT can enhance both generation and diversity. Extensive experiments on five models across three families (Qwen, OLMo, LLaMA) and eleven mathematical reasoning benchmarks show that CFT, despite fine-tuning on less than 12% of tokens, consistently outperforms standard SFT. Moreover, CFT enables test-time scaling through improved sampling diversity and provides a stronger initialization for reinforcement learning, sustaining performance gains in later training stages while maintaining higher entropy for better exploration. These results highlight CFT as a practical and general framework for efficient and robust LLM fine-tuning.
[65] DeepResearchGuard: Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety
Wei-Chieh Huang,Henry Peng Zou,Yaozu Wu,Dongyuan Li,Yankai Chen,Weizhi Zhang,Yangning Li,Angelo Zangari,Jizhou Guo,Chunyu Miao,Liancheng Fang,Langzhou He,Renhe Jiang,Philip S. Yu
Main category: cs.CL
TL;DR: DeepResearchGuard是一个深度研究框架,通过四阶段保护机制和开放域评估,提升了报告的安全性、可信性和质量,同时减少了过度拒绝率。
Details
Motivation: 现有的深度研究框架缺乏全面的评估和阶段性保护,可能导致有害内容进入最终报告。需要一种能同时确保报告质量和安全性的解决方案。Contribution: 提出了DeepResearchGuard框架,包含四阶段保护机制(输入、计划、研究和输出)、开放域评估方法,并引入了新的安全性基准DRSAFEBENCH。
Method: 采用多阶段保护机制(输入过滤、计划指导、研究规范、输出审核)和开放域评估,结合多种评估指标(防御成功率、过度拒绝率等)。
Result: 在多种LLM(如GPT-4o、Gemini-2.5-flash等)上测试,防御成功率提升18.16%,过度拒绝率降低6%,显著改善了报告质量。
Insight: 阶段性保护机制能够在早期过滤风险,同时系统性评估和规范引用行为,确保了报告的全面性和安全性。
Abstract: Deep research frameworks have shown promising capabilities in synthesizing comprehensive reports from web sources. While deep research possesses significant potential to address complex issues through planning and research cycles, existing frameworks are deficient in sufficient evaluation procedures and stage-specific protections. They typically treat evaluation as exact match accuracy of question-answering, but overlook crucial aspects of report quality such as credibility, coherence, breadth, depth, and safety. This oversight may result in hazardous or malicious sources being integrated into the final report. To address these issues, we introduce DEEPRESEARCHGUARD, a comprehensive framework featuring four-stage safeguards with open-domain evaluation of references and reports. We assess performance across multiple metrics, e.g., defense success rate and over-refusal rate, and five key report dimensions. In the absence of a suitable safety benchmark, we introduce DRSAFEBENCH, a stage-wise benchmark for deep research safety. Our evaluation spans diverse state-of-the-art LLMs, including GPT-4o, Gemini-2.5-flash, DeepSeek-v3, and o4-mini. DEEPRESEARCHGUARD achieves an average defense success rate improvement of 18.16% while reducing over-refusal rate by 6%. The input guard provides the most substantial early-stage protection by filtering out obvious risks, while the plan and research guards enhance citation discipline and source credibility. Through extensive experiments, we show that DEEPRESEARCHGUARD enables comprehensive open-domain evaluation and stage-aware defenses that effectively block harmful content propagation, while systematically improving report quality without excessive over-refusal rates. The code can be found via https://github.com/Jasonya/DeepResearchGuard.
[66] ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios
Mahika Phutane,Hayoung Jung,Matthew Kim,Tanushree Mitra,Aditya Vashistha
Main category: cs.CL
TL;DR: 论文分析了LLM在招聘场景中对残障人士的偏见问题,提出ABLEIST指标体系,揭示了多重边缘化(如性别和种姓)如何加剧对残障候选人的伤害,并指出当前安全工具的盲点。
Details
Motivation: 现有的研究主要关注西方背景,忽视了在南半球社会中残障人士如何因性别和种姓等多重边缘化因素而遭受更复杂的偏见。论文旨在填补这一空白。Contribution: 1. 引入了ABLEIST指标体系,专门衡量LLM生成的招聘场景中对残障人士的歧视行为;2. 揭示了多重边缘化因素如何加剧残障人士的伤害。
Method: 通过2,820个招聘场景的审计,测试了6个LLM模型,使用ABLEIST指标(包括残障歧视、灵感、超人化和象征主义等)进行量化分析。
Result: 结果显示LLM对残障候选人的伤害显著增加,尤其是性别和种姓边缘化的残障人士受到的伤害更为严重;现有模型难以检测这些偏见。
Insight: 当前的安全工具无法有效识别多重边缘化带来的复杂偏见,强调了在高风险领域(如招聘)中对前沿模型进行交叉性安全评估的必要性。
Abstract: Large language models (LLMs) are increasingly under scrutiny for perpetuating identity-based discrimination in high-stakes domains such as hiring, particularly against people with disabilities (PwD). However, existing research remains largely Western-centric, overlooking how intersecting forms of marginalization–such as gender and caste–shape experiences of PwD in the Global South. We conduct a comprehensive audit of six LLMs across 2,820 hiring scenarios spanning diverse disability, gender, nationality, and caste profiles. To capture subtle intersectional harms and biases, we introduce ABLEIST (Ableism, Inspiration, Superhumanization, and Tokenism), a set of five ableism-specific and three intersectional harm metrics grounded in disability studies literature. Our results reveal significant increases in ABLEIST harms towards disabled candidates–harms that many state-of-the-art models failed to detect. These harms were further amplified by sharp increases in intersectional harms (e.g., Tokenism) for gender and caste-marginalized disabled candidates, highlighting critical blind spots in current safety tools and the need for intersectional safety evaluations of frontier models in high-stakes domains like hiring.
[67] LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models
Yiwei Liu,Yucheng Li,Xiao Li,Gong Cheng
Main category: cs.CL
TL;DR: 论文介绍了LogiNumSynth,一种灵活的联合逻辑-数值推理问题合成器,用于生成需要逻辑推理和数值推理能力的自然语言任务,支持任务复杂度的细粒度控制。
Details
Motivation: 现有的联合逻辑-数值推理数据集依赖固定规则集且任务复杂度控制有限,限制了其评估和训练的可扩展性,因此需要一种更灵活的合成工具。Contribution: (1)设计了LogiNumSynth合成器,支持生成完全可控的联合推理任务;(2)提供评估和过程分析,关注过程准确性和答案准确性;(3)利用合成数据提升大语言模型的推理性能。
Method: LogiNumSynth通过控制推理世界的丰富性、逻辑推理深度和数值计算复杂度,灵活生成不同难度的任务。
Result: 实验显示当前大语言模型在联合逻辑-数值推理上仍存在不足,LogiNumSynth可作为诊断工具和针对性训练数据源。
Insight: 灵活的合成工具不仅能揭示模型弱点,还能为针对性训练提供数据,推动推理能力的综合提升。
Abstract: Joint logical-numerical reasoning remains a major challenge for language models, yet existing datasets rely on fixed rule sets and offer limited control over task complexity, constraining their generalizability for evaluation and training. We present LogiNumSynth, a flexible natural language problem synthesizer that synthesizes tasks requiring proficiency in joint logical reasoning (e.g., rule-based reasoning) and numerical reasoning (e.g., arithmetic computation). LogiNumSynth supports fine-grained control over reasoning world richness, logical reasoning depth, and the complexity of numerical computations, enabling flexible data synthesis across difficulty levels. We demonstrate three key contributions: (1) Synthesizer – synthesizing fully controllable joint reasoning tasks over natural language; (2) Evaluation & Process Analysis – evaluating both process accuracy and answer accuracy; (3) Targeted Training – using synthesized data to enhance LLMs’ reasoning performance. Experiments with multiple LLMs highlight persistent weaknesses in logical-numerical reasoning, showing that LogiNumSynth can serve as both a diagnostic tool and a source of targeted supervision for advancing integrated reasoning skills.
[68] Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States
Qinglin Zhu,Yizhen Yao,Runcong Zhao,Yanzheng Xiang,Amrutha Saseendran,Chen Jin,Philip Alexander Teare,Bin Liang,Yulan He,Lin Gui
Main category: cs.CL
TL;DR: 论文提出了Latent Refinement Decoding (LRD),一种两阶段解码框架,通过潜在细化和预测反馈循环解决了基于扩散的语言模型的信息丢失和过早决策问题,显著提升了生成速度和准确性。
Details
Motivation: 现有的自回归模型因严格顺序解码导致高延迟,而扩散并行生成方法存在信息丢失和过早决策的问题,阻碍了模型的全局一致性。Contribution: 提出了LRD框架,结合潜在细化和预测反馈循环,保留了未确定标记的分布信息,并通过KL散度动态控制收敛,显著提升了并行生成的效果和速度。
Method: LRD分为两阶段:第一阶段维持掩码位置的分布混合;第二阶段逐步确认高置信度标记并保留不确定标记以迭代反馈。KL散度动态为收敛提供了可靠标准。
Result: 实验表明,LRD在编码(HumanEval +6.3, MBPP +2.6)和推理(GSM8K +2.9, MATH500 +3.8)任务中显著提升准确性,速度提升可达10.6倍。
Insight: LRD通过分布混合和迭代反馈实现了全局一致性,避免了扩散模型的过早决策问题,为并行序列生成提供了一种高效且通用的方法。
Abstract: Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalized tokens are discarded at each step, and premature commitment, where local decisions are made without sufficient global coordination. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalizes confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) show that LRD improves accuracy while delivering speedups of up to 10.6x, making it a strong and versatile alternative for parallel sequence generation.
[69] Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization
Junjie Lu,Yuliang Liu,Chaofeng Qu,Wei Shen,Zhouhan Lin,Min Xu
Main category: cs.CL
TL;DR: 本文提出了一种名为CGPO的新方法,通过利用置信度信号识别模型推理过程中的不确定性高点,并应用自生成的非人类推理路径指导来优化推理性能,避免了传统方法对人工或高容量模型标注的依赖。
Details
Motivation: 当前强化大型语言模型(LLM)推理的方法通常偏向人类推理轨迹,限制了探索非人类推理路径的可能性,从而制约了性能提升空间。此外,研究发现模型的首个错误步骤通常发生在置信度最低点之后,这表明在最低置信点提供指导可能比在首个错误点更有效。Contribution: 提出了CGPO方法,通过置信度信号指导推理路径优化,无需依赖人类或高容量模型标注,实现了更好的推理性能。
Method: CGPO利用置信度信号识别推理过程中的不确定性高点,并通过自生成的非人类推理路径进行优化。实验涵盖代码和数学推理任务,验证了方法的有效性。
Result: 实验结果表明,CGPO方法在使用少量小型模型生成的数据时,性能优于依赖高容量模型或人工标注数据的方法。
Insight: 在模型推理过程中,最低置信点是关键干预时机,而非人类推理路径可能比传统方法更具潜力。这一发现为未来推理优化方法提供了新方向。
Abstract: Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for intermediate steps limits exploration of alternative, non-human-like reasoning paths and thus constrains achievable performance. Furthermore, through a small-scale pilot study, we observed that in approximately 75% of cases, the model’s first erroneous step occurs after the lowest-confidence point. This suggests that guiding the model at its lowest-confidence point before an error provides more accurate supervision than locating the first explicit error. In this paper, we propose Confidence-Guided Reasoning Path Preference Optimization (CGPO), a method that leverages a confidence signal to identify points of maximal uncertainty in the model’s reasoning process and applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift. Our experiments span diverse models applied to both code and mathematical reasoning tasks. The results show that, with the same amount of training data, our method using data generated by a small model can achieve better performance in most cases compared with approaches using data generated by a strong model or human-annotated.
[70] Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations
Johannes Moll,Markus Graf,Tristan Lemke,Nicolas Lenhart,Daniel Truhn,Jean-Benoit Delbrouck,Jiazhen Pan,Daniel Rueckert,Lisa C. Adams,Keno K. Bressem
Main category: cs.CL
TL;DR: 该论文提出了一个临床基础的框架,用于评估医学视觉语言模型(VLMs)的推理忠实性,通过多模态扰动分析了三点:临床保真度、因果归因和置信度校准。结果显示,答案准确性与解释质量是分离的,且开源模型与专有模型在归因和保真度上存在显著差距。
Details
Motivation: 在高风险的临床应用中,视觉语言模型生成的链式思维(CoT)解释可能听起来合理,但未能真实反映决策过程,导致信任问题。现有评估方法未能捕捉这种不一致性。Contribution: 提出了一种临床基础的评估框架,通过多模态扰动(文本和图像修改)分析CoT解释的忠实性,填补了现有评估方法的空白。
Method: 在胸部X光视觉问答(VQA)任务中,设计了一个框架,通过控制文本和图像的修改,评估CoT解释的临床保真度、因果归因和置信度校准。
Result: 评测显示,答案准确性与解释质量脱节;专有模型在归因(25.0% vs. 1.4%)和保真度(36.1% vs. 31.7%)上优于开源模型。
Insight: 文本线索对解释的影响大于视觉线索;部署VLMs时需超越最终的答案准确性,重视解释的真实性和连贯性。
Abstract: Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall’s $\tau_b=0.670$), moderate alignment for fidelity ($\tau_b=0.387$), and weak alignment for confidence tone ($\tau_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality are decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.
[71] Discursive Circuits: How Do Language Models Understand Discourse Relations?
Yisong Miao,Min-Yen Kan
Main category: cs.CL
TL;DR: 该论文研究了Transformer语言模型中负责理解话语关系的稀疏计算图(称为话语电路),并通过CuDR任务验证了这些电路的有效性和泛化能力。
Details
Motivation: 话语关系理解是自然语言处理中的重要任务,但当前语言模型中哪些组件负责这一功能尚不明确。论文旨在发现和控制这些组件,以便更好地理解和改进模型的话语能力。Contribution: 引入了Completion under Discourse Relation (CuDR)任务和对应的语料库,用于激活修补和电路发现;证明了稀疏电路(占GPT-2模型的0.2%)可以有效恢复话语理解能力,并能泛化到其他话语框架。
Method: 设计了CuDR任务,构建了最小对比对语料库;通过激活修补技术发现稀疏电路,并分析了不同层对不同话语特征的编码能力。
Result: 实验表明,稀疏电路在PDTB-based CuDR任务中表现良好,且能泛化到RST和SDRT等未见话语框架;低层捕捉词汇语义和共指,高层编码话语级抽象特征。
Insight: 话语理解在语言模型中由稀疏电路控制,且不同层分工明确;核心共指等特征对话语关系的支持在不同框架中保持一致,为模型设计提供了新视角。
Abstract: Which components in transformer language models are responsible for discourse understanding? We hypothesize that sparse computational graphs, termed as discursive circuits, control how models process discourse relations. Unlike simpler tasks, discourse relations involve longer spans and complex reasoning. To make circuit discovery feasible, we introduce a task called Completion under Discourse Relation (CuDR), where a model completes a discourse given a specified relation. To support this task, we construct a corpus of minimal contrastive pairs tailored for activation patching in circuit discovery. Experiments show that sparse circuits ($\approx 0.2%$ of a full GPT-2 model) recover discourse understanding in the English PDTB-based CuDR task. These circuits generalize well to unseen discourse frameworks such as RST and SDRT. Further analysis shows lower layers capture linguistic features such as lexical semantics and coreference, while upper layers encode discourse-level abstractions. Feature utility is consistent across frameworks (e.g., coreference supports Expansion-like relations).
[72] Domain-Specific Data Generation Framework for RAG Adaptation
Chris Xing Tian,Weihao Xie,Zhen Chen,Zhengyuan Yi,Hui Liu,Haoliang Li,Shiqi Wang,Siwei Ma
Main category: cs.CL
TL;DR: 论文提出了一种名为RAGen的领域专用数据生成框架,用于支持Retrieval-Augmented Generation (RAG)系统的领域自适应。该框架通过生成领域相关的问答-上下文三元组(QAC),优化RAG的LLM、检索器和嵌入模型等关键组件。
Details
Motivation: RAG系统需要领域专用的训练数据以生成领域相关的回答,但现有数据多为通用问答,缺乏领域特异性。因此,需要一种可扩展的框架来高效生成此类数据。Contribution: 提出了RAGen框架,支持生成领域相关的QAC三元组;引入语义分块、层次概念提取和多块检索等技术;设计了动态领域适应能力,适用于大规模文档库。
Method: 结合Bloom分类法原则生成多样化问题;通过语义分块和概念提取配对精确答案;引入干扰性上下文以提高推理能力;模块化设计支持高效处理动态文档。
Result: RAGen能够高效生成领域专用的QAC三元组,为RAG系统提供优化的训练数据,特别适用于科学研究和企业知识库等动态领域。
Insight: 领域专用的QAC数据生成是RAG系统性能优化的关键;模块化和可扩展性是处理大规模动态文档的有效策略。
Abstract: Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning power of large language models (LLMs) with external retrieval to enable domain-grounded responses. Effectively adapting RAG systems to domain-specific settings requires specialized, context-rich training data beyond general-purpose question-answering. Here, we propose RAGen, a scalable and modular framework for generating domain-grounded question-answer-context (QAC) triples tailored to diverse RAG adaptation approaches. RAGen produces these QAC triples by identifying key concepts in documents, generating diverse questions guided by Bloom’s Taxonomy-inspired principles, and pairing them with precise answers extracted from relevant contexts. RAGen supports multiple RAG adaptation strategies, including the optimization of key components such as the LLM, retriever, and embedding model, etc. Its modular pipeline features semantic chunking, hierarchical concept extraction, and multi-chunk retrieval, along with the introduction of curated distractor contexts to promote robust reasoning. Designed for scalability, RAGen efficiently handles large and evolving document corpora without redundant processing, making it especially suitable for dynamic evolving domains such as scientific research and enterprise knowledge bases.
[73] Fairness Metric Design Exploration in Multi-Domain Moral Sentiment Classification using Transformer-Based Models
Battemuulen Naranbat,Seyed Sahand Mohammadi Ziabari,Yousuf Nasser Al Husaini,Ali Mohammed Mansoor Alsahag
Main category: cs.CL
TL;DR: 该论文探讨了在多领域道德情感分类中使用基于Transformer的模型时如何设计公平性度量,提出了新的Moral Fairness Consistency (MFC)度量标准,并通过实验验证了其在跨域稳定性评估中的有效性。
Details
Motivation: 研究发现在跨域转移时,现有的公平性度量可能会掩盖某些标签的公平性违反问题,因此需要一种新的度量标准来更好地评估模型的公平性。Contribution: 主要贡献是提出了Moral Fairness Consistency (MFC)度量标准,用于量化道德基础检测的跨域稳定性,并通过实验验证其有效性。
Method: 使用BERT和DistilBERT在多标签设置下进行评估,分析了标签级别的公平性违反问题,并提出了MFC度量标准。
Result: 实验结果显示,MFC与Demographic Parity Difference呈完全负相关(rho = -1.000),证明了其在公平性评估中的有效性。
Insight: MFC作为一种互补的、诊断导向的度量标准,可以帮助更可靠地评估道德推理模型的公平性,特别是在跨域部署时。
Abstract: Ensuring fairness in natural language processing for moral sentiment classification is challenging, particularly under cross-domain shifts where transformer models are increasingly deployed. Using the Moral Foundations Twitter Corpus (MFTC) and Moral Foundations Reddit Corpus (MFRC), this work evaluates BERT and DistilBERT in a multi-label setting with in-domain and cross-domain protocols. Aggregate performance can mask disparities: we observe pronounced asymmetry in transfer, with Twitter->Reddit degrading micro-F1 by 14.9% versus only 1.5% for Reddit->Twitter. Per-label analysis reveals fairness violations hidden by overall scores; notably, the authority label exhibits Demographic Parity Differences of 0.22-0.23 and Equalized Odds Differences of 0.40-0.41. To address this gap, we introduce the Moral Fairness Consistency (MFC) metric, which quantifies the cross-domain stability of moral foundation detection. MFC shows strong empirical validity, achieving a perfect negative correlation with Demographic Parity Difference (rho = -1.000, p < 0.001) while remaining independent of standard performance metrics. Across labels, loyalty demonstrates the highest consistency (MFC = 0.96) and authority the lowest (MFC = 0.78). These findings establish MFC as a complementary, diagnosis-oriented metric for fairness-aware evaluation of moral reasoning models, enabling more reliable deployment across heterogeneous linguistic contexts. .
[74] A Theorem-Proving-Based Evaluation of Neural Semantic Parsing
Hayate Funakura,Hyunsoo Kim,Koji Mineshima
Main category: cs.CL
TL;DR: 该论文通过结合图匹配和自动定理证明,重新评估神经语义解析器的性能,发现现有方法在逻辑等价性上存在不足,并提出标准化目标表示的重要性。
Details
Motivation: 目前神经语义解析器的评估主要依赖图匹配指标(如Smatch),但这些指标仅捕捉表面重叠而非逻辑等价性。论文旨在通过自动定理证明更准确地评估逻辑等价性。Contribution: 论文的主要贡献在于提出了一种结合图匹配和定理证明的评估方法,揭示了现有解析器在逻辑等价性上的不足,并提出标准化目标表示以提升性能。
Method: 论文比较了两种解析器构建方法:监督微调(T5-Small/Base)和少样本上下文学习(GPT系列),并使用图匹配、双向逻辑蕴含(通过一阶逻辑定理证明器)和语法正确性进行评估。
Result: 实验表明,图匹配表现优异的模型在生成逻辑等价公式时表现不佳。标准化目标表示可以减少目标变异性并提升逻辑适格性。复杂公式和特定语法结构(如被动语态)会导致性能下降。
Insight: 论文揭示了图匹配指标在推理导向应用中的局限性,强调了逻辑敏感的评估和训练目标的重要性,同时简化目标表示有助于提升性能。
Abstract: Graph-matching metrics such as Smatch are the de facto standard for evaluating neural semantic parsers, yet they capture surface overlap rather than logical equivalence. We reassess evaluation by pairing graph-matching with automated theorem proving. We compare two approaches to building parsers: supervised fine-tuning (T5-Small/Base) and few-shot in-context learning (GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness. Across settings, we find that models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization reduces incidental target variability, improves well-formedness, and strengthens logical adequacy. Error analysis shows performance degrades with increasing formula complexity and with coordination, prepositional phrases, and passive voice; the dominant failures involve variable binding and indexing, and predicate naming. These findings highlight limits of graph-based metrics for reasoning-oriented applications and motivate logic-sensitive evaluation and training objectives together with simplified, normalized target representations. All code and data for our experiments are publicly available.
[75] CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis
Jinyuan Xu,Tian Lan,Xintao Yu,Xue He,Hezhi Zhang,Ying Wang,Pierre Magistry,Mathieu Valette,Lei Li
Main category: cs.CL
TL;DR: 该论文发布了CNSocialDepress数据集,一个用于中文社交媒体抑郁风险检测的基准数据集,包含大量标注数据和多维心理属性分析,支持抑郁信号的细粒度分析和大语言模型微调。
Details
Motivation: 解决中文抑郁风险检测领域公开数据稀缺且多为二分类的问题,提供更丰富的标注数据和多维心理属性支持。Contribution: 提出了CNSocialDepress数据集,包含44,178条文本和10,306个抑郁相关片段标注,支持二分类和多维心理分析。
Method: 通过心理专家标注抑郁相关文本和多维心理属性,构建了一个用于抑郁风险检测和结构化分析的数据集。
Result: 实验表明该数据集可广泛用于NLP任务,如心理画像和大语言模型微调,提升了抑郁风险检测的实用价值。
Insight: 数据集为中文心理健康应用提供了重要的资源和工具,填补了现有资源的空白。
Abstract: Depression is a pressing global public health issue, yet publicly available Chinese-language resources for risk detection remain scarce and are mostly limited to binary classification. To address this limitation, we release CNSocialDepress, a benchmark dataset for depression risk detection from Chinese social media posts. The dataset contains 44,178 texts from 233 users, within which psychological experts annotated 10,306 depression-related segments. CNSocialDepress provides binary risk labels together with structured multi-dimensional psychological attributes, enabling interpretable and fine-grained analysis of depressive signals. Experimental results demonstrate its utility across a wide range of NLP tasks, including structured psychological profiling and fine-tuning of large language models for depression detection. Comprehensive evaluations highlight the dataset’s effectiveness and practical value for depression risk identification and psychological analysis, thereby providing insights to mental health applications tailored for Chinese-speaking populations.
[76] Towards Real-Time Fake News Detection under Evidence Scarcity
Guangyu Wei,Ke Han,Yueming Lyu,Yu Luo,Yue Jiang,Caifeng Shan,Nicu Sebe
Main category: cs.CL
TL;DR: 论文提出了EASE框架,通过动态评估证据、推理和情感三个视角,解决了实时假新闻检测中证据稀缺的问题。
Details
Motivation: 实时假新闻检测面临证据稀缺的挑战,现有方法依赖外部证据且泛化能力不足,作者提出动态适应决策的EASE框架。Contribution: 1. 提出EASE框架,结合证据、推理和情感视角动态检测假新闻;2. 引入指令调优和伪标签增强评估准确性;3. 提出RealTimeNews-25基准,测试模型在新新闻中的泛化能力。
Method: EASE使用三个评估视角:证据评估(仅支持充足证据时)、推理评估(依赖可靠的LLMs知识)、情感回退;通过指令调优和伪标签优化评估能力,动态整合决策。
Result: 实验表明EASE在多个基准上表现最优,且在新提出的RealTimeNews-25数据集上显著提升了泛化能力。
Insight: 动态评估和多视角融合能有效解决证据稀缺问题,指令调优和伪标签可提升评估可靠性。
Abstract: Fake news detection becomes particularly challenging in real-time scenarios, where emerging events often lack sufficient supporting evidence. Existing approaches often rely heavily on external evidence and therefore struggle to generalize under evidence scarcity. To address this issue, we propose Evaluation-Aware Selection of Experts (EASE), a novel framework for real-time fake news detection that dynamically adapts its decision-making process according to the assessed sufficiency of available evidence. EASE introduces a sequential evaluation mechanism comprising three independent perspectives: (1) Evidence-based evaluation, which assesses evidence and incorporates it into decision-making only when the evidence is sufficiently supportive; (2) Reasoning-based evaluation, which leverages the world knowledge of large language models (LLMs) and applies them only when their reliability is adequately established; and (3) Sentiment-based fallback, which integrates sentiment cues when neither evidence nor reasoning is reliable. To enhance the accuracy of evaluation processes, EASE employs instruction tuning with pseudo labels to guide each evaluator in justifying its perspective-specific knowledge through interpretable reasoning. Furthermore, the expert modules integrate the evaluators’ justified assessments with the news content to enable evaluation-aware decision-making, thereby enhancing overall detection accuracy. Moreover, we introduce RealTimeNews-25, a new benchmark comprising recent news for evaluating model generalization on emerging news with limited evidence. Extensive experiments demonstrate that EASE not only achieves state-of-the-art performance across multiple benchmarks, but also significantly improves generalization to real-time news. The code and dataset are available: https://github.com/wgyhhhh/EASE.
[77] Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
Nikita Afonin,Nikita Andriyanov,Nikhil Bageshpura,Kyle Liu,Kevin Zhu,Sunishchal Dev,Ashwinee Panda,Alexander Panchenko,Oleg Rogov,Elena Tutubalina,Mikhail Seleznyov
Main category: cs.CL
TL;DR: 论文研究了窄范围上下文学习(ICL)是否会导致大语言模型(LLM)的广泛不一致性(EM),发现即使在ICL中,EM现象也会显著出现,且随上下文示例数量增加而加剧。
Details
Motivation: 现有研究表明窄范围微调可能导致LLM的广泛不一致性,但未涵盖上下文学习(ICL)。本文旨在填补这一空白,验证ICL是否同样引发EM现象。Contribution: 证明了ICL中也会出现EM,通过实验展示了三个前沿模型在窄范围上下文示例下产生不一致响应的比例,并分析了其机制。
Method: 在三个数据集上测试三个前沿LLM,通过提供不同数量的窄范围上下文示例观察EM现象,并通过逐步推理分析机制。
Result: 实验显示,64个窄范围示例下不一致响应率为2%-17%,256个示例时上升至58%。67.5%的不一致轨迹源于模型采用危险“角色”。
Insight: ICL中的EM与微调导致的EM机制类似,提示窄范围输入可能泛化为广泛不一致性,需在设计和使用LLM时警惕此类风险。
Abstract: Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ‘’persona’’, echoing prior results on finetuning-induced EM.
[78] Are Large Language Models Effective Knowledge Graph Constructors?
Ruirui Chen,Weifeng Jiang,Chengwei Qin,Bo Xiong,Fiona Liausvia,Dongkyu Choi,Boon Kiat Quek
Main category: cs.CL
TL;DR: 这篇论文探讨了大语言模型(LLMs)在知识图谱(KG)构建中的有效性,提出了一个层次化的提取框架来构建语义丰富的结构化KG,并评估了LLMs的能力和局限性。
Details
Motivation: 知识图谱对知识密集型任务至关重要,并能减少大语言模型的幻觉问题,但构建高质量KG仍面临挑战。现有的LLM方法通常局限于实体和关系提取,缺乏全面的语义覆盖和结构化表示。Contribution: 1. 提出了一个层次化的提取框架,支持多级信息组织,构建语义丰富的KG。
2. 全面评估了LLMs在KG构建中的能力和局限性。
3. 发布了一个由LLM生成的KG数据集,推动高影响力领域(如医疗健康)的研究。
Method: 采用层次化提取框架,利用先进的大语言模型从文本中提取信息,并在结构和语义层面评估生成的KG。
Result: 论文发现了当前LLMs在KG构建中的优势和不足,为进一步研究提供了关键挑战。
Insight: LLMs在KG构建中具有潜力,但需要改进以支持更全面的语义覆盖和结构化表示。发布的资源有助于推动高影响力领域的应用。
Abstract: Knowledge graphs (KGs) are vital for knowledge-intensive tasks and have shown promise in reducing hallucinations in large language models (LLMs). However, constructing high-quality KGs remains difficult, requiring accurate information extraction and structured representations that support interpretability and downstream utility. Existing LLM-based approaches often focus narrowly on entity and relation extraction, limiting coverage to sentence-level contexts or relying on predefined schemas. We propose a hierarchical extraction framework that organizes information at multiple levels, enabling the creation of semantically rich and well-structured KGs. Using state-of-the-art LLMs, we extract and construct knowledge graphs and evaluate them comprehensively from both structural and semantic perspectives. Our results highlight the strengths and shortcomings of current LLMs in KG construction and identify key challenges for future work. To advance research in this area, we also release a curated dataset of LLM-generated KGs derived from research papers on children’s mental well-being. This resource aims to foster more transparent, reliable, and impactful applications in high-stakes domains such as healthcare.
[79] FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks
Sabrina McCallum,Amit Parekh,Alessandro Suglia
Main category: cs.CL
TL;DR: 论文提出了一种名为FOSSIL的方法,通过结合最优和次优演示以及语言反馈,提升了模仿学习在具身视觉-语言任务中的泛化能力和数据效率。
Details
Motivation: 当前具身AI方法主要依赖专家演示学习策略,但无法评估演示质量,导致只能学习最优行为或复制错误。强化学习虽然是一种替代,但其探索过程牺牲了数据效率。本文旨在通过语言反馈利用次优演示,提升学习效果。Contribution: 主要的贡献包括:(1)提出了FOSSIL方法,结合最优和次优演示及语言反馈;(2)提出可选的自监督学习目标来预测反馈;(3)在BabyAI-XGen环境中验证了方法的有效性,显著提升了泛化能力和鲁棒性。
Method: 方法的核心是将语言反馈嵌入作为Transformer策略输入的一部分,并通过自监督学习目标(如反馈预测)来辅助传统的行为预测目标。具体实现包括对反馈的编码和多目标学习框架。
Result: 实验结果表明,FOSSIL在具身视觉-语言任务中显著提升了模型的组合泛化能力和鲁棒性。同时,语言反馈被证明是比标量奖励更直观和有效的替代方案。
Insight: 论文的洞察在于,语言反馈可以帮助模型从次优行为中学习,从而提升数据效率。这表明语言上下文在具身AI任务中具有重要作用,能够弥补演示质量的不足。
Abstract: Current approaches to embodied AI tend to learn policies from expert demonstrations. However, without a mechanism to evaluate the quality of demonstrated actions, they are limited to learning from optimal behaviour, or they risk replicating errors and inefficiencies. While reinforcement learning offers one alternative, the associated exploration typically results in sacrificing data efficiency. This work explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations when given access to constructive language feedback as a means to contextualise different modes of behaviour. We directly provide language feedback embeddings as part of the input sequence into a Transformer-based policy, and optionally complement the traditional next action prediction objective with auxiliary self-supervised learning objectives for feedback prediction. We test our approach on a range of embodied Vision-and-Language tasks in our custom BabyAI-XGen environment and show significant improvements in agents’ compositional generalisation abilities and robustness, suggesting that our data-efficient method allows models to successfully convert suboptimal behaviour into learning opportunities. Overall, our results suggest that language feedback is a competitive and intuitive alternative to intermediate scalar rewards for language-specified embodied tasks.
[80] Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications
Belkiss Souayed,Sarah Ebling,Yingqiang Gao
Main category: cs.CL
TL;DR: 本文提出了一种基于模板的文本-图像对齐方法,用于语言无障碍性研究,旨在通过结构化视觉-语言模型提示框架生成简约且易于理解的图像,结果显示对象焦点模板在语义对齐和可访问性方面表现最佳。
Details
Motivation: 智力障碍人士在理解复杂文本时面临困难,现有的文本-图像模型多注重美学而非无障碍性,因此需要研究如何通过简化文本生成更具可访问性的图像。Contribution: 提出了一种结构化视觉-语言模型提示框架,设计了五种提示模板以满足无障碍性需求;通过实验评估了模板的有效性和不同数据源的作用,为无障碍内容生成提供了实用指南。
Method: 设计了五种提示模板(Basic Object Focus、Contextual Scene、Educational Layout、Multi-Level Detail、Grid Layout),分别遵循不同的空间布局和无障碍约束;使用400个简化文本进行两阶段评估(CLIPScore评分和专家人工标注)。
Result: Basic Object Focus模板在语义对齐上表现最佳;Retro视觉风格被评为最具可访问性,Wikipedia数据源效果最好;文本简洁性维度表现出较强的可靠性。
Insight: 视觉极简主义(如Basic Object Focus)有助于提升语言无障碍性;结构化提示在AI生成无障碍视觉工具中至关重要;专家评估表明图像质量的主观性较强。
Abstract: Individuals with intellectual disabilities often have difficulties in comprehending complex texts. While many text-to-image models prioritize aesthetics over accessibility, it is not clear how visual illustrations relate to text simplifications (TS) generated from them. This paper presents a structured vision-language model (VLM) prompting framework for generating accessible images from simplified texts. We designed five prompt templates, i.e., Basic Object Focus, Contextual Scene, Educational Layout, Multi-Level Detail, and Grid Layout, each following distinct spatial arrangements while adhering to accessibility constraints such as object count limits, spatial separation, and content restrictions. Using 400 sentence-level simplifications from four established TS datasets (OneStopEnglish, SimPA, Wikipedia, and ASSET), we conducted a two-phase evaluation: Phase 1 assessed prompt template effectiveness with CLIPScores, and Phase 2 involved human annotation of generated images across ten visual styles by four accessibility experts. Results show that the Basic Object Focus prompt template achieved the highest semantic alignment, indicating that visual minimalism enhances language accessibility. Expert evaluation further identified Retro style as the most accessible and Wikipedia as the most effective data source. Inter-annotator agreement varied across dimensions, with Text Simplicity showing strong reliability and Image Quality proving more subjective. Overall, our framework offers practical guidelines for accessible content generation and underscores the importance of structured prompting in AI-generated visual accessibility tools.
[81] Do LLMs “Feel”? Emotion Circuits Discovery and Control
Chenxi Wang,Yixuan Zhang,Ruiji Yu,Yufei Zheng,Lang Gao,Zirui Song,Zixiang Xu,Gus Xia,Huishuai Zhang,Dongyan Zhao,Xiuying Chen
Main category: cs.CL
TL;DR: 该研究系统性地探索了大型语言模型(LLMs)中的情感机制,提出了情感电路的发现与控制方法,并在情感表达准确性上取得了显著成果。
Details
Motivation: 随着对LLMs情感智能需求的增长,理解其内部情感表达机制并实现对情感生成的精准控制成为关键挑战。Contribution: 1. 构建了SEV数据集以激发可比情感状态;2. 揭示了跨上下文的情感编码方向;3. 发现了情感计算的局部神经元和注意力头;4. 提出了全局情感电路的概念并通过调制实现了高准确性情感控制。
Method: 使用SEV数据集,通过分析分解和因果分析提取情感方向,识别局部情感计算组件,并整合为全局情感电路,最后通过调制电路实现情感控制。
Result: 调制情感电路的情感表达测试准确率达到99.65%,超越了传统提示和引导方法。
Insight: 这是首次系统性地发现并验证LLMs中的情感电路,为模型的可解释性和可控情感智能提供了新视角。
Abstract: As the demand for emotional intelligence in large language models (LLMs) grows, a key challenge lies in understanding the internal mechanisms that give rise to emotional expression and in controlling emotions in generated text. This study addresses three core questions: (1) Do LLMs contain context-agnostic mechanisms shaping emotional expression? (2) What form do these mechanisms take? (3) Can they be harnessed for universal emotion control? We first construct a controlled dataset, SEV (Scenario-Event with Valence), to elicit comparable internal states across emotions. Subsequently, we extract context-agnostic emotion directions that reveal consistent, cross-context encoding of emotion (Q1). We identify neurons and attention heads that locally implement emotional computation through analytical decomposition and causal analysis, and validate their causal roles via ablation and enhancement interventions. Next, we quantify each sublayer’s causal influence on the model’s final emotion representation and integrate the identified local components into coherent global emotion circuits that drive emotional expression (Q2). Directly modulating these circuits achieves 99.65% emotion-expression accuracy on the test set, surpassing prompting- and steering-based methods (Q3). To our knowledge, this is the first systematic study to uncover and validate emotion circuits in LLMs, offering new insights into interpretability and controllable emotional intelligence.
[82] LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation
Hengran Zhang,Keping Bi,Jiafeng Guo,Jiaming Zhang,Shuaiqiang Wang,Dawei Yin,Xueqi Cheng
Main category: cs.CL
TL;DR: 该论文提出了LLM特定效用的概念,强调在检索增强生成(RAG)中,不同LLM对相同外部知识的利用能力不同,传统的通用效用标注不适用。通过实验发现,人类标注的段落并非LLM最优选择,且效用段落不可跨模型迁移。论文还提出了一种LLM特定效用的基准评估方法。
Details
Motivation: 传统的检索增强生成(RAG)研究将效用视为通用属性,忽略了不同LLM在内部知识和理解能力上的差异可能导致对相同段落的效用不同。本文旨在揭示这种差异,并提出LLM特定效用的重要性。Contribution: 1. 引入并系统研究了LLM特定效用的概念;2. 通过实验证明人类标注的段落并非LLM最优选择,且效用段落不可跨模型迁移;3. 提出了LLM特定效用的基准评估方法。
Method: 1. 设计大规模实验,比较不同LLM在多个数据集上的表现;2. 分析人类标注段落与LLM实际效用之间的差异;3. 提出基于困惑度(perplexity)的LLM特定效用评估框架。
Result: 实验结果表明,人类标注的段落对特定LLM并非最优,且效用段落不可跨模型迁移。现有效用评估方法中,基于伪答案的语言化方法表现稳健,但LLM在评估效用时存在缺陷。
Insight: LLM的效用评估需要模型特定的视角,通用的效用标注可能在RAG中失效。困惑度是衡量LLM对查询和段落可读性的关键指标。
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. While traditional retrieval focuses on relevance, RAG’s effectiveness depends on the utility of retrieved passages, i.e., the usefulness in facilitating the generation of an accurate and comprehensive answer. Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage due to variations in internal knowledge and comprehension ability. In this work, we introduce and systematically investigate the notion of LLM-specific utility. Through large-scale experiments across multiple datasets and LLMs, we demonstrate that human-annotated passages are not optimal for LLMs and that ground-truth utilitarian passages are not transferable across different LLMs. These findings highlight the necessity of adopting the LLM-specific utility in RAG research. Our findings indicate that some human-annotated passages are not ground-truth utilitarian passages for specific LLMs, partially due to the varying readability of queries and passages for LLMs, a tendency for which perplexity is a key metric. Based on these findings, we propose a benchmarking procedure for LLM-specific utility judgments. We evaluate existing utility judgment methods on six datasets and find that while verbalized methods using pseudo-answers perform robustly, LLMs struggle to assess utility effectively-failing to reject all passages for known queries and to select truly useful ones for unknown queries.
[83] Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers
Wenhan Ma,Hailin Zhang,Liang Zhao,Yifan Song,Yudong Wang,Zhifang Sui,Fuli Luo
Main category: cs.CL
TL;DR: 论文提出了一种名为Rollout Routing Replay (R3)的方法,通过记录推理引擎的路由分布并在训练中重放,解决了MoE模型中路由机制导致的不稳定问题,显著提升了强化学习的稳定性。
Details
Motivation: 在Mixture-of-Experts (MoE)模型中,路由机制在训练和推理阶段的不一致性会导致强化学习不稳定甚至崩溃,限制了模型的性能和应用。Contribution: 论文的主要贡献是提出了R3方法,通过对齐训练和推理阶段的路由行为,减少KL散度差异,显著稳定MoE模型的强化学习训练。
Method: R3方法的核心是记录推理阶段的路由分布并在训练中重放,以减少训练和推理间的行为差异。实验表明,R3在不影响训练速度的情况下,有效缓解了极端差异。
Result: 实验证明,R3在多种设置下成功稳定了RL训练,避免了崩溃,并超越了GSPO和TIS等方法。
Insight: 这项工作揭示了MoE模型中路由不一致性问题的根源,并提供了一种新的解决方案,为MoE模型的稳定训练提供了理论基础和实践方法。
Abstract: Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. Moreover, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes. To address this foundational inconsistency, we propose Rollout Routing Replay (R3), a method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming methods such as GSPO and TIS. We believe this work can offer a new solution for stabilizing RL in MoE models.
[84] Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies
Zirui Song,Yuan Huang,Junchang Liu,Haozhe Luo,Chenxi Wang,Lang Gao,Zixiang Xu,Mingfei Han,Xiaojun Chang,Xiuying Chen
Main category: cs.CL
TL;DR: 该论文提出了一种新颖的策略对齐评估方法,使用高质量的人类验证数据来评估大型语言模型(LLM)在社交推理游戏中的表现,揭示了模型在欺骗和反事实推理方面的不足。
Details
Motivation: 现有研究大多将社交推理游戏简化为LLM自对弈,忽略了社交游戏的丰富性,同时缺乏高质量的参考数据和细粒度评估方法。Contribution: 1)引入了多模态的人类验证数据集;2)提出了两阶段的策略对齐评估框架,细粒度地评估模型的社交能力。
Method: 1)语音评估:通过多选题任务评估模型在五个社交能力维度的表现;2)决策评估:分析模型的投票选择和角色推断。
Result: 实验显示,现有LLMs表现参差不齐,约一半模型得分低于0.50,尤其在欺骗和反事实推理方面表现不佳。
Insight: 研究强调了结合策略对齐评估的重要性,为多智能体交互中的语言、推理和策略研究提供了新方向。
Abstract: Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction’s strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model’s voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models’ linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi-agent interaction.
[85] KnowRL: Teaching Language Models to Know What They Know
Sahil Kale,Devendra Singh Dhami
Main category: cs.CL
TL;DR: KnowRL框架通过自省和基于共识的奖励机制,提升语言模型对自身知识边界的认知能力,无需外部监督即可显著增强模型的自我知识一致性。
Details
Motivation: 当前大型语言模型(LLM)在超过20%的情况下无法准确评估自身能力,导致其回答不可靠。因此,需要一种方法帮助模型更准确地认知自身知识范围,以实现更安全和可靠的AI行为。Contribution: KnowRL提出了一种无需外部监督的框架,通过自省(生成并分类可行性任务)和共识奖励(强化自我评估的稳定性),显著提升了模型的自我知识一致性。
Method: KnowRL结合了两部分:1)自省,模型生成并分类可行性任务;2)基于共识的奖励机制,通过内部一致性强化自我评估的稳定性。整个框架仅依赖内部生成的数据,避免外部监督成本。
Result: 在LLaMA-3.1-8B和Qwen-2.5-7B上的实验表明,KnowRL显著提升了模型的自我知识一致性,准确性最高提升28%,F1分数提升12%,且在少量迭代内优于基线方法。
Insight: KnowRL展示了语言模型通过内部机制可以自我提升知识认知能力,无需外部干预。这一方法为构建更可靠、安全的AI系统提供了新思路,尤其适用于关键应用场景。
Abstract: Truly reliable AI requires more than simply scaling up knowledge; it demands the ability to know what it knows and when it does not. Yet recent research shows that even the best LLMs misjudge their own competence in more than one in five cases, making any response born of such internal uncertainty impossible to fully trust. Inspired by self-improvement reinforcement learning techniques that require minimal data, we present a simple but powerful framework KnowRL that strengthens a model’s internal understanding of its own feasibility boundaries, enabling safer and more responsible behaviour. Our framework combines two components: (i) introspection, where the model generates and classifies tasks it judges feasible or infeasible, and (ii) consensus-based rewarding, where stability of self-knowledge assessment is reinforced through internal agreement. By using internally generated data, this design strengthens consistency in self-knowledge and entirely avoids costly external supervision. In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL steadily improved self-knowledge, validated by both intrinsic self-consistency and extrinsic benchmarking. With nothing more than a small seed set and no external supervision, our method drove gains as high as 28% in accuracy and 12% in F1, outperforming baselines in just a few iterations. Our framework essentially unlocks the untapped capacity of LLMs to self-improve their knowledge awareness, opening the door to reliable, more accountable AI and safer deployment in critical applications. Owing to its simplicity and independence from external effort, we encourage applying this reliability-enhancing process to all future models.
[86] Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification
Stefan Krsteski,Giuseppe Russo,Serina Chang,Robert West,Kristina Gligorić
Main category: cs.CL
TL;DR: 该论文研究了如何在有限的人类数据下,通过提示、微调和校正方法的结合,利用大型语言模型(LLMs)生成有效的调查模拟结果。研究表明,仅使用合成方法会引入显著偏差(24%-86%),而结合校正方法可将偏差降至5%以下,并增加有效样本量至14%。
Details
Motivation: 传统的调查方法成本高且耗时,LLMs被提出作为低成本、可扩展的替代方案。然而,LLMs的输出存在偏差,如何合理分配人类数据以优化合成和校正方法的结合效果成为研究动机。Contribution: 论文的主要贡献是系统分析了合成与校正方法的交互作用,提出了在固定预算下,将更多人类数据用于校正而非微调的策略,显著提升了估计的有效性。
Method: 研究采用了两种面板调查数据(涉及营养、政治和经济问题),通过比较合成方法的偏差与结合校正方法后的效果,评估了人类数据的最佳分配方式。
Result: 结果显示,仅使用合成方法偏差为24%-86%,而结合校正方法后偏差降至5%以下,有效样本量增加至14%。
Insight: 论文揭示了传统方法中过度依赖人类数据微调的局限性,提出在预算有限时,优先分配数据用于校正能显著提升估计质量。
Abstract: Surveys provide valuable insights into public opinion and behavior, but their execution is costly and slow. Large language models (LLMs) have been proposed as a scalable, low-cost substitute for human respondents, but their outputs are often biased and yield invalid estimates. We study the interplay between synthesis methods that use LLMs to generate survey responses and rectification methods that debias population estimates, and explore how human responses are best allocated between them. Using two panel surveys with questions on nutrition, politics, and economics, we find that synthesis alone introduces substantial bias (24-86%), whereas combining it with rectification reduces bias below 5% and increases effective sample size by up to 14%. Overall, we challenge the common practice of using all human responses for fine-tuning, showing that under a fixed budget, allocating most to rectification results in far more effective estimation.
[87] Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models
Yusheng Song,Lirong Qiu,Xi Zhang,Zhihao Tang
Main category: cs.CL
TL;DR: 论文提出了一种统一的框架,通过结合内部状态探测和推理链验证来解决大型语言模型中的幻觉检测问题,克服了信号稀缺性和表征对齐障碍。
Details
Motivation: 大型语言模型中的幻觉检测存在检测困境,现有的方法在事实密集型和逻辑密集型任务中各有所长但无法兼顾。作者希望通过统一的方法解决这一问题。Contribution: 提出了一个统一的幻觉检测框架,结合了内部状态探测和推理链验证,解决了信号稀缺性和表征对齐两大挑战。
Method: 引入了多路径推理机制获取更细粒度的信号,并使用分段感知的时态化交叉注意力模块对齐表征,检测细微的不一致性。
Result: 在三个多样化的基准测试和两种领先的大型语言模型上的实验表明,该方法显著优于现有基线。
Insight: 通过融合内部状态和外部推理的一致性,可以更全面地检测幻觉,尤其在事实和逻辑混合的任务中表现优越。
Abstract: The detection of sophisticated hallucinations in Large Language Models (LLMs) is hampered by a ``Detection Dilemma’’: methods probing internal states (Internal State Probing) excel at identifying factual inconsistencies but fail on logical fallacies, while those verifying externalized reasoning (Chain-of-Thought Verification) show the opposite behavior. This schism creates a task-dependent blind spot: Chain-of-Thought Verification fails on fact-intensive tasks like open-domain QA where reasoning is ungrounded, while Internal State Probing is ineffective on logic-intensive tasks like mathematical reasoning where models are confidently wrong. We resolve this with a unified framework that bridges this critical gap. However, unification is hindered by two fundamental challenges: the Signal Scarcity Barrier, as coarse symbolic reasoning chains lack signals directly comparable to fine-grained internal states, and the Representational Alignment Barrier, a deep-seated mismatch between their underlying semantic spaces. To overcome these, we introduce a multi-path reasoning mechanism to obtain more comparable, fine-grained signals, and a segment-aware temporalized cross-attention module to adaptively fuse these now-aligned representations, pinpointing subtle dissonances. Extensive experiments on three diverse benchmarks and two leading LLMs demonstrate that our framework consistently and significantly outperforms strong baselines. Our code is available: https://github.com/peach918/HalluDet.
[88] Information-Preserving Reformulation of Reasoning Traces for Antidistillation
Jiayu Ding,Lei Cui,Li Dong,Nanning Zheng,Furu Wei
Main category: cs.CL
TL;DR: 论文提出了PART方法,通过保留信息的同时干扰蒸馏过程,保护大型语言模型的推理痕迹不被未经授权的蒸馏。
Details
Motivation: 现有方法在保护推理痕迹时往往牺牲了信息的完整性(如使用简短摘要),而PART旨在解决这种权衡问题。Contribution: 提出了PART方法,一种信息保留的推理痕迹重构技术,有效干扰蒸馏过程且不影响人类理解。
Method: 采用两步重构:去除自我对话行为和重排子结论,并通过小型辅助模型实现。
Result: 实验表明,PART显著降低了不同规模和类型的学生模型的蒸馏效果,例如32B模型的性能下降了13.5%。
Insight: 人类与小模型对推理痕迹的理解方式不同,PART利用这种差异实现了有效保护。
Abstract: Recent advances in Large Language Models (LLMs) show that extending the length of reasoning chains significantly improves performance on complex tasks. While revealing these reasoning traces helps users better follow, verify, and learn from the model’s problem-solving process, it also makes them highly vulnerable to unauthorized distillation. To mitigate this risk, proprietary model providers often adopt aggressive protection strategies, such as replacing detailed reasoning with brief summaries, which deprive users of valuable intermediate information. To address this trade-off, we propose PART, an information-preserving antidistillation reformulation of reasoning traces. Motivated by the difference between how humans understand reasoning traces and how LLMs exploit them for supervised fine-tuning, we design a simple but effective two-step reformulation: removing self-talk behaviors and reordering sub-conclusions. A small auxiliary model is trained to perform this reformulation, incurring minimal computational overhead. Extensive experiments demonstrate that PART consistently disrupts distillation across student models of different sizes and types on various reasoning benchmarks. For instance, when training on reformulated traces, even the performance of a large 32B student model decreases from 54.17 to 46.88 on AIME 2024, corresponding to a 13.5% degradation.
[89] LLMAtKGE: Large Language Models as Explainable Attackers against Knowledge Graph Embeddings
Ting Li,Yang Yang,Yipeng Yu,Liang Yao,Guoqing Chao,Ruifeng Xu
Main category: cs.CL
TL;DR: 该论文提出了LLMAtKGE,一种基于大语言模型(LLM)的框架,用于攻击知识图谱嵌入(KGE),同时生成人类可读的解释。通过结构化提示设计和过滤机制,解决了LLM输入限制和犹豫问题,实验结果表明其攻击性能和解释能力优于现有方法。
Details
Motivation: 现有的黑盒攻击方法在攻击知识图谱嵌入时缺乏人类可读的解释能力,并且在泛化性上表现不佳。大语言模型在文本理解和生成方面的强大能力为这一领域提供了新的可能性。Contribution: 1. 提出了第一个基于LLM的知识图谱嵌入攻击框架LLMAtKGE。2. 设计了结构化提示和多选择问题机制,增强LLM的上下文理解能力。3. 引入基于语义和中心性的过滤机制,解决了输入限制问题。4. 通过预计算高阶邻接和微调LLM,提升了过滤性能。
Method: 1. 使用结构化提示将攻击任务建模为多选择问题。2. 设计语义和中心性过滤器,压缩候选集。3. 预计算高阶邻接并微调LLM以增强过滤效果。4. 通过实验验证攻击性能和解释能力。
Result: 在多个知识图谱数据集上的实验表明,LLMAtKGE在黑盒攻击中表现优于基线方法,生成的人类可读解释具有竞争力,并接近白盒方法的性能。
Insight: 大语言模型可以成功应用于知识图谱攻击任务,并通过结构化设计和过滤机制解决实际应用中的输入限制和解释生成问题。
Abstract: Adversarial attacks on knowledge graph embeddings (KGE) aim to disrupt the model’s ability of link prediction by removing or inserting triples. A recent black-box method has attempted to incorporate textual and structural information to enhance attack performance. However, it is unable to generate human-readable explanations, and exhibits poor generalizability. In the past few years, large language models (LLMs) have demonstrated powerful capabilities in text comprehension, generation, and reasoning. In this paper, we propose LLMAtKGE, a novel LLM-based framework that selects attack targets and generates human-readable explanations. To provide the LLM with sufficient factual context under limited input constraints, we design a structured prompting scheme that explicitly formulates the attack as multiple-choice questions while incorporating KG factual evidence. To address the context-window limitation and hesitation issues, we introduce semantics-based and centrality-based filters, which compress the candidate set while preserving high recall of attack-relevant information. Furthermore, to efficiently integrate both semantic and structural information into the filter, we precompute high-order adjacency and fine-tune the LLM with a triple classification task to enhance filtering performance. Experiments on two widely used knowledge graph datasets demonstrate that our attack outperforms the strongest black-box baselines and provides explanations via reasoning, and showing competitive performance compared with white-box methods. Comprehensive ablation and case studies further validate its capability to generate explanations.
[90] Survey Response Generation: Generating Closed-Ended Survey Responses In-Silico with Large Language Models
Georg Ahnert,Anna-Carolina Haensch,Barbara Plank,Markus Strohmaier
Main category: cs.CL
TL;DR: 该论文系统研究了8种封闭式调查响应生成方法对预测调查响应的影响,提出了32百万次模拟调查响应的实验结果,发现受限生成方法效果最佳,推理输出未必提升对齐性。
Details
Motivation: 研究动机是厘清大语言模型(LLMs)在模拟人类封闭式调查响应时的最佳实践,填补现有研究多集中在开放式文本生成的空白。Contribution: 主要贡献包括:1)系统比较8种生成方法;2)实验验证32百万次模拟响应;3)发现受限生成方法最优,推理不一定有效;4)提出实用建议。
Method: 研究方法涵盖8种封闭式调查响应生成方法,应用于4个政治态度调查和10个开源语言模型,量化个体和亚群体层面对齐性差异。
Result: 结果显示:受限生成方法总体表现最佳(如强制选择特定格式),推理输出并未显著提升响应质量。
Insight: 关键洞察:封闭式响应生成方法对模拟结果影响显著,需谨慎选择方法;推理能力在封闭式任务中未必适用。
Abstract: Many in-silico simulations of human survey responses with large language models (LLMs) focus on generating closed-ended survey responses, whereas LLMs are typically trained to generate open-ended text instead. Previous research has used a diverse range of methods for generating closed-ended survey responses with LLMs, and a standard practice remains to be identified. In this paper, we systematically investigate the impact that various Survey Response Generation Methods have on predicted survey responses. We present the results of 32 mio. simulated survey responses across 8 Survey Response Generation Methods, 4 political attitude surveys, and 10 open-weight language models. We find significant differences between the Survey Response Generation Methods in both individual-level and subpopulation-level alignment. Our results show that Restricted Generation Methods perform best overall, and that reasoning output does not consistently improve alignment. Our work underlines the significant impact that Survey Response Generation Methods have on simulated survey responses, and we develop practical recommendations on the application of Survey Response Generation Methods.
[91] MeTA-LoRA: Data-Efficient Multi-Task Fine-Tuning for Large Language Models
Bo Cheng,Xu Wang,Jinda Liu,Yi Chang,Yuan Wu
Main category: cs.CL
TL;DR: 论文提出了一种名为MeTA-LoRA的两阶段优化框架,旨在提升大型语言模型在多任务学习中的数据效率。通过任务特异性LoRA适配器快速适应少量数据,并在第二阶段聚合多任务梯度以促进知识迁移,显著减少了数据需求。
Details
Motivation: 尽管LoRA在单任务微调中表现优异,但在多任务学习中难以高效利用任务间知识,且需要大量任务特异性数据。为解决这一问题,研究者提出了MeTA-LoRA。Contribution: 主要贡献是提出了MeTA-LoRA框架,通过两阶段优化显著提升了多任务学习中的数据效率,同时保持或超越传统全数据微调的性能。
Method: 方法分为两阶段:1. 使用少量任务数据学习任务特异性LoRA适配器;2. 通过聚合多任务梯度更新共享LoRA适配器,促进知识迁移。
Result: 在多任务和多语言学习场景中,MeTA-LoRA的性能与传统全数据微调相当或更优,数据使用量显著减少。
Insight: 通过少量任务数据和跨任务知识共享,可以在数据有限的情况下高效适配大型语言模型。
Abstract: Low-Rank Adaptation (LoRA) has emerged as one of the most widely used parameter-efficient fine-tuning (PEFT) methods for adapting large language models (LLMs) to downstream tasks. While highly effective in single-task settings, it struggles to efficiently leverage inter-task knowledge in complex multi-task learning scenarios, often requiring substantial task-specific data to achieve optimal performance. To address this limitation, we introduce MeTA-LoRA, a two-stage optimization framework that significantly improves data efficiency in multi-task adaptation. In the first stage, task-specific LoRA adapters are learned using only a few samples from each involved dataset, enabling rapid adaptation without large-scale supervision. In the second stage, the shared LoRA adapter is updated by aggregating gradients from multiple tasks to promote knowledge transfer across tasks, further reducing data usage by leveraging common patterns. In both multi-task learning and multilingual learning scenarios, our method matches or surpasses the performance of traditional full-data LoRA fine-tuning approaches, while using significantly less task-specific data.
[92] StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models
Zehao Chen,Rong Pan,Haoran Li
Main category: cs.CL
TL;DR: 该论文提出了StoryBox,一种基于多智能体模拟的混合自底向上长故事生成方法,通过智能体在动态沙盒环境中的交互生成事件,构建长故事。
Details
Motivation: 受到人类作家创作过程的启发,作者希望设计一种能够模拟角色与环境交互、自然生成故事的方法,解决传统自上而下方法过于僵化的问题。Contribution: 提出了混合自底向上的长故事生成框架,通过多智能体模拟实现动态、有机的情节发展,能够生成超过10,000字的连贯故事。
Method: 采用多智能体在沙盒环境中交互,行为和环境互动产生涌现事件,以此为基础构建长故事。结合大语言模型,提升生成质量。
Result: 系统在多项指标上达到state-of-the-art性能,生成的故事具有较高的连贯性和一致性。
Insight: 混合自底向上的方法可以有效平衡结构性与自由度,为长故事生成提供了一个可扩展的创新解决方案。
Abstract: Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment. Inspired by this creative process, we propose a novel approach to long-form story generation, termed hybrid bottom-up long-form story generation, using multi-agent simulations. In our method, agents interact within a dynamic sandbox environment, where their behaviors and interactions with one another and the environment generate emergent events. These events form the foundation for the story, enabling organic character development and plot progression. Unlike traditional top-down approaches that impose rigid structures, our hybrid bottom-up approach allows for the natural unfolding of events, fostering more spontaneous and engaging storytelling. The system is capable of generating stories exceeding 10,000 words while maintaining coherence and consistency, addressing some of the key challenges faced by current story generation models. We achieve state-of-the-art performance across several metrics. This approach offers a scalable and innovative solution for creating dynamic, immersive long-form stories that evolve organically from agent-driven interactions.
[93] Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation
Siheng Xiong,Ali Payani,Faramarz Fekri
Main category: cs.CL
TL;DR: 论文提出了一种名为多路径计划聚合(MPPA)的框架,通过探索和聚合多个候选计划来增强语言模型的推理能力,并结合在线Step-DPO方法优化训练过程,显著提升了长序列推理任务的性能。
Details
Motivation: 现有的单次推理方法容易导致推理链偏离正确方向(CoT derailment),尤其是能力有限的小模型在处理长推理链时更为严重。论文旨在通过分析推理层级结构,发现大多数错误源于规划步骤,从而提出改进方案。Contribution: 1. 提出了多路径计划聚合(MPPA)框架,通过生成和聚合多个候选计划来优化推理过程;2. 设计了在线Step-DPO方法,利用Twisted Sequential Monte Carlo(TSMC)提供高效的分步监督,解决了长轨迹训练的效率和稳定问题。
Method: 1. MPPA通过可变间隔调度生成多个候选计划并聚合;2. 采用轻量级LoRA模块实现计划聚合策略;3. 引入Step-DPO方法,结合TSMC优化训练过程。
Result: 在数学、科学和逻辑推理基准测试中,仅使用10%的SFT数据和5%的偏好对,MPPA方法显著优于DeepSeek-R1蒸馏基准和结果奖励RL基准。
Insight: 优化规划步骤比直接优化执行步骤更有效,分步监督(如Step-DPO)在长序列任务中比传统的结果奖励RL更具优势。
Abstract: Inference-time scaling enhances the reasoning ability of a language model (LM) by extending its chain-of-thought (CoT). However, existing approaches typically generate the entire reasoning chain in a single forward pass, which often leads to CoT derailment, i.e., the reasoning trajectory drifting off course due to compounding errors. This problem is particularly severe for smaller LMs with long CoTs due to their limited capacity. To address this, we analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning and execution steps. Our analysis reveals that most reasoning errors stem from incorrect planning. Motivated by this observation, we propose Multi-Path Plan Aggregation (MPPA), a framework that augments single-pass reasoning with plan exploration and aggregation. Following a variable interval schedule based on the token position, MPPA generates multiple candidate plans and aggregates them into a refined planning step. To maintain efficiency, we adopt a minimal design in which the base LM serves as the primary policy, while a lightweight LoRA module implements the plan aggregation policy. We further observe that outcome-reward RL is inefficient for long trajectories (e.g., exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks demonstrate that, with only 10% SFT data and 5% of preference pairs, our method outperforms both the DeepSeek-R1 distillation baseline and the outcome-reward RL baseline across multiple base models and tasks.
[94] ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems
Xin Gui,King Zhu,JinCheng Ren,Qianben Chen,Zekun Moore Wang,Yizhi LI,Xinpeng Liu,Xiaowan Li,Wenli Ren,Linyu Miao,Tianrui Qin,Ziqi Shu,He Zhu,Xiangru Tang,Dingfeng Shi,Jiaheng Liu,Yuchen Eleanor Jiang,Minghao Liu,Ge Zhang,Wangchunshu Zhou
Main category: cs.CL
TL;DR: 论文提出了Acadreason基准,旨在评估大型语言模型(LLMs)和智能代理在多领域学术问题中的高级推理能力。
Details
Motivation: 现有评估主要集中于数学/编程竞赛或通用任务,缺乏针对学术领域复杂推理能力的严格基准。Contribution: 引入了Acadreason基准,包含50个专家标注的学术问题,覆盖计算机科学、经济学、法学、数学和哲学五个高推理难度领域。
Method: 从顶级出版物中选取问题,经过严格标注和质量控制,评估了10多种主流LLMs和代理的表现。
Result: 结果表明,多数LLMs得分低于20分,顶级GPT-5仅得16分;代理表现较好,但最高不超过40分。
Insight: Acadreason揭示了LLMs和代理在学术研究任务中的能力差距,为未来研究方向提供了挑战。
Abstract: In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.
[95] Scaling Language-Centric Omnimodal Representation Learning
Chenghao Xiao,Hou Pong Chan,Hao Zhang,Weiwen Xu,Mahani Aljunied,Yu Rong
Main category: cs.CL
TL;DR: 本文提出了一种语言中心的全模态嵌入框架(LCO-Emb),揭示了多模态大语言模型(MLLMs)在生成预训练中隐含的跨模态对齐优势,并通过实验验证了对比学习的轻量化微调作用。作者还提出了生成-表征比例定律(GRSL),表明生成能力的提升可以增强表征质量。
Details
Motivation: 现有的多模态嵌入方法虽然表现优异,但其优势的深层原因尚未明确。本文旨在探索MLLMs在生成预训练中隐含的跨模态对齐机制,并利用这一发现提升表征学习效果。Contribution: 1)提出LCO-Emb框架,利用MLLMs的隐含对齐优势实现高效表征学习;2)提出GRSL定律,证明生成能力与表征质量的正相关性;3)通过理论和实验验证了GRSL的有效性。
Method: 通过分析各向异性和核相似性结构,验证MLLMs隐含的跨模态对齐;设计LCO-Emb框架,结合对比学习的轻量化微调;提出GRSL定律,并理论推导其与表征性能的关系。
Result: LCO-Emb在多种主干模型和基准测试中均达到最先进性能。GRSL在低资源视觉文档检索任务中得到验证,表明生成预训练能进一步提升嵌入能力。
Insight: 生成能力的提升是增强表征质量的有效途径,MLLMs在生成预训练中隐含的对齐机制为多模态表征学习提供了新视角。
Abstract: Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM’s generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM’s generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model’s embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.
[96] When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
Lingfei Qian,Xueqing Peng,Yan Wang,Vincent Jim Zhang,Huan He,Hanley Smith,Yi Han,Yueru He,Haohang Li,Yupeng Cao,Yangyang Yu,Alejandro Lopez-Lira,Peng Lu,Jian-Yun Nie,Guojun Xiong,Jimin Huang,Sophia Ananiadou
Main category: cs.CL
TL;DR: 论文提出了Agent Market Arena (AMA),这是首个用于在多市场实时环境中评估基于大语言模型(LLM)的交易代理的基准测试平台,填补了现有研究的空白。
Details
Motivation: 当前研究中,基于LLM的金融交易代理在实时市场中的推理和适应能力尚不明确,且现有测试多集中于模型而非代理,覆盖范围有限。AMA旨在提供一个公平、连续的评估框架。Contribution: AMA整合了已验证的交易数据、专家审核的新闻和多样化的代理架构,首次实现了在多市场环境下对LLM交易代理的终身实时评估。
Method: AMA实现了四种代理(包括单代理基线、不同风险风格代理和基于记忆推理的代理),并在多种LLM模型上进行实时加密货币和股票市场测试。
Result: 实验表明,代理框架表现出显著不同的行为模式(如激进或保守),而模型主干对结果的影响较小。AMA为LLM代理的金融推理能力提供了可复现的评估基础。
Insight: 代理的行为模式在交易中起关键作用,而模型选择的影响相对较小,这为未来代理设计和优化提供了方向。
Abstract: Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework, enabling fair and continuous comparison under real conditions. It implements four agents, including InvestorAgent as a single-agent baseline, TradeAgent and HedgeFundAgent with different risk styles, and DeepFundAgent with memory-based reasoning, and evaluates them across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets demonstrate that agent frameworks display markedly distinct behavioral patterns, spanning from aggressive risk-taking to conservative decision-making, whereas model backbones contribute less to outcome variation. AMA thus establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents.
[97] Demystifying Reinforcement Learning in Agentic Reasoning
Zhaochen Yu,Ling Yang,Jiaru Zou,Shuicheng Yan,Mengdi Wang
Main category: cs.CL
TL;DR: 论文通过系统研究揭示了强化学习在代理推理中的关键设计原则,提出了数据、算法和推理模式三方面的优化实践,显著提升了小模型的代理推理能力。
Details
Motivation: 尽管代理强化学习(RL)已显示出提升大型语言模型(LLMs)代理推理能力的潜力,但其关键设计原则和最佳实践仍不明确。本文旨在填补这一空白。Contribution: 论文的主要贡献包括:(1)提出了高质量的真实端到端工具使用轨迹数据集;(2)总结了探索友好的RL技术;(3)展示了较少的工具调用策略的优越性;(4)证明了小模型在代理推理任务中也能超越大模型。
Method: 研究从数据、算法和推理模式三个关键视角展开:(1)数据:使用真实端到端工具轨迹替换合成轨迹;(2)算法:采用探索友好技术(如适当奖励塑造和政策熵);(3)推理模式:提出更高效的深思熟虑策略。
Result: 通过优化实践,4B规模的模型在代理推理任务中超越了32B规模的模型,并在四个挑战性基准(如AIME2024和GPQA-Diamond)中取得了优越性能。
Insight: 高质量数据集、探索友好技术和高效推理模式是提升代理强化学习性能的关键。小模型通过优化设计也能实现与大模型媲美的推理能力。
Abstract: Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: https://github.com/Gen-Verse/Open-AgentRL
[98] Are Large Reasoning Models Interruptible?
Tsung-Han Wu,Mihran Miroyan,David M. Chan,Trevor Darrell,Narges Norouzi,Joseph E. Gonzalez
Main category: cs.CL
TL;DR: 论文探讨了大型推理模型(LRMs)在动态环境中的鲁棒性问题,揭示了静态评估会高估模型的性能,因为在中断或上下文变化时,模型的准确率可能大幅下降。
Details
Motivation: 传统评估假设推理模型在静态环境中运行,而现代推理任务(如辅助编程)需要长时间推理且上下文可能动态变化,因此需要验证LRMs在动态场景中的表现。Contribution: 论文挑战了静态评估假设,提出了中断和动态上下文两种动态场景,揭示了LRMs在这些场景中的性能下降和新型失败模式。
Method: 通过数学和编程基准测试,评估LRMs在中断和动态上下文两种动态场景中的表现,分析性能下降的原因和失败模式。
Result: 静态评估高估了LRMs的鲁棒性:在动态场景中,即使是顶尖模型也可能出现失败,准确率下降高达60%,并表现出‘推理泄漏’、‘恐慌’和‘自我怀疑’等新型失败模式。
Insight: 动态环境中的中断和上下文变化对LRMs的性能影响显著,未来的模型设计和评估需考虑动态性以提高实际应用的可靠性。
Abstract: Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, “frozen world” settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the “frozen world” assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model’s final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model’s partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.
cs.CV [Back]
[99] TinyViT-Batten: Few-Shot Vision Transformer with Explainable Attention for Early Batten-Disease Detection on Pediatric MRI
Khartik Uppalapati,Bora Yimenicioglu,Shakeel Abdulkareem,Adan Eftekhari,Bhavya Uppalapati,Viraj Kamath
Main category: cs.CV
TL;DR: TinyViT-Batten 是一种基于小规模视觉Transformer(ViT)的框架,用于从有限的儿科脑部MRI数据中检测早期Batten病。通过蒸馏大型ViT并结合元学习方法,该模型在准确性和可解释性上表现出色。
Details
Motivation: Batten病是一种罕见的儿科神经退行性疾病,其早期MRI信号细微且容易被漏诊。现有方法通常需要大量标注数据,而Batten病的罕见性导致数据稀缺。因此,开发一种高效的少样本学习模型具有重要意义。Contribution: 1. 提出TinyViT-Batten,一种参数仅为5M的小型ViT模型;2. 使用度量学习方法(原型损失)在少样本场景下优化模型;3. 结合Grad-CAM提供可解释的预测结果;4. 在多站点数据集上优于3D-ResNet和Swin-Tiny基线。
Method: 1. 从大型ViT蒸馏出轻量化的TinyViT;2. 在5-shot场景下使用原型损失进行微调;3. 利用Grad-CAM生成注意力图以解释模型决策。
Result: 在79例Batten病MRI和90例对照数据上,模型准确率达91%,ROC曲线下面积≥0.95,敏感性和特异性均约为90%。
Insight: 轻量化模型结合少样本学习方法可以有效解决罕见病数据稀缺问题;Grad-CAM的可解释性技术有助于临床医生理解和信任AI预测。
Abstract: Batten disease (neuronal ceroid lipofuscinosis) is a rare pediatric neurodegenerative disorder whose early MRI signs are subtle and often missed. We propose TinyViT-Batten, a few-shot Vision Transformer (ViT) framework to detect early Batten disease from pediatric brain MRI with limited training cases. We distill a large teacher ViT into a 5 M-parameter TinyViT and fine-tune it using metric-based few-shot learning (prototypical loss with 5-shot episodes). Our model achieves high accuracy (approximately 91%) and area under ROC of at least 0.95 on a multi-site dataset of 79 genetically confirmed Batten-disease MRIs (27 CLN3 from the Hochstein natural-history study, 32 CLN2 from an international longitudinal cohort, 12 early-manifestation CLN2 cases reported by Cokal et al., and 8 public Radiopaedia scans) together with 90 age-matched controls, outperforming a 3D-ResNet and Swin-Tiny baseline. We further integrate Gradient-weighted Class Activation Mapping (Grad-CAM) to highlight disease-relevant brain regions, enabling explainable predictions. The model’s small size and strong performance (sensitivity greater than 90%, specificity approximately 90%) demonstrates a practical AI solution for early Batten disease detection.
[100] Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition
Ranjan Sapkota,Manoj Karkee
Main category: cs.CV
TL;DR: 这篇论文全面回顾了Ultralytics YOLO家族的进化历程,包括YOLO26、YOLO11、YOLOv8和YOLOv5的创新点、性能对比和应用场景,指出了未来发展方向。
Details
Motivation: YOLO系列目标检测器在计算机视觉和模式识别领域广泛应用,但其演进过程、性能差异和未来挑战缺乏系统总结。本文旨在填补这一空白。Contribution: 提供了YOLO家族的详细综述,总结了YOLO26等最新版本的关键创新,并通过MS COCO数据集上的基准测试进行了定量对比。还讨论了部署策略和实际应用场景。
Method: 通过分析YOLO26的Distribution Focal Loss移除、NMS-free推理等创新,以及YOLO11、YOLOv8和YOLOv5的模块化设计、任务分配方法等,展示了YOLO系列的演进路径。
Result: YOLO26在MS COCO上表现出色,尤其在速度和精度平衡上优于YOLOv5、YOLOv8和YOLO11。与其他先进检测器(如RT-DETR)的对比也展示了其竞争力。
Insight: YOLO系列的演进围绕效率与精度的平衡展开,未来方向包括密集场景优化、CNN-Transformer混合架构和开放词汇检测。
Abstract: This paper presents a comprehensive overview of the Ultralytics YOLO(You Only Look Once) family of object detectors, focusing the architectural evolution, benchmarking, deployment perspectives, and future challenges. The review begins with the most recent release, YOLO26 (YOLOv26), which introduces key innovations including Distribution Focal Loss (DFL) removal, native NMS-free inference, Progressive Loss Balancing (ProgLoss), Small-Target-Aware Label Assignment (STAL), and the MuSGD optimizer for stable training. The progression is then traced through YOLO11, with its hybrid task assignment and efficiency-focused modules; YOLOv8, which advanced with a decoupled detection head and anchor-free predictions; and YOLOv5, which established the modular PyTorch foundation that enabled modern YOLO development. Benchmarking on the MS COCO dataset provides a detailed quantitative comparison of YOLOv5, YOLOv8, YOLO11, and YOLO26, alongside cross-comparisons with YOLOv12, YOLOv13, RT-DETR, and DEIM. Metrics including precision, recall, F1 score, mean Average Precision, and inference speed are analyzed to highlight trade-offs between accuracy and efficiency. Deployment and application perspectives are further discussed, covering export formats, quantization strategies, and real-world use in robotics, agriculture, surveillance, and manufacturing. Finally, the paper identifies challenges and future directions, including dense-scene limitations, hybrid CNN-Transformer integration, open-vocabulary detection, and edge-aware training approaches.
[101] OmniSAT: Compact Action Token, Faster Auto Regression
Huaihai Lyu,Chaofan Chen,Senwei Xie,Pengwei Wang,Xiansheng Chen,Shanghang Zhang,Changsheng Xu
Main category: cs.CV
TL;DR: 论文提出OmniSAT,一种紧凑且可转移的动作表示方法,通过标准化值范围和时间范围、结合B样条编码和多阶段残差量化,显著缩短了训练序列长度并降低了目标熵,进一步通过跨体现学习策略提升了性能。
Details
Motivation: 现有的视觉-语言-动作(VLA)模型中,自回归方法虽高效但面临动作块导致的长序列和高维问题,传统压缩方法在重建质量或效率上表现不佳。Contribution: 1. 提出OmniSAT,一种紧凑的动作表示方法;2. 通过多阶段残差量化实现高效压缩;3. 结合跨体现学习策略,利用异构数据进行监督。
Method: 1. 标准化值范围和时间范围,结合B样条编码;2. 对位置、旋转和夹爪子空间进行多阶段残差量化;3. 在Droid数据集上预训练;4. 开发跨体现学习策略。
Result: OmniSAT将训练序列缩短6.8倍,降低目标熵,同时在真实机器人和仿真实验中表现出更高的压缩效率和重建质量,加速自回归训练收敛。
Insight: OmniSAT通过紧凑表示和跨体现学习实现了高效的动作序列建模,为大规模预训练提供了新的解决方案。
Abstract: Existing Vision-Language-Action (VLA) models can be broadly categorized into diffusion-based and auto-regressive (AR) approaches: diffusion models capture continuous action distributions but rely on computationally heavy iterative denoising. In contrast, AR models enable efficient optimization and flexible sequence construction, making them better suited for large-scale pretraining. To further improve AR efficiency, particularly when action chunks induce extended and high-dimensional sequences, prior work applies entropy-guided and token-frequency techniques to shorten the sequence length. However, such compression struggled with \textit{poor reconstruction or inefficient compression}. Motivated by this, we introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation. Specifically, we first normalize value ranges and temporal horizons to obtain a consistent representation with B-Spline encoding. Then, we apply multi-stage residual quantization to the position, rotation, and gripper subspaces, producing compressed discrete tokens with coarse-to-fine granularity for each part. After pre-training on the large-scale dataset Droid, the resulting discrete tokenization shortens the training sequence by 6.8$\times$, and lowers the target entropy. To further explore the potential of OmniSAT, we develop a cross-embodiment learning strategy that builds on the unified action-pattern space and jointly leverages robot and human demonstrations. It enables scalable auxiliary supervision from heterogeneous egocentric videos. Across diverse real-robot and simulation experiments, OmniSAT encompasses higher compression while preserving reconstruction quality, enabling faster AR training convergence and model performance.
[102] Knowledge-Aware Mamba for Joint Change Detection and Classification from MODIS Times Series
Zhengsen Xu,Yimin Zhu,Zack Dewis,Mabel Heffring,Motasem Alkayid,Saeid Taleghanidoozdoozan,Lincoln Linlin Xu
Main category: cs.CV
TL;DR: 论文提出了一种名为KAMamba的知识感知Mamba方法,用于改进MODIS时间序列的联合变化检测和分类任务,通过知识驱动的转换矩阵和多任务学习提升准确性。
Details
Motivation: MODIS时间序列中的混合像素、时空光谱信息耦合效应以及背景类异质性使得变化检测极具挑战性,因此需要一种新方法来应对这些问题。Contribution: 1. 设计了知识驱动的转换矩阵方法(KAT-loss);2. 提出了多任务学习框架(包含PreC-loss、PostC-loss和Chg-loss);3. 开发了空间-光谱-时间Mamba模块(SSTMamba);4. 引入了稀疏可变形Mamba骨干网络(SDMamba)以提高效率。
Method: 1. 使用KAT-loss利用类别转换知识;2. 通过多任务学习框架联合训练模型;3. 设计SSTMamba模块解耦时空光谱信息;4. 采用SDMamba降低计算成本。
Result: 在加拿大萨斯喀彻温省的MODIS数据集上,变化检测的平均F1值提升了1.5-6%,LULC分类的OA、AA和Kappa指标提高了约2%。
Insight: 知识驱动和多任务学习的结合可以有效提升变化检测和分类任务的性能,同时稀疏可变形设计有助于降低模型计算负担。
Abstract: Although change detection using MODIS time series is critical for environmental monitoring, it is a highly challenging task due to key MODIS difficulties, e.g., mixed pixels, spatial-spectral-temporal information coupling effect, and background class heterogeneity. This paper presents a novel knowledge-aware Mamba (KAMamba) for enhanced MODIS change detection, with the following contributions. First, to leverage knowledge regarding class transitions, we design a novel knowledge-driven transition-matrix-guided approach, leading to a knowledge-aware transition loss (KAT-loss) that can enhance detection accuracies. Second, to improve model constraints, a multi-task learning approach is designed, where three losses, i.e., pre-change classification loss (PreC-loss), post-change classification loss (PostC-loss), and change detection loss (Chg-loss) are used for improve model learning. Third, to disentangle information coupling in MODIS time series, novel spatial-spectral-temporal Mamba (SSTMamba) modules are designed. Last, to improve Mamba model efficiency and remove computational cost, a sparse and deformable Mamba (SDMamba) backbone is used in SSTMamba. On the MODIS time-series dataset for Saskatchewan, Canada, we evaluate the method on land-cover change detection and LULC classification; results show about 1.5-6% gains in average F1 for change detection over baselines, and about 2% improvements in OA, AA, and Kappa for LULC classification.
[103] Multi Camera Connected Vision System with Multi View Analytics: A Comprehensive Survey
Muhammad Munsif,Waqas Ahmad,Amjid Ali,Mohib Ullah,Adnan Hussain,Sung Wook Baik
Main category: cs.CV
TL;DR: 这篇综述论文首次将多摄像头多视图跟踪、重识别和行为理解统一为一个框架,提出了新的分类法,总结了当前的最新技术和研究挑战,并指出了未来的研究方向。
Details
Motivation: 现有的研究大多专注于单一任务或单视图系统,忽视了多摄像头协作和多视图数据分析的潜力,而这篇论文旨在填补这一空白,为连接视觉系统提供全面的视角。Contribution: 论文的主要贡献包括:提出了一个新的分类法,将CVS分为四个关键部分;系统地整理和总结了相关数据集、方法和评估指标;指出了开放的研究问题和挑战,并讨论了未来技术发展的方向。
Method: 论文采用综述方法,对多摄像头多视图跟踪、重识别和行为理解进行了详细分类和分析,并结合最新的研究进展和数据集。
Result: 论文总结了当前的研究成果和性能指标,展示了多摄像头协作在提升系统性能方面的潜力。
Insight: 多摄像头协作和多视图数据分析是提升连接视觉系统鲁棒性和适应性的关键,未来的研究需要关注隐私保护、联邦学习等新兴技术。
Abstract: Connected Vision Systems (CVS) are transforming a variety of applications, including autonomous vehicles, smart cities, surveillance, and human-robot interaction. These systems harness multi-view multi-camera (MVMC) data to provide enhanced situational awareness through the integration of MVMC tracking, re-identification (Re-ID), and action understanding (AU). However, deploying CVS in real-world, dynamic environments presents a number of challenges, particularly in addressing occlusions, diverse viewpoints, and environmental variability. Existing surveys have focused primarily on isolated tasks such as tracking, Re-ID, and AU, often neglecting their integration into a cohesive system. These reviews typically emphasize single-view setups, overlooking the complexities and opportunities provided by multi-camera collaboration and multi-view data analysis. To the best of our knowledge, this survey is the first to offer a comprehensive and integrated review of MVMC that unifies MVMC tracking, Re-ID, and AU into a single framework. We propose a unique taxonomy to better understand the critical components of CVS, dividing it into four key parts: MVMC tracking, Re-ID, AU, and combined methods. We systematically arrange and summarize the state-of-the-art datasets, methodologies, results, and evaluation metrics, providing a structured view of the field’s progression. Furthermore, we identify and discuss the open research questions and challenges, along with emerging technologies such as lifelong learning, privacy, and federated learning, that need to be addressed for future advancements. The paper concludes by outlining key research directions for enhancing the robustness, efficiency, and adaptability of CVS in complex, real-world applications. We hope this survey will inspire innovative solutions and guide future research toward the next generation of intelligent and adaptive CVS.
[104] Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping
Dwip Dalal,Gautam Vashishtha,Utkarsh Mishra,Jeonghwan Kim,Madhav Kanda,Hyeonjeong Ha,Svetlana Lazebnik,Heng Ji,Unnat Jain
Main category: cs.CV
TL;DR: AttWarp是一种轻量级方法,通过注意力引导的图像扭曲在测试时为MLLMs分配更多分辨率到查询相关区域,保留全局上下文并提升细粒度感知准确性。
Details
Motivation: MLLMs在复杂场景中容易忽略小细节和空间关系,导致细粒度感知任务出错。AttWarp旨在通过注意力引导的图像扭曲优化分辨率分配,提升模型性能。Contribution: 提出了AttWarp方法,利用MLLMs的跨模态注意力动态扭曲输入图像,在不改变模型权重或架构的情况下提升模型对细粒度信息的捕捉能力。
Method: 通过MLLMs的交叉注意力生成注意力图,指导图像的直线扭曲,将分辨率集中在模型认为重要的区域,同时保留全局上下文。
Result: 在五个基准测试和四种MLLMs上,AttWarp显著提升了准确性、组合推理能力,并减少了幻觉现象,优于四种基线方法。
Insight: 注意力引导的图像扭曲能有效优化信息分配,即使相同的MLLMs在处理扭曲后的输入时表现更优,证明了动态分辨率分配的重要性。
Abstract: Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM’s cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.
[105] Towards Understanding Ambiguity Resolution in Multimodal Inference of Meaning
Yufei Wang,Adriana Kovashka,Loretta Fernández,Marc N. Coutanche,Seth Wiener
Main category: cs.CV
TL;DR: 论文研究了在多模态环境下学习外语的新情境,分析了图像和文本的特征如何帮助人类参与者推断陌生词汇的意义,并探索了AI系统在这一任务中的表现。
Details
Motivation: 外语学习中,学习者常通过多模态上下文(如图像和句子)推断陌生词汇的意义。然而,哪些特征有助于这种推断尚不明确,且AI系统在这方面的能力也有待提升。Contribution: 1) 通过人类实验分析了多模态数据特征与推断成功的关系;2) 探索了AI系统预测人类表现的能力及改进方向。
Method: 通过人类实验,使用不同的图文对,记录参与者推断陌生词汇的表现,并分析数据特征(如视觉和文本线索)与成功率的关联。同时测试AI系统的推理能力。
Result: 发现部分直观特征与参与者表现有强相关性,但需进一步研究预测特征。AI系统在推理人类表现方面显示出改进潜力。
Insight: 多模态学习中,单纯依赖直观特征可能不足,需深入挖掘更有效的特征。AI系统在这一任务中表现尚不完善,但未来有望通过改进推理能力提升效果。
Abstract: We investigate a new setting for foreign language learning, where learners infer the meaning of unfamiliar words in a multimodal context of a sentence describing a paired image. We conduct studies with human participants using different image-text pairs. We analyze the features of the data (i.e., images and texts) that make it easier for participants to infer the meaning of a masked or unfamiliar word, and what language backgrounds of the participants correlate with success. We find only some intuitive features have strong correlations with participant performance, prompting the need for further investigating of predictive features for success in these tasks. We also analyze the ability of AI systems to reason about participant performance, and discover promising future directions for improving this reasoning ability.
[106] Task-Aware Resolution Optimization for Visual Large Language Models
Weiqing Luo,Zhen Tan,Yifan Li,Xinyu Zhao,Kwonjoon Lee,Behzad Dariush,Tianlong Chen
Main category: cs.CV
TL;DR: 论文提出了一种任务感知的分辨率优化方法,用于提升视觉大语言模型(VLLMs)在多种任务中的性能,并通过实验验证了其有效性。
Details
Motivation: 现有的视觉大语言模型(如LLaVA)通常假设固定分辨率适用于所有下游任务,导致性能不佳。研究发现不同任务对分辨率的偏好与图像复杂度和模型不确定性相关,因此需要一个任务感知的分辨率优化方法。Contribution: 1)首次全面研究了视觉语言任务的分辨率偏好,揭示了其与图像复杂度和模型不确定性的相关性;2)提出了一种经验性公式来确定最优分辨率;3)提出了一种参数高效的微调技术,将预训练VLLMs的视觉输入分辨率扩展到最优分辨率。
Method: 1)分析分辨率偏好与图像复杂度及模型不确定性的关系;2)设计经验性公式计算最优分辨率;3)开发参数高效微调技术,扩展VLLMs的输入分辨率。
Result: 在多种视觉语言任务上的实验表明,该方法显著提升了VLLMs的性能。
Insight: 不同任务对分辨率的需求不同,结合图像复杂度和模型不确定性可以更高效地确定最优分辨率,从而提升模型性能。
Abstract: Real-world vision-language applications demand varying levels of perceptual granularity. However, most existing visual large language models (VLLMs), such as LLaVA, pre-assume a fixed resolution for downstream tasks, which leads to subpar performance. To address this problem, we first conduct a comprehensive and pioneering investigation into the resolution preferences of different vision-language tasks, revealing a correlation between resolution preferences with image complexity, and uncertainty variance of the VLLM at different image input resolutions. Building on this insight, we propose an empirical formula to determine the optimal resolution for a given vision-language task, combining these two factors. Second, based on rigorous experiments, we propose a novel parameter-efficient fine-tuning technique to extend the visual input resolution of pre-trained VLLMs to the identified optimal resolution. Extensive experiments on various vision-language tasks validate the effectiveness of our method.
[107] Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation
Zhi Chen,Xin Yu,Xiaohui Tao,Yan Li,Zi Huang
Main category: cs.CV
TL;DR: 本文提出了CAPEL框架,通过聚类感知的提示集成学习优化少样本视觉语言模型的适应能力,避免了传统集成方法导致的类别中心偏移问题。
Details
Motivation: 现有的视觉语言模型(如CLIP)通过多组上下文提示来实现零样本迁移,但传统提示集成方法通过特征平均会导致类别中心偏离真实分布,影响性能。Contribution: 1. 提出CAPEL框架,在分类对数空间集成提示而非特征空间;2. 引入聚类保持正则项,确保提示的区分性;3. 提出自适应提示加权技术,动态调整模糊提示的权重。
Method: 1. 将图像分类到由不同提示表示的子类簇;2. 在分类对数空间集成提示;3. 使用聚类正则化和自适应加权优化提示微调。
Result: CAPEL能更好地对齐视觉特征分布,在不同数据集和任务中表现鲁棒。
Insight: 在视觉语言模型中,保留提示的聚类特性比简单特征平均更有效,动态调整提示权重可进一步提升性能。
Abstract: Vision-language models (VLMs) such as CLIP achieve zero-shot transfer across various tasks by pre-training on numerous image-text pairs. These models often benefit from using an ensemble of context prompts to represent a class. Despite being effective, conventional prompt ensembling that averages textual features of context prompts often yields suboptimal results. This is because feature averaging shifts the class centroids away from the true class distribution. To address this issue, we propose the Cluster-Aware Prompt Ensemble Learning (CAPEL) framework, which preserves the cluster nature of context prompts. CAPEL classifies images into one of several class clusters, each represented by a distinct prompt. Instead of ensembling prompts in the feature space, we perform ensembling in the classification logits space, aligning better with the visual feature distribution. To further optimize prompt fine-tuning while maintaining cluster-specific discriminative power, we introduce a cluster-preserving regularization term. This ensures that prompts remain distinct and specialized for different clusters, preventing collapse into a uniform direction. Additionally, we integrate an adaptive prompt weighting technique to dynamically adjust the attention weights for flawed or ambiguous prompts, ensuring robust performance across diverse datasets and tasks.
[108] CHUG: Crowdsourced User-Generated HDR Video Quality Dataset
Shreshth Saini,Alan C. Bovik,Neil Birkbeck,Yilin Wang,Balu Adsumilli
Main category: cs.CV
TL;DR: CHUG是首个大规模用户生成HDR视频质量数据集,填补了现有PGC-HDR数据集的空白,涵盖856个源视频和5,992个处理后视频,提供21万+主观评分,助力无参考HDR视频质量评估研究。
Details
Motivation: 现有HDR-VQA数据集主要针对专业生成内容(UGC),而用户生成内容(UGC-HDR)的多样性、捕捉条件和压缩失真等挑战未被充分研究,需要更贴近实际的评估基准。Contribution: 1) 首个UGC-HDR大规模主观质量数据集CHUG;2) 包含856个源视频及多分辨率、码率处理版本,共5,992个视频;3) 通过众包收集211,848个主观评分;4) 公开数据集推动NR-HDR-VQA研究。
Method: 1) 收集856个UGC-HDR源视频;2) 模拟真实场景,对视频进行多分辨率、码率转码;3) 通过Amazon Mechanical Turk进行大规模主观评分研究。
Result: CHUG数据集成为首个针对UGC-HDR的基准,揭示了UGC特有失真对质量的影响,为无参考质量评估提供数据支持。
Insight: UGC-HDR的质量评估需考虑多样化场景和复杂失真,CHUG的多样性填补了研究空白,未来可推动NR-HDR-VQA算法的创新。
Abstract: High Dynamic Range (HDR) videos enhance visual experiences with superior brightness, contrast, and color depth. The surge of User-Generated Content (UGC) on platforms like YouTube and TikTok introduces unique challenges for HDR video quality assessment (VQA) due to diverse capture conditions, editing artifacts, and compression distortions. Existing HDR-VQA datasets primarily focus on professionally generated content (PGC), leaving a gap in understanding real-world UGC-HDR degradations. To address this, we introduce CHUG: Crowdsourced User-Generated HDR Video Quality Dataset, the first large-scale subjective study on UGC-HDR quality. CHUG comprises 856 UGC-HDR source videos, transcoded across multiple resolutions and bitrates to simulate real-world scenarios, totaling 5,992 videos. A large-scale study via Amazon Mechanical Turk collected 211,848 perceptual ratings. CHUG provides a benchmark for analyzing UGC-specific distortions in HDR videos. We anticipate CHUG will advance No-Reference (NR) HDR-VQA research by offering a large-scale, diverse, and real-world UGC dataset. The dataset is publicly available at: https://shreshthsaini.github.io/CHUG/.
[109] SpectralCA: Bi-Directional Cross-Attention for Next-Generation UAV Hyperspectral Vision
D. V. Brovko
Main category: cs.CV
TL;DR: 该论文提出了一种名为SpectralCA的双向跨注意力机制,用于无人机高光谱视觉任务。通过改进Mobile 3D Vision Transformer(MDvT),结合光谱与空间特征,提升了无人机在导航、目标检测和地形分类中的感知效率,并减少了参数和推理时间。
Details
Motivation: 无人机在复杂环境中(如干扰、低能见度或伪装)的可靠操作需求日益增长。高光谱成像(HSI)因其细粒度材料识别和目标区分能力,被视为提升无人机计算机视觉能力的关键。Contribution: 提出了SpectralCA模块,通过双向跨注意力机制融合光谱和空间特征;设计了混合2D/3D卷积架构,提升了感知效率并实现实时操作。
Method: 基于MDvT改进,引入SpectralCA块,结合双向跨注意力机制进行光谱和空间特征融合;在WHU-Hi-HongHu数据集上进行实验验证。
Result: 实验表明,该架构在总体精度、平均精度和Kappa系数等指标上表现优异,提升了无人机感知任务的效率。
Insight: 光谱与空间特征的融合是提升高光谱视觉任务性能的关键,双向跨注意力机制在此类任务中具有显著优势。
Abstract: The relevance of this research lies in the growing demand for unmanned aerial vehicles (UAVs) capable of operating reliably in complex environments where conventional navigation becomes unreliable due to interference, poor visibility, or camouflage. Hyperspectral imaging (HSI) provides unique opportunities for UAV-based computer vision by enabling fine-grained material recognition and object differentiation, which are critical for navigation, surveillance, agriculture, and environmental monitoring. The aim of this work is to develop a deep learning architecture integrating HSI into UAV perception for navigation, object detection, and terrain classification. Objectives include: reviewing existing HSI methods, designing a hybrid 2D/3D convolutional architecture with spectral-spatial cross-attention, training, and benchmarking. The methodology is based on the modification of the Mobile 3D Vision Transformer (MDvT) by introducing the proposed SpectralCA block. This block employs bi-directional cross-attention to fuse spectral and spatial features, enhancing accuracy while reducing parameters and inference time. Experimental evaluation was conducted on the WHU-Hi-HongHu dataset, with results assessed using Overall Accuracy, Average Accuracy, and the Kappa coefficient. The findings confirm that the proposed architecture improves UAV perception efficiency, enabling real-time operation for navigation, object recognition, and environmental monitoring tasks. Keywords: SpectralCA, deep learning, computer vision, hyperspectral imaging, unmanned aerial vehicle, object detection, semi-supervised learning.
[110] HeadsUp! High-Fidelity Portrait Image Super-Resolution
Renjie Li,Zihao Zhu,Xiaoyu Wang,Zhengzhong Tu
Main category: cs.CV
TL;DR: 论文《HeadsUp!》提出了一种用于高质量肖像图像超分辨率的单步扩散模型,解决了现有方法在处理肖像照片时因混合不同模型而引入的边界伪影问题。
Details
Motivation: 现有的图像超分辨率技术要么专注于通用真实世界图像,要么专注于严格对齐的面部图像(即人脸超分辨率)。处理肖像照片时需要混合不同模型,但会引入边界伪影。人类感知对面部保真度特别敏感,因此需要一种无缝恢复和提升分辨率的方法。Contribution: 1. 提出了HeadsUp,一种单步扩散模型,能够端到端无缝恢复和提升分辨率;2. 设计了面部监督机制以指导模型专注于面部区域;3. 引入了基于参考的机制以减少低质量面部恢复中的身份模糊;4. 构建了高质量的4K肖像图像超分辨率数据集PortraitSR-4K。
Method: 基于单步扩散模型,开发了面部监督机制和基于参考的身份恢复机制,确保模型能够专注于面部区域并减少身份模糊。
Result: HeadsUp在PortraitISR任务中达到了最先进的性能,同时在通用图像和对齐人脸数据集上表现相当或更好。
Insight: 单一模型端到端处理肖像图像可以有效避免混合模型的边界伪影问题,而面部监督和参考机制的结合提升了面部区域的保真度和身份一致性。
Abstract: Portrait pictures, which typically feature both human subjects and natural backgrounds, are one of the most prevalent forms of photography on social media. Existing image super-resolution (ISR) techniques generally focus either on generic real-world images or strictly aligned facial images (i.e., face super-resolution). In practice, separate models are blended to handle portrait photos: the face specialist model handles the face region, and the general model processes the rest. However, these blending approaches inevitably introduce blending or boundary artifacts around the facial regions due to different model training recipes, while human perception is particularly sensitive to facial fidelity. To overcome these limitations, we study the portrait image supersolution (PortraitISR) problem, and propose HeadsUp, a single-step diffusion model that is capable of seamlessly restoring and upscaling portrait images in an end-to-end manner. Specifically, we build our model on top of a single-step diffusion model and develop a face supervision mechanism to guide the model in focusing on the facial region. We then integrate a reference-based mechanism to help with identity restoration, reducing face ambiguity in low-quality face restoration. Additionally, we have built a high-quality 4K portrait image ISR dataset dubbed PortraitSR-4K, to support model training and benchmarking for portrait images. Extensive experiments show that HeadsUp achieves state-of-the-art performance on the PortraitISR task while maintaining comparable or higher performance on both general image and aligned face datasets.
[111] Denoising Diffusion as a New Framework for Underwater Images
Nilesh Jain,Elie Alhajjar
Main category: cs.CV
TL;DR: 本文提出了一种基于去噪扩散模型的新框架,用于提升水下图像的质量和多样性,并通过Controlnet进一步增强数据集,以克服现有水下图像数据集的局限性。
Details
Motivation: 水下图像在水域研究和环境监测中至关重要,但现有图像质量差(如低可见度、模糊、色彩失真和噪声),且依赖的数据集缺乏多样性和高质量样本。现有方法泛化能力差,亟需新方法解决这些问题。Contribution: 1) 提出利用去噪扩散模型扩展水下图像数据集,增加多样性(如立体、广角、微距等图像);2) 引入Controlnet评估和提升数据集质量,以更好地支持海洋生态研究。
Method: 1) 使用去噪扩散模型生成多样化水下图像;2) 结合Controlnet对图像质量进行评估和增强。
Result: 该方法能够生成更丰富的水下图像数据集,提升图像质量,从而支持更准确的海洋生态分析。
Insight: 去噪扩散模型不仅适用于传统图像生成任务,还可用于解决特定领域(如水下图像)的数据稀缺和质量问题,为其他类似问题提供了新思路。
Abstract: Underwater images play a crucial role in ocean research and marine environmental monitoring since they provide quality information about the ecosystem. However, the complex and remote nature of the environment results in poor image quality with issues such as low visibility, blurry textures, color distortion, and noise. In recent years, research in image enhancement has proven to be effective but also presents its own limitations, like poor generalization and heavy reliance on clean datasets. One of the challenges herein is the lack of diversity and the low quality of images included in these datasets. Also, most existing datasets consist only of monocular images, a fact that limits the representation of different lighting conditions and angles. In this paper, we propose a new plan of action to overcome these limitations. On one hand, we call for expanding the datasets using a denoising diffusion model to include a variety of image types such as stereo, wide-angled, macro, and close-up images. On the other hand, we recommend enhancing the images using Controlnet to evaluate and increase the quality of the corresponding datasets, and hence improve the study of the marine ecosystem. Tags - Underwater Images, Denoising Diffusion, Marine ecosystem, Controlnet
[112] Scaling Traffic Insights with AI and Language Model-Powered Camera Systems for Data-Driven Transportation Decision Making
Fan Zuo,Donglin Zhou,Jingqin Gao,Kaan Ozbay
Main category: cs.CV
TL;DR: 该论文提出了一种基于AI和语言模型的端到端框架,用于利用现有交通摄像头基础设施进行大规模交通监测。通过优化的YOLOv11模型和创新的视角归一化方法,实现了高分辨率、实时交通数据分析,并结合领域专用大语言模型生成自动化交通模式摘要。
Details
Motivation: 当前交通监测面临高成本、动态视角和大规模数据处理的挑战,亟需一种高效、低人工干预的解决方案。Contribution: 1. 开发了一种基于AI的端到端框架;2. 提出了图基视角归一化方法;3. 结合领域专用大语言模型实现自动化数据摘要。
Method: 1. 使用fine-tuned YOLOv11进行实时交通密度和分类提取;2. 通过图基方法解决动态视角问题;3. 利用LLM处理大规模视频数据并生成摘要。
Result: 在纽约市拥堵收费政策的早期实施中,系统成功监测到车辆密度下降9%,卡车流量先减后增,行人及自行车活动增加。
Insight: 基于示例的提示可提升LLM数值准确性,减少幻觉;该系统为政策相关的大规模交通监测提供了实用解决方案。
Abstract: Accurate, scalable traffic monitoring is critical for real-time and long-term transportation management, particularly during disruptions such as natural disasters, large construction projects, or major policy changes like New York City’s first-in-the-nation congestion pricing program. However, widespread sensor deployment remains limited due to high installation, maintenance, and data management costs. While traffic cameras offer a cost-effective alternative, existing video analytics struggle with dynamic camera viewpoints and massive data volumes from large camera networks. This study presents an end-to-end AI-based framework leveraging existing traffic camera infrastructure for high-resolution, longitudinal analysis at scale. A fine-tuned YOLOv11 model, trained on localized urban scenes, extracts multimodal traffic density and classification metrics in real time. To address inconsistencies from non-stationary pan-tilt-zoom cameras, we introduce a novel graph-based viewpoint normalization method. A domain-specific large language model was also integrated to process massive data from a 24/7 video stream to generate frequent, automated summaries of evolving traffic patterns, a task far exceeding manual capabilities. We validated the system using over 9 million images from roughly 1,000 traffic cameras during the early rollout of NYC congestion pricing in 2025. Results show a 9% decline in weekday passenger vehicle density within the Congestion Relief Zone, early truck volume reductions with signs of rebound, and consistent increases in pedestrian and cyclist activity at corridor and zonal scales. Experiments showed that example-based prompts improved LLM’s numerical accuracy and reduced hallucinations. These findings demonstrate the framework’s potential as a practical, infrastructure-ready solution for large-scale, policy-relevant traffic monitoring with minimal human intervention.
[113] FlareX: A Physics-Informed Dataset for Lens Flare Removal via 2D Synthesis and 3D Rendering
Lishen Qu,Zhihao Liu,Jinshan Pan,Shihao Zhou,Jinglei Shi,Duosheng Chen,Jufeng Yang
Main category: cs.CV
TL;DR: 该论文提出了FlareX,一种基于物理原理的数据集,通过参数化模板创建、光照感知的2D合成和基于物理引擎的3D渲染三阶段方法,生成了具有多样性和真实性的镜头炫光数据。
Details
Motivation: 现有的数据集通常是通过在背景图像上叠加人工炫光模板合成的2D数据,缺乏炫光多样性且忽视了物理原理,导致训练模型在真实场景中泛化能力差。Contribution: 提出了FlareX数据集,结合了2D和3D视角,生成了9,500个2D模板和3,000个3D渲染的炫光图像对,并通过掩码方法从真实图像中获取无炫光图像以评估模型性能。
Method: 1) 参数化模板创建;2) 光照感知的2D合成;3) 基于物理引擎的3D渲染。
Result: 实验表明,FlareX数据集显著提升了模型在真实世界图像中的泛化能力。
Insight: 通过结合物理原理和多视角数据生成方法,可以有效解决镜头炫光去除任务中的数据多样性问题。
Abstract: Lens flare occurs when shooting towards strong light sources, significantly degrading the visual quality of images. Due to the difficulty in capturing flare-corrupted and flare-free image pairs in the real world, existing datasets are typically synthesized in 2D by overlaying artificial flare templates onto background images. However, the lack of flare diversity in templates and the neglect of physical principles in the synthesis process hinder models trained on these datasets from generalizing well to real-world scenarios. To address these challenges, we propose a new physics-informed method for flare data generation, which consists of three stages: parameterized template creation, the laws of illumination-aware 2D synthesis, and physical engine-based 3D rendering, which finally gives us a mixed flare dataset that incorporates both 2D and 3D perspectives, namely FlareX. This dataset offers 9,500 2D templates derived from 95 flare patterns and 3,000 flare image pairs rendered from 60 3D scenes. Furthermore, we design a masking approach to obtain real-world flare-free images from their corrupted counterparts to measure the performance of the model on real-world images. Extensive experiments demonstrate the effectiveness of our method and dataset.
[114] BurstDeflicker: A Benchmark Dataset for Flicker Removal in Dynamic Scenes
Lishen Qu,Zhihao Liu,Shihao Zhou,Yaqi Luo,Jie Liang,Hui Zeng,Lei Zhang,Jufeng Yang
Main category: cs.CV
TL;DR: 本文提出了BurstDeflicker数据集,用于解决动态场景中由于滚动快门相机和AC光源相互作用产生的闪烁问题。通过合成数据、真实拍摄和绿幕方法三种策略构建了一个大规模、多样化的基准数据集。
Details
Motivation: 闪烁伪影(flicker)是由于滚动快门相机的逐行曝光机制与AC光源的时变亮度相互作用导致的,表现为图像中的暗带。这不仅影响图像质量,还干扰高层视觉任务(如目标检测和跟踪)。然而,缺乏大规模真实数据集阻碍了相关研究的进展。Contribution: 主要贡献是提出了BurstDeflicker,一个结合合成数据、真实拍摄和绿幕方法的大规模基准数据集,用于研究动态场景中的闪烁去除问题。
Method: 1. 基于Retinex的合成流水线,可控生成多样化闪烁模式;2. 采集4000张真实场景闪烁图像;3. 提出绿幕方法,在动态场景中保留真实闪烁退化。
Result: 实验验证了数据集的有效性,展示了其在推动闪烁去除研究方面的潜力。
Insight: 闪烁问题不仅是图像质量的挑战,还影响高层视觉任务。通过多策略数据采集,可以更好地建模闪烁的时空特性并提高泛化能力。
Abstract: Flicker artifacts in short-exposure images are caused by the interplay between the row-wise exposure mechanism of rolling shutter cameras and the temporal intensity variations of alternating current (AC)-powered lighting. These artifacts typically appear as uneven brightness distribution across the image, forming noticeable dark bands. Beyond compromising image quality, this structured noise also affects high-level tasks, such as object detection and tracking, where reliable lighting is crucial. Despite the prevalence of flicker, the lack of a large-scale, realistic dataset has been a significant barrier to advancing research in flicker removal. To address this issue, we present BurstDeflicker, a scalable benchmark constructed using three complementary data acquisition strategies. First, we develop a Retinex-based synthesis pipeline that redefines the goal of flicker removal and enables controllable manipulation of key flicker-related attributes (e.g., intensity, area, and frequency), thereby facilitating the generation of diverse flicker patterns. Second, we capture 4,000 real-world flicker images from different scenes, which help the model better understand the spatial and temporal characteristics of real flicker artifacts and generalize more effectively to wild scenarios. Finally, due to the non-repeatable nature of dynamic scenes, we propose a green-screen method to incorporate motion into image pairs while preserving real flicker degradation. Comprehensive experiments demonstrate the effectiveness of our dataset and its potential to advance research in flicker removal.
[115] MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output
Yanyuan Chen,Dexuan Xu,Yu Huang,Songkun Zhan,Hanpin Wang,Dongxue Chen,Xueping Wang,Meikang Qiu,Hang Li
Main category: cs.CV
TL;DR: 本文提出了一种统一的医学视觉语言模型MIMO,通过视觉参考多模态输入和像素定位多模态输出解决了现有模型在输入和输出上的不足。
Details
Motivation: 现有的医学视觉语言模型仅依赖文本指令输入和文本答案输出,缺乏对图像视觉线索的直接理解和与图像关键区域的联系。Contribution: 1) 提出了MIMO模型,支持视觉参考输入和像素定位输出;2) 构建了MIMOSeg数据集,包含89.5万样本,涵盖多模态输入和输出任务。
Method: MIMO结合视觉和文本线索理解医学图像,并能将文本输出中的医学术语定位到图像中;通过MIMOSeg数据集训练模型。
Result: 实验验证MIMO在多个医学多模态下游任务中表现出色,具备视觉参考和像素定位能力。
Insight: 视觉参考和像素定位的结合能显著提升医学视觉语言模型的表现,多模态数据集的构建是关键。
Abstract: Currently, medical vision language models are widely used in medical vision question answering tasks. However, existing models are confronted with two issues: for input, the model only relies on text instructions and lacks direct understanding of visual clues in the image; for output, the model only gives text answers and lacks connection with key areas in the image. To address these issues, we propose a unified medical vision language model MIMO, with visual referring Multimodal Input and pixel grounding Multimodal Output. MIMO can not only combine visual clues and textual instructions to understand complex medical images and semantics, but can also ground medical terminologies in textual output within the image. To overcome the scarcity of relevant data in the medical field, we propose MIMOSeg, a comprehensive medical multimodal dataset including 895K samples. MIMOSeg is constructed from four different perspectives, covering basic instruction following and complex question answering with multimodal input and multimodal output. We conduct experiments on several downstream medical multimodal tasks. Extensive experimental results verify that MIMO can uniquely combine visual referring and pixel grounding capabilities, which are not available in previous models.
[116] Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning
Junan Chen,Trung Thanh Nguyen,Takahiro Komamizu,Ichiro Ide
Main category: cs.CV
TL;DR: 该论文提出了Q-Adapter,一种轻量级的视觉适配器模块,用于在视频字幕任务中高效微调多模态大型语言模型(MLLMs),仅需1.4%的参数即可达到与全微调方法竞争的性能。
Details
Motivation: 随着模型规模增大,全微调方法的计算成本变得高昂。现有的参数高效微调(PEFT)方法主要集中在语言组件,而多模态任务中的视觉信息处理仍未充分探索。Contribution: Q-Adapter引入了可学习的查询令牌和门控层,有效提取稀疏且与字幕相关的特征,无需依赖外部文本监督,实现了性能和参数效率的平衡。
Method: Q-Adapter在Vision Encoder中添加查询令牌和门控层,专注于提取与字幕相关的视觉特征。
Result: 在MSR-VTT和MSVD数据集上,Q-Adapter在BLEU@4、METEOR、ROUGE-L和CIDEr指标上表现优异,达到了PEFT方法的最高水平。
Insight: Q-Adapter的设计为视频-语言建模提供了可扩展的优化策略,展示了在性能和效率之间的权衡潜力。
Abstract: Recent advances in video captioning are driven by large-scale pretrained models, which follow the standard “pre-training followed by fine-tuning” paradigm, where the full model is fine-tuned for downstream tasks. Although effective, this approach becomes computationally prohibitive as the model size increases. The Parameter-Efficient Fine-Tuning (PEFT) approach offers a promising alternative, but primarily focuses on the language components of Multimodal Large Language Models (MLLMs). Despite recent progress, PEFT remains underexplored in multimodal tasks and lacks sufficient understanding of visual information during fine-tuning the model. To bridge this gap, we propose Query-Adapter (Q-Adapter), a lightweight visual adapter module designed to enhance MLLMs by enabling efficient fine-tuning for the video captioning task. Q-Adapter introduces learnable query tokens and a gating layer into Vision Encoder, enabling effective extraction of sparse, caption-relevant features without relying on external textual supervision. We evaluate Q-Adapter on two well-known video captioning datasets, MSR-VTT and MSVD, where it achieves state-of-the-art performance among the methods that take the PEFT approach across BLEU@4, METEOR, ROUGE-L, and CIDEr metrics. Q-Adapter also achieves competitive performance compared to methods that take the full fine-tuning approach while requiring only 1.4% of the parameters. We further analyze the impact of key hyperparameters and design choices on fine-tuning effectiveness, providing insights into optimization strategies for adapter-based learning. These results highlight the strong potential of Q-Adapter in balancing caption quality and parameter efficiency, demonstrating its scalability for video-language modeling.
[117] P-4DGS: Predictive 4D Gaussian Splatting with 90$\times$ Compression
Henan Wang,Hanxin Zhu,Xinliang Gong,Tianyu He,Xin Li,Zhibo Chen
Main category: cs.CV
TL;DR: P-4DGS提出了一种基于预测的4D高斯泼溅方法,通过利用时空相关性和自适应量化策略,显著降低动态场景存储开销,最高实现90倍压缩。
Details
Motivation: 动态3D场景重建(4D重建)中,现有方法忽略时空冗余性,导致内存消耗巨大。P-4DGS旨在解决这一问题,提供高效的压缩表示。Contribution: 1. 提出3D锚点预测模块,利用时空相关性;2. 结合自适应量化和熵编码,实现高效压缩;3. 在合成和真实数据集上验证了90倍压缩和最快渲染速度。
Method: 1. 设计基于3D锚点的时空预测模块;2. 采用自适应量化和上下文熵编码;3. 优化动态高斯泼溅表示。
Result: P-4DGS在合成和真实场景中分别实现40倍和90倍压缩,存储占用仅约1MB,同时保持最佳重建质量和最快渲染速度。
Insight: 视频压缩技术可迁移至4D高斯泼溅,有效解决动态场景存储问题,为实时动态重建提供新思路。
Abstract: 3D Gaussian Splatting (3DGS) has garnered significant attention due to its superior scene representation fidelity and real-time rendering performance, especially for dynamic 3D scene reconstruction (\textit{i.e.}, 4D reconstruction). However, despite achieving promising results, most existing algorithms overlook the substantial temporal and spatial redundancies inherent in dynamic scenes, leading to prohibitive memory consumption. To address this, we propose P-4DGS, a novel dynamic 3DGS representation for compact 4D scene modeling. Inspired by intra- and inter-frame prediction techniques commonly used in video compression, we first design a 3D anchor point-based spatial-temporal prediction module to fully exploit the spatial-temporal correlations across different 3D Gaussian primitives. Subsequently, we employ an adaptive quantization strategy combined with context-based entropy coding to further reduce the size of the 3D anchor points, thereby achieving enhanced compression efficiency. To evaluate the rate-distortion performance of our proposed P-4DGS in comparison with other dynamic 3DGS representations, we conduct extensive experiments on both synthetic and real-world datasets. Experimental results demonstrate that our approach achieves state-of-the-art reconstruction quality and the fastest rendering speed, with a remarkably low storage footprint (around \textbf{1MB} on average), achieving up to \textbf{40$\times$} and \textbf{90$\times$} compression on synthetic and real-world scenes, respectively.
[118] Complementary and Contrastive Learning for Audio-Visual Segmentation
Sitong Gong,Yunzhi Zhuge,Lu Zhang,Pingping Zhang,Huchuan Lu
Main category: cs.CV
TL;DR: 本文提出了Complementary and Contrastive Transformer (CCFormer),一种新颖的框架,结合多尺度视觉特征和音频数据,通过并行双边结构和对比学习增强跨模态互补性,提升了音频-视觉分割的性能。
Details
Motivation: 现有的音频-视觉分割方法在处理局部和全局信息、时空动态以及跨模态对齐方面存在不足,限制了分割准确性和鲁棒性。Contribution: 1. 提出CCFormer框架,结合并行双边结构和多查询Transformer模块,增强跨模态互补性和时空动态捕捉能力;2. 引入Bi-modal Contrastive Learning (BCL),促进模态对齐;3. 在多个数据集上达到SOTA性能。
Method: 1. Early Integration Module (EIM)融合多尺度视觉特征和音频数据;2. Multi-query Transformer Module (MTM)动态学习音频查询并建模帧和视频级关系;3. Bi-modal Contrastive Learning (BCL)对齐模态特征。
Result: CCFormer在S4、MS3和AVSS数据集上实现了新的SOTA性能。
Insight: 结合并行双边结构和对比学习能够有效提升跨模态任务的性能,尤其是在捕捉时空动态和模态对齐方面表现出色。
Abstract: Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual interactions through basic operations like padding and multiplications but are restricted by CNNs’ limited local receptive field. More recently, Transformer-based methods treat auditory cues as queries, utilizing attention mechanisms to enhance audio-visual cooperation within frames. Nevertheless, they typically struggle to extract multimodal coefficients and temporal dynamics adequately. To overcome these limitations, we present the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively. Our CCFormer initiates with the Early Integration Module (EIM) that employs a parallel bilateral architecture, merging multi-scale visual features with audio data to boost cross-modal complementarity. To extract the intra-frame spatial features and facilitate the perception of temporal coherence, we introduce the Multi-query Transformer Module (MTM), which dynamically endows audio queries with learning capabilities and models the frame and video-level relations simultaneously. Furthermore, we propose the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space. Through the effective combination of those designs, our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets. Our source code and model weights will be made publicly available at https://github.com/SitongGong/CCFormer
[119] Think Twice to See More: Iterative Visual Reasoning in Medical VLMs
Kaitao Chen,Shaohao Rui,Yankai Jiang,Jiamin Wu,Qihao Zheng,Chunfeng Song,Xiaosong Wang,Mu Zhou,Mianxin Liu
Main category: cs.CV
TL;DR: ViTAR是一种新型医疗视觉-语言模型(VLM),通过模仿人类专家的迭代推理过程(”思考-行动-再思考-回答”),显著提升了医疗图像诊断的准确性和可信度。
Details
Motivation: 现有医疗VLM通常依赖单次推理,忽略了局部视觉线索,而人类专家则通过多次迭代聚焦和优化感兴趣区域。ViTAR旨在缩小这一机器与人类感知差距。Contribution: 提出了ViTAR框架,通过多步视觉推理模仿人类专家的迭代诊断行为;构建了高质量交互式数据集(1K指令数据+16K视觉问答数据);设计了两阶段训练策略(监督微调+强化学习)。
Method: ViTAR采用”思考-行动-再思考-回答”的认知链实现迭代推理,并结合两阶段训练策略:先通过监督微调引导认知轨迹,再用强化学习优化决策过程。
Result: 实验表明,ViTAR在性能上超越现有最先进模型,视觉注意力分析显示其逐渐聚焦临床关键区域,推理过程中始终保持高注意力分配。
Insight: 将专家风格的迭代思维链嵌入VLM,不仅能提升性能,还能增强医疗AI的可信度。注意力机制分析为模型改进提供了机制化解释。
Abstract: Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the regions of interest before reaching a final diagnosis. To narrow this machine-human perception gap, we introduce ViTAR, a novel VLM framework that emulates the iterative reasoning process of human experts through a cognitive chain of “think-act-rethink-answer”. ViTAR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning. To support this approach, we curate a high-quality instruction dataset comprising 1K interactive examples that encode expert-like diagnostic behaviors. In addition, a 16K visual question answering training data has been curated towards fine-grained visual diagnosis. We introduce a two-stage training strategy that begins with supervised fine-tuning to guide cognitive trajectories, followed by the reinforcement learning to optimize decision-making. Extensive evaluations demonstrate that ViTAR outperforms strong state-of-the-art models. Visual attention analysis reveals that from the “think” to “rethink” rounds, ViTAR increasingly anchors visual grounding to clinically critical regions and maintains high attention allocation to visual tokens during reasoning, providing mechanistic insight into its improved performance. These findings demonstrate that embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI.
[120] DREAM: A Benchmark Study for Deepfake REalism AssessMent
Bo Peng,Zichuan Wang,Sheng Yu,Xiaochuan Jin,Wei Wang,Jing Dong
Main category: cs.CV
TL;DR: 该论文提出了一个新的基准研究DREAM,专注于深度伪造视频的视觉真实感评估,通过大规模数据集和人类标注,对比了16种评估方法。
Details
Motivation: 深度伪造技术对信息可信度构成威胁,但主观感知的真实感评估缺乏研究。Contribution: 提出了DREAM基准,包括多样化的深度伪造视频数据集、大规模人类标注的真实感评分及文本描述,并对16种评估方法进行了全面分析。
Method: 使用大规模人类标注的真实感评分和文本描述,提出了一种描述对齐的CLIP方法,并与多种现有方法对比。
Result: DREAM基准为深度伪造真实感评估提供了基础,展示了多种方法的性能差异。
Insight: 深度伪造的真实感评估不仅有助于检测技术,还能优化生成过程并预测其社会影响。
Abstract: Deep learning based face-swap videos, widely known as deepfakes, have drawn wide attention due to their threat to information credibility. Recent works mainly focus on the problem of deepfake detection that aims to reliably tell deepfakes apart from real ones, in an objective way. On the other hand, the subjective perception of deepfakes, especially its computational modeling and imitation, is also a significant problem but lacks adequate study. In this paper, we focus on the visual realism assessment of deepfakes, which is defined as the automatic assessment of deepfake visual realism that approximates human perception of deepfakes. It is important for evaluating the quality and deceptiveness of deepfakes which can be used for predicting the influence of deepfakes on Internet, and it also has potentials in improving the deepfake generation process by serving as a critic. This paper prompts this new direction by presenting a comprehensive benchmark called DREAM, which stands for Deepfake REalism AssessMent. It is comprised of a deepfake video dataset of diverse quality, a large scale annotation that includes 140,000 realism scores and textual descriptions obtained from 3,500 human annotators, and a comprehensive evaluation and analysis of 16 representative realism assessment methods, including recent large vision language model based methods and a newly proposed description-aligned CLIP method. The benchmark and insights included in this study can lay the foundation for future research in this direction and other related areas.
[121] Collaborative Learning of Semantic-Aware Feature Learning and Label Recovery for Multi-Label Image Recognition with Incomplete Labels
Zhi-Fen He,Ren-Dong Xie,Bo Li,Bin Liu,Jin-Yan Hu
Main category: cs.CV
TL;DR: 该论文提出了一种名为CLSL的协同学习方法,用于解决不完全标签下的多标签图像识别问题,统一了语义感知特征学习和缺失标签恢复的两大挑战。
Details
Motivation: 不完全标签下的多标签图像识别存在两大核心挑战:语义感知特征学习和缺失标签恢复。现有方法未能有效统一这两者,限制了性能提升。Contribution: 1. 设计了语义相关特征学习模块和语义引导特征增强模块;2. 提出了一个协同学习框架,动态增强特征判别性并自适应恢复缺失标签。
Method: 1. 语义相关特征学习模块发现语义信息和标签相关性;2. 语义引导特征增强模块对齐视觉和语义特征空间;3. 协同框架动态优化特征和标签恢复。
Result: 在MS-COCO、VOC2007和NUS-WIDE数据集上,CLSL超越现有不完全标签多标签识别方法。
Insight: 协同学习框架通过双向优化(特征学习和标签恢复)实现了性能和鲁棒性的提升,为不完全标签任务提供了新思路。
Abstract: Multi-label image recognition with incomplete labels is a critical learning task and has emerged as a focal topic in computer vision. However, this task is confronted with two core challenges: semantic-aware feature learning and missing label recovery. In this paper, we propose a novel Collaborative Learning of Semantic-aware feature learning and Label recovery (CLSL) method for multi-label image recognition with incomplete labels, which unifies the two aforementioned challenges into a unified learning framework. More specifically, we design a semantic-related feature learning module to learn robust semantic-related features by discovering semantic information and label correlations. Then, a semantic-guided feature enhancement module is proposed to generate high-quality discriminative semantic-aware features by effectively aligning visual and semantic feature spaces. Finally, we introduce a collaborative learning framework that integrates semantic-aware feature learning and label recovery, which can not only dynamically enhance the discriminability of semantic-aware features but also adaptively infer and recover missing labels, forming a mutually reinforced loop between the two processes. Extensive experiments on three widely used public datasets (MS-COCO, VOC2007, and NUS-WIDE) demonstrate that CLSL outperforms the state-of-the-art multi-label image recognition methods with incomplete labels.
[122] Probabilistic Hyper-Graphs using Multiple Randomly Masked Autoencoders for Semi-supervised Multi-modal Multi-task Learning
Pîrvu Mihai-Cristian,Leordeanu Marius
Main category: cs.CV
TL;DR: 论文提出了PHG-MAE模型,通过随机掩码多模态数据构建概率超图,并结合MAE预训练与微调的统一训练框架,提升了多模态多任务学习的性能。
Details
Motivation: 当前多模态视觉任务依赖大量标注数据,而自监督预训练方法(如MAE)虽能减少标注需求,但通常需额外微调步骤。PHG-MAE旨在统一神经网络图理论与MAE方法,通过随机掩码多模态数据动态生成超图,简化训练流程并提升性能。Contribution: 1. 提出PHG-MAE模型,通过随机掩码多模态数据采样超图分布;2. 结合预训练与微调的统一训练框架;3. 支持推理时集成与知识蒸馏;4. 发布自动化数据流水线工具及扩展数据集。
Method: 1. 使用多模态随机掩码替代传统图像块掩码;2. 在每次前向传播中采样超图的边分布;3. 将MAE预训练与下游任务微调合并为一个训练循环;4. 通过集成与知识蒸馏优化模型性能。
Result: PHG-MAE在无人机多模态场景中表现优异,支持小参数量模型的知识蒸馏,性能损失极小。扩展数据集和工具已开源。
Insight: 通过动态超图建模和多模态掩码,PHG-MAE为多任务学习提供了一种高效的自监督框架,适用于自动驾驶等复杂场景。
Abstract: The computer vision domain has greatly benefited from an abundance of data across many modalities to improve on various visual tasks. Recently, there has been a lot of focus on self-supervised pre-training methods through Masked Autoencoders (MAE) \cite{he2022masked,bachmann2022multimae}, usually used as a first step before optimizing for a downstream task, such as classification or regression. This is very useful as it doesn’t require any manually labeled data. In this work, we introduce Probabilistic Hyper-Graphs using Masked Autoencoders (PHG-MAE): a novel model that unifies the classical work on neural graphs \cite{leordeanu2021semi} with the modern approach of masked autoencoders under a common theoretical framework. Through random masking of entire modalities, not just patches, the model samples from the distribution of hyper-edges on each forward pass. Additionally, the model adapts the standard MAE algorithm by combining pre-training and fine-tuning into a single training loop. Moreover, our approach enables the creation of inference-time ensembles which, through aggregation, boost the final prediction performance and consistency. Lastly, we show that we can apply knowledge distillation on top of the ensembles with little loss in performance, even with models that have fewer than 1M parameters. While our work mostly focuses on outdoor UAV scenes that contain multiple world interpretations and modalities, the same steps can be followed in other similar domains, such as autonomous driving or indoor robotics. In order to streamline the process of integrating external pre-trained experts for computer vision multi-modal multi-task learning (MTL) scenarios, we developed a data-pipeline software. Using this tool, we have created and released a fully-automated extension of the Dronescapes dataset. All the technical details, code and reproduction steps are publicly released.
[123] Tracking the Spatiotemporal Evolution of Landslide Scars Using a Vision Foundation Model: A Novel and Universal Framework
Meijun Zhou,Gang Mei,Zhengjing Ma,Nengxiong Xu,Jianbing Peng
Main category: cs.CV
TL;DR: 该论文提出了一种新颖且通用的框架,利用视觉基础模型追踪大规模滑坡疤痕的时空演化,重点关注滑坡前后的连续变化及其预警潜力。
Details
Motivation: 现有研究多集中于单阶段或双阶段的滑坡识别,难以追踪滑坡疤痕的时空演化过程。为解决这一问题,作者提出了一种新的框架,结合光学遥感图像和视频分割技术。Contribution: 提出了一个基于视觉基础模型的通用框架,将离散的光学遥感图像重构为连续视频序列,实现了滑坡疤痕时空演化的连续追踪。
Method: 通过知识引导、自动传播和交互式优化的方法,将遥感图像转换为视频序列,并利用视频分割模型追踪滑坡疤痕的动态变化。
Result: 在两个典型案例(白格滑坡和Sela滑坡)中验证了框架的有效性,成功捕捉了滑坡前的预警信号和滑坡后的演化特征。
Insight: 将遥感图像重构为视频序列的方法为地质灾害动态监测提供了新思路,结合视觉基础模型的技术展示了在灾害预警中的潜力。
Abstract: Tracking the spatiotemporal evolution of large-scale landslide scars is critical for understanding the evolution mechanisms and failure precursors, enabling effective early-warning. However, most existing studies have focused on single-phase or pre- and post-failure dual-phase landslide identification. Although these approaches delineate post-failure landslide boundaries, it is challenging to track the spatiotemporal evolution of landslide scars. To address this problem, this study proposes a novel and universal framework for tracking the spatiotemporal evolution of large-scale landslide scars using a vision foundation model. The key idea behind the proposed framework is to reconstruct discrete optical remote sensing images into a continuous video sequence. This transformation enables a vision foundation model, which is developed for video segmentation, to be used for tracking the evolution of landslide scars. The proposed framework operates within a knowledge-guided, auto-propagation, and interactive refinement paradigm to ensure the continuous and accurate identification of landslide scars. The proposed framework was validated through application to two representative cases: the post-failure Baige landslide and the active Sela landslide (2017-2025). Results indicate that the proposed framework enables continuous tracking of landslide scars, capturing both failure precursors critical for early warning and post-failure evolution essential for assessing secondary hazards and long-term stability.
[124] Gesplat: Robust Pose-Free 3D Reconstruction via Geometry-Guided Gaussian Splatting
Jiahui Lu,Haihong Xiao,Xueyan Zhao,Wenxiong Kang
Main category: cs.CV
TL;DR: Gesplat 是一种基于 3D Gaussian Splatting (3DGS) 的框架,旨在解决在稀疏视角下相机姿态不准确和视角覆盖不足的问题。通过结合 VGGT 基础模型、混合高斯表示和流式深度正则化,Gesplat 实现了无需准确相机姿态的稳健 3D 重建和新视角合成。
Details
Motivation: NeRF 和 3DGS 在 3D 重建和新视角合成中取得了显著进展,但它们严重依赖准确的相机姿态和密集的视角覆盖。这限制了它们在稀疏视角下的应用,而 Gesplat 旨在克服这一限制。Contribution: 1) 提出 Gesplat 框架,支持无需准确相机姿态的稀疏图像重建;2) 引入 VGGT 基础模型提供初始姿态和点云;3) 设计混合高斯表示、图引导属性优化和流式深度正则化模块。
Method: 1) 结合 VGGT 基础模型初始化姿态和点云;2) 使用混合高斯表示优化位置和形状;3) 引入图引导属性优化和流式深度正则化提升重建质量。
Result: Gesplat 在正向和大规模复杂数据集上均表现出色,相比其他无需姿态的方法,重建和新视角合成更稳健。
Insight: 通过结合基础模型和动态优化模块,Gesplat 展示了在稀疏视角下实现高质量 3D 重建的潜力,为实际应用提供了新思路。
Abstract: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have advanced 3D reconstruction and novel view synthesis, but remain heavily dependent on accurate camera poses and dense viewpoint coverage. These requirements limit their applicability in sparse-view settings, where pose estimation becomes unreliable and supervision is insufficient. To overcome these challenges, we introduce Gesplat, a 3DGS-based framework that enables robust novel view synthesis and geometrically consistent reconstruction from unposed sparse images. Unlike prior works that rely on COLMAP for sparse point cloud initialization, we leverage the VGGT foundation model to obtain more reliable initial poses and dense point clouds. Our approach integrates several key innovations: 1) a hybrid Gaussian representation with dual position-shape optimization enhanced by inter-view matching consistency; 2) a graph-guided attribute refinement module to enhance scene details; and 3) flow-based depth regularization that improves depth estimation accuracy for more effective supervision. Comprehensive quantitative and qualitative experiments demonstrate that our approach achieves more robust performance on both forward-facing and large-scale complex datasets compared to other pose-free methods.
[125] Cooperative Pseudo Labeling for Unsupervised Federated Classification
Kuangpu Guo,Lijun Sheng,Yongcan Yu,Jian Liang,Zilei Wang,Ran He
Main category: cs.CV
TL;DR: 本文首次将无监督联邦学习(UFL)扩展到分类问题,提出了一种名为FedCoPL的新方法,通过伪标签分布估计和调整实现全局类别平衡,并结合部分提示聚合协议提升协作效果和个性化。
Details
Motivation: 无监督联邦学习(UFL)通常仅用于表示学习和聚类任务。随着视觉语言模型(如CLIP)的强大零样本预测能力出现,如何在UFL范式下实现分类任务成为一个新机遇,但仍未被充分探索。Contribution: 1.首次将UFL扩展到分类问题;2.提出FedCoPL方法,通过伪标签分布调整解决类别不平衡问题;3.设计部分提示聚合协议,结合全局和本地知识提升性能。
Method: FedCoPL方法包括:客户端估计并上传伪标签分布,服务器调整分布以避免全局类别不平衡;采用部分提示聚合协议,聚合视觉提示(全局特征),保留文本提示(本地个性化知识)。
Result: 实验表明FedCoPL在性能上显著优于基线方法。
Insight: 结合CLIP的零样本能力与联邦学习的分布式特性,可以高效解决UFL中的分类问题;部分聚合协议平衡了全局协作与个性化需求。
Abstract: Unsupervised Federated Learning (UFL) aims to collaboratively train a global model across distributed clients without sharing data or accessing label information. Previous UFL works have predominantly focused on representation learning and clustering tasks. Recently, vision language models (e.g., CLIP) have gained significant attention for their powerful zero-shot prediction capabilities. Leveraging this advancement, classification problems that were previously infeasible under the UFL paradigm now present promising new opportunities, yet remain largely unexplored. In this paper, we extend UFL to the classification problem with CLIP for the first time and propose a novel method, \underline{\textbf{Fed}}erated \underline{\textbf{Co}}operative \underline{\textbf{P}}seudo \underline{\textbf{L}}abeling (\textbf{FedCoPL}). Specifically, clients estimate and upload their pseudo label distribution, and the server adjusts and redistributes them to avoid global imbalance among classes. Moreover, we introduce a partial prompt aggregation protocol for effective collaboration and personalization. In particular, visual prompts containing general image features are aggregated at the server, while text prompts encoding personalized knowledge are retained locally. Extensive experiments demonstrate the superior performance of our FedCoPL compared to baseline methods. Our code is available at \href{https://github.com/krumpguo/FedCoPL}{https://github.com/krumpguo/FedCoPL}.
[126] Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models
Minbin Huang,Runhui Huang,Chuanyang Zheng,Jingyao Li,Guoxuan Chen,Han Shi,Hong Cheng
Main category: cs.CV
TL;DR: 为了解决多模态大语言模型在强化学习过程中推理链与最终答案不一致的问题,作者提出了答案一致性强化学习(ACRE)。该方法通过引入一致性验证奖励机制,显著提高了推理与答案的一致性,并在视频和数学推理任务中取得了性能提升。
Details
Motivation: 传统的强化学习方法虽然在提高答案准确性方面有效,但可能导致推理链与最终答案不一致,影响模型的可解释性和可靠性。因此,作者希望通过一种新方法解决这一问题。Contribution: 提出了ACRE方法,通过在GRPO算法中引入辅助一致性检查,设计一致性验证奖励,显著提高了推理与答案的一致性。
Method: ACRE方法在模型生成推理链和初始答案后,打乱答案选项并要求模型使用相同推理链预测第二个答案。通过一致性验证奖励机制,奖励一致且正确的答案,惩罚不一致或错误的答案。
Result: 在视频推理和多模态数学推理任务中,ACRE分别比基线GRPO方法平均提高了2.2%和1.5%。
Insight: 一致性验证奖励不仅能提高推理与答案的一致性,还能减少模型对虚假模式(如选项顺序偏差)的依赖,增强模型的鲁棒性。
Abstract: Recent advances in large language models (LLMs) have demonstrated that reinforcement learning with verifiable rewards (RLVR) can significantly enhance reasoning abilities by directly optimizing correctness, rather than relying solely on supervised imitation. This paradigm has been extended to multimodal LLMs for complex video and image understanding tasks. However, while outcome-driven RL improves answer accuracy, it can inadvertently decouple the reasoning chain from the final answer, leading to situations where models produce inconsistency between the reasoning trace and final answer. In our experiments on multiple-choice visual question-answering tasks, the standard GRPO method yields only 79.7% consistency on MMVU between the reasoning steps and the chosen answers, indicating frequent mismatches between answers and reasoning. To this end, we propose Answer-Consistent Reinforcement Learning (ACRE) that modifies the GRPO algorithm with an auxiliary consistency check. After the model generates a chain of thought and an initial answer for a given question, we shuffle the answer options and prompt the model again with the same reasoning trace to predict a second answer. We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct; otherwise, a lower reward is assigned accordingly. This mechanism penalizes reasoning-answer misalignment and discourages the model from relying on spurious patterns, such as option ordering biases. We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2% and 1.5% improvement for Video Reasoning and Math Reasoning tasks over the GRPO baseline.
[127] Uncertainty-Aware Post-Detection Framework for Enhanced Fire and Smoke Detection in Compact Deep Learning Models
Aniruddha Srinivas Joshi,Godwyn James William,Shreyas Srinivas Joshi
Main category: cs.CV
TL;DR: 该论文提出了一种不确定性感知的后检测框架,用于增强紧凑型深度学习模型(如YOLOv5n和YOLOv8n)在火灾和烟雾检测中的性能。该方法通过结合统计不确定性和领域相关的视觉线索重新调整检测置信度,显著提升了检测精度和召回率。
Details
Motivation: 现有基于视觉的火灾和烟雾检测方法在效率和可靠性之间存在平衡问题,紧凑型深度学习模型因容量限制常导致误检和漏检。传统后检测方法仅依赖空间重叠信息,无法有效处理复杂或模糊场景。Contribution: 提出了一种轻量级的不确定性感知后检测框架,通过结合统计不确定性和视觉特征(如颜色、边缘和纹理)优化检测置信度,显著提升了检测模型的鲁棒性,且无需修改基础模型。
Method: 设计了一个轻量级的置信度优化网络(Confidence Refinement Network),利用不确定性估计和视觉特征重新调整检测分数。该框架在不增加基础模型计算负担的情况下,显著提升了检测性能。
Result: 在D-Fire数据集上的实验表明,该方法相比现有基线在精确率、召回率和平均精度(mAP)上均有提升,计算开销仅为适度增加。
Insight: 后检测阶段的置信度优化是提升紧凑型深度学习模型鲁棒性的有效途径,尤其是在复杂或模糊场景中,结合不确定性和多模态视觉线索能显著减少误检和漏检。
Abstract: Accurate fire and smoke detection is critical for safety and disaster response, yet existing vision-based methods face challenges in balancing efficiency and reliability. Compact deep learning models such as YOLOv5n and YOLOv8n are widely adopted for deployment on UAVs, CCTV systems, and IoT devices, but their reduced capacity often results in false positives and missed detections. Conventional post-detection methods such as Non-Maximum Suppression and Soft-NMS rely only on spatial overlap, which can suppress true positives or retain false alarms in cluttered or ambiguous fire scenes. To address these limitations, we propose an uncertainty aware post-detection framework that rescales detection confidences using both statistical uncertainty and domain relevant visual cues. A lightweight Confidence Refinement Network integrates uncertainty estimates with color, edge, and texture features to adjust detection scores without modifying the base model. Experiments on the D-Fire dataset demonstrate improved precision, recall, and mean average precision compared to existing baselines, with only modest computational overhead. These results highlight the effectiveness of post-detection rescoring in enhancing the robustness of compact deep learning models for real-world fire and smoke detection.
[128] Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization
Rui Chen,Bin Liu,Changtao Miao,Xinghao Wang,Yi Li,Tao Gong,Qi Chu,Nenghai Yu
Main category: cs.CV
TL;DR: 本文提出了In-Context Forensic Chain(ICFC),一种无需训练的框架,利用多模态大语言模型(MLLMs)实现可解释的图像篡改检测与定位。ICFC通过层级推理流程,结合自适应过滤和多模态知识库,显著优于现有无需训练方法,甚至在某些情况下媲美全监督方法。
Details
Motivation: 图像篡改技术带来的安全威胁日益严重,而现有监督方法依赖昂贵的像素级标注,弱监督或无监督方法则性能不足且缺乏可解释性。因此,研究提出了一种无需训练的、可解释的解决方案。Contribution: 1. 提出ICFC框架,首次将多模态大语言模型的推理能力应用于无需训练的图像篡改检测与定位任务;2. 设计了层级推理流程和自适应知识库构建方法,提升性能与可解释性;3. 在多个基准测试中表现优越。
Method: 1. 利用MLLMs的多模态能力构建知识库;2. 通过对象化规则和自适应过滤提取可靠信息;3. 采用从粗到细的渐进推理流程,模拟专家工作流。
Result: 在多个数据集上,ICFC超越了现有的无需训练方法,甚至在某些情况下媲美弱监督和全监督方法,同时提供了文本级可解释性。
Insight: 无需训练的方法可以通过多模态模型的推理能力实现高性能的篡改检测,且层级推理设计和知识库构建是关键因素。
Abstract: Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.
[129] Multi Class Parkinsons Disease Detection Based on Finger Tapping Using Attention-Enhanced CNN BiLSTM
Abu Saleh Musa Miah,Najmul Hassan,Md Maruf Al Hossain,Yuichi Okuyama,Jungpil Shin
Main category: cs.CV
TL;DR: 该论文提出了一种基于注意力增强CNN-BiLSTM的多元帕金森病检测系统,通过分析手指敲击视频提取时空特征,有效区分帕金森病的五种严重程度。
Details
Motivation: 现有基于手势的帕金森病识别系统性能不足,影响临床管理和干预效果。该研究旨在通过结合深度学习和注意力机制,提升帕金森病严重程度的自动检测准确性。Contribution: 主要贡献是提出了一种混合深度学习框架,整合CNN、BiLSTM和注意力机制,用于多元帕金森病严重程度的分类。
Method: 方法包括:1) 从手指敲击视频中提取时空特征;2) 使用Conv1D MaxPooling捕捉局部空间依赖;3) BiLSTM建模时间动态;4) 注意力机制聚焦关键时间特征;5) 融合空间和时间特征进行分类。
Result: 模型在区分五种严重程度类别上表现优异,表明结合空间-时间表征和注意力机制能显著提升帕金森病严重程度的自动化检测。
Insight: 研究展示了深度学习与非侵入性传感器数据结合在医疗诊断中的潜力,为临床医生提供了一种有效的辅助工具。
Abstract: Effective clinical management and intervention development depend on accurate evaluation of Parkinsons disease (PD) severity. Many researchers have worked on developing gesture-based PD recognition systems; however, their performance accuracy is not satisfactory. In this study, we propose a multi-class Parkinson Disease detection system based on finger tapping using an attention-enhanced CNN BiLSTM. We collected finger tapping videos and derived temporal, frequency, and amplitude based features from wrist and hand movements. Then, we proposed a hybrid deep learning framework integrating CNN, BiLSTM, and attention mechanisms for multi-class PD severity classification from video-derived motion features. First, the input sequence is reshaped and passed through a Conv1D MaxPooling block to capture local spatial dependencies. The resulting feature maps are fed into a BiLSTM layer to model temporal dynamics. An attention mechanism focuses on the most informative temporal features, producing a context vector that is further processed by a second BiLSTM layer. CNN-derived features and attention-enhanced BiLSTM outputs are concatenated, followed by dense and dropout layers, before the final softmax classifier outputs the predicted PD severity level. The model demonstrated strong performance in distinguishing between the five severity classes, suggesting that integrating spatial temporal representations with attention mechanisms can improve automated PD severity detection, making it a promising non-invasive tool to support clinicians in PD monitoring and progression tracking.
[130] DeepFusionNet: Autoencoder-Based Low-Light Image Enhancement and Super-Resolution
Halil Hüseyin Çalışkan,Talha Koruk
Main category: cs.CV
TL;DR: DeepFusionNet是一种基于自动编码器的架构,用于低光图像增强和超分辨率任务,能够在SSIM和PSNR指标上实现高性能,同时保持较少的参数数量。
Details
Motivation: 低光和模糊图像在计算机视觉应用中影响性能,现有方法参数多、计算成本高且性能较低,需要一种更高效的解决方案。Contribution: 提出了DeepFusionNet架构,用于低光图像增强和超分辨率,显著减少了参数数量(低光增强约250万参数,超分辨率约10万参数),同时提升了PSNR和SSIM指标。
Method: 使用自动编码器架构,结合低光图像增强和超分辨率技术,设计了高效的DeepFusionNet模型。
Result: 在LOL-v1数据集上,低光增强模型的SSIM达到92.8%,PSNR为26.30;超分辨率模型的PSNR为25.30,SSIM为80.7%。
Insight: 自动编码器可以用于低光和超分辨率任务,通过设计高效的架构,能够在减少参数的同时保持高性能,优于传统GAN方法。
Abstract: Computer vision and image processing applications suffer from dark and low-light images, particularly during real-time image transmission. Currently, low light and dark images are converted to bright and colored forms using autoencoders; however, these methods often achieve low SSIM and PSNR scores and require high computational power due to their large number of parameters. To address these challenges, the DeepFusionNet architecture has been developed. According to the results obtained with the LOL-v1 dataset, DeepFusionNet achieved an SSIM of 92.8% and a PSNR score of 26.30, while containing only approximately 2.5 million parameters. On the other hand, conversion of blurry and low-resolution images into high-resolution and blur-free images has gained importance in image processing applications. Unlike GAN-based super-resolution methods, an autoencoder-based super resolution model has been developed that contains approximately 100 thousand parameters and uses the DeepFusionNet architecture. According to the results of the tests, the DeepFusionNet based super-resolution method achieved a PSNR of 25.30 and a SSIM score of 80.7 percent according to the validation set.
[131] Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer
Yecong Wan,Mingwen Shao,Renlong Wu,Wangmeng Zuo
Main category: cs.CV
TL;DR: Color3D提出了一个高度适配的框架,用于从单色输入中对静态和动态3D场景进行着色,提供视觉多样且色彩丰富的重建,并支持灵活的用户引导控制。
Details
Motivation: 现有方法通常专注于静态场景,并通过平均颜色变化来强制多视图一致性,牺牲了色彩丰富性和可控性。Color3D旨在解决这一问题,同时确保跨视图和时间的一致性。Contribution: Color3D的核心贡献是将复杂的3D着色问题转化为更易处理的单图像范式,通过个性化着色器实现色彩多样性和一致性。
Method: 方法的关键是通过个性化着色器学习场景特定的确定性颜色映射,并将其传播到新视图和时间步骤。使用Lab色彩空间的高斯溅射表示实现3D场景重建。
Result: 实验表明,Color3D能够在多样化的静态和动态3D着色基准测试中提供更一致且色彩丰富的渲染结果,同时支持精确的用户控制。
Insight: 通过将3D着色问题简化为单图像着色任务,Color3D展示了如何在不牺牲一致性的情况下增强色彩多样性和用户控制能力。
Abstract: In this work, we present Color3D, a highly adaptable framework for colorizing both static and dynamic 3D scenes from monochromatic inputs, delivering visually diverse and chromatically vibrant reconstructions with flexible user-guided control. In contrast to existing methods that focus solely on static scenarios and enforce multi-view consistency by averaging color variations which inevitably sacrifice both chromatic richness and controllability, our approach is able to preserve color diversity and steerability while ensuring cross-view and cross-time consistency. In particular, the core insight of our method is to colorize only a single key view and then fine-tune a personalized colorizer to propagate its color to novel views and time steps. Through personalization, the colorizer learns a scene-specific deterministic color mapping underlying the reference view, enabling it to consistently project corresponding colors to the content in novel views and video frames via its inherent inductive bias. Once trained, the personalized colorizer can be applied to infer consistent chrominance for all other images, enabling direct reconstruction of colorful 3D scenes with a dedicated Lab color space Gaussian splatting representation. The proposed framework ingeniously recasts complicated 3D colorization as a more tractable single image paradigm, allowing seamless integration of arbitrary image colorization models with enhanced flexibility and controllability. Extensive experiments across diverse static and dynamic 3D colorization benchmarks substantiate that our method can deliver more consistent and chromatically rich renderings with precise user control. Project Page https://yecongwan.github.io/Color3D/.
[132] ReMix: Towards a Unified View of Consistent Character Generation and Editing
Benjia Zhou,Bin Fu,Pei Cheng,Yanru Wang,Jiayuan Fan,Tao Chen
Main category: cs.CV
TL;DR: ReMix提出了一种统一的框架,用于字符一致性生成和编辑,结合了ReMix模块与IP-ControlNet,解决了现有方法在语义一致性与空间可控性上的不足。
Details
Motivation: 现有的大规模文本到图像扩散模型在字符一致性生成和编辑任务中缺乏统一框架,生成方法难以实现细粒度身份一致性,而编辑方法常丢失空间可控性和指令对齐。Contribution: 1. 提出ReMix框架,统一字符一致性生成与编辑;2. 设计ReMix模块和IP-ControlNet,分别处理语义编辑与像素一致性;3. 引入ϵ-等价潜空间,实现多任务支持。
Method: 1. ReMix模块利用MLLM的多模态推理能力编辑语义特征;2. IP-ControlNet扩展ControlNet,解耦语义和布局线索;3. 通过ϵ-等价潜空间联合去噪参考与目标图像。
Result: 实验表明ReMix在个性化生成、图像编辑、风格迁移等多任务中高效且有效。
Insight: 1. 语义与布局的分离设计解决了传统方法的限制;2. 受生物学和量子物理启发的设计增强了特征对齐。
Abstract: Recent advances in large-scale text-to-image diffusion models (e.g., FLUX.1) have greatly improved visual fidelity in consistent character generation and editing. However, existing methods rarely unify these tasks within a single framework. Generation-based approaches struggle with fine-grained identity consistency across instances, while editing-based methods often lose spatial controllability and instruction alignment. To bridge this gap, we propose ReMix, a unified framework for character-consistent generation and editing. It constitutes two core components: the ReMix Module and IP-ControlNet. The ReMix Module leverages the multimodal reasoning ability of MLLMs to edit semantic features of input images and adapt instruction embeddings to the native DiT backbone without fine-tuning. While this ensures coherent semantic layouts, pixel-level consistency and pose controllability remain challenging. To address this, IP-ControlNet extends ControlNet to decouple semantic and layout cues from reference images and introduces an {\epsilon}-equivariant latent space that jointly denoises the reference and target images within a shared noise space. Inspired by convergent evolution and quantum decoherence,i.e., where environmental noise drives state convergence, this design promotes feature alignment in the hidden space, enabling consistent object generation while preserving identity. ReMix supports a wide range of tasks, including personalized generation, image editing, style transfer, and multi-condition synthesis. Extensive experiments validate its effectiveness and efficiency as a unified framework for character-consistent image generation and editing.
[133] SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation
Zhenjie Mao,Yuhuan Yang,Chaofan Ma,Dongsheng Jiang,Jiangchao Yao,Ya Zhang,Yanfeng Wang
Main category: cs.CV
TL;DR: SaFiRe是一种新颖的框架,模仿人类的两阶段认知过程(全局理解→细节精修),结合Mamba的高效扫描更新特性,解决Referring Image Segmentation(RIS)中的指代模糊问题,并在新基准aRefCOCO上验证了其优势。
Details
Motivation: 当前RIS方法主要处理简单的名词短语(如“红色汽车”),忽视了真实场景中指代模糊的表达(如对象干扰和隐含类别)。SaFiRe旨在填补这一空白。Contribution: 1) 提出SaFiRe框架,结合人类认知过程优化RIS任务;2) 引入aRefCOCO基准评估模糊表达下的模型表现。
Method: SaFiRe利用Mamba的扫描更新特性,分两阶段处理图像:全局理解→多周期细节精修,复杂度为线性。
Result: 在标准和新基准数据集上,SaFiRe优于现有方法。
Insight: 人类认知过程的建模和高效线性复杂度的设计是提升RIS性能的关键。
Abstract: Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions–short, clear noun phrases like “red car” or “left girl”. This simplification often reduces RIS to a key word/concept matching problem, limiting the model’s ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process–first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba’s scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.
[134] SparseUWSeg: Active Sparse Point-Label Augmentation for Underwater Semantic Segmentation
César Borja,Carlos Plou,Rubén Martinez-Cantín,Ana C. Murillo
Main category: cs.CV
TL;DR: SparseUWSeg是一种用于水下语义分割的新框架,通过主动采样和稀疏标签传播,显著提升了分割模型性能。
Details
Motivation: 水下场景的语义分割需要密集标注,成本高昂。稀疏点标注更容易获取,但需解决标注选择和信息传播的挑战。Contribution: 提出了一种结合主动采样和混合标签传播的新框架SparseUWSeg,并发布了一个高效的交互标注工具。
Method: 采用主动采样策略选择高价值标注点,结合SAM2和超像素方法进行稀疏标签传播。
Result: 在两个水下数据集上,SparseUWSeg比现有方法提升最多5% mIoU。
Insight: 稀疏标注结合主动学习和混合传播方法,可以有效降低标注成本并提升分割性能。
Abstract: Semantic segmentation is essential to automate underwater imagery analysis with ecology monitoring purposes. Unfortunately, fine grained underwater scene analysis is still an open problem even for top performing segmentation models. The high cost of obtaining dense, expert-annotated, segmentation labels hinders the supervision of models in this domain. While sparse point-labels are easier to obtain, they introduce challenges regarding which points to annotate and how to propagate the sparse information. We present SparseUWSeg, a novel framework that addresses both issues. SparseUWSeg employs an active sampling strategy to guide annotators, maximizing the value of their point labels. Then, it propagates these sparse labels with a hybrid approach leverages both the best of SAM2 and superpixel-based methods. Experiments on two diverse underwater datasets demonstrate the benefits of SparseUWSeg over state-of-the-art approaches, achieving up to +5% mIoU over D+NN. Our main contribution is the design and release of a simple but effective interactive annotation tool, integrating our algorithms. It enables ecology researchers to leverage foundation models and computer vision to efficiently generate high-quality segmentation masks to process their data.
[135] ViConEx-Med: Visual Concept Explainability via Multi-Concept Token Transformer for Medical Image Analysis
Cristiano Patrício,Luís F. Teixeira,João C. Neves
Main category: cs.CV
TL;DR: ViConEx-Med 是一种基于Transformer的框架,通过多概念可学习token联合预测和定位视觉概念,提供了人类可理解的视觉概念解释。该方法在医学图像分析中展示了优越的性能。
Details
Motivation: 现有概念模型将概念视为数值属性,缺乏视觉解释能力,限制了其在高风险场景(如医学应用)中的实用性。ViConEx-Med旨在填补这一空白。Contribution: 提出了一种新型Transformer框架ViConEx-Med,通过多概念token联合预测和定位视觉概念,生成概念级定位图,同时保持高预测准确性。
Method: 引入多概念可学习token,利用专门的注意力层处理视觉和文本概念token,结合局部化和预测任务。
Result: 在合成和真实医学数据集上,ViConEx-Med优于先前概念模型,并与黑盒模型在概念检测和定位精度上竞争性能。
Insight: ViConEx-Med为构建基于视觉概念的固有可解释模型提供了新方向,特别适用于医学等高风险领域。
Abstract: Concept-based models aim to explain model decisions with human-understandable concepts. However, most existing approaches treat concepts as numerical attributes, without providing complementary visual explanations that could localize the predicted concepts. This limits their utility in real-world applications and particularly in high-stakes scenarios, such as medical use-cases. This paper proposes ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts. By leveraging specialized attention layers for processing visual and text-based concept tokens, our method produces concept-level localization maps while maintaining high predictive accuracy. Experiments on both synthetic and real-world medical datasets demonstrate that ViConEx-Med outperforms prior concept-based models and achieves competitive performance with black-box models in terms of both concept detection and localization precision. Our results suggest a promising direction for building inherently interpretable models grounded in visual concepts. Code is publicly available at https://github.com/CristianoPatricio/viconex-med.
[136] TCMA: Text-Conditioned Multi-granularity Alignment for Drone Cross-Modal Text-Video Retrieval
Zixu Zhao,Yang Zhan
Main category: cs.CV
TL;DR: 该论文提出了一个用于无人机跨模态文本-视频检索的TCMA框架,并构建了一个细粒度的数据集DVTMD,通过多粒度对齐方法显著提升了检索性能。
Details
Motivation: 无人机采集的视频数据量大,但现有的文本-视频检索方法在无人机领域的应用受限,主要是由于数据集标注粗糙且冗余。因此,需要构建细粒度的数据集并设计高效的多粒度对齐方法。Contribution: 1. 构建了包含细粒度标注的无人机文本-视频数据集DVTMD;2. 提出了TCMA框架,通过全局视频-句子对齐、句子引导的帧聚合和词引导的区块对齐实现多粒度对齐;3. 设计了词和区块选择模块以及文本自适应的动态温度机制,进一步优化局部对齐。
Method: TCMA框架包含三个部分:全局视频-句子对齐、句子引导的帧聚合和词引导的区块对齐。此外,还设计了词和区块选择模块过滤无关内容,以及动态温度机制调整注意力强度。
Result: 在DVTMD和CapERA数据集上,TCMA实现了SOTA性能,文本到视频检索的R@1为45.5%,视频到文本检索的R@1为42.8%。
Insight: 细粒度的数据集和多粒度对齐方法显著提升了无人机领域的文本-视频检索性能,动态温度机制为跨模态注意力建模提供了新思路。
Abstract: Unmanned aerial vehicles (UAVs) have become powerful platforms for real-time, high-resolution data collection, producing massive volumes of aerial videos. Efficient retrieval of relevant content from these videos is crucial for applications in urban management, emergency response, security, and disaster relief. While text-video retrieval has advanced in natural video domains, the UAV domain remains underexplored due to limitations in existing datasets, such as coarse and redundant captions. Thus, in this work, we construct the Drone Video-Text Match Dataset (DVTMD), which contains 2,864 videos and 14,320 fine-grained, semantically diverse captions. The annotations capture multiple complementary aspects, including human actions, objects, background settings, environmental conditions, and visual style, thereby enhancing text-video correspondence and reducing redundancy. Building on this dataset, we propose the Text-Conditioned Multi-granularity Alignment (TCMA) framework, which integrates global video-sentence alignment, sentence-guided frame aggregation, and word-guided patch alignment. To further refine local alignment, we design a Word and Patch Selection module that filters irrelevant content, as well as a Text-Adaptive Dynamic Temperature Mechanism that adapts attention sharpness to text type. Extensive experiments on DVTMD and CapERA establish the first complete benchmark for drone text-video retrieval. Our TCMA achieves state-of-the-art performance, including 45.5% R@1 in text-to-video and 42.8% R@1 in video-to-text retrieval, demonstrating the effectiveness of our dataset and method. The code and dataset will be released.
[137] Fairness Without Labels: Pseudo-Balancing for Bias Mitigation in Face Gender Classification
Haohua Dong,Ana Manzano Rodríguez,Camille Guinaudeau,Shin’ichi Satoh
Main category: cs.CV
TL;DR: 论文提出了一种名为伪平衡(pseudo-balancing)的方法,用于在半监督学习中缓解人脸性别分类模型中的偏见问题。该方法无需真实标签,仅通过利用种族平衡的无标注数据集实现对模型的去偏处理。实验结果表明,伪平衡显著提升了公平性和准确性。
Details
Motivation: 人脸性别分类模型通常会放大训练数据中的偏见,导致在不同性别和种族子群中的性能不均。现有方法通常依赖真实标签,但实际场景中标签往往不可得。因此,需要一种无需真实标签的去偏方法。Contribution: 论文的主要贡献是提出伪平衡方法,通过在半监督学习的伪标签选择阶段强制人口统计学平衡,有效缓解模型偏见。该方法无需真实标签,仅需无标注数据。
Method: 伪平衡方法的核心是在伪标签选择阶段强制人口统计学平衡。具体实验包括:(1)在FairFace数据集上微调有偏见的性别分类器;(2)用故意不平衡的训练数据模拟偏见场景。模型在All-Age-Faces(AAF)基准上进行评估。
Result: 实验结果显示,伪平衡方法显著提升了公平性和准确性:整体准确率提升6.53%至79.81%,性别准确率差距缩小44.17%。在东亚子群中,基线的49%差距被缩小至5.01%。
Insight: 论文表明,即使没有真实标签监督,仅通过人口统计学平衡(或适度倾斜)的无标注数据集,也能有效去偏现有计算机视觉模型。这为实际应用提供了一种低成本解决方案。
Abstract: Face gender classification models often reflect and amplify demographic biases present in their training data, leading to uneven performance across gender and racial subgroups. We introduce pseudo-balancing, a simple and effective strategy for mitigating such biases in semi-supervised learning. Our method enforces demographic balance during pseudo-label selection, using only unlabeled images from a race-balanced dataset without requiring access to ground-truth annotations. We evaluate pseudo-balancing under two conditions: (1) fine-tuning a biased gender classifier using unlabeled images from the FairFace dataset, and (2) stress-testing the method with intentionally imbalanced training data to simulate controlled bias scenarios. In both cases, models are evaluated on the All-Age-Faces (AAF) benchmark, which contains a predominantly East Asian population. Our results show that pseudo-balancing consistently improves fairness while preserving or enhancing accuracy. The method achieves 79.81% overall accuracy - a 6.53% improvement over the baseline - and reduces the gender accuracy gap by 44.17%. In the East Asian subgroup, where baseline disparities exceeded 49%, the gap is narrowed to just 5.01%. These findings suggest that even in the absence of label supervision, access to a demographically balanced or moderately skewed unlabeled dataset can serve as a powerful resource for debiasing existing computer vision models.
[138] B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding
Feng Xiao,Hongbin Xu,Hai Ci,Wenxiong Kang
Main category: cs.CV
TL;DR: 论文提出了一种渐进式关系学习框架B2N3D,将3D物体定位中的关系建模从二元扩展到n元,提升了多模态关系理解的全局感知能力。通过设计分组监督损失和多模态混合注意力网络,该方法在ReferIt3D和ScanRefer基准测试中表现优于现有方法。
Details
Motivation: 当前3D物体定位方法仅建模二元对象关系,忽略了多模态理解中n元组合的全局感知重要性,导致描述涉及多空间关系时对齐困难。Contribution: 1. 提出渐进式关系学习框架,将关系建模从二元扩展到n元;2. 设计分组监督损失,解决训练数据中缺乏目标对象标注的问题;3. 结合混合注意力机制的多模态网络,优化n元组合中的目标定位。
Method: 1. 通过场景图构建n元关系;2. 使用分组监督损失促进n元关系学习;3. 在多模态网络中引入混合注意力机制进一步定位目标。
Result: 在ReferIt3D和ScanRefer基准测试中表现优于现有方法,验证了n元关系感知在3D定位中的优势。
Insight: n元关系建模能更全面地捕捉多模态描述中的全局信息,从而提升复杂空间关系下的3D物体定位精度。
Abstract: Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods only model relationships for pairwise objects, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding. To address this, we propose a novel progressive relational learning framework for 3D object grounding. We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally. Given the absence of specific annotations for referred objects in the training data, we design a grouped supervision loss to facilitate n-ary relational learning. In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target within the n-ary combinations. Experiments and ablation studies on the ReferIt3D and ScanRefer benchmarks demonstrate that our method outperforms the state-of-the-art, and proves the advantages of the n-ary relational perception in 3D localization.
[139] From Generic to Specialized: A Subspecialty Diagnostic System Powered by Self-Supervised Learning for Cervical Histopathology
Yizhi Wang,Li Chen,Qiang Huang,Tian Guan,Xi Deng,Zhiyuan Shen,Jiawen Li,Xinrui Chen,Bin Hu,Xitong Ling,Taojie Zhu,Zirui Huang,Deshui Yu,Yan Liu,Jiurun Chen,Lianghui Zhu,Qiming He,Yiqing Liu,Diwei Shi,Hanzhong Liu,Junbo Hu,Hongyi Gao,Zhen Song,Xilong Zhao,Chao He,Ming Zhao,Yonghong He
Main category: cs.CV
TL;DR: 该论文提出了CerS-Path系统,通过自监督学习和多模态增强构建宫颈病理学专用诊断模型,显著提高了诊断的准确性和泛化能力。
Details
Motivation: 宫颈癌的诊断需要复杂的组织病理学评估,现有深度学习模型在准确性和泛化性上表现不足,且通用基础模型难以捕捉专科特征。Contribution: 开发了CerS-Path系统,通过两阶段预训练(自监督学习和多模态增强)构建宫颈病理学专用特征提取器,支持多项诊断功能。
Method: 采用自监督学习在1.4万张切片上训练特征提取器,并通过250万图像文本对进行多模态增强,最终集成多项下游诊断功能。
Result: 在3,173例前瞻性测试中,系统保持了99.38%的筛查敏感性和优秀的泛化能力。
Insight: 自监督学习和多模态增强的结合能够显著提升专科诊断模型的性能,为其他专科领域的AI应用提供了参考。
Abstract: Cervical cancer remains a major malignancy, necessitating extensive and complex histopathological assessments and comprehensive support tools. Although deep learning shows promise, these models still lack accuracy and generalizability. General foundation models offer a broader reach but remain limited in capturing subspecialty-specific features and task adaptability. We introduce the Cervical Subspecialty Pathology (CerS-Path) diagnostic system, developed through two synergistic pretraining stages: self-supervised learning on approximately 190 million tissue patches from 140,000 slides to build a cervical-specific feature extractor, and multimodal enhancement with 2.5 million image-text pairs, followed by integration with multiple downstream diagnostic functions. Supporting eight diagnostic functions, including rare cancer classification and multimodal Q&A, CerS-Path surpasses prior foundation models in scope and clinical applicability. Comprehensive evaluations demonstrate a significant advance in cervical pathology, with prospective testing on 3,173 cases across five centers maintaining 99.38% screening sensitivity and excellent generalizability, highlighting its potential for subspecialty diagnostic translation and cervical cancer screening.
[140] Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images
Chuangchuang Tan,Xiang Ming,Jinglu Wang,Renshuai Tao,Bin Li,Yunchao Wei,Yao Zhao,Yan Lu
Main category: cs.CV
TL;DR: 论文提出了AnomReason基准和AnomAgent框架,用于检测和解释AI生成图像中的语义异常,并通过实验验证了其有效性。
Details
Motivation: AI生成内容(AIGC)虽然视觉上逼真,但常存在语义异常,如物体配置不合理或常识违反,这些异常会影响图像的可靠性。论文旨在解决这一问题,提升AIGC的语义可信度。Contribution: 1. 引入了AnomReason基准,提供结构化标注;2. 开发了AnomAgent多代理框架,实现了大规模高质量标注;3. 提出了语义匹配指标SemAP和SemF1,验证了方法的有效性。
Method: 通过AnomAgent(多代理流水线)生成结构化标注,结合轻量级人工验证;利用AnomReason训练模型,并通过SemAP和SemF1指标评估性能。
Result: 实验表明,基于AnomReason微调的模型在语义匹配指标上优于基线方法,并在可解释的Deepfake检测和图像生成器语义合理性评估中展示了实用性。
Insight: 语义异常检测是AIGC可信度的关键;多代理流水线和轻量人工验证的结合可以高效生成大规模高质量标注;结构化标注和语义指标为AIGC研究提供了新工具。
Abstract: The rapid advancement of AI-generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle \textbf{semantic anomalies}, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic-level anomalies is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment. In this paper, we formalize \textbf{semantic anomaly detection and reasoning} for AIGC images and introduce \textbf{AnomReason}, a large-scale benchmark with structured annotations as quadruples \emph{(Name, Phenomenon, Reasoning, Severity)}. Annotations are produced by a modular multi-agent pipeline (\textbf{AnomAgent}) with lightweight human-in-the-loop verification, enabling scale while preserving quality. At construction time, AnomAgent processed approximately 4.17,B GPT-4o tokens, providing scale evidence for the resulting structured annotations. We further show that models fine-tuned on AnomReason achieve consistent gains over strong vision-language baselines under our proposed semantic matching metric (\textit{SemAP} and \textit{SemF1}). Applications to {explainable deepfake detection} and {semantic reasonableness assessment of image generators} demonstrate practical utility. In summary, AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images. We will release code, metrics, data, and task-aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.
[141] MRI Brain Tumor Detection with Computer Vision
Jack Krolik,Jake Lynn,John Henry Rudden,Dmytro Vremenko
Main category: cs.CV
TL;DR: 论文探讨了深度学习在MRI脑瘤自动检测与分割中的应用,结合多种机器学习模型,展示了在诊断准确性和效率上的显著提升。
Details
Motivation: 传统脑瘤诊断依赖人工分析MRI图像,耗时且易出错。深度学习技术在医学影像中的潜力激发了这一研究,旨在提高诊断的自动化水平和准确性。Contribution: 论文的主要贡献包括:1) 结合逻辑回归、CNN和ResNet实现脑瘤分类;2) 利用U-Net和EfficientDet模型改进肿瘤定位与分割;3) 展示了深度学习在医学影像中的实际应用价值。
Method: 方法包括:1) 使用逻辑回归、CNN和ResNet进行分类;2) 采用U-Net进行语义分割;3) 引入EfficientDet进行目标检测以精确定位肿瘤。
Result: 实验结果表明,提出的方法在脑瘤检测和分割任务中表现优异,显著提升了诊断的准确性和效率。
Insight: 深度学习在医学影像分析中具有广阔前景,未来可进一步优化模型,结合多模态数据以提高临床实用性。
Abstract: This study explores the application of deep learning techniques in the automated detection and segmentation of brain tumors from MRI scans. We employ several machine learning models, including basic logistic regression, Convolutional Neural Networks (CNNs), and Residual Networks (ResNet) to classify brain tumors effectively. Additionally, we investigate the use of U-Net for semantic segmentation and EfficientDet for anchor-based object detection to enhance the localization and identification of tumors. Our results demonstrate promising improvements in the accuracy and efficiency of brain tumor diagnostics, underscoring the potential of deep learning in medical imaging and its significance in improving clinical outcomes.
[142] Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?
Yuxiang Lai,Jike Zhong,Ming Li,Yuheng Li,Xiaofeng Yang
Main category: cs.CV
TL;DR: 这篇论文探讨了大型生成模型(LVM)在零样本(zero-shot)设置下直接应用于医学影像任务的潜力,尽管模型从未在医学数据上进行过训练。结果表明,LVM在器官分割、去噪、超分辨率和放疗运动预测等任务上表现出色,甚至在运动预测中超越了专用基线方法。
Details
Motivation: 受近期大型生成模型在多领域零样本泛化能力的启发,研究团队试图验证是否可以通过自回归视频建模原理,直接将这些能力迁移到医学影像任务中,而无需针对医学数据进行微调。Contribution: 论文的主要贡献在于首次验证了通用视频模型在医学领域的零样本能力,尤其是在放疗运动预测任务中实现了领先的空间准确性,为未来的医学基础模型奠定了基础。
Method: 研究采用了大型视觉模型(LVM),在零样本设置下评估其在四种医学影像任务(器官分割、去噪、超分辨率、运动预测)中的表现。模型基于自回归视频建模原则,直接应用于医学数据。
Result: LVM在4D-CT数据的122名患者上进行了测试,总计超过1,820个3D CT体积。结果表明,即使未经医学数据训练,模型在所有任务中均表现优异,尤其在运动预测任务中超越了专用基线方法。
Insight: 文章揭示了视频模型在医学领域的潜力,表明通用视频模型可以作为统一的学习和推理工具,未来可能成为医学基础模型的核心组成部分。
Abstract: Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.
[143] Opacity-Gradient Driven Density Control for Compact and Efficient Few-Shot 3D Gaussian Splatting
Abdelrhman Elrawy,Emad A. Mohammed
Main category: cs.CV
TL;DR: 论文提出了一种基于不透明度梯度的密度控制方法,改进了3D高斯泼溅在少样本场景下的效率和紧凑性,显著减少了基元数量,同时保持重建质量。
Details
Motivation: 3D高斯泼溅(3DGS)在少样本场景中容易过拟合且重建臃肿,现有方法如FSGS虽然提高了质量,但大幅增加了基元数量。本文旨在优化核心3DGS的效率和紧凑性。Contribution: 1. 提出了一种新的密度控制触发机制,用不透明度梯度替代位置梯度作为渲染误差的轻量级代理。2. 结合保守的修剪策略和标准深度相关损失,显著提升了效率和紧凑性。
Method: 1. 设计了基于不透明度梯度的密度控制触发机制。2. 采用保守的修剪调度以防止破坏性优化循环。3. 引入深度相关损失提供几何指导。
Result: 在3-view LLFF数据集上,模型的基元数量比FSGS减少了40%(32k vs. 57k);在Mip-NeRF 360数据集上,减少了约70%,且重建指标仅略有下降。
Insight: 不透明度梯度是一种有效的轻量级代理指标,可以高效指导密度控制;保守修剪与激进密度控制的结合是实现高效紧凑重建的关键。
Abstract: 3D Gaussian Splatting (3DGS) struggles in few-shot scenarios, where its standard adaptive density control (ADC) can lead to overfitting and bloated reconstructions. While state-of-the-art methods like FSGS improve quality, they often do so by significantly increasing the primitive count. This paper presents a framework that revises the core 3DGS optimization to prioritize efficiency. We replace the standard positional gradient heuristic with a novel densification trigger that uses the opacity gradient as a lightweight proxy for rendering error. We find this aggressive densification is only effective when paired with a more conservative pruning schedule, which prevents destructive optimization cycles. Combined with a standard depth-correlation loss for geometric guidance, our framework demonstrates a fundamental improvement in efficiency. On the 3-view LLFF dataset, our model is over 40% more compact (32k vs. 57k primitives) than FSGS, and on the Mip-NeRF 360 dataset, it achieves a reduction of approximately 70%. This dramatic gain in compactness is achieved with a modest trade-off in reconstruction metrics, establishing a new state-of-the-art on the quality-vs-efficiency Pareto frontier for few-shot view synthesis.
[144] VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework
Donglin Huang,Yongyuan Li,Tianhang Liu,Junming Huang,Xiaoda Yang,Chi Wang,Weiwei Xu
Main category: cs.CV
TL;DR: VividAnimator是一个端到端的音频和姿态驱动半身人体动画框架,通过预训练手部清晰代码本(HCC)、双流音频感知模块(DSAA)和姿态校准技巧(PCT),解决了现有方法中头部僵硬和手部模糊的问题。
Details
Motivation: 现有音频和姿态驱动的人体动画方法在头部运动和手部细节上表现不佳,主要原因是音频与头部运动关联性弱,以及手部结构复杂。Contribution: 1)预训练的HCC缓解手部退化;2)DSAA模块分别建模唇同步和头部姿态动态;3)PCT技巧优化姿态条件,确保自然过渡。
Method: 结合HCC、DSAA和PCT,实现高保真手部纹理、唇同步和自然头部动态的综合优化。
Result: 实验表明,VividAnimator在手部细节、手势真实性和身份一致性上优于现有方法。
Insight: 分离建模音频驱动的不同部分(唇同步和头部动态)并结合手部先验知识,是提升动画质量的关键。
Abstract: Existing for audio- and pose-driven human animation methods often struggle with stiff head movements and blurry hands, primarily due to the weak correlation between audio and head movements and the structural complexity of hands. To address these issues, we propose VividAnimator, an end-to-end framework for generating high-quality, half-body human animations driven by audio and sparse hand pose conditions. Our framework introduces three key innovations. First, to overcome the instability and high cost of online codebook training, we pre-train a Hand Clarity Codebook (HCC) that encodes rich, high-fidelity hand texture priors, significantly mitigating hand degradation. Second, we design a Dual-Stream Audio-Aware Module (DSAA) to model lip synchronization and natural head pose dynamics separately while enabling interaction. Third, we introduce a Pose Calibration Trick (PCT) that refines and aligns pose conditions by relaxing rigid constraints, ensuring smooth and natural gesture transitions. Extensive experiments demonstrate that Vivid Animator achieves state-of-the-art performance, producing videos with superior hand detail, gesture realism, and identity consistency, validated by both quantitative metrics and qualitative evaluations.
[145] SAM2LoRA: Composite Loss-Guided, Parameter-Efficient Finetuning of SAM2 for Retinal Fundus Segmentation
Sayan Mandal,Divyadarshini Karthikeyan,Manas Paldhe
Main category: cs.CV
TL;DR: SAM2LoRA是一种参数高效的微调策略,通过低秩适配器(LoRA)和复合损失函数优化Segment Anything Model 2(SAM2),用于视网膜眼底图像分割,显著减少参数需求并提升性能。
Details
Motivation: SAM2虽然在低资源场景下表现优异,但微调仍然具有挑战性。需要一种高效的方法,既能减少训练开销,又能提升跨数据集分割任务的性能。Contribution: 1)提出SAM2LoRA,通过低秩适配器(LoRA)显著减少可训练参数(少于5%);2)设计复合损失函数(结合BCE、SoftDice和FocalTversky)优化网络调优。
Method: 1)在SAM2的图像编码器和掩码解码器中集成低秩适配器(LoRA);2)使用复合损失函数指导训练;3)在11个眼底分割数据集上进行评估。
Result: 在跨数据集训练条件下,SAM2LoRA在血管和视盘分割任务中表现优异,Dice分数分别达0.86和0.93,AUC值达0.98和0.99,达到SOTA性能。
Insight: 1)低秩适配器是实现参数高效微调的有效手段;2)复合损失函数对多任务分割问题至关重要;3)SAM2LoRA在低资源场景下具有广泛应用潜力。
Abstract: We propose SAM2LoRA, a parameter-efficient fine-tuning strategy that adapts the Segment Anything Model 2 (SAM2) for fundus image segmentation. SAM2 employs a masked autoencoder-pretrained Hierarchical Vision Transformer for multi-scale feature decoding, enabling rapid inference in low-resource settings; however, fine-tuning remains challenging. To address this, SAM2LoRA integrates a low-rank adapter into both the image encoder and mask decoder, requiring fewer than 5% of the original trainable parameters. Our analysis indicates that for cross-dataset fundus segmentation tasks, a composite loss function combining segmentationBCE, SoftDice, and FocalTversky losses is essential for optimal network tuning. Evaluated on 11 challenging fundus segmentation datasets, SAM2LoRA demonstrates high performance in both blood vessel and optic disc segmentation under cross-dataset training conditions. It achieves Dice scores of up to 0.86 and 0.93 for blood vessel and optic disc segmentation, respectively, and AUC values of up to 0.98 and 0.99, achieving state-of-the-art performance while substantially reducing training overhead.
[146] From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries
Joy Hsu,Emily Jin,Jiajun Wu,Niloy J. Mitra
Main category: cs.CV
TL;DR: FactoredScenes是一个框架,通过学习房间结构和物体姿态的变化,生成逼真的3D场景。它将场景分解为层级化的程序概念和物体姿态,并通过语言模型生成高级程序,同时预测和检索物体姿态。
Details
Motivation: 真实世界的场景(如ScanNet)难以捕捉且数据有限,生成多样化的物体姿态仍具挑战性。需要一种能够利用场景结构和姿态变化的方法。Contribution: 提出了FactoredScenes框架,引入层级化的分解表示,结合语言模型生成程序,并学习程序条件下的物体姿态预测模型。
Method: 1. 使用分解表示将场景分为房间程序和物体姿态;2. 学习可复用的布局模式库;3. 通过语言模型生成程序;4. 基于程序预测和放置物体姿态。
Result: 生成的场景逼真,难以与真实的ScanNet场景区分。
Insight: 层级化分解和程序生成方法能够有效地捕捉场景结构和姿态变化,为3D场景生成提供了新思路。
Abstract: Real-world scenes, such as those in ScanNet, are difficult to capture, with highly limited data available. Generating realistic scenes with varied object poses remains an open and challenging task. In this work, we propose FactoredScenes, a framework that synthesizes realistic 3D scenes by leveraging the underlying structure of rooms while learning the variation of object poses from lived-in scenes. We introduce a factored representation that decomposes scenes into hierarchically organized concepts of room programs and object poses. To encode structure, FactoredScenes learns a library of functions capturing reusable layout patterns from which scenes are drawn, then uses large language models to generate high-level programs, regularized by the learned library. To represent scene variations, FactoredScenes learns a program-conditioned model to hierarchically predict object poses, and retrieves and places 3D objects in a scene. We show that FactoredScenes generates realistic, real-world rooms that are difficult to distinguish from real ScanNet scenes.
[147] Ordinal Scale Traffic Congestion Classification with Multi-Modal Vision-Language and Motion Analysis
Yu-Hsuan Lin
Main category: cs.CV
TL;DR: 本文提出了一种多模态框架,结合视觉语言推理(CLIP)、目标检测(YOLO-World)和运动分析(MOG2),用于按有序尺度分类交通拥堵等级,显著优于单模态基线。
Details
Motivation: 准确的交通拥堵分类对智能交通系统和实时城市交通管理至关重要。现有的单模态方法难以充分捕捉交通场景的复杂性,需要结合多模态信息以提高分类性能。Contribution: 主要贡献是提出了一种多模态框架,结合视觉语言推理、目标检测和运动分析,实现了按有序尺度(1至5级)的交通拥堵分类,并在性能和语义一致性上显著优于基线方法。
Method: 方法包括三个模块:1)CLIP进行视觉语言推理;2)YOLO-World进行目标检测;3)MOG2背景减除用于运动分析。此外,通过运动加权增强分类的可解释性。
Result: 实验结果表明,模型准确率达76.7%,F1分数为0.752,二次加权Kappa(QWK)为0.684,显著优于单模态基线。
Insight: 多模态信息的结合可以有效提升交通拥堵分类的性能和语义一致性。未来可通过车辆尺寸和更精细的密度指标进一步优化。
Abstract: Accurate traffic congestion classification is essential for intelligent transportation systems and real-time urban traffic management. This paper presents a multimodal framework combining open-vocabulary visual-language reasoning (CLIP), object detection (YOLO-World), and motion analysis via MOG2-based background subtraction. The system predicts congestion levels on an ordinal scale from 1 (free flow) to 5 (severe congestion), enabling semantically aligned and temporally consistent classification. To enhance interpretability, we incorporate motion-based confidence weighting and generate annotated visual outputs. Experimental results show the model achieves 76.7 percent accuracy, an F1 score of 0.752, and a Quadratic Weighted Kappa (QWK) of 0.684, significantly outperforming unimodal baselines. These results demonstrate the framework’s effectiveness in preserving ordinal structure and leveraging visual-language and motion modalities. Future enhancements include incorporating vehicle sizing and refined density metrics.
[148] PointMAC: Meta-Learned Adaptation for Robust Test-Time Point Cloud Completion
Linlian Jiang,Rui Ma,Li Gu,Ziqiang Wang,Xinxin Zuo,Yang Wang
Main category: cs.CV
TL;DR: PointMAC是一个基于元学习(meta-learning)的框架,通过自监督辅助目标和MAML策略实现测试时的点云补全自适应,无需额外监督。
Details
Motivation: 现有的点云补全模型在测试时缺乏对新结构和传感器失真的适应能力,限制了其在安全关键应用中的鲁棒性。Contribution: 提出了PointMAC,首次将元辅助测试时自适应应用于点云补全,通过自监督目标和自适应梯度平衡机制实现样本级优化。
Method: 结合MAML的元辅助学习策略,优化两个自监督辅助目标(结构性和传感器级不完整性),并在推理时动态调整编码器。
Result: 在合成、模拟和真实数据集上取得了最先进的结果,表明PointMAC能够生成高质量的补全点云。
Insight: 测试时自适应和自监督辅助目标是提升点云补全鲁棒性的有效方法,尤其是在面对新结构或传感器噪声时。
Abstract: Point cloud completion is essential for robust 3D perception in safety-critical applications such as robotics and augmented reality. However, existing models perform static inference and rely heavily on inductive biases learned during training, limiting their ability to adapt to novel structural patterns and sensor-induced distortions at test time. To address this limitation, we propose PointMAC, a meta-learned framework for robust test-time adaptation in point cloud completion. It enables sample-specific refinement without requiring additional supervision. Our method optimizes the completion model under two self-supervised auxiliary objectives that simulate structural and sensor-level incompleteness. A meta-auxiliary learning strategy based on Model-Agnostic Meta-Learning (MAML) ensures that adaptation driven by auxiliary objectives is consistently aligned with the primary completion task. During inference, we adapt the shared encoder on-the-fly by optimizing auxiliary losses, with the decoder kept fixed. To further stabilize adaptation, we introduce Adaptive $\lambda$-Calibration, a meta-learned mechanism for balancing gradients between primary and auxiliary objectives. Extensive experiments on synthetic, simulated, and real-world datasets demonstrate that PointMAC achieves state-of-the-art results by refining each sample individually to produce high-quality completions. To the best of our knowledge, this is the first work to apply meta-auxiliary test-time adaptation to point cloud completion.
[149] Vision4PPG: Emergent PPG Analysis Capability of Vision Foundation Models for Vital Signs like Blood Pressure
Saurabh Kataria,Ayca Ermis,Lovely Yeswanth Panchumarthi,Minxiao Wang,Xiao Hu
Main category: cs.CV
TL;DR: 这篇论文提出了一种名为Vision4PPG的方法,利用视觉基础模型(VFM)处理PPG信号,通过将其转换为二维图像表示(如STFT),实现了在多项生理任务中的领先性能。
Details
Motivation: PPG传感器在可穿戴和临床设备中提供了非侵入性和实时的生理数据,但传统方法多依赖于专用基础模型或时间序列模型。作者探索了视觉基础模型在处理PPG信号中的潜力,发现其性能优于传统方法。Contribution: 主要贡献包括:1)提出Vision4PPG方法,将PPG信号转换为二维图像表示并用VFM处理;2)证明了VFM在血压估计等多项任务中的SOTA性能;3)展示了该方法对其他2D表示(如STFT相位和递归图)的泛化能力。
Method: 方法包括:1)将一维PPG信号通过STFT等方法转换为二维图像;2)使用最新的VFM(如DINOv3和SIGLIP-2)提取特征;3)采用参数高效微调技术(PEFT)优化模型。
Result: 结果显示,Vision4PPG在血压估计等任务中达到了SOTA性能,并在其他6项任务中表现出色。
Insight: 研究发现视觉基础模型能有效处理PPG信号,开辟了新的研究方向;同时,PEFT技术使其计算高效,适合临床应用。
Abstract: Photoplethysmography (PPG) sensor in wearable and clinical devices provides valuable physiological insights in a non-invasive and real-time fashion. Specialized Foundation Models (FM) or repurposed time-series FMs are used to benchmark physiological tasks. Our experiments with fine-tuning FMs reveal that Vision FM (VFM) can also be utilized for this purpose and, in fact, surprisingly leads to state-of-the-art (SOTA) performance on many tasks, notably blood pressure estimation. We leverage VFMs by simply transforming one-dimensional PPG signals into image-like two-dimensional representations, such as the Short-Time Fourier transform (STFT). Using the latest VFMs, such as DINOv3 and SIGLIP-2, we achieve promising performance on other vital signs and blood lab measurement tasks as well. Our proposal, Vision4PPG, unlocks a new class of FMs to achieve SOTA performance with notable generalization to other 2D input representations, including STFT phase and recurrence plots. Our work improves upon prior investigations of vision models for PPG by conducting a comprehensive study, comparing them to state-of-the-art time-series FMs, and demonstrating the general PPG processing ability by reporting results on six additional tasks. Thus, we provide clinician-scientists with a new set of powerful tools that is also computationally efficient, thanks to Parameter-Efficient Fine-Tuning (PEFT) techniques.
[150] Self-Supervised Multi-Scale Transformer with Attention-Guided Fusion for Efficient Crack Detection
Blessing Agyei Kyem,Joshua Kofi Asamoah,Eugene Denteh,Andrews Danyo,Armstrong Aboah
Main category: cs.CV
TL;DR: 论文提出了一种完全自监督的裂缝检测框架Crack-Segmenter,通过多尺度特征提取、注意力机制和自适应特征融合模块,无需人工标注即可实现高效的像素级裂缝分割,性能优于现有监督方法。
Details
Motivation: 传统裂缝检测依赖于成本高且耗时的像素级标注,限制了大规模基础设施监测的可扩展性。本研究旨在探索无需人工标注的高效裂缝检测方法。Contribution: 提出了一个自监督框架Crack-Segmenter,包含Scale-Adaptive Embedder(SAE)、Directional Attention Transformer(DAT)和Attention-Guided Fusion(AGF)三个模块,实现了完全无需标注的高效裂缝检测。
Method: 1. SAE模块用于多尺度特征提取;2. DAT模块通过注意力机制保持裂缝的连续性;3. AGF模块用于自适应特征融合。整个框架通过自监督学习实现高效分割。
Result: 在十个公开数据集上的实验表明,Crack-Segmenter在mIoU、Dice score、XOR和HD等指标上均优于13种监督方法。
Insight: 研究表明,无需人工标注的裂缝检测不仅可行,还能超越监督方法,为大规模基础设施监测提供了低成本、高效的解决方案。
Abstract: Pavement crack detection has long depended on costly and time-intensive pixel-level annotations, which limit its scalability for large-scale infrastructure monitoring. To overcome this barrier, this paper examines the feasibility of achieving effective pixel-level crack segmentation entirely without manual annotations. Building on this objective, a fully self-supervised framework, Crack-Segmenter, is developed, integrating three complementary modules: the Scale-Adaptive Embedder (SAE) for robust multi-scale feature extraction, the Directional Attention Transformer (DAT) for maintaining linear crack continuity, and the Attention-Guided Fusion (AGF) module for adaptive feature integration. Through evaluations on ten public datasets, Crack-Segmenter consistently outperforms 13 state-of-the-art supervised methods across all major metrics, including mean Intersection over Union (mIoU), Dice score, XOR, and Hausdorff Distance (HD). These findings demonstrate that annotation-free crack detection is not only feasible but also superior, enabling transportation agencies and infrastructure managers to conduct scalable and cost-effective monitoring. This work advances self-supervised learning and motivates pavement cracks detection research.
[151] Identifying bias in CNN image classification using image scrambling and transforms
Sai Teja Erukude
Main category: cs.CV
TL;DR: 该论文通过图像分割和变换技术,探讨了CNN图像分类中难以察觉的偏见问题,并提出两种方法来区分上下文信息和背景噪声。
Details
Motivation: CNN在图像分类中表现优异,但其“黑盒”特性使得用户难以理解特征选择过程,可能导致基于背景信息的偏见决策。作者希望通过技术手段识别这些隐藏偏见。Contribution: 提出了两种方法(图像分块随机打乱和多种图像变换)来识别CNN分类中的背景噪声和上下文信息,测试了六种数据集并证明了方法的有效性。
Method: 1. 将图像分割为非重叠小块并随机打乱;2. 应用傅里叶变换、小波变换和中值滤波等图像变换及其组合,以恢复CNN用于分类的背景噪声信息。
Result: 实验表明,这些方法能有效区分上下文信息和背景噪声,并且无需依赖背景信息即可检测背景噪声的存在。
Insight: 图像变换和小块打乱技术有助于揭示CNN分类中的隐藏偏见,为理解和改进模型的可解释性提供了新思路。
Abstract: CNNs are now prevalent as the primary choice for most machine vision problems due to their superior rate of classification and the availability of user-friendly libraries. These networks effortlessly identify and select features in a non-intuitive data-driven manner, making it difficult to determine which features were most influential. That leads to a ``black box”, where users cannot know how the image data are analyzed but rely on empirical results. Therefore the decision-making process can be biased by background information that is difficult to detect. Here we discuss examples of such hidden biases and propose techniques for identifying them, methods to distinguish between contextual information and background noise, and explore whether CNNs learn from irrelevant features. One effective approach to identify dataset bias is to classify blank background parts of the images. However, in some situations a blank background in the images is not available, making it more difficult to separate the foreground information from the blank background. Such parts of the image can also be considered contextual learning, not necessarily bias. To overcome this, we propose two approaches that were tested on six different datasets, including natural, synthetic, and hybrid datasets. The first method involves dividing images into smaller, non-overlapping tiles of various sizes, which are then shuffled randomly, making classification more challenging. The second method involves the application of several image transforms, including Fourier, Wavelet transforms, and Median filter, and their combinations. These transforms help recover background noise information used by CNN to classify images. Results indicate that this method can effectively distinguish between contextual information and background noise, and alert on the presence of background noise even without the need to use background information.
[152] AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
Xinlong Chen,Yue Ding,Weihong Lin,Jingyun Hua,Linli Yao,Yang Shi,Bozhou Li,Yuanxing Zhang,Qiang Liu,Pengfei Wan,Liang Wang,Tieniu Tan
Main category: cs.CV
TL;DR: AVoCaDO是一种通过视听模态时间协调驱动的视听视频字幕生成模型,通过两阶段训练显著提升了字幕的质量和时间一致性。
Details
Motivation: 视听视频字幕生成旨在通过视觉和听觉事件的时间对齐生成语义丰富的描述,但由于时序一致性和多模态对齐的挑战,现有方法效果有限。Contribution: 1. 提出了AVoCaDO,一种基于时间协调的视听视频字幕生成模型;2. 设计了两阶段训练管道(SFT和GRPO),提升了字幕的时间一致性和对话准确性;3. 引入了一个新的高质量视听对齐字幕数据集(107K)。
Method: 1. AVoCaDO SFT阶段:在新数据集上进行监督微调;2. AVoCaDO GRPO阶段:通过定制奖励函数优化时序一致性和对话准确性,同时正则化字幕长度。
Result: 在四个视听视频字幕基准测试中显著优于现有开源模型,并在纯视觉设置下(VDC和DREAM-1K)表现优异。
Insight: 时间协调对多模态字幕生成至关重要,两阶段训练和定制奖励函数可以有效提升模型的生成质量。
Abstract: Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.
[153] Mesh-Gait: A Unified Framework for Gait Recognition Through Multi-Modal Representation Learning from 2D Silhouettes
Zhao-Yang Wang,Jieneng Chen,Jiang Liu,Yuxiang Guo,Rama Chellappa
Main category: cs.CV
TL;DR: Mesh-Gait是一种新颖的多模态步态识别框架,通过从2D轮廓直接重建3D表示,结合了2D和3D的优势,实现了高效且鲁棒的步态识别。
Details
Motivation: 传统步态识别方法(如基于2D轮廓或骨架)在视角变化、遮挡和噪声下表现不佳,而多模态方法(如结合3D信息)虽鲁棒但计算成本高。Mesh-Gait旨在解决这些问题,通过高效融合2D和3D信息。Contribution: 提出Mesh-Gait框架,首次通过重建3D热图作为中间表示,高效结合2D轮廓和3D信息,避免了直接3D重建的高计算成本。
Method: Mesh-Gait通过监督学习逐步重建准确的3D热图,并从中提取特征,与2D轮廓特征融合。损失函数基于重建的3D关节、虚拟标记和3D网格与真实数据的对齐。
Result: Mesh-Gait在多模态步态识别任务中实现了最先进的准确率。
Insight: 中间3D热图表示是一种高效且鲁棒的特征提取方式,既保留了3D几何信息,又避免了复杂的直接3D重建。
Abstract: Gait recognition, a fundamental biometric technology, leverages unique walking patterns for individual identification, typically using 2D representations such as silhouettes or skeletons. However, these methods often struggle with viewpoint variations, occlusions, and noise. Multi-modal approaches that incorporate 3D body shape information offer improved robustness but are computationally expensive, limiting their feasibility for real-time applications. To address these challenges, we introduce Mesh-Gait, a novel end-to-end multi-modal gait recognition framework that directly reconstructs 3D representations from 2D silhouettes, effectively combining the strengths of both modalities. Compared to existing methods, directly learning 3D features from 3D joints or meshes is complex and difficult to fuse with silhouette-based gait features. To overcome this, Mesh-Gait reconstructs 3D heatmaps as an intermediate representation, enabling the model to effectively capture 3D geometric information while maintaining simplicity and computational efficiency. During training, the intermediate 3D heatmaps are gradually reconstructed and become increasingly accurate under supervised learning, where the loss is calculated between the reconstructed 3D joints, virtual markers, and 3D meshes and their corresponding ground truth, ensuring precise spatial alignment and consistent 3D structure. Mesh-Gait extracts discriminative features from both silhouettes and reconstructed 3D heatmaps in a computationally efficient manner. This design enables the model to capture spatial and structural gait characteristics while avoiding the heavy overhead of direct 3D reconstruction from RGB videos, allowing the network to focus on motion dynamics rather than irrelevant visual details. Extensive experiments demonstrate that Mesh-Gait achieves state-of-the-art accuracy. The code will be released upon acceptance of the paper.
[154] Guided Image Feature Matching using Feature Spatial Order
Chin-Hung Teng,Ben-Jian Dong
Main category: cs.CV
TL;DR: 该论文提出了一种结合特征空间顺序与渐进式匹配框架的图像特征匹配方法,通过估计特征匹配概率并过滤不必要的匹配,显著提升了匹配效率和准确性。
Details
Motivation: 传统图像特征匹配方法在处理大量特征点时效率较低,且依赖对极几何。特征空间顺序作为一个独立概念可以补充对极几何,提升匹配效率。Contribution: 1. 提出将特征空间顺序整合到渐进式匹配框架中;2. 结合对极几何进一步提升匹配效率和准确性;3. 提出一种图像对齐方法消除图像旋转的影响。
Method: 1. 利用初始匹配特征构建特征空间顺序模型;2. 通过模型计算后续匹配的可能空间范围以过滤不必要匹配;3. 结合对极几何优化匹配过程。
Result: 实验表明,该方法在标准数据集、模拟图像和真实图像上均显著提升了匹配效率和准确性。
Insight: 特征空间顺序是对极几何的有效补充,通过渐进式框架和图像对齐可以显著优化特征匹配性能。
Abstract: Image feature matching plays a vital role in many computer vision tasks. Although many image feature detection and matching techniques have been proposed over the past few decades, it is still time-consuming to match feature points in two images, especially for images with a large number of detected features. Feature spatial order can estimate the probability that a pair of features is correct. Since it is a completely independent concept from epipolar geometry, it can be used to complement epipolar geometry in guiding feature match in a target region so as to improve matching efficiency. In this paper, we integrate the concept of feature spatial order into a progressive matching framework. We use some of the initially matched features to build a computational model of feature spatial order and employs it to calculates the possible spatial range of subsequent feature matches, thus filtering out unnecessary feature matches. We also integrate it with epipolar geometry to further improve matching efficiency and accuracy. Since the spatial order of feature points is affected by image rotation, we propose a suitable image alignment method from the fundamental matrix of epipolar geometry to remove the effect of image rotation. To verify the feasibility of the proposed method, we conduct a series of experiments, including a standard benchmark dataset, self-generated simulated images, and real images. The results demonstrate that our proposed method is significantly more efficient and has more accurate feature matching than the traditional method.
[155] Combo-Gait: Unified Transformer Framework for Multi-Modal Gait Recognition and Attribute Analysis
Zhao-Yang Wang,Zhimin Shao,Jieneng Chen,Rama Chellappa
Main category: cs.CV
TL;DR: 论文提出了一个多模态、多任务框架Combo-Gait,结合2D时序剪影和3D SMPL特征进行步态识别和人体属性分析,采用统一Transformer融合特征,效果优于现有方法。
Details
Motivation: 现有步态识别方法通常仅依赖单一模态(如2D或3D),无法充分捕捉步态的几何和动态复杂性,同时忽视了步态的其他潜在信息(如人体属性)。Contribution: 1. 提出多模态、多任务框架Combo-Gait,结合2D和3D特征;2. 引入统一Transformer实现特征融合和属性相关表示学习;3. 在BRIAR数据集上验证了方法的优越性。
Method: 1. 融合2D时序剪影和3D SMPL特征;2. 采用统一Transformer进行多模态特征融合;3. 多任务学习同时进行步态识别和人体属性(年龄、BMI、性别)估计。
Result: 在BRIAR数据集(远距离、极端俯仰角条件下)上,Combo-Gait在步态识别和属性估计任务中均优于现有方法。
Insight: 多模态和多任务学习能够有效提升步态识别的鲁棒性,并挖掘步态中的附加信息(如人体属性),为现实场景中的步态分析提供了新思路。
Abstract: Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50{\deg}), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.
[156] Towards Cybersickness Severity Classification from VR Gameplay Videos Using Transfer Learning and Temporal Modeling
Jyotirmay Nag Setu,Kevin Desai,John Quarles
Main category: cs.CV
TL;DR: 该论文提出了一种基于迁移学习和时间建模的方法,利用VR游戏视频预测晕动症的严重程度,提供了一种实用的工具来评估和缓解VR环境中的不适。
Details
Motivation: 虚拟现实(VR)技术在多个领域快速普及,但晕动症(cybersickness)问题阻碍了其广泛应用。现有的多模态深度学习方法依赖于VR传感器数据,但视频数据的潜力尚未充分探索。Contribution: 研究填补了利用视频数据预测晕动症的空白,结合迁移学习(InceptionV3)和时间建模(LSTM),显著提升了分类准确性(68.4%)。
Method: 方法分为两步:1)使用在ImageNet上预训练的InceptionV3模型提取视频的高级视觉特征;2)通过LSTM网络捕捉时间动态,预测晕动症严重程度。
Result: 该方法在视频数据上的分类准确率达到68.4%,优于现有仅依赖视频数据的模型。
Insight: 研究展示了视频数据在晕动症预测中的潜力,为未来基于时间建模的视频分析方法奠定了基础,有助于提升VR用户体验。
Abstract: With the rapid advancement of virtual reality (VR) technology, its adoption across domains such as healthcare, education, and entertainment has grown significantly. However, the persistent issue of cybersickness, marked by symptoms resembling motion sickness, continues to hinder widespread acceptance of VR. While recent research has explored multimodal deep learning approaches leveraging data from integrated VR sensors like eye and head tracking, there remains limited investigation into the use of video-based features for predicting cybersickness. In this study, we address this gap by utilizing transfer learning to extract high-level visual features from VR gameplay videos using the InceptionV3 model pretrained on the ImageNet dataset. These features are then passed to a Long Short-Term Memory (LSTM) network to capture the temporal dynamics of the VR experience and predict cybersickness severity over time. Our approach effectively leverages the time-series nature of video data, achieving a 68.4% classification accuracy for cybersickness severity. This surpasses the performance of existing models trained solely on video data, providing a practical tool for VR developers to evaluate and mitigate cybersickness in virtual environments. Furthermore, this work lays the foundation for future research on video-based temporal modeling for enhancing user comfort in VR applications.
[157] Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs
Suyang Xi,Chenxi Yang,Hong Ding,Yiqing Ni,Catherine C. Liu,Yunhao Liu,Chengqi Zhang
Main category: cs.CV
TL;DR: 该论文提出了一种名为HuLiRAG的框架,通过模仿人类处理视觉信息的方式(即’what-where-reweight’级联),解决了多模态大语言模型在细粒度视觉问答中的幻觉问题。该方法结合了开放词汇检测和空间解析,显著提高了模型的可靠性和一致性。
Details
Motivation: 现有的多模态大语言模型(MLLMs)在细粒度视觉问答中容易产生幻觉,例如错误识别物体身份、位置和关系。尽管检索增强生成(RAG)部分解决了这些问题,但其未能模拟人类对全局和局部信息的处理方式,导致推理能力受限。Contribution: HuLiRAG框架通过引入一种’what-where-reweight’的级联方法,将多模态推理分为三步:开放词汇检测(what)、SAM掩码空间解析(where)以及局部与全局对齐的优先级调整(reweight)。这种设计显著提升了细粒度视觉问答的准确性和可靠性。
Method: HuLiRAG采用三步级联:1)通过开放词汇检测锚定查询到候选物体(what);2)利用SAM掩码解析空间信息(where);3)动态调整局部与全局对齐的权重(reweight)。此外,掩码引导的微调将空间证据注入生成过程,强化答案的显式约束。
Result: 实验表明,HuLiRAG在细粒度视觉问答任务中显著减少了幻觉现象,同时提升了答案的事实一致性和定位准确性。该方法在多模态问答任务中取得了卓越表现。
Insight: 论文的核心洞察在于,通过模仿人类视觉信息处理的层次化方式(全局到局部),可以有效解决MLLMs在细粒度任务中的幻觉问题。这种设计不仅提升了模型的可靠性,也为多模态推理提供了新的方法论。
Abstract: Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what–where–reweight’’ cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.
[158] MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation
Kangjian Zhu,Haobo Jiang,Yigong Zhang,Jianjun Qian,Jian Yang,Jin Xie
Main category: cs.CV
TL;DR: MonoSE(3)-Diffusion提出了一种基于单目相机的SE(3)扩散框架,通过条件去噪扩散过程解决无标记的机器人姿态估计问题,显著提升了性能。
Details
Motivation: 现有的姿态估计方法通常采用固定尺度的扰动,限制了姿态多样性和鲁棒性。本文旨在通过扩散模型生成更具多样性的训练姿态,并利用时序感知的逐步细化策略提升预测准确性。Contribution: 1. 提出了一个带有可见性约束的扩散过程,确保姿态扰动在相机视野内,提升网络泛化能力。2. 设计了一个时序感知的反向过程,通过逐步细化姿态估计,提高了预测的鲁棒性和准确性。
Method: 1. 扩散过程:通过可见性约束生成多样化的训练姿态。2. 反向过程:利用时序感知的去噪网络逐步细化姿态估计,遵循从粗到细的调度策略。
Result: 在DREAM和RoboKeyGen两个基准测试中表现优异,最具挑战性的数据集的AUC达到66.75,比现有技术提升了32.3%。
Insight: 扩散模型的时序感知特性可以有效捕捉姿态估计的多尺度信息,提升任务性能和鲁棒性,尤其在复杂场景中表现突出。
Abstract: We propose MonoSE(3)-Diffusion, a monocular SE(3) diffusion framework that formulates markerless, image-based robot pose estimation as a conditional denoising diffusion process. The framework consists of two processes: a visibility-constrained diffusion process for diverse pose augmentation and a timestep-aware reverse process for progressive pose refinement. The diffusion process progressively perturbs ground-truth poses to noisy transformations for training a pose denoising network. Importantly, we integrate visibility constraints into the process, ensuring the transformations remain within the camera field of view. Compared to the fixed-scale perturbations used in current methods, the diffusion process generates in-view and diverse training poses, thereby improving the network generalization capability. Furthermore, the reverse process iteratively predicts the poses by the denoising network and refines pose estimates by sampling from the diffusion posterior of current timestep, following a scheduled coarse-to-fine procedure. Moreover, the timestep indicates the transformation scales, which guide the denoising network to achieve more accurate pose predictions. The reverse process demonstrates higher robustness than direct prediction, benefiting from its timestep-aware refinement scheme. Our approach demonstrates improvements across two benchmarks (DREAM and RoboKeyGen), achieving a notable AUC of 66.75 on the most challenging dataset, representing a 32.3% gain over the state-of-the-art.
[159] On the Problem of Consistent Anomalies in Zero-Shot Industrial Anomaly Detection
Tai Le-Gia,Ahn Jaehyun
Main category: cs.CV
TL;DR: 论文提出了CoDeGraph算法,通过构建图像级图并利用社区检测过滤一致性异常,显著提升了零样本工业异常检测的性能。
Details
Motivation: 现有的基于表示的方法在处理重复出现的相似缺陷(一致性异常)时表现不佳,影响异常分类和分割的性能。Contribution: 提出了CoDeGraph算法和‘邻居烧尽’现象的理论解释,显著提升了零样本异常检测的性能。
Method: 构建图像级图,识别并过滤一致性异常,利用社区检测技术改进相似性计算。
Result: 在MVTec AD数据集上,AC性能达到98.3% AUROC,AS性能在F1和AP指标上分别提升了4.2%和5.4%。
Insight: 正常块与异常块在相似性表现上的差异(稳定渐进vs.突发高峰)为解决一致性异常问题提供了新思路。
Abstract: Zero-shot image anomaly classification (AC) and segmentation (AS) are vital for industrial quality control, detecting defects without prior training data. Existing representation-based methods compare patch features with nearest neighbors in unlabeled test images but struggle with consistent anomalies – similar defects recurring across multiple images – resulting in poor AC/AS performance. We introduce Consistent-Anomaly Detection Graph (CoDeGraph), a novel algorithm that identifies and filters consistent anomalies from similarity computations. Our key insight is that normal patches in industrial images show stable, gradually increasing similarity to other test images, while consistent-anomaly patches exhibit abrupt similarity spikes after exhausting a limited set of similar matches, a phenomenon we term ``neighbor-burnout.’’ CoDeGraph constructs an image-level graph, with images as nodes and edges connecting those with shared consistent-anomaly patterns, using community detection to filter these anomalies. We provide a theoretical foundation using Extreme Value Theory to explain the effectiveness of our approach. Experiments on MVTec AD with the ViT-L-14-336 backbone achieve 98.3% AUROC for AC and AS performance of 66.8% (+4.2%) F1 and 68.1% (+5.4%) AP over state-of-the-art zero-shot methods. Using the DINOv2 backbone further improves segmentation, yielding 69.1% (+6.5%) F1 and 71.9% (+9.2%) AP, demonstrating robustness across architectures.
[160] Learning from Disagreement: A Group Decision Simulation Framework for Robust Medical Image Segmentation
Chen Zhong,Yuxuan Yang,Xinyue Zhang,Ruohan Ma,Yong Guo,Gang Li,Jupeng Li
Main category: cs.CV
TL;DR: 论文提出了一种新的医学图像分割框架,通过模拟临床团队的决策过程,利用专家的分歧信息来生成更鲁棒的分割结果。
Details
Motivation: 医学图像分割标注存在专家间差异性(IRV),传统方法简单地平均标注结果会丢失有价值的临床不确定性信息。论文旨在利用这种分歧信息提升分割的鲁棒性和可信度。Contribution: 1)提出了一个模拟临床团队决策的框架;2)设计了专家签名生成器(ESG)和模拟会诊模块(SCM),分别学习专家风格并生成最终分割;3)在CBCT和MRI数据集上取得了SOTA结果。
Method: 1)ESG学习每个专家的独特分割风格并在潜在空间中表示;2)SCM通过采样潜在空间模拟专家会诊过程,生成最终分割。
Result: 在CBCT和MRI数据集上分别达到92.11%和90.72%的Dice分数,表现优于传统方法。
Insight: 将专家间的分歧视为有用信号而非噪声,为医疗AI提供了更鲁棒和可信的路径。
Abstract: Medical image segmentation annotation suffers from inter-rater variability (IRV) due to differences in annotators’ expertise and the inherent blurriness of medical images. Standard approaches that simply average expert labels are flawed, as they discard the valuable clinical uncertainty revealed in disagreements. We introduce a fundamentally new approach with our group decision simulation framework, which works by mimicking the collaborative decision-making process of a clinical panel. Under this framework, an Expert Signature Generator (ESG) learns to represent individual annotator styles in a unique latent space. A Simulated Consultation Module (SCM) then intelligently generates the final segmentation by sampling from this space. This method achieved state-of-the-art results on challenging CBCT and MRI datasets (92.11% and 90.72% Dice scores). By treating expert disagreement as a useful signal instead of noise, our work provides a clear path toward more robust and trustworthy AI systems for healthcare.
[161] Post-TIPS Prediction via Multimodal Interaction: A Multi-Center Dataset and Framework for Survival, Complication, and Portal Pressure Assessment
Junhao Dong,Dejia Liu,Ruiqi Ding,Zongxing Chen,Yingjie Huang,Zhu Meng,Jianbo Zhao,Zhicheng Zhao,Fei Su
Main category: cs.CV
TL;DR: 论文提出了MultiTIPS数据集和一个多模态预后框架,用于TIPS手术后的综合评估,包括生存、并发症和门脉压力预测。框架包含双选项分割、多模态交互和多任务预测三大模块,解决了当前方法在标注效率、泛化性和多终点预测方面的不足。
Details
Motivation: TIPS手术的预后评估存在标注成本高、单模态方法泛化性差以及单终点预测不全面的问题,亟需一个公开数据集和更可靠的预测框架。Contribution: 1. 发布了首个公开的TIPS多中心数据集MultiTIPS;2. 提出了一个多模态预后框架,包含双选项分割、多模态交互和多任务预测三大模块;3. 引入了MGRA、POD和CGPE三种技术,提升模型的准确性和泛化性。
Method: 1. 双选项分割:结合半监督和基础模型的分割方法,减少标注需求;2. 多模态交互:通过MGRA、POD和CGPE技术实现跨模态特征交互;3. 多任务预测:采用分阶段训练策略,同时优化生存、PPG和OHE预测。
Result: 在MultiTIPS数据集上,提出的方法优于现有技术,表现出强泛化性和可解释性。
Insight: 多模态交互和分阶段多任务训练是提升预后模型性能的关键。公开数据集和框架为临床研究提供了重要资源。
Abstract: Transjugular intrahepatic portosystemic shunt (TIPS) is an established procedure for portal hypertension, but provides variable survival outcomes and frequent overt hepatic encephalopathy (OHE), indicating the necessity of accurate preoperative prognostic modeling. Current studies typically build machine learning models from preoperative CT images or clinical characteristics, but face three key challenges: (1) labor-intensive region-of-interest (ROI) annotation, (2) poor reliability and generalizability of unimodal methods, and (3) incomplete assessment from single-endpoint prediction. Moreover, the lack of publicly accessible datasets constrains research in this field. Therefore, we present MultiTIPS, the first public multi-center dataset for TIPS prognosis, and propose a novel multimodal prognostic framework based on it. The framework comprises three core modules: (1) dual-option segmentation, which integrates semi-supervised and foundation model-based pipelines to achieve robust ROI segmentation with limited annotations and facilitate subsequent feature extraction; (2) multimodal interaction, where three techniques, multi-grained radiomics attention (MGRA), progressive orthogonal disentanglement (POD), and clinically guided prognostic enhancement (CGPE), are introduced to enable cross-modal feature interaction and complementary representation integration, thus improving model accuracy and robustness; and (3) multi-task prediction, where a staged training strategy is used to perform stable optimization of survival, portal pressure gradient (PPG), and OHE prediction for comprehensive prognostic assessment. Extensive experiments on MultiTIPS demonstrate the superiority of the proposed method over state-of-the-art approaches, along with strong cross-domain generalization and interpretability, indicating its promise for clinical application. The dataset and code are available.
[162] When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance
Jinjin Cao,Zhiyang Chen,Zijun Wang,Liyuan Ma,Weijian Luo,Guojun Qi
Main category: cs.CV
TL;DR: 论文提出了一种名为跨模态引导(CMG)的训练无关的解码方法,通过降低视觉-语言注意力来解决VLMs中的幻觉问题,显著减少了语言偏差而不损害模型能力。
Details
Motivation: 现有的VLMs在生成响应时容易产生与图像无关的语言流畅回答(幻觉问题),本文分析了语言偏差如何导致这一问题,并提出了解决方案。Contribution: 提出了CMG方法,通过自适应掩码最影响图像标记的注意力权重来强调视觉上下文感知,无需额外训练即可有效减少幻觉问题。
Method: 通过对比原始模型和视觉-语言注意力退化模型的输出分布差异,自适应掩码关键图像标记的注意力权重,以强调视觉上下文。
Result: 实验证明CMG在幻觉基准测试中显著提升了不同VLMs的性能,且无需额外训练成本。
Insight: 通过调整注意力机制可以有效抑制语言偏差,同时保留VLMs的视觉-语言理解能力。
Abstract: Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM’s performance on hallucination-specific benchmarks and generalize effectively.
[163] DAGLFNet:Deep Attention-Guided Global-Local Feature Fusion for Pseudo-Image Point Cloud Segmentation
Chuang Chen,Wenyi Ge
Main category: cs.CV
TL;DR: DAGLFNet是一种基于伪图像的语义分割框架,通过全局-局部特征融合和多分支特征提取提升点云的分割性能,同时引入深度特征引导注意力机制优化特征融合。
Details
Motivation: 点云的语义分割在高精度地图和自主导航中至关重要,但现有方法在特征融合和区分性方面存在不足,需要一种更高效且性能优异的方法。Contribution: 提出了DAGLFNet框架,包含全局-局部特征融合编码模块、多分支特征提取网络和深度特征引导注意力机制,显著提升了点云分割的性能。
Method: 采用全局-局部特征融合编码模块和多分支特征提取网络捕获局部和全局特征,通过深度特征引导注意力机制优化跨通道特征融合。
Result: 在SemanticKITTI和nuScenes验证集上分别达到69.83%和78.65%的性能,兼顾高效率和实时性。
Insight: 伪图像表示方法在点云分割中具有潜力,同时全局-局部特征融合和注意力机制的结合能够显著提升模型性能。
Abstract: Environmental perception systems play a critical role in high-precision mapping and autonomous navigation, with LiDAR serving as a core sensor that provides accurate 3D point cloud data. How to efficiently process unstructured point clouds while extracting structured semantic information remains a significant challenge, and in recent years, numerous pseudo-image-based representation methods have emerged to achieve a balance between efficiency and performance. However, they often overlook the structural and semantic details of point clouds, resulting in limited feature fusion and discriminability. In this work, we propose DAGLFNet, a pseudo-image-based semantic segmentation framework designed to extract discriminative features. First, the Global-Local Feature Fusion Encoding module is used to enhance the correlation among local features within a set and capture global contextual information. Second, the Multi-Branch Feature Extraction network is employed to capture more neighborhood information and enhance the discriminability of contour features. Finally, a Feature Fusion via Deep Feature-guided Attention mechanism is introduced to improve the precision of cross-channel feature fusion. Experimental evaluations show that DAGLFNet achieves 69.83% and 78.65% on the validation sets of SemanticKITTI and nuScenes, respectively. The method balances high performance with real-time capability, demonstrating great potential for LiDAR-based real-time applications.
[164] MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition
Deng Li,Jun Shao,Bohao Xing,Rong Gao,Bihan Wen,Heikki Kälviäinen,Xin Liu
Main category: cs.CV
TL;DR: 该论文提出了一种名为MSF-Mamba的高效模型,用于微手势识别(MGR),通过引入运动感知状态融合模块和多尺度版本(MSF-Mamba+),显著提升了局部时空依赖性建模能力。
Details
Motivation: 微手势识别需要对长距离和局部时空依赖进行精确建模,而现有方法(如CNN和Transformer)各有局限性:CNN难以捕捉长距离依赖,Transformer计算成本高。Mamba虽高效但缺乏局部时空建模能力。Contribution: 提出MSF-Mamba模型,通过运动感知状态融合模块(基于中心帧差异)和多尺度设计(MSF-Mamba+),增强了局部时空建模能力和运动感知能力。
Method: 采用基于状态空间模型(SSM)的Mamba框架,引入运动感知状态融合模块和多尺度融合策略,动态加权不同尺度的状态。
Result: 在两个公开MGR数据集上,MSF-Mamba及MSF-Mamba+均达到最先进性能,超越CNN、Transformer和SSM基线模型,且保持高效。
Insight: 通过增强Mamba的局部时空建模能力,结合运动感知和多尺度动态加权策略,能够高效捕获微手势的细微运动线索。
Abstract: Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.
[165] Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Yunlong Deng,Guangyi Chen,Tianpei Gu,Lingjing Kong,Yan Li,Zeyu Tang,Kun Zhang
Main category: cs.CV
TL;DR: 该论文提出了一种基于三角一致性原则的自优化框架,验证了视觉-语言模型(VLMs)具有自我生成高质量监督数据的能力,无需外部输入即可自主学习。
Details
Motivation: 研究旨在探索无监督指令训练的VLMs潜力,验证其自我优化能力,以减少对外部监督数据的依赖。Contribution: 1. 提出基于三角一致性原则的自优化框架;2. 揭示了VLMs的自我生成和监督数据滤波能力;3. 通过理论分析解释了VLMs的自优化机制。
Method: 1. 多任务指令微调以生成图像-问答对;2. 基于三角一致性原则过滤低质量数据;3. 利用过滤后的数据更新模型。
Result: 实验表明,该方法在多个基准测试中实现了一致的性能提升,且无需外部监督。
Insight: VLMs具有内在的自我优化能力,未来研究可进一步探索其学习机制。
Abstract: Vision-Language Models (VLMs) integrate visual knowledge with the analytical capabilities of Large Language Models (LLMs) through supervised visual instruction tuning, using image-question-answer triplets. However, the potential of VLMs trained without supervised instruction remains largely unexplored. This study validates that VLMs possess inherent self-refinement capabilities, enabling them to generate high-quality supervised data without external inputs and thereby learn autonomously. Specifically, to stimulate the self-refinement ability of VLMs, we propose a self-refinement framework based on a Triangular Consistency principle: within the image-query-answer triangle, any masked elements should be consistently and accurately reconstructed. The framework involves three steps: (1) We enable the instruction generation ability of VLMs by adding multi-task instruction tuning like image$\rightarrow$question-answer or image-answer$\rightarrow$question. (2) We generate image-query-answer triplets from unlabeled images and use the Triangular Consistency principle for filtering. (3) The model is further updated using the filtered synthetic data. To investigate the underlying mechanisms behind this self-refinement capability, we conduct a theoretical analysis from a causal perspective. Using the widely recognized LLaVA-1.5 as our baseline, our experiments reveal that the model can autonomously achieve consistent, though deliberately modest, improvements across multiple benchmarks without any external supervision, such as human annotations or environmental feedback. We expect that the insights of this study on the self-refinement ability of VLMs can inspire future research on the learning mechanism of VLMs. Code is available at https://github.com/dengyl20/SRF-LLaVA-1.5.
[166] VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
Qunzhong Wang,Jie Liu,Jiajun Liang,Yilei Jiang,Yuanxing Zhang,Jinyuan Chen,Yaozhi Zheng,Xintao Wang,Pengfei Wan,Xiangyu Yue,Jiaheng Liu
Main category: cs.CV
TL;DR: VR-Thinker是一个通过思考与图像推理增强视频奖励模型的框架,解决了传统奖励模型在处理视觉输入时的局限性,显著提高了长视频的评估准确率。
Details
Motivation: 现有的多模态奖励模型(RMs)在处理视觉输入时存在两个主要问题:视觉输入占用大量上下文预算导致细节丢失,以及初始提示中打包所有视觉信息加剧了幻觉和遗忘。Contribution: 提出了VR-Thinker框架,通过视觉推理操作(如选择帧)和可配置的视觉记忆窗口,使RM能够主动获取和更新视觉证据,提高了推理的保真度和可靠性。
Method: 通过强化微调管道激活视觉推理:冷启动阶段使用视觉链式思考数据,选择高质量样本进行拒绝采样微调,最后应用组相对策略优化(GRPO)加强推理。
Result: 在多个视频偏好基准测试中取得了领先的准确率,例如7B VR-Thinker在VideoGen Reward上达到80.5%,GenAI-Bench上82.3%,MJ-Bench-Video上75.6%。
Insight: 思考与图像的推理方式显著提升了多模态奖励模型的性能,尤其是在处理长视频时,验证了这一方法的有效性。
Abstract: Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
[167] Receptive Field Expanded Look-Up Tables for Vision Inference: Advancing from Low-level to High-level Tasks
Xi Zhang,Xiaolin Wu
Main category: cs.CV
TL;DR: 该论文提出了一种新颖的查找表(LUT)方法,通过扩展卷积核的感受野并引入优化技术,实现了在不增加表大小的情况下提升CNN推理性能的目标。
Details
Motivation: 现有的LUT方法因卷积核感受野受限(由于表大小的组合爆炸)而难以在高层次任务中表现良好。为此,研究旨在扩展感受野,同时保持空间复杂度不变。Contribution: 主要贡献包括提出了一种学习最优格向量量化器的方法,自适应分配量化分辨率;引入了不规则空洞卷积和U型级联LUT结构,以扩展感受野并捕捉多级上下文信息。
Method: 方法包括学习格向量量化器以优化卷积核的量化表现,以及引入不规则空洞卷积和U型级联LUT结构来扩展感受野。
Result: 该方法在速度、精度和内存效率之间取得了有效平衡,显著优于现有LUT方法。
Insight: 通过优化量化策略和引入多级上下文捕捉机制,可以在固定表大小的情况下显著提升CNN推理的效率和性能。
Abstract: Recently, several look-up table (LUT) methods were developed to greatly expedite the inference of CNNs in a classical strategy of trading space for speed. However, these LUT methods suffer from a common drawback of limited receptive field of the convolution kernels due to the combinatorial explosion of table size. This research aims to expand the CNN receptive field with a fixed table size, thereby enhancing the performance of LUT-driven fast CNN inference while maintaining the same space complexity. To achieve this goal, various techniques are proposed. The main contribution is a novel approach of learning an optimal lattice vector quantizer that adaptively allocates the quantization resolution across data dimensions based on their significance to the inference task. In addition, the lattice vector quantizer offers an inherently more accurate approximation of CNN kernels than scalar quantizer as used in current practice. Furthermore, we introduce other receptive field expansion strategies, including irregular dilated convolutions and a U-shaped cascaded LUT structure, designed to capture multi-level contextual information without inflating table size. Together, these innovations allow our approach to effectively balance speed, accuracy, and memory efficiency, demonstrating significant improvements over existing LUT methods.
[168] Unified Open-World Segmentation with Multi-Modal Prompts
Yang Liu,Yufei Yin,Chenchen Jing,Muzhi Zhu,Hao Chen,Yuling Xi,Bo Feng,Hao Wang,Shiyu Li,Chunhua Shen
Main category: cs.CV
TL;DR: COSINE是一个统一的开世界分割模型,结合了开放词汇分割和上下文分割,利用多模态提示(如文本和图像)实现高效分割。
Details
Motivation: 现有的开放词汇分割和上下文分割方法存在架构差异和学习目标不一致的问题,限制了模型的通用性和性能。Contribution: 提出了COSINE模型,统一了多模态提示的开世界分割任务,并利用基础模型提取表示,通过SegDecoder对齐和建模交互。
Method: 利用基础模型提取图像和多模态提示的表示,通过SegDecoder对齐和建模交互,实现对不同粒度提示的分割。
Result: 实验表明COSINE在开放词汇和上下文分割任务中性能显著提升,多模态提示的协同作用优于单模态方法。
Insight: 多模态提示(视觉和文本)协同能显著提升泛化能力,统一框架克服了传统方法的局限性。
Abstract: In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.
[169] Layout-Independent License Plate Recognition via Integrated Vision and Language Models
Elham Shabaninia,Fatemeh Asadi-zeydabadi,Hossein Nezamabadi-pour
Main category: cs.CV
TL;DR: 提出了一种不依赖于布局的车牌识别方法,通过结合视觉和语言模型实现高精度识别。
Details
Motivation: 传统车牌识别方法依赖于手动布局分类或启发式修正,难以应对多样化的车牌布局和复杂现实条件。Contribution: 提出了一种统一的识别框架,结合视觉transformer和迭代语言模型,自动学习车牌的结构模式和格式化规则,无需显式分类或修正。
Method: 系统由高精度检测网络和识别阶段组成,识别阶段集成视觉transformer和迭代语言模型,实现字符识别和后OCR精细化的一体化处理。
Result: 在多个国际数据集上表现出更高的准确性和鲁棒性,优于最近的免分割方法。
Insight: 将模式分析嵌入识别阶段,结合视觉和语言建模,增强了车牌识别的适应性和鲁棒性。
Abstract: This work presents a pattern-aware framework for automatic license plate recognition (ALPR), designed to operate reliably across diverse plate layouts and challenging real-world conditions. The proposed system consists of a modern, high-precision detection network followed by a recognition stage that integrates a transformer-based vision model with an iterative language modelling mechanism. This unified recognition stage performs character identification and post-OCR refinement in a seamless process, learning the structural patterns and formatting rules specific to license plates without relying on explicit heuristic corrections or manual layout classification. Through this design, the system jointly optimizes visual and linguistic cues, enables iterative refinement to improve OCR accuracy under noise, distortion, and unconventional fonts, and achieves layout-independent recognition across multiple international datasets (IR-LPR, UFPR-ALPR, AOLP). Experimental results demonstrate superior accuracy and robustness compared to recent segmentation-free approaches, highlighting how embedding pattern analysis within the recognition stage bridges computer vision and language modelling for enhanced adaptability in intelligent transportation and surveillance applications.
[170] MCE: Towards a General Framework for Handling Missing Modalities under Imbalanced Missing Rates
Binyu Zhao,Wei Zhang,Zhaonian Zou
Main category: cs.CV
TL;DR: 论文提出了一种名为MCE的通用框架,用于处理不平衡缺失率下的多模态缺失问题。MCE通过动态平衡模态学习进度和增强特征表示能力,显著提高了性能。
Details
Motivation: 多模态学习中,不平衡的缺失率会导致某些模态学习不足和特征退化,现有方法忽略了样本级模态效用和特征质量下降的问题。Contribution: 提出了MCE框架,包含学习能力增强(LCE)和表示能力增强(RCE)两个协同组件,动态平衡学习进度并提升特征质量。
Method: LCE通过多级因子动态平衡模态学习进度;RCE通过子集预测和跨模态补全任务提升特征的语义和鲁棒性。
Result: 在四个多模态基准测试中,MCE在不同缺失配置下均优于现有方法。
Insight: 动态平衡模态学习进度和增强特征表示是处理不平衡缺失率问题的关键。
Abstract: Multi-modal learning has made significant advances across diverse pattern recognition applications. However, handling missing modalities, especially under imbalanced missing rates, remains a major challenge. This imbalance triggers a vicious cycle: modalities with higher missing rates receive fewer updates, leading to inconsistent learning progress and representational degradation that further diminishes their contribution. Existing methods typically focus on global dataset-level balancing, often overlooking critical sample-level variations in modality utility and the underlying issue of degraded feature quality. We propose Modality Capability Enhancement (MCE) to tackle these limitations. MCE includes two synergistic components: i) Learning Capability Enhancement (LCE), which introduces multi-level factors to dynamically balance modality-specific learning progress, and ii) Representation Capability Enhancement (RCE), which improves feature semantics and robustness through subset prediction and cross-modal completion tasks. Comprehensive evaluations on four multi-modal benchmarks show that MCE consistently outperforms state-of-the-art methods under various missing configurations. The journal preprint version is now available at https://doi.org/10.1016/j.patcog.2025.112591. Our code is available at https://github.com/byzhaoAI/MCE.
[171] GLOFNet – A Multimodal Dataset for GLOF Monitoring and Prediction
Zuha Fatima,Muhammad Anser Sohaib,Muhammad Talha,Sidra Sultana,Ayesha Kanwal,Nazia Perwaiz
Main category: cs.CV
TL;DR: GLOFNet是一个多模态数据集,旨在支持冰川湖溃决洪水(GLOFs)的监测和预测研究。它整合了Sentinel-2多光谱影像、NASA ITS_LIVE冰川运动数据和MODIS地表温度记录,通过预处理和多模态融合解决了数据分散和单模态问题。
Details
Motivation: GLOFs是一种罕见但极具破坏性的高山灾害,但预测研究因数据分散和单模态特性而受限,亟需一个统一的多模态数据集来支持预测。Contribution: GLOFNet是一个公开的多模态数据集,集成了空间监测、冰川运动学和地表温度数据,为罕见灾害预测提供了结构化基准。
Method: 数据集整合了Sentinel-2、ITS_LIVE和MODIS数据,并通过云掩膜、质量过滤、归一化、时间插值、增强和循环编码等预处理步骤进行融合。
Result: 分析揭示了冰川运动的季节周期性、长期变暖趋势(每十年约0.8K)和低温环境的空间异质性。
Insight: GLOFNet通过解决类别不平衡、云污染和低分辨率等问题,为多模态深度学习在罕见灾害预测中的应用提供了基础。
Abstract: Glacial Lake Outburst Floods (GLOFs) are rare but destructive hazards in high mountain regions, yet predictive research is hindered by fragmented and unimodal data. Most prior efforts emphasize post-event mapping, whereas forecasting requires harmonized datasets that combine visual indicators with physical precursors. We present GLOFNet, a multimodal dataset for GLOF monitoring and prediction, focused on the Shisper Glacier in the Karakoram. It integrates three complementary sources: Sentinel-2 multispectral imagery for spatial monitoring, NASA ITS_LIVE velocity products for glacier kinematics, and MODIS Land Surface Temperature records spanning over two decades. Preprocessing included cloud masking, quality filtering, normalization, temporal interpolation, augmentation, and cyclical encoding, followed by harmonization across modalities. Exploratory analysis reveals seasonal glacier velocity cycles, long-term warming of ~0.8 K per decade, and spatial heterogeneity in cryospheric conditions. The resulting dataset, GLOFNet, is publicly available to support future research in glacial hazard prediction. By addressing challenges such as class imbalance, cloud contamination, and coarse resolution, GLOFNet provides a structured foundation for benchmarking multimodal deep learning approaches to rare hazard prediction.
[172] Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection
Shizhen Zhao,Jiahui Liu,Xin Wen,Haoru Tan,Xiaojuan Qi
Main category: cs.CV
TL;DR: 该论文研究了预训练视觉基础模型(如DINOv2)在OOD检测任务中的应用潜力,并提出了一种新的混合特征专家模块(MoFE)和动态β混合策略(Dynamic-β Mixup),以提升模型在大语义空间下的性能。
Details
Motivation: 预训练视觉基础模型在许多计算机视觉任务中表现出色,但其在OOD检测任务中的潜力尚未被充分挖掘。论文旨在探索这些模型的OOD检测能力,并提出改进方法以应对大语义空间的挑战。Contribution: 1. 发现预训练的DINOv2模型在OOD检测任务中表现出色,无需复杂设计即可达到SOTA性能。2. 提出MoFE模块,通过分割特征子空间优化决策边界。3. 引入Dynamic-β Mixup策略,根据类别学习难度动态调整混合权重。
Method: 1. 系统评估DINOv2等基础模型的OOD检测性能。2. 设计MoFE模块,将特征空间分割为子空间以捕捉复杂数据分布。3. 提出Dynamic-β Mixup,动态调整混合权重以优化特征学习。
Result: 实验表明,MoFE模块和Dynamic-β Mixup策略显著提升了模型在大语义空间下的OOD检测性能,优于基线方法。
Insight: 1. 预训练基础模型可直接用于OOD检测,无需微调。2. 大语义空间中,决策边界复杂度增加,需通过特征分割和动态混合策略优化。
Abstract: Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, we systematically investigate representative vision foundation models for OOD detection. Our findings reveal that a pre-trained DINOv2 model, even without fine-tuning on in-domain (ID) data, naturally provides a highly discriminative feature space for OOD detection, achieving performance comparable to existing state-of-the-art methods without requiring complex designs. Beyond this, we explore how fine-tuning foundation models on in-domain (ID) data can enhance OOD detection. However, we observe that the performance of vision foundation models remains unsatisfactory in scenarios with a large semantic space. This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. Further, we introduce a Dynamic-$\beta$ Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments demonstrate the effectiveness of our approach, significantly outperforming baseline methods.
[173] A Simple and Better Baseline for Visual Grounding
Jingchao Wang,Wenlong Zhang,Dingjiang Huang,Hong Wang,Yefeng Zheng
Main category: cs.CV
TL;DR: 论文提出了一种名为FSVG的简单但高效的方法,用于视觉定位任务,通过特征选择机制减少了计算开销,同时保持了高精度。
Details
Motivation: 现有的视觉定位方法虽然性能出色,但依赖于复杂的迭代过程和缓存机制,导致计算开销大。论文旨在设计一种更简单的基线方法,减少计算负担。Contribution: 提出了一种基于特征选择的视觉定位基线方法FSVG,通过并行处理语言和视觉模态,避免了复杂的迭代过程,同时引入相似性特征选择机制降低计算成本。
Method: FSVG直接整合语言和视觉模态到一个网络架构中,并行处理语言信息以指导视觉特征的提取,并使用相似性特征选择机制仅保留语言相关的视觉特征。
Result: 在多个基准数据集上的实验表明,FSVG在精度和效率上优于当前最优方法。
Insight: 简化网络架构和引入特征选择机制可以有效提升视觉定位任务的效率和性能。
Abstract: Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.
[174] ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
Yuqi Liu,Liangyu Chen,Jiazhen Liu,Mingkang Zhu,Zhisheng Zhong,Bei Yu,Jiaya Jia
Main category: cs.CV
TL;DR: ViSurf提出了一种统一的后训练范式,结合了监督微调(SFT)和强化学习(RLVR)的优点,通过注入真实标签和引入奖励控制策略,显著提升了大规模视觉语言模型的性能。
Details
Motivation: 现有的SFT和RLVR方法各自存在局限性:SFT性能次优,RLVR难以处理超出模型内部知识的任务。ViSurf旨在结合两者的优势,提供更高效的后训练方案。Contribution: 1. 提出ViSurf,统一了SFT和RLVR的训练目标;2. 在RLVR过程中注入真实标签,实现外部监督和内部强化的同步;3. 引入三种奖励控制策略以优化训练。
Method: ViSurf通过分析SFT和RLVR的目标推导统一目标,并在RLVR的rollout中注入真实标签。此外,设计了三种奖励控制策略以稳定训练。
Result: 在多个基准测试中,ViSurf的表现优于单独的SFT、RLVR以及两阶段SFT→RLVR,验证了其设计的有效性。
Insight: 结合外部监督和内部强化是提升模型性能的有效途径,奖励控制策略对训练稳定性至关重要。
Abstract: Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model’s internal knowledge base. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.
[175] OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment
Yiting Lu,Fengbin Guan,Yixin Gao,Yan Zhong,Xinge Peng,Jiakang Yuan,Yihao Liu,Bo Zhang,Xin Li,Zhibo Chen,Weisi Lin
Main category: cs.CV
TL;DR: 本文提出了一种统一的奖励建模框架OmniQuality-R,通过将多任务质量推理转化为连续且可解释的奖励信号以优化策略。该方法结合了主观实验思想,构建了一个推理增强的奖励建模数据集,并使用GRPO进行后训练。通过STD过滤和熵门控机制稳定训练,提高了下游任务的泛化能力。
Details
Motivation: 当前的视觉评估方法通常局限于单一任务,缺乏统一的多任务质量评估框架。Contribution: 提出了OmniQuality-R,一个通过多任务推理生成连续奖励信号的统一建模框架,并引入了GRPO、STD过滤和熵门控等技术优化策略训练。
Method: 1. 构建推理增强的奖励建模数据集;2. 使用GRPO进行后训练;3. 引入STD过滤和熵门控机制稳定训练。
Result: 在美学质量评估、技术质量评估和文本图像对齐三个关键IQA任务上进行了验证。
Insight: 通过多任务推理和连续奖励信号的设计,OmniQuality-R能够有效提升策略优化的稳定性和泛化能力。
Abstract: Current visual evaluation approaches are typically constrained to a single task. To address this, we propose OmniQuality-R, a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals for policy optimization. Inspired by subjective experiments, where participants are given task-specific instructions outlining distinct assessment principles prior to evaluation, we propose OmniQuality-R, a structured reward modeling framework that transforms multi-dimensional reasoning into continuous and interpretable reward signals. To enable this, we construct a reasoning-enhanced reward modeling dataset by sampling informative plan-reason trajectories via rejection sampling, forming a reliable chain-of-thought (CoT) dataset for supervised fine-tuning (SFT). Building on this, we apply Group Relative Policy Optimization (GRPO) for post-training, using a Gaussian-based reward to support continuous score prediction. To further stabilize the training and improve downstream generalization, we incorporate standard deviation (STD) filtering and entropy gating mechanisms during reinforcement learning. These techniques suppress unstable updates and reduce variance in policy optimization. We evaluate OmniQuality-R on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.
[176] DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis
Peiyin Chen,Zhuowei Yang,Hui Feng,Sheng Jiang,Rui Yan
Main category: cs.CV
TL;DR: DEMO提出了一种基于流匹配的生成框架,用于音频驱动的说话肖像视频合成,实现了唇部运动、头部姿态和眼部的精细化控制。
Details
Motivation: 尽管基于扩散模型的音频驱动说话头生成技术快速发展,但生成时间一致且具有精细化运动控制的视频仍具挑战性。Contribution: 1) 提出了一种运动自动编码器,构建结构化潜在空间,实现运动因子的独立表示和近似正交化;2) 在解耦的运动空间上,应用基于最优传输的流匹配和Transformer预测器生成平滑运动轨迹。
Method: DEMO的核心是通过运动自动编码器解耦运动空间,并利用流匹配和Transformer生成基于音频的平滑运动轨迹。
Result: 在多基准测试中,DEMO在视频真实性、唇音同步和运动保真度上优于现有方法。
Insight: 将精细化运动解耦与基于流的生成模型结合,为可控说话头视频合成提供了新范式。
Abstract: Audio-driven talking-head generation has advanced rapidly with diffusion-based generative models, yet producing temporally coherent videos with fine-grained motion control remains challenging. We propose DEMO, a flow-matching generative framework for audio-driven talking-portrait video synthesis that delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze. The core contribution is a motion auto-encoder that builds a structured latent space in which motion factors are independently represented and approximately orthogonalized. On this disentangled motion space, we apply optimal-transport-based flow matching with a transformer predictor to generate temporally smooth motion trajectories conditioned on audio. Extensive experiments across multiple benchmarks show that DEMO outperforms prior methods in video realism, lip-audio synchronization, and motion fidelity. These results demonstrate that combining fine-grained motion disentanglement with flow-based generative modeling provides a powerful new paradigm for controllable talking-head video synthesis.
[177] A Machine Learning Perspective on Automated Driving Corner Cases
Sebastian Schmidt,Julius Körner,Stephan Günnemann
Main category: cs.CV
TL;DR: 论文提出了一种基于机器学习的新视角,用于自动驾驶中的corner case识别,通过考虑数据分布来统一现有方法,并在多个基准测试中表现出色。
Details
Motivation: 传统方法将自动驾驶中的corner case视为孤立问题,缺乏数据覆盖和泛化能力,本文提出了一种基于数据分布的机器学习方法来解决这一问题。Contribution: 1. 提出了一种基于数据分布的corner case识别新视角;2. 发布了fog-augmented Lost & Found数据集;3. 在多个基准测试中验证了方法的有效性。
Method: 通过机器学习模型分析数据分布,提出了一种框架来自动识别corner case,并扩展了out-of-distribution检测基准。
Result: 方法统一了现有corner case分类,在检测任务中表现优异,并通过新数据集支持组合corner case分析。
Insight: corner case识别应从数据分布角度出发,而非孤立场景,这为自动驾驶安全提供了更普适的理论基础。
Abstract: For high-stakes applications, like autonomous driving, a safe operation is necessary to prevent harm, accidents, and failures. Traditionally, difficult scenarios have been categorized into corner cases and addressed individually. However, this example-based categorization is not scalable and lacks a data coverage perspective, neglecting the generalization to training data of machine learning models. In our work, we propose a novel machine learning approach that takes the underlying data distribution into account. Based on our novel perspective, we present a framework for effective corner case recognition for perception on individual samples. In our evaluation, we show that our approach (i) unifies existing scenario-based corner case taxonomies under a distributional perspective, (ii) achieves strong performance on corner case detection tasks across standard benchmarks for which we extend established out-of-distribution detection benchmarks, and (iii) enables analysis of combined corner cases via a newly introduced fog-augmented Lost & Found dataset. These results provide a principled basis for corner case recognition, underlining our manual specification-free definition.
[178] Stability Under Scrutiny: Benchmarking Representation Paradigms for Online HD Mapping
Hao Shan,Ruikai Li,Han Jiang,Yizhe Fan,Ziyang Yan,Bohan Li,Xiaoshuai Hao,Hao Zhao,Zhiyong Cui,Yilong Ren,Haiyang Yu
Main category: cs.CV
TL;DR: 该论文首次系统地研究了在线高精地图(HD mapping)的时间稳定性问题,提出了一个多维稳定性评估框架,并引入了新的指标和统一的mAS评分标准。通过大规模实验,揭示了精度(mAP)与稳定性(mAS)是两个相对独立的性能维度,并分析了模型设计对两者的影响。
Details
Motivation: 在线高精地图是自动驾驶的核心模块之一,但由于传感器空间位移导致的映射结果不稳定,对下游任务提出了挑战。现有模型多关注单帧映射精度,而忽略了稳定性问题。Contribution: 1. 首次提出了在线高精地图的时间稳定性评估框架;2. 设计了Presence、Localization和Shape Stability三个维度的新指标和mAS评分;3. 通过42种模型和变体的实验,揭示了精度与稳定性的独立性;4. 分析了模型设计对性能的影响。
Method: 提出了多维稳定性评估框架,结合新指标和mAS评分标准,通过大规模实验验证模型设计对精度和稳定性的影响。
Result: 实验表明,精度(mAP)和稳定性(mAS)是两个独立的性能维度,并识别了提升单一或双重性能的模型设计和训练因素。
Insight: 时间稳定性应与精度同等重要,作为核心评估标准,以推动更可靠的自动驾驶系统发展。
Abstract: As one of the fundamental modules in autonomous driving, online high-definition (HD) maps have attracted significant attention due to their cost-effectiveness and real-time capabilities. Since vehicles always cruise in highly dynamic environments, spatial displacement of onboard sensors inevitably causes shifts in real-time HD mapping results, and such instability poses fundamental challenges for downstream tasks. However, existing online map construction models tend to prioritize improving each frame’s mapping accuracy, while the mapping stability has not yet been systematically studied. To fill this gap, this paper presents the first comprehensive benchmark for evaluating the temporal stability of online HD mapping models. We propose a multi-dimensional stability evaluation framework with novel metrics for Presence, Localization, and Shape Stability, integrated into a unified mean Average Stability (mAS) score. Extensive experiments on 42 models and variants show that accuracy (mAP) and stability (mAS) represent largely independent performance dimensions. We further analyze the impact of key model design choices on both criteria, identifying architectural and training factors that contribute to high accuracy, high stability, or both. To encourage broader focus on stability, we will release a public benchmark. Our work highlights the importance of treating temporal stability as a core evaluation criterion alongside accuracy, advancing the development of more reliable autonomous driving systems. The benchmark toolkit, code, and models will be available at https://stablehdmap.github.io/.
[179] Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection
Gaojian Wang,Feng Lin,Tong Wu,Zhisheng Yan,Kui Ren
Main category: cs.CV
TL;DR: 论文提出了一种名为FS-VFM的自监督预训练框架,通过学习真实人脸图像的表示,增强在各种人脸安全任务中的泛化能力。通过结合掩码图像建模(MIM)和实例判别(ID),设计了三种学习目标(3C),并提出了CRFR-P掩码策略和自蒸馏机制。此外,还提出了一种轻量级适配器FS-Adapter,用于高效迁移预训练模型。实验证明FS-VFM优于多种现有方法。
Details
Motivation: 现实世界中存在大量未标记的真实人脸数据,如何从中学习鲁棒且可迁移的人脸表示,以提升在多种人脸安全任务中的泛化能力,是一个重要问题。Contribution: 1. 首次提出FS-VFM框架,通过自监督预训练学习真实人脸的基本表示;2. 设计了3C学习目标,结合MIM和ID;3. 提出了CRFR-P掩码策略和自蒸馏机制;4. 开发了轻量级适配器FS-Adapter。
Method: 1. 提出了CRFR-P掩码策略,强调区域内一致性和区域间相干性;2. 设计了自蒸馏机制,将MIM与ID结合;3. 使用ViT作为视觉基础模型;4. 提出了FS-Adapter适配器,支持高效迁移学习。
Result: 在11个公开基准测试中,FS-VFM表现出优异的泛化能力,超越了多种现有方法,包括自然和面部领域的基准模型及任务特定方法。
Insight: 1. 通过结合局部和全局学习目标,可以提高人脸表示的鲁棒性;2. 轻量级适配器是高效迁移预训练模型的有效工具。
Abstract: With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS-VFM, a scalable self-supervised pre-training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS-VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR-P masking, which explicitly prompts the model to pursue meaningful intra-region Consistency and challenging inter-region Coherency. We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global Correspondence. After pre-training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre-trained FS-VFM, we further propose FS-Adapter, a lightweight plug-and-play bottleneck atop the frozen backbone with a novel real-anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self-supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task-specific methods, while FS-Adapter offers an excellent efficiency-performance trade-off. The code and models are available on https://fsfm-3c.github.io/fsvfm.html.
[180] AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes
Yu Li,Menghan Xia,Gongye Liu,Jianhong Bai,Xintao Wang,Conglang Zhang,Yuxuan Lin,Ruihang Chu,Pengfei Wan,Yujiu Yang
Main category: cs.CV
TL;DR: 该论文提出了AdaViewPlanner方法,通过两阶段范式将预训练的文本生成视频(T2V)模型适配用于4D场景的视角规划,证明了视频生成模型在真实世界4D交互中的潜力。
Details
Motivation: 利用文本生成视频模型的强大视觉模拟能力,探索其作为隐式世界模型的潜力,并将其适配用于4D场景的视角规划任务。Contribution: 1. 提出了两阶段范式,将预训练T2V模型适配用于视角预测;2. 设计了自适应学习分支和相机外参扩散分支;3. 实验证明方法优于现有竞争方法,并验证了关键技术设计的有效性。
Method: 1. 通过自适应学习分支将4D场景表示注入预训练T2V模型;2. 将视角提取建模为混合条件引导的相机外参去噪过程;3. 引入相机外参扩散分支,以生成的视频和4D场景作为输入。
Result: 实验结果表明,该方法在视角规划任务上优于现有竞争对手,并通过消融实验验证了关键技术设计的有效性。
Insight: 视频生成模型可以作为隐式世界模型,为4D交互任务提供新的视角规划解决方案。
Abstract: Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.
[181] Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li,Chaolei Tan,Haoxuan Chen,Jianxin Ma,Jian-Fang Hu,Wei-Shi Zheng,Jianhuang Lai
Main category: cs.CV
TL;DR: 该综述首次全面回顾了基于图像-语言基础模型(ILFM)的图像到视频迁移学习领域,分类了现有方法(冻结特征与修改特征),并分析了其在不同视频文本任务中的应用效果,最后指出了未来研究方向。
Details
Motivation: 现有视频-文本研究的需求推动了从图像领域迁移学习的兴趣,以缓解从头训练视频-语言基础模型的数据与计算成本问题。Contribution: 1. 首次系统综述图像到视频迁移学习领域;2. 分类并提出两种迁移策略(冻结特征与修改特征);3. 实验分析不同策略在下游任务中的效果。
Method: 分类为冻结特征(保持ILFM原始表征)与修改特征(对原始表征进行调整)策略,并详细阐述其在视频文本任务中的应用。
Result: 实验分析了不同迁移学习范式在视频理解任务中的表现,验证了方法的有效性。
Insight: 基于ILFM的迁移学习为视频-文本任务提供了高效解决方案,未来可探索更动态的特征调整方法或跨模态对齐优化。
Abstract: Image-Language Foundation Models (ILFM) have demonstrated remarkable success in image-text understanding/generation tasks, providing transferable multimodal representations that generalize across diverse downstream image-based tasks. The advancement of video-text research has spurred growing interest in extending image-based models to the video domain. This paradigm, known as image-to-video transfer learning, succeeds in alleviating the substantial data and computational requirements associated with training video-language foundation models from scratch for video-text learning. This survey provides the first comprehensive review of this emerging field, which begins by summarizing the widely used ILFM and their capabilities. We then systematically classify existing image-to-video transfer learning strategies into two categories: frozen features and modified features, depending on whether the original representations from ILFM are preserved or undergo modifications. Building upon the task-specific nature of image-to-video transfer, this survey methodically elaborates these strategies and details their applications across a spectrum of video-text learning tasks, ranging from fine-grained (e.g., spatio-temporal video grounding) to coarse-grained (e.g., video question answering). We further present a detailed experimental analysis to investigate the efficacy of different image-to-video transfer learning paradigms on a range of downstream video understanding tasks. Finally, we identify prevailing challenges and highlight promising directions for future research. By offering a comprehensive and structured overview, this survey aims to establish a structured roadmap for advancing video-text learning based on existing ILFM, and to inspire future research directions in this rapidly evolving domain.
[182] MSM-Seg: A Modality-and-Slice Memory Framework with Category-Agnostic Prompting for Multi-Modal Brain Tumor Segmentation
Yuxiang Luo,Qing Xu,Hai Huang,Yuqi Ouyang,Zhen Chen,Wenting Duan
Main category: cs.CV
TL;DR: MSM-Seg提出了一种新的双记忆分割框架,利用多模态和切片信息结合类别无关提示,实现了多模态脑肿瘤分割的优越性能。
Details
Motivation: 现有基于提示的分割方法忽视跨模态相关性,且依赖类别特定的提示,限制了在真实场景中的适用性。本文旨在解决这些问题。Contribution: 1. 提出了双记忆分割范式(MSMA)来整合多模态和切片信息。2. 设计了多尺度类别无关提示编码器(MCP-Encoder)。3. 开发了模态自适应融合解码器(MF-Decoder)。
Method: 1. 使用MSMA探索输入扫描间的跨模态和切片关系。2. MCP-Encoder提供肿瘤区域指导。3. MF-Decoder利用多模态互补信息提升分割精度。
Result: 在多模态MRI数据集上,MSM-Seg在转移瘤和胶质瘤分割任务中优于现有方法。
Insight: 类别无关提示和多模态/切片信息的结合可以显著提升分割性能,适用于复杂的临床场景。
Abstract: Multi-modal brain tumor segmentation is critical for clinical diagnosis, and it requires accurate identification of distinct internal anatomical subregions. While the recent prompt-based segmentation paradigms enable interactive experiences for clinicians, existing methods ignore cross-modal correlations and rely on labor-intensive category-specific prompts, limiting their applicability in real-world scenarios. To address these issues, we propose a MSM-Seg framework for multi-modal brain tumor segmentation. The MSM-Seg introduces a novel dual-memory segmentation paradigm that synergistically integrates multi-modal and inter-slice information with the efficient category-agnostic prompt for brain tumor understanding. To this end, we first devise a modality-and-slice memory attention (MSMA) to exploit the cross-modal and inter-slice relationships among the input scans. Then, we propose a multi-scale category-agnostic prompt encoder (MCP-Encoder) to provide tumor region guidance for decoding. Moreover, we devise a modality-adaptive fusion decoder (MF-Decoder) that leverages the complementary decoding information across different modalities to improve segmentation accuracy. Extensive experiments on different MRI datasets demonstrate that our MSM-Seg framework outperforms state-of-the-art methods in multi-modal metastases and glioma tumor segmentation. The code is available at https://github.com/xq141839/MSM-Seg.
[183] Action-Dynamics Modeling and Cross-Temporal Interaction for Online Action Understanding
Xinyu Yang,Zheheng Jiang,Feixiang Zhou,Yihang Zhu,Na Lv,Nan Xing,Huiyu Zhou
Main category: cs.CV
TL;DR: 该论文提出了一种名为状态特定模型(SSM)的新框架,用于统一和增强动作检测与预测任务,通过关键状态压缩、动作模式学习和跨时间交互模块解决未修剪视频中的冗余信息和噪声问题,并在多个数据集上验证了其性能优越性。
Details
Motivation: 未修剪视频常包含大量冗余信息和噪声,且现有方法常忽略智能体意图对动作的影响。为此,论文提出了一种新框架来解决这些问题。Contribution: 1. 提出了状态特定模型(SSM)框架,统一动作检测与预测;2. 设计了关键状态压缩模块以减少冗余;3. 通过动作模式学习建模动作动态;4. 引入跨时间交互模块捕捉意图与历史信息的相互影响。
Method: 1. 关键状态压缩模块压缩帧序列;2. 动作模式学习构建状态转移图;3. 跨时间交互模块建模意图与历史信息的相互作用。
Result: 在EPIC-Kitchens-100、THUMOS’14等数据集上表现优于现有方法,验证了动作动态学习和跨时间交互的重要性。
Insight: 动作动态建模和意图与历史信息的交互对动作理解至关重要,为未来研究提供了新方向。
Abstract: Action understanding, encompassing action detection and anticipation, plays a crucial role in numerous practical applications. However, untrimmed videos are often characterized by substantial redundant information and noise. Moreover, in modeling action understanding, the influence of the agent’s intention on the action is often overlooked. Motivated by these issues, we propose a novel framework called the State-Specific Model (SSM), designed to unify and enhance both action detection and anticipation tasks. In the proposed framework, the Critical State-Based Memory Compression module compresses frame sequences into critical states, reducing information redundancy. The Action Pattern Learning module constructs a state-transition graph with multi-dimensional edges to model action dynamics in complex scenarios, on the basis of which potential future cues can be generated to represent intention. Furthermore, our Cross-Temporal Interaction module models the mutual influence between intentions and past as well as current information through cross-temporal interactions, thereby refining present and future features and ultimately realizing simultaneous action detection and anticipation. Extensive experiments on multiple benchmark datasets – including EPIC-Kitchens-100, THUMOS’14, TVSeries, and the introduced Parkinson’s Disease Mouse Behaviour (PDMB) dataset – demonstrate the superior performance of our proposed framework compared to other state-of-the-art approaches. These results highlight the importance of action dynamics learning and cross-temporal interactions, laying a foundation for future action understanding research.
[184] Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos
Xuankai Zhang,Junjin Xiao,Qing Zhang
Main category: cs.CV
TL;DR: 该论文提出了一种统一的框架,能够从失焦和运动模糊的单目视频中实现高质量动态高斯散射(Gaussian Splatting)。通过模糊预测网络和动态高斯密集化策略,解决了现有方法无法同时处理两类模糊的问题,并在实验中展示了优于现有技术的性能。
Details
Motivation: 现有方法通常针对失焦模糊或运动模糊单独设计,缺乏同时处理两者的能力。虽然两者可以建模为基于模糊核的卷积,但估计准确的模糊核存在固有困难。Contribution: 1. 提出了一个统一的框架,能够同时处理失焦和运动模糊的单目视频;2. 设计了模糊预测网络,结合模糊相关场景和相机信息,并引入模糊感知稀疏约束;3. 提出了动态高斯密集化策略,以解决不完全区域的高斯不足问题;4. 通过引入未见视角信息约束场景优化,提升了新视角合成的性能。
Method: 1. 使用模糊预测网络估计每像素的可靠模糊核;2. 引入动态高斯密集化策略;3. 结合未见视角信息优化场景。
Result: 实验表明,该方法在从失焦和运动模糊的单目视频中生成逼真的新视角合成方面优于现有技术。
Insight: 模糊预测网络的引入和动态高斯密集化策略是同时处理两类模糊的关键创新,展示了模糊信息在场景重建中的重要性。
Abstract: This paper presents a unified framework that allows high-quality dynamic Gaussian Splatting from both defocused and motion-blurred monocular videos. Due to the significant difference between the formation processes of defocus blur and motion blur, existing methods are tailored for either one of them, lacking the ability to simultaneously deal with both of them. Although the two can be jointly modeled as blur kernel-based convolution, the inherent difficulty in estimating accurate blur kernels greatly limits the progress in this direction. In this work, we go a step further towards this direction. Particularly, we propose to estimate per-pixel reliable blur kernels using a blur prediction network that exploits blur-related scene and camera information and is subject to a blur-aware sparsity constraint. Besides, we introduce a dynamic Gaussian densification strategy to mitigate the lack of Gaussians for incomplete regions, and boost the performance of novel view synthesis by incorporating unseen view information to constrain scene optimization. Extensive experiments show that our method outperforms the state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos. Our code and trained model will be made publicly available.
[185] Uncovering Anomalous Events for Marine Environmental Monitoring via Visual Anomaly Detection
Laura Weihl,Nejc Novak,Stefan H. Bengtson,Malte Pedersen
Main category: cs.CV
TL;DR: 该论文探讨了利用基于深度神经网络的视觉异常检测(VAD)自动识别水下监控视频中的异常事件,并提出了首个多标注者水下VAD基准数据集AURA,评估了四种VAD模型在不同场景下的表现。
Details
Motivation: 水下视频监控是评估海洋生物多样性的有效方法,但海量无事件视频使人工检查不切实际,因此需要自动化的视觉异常检测技术。Contribution: 1. 提出首个多标注者水下VAD基准数据集AURA;2. 评估了四种VAD模型在不同场景下的性能;3. 强调了训练数据量和视觉内容多样性对VAD性能的影响;4. 提出了支持科学探索和可扩展生物多样性监测的实用方法。
Method: 1. 利用深度神经网络进行视觉异常检测;2. 通过多标注者标注的数据集AURA评估模型;3. 分析不同训练数据量和视觉内容多样性对模型性能的影响。
Result: 实验表明,当前VAD模型的性能差异显著,且对训练数据量和视觉内容多样性高度敏感。软标签和共识标签在提升模型性能方面具有重要价值。
Insight: 1. 视觉异常检测在水下监控中具有潜力,但需解决数据多样性问题;2. 多标注者的标注方法有助于提升模型鲁棒性;3. 训练数据的量和质量对VAD性能至关重要。
Abstract: Underwater video monitoring is a promising strategy for assessing marine biodiversity, but the vast volume of uneventful footage makes manual inspection highly impractical. In this work, we explore the use of visual anomaly detection (VAD) based on deep neural networks to automatically identify interesting or anomalous events. We introduce AURA, the first multi-annotator benchmark dataset for underwater VAD, and evaluate four VAD models across two marine scenes. We demonstrate the importance of robust frame selection strategies to extract meaningful video segments. Our comparison against multiple annotators reveals that VAD performance of current models varies dramatically and is highly sensitive to both the amount of training data and the variability in visual content that defines “normal” scenes. Our results highlight the value of soft and consensus labels and offer a practical approach for supporting scientific exploration and scalable biodiversity monitoring.
[186] Restricted Receptive Fields for Face Verification
Kagan Ozturk,Aman Bhatta,Haiyu Wu,Patrick Flynn,Kevin W. Bowyer
Main category: cs.CV
TL;DR: 该论文提出了一种基于受限感受野的人脸相似性度量方法,通过将全局相似性分解为局部补丁贡献,实现模型决策的固有可解释性。
Details
Motivation: 当前深度神经网络的决策过程通常是黑盒的,常用的事后分析方法(如像素重要性)可能无法准确反映模型的实际推理过程。为了解决这一问题,作者提出设计一种固有可解释的模型。Contribution: 主要贡献是提出了一种局部加性的人脸相似性度量方法,无需依赖事后分析即可提供解释,并在小补丁(如28x28)和大补丁(56x56)下分别验证了其竞争力。
Method: 方法将两幅人脸图像的相似性定义为局部补丁相似性得分的总和,每对补丁的相似性通过受限感受野计算。实验中使用112x112人脸图像,分别测试了28x28和56x56大小的补丁。
Result: 结果表明,该方法在小补丁(28x28)下具有竞争力,在大补丁(56x56)下甚至超越了现有最优方法。
Insight: 通过局部加性设计,模型的解释性与性能可以并存,无需事后分析,这为设计固有可解释的模型提供了新思路。
Abstract: Understanding how deep neural networks make decisions is crucial for analyzing their behavior and diagnosing failure cases. In computer vision, a common approach to improve interpretability is to assign importance to individual pixels using post-hoc methods. Although they are widely used to explain black-box models, their fidelity to the model’s actual reasoning is uncertain due to the lack of reliable evaluation metrics. This limitation motivates an alternative approach, which is to design models whose decision processes are inherently interpretable. To this end, we propose a face similarity metric that breaks down global similarity into contributions from restricted receptive fields. Our method defines the similarity between two face images as the sum of patch-level similarity scores, providing a locally additive explanation without relying on post-hoc analysis. We show that the proposed approach achieves competitive verification performance even with patches as small as 28x28 within 112x112 face images, and surpasses state-of-the-art methods when using 56x56 patches.
[187] EGD-YOLO: A Lightweight Multimodal Framework for Robust Drone-Bird Discrimination via Ghost-Enhanced YOLOv8n and EMA Attention under Adverse Condition
Sudipto Sarkar,Mohammad Asif Hasan,Khondokar Ashik Shahriar,Fablia Labiba,Nahian Tasnim,Sheikh Anawarul Haq Fattah
Main category: cs.CV
TL;DR: EGD-YOLOv8n是一个轻量级的多模态目标检测框架,通过Ghost增强的YOLOv8n和EMA注意力机制,在恶劣条件下实现了无人机与鸟类的高效区分。
Details
Motivation: 无人机与鸟类的高效区分对空域安全和安保系统至关重要。现有方法在计算效率和恶劣条件下的鲁棒性上存在不足。Contribution: 提出了EGD-YOLOv8n模型,结合RGB和红外图像,通过Ghost模块和EMA注意力机制,提升了特征提取能力,同时保持了轻量化和实时性。
Method: 采用Ghost模块减少冗余计算,引入EMA注意力机制增强特征表达,并通过特殊检测头适应不同形状和大小的目标。训练了RGB、红外和多模态三个版本。
Result: 多模态版本在VIP CUP 2025数据集上表现最佳,兼具高精度和实时性,适用于普通GPU。
Insight: 多模态数据和高效的注意力机制能够显著提升目标检测模型在恶劣条件下的性能,同时保持轻量化和实时性。
Abstract: Identifying drones and birds correctly is essential for keeping the skies safe and improving security systems. Using the VIP CUP 2025 dataset, which provides both RGB and infrared (IR) images, this study presents EGD-YOLOv8n, a new lightweight yet powerful model for object detection. The model improves how image features are captured and understood, making detection more accurate and efficient. It uses smart design changes and attention layers to focus on important details while reducing the amount of computation needed. A special detection head helps the model adapt to objects of different shapes and sizes. We trained three versions: one using RGB images, one using IR images, and one combining both. The combined model achieved the best accuracy and reliability while running fast enough for real-time use on common GPUs.
[188] Structured Spectral Graph Learning for Multi-label Abnormality Classification in 3D Chest CT Scans
Theo Di Piazza,Carole Lazarus,Olivier Nempont,Loic Boussel
Main category: cs.CV
TL;DR: 论文提出了一种基于图谱的2.5D框架,用于3D胸部CT扫描的多标签异常分类,通过将3D体积表示为结构化图谱并利用谱图卷积捕获切片间依赖关系,实现了跨数据集的强泛化性能。
Details
Motivation: 3D胸部CT的多标签分类问题复杂且具有挑战性,主要由于体积数据的空间关系复杂性和异常多样性。现有的3D卷积神经网络难以捕捉长程依赖,而视觉Transformer需要大量领域特定数据预训练。Contribution: 1. 提出了一种新的图谱框架,将3D CT体积表示为结构化图谱;2. 利用谱图卷积处理切片间依赖关系;3. 展示了跨数据集的强泛化能力;4. 通过消融实验验证了方法的有效性。
Method: 方法将3D CT体积表示为图谱,轴向切片三元组作为节点,通过谱图卷积处理切片间的依赖关系,同时保持计算复杂度适合临床部署。
Result: 在三个独立机构的数据集上训练和评估,表现出强大的跨数据集泛化能力,性能媲美前沿视觉编码器。消融实验验证了不同聚合策略、边权重方案和图谱连接模式的性能影响。
Insight: 图谱建模方法可以有效捕捉3D CT数据中的复杂空间关系,同时避免了传统方法对大规模预训练的依赖。其方法还可推广至放射报告生成和腹部CT数据。
Abstract: With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.\ This work extends our previous contribution presented at the MICCAI 2025 EMERGE Workshop.
[189] Full segmentation annotations of 3D time-lapse microscopy images of MDA231 cells
Aleksandra Melnikova,Petr Matula
Main category: cs.CV
TL;DR: 该论文提供了一个高质量、公开可用的3D时间推移显微镜图像(MDA231细胞)分割注释数据集,填补了该领域的数据空白,并验证了其一致性和准确性。
Details
Motivation: 体积图像的分割注释对图像处理领域至关重要,但现有公开数据集中缺乏高质量的多目标3D动态分割注释。Contribution: 提供了首个公开的全3D时间推移分割注释数据集,并验证了其与现有跟踪标记的一致性及分割准确性。
Method: 通过多个注释者对MDA231细胞的3D时间推移图像进行手动标注,并与Cell Tracking Challenge(CTC)提供的标准标记和自动生成的silver truth进行对比。
Result: 新注释的数据集在一致性(与CTC跟踪标记)和分割准确性(与2D黄金标准)上表现良好,且优于自动生成的silver truth。
Insight: 手动注释的3D分割数据集能更准确地反映复杂动态目标的形状,可用于细胞分割训练和动态形状分析。
Abstract: High-quality, publicly available segmentation annotations of image and video datasets are critical for advancing the field of image processing. In particular, annotations of volumetric images of a large number of targets are time-consuming and challenging. In (Melnikova, A., & Matula, P., 2025), we presented the first publicly available full 3D time-lapse segmentation annotations of migrating cells with complex dynamic shapes. Concretely, three distinct humans annotated two sequences of MDA231 human breast carcinoma cells (Fluo-C3DL-MDA231) from the Cell Tracking Challenge (CTC). This paper aims to provide a comprehensive description of the dataset and accompanying experiments that were not included in (Melnikova, A., & Matula, P., 2025) due to limitations in publication space. Namely, we show that the created annotations are consistent with the previously published tracking markers provided by the CTC organizers and the segmentation accuracy measured based on the 2D gold truth of CTC is within the inter-annotator variability margins. We compared the created 3D annotations with automatically created silver truth provided by CTC. We have found the proposed annotations better represent the complexity of the input images. The presented annotations can be used for testing and training cell segmentation, or analyzing 3D shapes of highly dynamic objects.
[190] Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales
Zhaofang Qian,Hardy Chen,Zeyu Wang,Li Zhang,Zijun Wang,Xiaoke Huang,Hui Liu,Xianfeng Tang,Zeyu Zheng,Haoqin Tu,Cihang Xie,Yuyin Zhou
Main category: cs.CV
TL;DR: 论文提出了EarthWhere,一个综合评估视觉语言模型(VLM)在地理定位任务中的能力的基准,涵盖国家和街道两种尺度。通过评估13个先进VLM,发现模型性能受限且存在区域偏差。
Details
Motivation: 现有的VLM在地理定位任务中的能力尚未被充分评估,而这一任务在实际应用中具有重要价值。Contribution: 提出了EarthWhere基准,包含国家和街道两种尺度的地理定位任务,并引入了人类验证的关键视觉线索和Shapley重加权思维评分方法。
Method: 基准包含810张全球分布的图像,分为WhereCountry(国家尺度)和WhereStreet(街道尺度)任务。评估指标包括Acc@k和分层路径得分,并提出思维评分方法。
Result: Gemini-2.5-Pro表现最佳(56.32%),最强开源模型GLM-4.5V达到34.71%。研究发现网络搜索和推理在视觉线索有限时未必能提升性能,且模型存在区域偏差。
Insight: 模型在地理定位任务中存在显著的区域偏差,且依赖视觉线索的性能受限,突显了任务中的挑战性和改进空间。
Abstract: Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use. EarthWhere comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple-choice question-answering, with country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained street-level identification tasks requiring multi-step reasoning with optional web search). For evaluation, we adopt the final-prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human-verified key visual clues and a Shapley-reweighted thinking score that attributes credit to each clue’s marginal contribution. We benchmark 13 state-of-the-art VLMs with web searching tools on our EarthWhere and report different types of final answer accuracies as well as the calibrated model thinking scores. Overall, Gemini-2.5-Pro achieves the best average accuracy at 56.32%, while the strongest open-weight model, GLM-4.5V, reaches 34.71%. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine-grained localization. We open-source our benchmark at https://github.com/UCSC-VLAA/EarthWhere.
[191] Topological Alignment of Shared Vision-Language Embedding Space
Junwon You,Dasol Kang,Jae-Hun Jung
Main category: cs.CV
TL;DR: 这篇论文提出了ToMCLIP,一种拓扑对齐方法,用于改进多语言视觉-语言嵌入空间的全局几何结构,通过拓扑保持约束提升跨模态对齐性能。
Details
Motivation: 现有的对比视觉-语言模型在多语言任务中存在跨模态对齐偏向英语的问题,且现有方法忽视了共享嵌入空间的全局几何结构。Contribution: 提出了ToMCLIP框架,利用拓扑对齐损失和图稀疏化策略改进多语言嵌入空间的结构一致性,提升零样本分类和多语言检索性能。
Method: 引入持续性同调定义拓扑对齐损失,并通过图稀疏化策略近似计算持久性图,同时提供理论误差界限。
Result: 实验表明ToMCLIP在CIFAR-100上零样本准确率更高,在xFlickr&CO上多语言检索性能更强。
Insight: 拓扑对齐提供了一种通用的表示学习方法,适用于改善嵌入空间的全局结构。
Abstract: Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.
[192] SceneTextStylizer: A Training-Free Scene Text Style Transfer Framework with Diffusion Model
Honghui Yuan,Keiji Yanai
Main category: cs.CV
TL;DR: SceneTextStylizer是一个无需训练的、基于扩散模型的场景文本风格转移框架,能够实现灵活、高保真的文本风格编辑,同时保持文本可读性和风格一致性。
Details
Motivation: 现有场景文本编辑方法通常局限于内容替换和简单风格,缺乏自由风格转移的能力。本文旨在解决场景文本灵活且局部化的风格编辑问题。Contribution: 1. 提出了首个无需训练的扩散模型框架SceneTextStylizer,支持提示引导的文本风格转移;2. 设计了一个特征注入模块,结合扩散模型反转和自注意力机制;3. 引入区域控制机制和基于傅里叶变换的风格增强模块,提升编辑精度和视觉效果。
Method: 1. 特征注入模块:利用扩散模型反转和自注意力机制转移风格特征;2. 区域控制机制:在去噪步骤中应用基于距离的变化掩码;3. 风格增强模块:基于傅里叶变换增强风格丰富度。
Result: 实验表明,SceneTextStylizer在视觉保真度和文本保存方面优于现有方法。
Insight: 1. 无需训练的扩散模型框架可以有效实现复杂风格编辑;2. 结合区域控制和风格增强模块能够显著提升文本编辑的质量和灵活性。
Abstract: With the rapid development of diffusion models, style transfer has made remarkable progress. However, flexible and localized style editing for scene text remains an unsolved challenge. Although existing scene text editing methods have achieved text region editing, they are typically limited to content replacement and simple styles, which lack the ability of free-style transfer. In this paper, we introduce SceneTextStylizer, a novel training-free diffusion-based framework for flexible and high-fidelity style transfer of text in scene images. Unlike prior approaches that either perform global style transfer or focus solely on textual content modification, our method enables prompt-guided style transformation specifically for text regions, while preserving both text readability and stylistic consistency. To achieve this, we design a feature injection module that leverages diffusion model inversion and self-attention to transfer style features effectively. Additionally, a region control mechanism is introduced by applying a distance-based changing mask at each denoising step, enabling precise spatial control. To further enhance visual quality, we incorporate a style enhancement module based on the Fourier transform to reinforce stylistic richness. Extensive experiments demonstrate that our method achieves superior performance in scene text style transformation, outperforming existing state-of-the-art methods in both visual fidelity and text preservation.
[193] FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
Chunyu Xie,Bin Wang,Fanjing Kong,Jincheng Li,Dawei Liang,Ji Ao,Dawei Leng,Yuhui Yin
Main category: cs.CV
TL;DR: FG-CLIP 2提出了一个双语(英语和中文)细粒度视觉-语言对齐模型,通过引入区域-文本匹配、长字幕建模和多判别目标等方法,显著提升了细粒度对齐能力,并在29个数据集上表现优异。
Details
Motivation: 现有模型(如CLIP)在全局对齐表现良好,但在细粒度细节(如物体属性、空间关系)和多语言支持上能力有限,尤其是在非英语环境下。FG-CLIP 2旨在填补这一空白,推动双语(英语和中文)细粒度视觉-语言对齐研究。Contribution: 1. 提出了FG-CLIP 2模型,支持英语和中文的双语细粒度对齐;2. 设计了文本模态内对比损失(TIC)提升语义相似字幕的区分能力;3. 发布了中文多模态理解的新基准测试。
Method: 1. 使用区域-文本匹配和长字幕建模提供细粒度监督;2. 结合多判别目标训练;3. 引入TIC损失优化语义相似字幕区分。
Result: 在29个数据集和8项任务中,FG-CLIP 2超越了现有方法,实现了双语的最佳性能。
Insight: 细粒度对齐需要多模态监督和多语言数据支持;TIC损失可以有效提升模型对语义相似文本的区分能力。
Abstract: Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.
[194] DKPMV: Dense Keypoints Fusion from Multi-View RGB Frames for 6D Pose Estimation of Textureless Objects
Jiahong Chen,Jinghao Wang,Zi Wang,Ziwen Wang,Banglei Guan,Qifeng Yu
Main category: cs.CV
TL;DR: DKPMV提出了一种基于多视角RGB图像的稠密关键点融合方法,用于无纹理物体的6D姿态估计,通过三阶段姿态优化策略和注意力聚合与对称感知训练,显著提升了性能。
Details
Motivation: 无纹理物体的6D姿态估计在工业机器人应用中很重要,但现有方法要么依赖深度数据,要么未能充分利用多视角几何信息,导致性能受限。Contribution: 1. 提出了DKPMV,一种仅依赖RGB图像的稠密关键点融合方法;2. 设计了三阶段的姿态优化策略;3. 改进了关键点网络,引入注意力聚合和对称感知训练。
Method: 1. 稠密关键点融合:利用多视角RGB图像生成稠密关键点;2. 三阶段姿态优化:逐步优化姿态;3. 注意力聚合与对称感知训练:提升关键点预测精度。
Result: 在ROBI数据集上,DKPMV表现优于现有的多视角RGB方法,甚至多数情况下超越了RGB-D方法。
Insight: 仅依赖RGB图像也能通过稠密关键点融合和几何信息优化实现高精度的6D姿态估计,对称感知训练对解决对称物体模糊性至关重要。
Abstract: 6D pose estimation of textureless objects is valuable for industrial robotic applications, yet remains challenging due to the frequent loss of depth information. Current multi-view methods either rely on depth data or insufficiently exploit multi-view geometric cues, limiting their performance. In this paper, we propose DKPMV, a pipeline that achieves dense keypoint-level fusion using only multi-view RGB images as input. We design a three-stage progressive pose optimization strategy that leverages dense multi-view keypoint geometry information. To enable effective dense keypoint fusion, we enhance the keypoint network with attentional aggregation and symmetry-aware training, improving prediction accuracy and resolving ambiguities on symmetric objects. Extensive experiments on the ROBI dataset demonstrate that DKPMV outperforms state-of-the-art multi-view RGB approaches and even surpasses the RGB-D methods in the majority of cases. The code will be available soon.
[195] Towards Distribution-Shift Uncertainty Estimation for Inverse Problems with Generative Priors
Namhoon Kim,Sara Fridovich-Keil
Main category: cs.CV
TL;DR: 本文提出了一种基于生成式先验的反问题不确定性估计方法,无需校准数据集,即可检测分布偏移,适用于计算成像问题。
Details
Motivation: 生成式模型在反问题(如医学图像重建)中表现出强大的潜力,但它们可能在测试数据超出训练分布时产生幻觉特征。现有方法需要校准数据集或仅提供启发式估计,无法直接量化分布偏移带来的不确定性。Contribution: 提出了一个实例级别的、无需校准的不确定性指标,能够敏感地检测分布偏移,且无需训练分布的先验知识或额外的训练成本。
Method: 假设对分布内图像的 reconstruction 在随机测量变化下稳定,而分布外图像的 reconstruction 不稳定。利用这种稳定性作为检测分布偏移的代理。
Result: 在MNIST数字的断层重建实验中,分布外数字的 reconstruction 表现出更高的变异性和重建误差,验证了该指标的有效性。
Insight: 该方法为生成式先验的部署提供了轻量级保障策略,能在分布内情况下实现激进的数据减少,同时在分布外情况下自动发出警告。
Abstract: Generative models have shown strong potential as data-driven priors for solving inverse problems such as reconstructing medical images from undersampled measurements. While these priors improve reconstruction quality with fewer measurements, they risk hallucinating features when test images lie outside the training distribution. Existing uncertainty quantification methods in this setting (i) require an in-distribution calibration dataset, which may not be available, (ii) provide heuristic rather than statistical estimates, or (iii) quantify uncertainty from model capacity or limited measurements rather than distribution shift. We propose an instance-level, calibration-free uncertainty indicator that is sensitive to distribution shift, requires no knowledge of the training distribution, and incurs no retraining cost. Our key hypothesis is that reconstructions of in-distribution images remain stable under random measurement variations, while reconstructions of out-of-distribution (OOD) images exhibit greater instability. We use this stability as a proxy for detecting distribution shift. Our proposed OOD indicator is efficiently computable for any computational imaging inverse problem; we demonstrate it on tomographic reconstruction of MNIST digits, where a learned proximal network trained only on digit “0” is evaluated on all ten digits. Reconstructions of OOD digits show higher variability and correspondingly higher reconstruction error, validating this indicator. These results suggest a deployment strategy that pairs generative priors with lightweight guardrails, enabling aggressive measurement reduction for in-distribution cases while automatically warning when priors are applied out of distribution.
[196] IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation
Zeteng Lin,Xingxing Li,Wen You,Xiaoyang Li,Zehan Lu,Yujun Cai,Jing Tang
Main category: cs.CV
TL;DR: 该论文提出了IUT-Plug,一种基于图像理解树(IUT)的插件工具,用于改进现有视觉语言模型在多模态图像-文本生成中的逻辑、对象一致性和风格保持问题。
Details
Motivation: 现有视觉语言模型(如GPT-4和DALL-E)在多模态图像-文本生成中难以保持逻辑、对象一致性和风格,限制了其在复杂场景中的泛化能力。Contribution: 提出IUT-Plug,通过动态IUT提取模块和跨模态一致性机制,显著改善了多模态QA任务中的上下文漂移问题。
Method: 采用两阶段框架:(1)动态IUT提取模块将视觉场景解析为分层符号结构;(2)协调的叙事流和图像合成机制确保跨模态一致性。
Result: 实验表明,IUT-Plug不仅在现有基准测试中提升了准确性,还能在多模态QA场景中有效缓解三种关键形式的上下文漂移。
Insight: 通过结构化推理和跨模态一致性机制,可以显著提升视觉语言模型在多模态任务中的表现。
Abstract: Existing vision language models (VLMs), including GPT-4 and DALL-E, often struggle to preserve logic, object identity, and style in multimodal image-text generation. This limitation significantly hinders the generalization capability of VLMs in complex image-text input-output scenarios. To address this issue, we propose IUT-Plug, a module grounded in an Image Understanding Tree (IUT), which enhances existing interleaved VLMs through explicit structured reasoning, thereby mitigating context drift in logic, entity identity, and style. The proposed framework operates in two stages. (1) A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures. (2) A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency. To evaluate our approach, we construct a novel benchmark based on 3,000 real human-generated question-answer pairs over fine-tuned large models, introducing a dynamic evaluation protocol for quantifying context drift in interleaved VLMs. Experimental results demonstrate that IUT-Plug not only improves accuracy on established benchmarks but also effectively alleviates the three critical forms of context drift across diverse multimodal question answering (QA) scenarios.
[197] Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning
Sanchit Sinha,Oana Frunza,Kashif Rasul,Yuriy Nevmyvaka,Aidong Zhang
Main category: cs.CV
TL;DR: Chart-RVR通过结合GRPO和可验证奖励,提升了大规模视觉语言模型(LVLM)在图谱推理任务中的鲁棒性和可解释性,显著缩小了分布外数据(OOD)的性能差距,并在多个基准测试中达到最优。
Details
Motivation: 目前的大型视觉语言模型在图谱推理任务中表现优秀,但在分布外数据和生成解释性推理链时表现不佳,限制了其可解释性和可靠性。Contribution: 提出Chart-RVR框架,结合GRPO和三种可验证奖励(图表类型分类、图表表格重建和过程一致性),显著提升了模型在图谱推理任务中的性能和可解释性。
Method: 采用Group Relative Policy Optimization(GRPO)与三种自动验证奖励结合的方法进行模型微调,优化图表类型分类、表格重建和推理过程一致性。
Result: Chart-RVR-3B系列模型在六个图谱推理基准测试中表现最优,显著缩小了OOD性能差距,并生成了更可解释的推理链。
Insight: 可验证奖励与GRPO结合的方法不仅提升了模型性能,还增强了推理过程的透明度和可靠性,展示了其在可解释性图谱推理任务中的潜力。
Abstract: The capabilities of Large Vision-Language Models (LVLMs) have reached state-of-the-art on many visual reasoning tasks, including chart reasoning, yet they still falter on out-of-distribution (OOD) data, and degrade further when asked to produce their chain-of-thought (CoT) rationales, limiting explainability. We present Chart-RVR, a general framework that fine-tunes LVLMs to be more robust and explainable for chart reasoning by coupling Group Relative Policy Optimization (GRPO) with automatically verifiable rewards. Our framework comprises of three rewards that maximize: (i) correct chart-type classification, (ii) faithful chart table reconstruction, and (iii) process conformity. Applied to 3-billion-parameter LVLMs, Chart-RVR consistently outperforms standard supervised fine-tuning (SFT) on both in-distribution and out-of-distribution datasets, closing the OOD performance gap while improving rationale fidelity. The resulting models, the Chart-RVR-3B series, achieve state-of-the-art results on six chart-reasoning benchmarks spanning in-domain and OOD settings, surpassing all existing models of comparable size. Beyond accuracy, Chart-RVR yields more interpretable CoT rationales, strengthening trust and reliability - showcasing the power of verifiable rewards with GRPO for training reliable, interpretable chart-reasoning models.
[198] Mixup Helps Understanding Multimodal Video Better
Xiaoyu Ma,Ding Ding,Hao Chen
Main category: cs.CV
TL;DR: 论文提出了Multimodal Mixup (MM)和Balanced Multimodal Mixup (B-MM)方法,通过在多模态特征层级应用Mixup策略和动态调整模态混合比例,解决了多模态视频理解中模态不平衡和过拟合问题。
Details
Motivation: 多模态视频理解任务中,强模态容易主导学习过程,压制弱模态的贡献,导致模型过拟合和泛化能力下降。Contribution: 1. 提出Multimodal Mixup (MM),在多模态特征层级生成虚拟特征-标签对以减少过拟合;2. 扩展提出Balanced Multimodal Mixup (B-MM),动态调整模态混合比例以解决模态不平衡问题。
Method: 1. MM通过在多模态聚合特征层级应用Mixup生成虚拟样本;2. B-MM基于模态对学习目标的贡献动态调整混合比例。
Result: 在多个数据集上的实验表明,所提方法显著提升了模型的泛化能力和多模态鲁棒性。
Insight: 动态调整模态混合比例能更有效地平衡模态间的贡献,从而提高模型在多模态任务中的表现。
Abstract: Multimodal video understanding plays a crucial role in tasks such as action recognition and emotion classification by combining information from different modalities. However, multimodal models are prone to overfitting strong modalities, which can dominate learning and suppress the contributions of weaker ones. To address this challenge, we first propose Multimodal Mixup (MM), which applies the Mixup strategy at the aggregated multimodal feature level to mitigate overfitting by generating virtual feature-label pairs. While MM effectively improves generalization, it treats all modalities uniformly and does not account for modality imbalance during training. Building on MM, we further introduce Balanced Multimodal Mixup (B-MM), which dynamically adjusts the mixing ratios for each modality based on their relative contributions to the learning objective. Extensive experiments on several datasets demonstrate the effectiveness of our methods in improving generalization and multimodal robustness.
[199] A Survey on Agentic Multimodal Large Language Models
Huanjin Yao,Ruifei Zhang,Jiaxing Huang,Jingyi Zhang,Yibo Wang,Bo Fang,Ruolin Zhu,Yongcheng Jing,Shunyu Liu,Guanbin Li,Dacheng Tao
Main category: cs.CV
TL;DR: 论文对Agentic Multimodal Large Language Models(Agentic MLLMs)进行了全面综述,探讨了其与传统MLLM代理的区别,并提出了一个概念框架,围绕三个维度组织Agentic MLLMs的研究方向。
Details
Motivation: 随着自主代理系统的兴起,AI代理正从静态、被动和领域特定转向动态、主动和通用化。研究者对Agentic AI的兴趣日益增长,并认为其可能成为实现AGI的重要路径,因此需要对Agentic MLLMs进行系统性综述。Contribution: 论文的主要贡献是提出了一个概念框架,从Agentic内部智能、外部工具调用和环境交互三个维度组织Agentic MLLMs的研究,并整理了开源训练框架、数据集和应用方向,为推动该领域研究提供了资源。
Method: 论文通过文献综述的方法,分析了Agentic MLLMs的核心特点和功能,构建了一个包含三个维度的分类框架:1) Agentic内部智能;2) Agentic外部工具调用;3) Agentic环境交互。
Result: 论文总结了Agentic MLLMs的研究现状和发展方向,提供了开源工具和数据集,并展望了未来的研究潜力。
Insight: Agentic MLLMs的关键在于实现动态规划、主动工具调用和适应环境的能力,这种多模态与大语言模型的结合为未来通用AI提供了重要路径。
Abstract: With the recent emergence of revolutionary autonomous agentic systems, research community is witnessing a significant shift from traditional static, passive, and domain-specific AI agents toward more dynamic, proactive, and generalizable agentic AI. Motivated by the growing interest in agentic AI and its potential trajectory toward AGI, we present a comprehensive survey on Agentic Multimodal Large Language Models (Agentic MLLMs). In this survey, we explore the emerging paradigm of agentic MLLMs, delineating their conceptual foundations and distinguishing characteristics from conventional MLLM-based agents. We establish a conceptual framework that organizes agentic MLLMs along three fundamental dimensions: (i) Agentic internal intelligence functions as the system’s commander, enabling accurate long-horizon planning through reasoning, reflection, and memory; (ii) Agentic external tool invocation, whereby models proactively use various external tools to extend their problem-solving capabilities beyond their intrinsic knowledge; and (iii) Agentic environment interaction further situates models within virtual or physical environments, allowing them to take actions, adapt strategies, and sustain goal-directed behavior in dynamic real-world scenarios. To further accelerate research in this area for the community, we compile open-source training frameworks, training and evaluation datasets for developing agentic MLLMs. Finally, we review the downstream applications of agentic MLLMs and outline future research directions for this rapidly evolving field. To continuously track developments in this rapidly evolving field, we will also actively update a public repository at https://github.com/HJYao00/Awesome-Agentic-MLLMs.
[200] Perspective-aware 3D Gaussian Inpainting with Multi-view Consistency
Yuxin Cheng,Binxiao Huang,Taiqiang Wu,Wenyong Zhou,Chenchen Ding,Zhengwu Liu,Graziano Chesi,Ngai Wong
Main category: cs.CV
TL;DR: PAInpainter是一种新型3D高斯修复方法,通过视角感知的内容传播和多视角一致性验证,显著提升了修复质量和对多视角场景的全局一致性。
Details
Motivation: 3D高斯修复在虚拟现实和多媒体应用中至关重要,但现有方法在多视角一致性方面仍存在挑战。Contribution: 提出了PAInpainter方法,通过视角感知的内容传播和多视角一致性验证,实现了高质量的3D高斯修复。
Method: 基于预训练扩散模型,采用自适应视角采样迭代优化3D高斯表示,并通过传播修复内容和验证多视角一致性提升修复质量。
Result: 在SPIn-NeRF和NeRFiller数据集上,PSNR分别达到26.03 dB和29.51 dB,优于现有方法。
Insight: 多视角一致性验证和视角感知内容传播是提升3D高斯修复质量的关键因素。
Abstract: 3D Gaussian inpainting, a critical technique for numerous applications in virtual reality and multimedia, has made significant progress with pretrained diffusion models. However, ensuring multi-view consistency, an essential requirement for high-quality inpainting, remains a key challenge. In this work, we present PAInpainter, a novel approach designed to advance 3D Gaussian inpainting by leveraging perspective-aware content propagation and consistency verification across multi-view inpainted images. Our method iteratively refines inpainting and optimizes the 3D Gaussian representation with multiple views adaptively sampled from a perspective graph. By propagating inpainted images as prior information and verifying consistency across neighboring views, PAInpainter substantially enhances global consistency and texture fidelity in restored 3D scenes. Extensive experiments demonstrate the superiority of PAInpainter over existing methods. Our approach achieves superior 3D inpainting quality, with PSNR scores of 26.03 dB and 29.51 dB on the SPIn-NeRF and NeRFiller datasets, respectively, highlighting its effectiveness and generalization capability.
[201] ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation
Ruihang Xu,Dewei Zhou,Fan Ma,Yi Yang
Main category: cs.CV
TL;DR: ContextGen是一个基于Diffusion Transformer的新框架,通过Contextual Layout Anchoring(CLA)和Identity Consistency Attention(ICA)机制,解决了多实例图像生成中布局控制和身份一致性的挑战。
Details
Motivation: 多实例图像生成(MIG)在现有扩散模型中面临布局控制和身份一致性的难题,缺乏大规模的标注数据集。Contribution: 提出了CLA机制将布局图像整合到生成上下文中,以及ICA机制通过参考图像确保身份一致性;并发布了首个大规模标注数据集IMIG-100K。
Method: 采用Diffusion Transformer框架,结合CLA和ICA机制,分别处理布局锚定和身份一致性。
Result: 实验表明ContextGen在控制精度、身份保真度和视觉质量上优于现有方法。
Insight: 布局和身份一致性是多实例生成的核心挑战,通过上下文整合和注意力机制可以有效解决。
Abstract: Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism that leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce IMIG-100K, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.
[202] COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models
Sanchit Sinha,Guangzhi Xiong,Aidong Zhang
Main category: cs.CV
TL;DR: 论文提出了COCO-Tree方法,通过设计神经符号概念树(从大型语言模型中学习),增强视觉语言模型(VLM)的语言推理能力,显著提升了组合泛化性能。
Details
Motivation: 现有的视觉语言模型在组合推理方面表现较弱,传统改进方法如提示结构优化或思维链推理效果有限,而依赖大型语言模型的资源密集型方案缺乏可解释性。COCO-Tree旨在解决这些问题。Contribution: 1. 提出COCO-Tree方法,通过神经符号概念树增强VLM的语言推理能力;2. 设计了一种基于束搜索的推理过程,提高组合性性能并提供预测解释;3. 在多个基准测试中验证了方法的有效性。
Method: 1. 从大型语言模型中学习神经符号概念树;2. 将这些树与VLM输出结合,通过束搜索优化推理;3. 在多个组合性基准上进行实验验证。
Result: 在Winoground、EqBench、ColorSwap和SugarCrepe四个基准上的实验表明,COCO-Tree将组合泛化性能提升了5-10%,且适用于不同规模的VLM。
Insight: 1. 神经符号方法可以有效结合LLM的语言优势和VLM的视觉优势;2. 束搜索推理过程增强了模型的解释性;3. 该方法为VLM的组合推理提供了一种高效且可解释的改进方向。
Abstract: Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research works have attempted to improve compositionality performance by creative tricks such as improving prompt structure, chain of thought reasoning, etc. A more recent line of work attempts to impart additional reasoning in VLMs using well-trained Large Language Models (LLMs), which are far superior in linguistic understanding than VLMs to compensate for the limited linguistic prowess of VLMs. However, these approaches are either resource-intensive or do not provide an interpretable reasoning process. In this paper, we present ‘COCO-Tree’ - a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees learned from LLMs to improve VLM’s linguistic reasoning. COCO-Tree’s beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions. Empirical results on four compositionality benchmarks, Winoground, EqBench, ColorSwap, and SugarCrepe, in seven different open-source VLMs with varying sizes, demonstrate that COCO-Tree significantly improves compositional generalization by 5-10% over baselines.
[203] High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation
Runyang Feng,Hyung Jin Chang,Tze Ho Elden Tse,Boeun Kim,Yi Chang,Yixing Gao
Main category: cs.CV
TL;DR: 该论文提出了一种新颖的框架,通过扩展Mamba模型(状态空间模型)来分别学习全局和局部的高分辨率时空表示,用于视频姿态估计(VHPE),解决了现有方法在平衡全局与局部动态建模方面的不足及其计算复杂度问题。
Details
Motivation: 现有VHPE方法在统一时空学习时存在全局与局部动态建模失衡的问题,且全局依赖建模的计算复杂度较高(二次复杂度)。Mamba模型虽有潜力在处理长距离依赖时表现出线性复杂度,但仅限于1D序列数据。Contribution: 1. 提出了Global Spatiotemporal Mamba,通过6D选择性时空扫描和时空调制扫描合并高效提取高分辨率序列的全局表示。2. 设计了基于窗口化时空扫描的Local Refinement Mamba,增强局部关键点运动的高频细节。
Method: 1. Global Spatiotemporal Mamba:在6D空间中进行选择性时空扫描,并通过时空调制合并扫描结果。2. Local Refinement Mamba:采用窗口化时空扫描机制优化局部运动细节。
Result: 在四个基准数据集上的实验表明,该模型性能优于现有VHPE方法,同时实现了更好的计算效率平衡。
Insight: 通过分离全局与局部动态建模,并利用Mamba模型的线性复杂度特性,可以有效解决高分辨率时空建模中的计算瓶颈和性能限制问题。
Abstract: Modeling high-resolution spatiotemporal representations, including both global dynamic contexts (e.g., holistic human motion tendencies) and local motion details (e.g., high-frequency changes of keypoints), is essential for video-based human pose estimation (VHPE). Current state-of-the-art methods typically unify spatiotemporal learning within a single type of modeling structure (convolution or attention-based blocks), which inherently have difficulties in balancing global and local dynamic modeling and may bias the network to one of them, leading to suboptimal performance. Moreover, existing VHPE models suffer from quadratic complexity when capturing global dependencies, limiting their applicability especially for high-resolution sequences. Recently, the state space models (known as Mamba) have demonstrated significant potential in modeling long-range contexts with linear complexity; however, they are restricted to 1D sequential data. In this paper, we present a novel framework that extends Mamba from two aspects to separately learn global and local high-resolution spatiotemporal representations for VHPE. Specifically, we first propose a Global Spatiotemporal Mamba, which performs 6D selective space-time scan and spatial- and temporal-modulated scan merging to efficiently extract global representations from high-resolution sequences. We further introduce a windowed space-time scan-based Local Refinement Mamba to enhance the high-frequency details of localized keypoint motions. Extensive experiments on four benchmark datasets demonstrate that the proposed model outperforms state-of-the-art VHPE approaches while achieving better computational trade-offs.
[204] GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation
Shasha Guo,Liang Pang,Xi Wang,Yanling Wang,Huawei Shen,Jing Zhang
Main category: cs.CV
TL;DR: GeoVLMath提出了一种通过跨模态奖励增强视觉语言模型在几何推理中辅助线生成的强化学习框架,并在标准几何问题上实现了优异性能。
Details
Motivation: 辅助线对解决复杂几何问题至关重要,但对大型视觉语言模型(LVLMs)仍具挑战性。现有图像编辑模型难以精确绘制几何辅助线,因此作者提出通过生成辅助线的文本描述来解决这一问题。Contribution: 1. 提出了一种强化学习框架,通过跨模态奖励评估辅助线描述的匹配度;2. 开发了GeoVLMath,一种专门用于立体几何辅助线推理的开源LVLM;3. 构建了AuxSolidMath数据集,支持训练和评估。
Method: 1. 生成辅助线的文本描述而非直接绘制;2. 设计了跨模态奖励函数,衡量文本描述与实际辅助线图的匹配度;3. 采用GRPO-based强化学习优化模型。
Result: 在3B和7B规模的基准测试中,GeoVLMath表现优于开源和专有LVLMs。
Insight: 通过文本描述而非图像编辑生成辅助线,更符合LVLMs的表达优势,且强化学习框架显著提升了几何推理能力。
Abstract: Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Rather than editing diagrams to draw auxiliary lines, which current image editing models struggle to render with geometric precision, we generate textual descriptions of auxiliary-line constructions to better align with the representational strengths of LVLMs. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. At the core of our approach is a cross-modal reward that evaluates how well the generated auxiliary-line description for an original diagram matches a ground-truth auxiliary-line diagram. Built on this reward, we present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry. This fine-grained signal drives a GRPO-based RL stage, yielding precise diagram-text alignment. To support training, we develop a scalable data creation pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. At the 3B and 7B scales, GeoVLMath achieves competitive and often superior performance compared with strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks.
[205] GIR-Bench: Versatile Benchmark for Generating Images with Reasoning
Hongxiang Li,Yaowei Li,Bin Lin,Yuwei Niu,Yuhang Yang,Xiaoshuang Huang,Jiayin Cai,Xiaolong Jiang,Yao Hu,Long Chen
Main category: cs.CV
TL;DR: GIR-Bench是一个多模态模型的综合评测基准,专注于评估理解与生成之间的对齐能力、逻辑推理驱动的图像生成以及多步推理编辑任务。
Details
Motivation: 研究界缺乏一个严格的评测基准,以系统评估多模态模型在理解与生成任务中的一致性及其在复杂视觉任务中的泛化潜力。Contribution: 提出了GIR-Bench基准,包含三个子任务(UGC、T2I、Edit),设计了针对性的评测流程,填补了多模态模型评测的空白。
Method: 通过三个互补视角(理解生成一致性、逻辑推理生成、多步推理编辑)评测模型,并设计任务特定的评测流程避免偏差。
Result: 实验表明统一模型在推理驱动任务上表现更优,但仍存在理解与生成之间的差距。
Insight: GIR-Bench揭示了当前多模态模型在理解与生成对齐方面的不足,为未来研究方向提供了关键洞见。
Abstract: Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.
[206] Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
Ganlin Yang,Tianyi Zhang,Haoran Hao,Weiyun Wang,Yibin Liu,Dehui Wang,Guanzhou Chen,Zijian Cai,Junting Chen,Weijie Su,Wengang Zhou,Yu Qiao,Jifeng Dai,Jiangmiao Pang,Gen Luo,Wenhai Wang,Yao Mu,Zhi Hou
Main category: cs.CV
TL;DR: 本文提出了Vlaser模型,一种集成了高层推理与低层控制的视觉-语言-动作模型,通过协同推理能力填补了上游VLM推理与下游VLA策略学习之间的关键缺口。
Details
Motivation: 目前的研究主要集中在使用VLM开发具身推理能力或将高级VLM集成到VLA模型中实现端到端机器人控制,但较少直接解决上游VLM推理与下游VLA策略学习之间的关键差距。Contribution: 1. 提出了Vlaser模型,集成了视觉-语言-动作的协同推理能力;2. 构建了高质量的Vlaser-6M数据集;3. 系统分析了不同VLM初始化对VLA微调的影响,并提出缓解领域偏移的新见解。
Method: 基于Vlaser-6M数据集,设计了一种将高层推理与低层控制结合的VLA模型,并通过系统实验研究了VLM初始化对微调的影响。
Result: Vlaser在多种具身推理任务(如空间推理、任务规划等)上实现了SOTA性能,并在WidowX和Google Robot基准测试中表现优异。
Insight: 研究发现VLM初始化对VLA微调效果有显著影响,并通过领域偏移缓解策略提升了模型性能。
Abstract: While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
[207] LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation
Chang Liu,Henghui Ding,Kaining Ying,Lingyi Hong,Ning Xu,Linjie Yang,Yuchen Fan,Mingqi Gao,Jingkun Chen,Yunqi Miao,Gengshen Wu,Zhijin Qin,Jungong Han,Zhixiong Zhang,Shuangrui Ding,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Jiaqi Wang,Chang Soo Lim,Joonyoung Moon,Donghyeon Cho,Tingmin Li,Yixuan Li,Yang Yang,An Yan,Leilei Cao,Feng Lu,Ran Hong,Youhai Jiang,Fengjie Zhu,Yujie Xie,Hongyang Zhang,Zhihui Liu,Shihai Ruan,Quanzhu Niu,Dengxian Gong,Shihao Chen,Tao Zhang,Yikang Zhou,Haobo Yuan,Lu Qi,Xiangtai Li,Shunping Ji,Ran Hong,Feng Lu,Leilei Cao,An Yan,Alexey Nekrasov,Ali Athar,Daan de Geus,Alexander Hermans,Bastian Leibe
Main category: cs.CV
TL;DR: 本文总结了2025年LSVOS挑战赛的最新进展,重点介绍了为提升复杂视频场景下的鲁棒性而新增的MOSEv2赛道,以及新兴趋势如LLM/MLLM组件的应用。
Details
Motivation: 传统视频目标分割(VOS)任务在真实场景中仍面临挑战,如小物体密集、遮挡频繁等问题。MOSEv2赛道的引入旨在推动长期一致性和泛化能力的提升。Contribution: 1. 新增MOSEv2赛道,提升任务难度并引入更真实的复杂场景;2. 提出新的评估指标${J&\dot{F}}$以更好衡量多尺度物体和消失/重现情况;3. 总结LLM/MLLM组件和内存感知传播等新兴趋势。
Method: 1. 沿用传统赛道(VOS和RVOS)的标准指标${J}$、${F}$、${J&F}$;2. 新增MOSEv2赛道采用${J&\dot{F}}$作为主要指标;3. 综合分析数据集、协议和优胜方案。
Result: 挑战赛展示了LLM/MLLM组件在视频分割中的作用,并强调了内存感知传播等技术的重要性。
Insight: 未来研究方向包括提升语言感知能力和复杂场景下的鲁棒性,LLM/MLLM可能成为关键工具。
Abstract: This report presents an overview of the 7th Large-scale Video Object Segmentation (LSVOS) Challenge held in conjunction with ICCV 2025. Besides the two traditional tracks of LSVOS that jointly target robustness in realistic video scenarios: Classic VOS (VOS), and Referring VOS (RVOS), the 2025 edition features a newly introduced track, Complex VOS (MOSEv2). Building upon prior insights, MOSEv2 substantially increases difficulty, introducing more challenging but realistic scenarios including denser small objects, frequent disappear/reappear events, severe occlusions, adverse weather and lighting, etc., pushing long-term consistency and generalization beyond curated benchmarks. The challenge retains standard ${J}$, $F$, and ${J&F}$ metrics for VOS and RVOS, while MOSEv2 adopts ${J&\dot{F}}$ as the primary ranking metric to better evaluate objects across scales and disappearance cases. We summarize datasets and protocols, highlight top-performing solutions, and distill emerging trends, such as the growing role of LLM/MLLM components and memory-aware propagation, aiming to chart future directions for resilient, language-aware video segmentation in the wild.
[208] CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization
Fengling Zhu,Boshi Liu,Jingyu Hua,Sheng Zhong
Main category: cs.CV
TL;DR: 论文提出了一种名为CoDefend的多模态防御方法,通过扩散模型净化和提示优化,提升多模态大语言模型(MLLMs)对对抗攻击的鲁棒性。
Details
Motivation: 多模态大语言模型在视觉和文本任务中表现优异,但容易受到对抗攻击的威胁。现有防御方法如对抗训练和输入净化存在局限性,包括计算成本高、图像质量下降和泛化能力不足。Contribution: 论文的主要贡献是提出了CoDefend方法,结合监督扩散去噪和提示优化,显著提升了多模态任务中对对抗攻击的防御能力。
Method: 方法包括基于监督扩散的去噪框架(利用对抗-干净图像对微调扩散模型)和提示优化机制,以增强对未知攻击的抵抗能力。
Result: 在图像描述和视觉问答任务上的实验表明,CoDefend不仅显著提高了鲁棒性,还表现出对未知攻击的强迁移性。
Insight: 监督扩散去噪是多模态防御的有效方法,为MLLMs在实际应用中的安全部署提供了新思路。
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in tasks such as image captioning, visual question answering, and cross-modal reasoning by integrating visual and textual modalities. However, their multimodal nature also exposes them to adversarial threats, where attackers can perturb either modality or both jointly to induce harmful, misleading, or policy violating outputs. Existing defense strategies, such as adversarial training and input purification, face notable limitations: adversarial training typically improves robustness only against known attacks while incurring high computational costs, whereas conventional purification approaches often suffer from degraded image quality and insufficient generalization to complex multimodal tasks. In this work, we focus on defending the visual modality, which frequently serves as the primary entry point for adversarial manipulation. We propose a supervised diffusion based denoising framework that leverages paired adversarial clean image datasets to fine-tune diffusion models with directional, task specific guidance. Unlike prior unsupervised purification methods such as DiffPure, our approach achieves higher quality reconstructions while significantly improving defense robustness in multimodal tasks. Furthermore, we incorporate prompt optimization as a complementary defense mechanism, enhancing resistance against diverse and unseen attack strategies. Extensive experiments on image captioning and visual question answering demonstrate that our method not only substantially improves robustness but also exhibits strong transferability to unknown adversarial attacks. These results highlight the effectiveness of supervised diffusion based denoising for multimodal defense, paving the way for more reliable and secure deployment of MLLMs in real world applications.
[209] Compositional Zero-Shot Learning: A Survey
Ans Munir,Faisal Z. Qureshi,Mohsen Ali,Muhammad Haris Khan
Main category: cs.CV
TL;DR: 这篇论文是关于组合零样本学习(CZSL)的第一篇全面综述,系统回顾了最新的CZSL方法,并提出了一种基于解缠的分类法,分析了其优缺点和未来研究方向。
Details
Motivation: CZSL任务在计算机视觉中至关重要,因为它需要模型在推理时识别已知属性和物体的未见组合,而训练数据无法覆盖所有可能的组合。外观的上下文性(如“小的”猫和“老的”猫视觉差异显著)增加了问题的复杂性。Contribution: 主要贡献包括:1)首篇专注于CZSL的全面综述;2)提出基于解缠的分类法,将方法分为四类;3)详细比较不同方法的核心优势和局限;4)指出重要的开放挑战和未来研究方向。
Method: 论文系统地分类了CZSL方法,基于解缠的程度分为四类:无显式解缠、文本解缠、视觉解缠和跨模态解缠。每种方法在不同问题设置(如封闭世界和开放世界CZSL)中的表现被详细分析。
Result: 综述分析了各种方法的优缺点,并指出跨模态解缠在应对上下文性和组合性方面的潜力。同时总结了当前领域的局限性和未来可能的突破点。
Insight: 核心洞见包括:1)上下文性建模对CZSL至关重要;2)解缠策略的选择显著影响性能;3)跨模态方法可能在开放世界中表现更优;4)社区需要更多关注数据高效和泛化能力强的模型。
Abstract: Compositional Zero-Shot Learning (CZSL) is a critical task in computer vision that enables models to recognize unseen combinations of known attributes and objects during inference, addressing the combinatorial challenge of requiring training data for every possible composition. This is particularly challenging because the visual appearance of primitives is highly contextual; for example, small'' cats appear visually distinct from older’’ ones, and wet'' cars differ significantly from wet’’ cats. Effectively modeling this contextuality and the inherent compositionality is crucial for robust compositional zero-shot recognition. This paper presents, to our knowledge, the first comprehensive survey specifically focused on Compositional Zero-Shot Learning. We systematically review the state-of-the-art CZSL methods, introducing a taxonomy grounded in disentanglement, with four families of approaches: no explicit disentanglement, textual disentanglement, visual disentanglement, and cross-modal disentanglement. We provide a detailed comparative analysis of these methods, highlighting their core advantages and limitations in different problem settings, such as closed-world and open-world CZSL. Finally, we identify the most significant open challenges and outline promising future research directions. This survey aims to serve as a foundational resource to guide and inspire further advancements in this fascinating and important field. Papers studied in this survey with their official code are available on our github: https://github.com/ans92/Compositional-Zero-Shot-Learning
[210] MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps
Jiahui Lei,Kyle Genova,George Kopanas,Noah Snavely,Leonidas Guibas
Main category: cs.CV
TL;DR: 该论文提出了MoMaps(Motion Maps)表示方法,用于从单张输入图像预测未来3D场景运动,并通过扩散模型学习运动分布。
Details
Motivation: 从真实视频中学习语义和功能上有意义的3D运动先验,以支持从单张图像预测未来场景运动的需求。Contribution: 提出了一种像素对齐的MoMap表示方法,并构建了一个大规模MoMaps数据库,训练扩散模型以生成逼真且语义一致的3D场景运动。
Method: 1. 设计MoMap表示方法;2. 从50,000多个真实视频中构建MoMaps数据库;3. 训练扩散模型生成运动;4. 提出基于MoMap的2D视频合成新流程。
Result: 实验表明,该方法生成的3D场景运动既真实又语义一致。
Insight: MoMap是一种高效的3D运动表示方法,结合扩散模型能够显著提升运动生成的语义和功能质量。
Abstract: This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image. We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.
[211] Multimodal Disease Progression Modeling via Spatiotemporal Disentanglement and Multiscale Alignment
Chen Liu,Wenfang Yao,Kejing Yin,William K. Cheung,Jing Qin
Main category: cs.CV
TL;DR: DiPro是一个通过时空解耦和多尺度对齐建模疾病进展的多模态框架,解决了CXR序列的冗余和EHR数据的时间不对齐问题。
Details
Motivation: 纵向多模态数据(如EHR和CXR)对疾病进展建模至关重要,但存在CXR序列冗余和与EHR时间不对齐的挑战。Contribution: 提出了DiPro框架,通过区域感知解耦和多尺度对齐,有效提取疾病相关动态并同步多模态数据。
Method: 解耦CXR的静态(解剖)和动态(病理进展)特征,并通过局部和全局同步对齐EHR数据。
Result: 在MIMIC数据集上,DiPro在疾病进展识别和ICU预测任务中达到最先进性能。
Insight: 解耦和多尺度对齐是多模态疾病建模的关键,可显著提升动态特征的提取和时序一致性。
Abstract: Longitudinal multimodal data, including electronic health records (EHR) and sequential chest X-rays (CXRs), is critical for modeling disease progression, yet remains underutilized due to two key challenges: (1) redundancy in consecutive CXR sequences, where static anatomical regions dominate over clinically-meaningful dynamics, and (2) temporal misalignment between sparse, irregular imaging and continuous EHR data. We introduce $\texttt{DiPro}$, a novel framework that addresses these challenges through region-aware disentanglement and multi-timescale alignment. First, we disentangle static (anatomy) and dynamic (pathology progression) features in sequential CXRs, prioritizing disease-relevant changes. Second, we hierarchically align these static and dynamic CXR features with asynchronous EHR data via local (pairwise interval-level) and global (full-sequence) synchronization to model coherent progression pathways. Extensive experiments on the MIMIC dataset demonstrate that $\texttt{DiPro}$ could effectively extract temporal clinical dynamics and achieve state-of-the-art performance on both disease progression identification and general ICU prediction tasks.
[212] Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning
Hao Tang,Shengfeng He,Jing Qin
Main category: cs.CV
TL;DR: SynTrans 是一个新颖的框架,通过从大型多模态模型(如 CLIP)中协同转移知识,显著提升了少样本学习的性能。
Details
Motivation: 现有的少样本学习方法通常依赖小型模型的语义知识,但这些知识可能包含噪声和偏差。SynTrans 旨在通过利用大型多模态模型的多样化知识来解决这一问题。Contribution: 提出了 SynTrans 框架,实现了大型多模态模型的协同知识转移,并通过双向视觉-语义知识桥接优化少样本分类器。
Method: SynTrans 结合了无监督代理任务提取视觉知识、训练无关的协同知识挖掘模块,以及视觉权重生成和语义权重重构模块。
Result: 在四个少样本学习数据集上表现优异,显著优于当前最先进方法。
Insight: 大型多模态模型的多样化知识可以有效解决少样本学习中的数据稀缺问题,且协同知识转移能显著提升性能。
Abstract: Few-shot learning (FSL) addresses the challenge of classifying novel classes with limited training samples. While some methods leverage semantic knowledge from smaller-scale models to mitigate data scarcity, these approaches often introduce noise and bias due to the data’s inherent simplicity. In this paper, we propose a novel framework, Synergistic Knowledge Transfer (SynTrans), which effectively transfers diverse and complementary knowledge from large multimodal models to empower the off-the-shelf few-shot learner. Specifically, SynTrans employs CLIP as a robust teacher and uses a few-shot vision encoder as a weak student, distilling semantic-aligned visual knowledge via an unsupervised proxy task. Subsequently, a training-free synergistic knowledge mining module facilitates collaboration among large multimodal models to extract high-quality semantic knowledge. Building upon this, a visual-semantic bridging module enables bi-directional knowledge transfer between visual and semantic spaces, transforming explicit visual and implicit semantic knowledge into category-specific classifier weights. Finally, SynTrans introduces a visual weight generator and a semantic weight reconstructor to adaptively construct optimal multimodal FSL classifiers. Experimental results on four FSL datasets demonstrate that SynTrans, even when paired with a simple few-shot vision encoder, significantly outperforms current state-of-the-art methods.
[213] video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory
Guangzhi Sun,Yixuan Li,Xiaodong Wu,Yudong Yang,Wei Li,Zejun Ma,Chao Zhang
Main category: cs.CV
TL;DR: video-SALMONN S是一种流式音频-视觉LLM,首次能够在固定内存预算下处理3小时1 FPS 360p的视频。通过测试时训练(TTT)内存模块和提示依赖的内存读取器,实现了长视频的高质量理解。
Details
Motivation: 当前视频理解LLM在处理长视频时存在内存和时间限制的挑战,需要一种能够持续处理高帧率高分辨率视频流的方法。Contribution: 1. 提出TTT内存模块,通过动态更新令牌表示捕捉长程依赖;2. 引入提示依赖的内存读取器,选择性检索相关上下文;3. 在固定内存下实现了长达3小时的视频处理。
Method: 1. TTT内存模块取代令牌合并,动态更新令牌表示;2. 使用Hessian-free共轭梯度优化TTT模块(TTT_HF);3. 提示依赖的内存读取器从固定大小内存中检索相关内容。
Result: 8B参数模型在Video-MME长分割上达到74.2%和67.8%的性能,优于离线和流式基线。
Insight: 动态更新的TTT模块和选择性内存检索是处理长视频的关键,未来可扩展为更复杂的AI代理任务。
Abstract: Continuous, high-frame-rate, high-resolution processing of long video streams is critical for future AI agents, yet current video-understanding LLMs struggle to scale. Offline, fixed-frame-number methods require the stream length to adapt frame rates; streaming methods constrain memory by merging or discarding tokens, losing information. We propose video-SALMONN S, a streaming audio-visual LLM that, to our knowledge, is the first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget. Our model introduces (i) a test-time-training (TTT) memory module that continually updates token representations to capture long-range dependencies by replacing token merging, and (ii) a prompt-dependent memory reader that selectively retrieves context-relevant content from fixed-size memory. The TTT module is optimised with a Hessian-free conjugate-gradient procedure (TTT_HF) for efficient adaptation. On long-video benchmarks (Video-MME, LVBench, VideoEvalPro), video-SALMONN S sustains high-quality understanding on multi-hour videos with 10k frames and 1M tokens. Our 8B-parameter model achieves 74.2% overall and 67.8% on the Video-MME long split, outperforming both offline and streaming baselines.
[214] Validation of an Artificial Intelligence Tool for the Detection of Sperm DNA Fragmentation Using the TUNEL In Situ Hybridization Assay
Byron Alexander Jacobs,Aqeel Morris,Ifthakaar Shaik,Frando Lin
Main category: cs.CV
TL;DR: 本研究验证了一种基于人工智能的工具,通过相位对比显微镜图像的数字分析检测精子DNA断裂(SDF),并证明了其有效性。
Details
Motivation: 传统精液分析无法评估精子DNA断裂(SDF),这是男性生育能力评估中的关键参数。本研究旨在开发一种非破坏性方法,提高SDF检测的准确性和效率。Contribution: 提出了一个基于形态学的集成AI模型,结合图像处理技术和先进的基于Transformer的机器学习模型(GC-ViT),用于从相位对比图像中预测SDF。
Method: 使用TUNEL检测作为金标准,开发了一个集成的AI框架,比较了纯Transformer视觉模型和仅基于形态学的模型。
Result: 提出的模型在灵敏度(60%)和特异性(75%)方面表现出色,为非破坏性实时精子选择提供了新方法。
Insight: 该研究展示了AI在生殖医学中的潜力,通过结合形态学和深度学习方法,可以实现精子DNA完整性的高效评估。
Abstract: Sperm DNA fragmentation (SDF) is a critical parameter in male fertility assessment that conventional semen analysis fails to evaluate. This study presents the validation of a novel artificial intelligence (AI) tool designed to detect SDF through digital analysis of phase contrast microscopy images, using the terminal deoxynucleotidyl transferase dUTP nick end labeling (TUNEL) assay as the gold standard reference. Utilising the established link between sperm morphology and DNA integrity, the present work proposes a morphology assisted ensemble AI model that combines image processing techniques with state-of-the-art transformer based machine learning models (GC-ViT) for the prediction of DNA fragmentation in sperm from phase contrast images. The ensemble model is benchmarked against a pure transformer vision' model as well as a morphology-only` model. Promising results show the proposed framework is able to achieve sensitivity of 60% and specificity of 75%. This non-destructive methodology represents a significant advancement in reproductive medicine by enabling real-time sperm selection based on DNA integrity for clinical diagnostic and therapeutic applications.
[215] Multiview Manifold Evidential Fusion for PolSAR Image Classification
Junfei Shi,Haojia Zhang,Haiyan Jin,Junhuai Li,Xiaogang Song,Yuanfan Guo,Haonan Su,Weisi Lin
Main category: cs.CV
TL;DR: 该论文提出了一种多流形证据融合方法(MMEFnet),用于极化合成孔径雷达(PolSAR)图像的分类,解决了传统融合方法忽略多视图差异和不确定性的问题。
Details
Motivation: 传统的PolSAR图像分类方法通常直接拼接多特征或使用深度学习结合特征,但这些方法忽略了不同视图(如协方差矩阵和多特征)处于不同流形的问题,也未考虑视图的重要性和不确定性。Contribution: 1) 提出了MMEFnet方法,将PolSAR流形学习与证据融合统一到一个架构中;2) 分别对Hermitian正定流形(HPD)和Grassmann流形建模协方差矩阵和多特征;3) 设计了基于Dempster-Shafer理论的可信多视图证据融合方法。
Method: 1) 将协方差矩阵表示为HPD流形,多特征建模为Grassmann流形;2) 分别设计核度量学习网络学习流形表示;3) 用证据融合替代softmax分类器,量化各视图的不确定性;4) 基于Dempster-Shafer理论融合证据。
Result: 在三个真实PolSAR数据集上的实验表明,MMEFnet在准确性、鲁棒性和可解释性上均优于现有方法。
Insight: 通过显式建模不同视图的流形结构和不确定性,提升了融合效果;证据理论为分类提供了可靠的可解释性。
Abstract: Polarimetric Synthetic Aperture Radar (PolSAR) covariance matrices and their extracted multi-features - such as scattering angle, entropy, texture, and boundary descriptors - provide complementary and physically interpretable information for image classification. Traditional fusion strategies typically concatenate these features or employ deep learning networks to combine them. However, the covariance matrices and multi-features, as two complementary views, lie on different manifolds with distinct geometric structures. Existing fusion methods also overlook the varying importance of different views and ignore uncertainty, often leading to unreliable predictions. To address these issues, we propose a Multiview Manifold Evidential Fusion (MMEFnet) method to effectively fuse these two views. It gives a new framework to integrate PolSAR manifold learning and evidence fusion into a unified architecture. Specifically, covariance matrices are represented on the Hermitian Positive Definite (HPD) manifold, while multi-features are modeled on the Grassmann manifold. Two different kernel metric learning networks are constructed to learn their manifold representations. Subsequently, a trusted multiview evidence fusion, replacing the conventional softmax classifier, estimates belief mass and quantifies the uncertainty of each view from the learned deep features. Finally, a Dempster-Shafer theory-based fusion strategy combines evidence, enabling a more reliable and interpretable classification. Extensive experiments on three real-world PolSAR datasets demonstrate that the proposed method consistently outperforms existing approaches in accuracy, robustness, and interpretability.
[216] CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation
Zhenyu Lu,Liupeng Li,Jinpeng Wang,Yan Feng,Bin Chen,Ke Chen,Yaowei Wang
Main category: cs.CV
TL;DR: CoPRS提出了一种基于多模态链式思维(MCoT)的位置感知模型,通过可微分和可解释的热图作为位置先验,将语言推理与图像分割联系起来,提升了推理过程的可解释性和分割精度。
Details
Motivation: 现有方法在推理分割中要么直接将语言模型的隐藏特征连接到掩码解码器,要么通过文本表示位置,限制了可解释性和语义细节。CoPRS旨在解决这一问题。Contribution: 1. 提出基于MCoT的位置感知模型CoPRS;2. 通过可微分热图作为位置先验,增强推理过程的可解释性;3. 轻量级解码器实现从推理到分割的直接连接。
Method: 1. 使用MCoT生成清晰推理过程;2. 通过可学习聚合令牌生成热图作为位置先验;3. 轻量级解码器将热图解码为精确掩码。
Result: 在RefCOCO系列和ReasonSeg数据集上,CoPRS匹配或超越现有最佳性能,验证了推理输出与掩码生成的一致性。
Insight: 热图质量直接影响掩码质量,表明推理驱动的集中性和分割精度具有强关联。
Abstract: Existing works on reasoning segmentation either connect hidden features from a language model directly to a mask decoder or represent positions in text, which limits interpretability and semantic detail. To solve this, we present CoPRS, a Multi-modal Chain-of-Thought (MCoT)-based positional perception model that bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap. By making the reasoning process clear via MCoT and expressing it as a dense, differentiable heatmap, this interface enhances interpretability and diagnostic analysis and yields more concentrated evidence on the target. A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder, providing a direct connection between reasoning and segmentation. Across the RefCOCO series and ReasonSeg, CoPRS matches or surpasses the best reported metrics on each standard split under comparable protocols, with performance at or above prior state of the art across both validation and test partitions. Extensive experiments reveal that the quality of the heatmap strongly influences the resulting mask quality, supporting a consistent association between the reasoning output and downstream mask generation. Collectively, these findings support the utility of this paradigm in bridging reasoning and segmentation and show advantages in concentration driven by reasoning and predicting masks more precisely. Code, checkpoints and logs are released at https://github.com/ZhenyuLU-Heliodore/CoPRS.git.
[217] Reliable Cross-modal Alignment via Prototype Iterative Construction
Xiang Ma,Litian Xu,Lexin Fang,Caiming Zhang,Lizhen Cui
Main category: cs.CV
TL;DR: 该论文提出了一种名为PICO的新框架,旨在通过原型迭代构造可靠地抑制风格信息在多模态对齐中的干扰。
Details
Motivation: 传统的跨模态对齐方法假设嵌入仅包含语义信息,忽略了非语义信息(如风格)的干扰,导致信息偏差甚至丢失。Contribution: 提出了PICO框架,通过量化特征列表示语义信息的概率,并以此作为嵌入交互的权重,结合原型迭代构造方法验证语义概率的可靠性。
Method: PICO通过性能反馈加权的函数为原型分配更高的权重,理论证明该方法能提升性能。
Result: 在多种基准测试和模型骨干网络上,PICO优于现有方法5.2%-14.1%。
Insight: 分离语义与风格信息并仅对齐语义信息是提升跨模态对齐效果的关键,原型迭代构造方法为此提供了可靠支持。
Abstract: Cross-modal alignment is an important multi-modal task, aiming to bridge the semantic gap between different modalities. The most reliable fundamention for achieving this objective lies in the semantic consistency between matched pairs. Conventional methods implicitly assume embeddings contain solely semantic information, ignoring the impact of non-semantic information during alignment, which inevitably leads to information bias or even loss. These non-semantic information primarily manifest as stylistic variations in the data, which we formally define as style information. An intuitive approach is to separate style from semantics, aligning only the semantic information. However, most existing methods distinguish them based on feature columns, which cannot represent the complex coupling relationship between semantic and style information. In this paper, we propose PICO, a novel framework for suppressing style interference during embedding interaction. Specifically, we quantify the probability of each feature column representing semantic information, and regard it as the weight during the embedding interaction. To ensure the reliability of the semantic probability, we propose a prototype iterative construction method. The key operation of this method is a performance feedback-based weighting function, and we have theoretically proven that the function can assign higher weight to prototypes that bring higher performance improvements. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of PICO, outperforming state-of-the-art methods by 5.2%-14.1%.
[218] BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models
Bryan Chen Zhengyu Tan,Zheng Weihua,Zhengyuan Liu,Nancy F. Chen,Hwaran Lee,Kenny Tsu Wei Choo,Roy Ka-Wei Lee
Main category: cs.CV
TL;DR: BLEnD-Vis是一个多模态、多文化基准测试,用于评估视觉语言模型(VLMs)在日常文化知识理解中的鲁棒性和跨模态一致性。
Details
Motivation: 现有的VLMs评估主要关注静态回忆或孤立视觉任务,缺乏对其文化理解能力的系统性测试。BLEnD-Vis旨在填补这一空白,评估模型在多文化背景下的表现。Contribution: 提出了BLEnD-Vis基准测试,包含313个文化相关的问题模板,覆盖16个地区,生成超过21,000个多选题实例和4,916张图像,为文化理解和多模态融合提供了系统性评估工具。
Method: 基于BLEnD数据集,设计了三种多选题形式:(i) 纯文本基准测试,(ii) 反向文本变体,(iii) VQA风格的多模态版本。这些形式通过人类标注验证。
Result: 研究发现当前VLMs在文化知识上存在显著脆弱性,语言改写会导致性能下降,且视觉线索的引入虽然有一定帮助,但跨模态一致性较低,尤其在资源匮乏地区表现更差。
Insight: BLEnD-Vis揭示了VLMs在文化理解和多模态融合上的局限性,为开发更文化敏感的模型提供了方向。
Abstract: As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce BLEnD-Vis, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region $\to$ Entity, (ii) an inverted text-only variant (Entity $\to$ Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation. BLEnD-Vis reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing and, whilst visual cues often aid performance, low cross-modal consistency highlights challenges in robustly integrating textual and visual understanding, particularly for lower-resource regions. BLEnD-Vis thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs.
[219] FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models
Shengming Yuan,Xinyu Lyu,Shuailong Wang,Beitao Chen,Jingkuan Song,Lianli Gao
Main category: cs.CV
TL;DR: 论文提出FlexAC框架,通过灵活控制多模态大语言模型(MLLMs)中的联想推理强度,解决了模型在忠实性与创造性之间的平衡问题。
Details
Motivation: MLLMs在不同任务中需要不同程度的联想推理,但现有方法缺乏灵活调节能力。FlexAC旨在通过内部机制分析和动态调节,弥补这一不足。Contribution: 1)揭示了MLLMs联想行为的关键层机制;2)提出基于幻觉引导的轻量级FlexAC框架,无需训练;3)实验显示在创造性和减少幻觉方面显著优于基线方法。
Method: FlexAC通过分析中间层表示,利用幻觉生成引导向量,动态调节联想强度,并结合任务特定样本生成多维联想方向向量。
Result: 在Creation-MMBench中创造力提升5.8倍,CHAIR数据集上幻觉率降低29%,显著优于现有方法。
Insight: 联想推理是多维且任务依赖的,FlexAC的动态调节机制为MLLMs在多样任务中的适应性提供了新思路。
Abstract: Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs’ adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model’s associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.
[220] Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos
Rohit Gupta,Anirban Roy,Claire Christensen,Sujeong Kim,Sarah Gerard,Madeline Cincebeaux,Ajay Divakaran,Todd Grindal,Mubarak Shah
Main category: cs.CV
TL;DR: 该论文提出了一种基于类原型的监督对比学习方法,用于处理多标签和细粒度的教育视频分类问题,并通过多模态Transformer网络捕捉视觉和音频线索的交互关系。
Details
Motivation: 随着儿童早期在线媒体消费的增长,迫切需要数据驱动的工具帮助教育者筛选适合的教育内容。论文针对这一问题,聚焦于识读和数学两类教育视频的多标签和细粒度分类。Contribution: 1. 提出了一种新颖的基于类原型的监督对比学习方法;2. 构建了一个多模态Transformer网络,学习视频嵌入并捕捉视觉和音频线索的交互;3. 发布了一个专家标注的教育视频数据集APPROVE。
Method: 1. 为每个类别学习一个类原型;2. 设计损失函数,最小化类内样本与原型的距离,最大化类间距离;3. 使用多模态Transformer网络编码视频的视觉和音频特征。
Result: 在APPROVE、Youtube-8M和COIN等数据集上,提出的方法优于强基线。
Insight: 视觉和音频线索的交互对教育视频的理解至关重要,结合类原型和监督对比学习可以有效处理多标签和细粒度分类问题。
Abstract: The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for detecting educational content in online videos. We focus on two widely used educational content classes: literacy and math. For each class, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include letter names', letter sounds’, and math codes include counting', sorting’. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., letter names' vs letter sounds’). We propose a novel class prototypes based supervised contrastive learning approach that can handle fine-grained samples associated with multiple labels. We learn a class prototype for each class and a loss function is employed to minimize the distances between a class prototype and the samples from the class. Similarly, distances between a class prototype and the samples from other classes are maximized. As the alignment between visual and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://github.com/rohit-gupta/MMContrast/tree/main/APPROVE
[221] Investigating Identity Signals in Conversational Facial Dynamics via Disentangled Expression Features
Masoumeh Chapariniya,Pierre Vuillecard,Jean-Marc Odobez,Volker Dellwo,Teodora Vukovic
Main category: cs.CV
TL;DR: 研究表明,面部表情的动态特征(而非静态外貌)可以用于身份识别。通过FLAME 3D可变形模型实现面部形状和表情动态的解耦,并结合对比学习模型在自然对话数据中验证其有效性。
Details
Motivation: 探索面部动态特征是否足以独立作为身份识别的依据,特别是在排除静态外貌影响的情况下。Contribution: 1. 展示了面部动态特征在身份识别中的作用;2. 引入漂移-噪声比(DNR)量化形状与表情分离的可靠性。
Method: 利用FLAME 3D模型分离面部形状和表情动态参数,结合Conformer模型和监督对比学习进行分类。
Result: 在1,429人的分类任务中达到61.14%的准确率,远超随机概率;DNR与识别性能负相关。
Insight: 面部动态特征携带强身份标识,但形状估计的不稳定会削弱动态识别的有效性。
Abstract: This work investigates whether individuals can be identified solely through the pure dynamical components of their facial expressions, independent of static facial appearance. We leverage the FLAME 3D morphable model to achieve explicit disentanglement between facial shape and expression dynamics, extracting frame-by-frame parameters from conversational videos while retaining only expression and jaw coefficients. On the CANDOR dataset of 1,429 speakers in naturalistic conversations, our Conformer model with supervised contrastive learning achieves 61.14%accuracy on 1,429-way classification – 458 times above chance – demonstrating that facial dynamics carry strong identity signatures. We introduce a drift-to-noise ratio (DNR) that quantifies the reliability of shape expression separation by measuring across-session shape changes relative to within-session variability. DNR strongly negatively correlates with recognition performance, confirming that unstable shape estimation compromises dynamic identification. Our findings reveal person-specific signatures in conversational facial dynamics, with implications for social perception and clinical assessment.
[222] A Large-Language-Model Assisted Automated Scale Bar Detection and Extraction Framework for Scanning Electron Microscopic Images
Yuxuan Chen,Ruotong Yang,Zhengyang Zhang,Mehreen Ahmed,Yanming Wang
Main category: cs.CV
TL;DR: 论文提出了一种结合多模态和大语言模型(LLM)的自动化标尺检测与提取框架,用于扫描电子显微镜(SEM)图像,显著提高了效率和准确性。
Details
Motivation: SEM图像的标尺检测目前依赖人工操作,耗时且易出错。研究旨在通过自动化框架解决这一问题,提高科学图像的解析效率和可靠性。Contribution: 1) 提出了一种多模态自动化标尺检测框架;2) 结合Auto-DG生成多样性数据集;3) 使用混合OCR系统提升文本识别;4) 引入LLM作为推理引擎和智能助手。
Method: 框架分为四阶段:1) Auto-DG生成合成数据集;2) 标尺目标检测;3) 混合OCR(DenseNet + CRNN)提取信息;4) LLM验证结果并建议下一步。
Result: 标尺检测精度100%,召回率95.8%,mAP@0.5为99.2%;混合OCR精度89%,召回率65%,F1分数75%,优于主流OCR引擎。
Insight: LLM作为推理引擎可以提升自动化科学图像分析的可靠性和智能化,未来有望扩展到其他模态或多任务场景。
Abstract: Microscopic characterizations, such as Scanning Electron Microscopy (SEM), are widely used in scientific research for visualizing and analyzing microstructures. Determining the scale bars is an important first step of accurate SEM analysis; however, currently, it mainly relies on manual operations, which is both time-consuming and prone to errors. To address this issue, we propose a multi-modal and automated scale bar detection and extraction framework that provides concurrent object detection, text detection and text recognition with a Large Language Model (LLM) agent. The proposed framework operates in four phases; i) Automatic Dataset Generation (Auto-DG) model to synthesize a diverse dataset of SEM images ensuring robust training and high generalizability of the model, ii) scale bar object detection, iii) information extraction using a hybrid Optical Character Recognition (OCR) system with DenseNet and Convolutional Recurrent Neural Network (CRNN) based algorithms, iv) an LLM agent to analyze and verify accuracy of the results. The proposed model demonstrates a strong performance in object detection and accurate localization with a precision of 100%, recall of 95.8%, and a mean Average Precision (mAP) of 99.2% at IoU=0.5 and 69.1% at IoU=0.5:0.95. The hybrid OCR system achieved 89% precision, 65% recall, and a 75% F1 score on the Auto-DG dataset, significantly outperforming several mainstream standalone engines, highlighting its reliability for scientific image analysis. The LLM is introduced as a reasoning engine as well as an intelligent assistant that suggests follow-up steps and verifies the results. This automated method powered by an LLM agent significantly enhances the efficiency and accuracy of scale bar detection and extraction in SEM images, providing a valuable tool for microscopic analysis and advancing the field of scientific imaging.
[223] Exploring and Leveraging Class Vectors for Classifier Editing
Jaeik Kim,Jaeyoung Do
Main category: cs.CV
TL;DR: 论文引入Class Vectors(类向量)用于图像分类器编辑,解决了现有方法灵活性不足或成本过高的问题。通过隐空间和权重空间的调整,实现了高效编辑和高层次概念操作。
Details
Motivation: 图像分类器在训练后行为固定,难以进行事后编辑,尤其是在遗忘特定类别或适应分布变化时。现有方法要么范围有限,要么成本高昂。Contribution: 提出了Class Vectors,分别在隐空间和权重空间实现类别特异性编辑,支持高效、灵活的高层次概念操作(如类算术)。
Method: 通过Class Vectors捕捉类别特异性表示调整,直接在隐空间或权重空间调整分类器行为,利用线性性和正交性实现高效操作。
Result: 验证了Class Vectors在遗忘学习、环境适应、对抗防御和对抗触发优化等应用中的有效性。
Insight: Class Vectors的线性性和正交性为分类器编辑提供了新的高效工具,支持语义调整和高层次概念操作,适用于多重应用场景。
Abstract: Image classifiers play a critical role in detecting diseases in medical imaging and identifying anomalies in manufacturing processes. However, their predefined behaviors after extensive training make post hoc model editing difficult, especially when it comes to forgetting specific classes or adapting to distribution shifts. Existing classifier editing methods either focus narrowly on correcting errors or incur extensive retraining costs, creating a bottleneck for flexible editing. Moreover, such editing has seen limited investigation in image classification. To overcome these challenges, we introduce Class Vectors, which capture class-specific representation adjustments during fine-tuning. Whereas task vectors encode task-level changes in weight space, Class Vectors disentangle each class’s adaptation in the latent space. We show that Class Vectors capture each class’s semantic shift and that classifier editing can be achieved either by steering latent features along these vectors or by mapping them into weight space to update the decision boundaries. We also demonstrate that the inherent linearity and orthogonality of Class Vectors support efficient, flexible, and high-level concept editing via simple class arithmetic. Finally, we validate their utility in applications such as unlearning, environmental adaptation, adversarial defense, and adversarial trigger optimization.
[224] Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
Jian Lan,Zhicheng Liu,Udo Schlegel,Raoyuan Zhao,Yihong Liu,Hinrich Schütze,Michael A. Hedderich,Thomas Seidl
Main category: cs.CV
TL;DR: 这篇论文提出了一个名为HaDola的框架,旨在通过数据选择和自动标注来利用人类不确定性(HU)优化视觉问答任务中的监督微调(SFT),减少对昂贵标注数据的依赖。
Details
Motivation: 视觉语言模型(VLMs)在视觉问答任务中表现优秀,但仍依赖大量标注数据进行监督微调(SFT)。现有方法忽略人类不确定性(HU)分布,导致性能受损和模型校准不足。Contribution: 1. 系统评估了人类不确定性对SFT的影响;2. 提出了HaDola框架,通过数据选择和自动标注优化训练;3. 减少了标注成本并提升模型性能和校准能力。
Method: HaDola框架分为四个阶段:1. 区分样本(discriminate);2. 自标注(self-annotate);3. 错误触发(error trigger);4. 训练(training)。迭代剔除有害样本,优先选择信息丰富的样本,并从少量种子数据中自举学习。
Result: 在VQAv2和VizWiz数据集上的实验表明,HaDola仅需5%的标注数据即可匹配或超越现有基线方法,同时提升了模型准确性和校准能力。
Insight: 研究表明,合理利用人类不确定性(如剔除高HU样本)比单纯扩大数据集规模更有效。这种方法为减少标注成本提供了新思路。
Abstract: Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) – variation in human confidence across annotations – but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages – discriminate, self-annotate, error trigger, and training – to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.
[225] $Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization
Lin Zhu,Yifeng Yang,Xinbing Wang,Qinying Gu,Nanyang Ye
Main category: cs.CV
TL;DR: 论文提出了一种新的OOD评分方法ΔEnergy,通过优化视觉-语言对齐过程中的能量变化,显著提升了OOD检测和OOD泛化能力。该方法通过最大化ΔEnergy的下界(EBM)实现理论和实验上的优越表现。
Details
Motivation: 现有视觉-语言模型(VLMs)在真实下游任务中会同时遇到分布内(ID)和分布外(OOD)数据,而OOD数据包括协变量偏移(如图像风格变化)和语义偏移(如未见类别)。因此,需要提升VLMs对OOD数据的泛化能力并有效检测语义偏移的OOD类别。Contribution: 1. 提出新的OOD评分方法ΔEnergy,显著优于传统能量评分;2. 提出EBM方法,通过最大化ΔEnergy下界同时提升OOD检测和泛化能力;3. 理论证明了EBM的域一致性Hessian矩阵对OOD泛化的作用;4. 提供了统一的微调框架。
Method: 1. 引入ΔEnergy评分,优化视觉-语言对齐中的能量变化;2. 提出EBM方法,最大化ΔEnergy下界;3. 理论分析EBM的Hessian矩阵性质;4. 设计统一的微调框架。
Result: 在挑战性OOD检测和泛化基准上,方法显著优于现有方法,AUROC提升10%到25%。
Insight: ΔEnergy能量变化优化不仅能提升OOD检测性能,还能通过域一致性Hessian矩阵改善OOD泛化能力,揭示了能量变化与模型鲁棒性的关系。
Abstract: Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs’ generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named {\Delta}Energy. {\Delta}Energy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, {\Delta}Energy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for {\Delta}Energy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs’ robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.
[226] When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models
Samer Al-Hamadani
Main category: cs.CV
TL;DR: 该论文首次对传统监督学习的YOLO目标检测系统与基于视觉语言模型(VLM)的零样本检测方法(Gemini Flash 2.5)进行了成本效益分析,揭示了在不同场景下的经济性和效率权衡。
Details
Motivation: 传统的目标检测方法依赖大量人工标注数据,成本高昂,而零样本检测的VLM方法无需标注但准确率较低。本文旨在比较这两种方法的经济性,为实际应用提供决策依据。Contribution: 论文的主要贡献包括:1)系统地对比了监督学习和零样本检测的成本效益;2)提出了定量化的平衡点阈值;3)开发了基于部署规模、类别稳定性、预算和准确率需求的决策框架。
Method: 论文通过分层采样COCO数据集和多样化产品图像数据集,使用YOLO和Gemini Flash 2.5进行评估,并结合总拥有成本(TCO)模型进行分析。
Result: 结果表明,监督学习的YOLO在标准类别上准确率高达91.2%,但需要高昂的标注成本;而零样本Gemini在稀有类别上仍有52.3%的准确率,且单次检测成本更低。
Insight: 研究发现,选择检测方法时需综合考虑经济性和效率,零样本方法在小规模或类别动态变化的场景下更具优势,而监督方法在大规模稳定类别场景下更经济。
Abstract: Object detection systems have traditionally relied on supervised learning with manually annotated bounding boxes, achieving high accuracy at the cost of substantial annotation investment. The emergence of Vision-Language Models (VLMs) offers an alternative paradigm enabling zero-shot detection through natural language queries, eliminating annotation requirements but operating with reduced accuracy. This paper presents the first comprehensive cost-effectiveness analysis comparing supervised detection (YOLO) with zero-shot VLM inference (Gemini Flash 2.5). Through systematic evaluation on 1,000 stratified COCO images and 200 diverse product images spanning consumer electronics and rare categories, combined with detailed Total Cost of Ownership modeling, we establish quantitative break-even thresholds governing architecture selection. Our findings reveal that supervised YOLO achieves 91.2% accuracy versus 68.5% for zero-shot Gemini on standard categories, representing a 22.7 percentage point advantage that costs $10,800 in annotation for 100-category systems. However, this advantage justifies investment only beyond 55 million inferences, equivalent to 151,000 images daily for one year. Zero-shot Gemini demonstrates 52.3% accuracy on diverse product categories (ranging from highly web-prevalent consumer electronics at 75-85% to rare specialized equipment at 25-40%) where supervised YOLO achieves 0% due to architectural constraints preventing detection of untrained classes. Cost per Correct Detection analysis reveals substantially lower per-detection costs for Gemini ($0.00050 vs $0.143) at 100,000 inferences despite accuracy deficits. We develop decision frameworks demonstrating that optimal architecture selection depends critically on deployment volume, category stability, budget constraints, and accuracy requirements rather than purely technical performance metrics.
[227] sketch2symm: Symmetry-aware sketch-to-shape generation via semantic bridging
Yan Zhou,Mingji Li,Xiantao Zeng,Jie Lin,Yuexia Zhou
Main category: cs.CV
TL;DR: Sketch2Symm通过语义桥接和对称约束,从稀疏草图生成对称感知的3D形状,显著提升了重建效果。
Details
Motivation: 解决草图输入因抽象和稀疏性导致的语义和几何信息不足问题。Contribution: 提出了一个两阶段生成方法,结合语义桥接(草图到图像转换)和对称约束,增强几何一致性。
Method: 1.语义桥接:草图到图像转换以丰富稀疏特征;2.对称约束:利用结构规律性作为几何先验。
Result: 在多个评估指标(Chamfer Distance、Earth Mover’s Distance、F-Score)上表现优于现有方法。
Insight: 语义桥接和对称约束能有效弥补草图输入的不足,提升3D重建质量。
Abstract: Sketch-based 3D reconstruction remains a challenging task due to the abstract and sparse nature of sketch inputs, which often lack sufficient semantic and geometric information. To address this, we propose Sketch2Symm, a two-stage generation method that produces geometrically consistent 3D shapes from sketches. Our approach introduces semantic bridging via sketch-to-image translation to enrich sparse sketch representations, and incorporates symmetry constraints as geometric priors to leverage the structural regularity commonly found in everyday objects. Experiments on mainstream sketch datasets demonstrate that our method achieves superior performance compared to existing sketch-based reconstruction methods in terms of Chamfer Distance, Earth Mover’s Distance, and F-Score, verifying the effectiveness of the proposed semantic bridging and symmetry-aware design.
[228] InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
Haomin Wang,Jinhui Yin,Qi Wei,Wenguang Zeng,Lixin Gu,Shenglong Ye,Zhangwei Gao,Yaohui Wang,Yanting Zhang,Yuanqi Li,Yanwen Guo,Wenhai Wang,Kai Chen,Yu Qiao,Hongjie Zhang
Main category: cs.CV
TL;DR: 该论文提出了一种统一的SVG(可缩放矢量图形)建模方法InternSVG,利用多模态大语言模型(MLLMs)的能力,解决了数据集碎片化、任务间方法迁移性差和结构复杂性高的问题。其核心贡献包括数据集SAgoge、基准测试SArena和模型InternSVG,实现了SVG理解、编辑和生成的统一建模。
Details
Motivation: SVG建模面临数据集碎片化、任务间迁移性差和结构复杂性高的挑战,需要一种统一的解决方案。Contribution: 1. 提出了SAgoge,目前最大且最全面的SVG多模态数据集。2. 设计了SArena基准测试,涵盖广泛任务和难度。3. 提出了InternSVG模型,专为SVG任务设计,通过特殊标记和两阶段训练实现了任务间的正向迁移。
Method: 1. SAgoge数据集覆盖静态和动态SVG,支持多层次任务。2. SArena基准提供了标准化的任务定义和评估。3. InternSVG模型引入了SVG特殊标记、子词嵌入初始化和两阶段训练,从简单静态SVG逐步过渡到复杂动画。
Result: 在SArena和现有基准上的实验表明,InternSVG在性能上显著优于现有开源和专有模型。
Insight: 通过统一的MLLM框架,结合大规模数据集和针对性训练策略,可以有效解决SVG建模的复杂性和多样性问题。
Abstract: General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.
[229] DocReward: A Document Reward Model for Structuring and Stylizing
Junpeng Liu,Yuzhong Zhao,Bowen Cao,Jiayu Ding,Yilin Jia,Tengchao Lv,Yupan Huang,Shaohan Huang,Nan Yang,Li Dong,Lei Cui,Tao Ge,Xun Wang,Huitian Jiao,Sun Mao,FNU Kartik,Si-Qing Chen,Wai Lam,Furu Wei
Main category: cs.CV
TL;DR: DocReward是一种文档奖励模型,专注于评估文档的结构和风格质量,弥补了现有技术在视觉结构和风格方面的不足。
Details
Motivation: 现有自动化文档生成技术主要关注文本质量,忽视了视觉结构和风格对文档可读性和吸引力的重要性,因此需要一种新的奖励模型来弥补这一缺陷。Contribution: 提出了DocReward模型,利用多领域数据集DocPair训练,能够评估文档的结构和风格质量,并在人类评估中优于GPT-4o和GPT-5。
Method: 使用Bradley-Terry损失函数训练DocReward,通过对比高专业性和低专业性文档的排名来优化模型。
Result: DocReward在准确性上分别超过GPT-4o和GPT-5 30.6和19.4个百分点,并在生成任务中取得了60.8%的胜率。
Insight: 文档的结构和风格对专业性和用户体验至关重要,DocReward为自动化文档生成提供了有效的指导工具。
Abstract: Recent advances in agentic workflows have enabled the automation of tasks such as professional document generation. However, they primarily focus on textual quality, neglecting visual structure and style, which are crucial for readability and engagement. This gap arises mainly from the absence of suitable reward models to guide agentic workflows toward producing documents with stronger structural and stylistic quality. To address this, we propose DocReward, a document reward model that evaluates documents based on their structure and style. We construct a multi-domain dataset DocPair of 117K paired documents, covering 32 domains and 267 document types, each including a high- and low-professionalism document with identical content but different structure and style. This enables the model to evaluate professionalism comprehensively, and in a textual-quality-agnostic way. DocReward is trained using the Bradley-Terry loss to score documents, penalizing predictions that contradict the annotated ranking. To assess the performance of reward models, we create a test dataset containing document bundles ranked by well-educated human evaluators. Notably, DocReward outperforms GPT-4o and GPT-5 in accuracy by 30.6 and 19.4 percentage points, respectively, demonstrating its superiority over baselines. In an extrinsic evaluation of document generation, DocReward achieves a significantly higher win rate of 60.8%, compared to GPT-5’s 37.7% win rate, demonstrating its utility in guiding generation agents toward producing human-preferred documents.
[230] Uncertainty-Aware ControlNet: Bridging Domain Gaps with Synthetic Image Generation
Joshua Niemeijer,Jan Ehrhardt,Heinz Handels,Hristina Uzunova
Main category: cs.CV
TL;DR: 本文提出了一种不确定性感知的ControlNet方法,通过引入不确定性机制,利用无标注域数据训练ControlNet,生成目标域的合成标注数据,从而弥合领域差距,并显著提升下游任务的性能。
Details
Motivation: 生成模型虽能生成高质量图像数据,但现有ControlNet通常只能复制原始训练分布,限制了其增强下游任务的潜力。本文旨在利用无标注域数据,通过不确定性引导生成目标域的合成数据,解决领域差距问题。Contribution: 1. 提出不确定性感知的ControlNet框架,结合无标注数据的概率控制和标注数据的语义控制;2. 实现了无需额外监督的目标域合成数据生成,显著提升下游任务(如分割)性能;3. 在医学OCT图像和交通场景实验中验证了方法的有效性。
Method: 1. 在网络中引入不确定性控制机制,利用无标注域数据生成高不确定性数据;2. 联合训练不确定性控制和语义控制,生成目标域的合成标注数据;3. 通过Home-OCT和交通场景验证方法的通用性和鲁棒性。
Result: 实验表明,生成的合成数据显著改善了目标域(如低质量Home-OCT和交通场景)的分割性能,无需额外标注。
Insight: 不确定性引导的数据生成能够灵活适应任意领域偏移,无需严格的图像风格学习,为跨领域任务提供了一种高效解决方案。
Abstract: Generative Models are a valuable tool for the controlled creation of high-quality image data. Controlled diffusion models like the ControlNet have allowed the creation of labeled distributions. Such synthetic datasets can augment the original training distribution when discriminative models, like semantic segmentation, are trained. However, this augmentation effect is limited since ControlNets tend to reproduce the original training distribution. This work introduces a method to utilize data from unlabeled domains to train ControlNets by introducing the concept of uncertainty into the control mechanism. The uncertainty indicates that a given image was not part of the training distribution of a downstream task, e.g., segmentation. Thus, two types of control are engaged in the final network: an uncertainty control from an unlabeled dataset and a semantic control from the labeled dataset. The resulting ControlNet allows us to create annotated data with high uncertainty from the target domain, i.e., synthetic data from the unlabeled distribution with labels. In our scenario, we consider retinal OCTs, where typically high-quality Spectralis images are available with given ground truth segmentations, enabling the training of segmentation networks. The recent development in Home-OCT devices, however, yields retinal OCTs with lower quality and a large domain shift, such that out-of-the-pocket segmentation networks cannot be applied for this type of data. Synthesizing annotated images from the Home-OCT domain using the proposed approach closes this gap and leads to significantly improved segmentation results without adding any further supervision. The advantage of uncertainty-guidance becomes obvious when compared to style transfer: it enables arbitrary domain shifts without any strict learning of an image style. This is also demonstrated in a traffic scene experiment.
[231] Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment
Shijie Zhao,Xuanyu Zhang,Weiqi Li,Junlin Li,Li Zhang,Tianfan Xue,Jian Zhang
Main category: cs.CV
TL;DR: 本文探讨了基于强化学习(RL)的图像质量评估(IQA)模型的泛化能力,并提出了一种新算法RALI,通过对比学习直接将图像与RL学习到的通用文本表示对齐,显著减少了推理时间和参数量。
Details
Motivation: 当前基于RL的IQA模型虽具备出色的泛化能力,但其推理能耗和延迟极高,限制了实际应用。本文旨在揭示其泛化机制并提出更高效的解决方案。Contribution: 1. 揭示了RL训练的IQA模型通过推理能力将冗余视觉表征转化为紧凑的跨域对齐文本表征的机制;2. 提出了RALI算法,无需依赖RL推理过程,大幅降低模型复杂度和推理时间。
Method: 通过广泛的实验验证RL训练的MLLMs的泛化机制,并提出RALI算法,利用对比学习对齐图像与RL学习到的文本表征。
Result: RALI在质量评分任务中达到与RL模型相当的泛化性能,同时仅需不到5%的模型参数和推理时间。
Insight: RL模型的泛化源于推理能力对视觉表征的转化,而直接对齐文本表示可高效实现类似效果,为轻量化IQA模型提供了新思路。
Abstract: Reasoning-based image quality assessment (IQA) models trained through reinforcement learning (RL) exhibit exceptional generalization, yet the underlying mechanisms and critical factors driving this capability remain underexplored in current research. Moreover, despite their superior performance, these models incur inference energy usage and latency orders of magnitude higher than their earlier counterparts, restricting their deployment in specific scenarios. Through extensive experiments, this paper verifies and elaborates that through RL training, MLLMs leverage their reasoning capability to convert redundant visual representations into compact, cross-domain aligned text representations. This conversion is precisely the source of the generalization exhibited by these reasoning-based IQA models. Building on this fundamental insight, we propose a novel algorithm, RALI, which employs contrastive learning to directly align images with these generalizable text representations learned by RL. This approach eliminates the reliance on reasoning processes and even obviates the need to load an LLM. For the quality scoring task, this framework achieves generalization performance comparable to reasoning-based models while requiring less than 5% of their model parameters and inference time.
[232] Robust Ego-Exo Correspondence with Long-Term Memory
Yijun Hu,Bing Fan,Xin Gu,Haiqing Ren,Dongfang Liu,Heng Fan,Libo Zhang
Main category: cs.CV
TL;DR: 论文提出了一种基于SAM 2的EEC框架LM-EEC,通过双记忆架构和自适应特征路由模块(MoE)解决了ego-exo视角下的特征融合和长期记忆问题,显著提升了性能。
Details
Motivation: 现有方法在处理egocentric和exocentric视角的物体对应关系时,面临视角变化大、遮挡和小物体等挑战。尽管SAM 2在视频分割中表现优异,但其在EEC任务中因特征融合和长期记忆不足而失效。Contribution: 提出了LM-EEC框架,包括(i)Memory-View MoE模块,用于自适应分配特征权重;(ii)双记忆银行系统,保留关键长期信息并消除冗余。
Method: 采用双记忆架构和MoE启发的自适应特征路由模块,结合压缩策略优化长期信息存储。
Result: 在EgoExo4D基准测试中,LM-EEC超越了现有方法和SAM 2基线,实现了新的SOTA结果。
Insight: 自适应特征路由和长期记忆管理是解决ego-exo视角对应问题的关键。
Abstract: Establishing object-level correspondence between egocentric and exocentric views is essential for intelligent assistants to deliver precise and intuitive visual guidance. However, this task faces numerous challenges, including extreme viewpoint variations, occlusions, and the presence of small objects. Existing approaches usually borrow solutions from video object segmentation models, but still suffer from the aforementioned challenges. Recently, the Segment Anything Model 2 (SAM 2) has shown strong generalization capabilities and excellent performance in video object segmentation. Yet, when simply applied to the ego-exo correspondence (EEC) task, SAM 2 encounters severe difficulties due to ineffective ego-exo feature fusion and limited long-term memory capacity, especially for long videos. Addressing these problems, we propose a novel EEC framework based on SAM 2 with long-term memories by presenting a dual-memory architecture and an adaptive feature routing module inspired by Mixture-of-Experts (MoE). Compared to SAM 2, our approach features (i) a Memory-View MoE module which consists of a dual-branch routing mechanism to adaptively assign contribution weights to each expert feature along both channel and spatial dimensions, and (ii) a dual-memory bank system with a simple yet effective compression strategy to retain critical long-term information while eliminating redundancy. In the extensive experiments on the challenging EgoExo4D benchmark, our method, dubbed LM-EEC, achieves new state-of-the-art results and significantly outperforms existing methods and the SAM 2 baseline, showcasing its strong generalization across diverse scenarios. Our code and model are available at https://github.com/juneyeeHu/LM-EEC.
[233] Enhancing Maritime Domain Awareness on Inland Waterways: A YOLO-Based Fusion of Satellite and AIS for Vessel Characterization
Geoffery Agorku,Sarah Hernandez,Hayley Hames,Cade Wagner
Main category: cs.CV
TL;DR: 该论文提出了一种基于YOLO v11的框架,通过融合高分辨率卫星图像和AIS数据,提升内陆水道的海上领域感知能力,解决了AIS监测的局限性。
Details
Motivation: 内陆水道的海上领域感知(MDA)存在合作系统(如AIS)的脆弱性问题。论文旨在通过非合作卫星图像与AIS的融合,弥补AIS的不足,提高船舶监测的准确性和可靠性。Contribution: 1. 提出了一种新颖的融合框架,结合卫星图像和AIS数据,用于船舶检测和特征化;2. 开发了一个包含4,550个标注实例的数据集;3. 在多个任务(如船舶分类、状态检测等)上取得了高精度结果。
Method: 1. 使用YOLO v11模型检测和分类船舶;2. 将视觉检测结果与AIS轨迹数据融合;3. 评估了模型在分类、状态检测、方向性等多任务上的性能。
Result: 1. 船舶分类F1分数95.8%;2. 状态检测F1分数99.4%;3. 方向性准确率93.8%;4. 驳船计数平均绝对误差2.4。
Insight: 融合卫星图像与AIS数据可以显著提升内陆水道监测能力,特别是弥补AIS的局限性(如‘黑暗船舶’问题)。未来可通过多模态深度学习进一步扩展方法。
Abstract: Maritime Domain Awareness (MDA) for inland waterways remains challenged by cooperative system vulnerabilities. This paper presents a novel framework that fuses high-resolution satellite imagery with vessel trajectory data from the Automatic Identification System (AIS). This work addresses the limitations of AIS-based monitoring by leveraging non-cooperative satellite imagery and implementing a fusion approach that links visual detections with AIS data to identify dark vessels, validate cooperative traffic, and support advanced MDA. The You Only Look Once (YOLO) v11 object detection model is used to detect and characterize vessels and barges by vessel type, barge cover, operational status, barge count, and direction of travel. An annotated data set of 4,550 instances was developed from $5{,}973~\mathrm{mi}^2$ of Lower Mississippi River imagery. Evaluation on a held-out test set demonstrated vessel classification (tugboat, crane barge, bulk carrier, cargo ship, and hopper barge) with an F1 score of 95.8%; barge cover (covered or uncovered) detection yielded an F1 score of 91.6%; operational status (staged or in motion) classification reached an F1 score of 99.4%. Directionality (upstream, downstream) yielded 93.8% accuracy. The barge count estimation resulted in a mean absolute error (MAE) of 2.4 barges. Spatial transferability analysis across geographically disjoint river segments showed accuracy was maintained as high as 98%. These results underscore the viability of integrating non-cooperative satellite sensing with AIS fusion. This approach enables near-real-time fleet inventories, supports anomaly detection, and generates high-quality data for inland waterway surveillance. Future work will expand annotated datasets, incorporate temporal tracking, and explore multi-modal deep learning to further enhance operational scalability.
[234] Coupled Degradation Modeling and Fusion: A VLM-Guided Degradation-Coupled Network for Degradation-Aware Infrared and Visible Image Fusion
Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui
Main category: cs.CV
TL;DR: 本文提出了一个新型的视觉语言模型引导的退化耦合融合网络(VGDCFusion),将退化建模与图像融合紧密结合,显著提升了退化场景下的红外与可见光图像融合性能。
Details
Motivation: 现有红外与可见光图像融合方法假设输入图像质量高,但在处理退化图像时依赖手动预处理,导致性能下降。本文旨在解决退化处理与图像融合的脱节问题。Contribution: 提出VGDCFusion,首次将退化建模与融合过程紧密结合;设计SPDCE和JPDCF模块,分别实现模态特异性退化感知和跨模态退化感知与特征融合。
Method: 利用VLM进行退化感知和引导抑制;通过SPDCE提取模态特异性退化特征并建模退化抑制;通过JPDCF实现跨模态退化感知与特征融合。
Result: 实验表明,VGDCFusion在多种退化场景下显著优于现有融合方法。
Insight: 退化感知与图像融合的耦合能有效提升退化场景下的融合性能,VLM的引入为退化建模提供了新思路。
Abstract: Existing Infrared and Visible Image Fusion (IVIF) methods typically assume high-quality inputs. However, when handing degraded images, these methods heavily rely on manually switching between different pre-processing techniques. This decoupling of degradation handling and image fusion leads to significant performance degradation. In this paper, we propose a novel VLM-Guided Degradation-Coupled Fusion network (VGDCFusion), which tightly couples degradation modeling with the fusion process and leverages vision-language models (VLMs) for degradation-aware perception and guided suppression. Specifically, the proposed Specific-Prompt Degradation-Coupled Extractor (SPDCE) enables modality-specific degradation awareness and establishes a joint modeling of degradation suppression and intra-modal feature extraction. In parallel, the Joint-Prompt Degradation-Coupled Fusion (JPDCF) facilitates cross-modal degradation perception and couples residual degradation filtering with complementary cross-modal feature fusion. Extensive experiments demonstrate that our VGDCFusion significantly outperforms existing state-of-the-art fusion approaches under various degraded image scenarios. Our code is available at https://github.com/Lmmh058/VGDCFusion.
[235] VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment
Qing Li,Huifang Feng,Xun Gong,Yu-Shen Liu
Main category: cs.CV
TL;DR: 该论文提出一种新方法VA-GS,通过视角对齐(VA)增强3D高斯泼溅的几何表示,结合边缘感知图像线索、可见性感知光度对齐损失和基于法线的约束,提升了表面重建和新视角合成的性能。
Details
Motivation: 3D高斯泼溅在高质量和实时的视角合成中表现优异,但其表面重建的准确性仍有待提升。由于高斯的离散和非结构化特性,仅依靠图像渲染损失会导致几何不准确和多视角对齐不一致。Contribution: 1. 引入边缘感知图像线索改进表面边界划分;2. 提出可见性感知光度对齐损失以增强跨视角几何一致性;3. 结合法线约束优化高斯空间方向;4. 利用深度图像特征嵌入提升几何学习的鲁棒性。
Method: 1. 将边缘感知线索融入渲染损失;2. 设计可见性感知光度对齐损失以建模遮挡;3. 使用法线约束细化高斯空间方向;4. 通过深度特征嵌入增强跨视角一致性。
Result: 在标准基准测试中,VA-GS在表面重建和新视角合成方面均达到了最先进的性能。
Insight: 结合几何一致性约束和多模态监督(如法线和深度特征)可以显著提升3D高斯泼溅的几何表示能力。
Abstract: 3D Gaussian Splatting has recently emerged as an efficient solution for high-quality and real-time novel view synthesis. However, its capability for accurate surface reconstruction remains underexplored. Due to the discrete and unstructured nature of Gaussians, supervision based solely on image rendering loss often leads to inaccurate geometry and inconsistent multi-view alignment. In this work, we propose a novel method that enhances the geometric representation of 3D Gaussians through view alignment (VA). Specifically, we incorporate edge-aware image cues into the rendering loss to improve surface boundary delineation. To enforce geometric consistency across views, we introduce a visibility-aware photometric alignment loss that models occlusions and encourages accurate spatial relationships among Gaussians. To further mitigate ambiguities caused by lighting variations, we incorporate normal-based constraints to refine the spatial orientation of Gaussians and improve local surface estimation. Additionally, we leverage deep image feature embeddings to enforce cross-view consistency, enhancing the robustness of the learned geometry under varying viewpoints and illumination. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis. The source code is available at https://github.com/LeoQLi/VA-GS.
[236] AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model
Zhiwei Jin,Xiaohui Song,Nan Wang,Yafei Liu,Chao Li,Xin Li,Ruichen Wang,Zhihao Li,Qi Qi,Long Cheng,Dongze Hao,Quanlong Zheng,Yanhao Zhang,Haobo Ji,Jian Ma,Zhitong Zheng,Zhenyi Lin,Haolin Deng,Xin Zou,Xiaojie Yin,Ruilin Wang,Liankai Cai,Haijing Liu,Yuqing Qiu,Ke Chen,Zixian Li,Chi Xie,Huafei Li,Chenxing Li,Chuangchuang Wang,Kai Tang,Zhiguang Zhu,Kai Tang,Wenmei Gao,Rui Wang,Jun Wu,Chao Liu,Qin Xie,Chen Chen,Haonan Lu
Main category: cs.CV
TL;DR: AndesVL是一款高效的移动端多模态大语言模型(MLLM),专为边缘设备设计,参数规模从0.6B到4B,基于Qwen3架构,支持多种视觉编码器,性能媲美同类开源模型。
Details
Motivation: 现有的云端MLLMs(如GPT-4o、Gemini等)虽然性能强大,但无法在内存、功耗和计算能力受限的边缘设备上运行。因此,需要开发高效的移动端MLLM。Contribution: >1. 提出AndesVL,一款高效的移动端MLLM,支持多种视觉任务。
- 详细介绍了模型架构、训练流程和数据,覆盖文本丰富图像理解、多图像理解、通用VQA、幻觉缓解等多任务。
- 提出1+N LoRA方法,进一步提升模型效率。
Method: 基于Qwen3的LLM架构,结合多种视觉编码器,设计了从0.6B到4B参数规模的模型。训练流程包括数据准备和多任务训练,还提出1+N LoRA技术优化模型性能。
Result: AndesVL在开源基准测试中表现优异,覆盖多个任务领域,性能与同类规模的最先进模型相当。
Insight: 移动端MLLM可以在保持高效的同时实现高性能,LoRA等技术在模型压缩和优化中具有潜力。
Abstract: In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3’s LLM and various visual encoders. We comprehensively outline the model architectures, training pipeline, and training data of AndesVL, which achieves first-tier performance across a wide range of open-source benchmarks, including fields such as text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoR
[237] Towards Fast and Scalable Normal Integration using Continuous Components
Francesco Milano,Jen Jen Chung,Lionel Ott,Roland Siegwart
Main category: cs.CV
TL;DR: 该论文提出了一种快速且可扩展的法向量积分方法,通过将问题转化为连续组件的相对尺度估计,大幅减少了优化变量的数量,实现了高效的大规模法向量重建。
Details
Motivation: 传统的法向量积分方法需要进行全局迭代优化,计算量大且难以扩展到高分辨率法向量图。论文旨在解决这一效率问题。Contribution: 1. 将法向量积分问题转化为连续组件的相对尺度估计;2. 提出了估计连续组件的启发式方法;3. 设计了分量合并策略和优化项重平衡技术。
Method: 1. 通过启发式方法从初始法向量图中提取连续组件;2. 通过合并组件和重平衡优化项减少优化变量;3. 迭代优化相对尺度以实现高效积分。
Result: 在标准法向量积分基准上取得了最优结果,相较于像素级方法实现了数量级的加速,适用于高分辨率法向量图。
Insight: 通过连续组件的概念将像素级优化转化为更高层次的优化问题,显著降低了计算复杂度。
Abstract: Surface normal integration is a fundamental problem in computer vision, dealing with the objective of reconstructing a surface from its corresponding normal map. Existing approaches require an iterative global optimization to jointly estimate the depth of each pixel, which scales poorly to larger normal maps. In this paper, we address this problem by recasting normal integration as the estimation of relative scales of continuous components. By constraining pixels belonging to the same component to jointly vary their scale, we drastically reduce the number of optimization variables. Our framework includes a heuristic to accurately estimate continuous components from the start, a strategy to rebalance optimization terms, and a technique to iteratively merge components to further reduce the size of the problem. Our method achieves state-of-the-art results on the standard normal integration benchmark in as little as a few seconds and achieves one-order-of-magnitude speedup over pixel-level approaches on large-resolution normal maps.
[238] Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model
Ruiping Liu,Junwei Zheng,Yufan Chen,Zirui Wang,Kunyu Peng,Kailun Yang,Jiaming Zhang,Marc Pollefeys,Rainer Stiefelhagen
Main category: cs.CV
TL;DR: Situat3DChange是一个大规模的数据集,支持三种情境感知变化理解任务,包括感知任务的动作任务。该数据集结合人类观察和多模态信息,提出了SCReasoner方法以高效比较点云数据。
Details
Motivation: 当前的3D数据集和评估基准通常专注于动态场景或动态情境的孤立研究,导致对动态环境的理解不完整。为此,作者提出了情境感知的变化理解数据集Situat3DChange。Contribution: 1. 提出了一个大规模的数据集Situat3DChange,包含12.1万问答对、3.6万变化描述和1.7万重排指令。2. 提出了SCReasoner方法,高效比较点云数据。3. 展示了数据集的跨域迁移和训练效果。
Method: 1. 利用11K人类观察构建数据集,结合自我中心和他中心视角以及空间关系。2. 提出SCReasoner方法,通过最小参数开销实现点云比较。
Result: 在Situat3DChange任务上的综合评估显示了MLLMs的动态场景理解进展和限制。数据扩展和跨域实验证明了数据集的任务无关有效性。
Insight: 结合人类观察和多模态信息的情境感知数据集有助于提升动态环境理解。SCReasoner的轻量化设计为3D MLLMs提供了高效的解决方案。
Abstract: Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs.
[239] LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference
Jianhao Yuan,Fabio Pizzati,Francesco Pinto,Lars Kunze,Ivan Laptev,Paul Newman,Philip Torr,Daniele De Martini
Main category: cs.CV
TL;DR: LikePhys提出了一种无需训练的方法,通过基于ELBO的似然替代来评估视频扩散模型中的直觉物理理解能力,并在多个物理领域中验证其有效性。
Details
Motivation: 直觉物理理解在构建通用物理世界模拟器中至关重要,但现有评估方法难以区分物理正确性与视觉表现。Contribution: 提出了LikePhys方法,设计了PPE评估指标,并在多物理领域基准测试中验证其优于现有基线。
Method: 使用去噪目标作为ELBO似然替代,通过验证有效-无效视频对评估模型物理理解能力。
Result: PPE指标与人类偏好高度一致,并在不同物理领域中揭示了模型能力的差异与改进趋势。
Insight: 模型容量和推理设置的提升有助于改善物理理解能力,但在复杂和混沌动力学中仍存在挑战。
Abstract: Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
[240] mmWalk: Towards Multi-modal Multi-view Walking Assistance
Kedi Ying,Ruiping Liu,Chongyan Chen,Mingzhe Tao,Hao Shi,Kailun Yang,Jiaming Zhang,Rainer Stiefelhagen
Main category: cs.CV
TL;DR: 论文构建了mmWalk,一个多模态多视角的模拟数据集,用于支持盲人或低视力人群的室外安全导航,并生成了mmWalkVQA基准测试,验证了现有视觉语言模型在风险评估和导航任务上的不足。
Details
Motivation: 盲人或低视力人群在复杂环境中行走时缺乏全面的场景理解,因此亟需一种多模态多视角的数据集和辅助技术来提升安全导航能力。Contribution: 1) 构建了mmWalk数据集,包含多视角传感器数据和无障碍特征;2) 生成了mmWalkVQA基准测试,包含69k视觉问答对;3) 验证了现有视觉语言模型的局限性,并提出了一种微调模型。
Method: 1) 手动控制并记录120条分类行走轨迹;2) 同步采集RGB、深度和语义模态的559k全景图像;3) 设计和评估mmWalkVQA基准测试。
Result: 实验表明,现有视觉语言模型在零样本和少样本设置下难以应对风险评估和导航任务;微调模型在真实数据集上表现更优。
Insight: 多模态数据的整合和真实世界的复杂性对提升盲人或低视力人群的导航能力至关重要,未来工作需进一步优化模型的多模态理解能力。
Abstract: Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.
[241] ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?
Liu Yang,Huiyu Duan,Ran Tao,Juntao Cheng,Sijing Wu,Yunhao Li,Jing Liu,Xiongkuo Min,Guangtao Zhai
Main category: cs.CV
TL;DR: 论文提出了ODI-Bench,一个针对全向图像(ODI)理解的全新基准测试,包含2000张高质量全向图像和4000多个手动标注的问答对。实验表明当前多模态大语言模型(MLLMs)在全向图像的理解上表现不佳,并提出了一种无需训练的方法Omni-CoT,通过跨文本和视觉线索的链式推理显著提升了MLLMs的能力。
Details
Motivation: 全向图像(ODIs)在VR、AR和具身智能等领域广泛应用,但多模态大语言模型(MLLMs)在全向环境理解方面的能力尚未被充分研究。因此,需要设计专门的基准测试和方法来填补这一空白。Contribution: 1. 提出了ODI-Bench基准测试,覆盖10个细粒度任务;2. 对20个MLLMs模型进行了广泛实验;3. 提出了无需训练的Omni-CoT方法,显著提升了MLLMs在全向环境中的理解能力。
Method: 1. 构建了ODI-Bench,包含2000张全向图像和4000多个QA对;2. 提出Omni-CoT方法,通过文本和视觉信息的链式推理增强MLLMs的全向理解能力。
Result: 实验显示当前MLLMs对全向图像的理解能力有限,而Omni-CoT方法在无需训练的情况下显著提升了性能。
Insight: 全向图像的独特空间特性对MLLMs提出了新挑战,而结合文本和视觉的链式推理是提升其理解能力的有效途径。
Abstract: Omnidirectional images (ODIs) provide full 360x180 view which are widely adopted in VR, AR and embodied intelligence applications. While multi-modal large language models (MLLMs) have demonstrated remarkable performance on conventional 2D image and video understanding benchmarks, their ability to comprehend the immersive environments captured by ODIs remains largely unexplored. To address this gap, we first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding. ODI-Bench contains 2,000 high-quality omnidirectional images and over 4,000 manually annotated question-answering (QA) pairs across 10 fine-grained tasks, covering both general-level and spatial-level ODI understanding. Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models, under both close-ended and open-ended settings. Experimental results reveal that current MLLMs still struggle to capture the immersive context provided by ODIs. To this end, we further introduce Omni-CoT, a training-free method which significantly enhances MLLMs’ comprehension ability in the omnidirectional environment through chain-of-thought reasoning across both textual information and visual cues. Both the benchmark and the code will be released upon the publication.
[242] SNAP: Towards Segmenting Anything in Any Point Cloud
Aniket Gupta,Hanhui Wang,Charles Saunders,Aruni RoyChowdhury,Hanumant Singh,Huaizu Jiang
Main category: cs.CV
TL;DR: SNAP是一种统一的3D点云交互式分割模型,支持跨域的点和文本提示分割,通过多数据集训练和域自适应归一化避免负迁移,在多个基准测试中表现优异。
Details
Motivation: 当前3D点云分割方法局限于单一域或单一交互形式,且多数据集训练易导致负迁移,限制了模型的通用性和实用性。Contribution: 提出了SNAP模型,支持点和文本提示的跨域分割,采用域自适应归一化防止负迁移,并在多数据集上展示了优异的泛化能力。
Method: 结合多数据集训练和域自适应归一化,自动生成掩码提案并与CLIP嵌入的文本查询匹配,实现开放词汇和全景分割。
Result: 在8/9空间提示和5个文本提示基准测试中达到SOTA或竞争性结果,证明统一模型优于专用域特定方法。
Insight: 统一模型可通过域自适应和跨域训练实现通用性,为大规模3D标注提供实用工具。
Abstract: Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present \textbf{SNAP} (\textbf{S}egment a\textbf{N}ything in \textbf{A}ny \textbf{P}oint cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/
[243] Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping
Walid Elbarz,Mohamed Bourriz,Hicham Hajji,Hamd Ait Abdelali,François Bourzeix
Main category: cs.CV
TL;DR: 该论文对三种基础模型(HyperSigma、DOFA和基于SpectralEarth数据集的Vision Transformers)在超光谱作物分类任务中的性能进行了系统评估。DOFA和SpectralEarth模型表现最佳,后者准确率高达93.5%。
Details
Motivation: 超光谱作物分类在农业应用中具有重要意义,但基础模型在这一领域的潜力尚未充分挖掘。本文旨在填补这一空白,为实际应用提供参考。Contribution: 论文的主要贡献是对HyperSigma、DOFA和SpectralEarth预训练模型进行了全面的性能评估,展示了基础模型在超光谱作物分类任务中的潜力。
Method: 研究使用了三种基础模型,并在手动标注的训练区域数据上进行微调,随后在独立测试区域上进行评估。指标包括总体准确率(OA)、平均准确率(AA)和F1分数。
Result: SpectralEarth预训练模型表现最佳(OA=93.5%),DOFA次之(OA=62.6%),HyperSigma最低(OA=34.5%)。此外,从头训练的SpectralEarth紧凑版本也达到91%的OA。
Insight: 模型架构对跨地区和传感器的泛化能力至关重要。SpectralEarth模型的成功表明,大规模预训练数据和多时间信息对超光谱作物分类任务有显著帮助。
Abstract: Foundation models are transforming Earth observation, but their potential for hyperspectral crop mapping remains underexplored. This study benchmarks three foundation models for cereal crop mapping using hyperspectral imagery: HyperSigma, DOFA, and Vision Transformers pre-trained on the SpectralEarth dataset (a large multitemporal hyperspectral archive). Models were fine-tuned on manually labeled data from a training region and evaluated on an independent test region. Performance was measured with overall accuracy (OA), average accuracy (AA), and F1-score. HyperSigma achieved an OA of 34.5% (+/- 1.8%), DOFA reached 62.6% (+/- 3.5%), and the SpectralEarth model achieved an OA of 93.5% (+/- 0.8%). A compact SpectralEarth variant trained from scratch achieved 91%, highlighting the importance of model architecture for strong generalization across geographic regions and sensor platforms. These results provide a systematic evaluation of foundation models for operational hyperspectral crop mapping and outline directions for future model development.
[244] MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis
Hongyu Zhu,Lin Chen,Mounim A. El-Yacoubi,Mingsheng Shang
Main category: cs.CV
TL;DR: MS-Mix是一种情感感知的多模态混合数据增强框架,通过自适应混合策略和情感对齐损失,解决了多模态情感分析中的语义不一致问题,显著提升了模型性能。
Details
Motivation: 多模态情感分析(MSA)受限于标注数据稀缺,而现有的Mixup增强方法在多模态任务中引入语义不一致和标签模糊的问题。因此,需要一种情感感知的混合机制来优化样本选择和混合比例。Contribution: 1. 提出SASS策略,避免混合情感矛盾的样本;2. 设计SIG模块动态计算模态混合比例;3. 引入SAL损失函数对齐模态预测分布。
Method: MS-Mix结合SASS策略、SIG模块和SAL损失,通过情感强度和模态对齐优化混合过程。
Result: 在三个基准数据集和六个SOTA模型上,MS-Mix一致优于现有方法。
Insight: 情感感知的混合策略是多模态数据增强的关键,动态计算模态混合比例能有效提升模型的鲁棒性。
Abstract: Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources such as text, video, and audio. While deep learning models have advanced in network architecture design, they remain heavily limited by scarce multimodal annotated data. Although Mixup-based augmentation improves generalization in unimodal tasks, its direct application to MSA introduces critical challenges: random mixing often amplifies label ambiguity and semantic inconsistency due to the lack of emotion-aware mixing mechanisms. To overcome these issues, we propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings. The key components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS) strategy that effectively prevents semantic confusion caused by mixing samples with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module using multi-head self-attention to compute modality-specific mixing ratios dynamically based on their respective emotional intensities. (3) a Sentiment Alignment Loss (SAL) that aligns the prediction distributions across modalities, and incorporates the Kullback-Leibler-based loss as an additional regularization term to train the emotion intensity predictor and the backbone network jointly. Extensive experiments on three benchmark datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms existing methods, establishing a new standard for robust multimodal sentiment augmentation. The source code is available at: https://github.com/HongyuZhu-s/MS-Mix.
[245] ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
Yicheng Xu,Yue Wu,Jiashuo Yu,Ziang Yan,Tianxiang Jiang,Yinan He,Qingsong Zhao,Kai Chen,Yu Qiao,Limin Wang,Manabu Okumura,Yi Wang
Main category: cs.CV
TL;DR: ExpVid是首个专注于科学实验视频理解与推理的基准测试,旨在评估多模态大语言模型(MLLMs)在实验室环境中的细粒度和长时程任务上的表现。
Details
Motivation: 现有基准测试忽视了真实实验室工作的细粒度和长时程特性,导致MLLMs在科学实验视频中的实际能力未被充分理解。Contribution: 引入了ExpVid基准测试,提出了一种新的三层次任务框架(细粒度感知、过程理解、科学推理)来系统评估MLLMs,并揭示了其在科学实验视频中的性能差距。
Method: 通过视觉为中心的标注流程(结合自动化生成和多学科专家验证),创建了一个基于同行评审视频的数据集,并对19种MLLMs进行了评估。
Result: MLLMs在粗粒度识别上表现优异,但在细粒度区分、状态变化跟踪及实验过程与科学结果的关联上表现不佳,且专有模型与开源模型在高阶推理上存在显著差距。
Insight: ExpVid不仅是诊断工具,还为开发可信赖的科学实验伙伴MLLMs提供了路线图。
Abstract: Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.
[246] EvoCAD: Evolutionary CAD Code Generation with Vision Language Models
Tobias Preintner,Weixuan Yuan,Adrian König,Thomas Bäck,Elena Raponi,Niki van Stein
Main category: cs.CV
TL;DR: EvoCAD结合视觉语言模型和进化算法生成CAD对象的符号表示,通过GPT-4V和GPT-4o优化,表现优于现有方法。
Details
Motivation: 结合大型语言模型的生成能力和进化算法的优化潜能,提出了EvoCAD方法,用于生成高质量的CAD对象。Contribution: 1. 提出EvoCAD方法,结合视觉语言模型和进化算法生成CAD对象;2. 引入基于欧拉特性的两种新指标评估语义相似性;3. 展示了在CADPrompt基准上的显著性能提升。
Method: 使用视觉语言模型和进化优化算法生成CAD对象,并通过GPT-4V和GPT-4o进行多轮优化。
Result: EvoCAD在生成拓扑正确对象方面表现优于现有方法,新提出的指标有效补充了现有空间指标。
Insight: 结合语言模型的生成能力和进化算法的优化能力可以有效提升CAD对象的生成质量,拓扑指标的引入为评估提供了新维度。
Abstract: Combining large language models with evolutionary computation algorithms represents a promising research direction leveraging the remarkable generative and in-context learning capabilities of LLMs with the strengths of evolutionary algorithms. In this work, we present EvoCAD, a method for generating computer-aided design (CAD) objects through their symbolic representations using vision language models and evolutionary optimization. Our method samples multiple CAD objects, which are then optimized using an evolutionary approach with vision language and reasoning language models. We assess our method using GPT-4V and GPT-4o, evaluating it on the CADPrompt benchmark dataset and comparing it to prior methods. Additionally, we introduce two new metrics based on topological properties defined by the Euler characteristic, which capture a form of semantic similarity between 3D objects. Our results demonstrate that EvoCAD outperforms previous approaches on multiple metrics, particularly in generating topologically correct objects, which can be efficiently evaluated using our two novel metrics that complement existing spatial metrics.
[247] NV3D: Leveraging Spatial Shape Through Normal Vector-based 3D Object Detection
Krittin Chaowakarn,Paramin Sangwongngam,Nang Htet Htet Aung,Chalie Charoenlarpnopparut
Main category: cs.CV
TL;DR: NV3D通过法向量增强3D物体检测,利用KNN和PCA提取局部特征,并提出两种采样策略和数据精简方法,在KITTI数据集上表现优于基线模型。
Details
Motivation: 多模态方法在特征对齐上存在挑战,而局部特征提取可能过于简化复杂的3D检测任务,因此需要一种更有效的方法。Contribution: 提出了NV3D模型,利用法向量提取局部特征,并设计了两种采样策略和元素级注意力融合方法,显著提升了检测性能。
Method: 基于KNN和PCA计算体素法向量,采用密度和FOV感知的采样策略,结合注意力机制融合特征。
Result: 在KITTI验证集上,NV3D在汽车和行人检测中的mAP分别比基线高2.61%和4.23%,数据精简55%后仍优于基线。
Insight: 法向量能有效表征物体空间形状,采样策略和数据精简可在保持性能的同时显著降低计算开销。
Abstract: Recent studies in 3D object detection for autonomous vehicles aim to enrich features through the utilization of multi-modal setups or the extraction of local patterns within LiDAR point clouds. However, multi-modal methods face significant challenges in feature alignment, and gaining features locally can be oversimplified for complex 3D object detection tasks. In this paper, we propose a novel model, NV3D, which utilizes local features acquired from voxel neighbors, as normal vectors computed per voxel basis using K-nearest neighbors (KNN) and principal component analysis (PCA). This informative feature enables NV3D to determine the relationship between the surface and pertinent target entities, including cars, pedestrians, or cyclists. During the normal vector extraction process, NV3D offers two distinct sampling strategies: normal vector density-based sampling and FOV-aware bin-based sampling, allowing elimination of up to 55% of data while maintaining performance. In addition, we applied element-wise attention fusion, which accepts voxel features as the query and value and normal vector features as the key, similar to the attention mechanism. Our method is trained on the KITTI dataset and has demonstrated superior performance in car and cyclist detection owing to their spatial shapes. In the validation set, NV3D without sampling achieves 86.60% and 80.18% mean Average Precision (mAP), greater than the baseline Voxel R-CNN by 2.61% and 4.23% mAP, respectively. With both samplings, NV3D achieves 85.54% mAP in car detection, exceeding the baseline by 1.56% mAP, despite roughly 55% of voxels being filtered out.
[248] IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment
Yinan Chen,Jiangning Zhang,Teng Hu,Yuxiang Zeng,Zhucun Xue,Qingdong He,Chengjie Wang,Yong Liu,Xiaobin Hu,Shuicheng Yan
Main category: cs.CV
TL;DR: IVEBench是一个专门为指令引导的视频编辑评估设计的现代基准测试套件,解决了现有基准在多样性、任务覆盖和评估指标上的不足。
Details
Motivation: 指令引导的视频编辑是新兴研究方向,但现有基准无法充分支持其评估,表现为来源多样性低、任务覆盖窄和评估指标不完整。Contribution: IVEBench提供了一个多样化的高质量视频数据库(600个视频),覆盖7个语义维度和8类编辑任务,并建立了三维评估协议。
Method: IVEBench通过大语言模型和专家评审生成编辑任务提示,结合传统指标和多模态大语言模型评估视频质量、指令合规性和视频保真度。
Result: 实验表明,IVEBench能有效评估当前最先进的指令引导视频编辑方法,提供全面且与人类对齐的评估结果。
Insight: IVEBench的多样性和多维评估协议为指令引导视频编辑的标准化评估提供了重要工具。
Abstract: Instruction-guided video editing has emerged as a rapidly advancing research direction, offering new opportunities for intuitive content transformation while also posing significant challenges for systematic evaluation. Existing video editing benchmarks fail to support the evaluation of instruction-guided video editing adequately and further suffer from limited source diversity, narrow task coverage and incomplete evaluation metrics. To address the above limitations, we introduce IVEBench, a modern benchmark suite specifically designed for instruction-guided video editing assessment. IVEBench comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames. It further includes 8 categories of editing tasks with 35 subcategories, whose prompts are generated and refined through large language models and expert review. Crucially, IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity, integrating both traditional metrics and multimodal large language model-based assessments. Extensive experiments demonstrate the effectiveness of IVEBench in benchmarking state-of-the-art instruction-guided video editing methods, showing its ability to provide comprehensive and human-aligned evaluation outcomes.
[249] InfiniHuman: Infinite 3D Human Creation with Precise Control
Yuxuan Xue,Xianghui Xie,Margaret Kostyrko,Gerard Pons-Moll
Main category: cs.CV
TL;DR: InfiniHuman是一个框架,通过利用基础模型生成无限多样且可控的3D人类数据,解决了传统方法因数据采集昂贵而受限的问题。
Details
Motivation: 传统方法生成3D人类数据的成本高且多样性有限,InfiniHuman致力于通过自动化和可扩展的方式解决这一问题。Contribution: 提出了InfiniHumanData自动生成多模态数据集和InfiniHumanGen生成模型,实现了高质量、可控且可扩展的3D人类生成。
Method: 利用视觉-语言和图像生成模型自动生成数据集(InfiniHumanData),并基于扩散模型构建生成流水线(InfiniHumanGen)。
Result: 生成了111K个多样化的3D人类身份,并通过实验验证了其在视觉质量、生成速度和可控性上的显著优势。
Insight: 利用现有基础模型可以大幅降低数据生成成本,同时实现高质量的3D内容生成与控制。
Abstract: Generating realistic and controllable 3D human avatars is a long-standing challenge, particularly when covering broad attribute ranges such as ethnicity, age, clothing styles, and detailed body shapes. Capturing and annotating large-scale human datasets for training generative models is prohibitively expensive and limited in scale and diversity. The central question we address in this paper is: Can existing foundation models be distilled to generate theoretically unbounded, richly annotated 3D human data? We introduce InfiniHuman, a framework that synergistically distills these models to produce richly annotated human data at minimal cost and with theoretically unlimited scalability. We propose InfiniHumanData, a fully automatic pipeline that leverages vision-language and image generation models to create a large-scale multi-modal dataset. User study shows our automatically generated identities are undistinguishable from scan renderings. InfiniHumanData contains 111K identities spanning unprecedented diversity. Each identity is annotated with multi-granularity text descriptions, multi-view RGB images, detailed clothing images, and SMPL body-shape parameters. Building on this dataset, we propose InfiniHumanGen, a diffusion-based generative pipeline conditioned on text, body shape, and clothing assets. InfiniHumanGen enables fast, realistic, and precisely controllable avatar generation. Extensive experiments demonstrate significant improvements over state-of-the-art methods in visual quality, generation speed, and controllability. Our approach enables high-quality avatar generation with fine-grained control at effectively unbounded scale through a practical and affordable solution. We will publicly release the automatic data generation pipeline, the comprehensive InfiniHumanData dataset, and the InfiniHumanGen models at https://yuxuan-xue.com/infini-human.
[250] FACE: Faithful Automatic Concept Extraction
Dipkamal Bhusal,Michael Clifford,Sara Rampazzi,Nidhi Rastogi
Main category: cs.CV
TL;DR: FACE提出了一种基于KL散度正则化的概念提取框架,提高了深度学习模型解释的忠实性。
Details
Motivation: 现有自动概念发现方法未能充分对齐提取的概念与模型真实决策过程,导致解释缺乏忠实性。Contribution: 提出了FACE框架,结合NMF和KL散度正则化,确保概念与模型预测的一致性。
Method: 使用KL散度正则化增强NMF,并在概念学习中引入分类器监督以提升预测一致性。
Result: 在ImageNet、COCO和CelebA数据集上,FACE在忠实性和稀疏性指标上优于现有方法。
Insight: 通过理论证明,KL散度最小化能约束预测分布的偏差,从而提升概念空间的局部线性忠实性。
Abstract: Interpreting deep neural networks through concept-based explanations offers a bridge between low-level features and high-level human-understandable semantics. However, existing automatic concept discovery methods often fail to align these extracted concepts with the model’s true decision-making process, thereby compromising explanation faithfulness. In this work, we propose FACE (Faithful Automatic Concept Extraction), a novel framework that augments Non-negative Matrix Factorization (NMF) with a Kullback-Leibler (KL) divergence regularization term to ensure alignment between the model’s original and concept-based predictions. Unlike prior methods that operate solely on encoder activations, FACE incorporates classifier supervision during concept learning, enforcing predictive consistency and enabling faithful explanations. We provide theoretical guarantees showing that minimizing the KL divergence bounds the deviation in predictive distributions, thereby promoting faithful local linearity in the learned concept space. Systematic evaluations on ImageNet, COCO, and CelebA datasets demonstrate that FACE outperforms existing methods across faithfulness and sparsity metrics.
[251] Beyond ‘Templates’: Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View
Jinyu Zhang,Haitao Lin,Jiashu Hou,Xiangyang Xue,Yanwei Fu
Main category: cs.CV
TL;DR: 该论文提出了一种无需模板或CAD模型的类别无关框架,用于从单张RGB-D图像中同时预测物体的6D位姿、大小和密集形状,展现了强大的零样本泛化能力。
Details
Motivation: 现有方法依赖于特定类别的先验(如CAD模型或模板)或多阶段流程,限制了跨类别的泛化能力。该研究旨在解决这些问题,实现更灵活和通用的物体理解。Contribution: 1. 提出统一的类别无关框架;2. 结合密集2D特征与部分3D点云;3. 使用Transformer编码器和多专家混合机制;4. 表现优异的零样本泛化能力。
Method: 模型融合了视觉基础模型的密集2D特征和部分3D点云,采用Transformer编码器(增强的多专家混合机制),并行解码器分别用于位姿-大小估计和形状重建。
Result: 在四个基准测试(300+类别)上达到SOTA,实时推理速度为28 FPS,且零样本泛化表现突出。
Insight: 通过密集特征和多专家混合机制,解决了位姿与形状的纠缠问题;合成数据训练仍能实现真实的强泛化。
Abstract: Estimating an object’s 6D pose, size, and shape from visual input is a fundamental problem in computer vision, with critical applications in robotic grasping and manipulation. Existing methods either rely on object-specific priors such as CAD models or templates, or suffer from limited generalization across categories due to pose-shape entanglement and multi-stage pipelines. In this work, we propose a unified, category-agnostic framework that simultaneously predicts 6D pose, size, and dense shape from a single RGB-D image, without requiring templates, CAD models, or category labels at test time. Our model fuses dense 2D features from vision foundation models with partial 3D point clouds using a Transformer encoder enhanced by a Mixture-of-Experts, and employs parallel decoders for pose-size estimation and shape reconstruction, achieving real-time inference at 28 FPS. Trained solely on synthetic data from 149 categories in the SOPE dataset, our framework is evaluated on four diverse benchmarks SOPE, ROPE, ObjaversePose, and HANDAL, spanning over 300 categories. It achieves state-of-the-art accuracy on seen categories while demonstrating remarkably strong zero-shot generalization to unseen real-world objects, establishing a new standard for open-set 6D understanding in robotics and embodied AI.
[252] Bayesian Topological Convolutional Neural Nets
Sarah Harkins Dayton,Hayden Everett,Ioannis Schizas,David L. Boothe Jr.,Vasileios Maroulas
Main category: cs.CV
TL;DR: 这篇论文提出了一种结合拓扑学习和贝叶斯采样的新型贝叶斯拓扑卷积神经网络(Bayesian Topological CNN),解决了传统CNN需要大量数据训练、预测过度自信和不确定性量化不足的问题。
Details
Motivation: 传统卷积神经网络(CNNs)存在训练数据需求大、预测过度自信和不确定性量化不足的问题,而贝叶斯神经网络(BNNs)和拓扑CNNs未能完全解决这些问题,因此需要一种更高效、鲁棒的混合方法。Contribution: 论文的主要贡献是提出了一种新型贝叶斯拓扑CNN,通过拓扑感知学习和贝叶斯采样的结合,加速训练并减少校准误差。引入的一致性条件进一步优化了先验分布,提升了性能。
Method: 方法包括在CNN中引入贝叶斯采样,对网络参数设置先验分布并学习后验分布,同时利用拓扑信息加速训练。学习成本中嵌入的一致性条件进一步优化了先验分布。
Result: 在基准图像分类数据集上的实验表明,该方法优于传统CNN、BNN和拓扑CNN,尤其是在训练数据有限或损坏的情况下表现更优。此外,模型在不确定性量化方面优于标准BNN,能更好识别未见过的分布外数据。
Insight: 结合拓扑学习和贝叶斯方法能够显著提升CNN的效率与鲁棒性,为图像分类任务提供了一种更高效且可信的解决方案。
Abstract: Convolutional neural networks (CNNs) have been established as the main workhorse in image data processing; nonetheless, they require large amounts of data to train, often produce overconfident predictions, and frequently lack the ability to quantify the uncertainty of their predictions. To address these concerns, we propose a new Bayesian topological CNN that promotes a novel interplay between topology-aware learning and Bayesian sampling. Specifically, it utilizes information from important manifolds to accelerate training while reducing calibration error by placing prior distributions on network parameters and properly learning appropriate posteriors. One important contribution of our work is the inclusion of a consistency condition in the learning cost, which can effectively modify the prior distributions to improve the performance of our novel network architecture. We evaluate the model on benchmark image classification datasets and demonstrate its superiority over conventional CNNs, Bayesian neural networks (BNNs), and topological CNNs. In particular, we supply evidence that our method provides an advantage in situations where training data is limited or corrupted. Furthermore, we show that the new model allows for better uncertainty quantification than standard BNNs since it can more readily identify examples of out-of-distribution data on which it has not been trained. Our results highlight the potential of our novel hybrid approach for more efficient and robust image classification.
[253] DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
Haoran Feng,Dizhe Zhang,Xiangtai Li,Bo Du,Lu Qi
Main category: cs.CV
TL;DR: DiT360提出了一种基于DiT的框架,通过混合训练视角和全景数据生成高质量的全景图像。其核心创新在于跨域转换和域内增强模块,结合图像和令牌级别的监督,提升了边界一致性和图像逼真度。
Details
Motivation: 现有全景图像生成方法因缺乏大规模高质量真实全景数据而导致几何保真度和逼真度不足。DiT360通过数据驱动的视角解决了这一问题,而非单纯依赖模型设计。Contribution: 1) 提出混合训练框架,结合视角和全景数据;2) 引入跨域转换和域内增强模块;3) 设计图像和令牌级别的监督机制以提高生成质量。
Method: 1) 图像级别:通过视角图像引导和全景细化引入跨域知识;2) 令牌级别:采用混合监督(环形填充、偏航损失和立方体损失);3) 实验验证了文本到全景、修复和外展任务的性能。
Result: 在11项定量指标上,DiT360在边界一致性和图像逼真度方面表现优于基线方法。
Insight: 数据驱动的混合训练策略能显著提升全景图像的生成质量,跨域知识引入和令牌级别监督是关键。
Abstract: In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.
[254] Point Prompting: Counterfactual Tracking with Video Diffusion Models
Ayush Shrivastava,Sanyam Mehta,Daniel Geng,Andrew Owens
Main category: cs.CV
TL;DR: 本文提出了一种基于预训练视频扩散模型的零样本点跟踪方法,通过视觉标记点来合成运动轨迹。该方法利用反事实生成技术,显著提升了跟踪性能。
Details
Motivation: 现有的跟踪器和视频生成器分别专注于运动分析和运动合成,两者任务密切相关。本文探索了如何利用预训练视频扩散模型的能力进行零样本点跟踪,填补了这一领域的研究空白。Contribution: 主要贡献包括:(1) 提出了一种通过视觉标记点进行的零样本点跟踪方法;(2) 利用反事实生成技术增强标记点的可见性;(3) 实验表明该方法优于现有零样本方法,并与专用自监督模型的性能相当。
Method: 方法核心是通过在查询点放置颜色鲜明的标记,然后从中间噪声级别重新生成视频的其他部分,以传播标记并跟踪点的轨迹。为了使标记在反事实生成中保持可见,使用了未编辑的初始帧作为负提示。
Result: 实验表明,该方法在多个图像条件视频扩散模型上的表现优于现有零样本方法,能够有效处理遮挡问题,性能接近专用自监督模型。
Insight: 视频扩散模型在零样本跟踪任务中表现出强大的潜力,表明可以利用合成的运动信息来解决分析任务,为跨任务模型设计提供了新思路。
Abstract: Trackers and video generators solve closely related problems: the former analyze motion, while the latter synthesize it. We show that this connection enables pretrained video diffusion models to perform zero-shot point tracking by simply prompting them to visually mark points as they move over time. We place a distinctively colored marker at the query point, then regenerate the rest of the video from an intermediate noise level. This propagates the marker across frames, tracing the point’s trajectory. To ensure that the marker remains visible in this counterfactual generation, despite such markers being unlikely in natural videos, we use the unedited initial frame as a negative prompt. Through experiments with multiple image-conditioned video diffusion models, we find that these “emergent” tracks outperform those of prior zero-shot methods and persist through occlusions, often obtaining performance that is competitive with specialized self-supervised models.
[255] CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images
Chengqi Duan,Kaiyue Sun,Rongyao Fang,Manyuan Zhang,Yan Feng,Ying Luo,Yufang Liu,Ke Wang,Peng Pei,Xunliang Cai,Hongsheng Li,Yi Ma,Xihui Liu
Main category: cs.CV
TL;DR: CodePlot-CoT引入了一种基于代码驱动的链式思考(Chain-of-Thought)范式,通过生成可执行的绘图代码并将其渲染为“视觉思考”图像,解决需要视觉辅助的数学问题。论文还提出了首个双语大规模数据集Math-VR和专用图像到代码转换器。
Details
Motivation: 现有的大语言模型和多模态统一模型在需要视觉辅助的数学问题上存在局限性,尤其是缺乏生成精确可控的图像的能力。Contribution: 1) 提出了CodePlot-CoT,一种代码驱动的视觉推理范式;2) 构建了首个大规模双语数据集Math-VR;3) 开发了高效的图像到代码转换器;4) 开源了数据集、代码和预训练模型。
Method: 利用视觉语言模型生成文本推理和可执行的绘图代码,将代码渲染为图像以辅助解决数学问题。通过Math-VR数据集和专用转换器训练模型。
Result: 实验结果表明,CodePlot-CoT在Math-VR基准上比基础模型提升了21%。
Insight: 代码驱动的视觉推理为解决复杂数学问题提供了新思路,同时开源资源将推动多模态数学推理领域的发展。
Abstract: Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as drawing auxiliary lines or plotting functions to solve the problems. Most LLMs and VLMs are constrained to text-only reasoning chains, while multimodal unified models that can generate interleaved text and images lack the necessary precision and controllability for such tasks. To address this, we propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for “thinking with images” in mathematics. Our approach leverages the VLM to generate text reasoning as well as executable plotting code, which is then rendered into images as “visual thought”, to solve mathematical problems. To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning, comprising 178K samples. Second, to create high-quality training data, we develop a state-of-the-art image-to-code converter specialized for parsing complex mathematical figures into codes. Finally, using these training data, we train the CodePlot-CoT model for solving mathematical problems. Experimental results show that our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm. Our work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, comprehensive benchmark, and strong approach for such problems. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.
cs.AI [Back]
[256] The Geometry of Reasoning: Flowing Logics in Representation Space
Yufa Zhou,Yixiao Wang,Xunjian Yin,Shuyan Zhou,Anru R. Zhang
Main category: cs.AI
TL;DR: 该论文提出了一种新的几何框架,将大语言模型(LLM)的推理建模为表征空间中的流(flow),并分离逻辑结构与语义,通过几何量(如位置、速度和曲率)形式化分析推理过程。
Details
Motivation: 研究语言模型如何在表征空间中‘思考’,理解其推理过程是否超越了表面形式的内在逻辑。Contribution: 1. 提出几何框架将LLM推理建模为表征空间中的流;2. 逻辑语句作为流的局部控制器;3. 开发实验方法可视化并量化推理流。
Method: 通过自然演绎命题分离逻辑与语义,利用几何量分析推理流,并使用学习到的表征代理设计控制实验。
Result: 证实LLM推理对应于表征空间中的平滑流,且逻辑语句能控制流的局部速度。
Insight: 几何视角为LLM的推理行为和可解释性研究提供了新工具和分析基础。
Abstract: We study how large language models (LLMs) ``think’’ through their representation space. We propose a novel geometric framework that models an LLM’s reasoning as flows – embedding trajectories evolving where logic goes. We disentangle logical structure from semantics by employing the same natural deduction propositions with varied semantic carriers, allowing us to test whether LLMs internalize logic beyond surface form. This perspective connects reasoning with geometric quantities such as position, velocity, and curvature, enabling formal analysis in representation and concept spaces. Our theory establishes: (1) LLM reasoning corresponds to smooth flows in representation space, and (2) logical statements act as local controllers of these flows’ velocities. Using learned representation proxies, we design controlled experiments to visualize and quantify reasoning flows, providing empirical validation of our theoretical framework. Our work serves as both a conceptual foundation and practical tools for studying reasoning phenomenon, offering a new lens for interpretability and formal analysis of LLMs’ behavior.
[257] The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs
Xi Fang,Weijie Xu,Yuchong Zhang,Stephanie Eckman,Scott Nickleach,Chandan K. Reddy
Main category: cs.AI
TL;DR: 论文探讨了用户记忆如何影响LLMs的情感推理,发现不同用户档案会导致系统性情感解读偏差,且优势群体获得更准确的情感解读。
Details
Motivation: 随着个性化AI系统逐渐融入长期用户记忆,研究用户记忆如何影响LLMs的情感推理成为关键,以避免强化社会不平等。Contribution: 揭示了LLMs在情感推理中存在系统性偏差,优势用户档案获得更准确解读,表明个性化机制可能嵌入社会层级。
Method: 通过15个LLMs在人类验证的情感智力测试上进行评估,分析用户档案对情感解读的影响。
Result: 发现LLMs在情感理解和支持建议任务中存在显著人口统计差异,个性化可能导致社会不平等。
Insight: 设计个性化AI系统时需警惕用户记忆可能强化社会偏见,情感推理算法需要更加公平。
Abstract: When an AI assistant remembers that Sarah is a single mother working two jobs, does it interpret her stress differently than if she were a wealthy executive? As personalized AI systems increasingly incorporate long-term user memory, understanding how this memory shapes emotional reasoning is critical. We investigate how user memory affects emotional intelligence in large language models (LLMs) by evaluating 15 models on human validated emotional intelligence tests. We find that identical scenarios paired with different user profiles produce systematically divergent emotional interpretations. Across validated user independent emotional scenarios and diverse user profiles, systematic biases emerged in several high-performing LLMs where advantaged profiles received more accurate emotional interpretations. Moreover, LLMs demonstrate significant disparities across demographic factors in emotion understanding and supportive recommendations tasks, indicating that personalization mechanisms can embed social hierarchies into models emotional reasoning. These results highlight a key challenge for memory enhanced AI: systems designed for personalization may inadvertently reinforce social inequalities.
[258] A Layered Intuition – Method Model with Scope Extension for LLM Reasoning
Hong Su
Main category: cs.AI
TL;DR: 该论文提出了一个分层直觉-方法模型,结合范围扩展,以系统性解决LLM在未见过问题上的推理能力。通过直觉快速反应和方法分解问题,结合垂直、水平、时间和空间扩展,构建知识网络,并提出熵度量来衡量扩展多样性。
Details
Motivation: 现有方法在LLM推理中主要依赖直接矩阵映射,难以系统性解决未见过的问题。本研究旨在通过整合直觉和方法分层模型,结合范围扩展,提升LLM的适应性和推理能力。Contribution: 1. 提出分层直觉-方法模型,结合直觉反应和方法分解;2. 引入垂直、水平、时空扩展,构建知识网络;3. 提出方法扩展熵,量化扩展多样性。
Method: 1. 直觉层提供快速答案,方法层分解问题和解决方案;2. 通过垂直(因果分析)、水平(并行和泛化问题)、时空扩展扩展推理范围;3. 构建系统知识树和网络。
Result: 模型通过系统性扩展提高了LLM对未见过问题的解决能力,熵度量有效评估了扩展多样性。
Insight: 范围和时间的扩展为LLM推理提供了更全面的维度,熵度量可推广为评估其他扩展方法的指标。
Abstract: Existing studies have introduced method-based reasoning and scope extension as approaches to enhance Large Language Model (LLM) performance beyond direct matrix mappings. Building on these foundations, this paper summarizes and integrates these ideas into a unified Intuition-Method Layered Model with Scope Extension, designed to address indirected (unseen) issues more systematically. In this framework, intuition-based thinking provides rapid first-reaction answers, while method-based thinking decouples questions and solutions into transferable reasoning units. Scope extension is then applied to broaden applicability, including vertical (cause analysis), horizontal (parallel and generalized issues), and for the first time, temporal and spatial extensions, which expand reasoning across time and contextual dimensions. These extensions are organized into systematic knowledge trees that interconnect into a knowledge network, thereby increasing adaptability. To quantitatively evaluate this process, we propose the entropy of method extension, which measures the independence and diversity of extensions as an indicator of the system’s capacity to solve unseen questions. By logically connecting existing approaches with new extensions and introducing an entropy-based evaluation framework, this work advances toward a more robust and extensible reasoning paradigm for LLMs in real-world problem-solving.
[259] Revisiting Model Interpolation for Efficient Reasoning
Taiqiang Wu,Runming Yang,Tao Liu,Jiahao Wang,Ngai Wong
Main category: cs.AI
TL;DR: 该论文系统地重新研究了最简单的模型权重直接插值方法,揭示了其遵循三阶段演化范式,并提出了一套实用的框架,以实现高效的推理性能与成本的平衡。
Details
Motivation: 模型插值是一种简单但未被充分研究的模型合并方法,作者希望通过重新研究其性能和行为,提供一种高效且有效的推理解决方案。Contribution: 1. 揭示了模型插值的三阶段演化范式;2. 展示了策略性插值模型在效率和效果上超越复杂模型合并基线;3. 提供了实用的框架以精准设计推理能力。
Method: 通过对两模型权重直接插值的方法,系统性分析其在推理轨迹上的行为,并通过层、模块和解码策略的消融研究验证结果。
Result: 实验证明,策略性插值模型在效率和效果上均优于复杂合并方法。
Insight: 模型插值的动态行为提供了一种性能与成本平衡的原则性指导,简单方法也可实现高效推理。
Abstract: Model merging, typically on Instruct and Thinking models, has shown remarkable performance for efficient reasoning. In this paper, we systematically revisit the simplest merging method that interpolates two weights directly. Particularly, we observe that model interpolation follows a three-stage evolutionary paradigm with distinct behaviors on the reasoning trajectory. These dynamics provide a principled guide for navigating the performance-cost trade-off. Empirical results demonstrate that a strategically interpolated model surprisingly surpasses sophisticated model merging baselines on both efficiency and effectiveness. We further validate our findings with extensive ablation studies on model layers, modules, and decoding strategies. Ultimately, this work demystifies model interpolation and offers a practical framework for crafting models with precisely targeted reasoning capabilities. Code is available at \href{https://github.com/wutaiqiang/MI}{Github}.
cs.GR [Back]
[260] VLM-Guided Adaptive Negative Prompting for Creative Generation
Shelly Golan,Yotam Nitzan,Zongze Wu,Or Patashnik
Main category: cs.GR
TL;DR: 论文提出了一种无需训练的推理时方法VLM-Guided Adaptive Negative Prompting,通过视觉语言模型(VLM)分析生成过程中的中间输出,自适应地引导生成远离常规视觉概念,从而促进新颖且有效的创意图像生成。该方法在CLIP嵌入空间中评估创意的新颖性和有效性,实验表明其在创意新颖性上有显著提升,且计算开销极小。
Details
Motivation: 现有文本到图像扩散模型难以生成真正新颖的内容,而现有增强生成创意的方法要么依赖于图像特征的插值(限制在预定义类别中),要么需要耗时的嵌入优化或模型微调。因此,需要一种无需训练、高效的方法来引导模型生成新颖且有效的创意内容。Contribution: 1. 提出了一种无需训练的推理时方法,通过VLM自适应生成负提示,引导创意图像生成;2. 在CLIP嵌入空间中评估创意的新颖性和有效性;3. 扩展了方法的应用范围,支持复杂场景和连贯创意对象的生成。
Method: 利用VLM分析生成过程中的中间输出,动态生成负提示以避开常规视觉概念,从而鼓励新颖内容的生成。该方法无缝集成到现有扩散模型中,无需额外训练。
Result: 实验结果表明,该方法在创意新颖性上表现优于现有方法,且计算开销极小。同时,成功扩展到了复杂场景的创意生成。
Insight: 通过动态负提示引导生成过程,可以在不改变模型结构的情况下显著提升创意生成的效果,为创意扩散模型提供了一种高效的新思路。
Abstract: Creative generation is the synthesis of new, surprising, and valuable samples that reflect user intent yet cannot be envisioned in advance. This task aims to extend human imagination, enabling the discovery of visual concepts that exist in the unexplored spaces between familiar domains. While text-to-image diffusion models excel at rendering photorealistic scenes that faithfully match user prompts, they still struggle to generate genuinely novel content. Existing approaches to enhance generative creativity either rely on interpolation of image features, which restricts exploration to predefined categories, or require time-intensive procedures such as embedding optimization or model fine-tuning. We propose VLM-Guided Adaptive Negative-Prompting, a training-free, inference-time method that promotes creative image generation while preserving the validity of the generated object. Our approach utilizes a vision-language model (VLM) that analyzes intermediate outputs of the generation process and adaptively steers it away from conventional visual concepts, encouraging the emergence of novel and surprising outputs. We evaluate creativity through both novelty and validity, using statistical metrics in the CLIP embedding space. Through extensive experiments, we show consistent gains in creative novelty with negligible computational overhead. Moreover, unlike existing methods that primarily generate single objects, our approach extends to complex scenarios, such as generating coherent sets of creative objects and preserving creativity within elaborate compositional prompts. Our method integrates seamlessly into existing diffusion pipelines, offering a practical route to producing creative outputs that venture beyond the constraints of textual descriptions.
eess.IV [Back]
[261] Generative Latent Video Compression
Zongyu Guo,Zhaoyang Jia,Jiahao Li,Xiaoyi Zhang,Bin Li,Yan Lu
Main category: eess.IV
TL;DR: GLVC是一种基于生成潜变量的视频压缩框架,通过预训练的连续标记器将视频帧投影到感知对齐的潜空间中,从而优化率失真与感知效果的平衡。该方法在多个基准测试中表现优异,用户研究表明其在高压缩率下仍能保持稳定的时间一致性。
Details
Motivation: 视频压缩中平衡率失真与感知效果是一个关键挑战,尤其是帧间质量波动导致的闪烁问题。GLVC的提出旨在通过潜生成模型的优势解决这一问题。Contribution: 1. 提出了GLVC框架,利用预训练的连续标记器将视频帧映射到潜空间;2. 设计了专为潜域优化的编解码架构,引入统一帧内/帧间编码和循环记忆机制;3. 在多个基准测试中取得了SOTA性能。
Method: 1. 使用预训练的连续标记器进行潜空间投影;2. 设计针对潜域的编解码架构;3. 引入统一帧内/帧间编码和循环记忆机制。
Result: GLVC在DISTS和LPIPS指标上表现优异,用户研究表明其在压缩率更高的情况下仍能媲美最新神经视频编解码器。
Insight: 潜生成模型在视频压缩中的应用能够有效平衡感知效果与压缩性能,统一的设计和循环机制是提升时间一致性的关键。
Abstract: Perceptual optimization is widely recognized as essential for neural compression, yet balancing the rate-distortion-perception tradeoff remains challenging. This difficulty is especially pronounced in video compression, where frame-wise quality fluctuations often cause perceptually optimized neural video codecs to suffer from flickering artifacts. In this paper, inspired by the success of latent generative models, we present Generative Latent Video Compression (GLVC), an effective framework for perceptual video compression. GLVC employs a pretrained continuous tokenizer to project video frames into a perceptually aligned latent space, thereby offloading perceptual constraints from the rate-distortion optimization. We redesign the codec architecture explicitly for the latent domain, drawing on extensive insights from prior neural video codecs, and further equip it with innovations such as unified intra/inter coding and a recurrent memory mechanism. Experimental results across multiple benchmarks show that GLVC achieves state-of-the-art performance in terms of DISTS and LPIPS metrics. Notably, our user study confirms GLVC rivals the latest neural video codecs at nearly half their rate while maintaining stable temporal coherence, marking a step toward practical perceptual video compression.
[262] Towards Efficient 3D Gaussian Human Avatar Compression: A Prior-Guided Framework
Shanzhi Yin,Bolin Chen,Xinju Wu,Ru-Ling Liao,Jie Chen,Shiqi Wang,Yan Ye
Main category: eess.IV
TL;DR: 本文提出了一种高效的3D人体化身压缩框架,利用紧凑的人体先验和规范到目标的变换,实现了超低比特率下的高质量3D人体化身视频压缩。
Details
Motivation: 现有3D人体化身压缩方法在比特率效率和建模冗余方面存在不足,限制了沉浸式多媒体体验的广泛推广。Contribution: 1) 提出了一种基于规范高斯化身的网络无关训练方法;2) 结合人体先验模板捕捉时间运动;3) 实现了高效的外观和运动参数分离压缩。
Method: 1) 通过articulated splatting训练规范高斯化身;2) 使用紧凑参数表示时间运动;3) 通过Linear Blend Skinning变换生成目标化身。
Result: 实验表明,该方法在主流多视角人体数据集上显著优于传统2D/3D编解码器和现有动态3D高斯溅射压缩方法。
Insight: 外观与时序运动分离是高效3D压缩的关键,为元宇宙应用中的沉浸式体验提供了可行方案。
Abstract: This paper proposes an efficient 3D avatar coding framework that leverages compact human priors and canonical-to-target transformation to enable high-quality 3D human avatar video compression at ultra-low bit rates. The framework begins by training a canonical Gaussian avatar using articulated splatting in a network-free manner, which serves as the foundation for avatar appearance modeling. Simultaneously, a human-prior template is employed to capture temporal body movements through compact parametric representations. This decomposition of appearance and temporal evolution minimizes redundancy, enabling efficient compression: the canonical avatar is shared across the sequence, requiring compression only once, while the temporal parameters, consisting of just 94 parameters per frame, are transmitted with minimal bit-rate. For each frame, the target human avatar is generated by deforming canonical avatar via Linear Blend Skinning transformation, facilitating temporal coherent video reconstruction and novel view synthesis. Experimental results demonstrate that the proposed method significantly outperforms conventional 2D/3D codecs and existing learnable dynamic 3D Gaussian splatting compression method in terms of rate-distortion performance on mainstream multi-view human video datasets, paving the way for seamless immersive multimedia experiences in meta-verse applications.
cs.CY [Back]
[263] Stop DDoS Attacking the Research Community with AI-Generated Survey Papers
Jianghao Lin,Rong Shan,Jiachen Zhu,Yunjia Xi,Yong Yu,Weinan Zhang
Main category: cs.CY
TL;DR: 这篇立场论文指出,AI生成的大量综述论文对研究社区造成了类似DDoS攻击的威胁,呼吁制定规范以保障科学记录的质量。
Details
Motivation: 当前AI生成的综述论文泛滥,导致低质量、冗余甚至虚假的内容充斥平台,破坏了研究社区的信任和效率。Contribution: 提出了“综述论文DDoS攻击”的概念,呼吁制定AI辅助写作的规范,并建议开发动态更新的社区维护综述库。
Method: 通过定量趋势分析、质量审核和文化影响讨论,论证了AI生成综述的危害。
Result: 论证了保障综述论文质量的必要性,提出了解决方案。
Insight: AI工具的滥用可能对学术领域造成深远负面影响,亟需透明度和专家监督。
Abstract: Survey papers are foundational to the scholarly progress of research communities, offering structured overviews that guide both novices and experts across disciplines. However, the recent surge of AI-generated surveys, especially enabled by large language models (LLMs), has transformed this traditionally labor-intensive genre into a low-effort, high-volume output. While such automation lowers entry barriers, it also introduces a critical threat: the phenomenon we term the “survey paper DDoS attack” to the research community. This refers to the unchecked proliferation of superficially comprehensive but often redundant, low-quality, or even hallucinated survey manuscripts, which floods preprint platforms, overwhelms researchers, and erodes trust in the scientific record. In this position paper, we argue that we must stop uploading massive amounts of AI-generated survey papers (i.e., survey paper DDoS attack) to the research community, by instituting strong norms for AI-assisted review writing. We call for restoring expert oversight and transparency in AI usage and, moreover, developing new infrastructures such as Dynamic Live Surveys, community-maintained, version-controlled repositories that blend automated updates with human curation. Through quantitative trend analysis, quality audits, and cultural impact discussion, we show that safeguarding the integrity of surveys is no longer optional but imperative to the research community.
cs.CR [Back]
[264] ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
Guan-Yan Yang,Tzu-Yu Cheng,Ya-Wen Teng,Farn Wanga,Kuo-Hui Yeh
Main category: cs.CR
TL;DR: 论文提出ArtPerception框架,利用ASCII艺术绕过LLMs的安全措施,采用两阶段方法实现高效攻击。
Details
Motivation: 现有LLMs的安全对齐主要依赖语义解释,容易受到非标准数据表示的攻击,亟需解决这一漏洞。Contribution: 1. 提出首个基于ASCII艺术的黑盒攻击框架ArtPerception;2. 通过预测试优化攻击参数,实现高效攻击;3. 引入MLD评估指标。
Method: 1. 预处理阶段评估ASCII艺术识别能力;2. 根据结果发起高效单次攻击。
Result: 在开源和商业LLMs上验证了高效攻击能力,并成功对抗多种防御工具。
Insight: LLMs安全需防御多模态攻击,即使是纯文本输入也可能存在漏洞。
Abstract: The integration of Large Language Models (LLMs) into computer applications has introduced transformative capabilities but also significant security challenges. Existing safety alignments, which primarily focus on semantic interpretation, leave LLMs vulnerable to attacks that use non-standard data representations. This paper introduces ArtPerception, a novel black-box jailbreak framework that strategically leverages ASCII art to bypass the security measures of state-of-the-art (SOTA) LLMs. Unlike prior methods that rely on iterative, brute-force attacks, ArtPerception introduces a systematic, two-phase methodology. Phase 1 conducts a one-time, model-specific pre-test to empirically determine the optimal parameters for ASCII art recognition. Phase 2 leverages these insights to launch a highly efficient, one-shot malicious jailbreak attack. We propose a Modified Levenshtein Distance (MLD) metric for a more nuanced evaluation of an LLM’s recognition capability. Through comprehensive experiments on four SOTA open-source LLMs, we demonstrate superior jailbreak performance. We further validate our framework’s real-world relevance by showing its successful transferability to leading commercial models, including GPT-4o, Claude Sonnet 3.7, and DeepSeek-V3, and by conducting a rigorous effectiveness analysis against potential defenses such as LLaMA Guard and Azure’s content filters. Our findings underscore that true LLM security requires defending against a multi-modal space of interpretations, even within text-only inputs, and highlight the effectiveness of strategic, reconnaissance-based attacks. Content Warning: This paper includes potentially harmful and offensive model outputs.
[265] Bag of Tricks for Subverting Reasoning-based Safety Guardrails
Shuo Chen,Zhen Han,Haokun Chen,Bailan He,Shengyun Si,Jingpei Wu,Philip Torr,Volker Tresp,Jindong Gu
Main category: cs.CR
TL;DR: 该论文揭示了基于推理安全防护的脆弱性,并提出一系列绕过防护的攻击方法,展示了其在多种LRMs上的高攻击成功率,强调了改进对齐技术的紧迫性。
Details
Motivation: 研究发现,尽管基于推理的安全防护(如审议对齐)在大型推理模型(LRMs)中表现出强大的防御能力,但它们对输入提示的细微操纵极为脆弱,可能导致更严重的危害。Contribution: 论文的主要贡献包括:(1)发现基于推理的安全防护的脆弱性;(2)提出一系列绕过防护的攻击方法(涵盖白盒、灰盒和黑盒场景);(3)在多个LRMs上验证攻击的高成功率(>90%),并开源代码。
Method: 研究方法包括:(1)通过简单的模板标记操纵绕过防护;(2)提出多样化的攻击方法,从手动模板修改到自动化优化;(3)在多种LRMs和API服务上评估攻击效果。
Result: 攻击方法在5个不同基准测试上实现了超过90%的成功率,展示了基于推理安全防护的系统性漏洞。
Insight: 论文揭示了LRMs安全防护的脆弱性,强调需要更强大的对齐技术以防止恶意滥用,尤其是在开源模型中。
Abstract: Recent reasoning-based safety guardrails for Large Reasoning Models (LRMs), such as deliberative alignment, have shown strong defense against jailbreak attacks. By leveraging LRMs’ reasoning ability, these guardrails help the models to assess the safety of user inputs before generating final responses. The powerful reasoning ability can analyze the intention of the input query and will refuse to assist once it detects the harmful intent hidden by the jailbreak methods. Such guardrails have shown a significant boost in defense, such as the near-perfect refusal rates on the open-source gpt-oss series. Unfortunately, we find that these powerful reasoning-based guardrails can be extremely vulnerable to subtle manipulation of the input prompts, and once hijacked, can lead to even more harmful results. Specifically, we first uncover a surprisingly fragile aspect of these guardrails: simply adding a few template tokens to the input prompt can successfully bypass the seemingly powerful guardrails and lead to explicit and harmful responses. To explore further, we introduce a bag of jailbreak methods that subvert the reasoning-based guardrails. Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization. Along with the potential for scalable implementation, these methods also achieve alarmingly high attack success rates (e.g., exceeding 90% across 5 different benchmarks on gpt-oss series on both local host models and online API services). Evaluations across various leading open-source LRMs confirm that these vulnerabilities are systemic, underscoring the urgent need for stronger alignment techniques for open-sourced LRMs to prevent malicious misuse. Code is open-sourced at https://chenxshuo.github.io/bag-of-tricks.
[266] SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
Zonghao Ying,Yangguang Shao,Jianle Gan,Gan Xu,Junjie Shen,Wenxin Zhang,Quanchen Zou,Junzheng Shi,Zhenfei Yin,Mingchuan Zhang,Aishan Liu,Xianglong Liu
Main category: cs.CR
TL;DR: SecureWebArena提出了一种全面的安全评估基准,用于评估基于大型视觉-语言模型(LVLM)的网页代理的安全性,填补了现有基准在覆盖范围和攻击向量多样性上的不足。
Details
Motivation: 现有的安全评估基准仅覆盖狭窄场景(如用户级提示操纵),无法全面捕捉LVLM网页代理的漏洞。因此,需要一种更全面的评估方法。Contribution: 1. 引入首个涵盖多维攻击向量的LVLM网页代理安全评估基准;2. 提供多样化的模拟网页环境和高质量任务轨迹;3. 提出多层评估协议,分析代理的内部推理、行为轨迹和任务结果。
Method: 1. 设计6种模拟网页环境和2,970条任务轨迹;2. 定义6大攻击向量(用户级和环境级);3. 采用三层评估协议(推理、行为、结果)。
Result: 实验表明,所有测试的LVLM代理均对微妙的对抗性操纵表现出脆弱性,揭示了模型专用化与安全性之间的权衡。
Insight: 1. LVLM代理的安全性问题广泛存在;2. 专用化模型并非总能提高安全性;3. 多维评估可更精确地揭示代理的漏洞。
Abstract: Large vision-language model (LVLM)-based web agents are emerging as powerful tools for automating complex online tasks. However, when deployed in real-world environments, they face serious security risks, motivating the design of security evaluation benchmarks. Existing benchmarks provide only partial coverage, typically restricted to narrow scenarios such as user-level prompt manipulation, and thus fail to capture the broad range of agent vulnerabilities. To address this gap, we present \tool{}, the first holistic benchmark for evaluating the security of LVLM-based web agents. \tool{} first introduces a unified evaluation suite comprising six simulated but realistic web environments (\eg, e-commerce platforms, community forums) and includes 2,970 high-quality trajectories spanning diverse tasks and attack settings. The suite defines a structured taxonomy of six attack vectors spanning both user-level and environment-level manipulations. In addition, we introduce a multi-layered evaluation protocol that analyzes agent failures across three critical dimensions: internal reasoning, behavioral trajectory, and task outcome, facilitating a fine-grained risk analysis that goes far beyond simple success metrics. Using this benchmark, we conduct large-scale experiments on 9 representative LVLMs, which fall into three categories: general-purpose, agent-specialized, and GUI-grounded. Our results show that all tested agents are consistently vulnerable to subtle adversarial manipulations and reveal critical trade-offs between model specialization and security. By providing (1) a comprehensive benchmark suite with diverse environments and a multi-layered evaluation pipeline, and (2) empirical insights into the security challenges of modern LVLM-based web agents, \tool{} establishes a foundation for advancing trustworthy web agent deployment.
cs.RO [Back]
[267] Dejavu: Post-Deployment Learning for Embodied Agents via Experience Feedback
Shaokai Wu,Yanbiao Ji,Qiuchang Li,Zhiyi Zhang,Qichen He,Wenyuan Xie,Guodong Zhang,Bayram Bayramli,Yue Ding,Hongtao Lu
Main category: cs.RO
TL;DR: Dejavu引入了一种后部署学习框架,通过经验反馈网络(EFN)扩展冻结的视觉-语言-动作(VLA)策略,使具身代理能够基于检索到的执行记忆进行动作预测,并在部署后持续学习。
Details
Motivation: 具身代理在部署后无法更新知识以提升任务表现,限制了其适应性和鲁棒性。Contribution: 提出了一种通用的后部署学习框架Dejavu,结合EFN和冻结VLA策略,使代理能够通过经验反馈持续改进行为。
Method: EFN检索成功的先验动作经验,并使用语义相似性奖励的强化学习方法,确保动作预测与过去成功行为一致。
Result: 实验表明,EFN显著提升了适应性、鲁棒性和任务成功率。
Insight: 为具身代理的后部署持续学习提供了一种可行路径。
Abstract: Embodied agents face a fundamental limitation: once deployed in real-world environments to perform specific tasks, they are unable to acquire new useful knowledge to enhance task performance. In this paper, we propose a general post-deployment learning framework called Dejavu, which employs an Experience Feedback Network (EFN) and augments the frozen Vision-Language-Action (VLA) policy with retrieved execution memories. EFN automatically identifies contextually successful prior action experiences and conditions action prediction on this retrieved guidance. We adopt reinforcement learning with semantic similarity rewards on EFN to ensure that the predicted actions align with past successful behaviors under current observations. During deployment, EFN continually enriches its memory with new trajectories, enabling the agent to exhibit “learning from experience” despite fixed weights. Experiments across diverse embodied tasks show that EFN significantly improves adaptability, robustness, and success rates over frozen baselines. These results highlight a promising path toward embodied agents that continually refine their behavior after deployment.
[268] X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Jinliang Zheng,Jianxiong Li,Zhihao Wang,Dongxiu Liu,Xirui Kang,Yuchun Feng,Yinan Zheng,Jiayin Zou,Yilun Chen,Jia Zeng,Ya-Qin Zhang,Jiangmiao Pang,Jingjing Liu,Tai Wang,Xianyuan Zhan
Main category: cs.RO
TL;DR: X-VLA是一种基于软提示的Transformer模型,用于跨具身视觉语言动作(VLA)任务,通过最小化参数增加实现高效的异构数据训练。
Details
Motivation: 现有VLA模型需要高效利用多样化的机器人数据集,但跨具身数据的异构性限制了模型的通用性和适应性。Contribution: 提出X-VLA,通过软提示和独特的嵌入学习方法,实现对多样化机器人数据的高效利用,同时保持模型的轻量化和扩展性。
Method: 采用软提示技术,为每个数据源分配可学习的嵌入作为具身特定提示,并结合基于流匹配的Transformer架构。
Result: 在6个模拟环境和3个真实机器人上的实验表明,X-VLA-0.9B在多项基准测试中达到SOTA性能,展示了优越的适应性和灵活性。
Insight: 软提示技术可以有效解决跨具身数据的异构性问题,为通用VLA模型的开发提供了新思路。
Abstract: Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/
[269] SpikeGrasp: A Benchmark for 6-DoF Grasp Pose Detection from Stereo Spike Streams
Zhuoheng Gao,Jiyao Zhang,Zhiyong Xie,Hao Dong,Zhaofei Yu,Rongmei Chen,Guozhang Chen,Tiejun Huang
Main category: cs.RO
TL;DR: SpikeGrasp提出了一个基于生物启发的6-DoF抓取姿态检测框架,直接处理立体脉冲相机的事件流,避免了显式3D点云重建,并在复杂和无纹理场景中优于传统方法。
Details
Motivation: 现有机器人抓取系统依赖显式3D点云,这与生物视觉处理方式不同。作者探索一种更接近生物视觉处理的新范式。Contribution: 提出了SpikeGrasp框架,直接处理脉冲事件流,避免了点云重建;构建了大规模合成基准数据集;验证了方法的优越性和数据效率。
Method: 使用立体脉冲相机的事件流,通过循环脉冲神经网络迭代优化抓取姿态假设,无需重建点云。
Result: SpikeGrasp在复杂和无纹理场景中超越传统点云方法,并展示出色的数据效率。
Insight: 通过模仿生物视觉处理路径,可以实现更高效和流畅的机器人操作,特别是在动态对象应用中。
Abstract: Most robotic grasping systems rely on converting sensor data into explicit 3D point clouds, which is a computational step not found in biological intelligence. This paper explores a fundamentally different, neuro-inspired paradigm for 6-DoF grasp detection. We introduce SpikeGrasp, a framework that mimics the biological visuomotor pathway, processing raw, asynchronous events from stereo spike cameras, similarly to retinas, to directly infer grasp poses. Our model fuses these stereo spike streams and uses a recurrent spiking neural network, analogous to high-level visual processing, to iteratively refine grasp hypotheses without ever reconstructing a point cloud. To validate this approach, we built a large-scale synthetic benchmark dataset. Experiments show that SpikeGrasp surpasses traditional point-cloud-based baselines, especially in cluttered and textureless scenes, and demonstrates remarkable data efficiency. By establishing the viability of this end-to-end, neuro-inspired approach, SpikeGrasp paves the way for future systems capable of the fluid and efficient manipulation seen in nature, particularly for dynamic objects.
[270] Into the Unknown: Towards using Generative Models for Sampling Priors of Environment Uncertainty for Planning in Configuration Spaces
Subhransu S. Bhattacharjee,Hao Lu,Dylan Campbell,Rahul Shome
Main category: cs.RO
TL;DR: 这篇论文提出了一种基于采样的流程,利用预训练的生成模型在零样本情况下生成概率先验,捕捉环境不确定性和空间语义关系,适用于配置空间规划。通过Matterport3D基准测试,展示了该方法在恢复未观察区域的占用和目标位置不确定性方面的有效性。
Details
Motivation: 在部分可观测环境下的规划任务中,先验信息至关重要,但实际中难以获得。为了解决这一问题,论文提出利用生成模型自动生成具备空间语义信息的先验。Contribution: 1. 提出了一种基于生成模型的采样流程,能够零样本生成概率先验;2. 设计了可直接用于配置空间规划的RGB-D点云样本;3. 在Matterport3D基准测试中验证了方法的有效性。
Method: 1. 使用预训练的生成模型生成包含占用和目标语义的完整RGB-D点云样本;2. 部分观测作为条件输入;3. 通过采样流程得到多样且干净的3D点云,用于运动规划。
Result: 实验表明,该方法生成的先验能够恢复符合真实环境的常识性空间语义,生成的3D点云可用于运动规划,展现了生成模型在机器人规划中的潜力。
Insight: 生成模型能够作为一种丰富的先验信息来源,为机器人规划任务提供多样化和有效的环境不确定性表示。
Abstract: Priors are vital for planning under partial observability, yet difficult to obtain in practice. We present a sampling-based pipeline that leverages large-scale pretrained generative models to produce probabilistic priors capturing environmental uncertainty and spatio-semantic relationships in a zero-shot manner. Conditioned on partial observations, the pipeline recovers complete RGB-D point cloud samples with occupancy and target semantics, formulated to be directly useful in configuration-space planning. We establish a Matterport3D benchmark of rooms partially visible through doorways, where a robot must navigate to an unobserved target object. Effective priors for this setting must represent both occupancy and target-location uncertainty in unobserved regions. Experiments show that our approach recovers commonsense spatial semantics consistent with ground truth, yielding diverse, clean 3D point clouds usable in motion planning, highlight the promise of generative models as a rich source of priors for robotic planning.
[271] SCOOP’D: Learning Mixed-Liquid-Solid Scooping via Sim2Real Generative Policy
Kuanning Wang,Yongchong Gu,Yuqian Fu,Zeyu Shangguan,Sicheng He,Xiangyang Xue,Yanwei Fu,Daniel Seita
Main category: cs.RO
TL;DR: 该论文提出了SCOOP’D方法,通过仿真到现实的生成策略学习混合液体-固体物质的舀取技能,并在零样本部署中展示了多样性场景中的优异性能。
Details
Motivation: 自主机器人舀取技能在日常生活和灾害救援中具有广泛应用,但由于复杂工具-物体交互和变形物体(如颗粒或液体)的动力学复杂性,开发通用策略极具挑战性。Contribution: 提出了SCOOP’D方法,利用仿真(OmniGibson)生成演示数据,并通过扩散模型生成策略从观测输入中模仿学习,实现了多样化的现实场景零样本部署。
Method: 基于OmniGibson仿真收集特权状态信息生成的演示数据,利用扩散模型学习生成策略,并将策略直接应用于现实场景。
Result: 在465次多样性实验中,SCOOP’D表现优于所有基线方法,成功应对了不同难度级别的舀取任务。
Insight: 仿真数据结合生成策略可以有效解决复杂变形物体操纵问题,为机器人学习高难度任务提供了新思路。
Abstract: Scooping items with tools such as spoons and ladles is common in daily life, ranging from assistive feeding to retrieving items from environmental disaster sites. However, developing a general and autonomous robotic scooping policy is challenging since it requires reasoning about complex tool-object interactions. Furthermore, scooping often involves manipulating deformable objects, such as granular media or liquids, which is challenging due to their infinite-dimensional configuration spaces and complex dynamics. We propose a method, SCOOP’D, which uses simulation from OmniGibson (built on NVIDIA Omniverse) to collect scooping demonstrations using algorithmic procedures that rely on privileged state information. Then, we use generative policies via diffusion to imitate demonstrations from observational input. We directly apply the learned policy in diverse real-world scenarios, testing its performance on various item quantities, item characteristics, and container types. In zero-shot deployment, our method demonstrates promising results across 465 trials in diverse scenarios, including objects of different difficulty levels that we categorize as “Level 1” and “Level 2.” SCOOP’D outperforms all baselines and ablations, suggesting that this is a promising approach to acquiring robotic scooping skills. Project page is at https://scoopdiff.github.io/.
cs.CE [Back]
[272] Comparative Evaluation of Neural Network Architectures for Generalizable Human Spatial Preference Prediction in Unseen Built Environments
Maral Doctorarastoo,Katherine A. Flanigan,Mario Bergés,Christopher McComb
Main category: cs.CE
TL;DR: 本文通过对比图神经网络(GNN)、卷积神经网络(CNN)和前馈神经网络(FFNN)在合成数据上的表现,研究了这些架构在预测未见建筑环境中人类空间偏好的泛化能力。
Details
Motivation: 预测人类在建筑环境中的空间偏好对发展Cyber-Physical-Social Infrastructure Systems(CPSIS)至关重要,但现有模型在未见环境中的泛化能力尚不明确。Contribution: 主要的贡献包括:1)比较了GNN、CNN和FFNN在未见空间布局中的泛化能力;2)提出了一种基于精确率-召回率曲线下面积(AUC-PR)的泛化评分方法。
Method: 研究方法包括:1)使用合成口袋公园环境生成数据;2)训练和评估GNN、CNN和FFNN模型;3)基于AUC-PR计算泛化评分。
Result: 结果显示,GNN在预测人类空间偏好时表现最佳,尤其是在未见环境中具有更高的泛化能力。
Insight: 研究表明,图神经网络能够更好地捕捉空间和上下文依赖关系,是预测人类偏好的理想选择。
Abstract: The capacity to predict human spatial preferences within built environments is instrumental for developing Cyber-Physical-Social Infrastructure Systems (CPSIS). A significant challenge in this domain is the generalizability of preference models, particularly their efficacy in predicting preferences within environmental configurations not encountered during training. While deep learning models have shown promise in learning complex spatial and contextual dependencies, it remains unclear which neural network architectures are most effective at generalizing to unseen layouts. To address this, we conduct a comparative study of Graph Neural Networks, Convolutional Neural Networks, and standard feedforward Neural Networks using synthetic data generated from a simplified and synthetic pocket park environment. Beginning with this illustrative case study, allows for controlled analysis of each model’s ability to transfer learned preference patterns to unseen spatial scenarios. The models are evaluated based on their capacity to predict preferences influenced by heterogeneous physical, environmental, and social features. Generalizability score is calculated using the area under the precision-recall curve for the seen and unseen layouts. This generalizability score is appropriate for imbalanced data, providing insights into the suitability of each neural network architecture for preference-aware human behavior modeling in unseen built environments.
cs.IR [Back]
[273] CardRewriter: Leveraging Knowledge Cards for Long-Tail Query Rewriting on Short-Video Platforms
Peiyuan Gong,Feiran Zhu,Yaqi Yin,Chenglei Dai,Chao Zhang,Kai Zheng,Wentian Bao,Jiaxin Mao,Yi Zhang
Main category: cs.IR
TL;DR: CardRewriter 是一个基于大语言模型(LLM)的框架,通过整合领域特定知识卡来优化短视频平台上的长尾查询改写,显著提升查询质量与用户体验,并已在快手平台落地应用。
Details
Motivation: 短视频平台的用户查询(尤其是长尾查询)常因拼写错误、表达不完整或意图模糊导致检索结果与预期不符。现有LLM在非公开内容(如短视频、直播等)上的表现较差,这促使研究者提出一种结合领域知识的改写方法。Contribution: 提出了CardRewriter框架,通过多源知识聚合与知识卡生成引导LLM更好地理解用户意图;设计了二阶段训练流程(监督微调+分组相对策略优化),并定制了平衡查询相关性与检索效果的奖励机制。
Method: 1. 为查询聚合多源知识并生成知识卡;2. 基于知识卡优化LLM的查询改写;3. 采用监督微调和分组相对策略优化的两阶段训练流程。
Result: 离线实验显示CardRewriter显著提升了针对非公开内容的查询改写质量;在线A/B测试证实其在长播率(LVR)、点击率(CTR)和主动查询改写率(IQRR)方面均有显著改进。
Insight: 领域特定知识的引入能有效弥补LLM在非公开内容上的不足;分层优化的训练流程和定制奖励机制对提升查询改写的实用性和效果至关重要。
Abstract: Short-video platforms have rapidly become a new generation of information retrieval systems, where users formulate queries to access desired videos. However, user queries, especially long-tail ones, often suffer from spelling errors, incomplete phrasing, and ambiguous intent, resulting in mismatches between user expectations and retrieved results. While large language models (LLMs) have shown success in long-tail query rewriting within e-commerce, they struggle on short-video platforms, where proprietary content such as short videos, live streams, micro dramas, and user social networks falls outside their training distribution. To address this challenge, we introduce \textbf{CardRewriter}, an LLM-based framework that incorporates domain-specific knowledge to enhance long-tail query rewriting. For each query, our method aggregates multi-source knowledge relevant to the query and summarizes it into an informative and query-relevant knowledge card. This card then guides the LLM to better capture user intent and produce more effective query rewrites. We optimize CardRewriter using a two-stage training pipeline: supervised fine-tuning followed by group relative policy optimization, with a tailored reward system balancing query relevance and retrieval effectiveness. Offline experiments show that CardRewriter substantially improves rewriting quality for queries targeting proprietary content. Online A/B testing further confirms significant gains in long-view rate (LVR) and click-through rate (CTR), along with a notable reduction in initiative query reformulation rate (IQRR). Since September 2025, CardRewriter has been deployed on Kuaishou, one of China’s largest short-video platforms, serving hundreds of millions of users daily.
[274] REGENT: Relevance-Guided Attention for Entity-Aware Multi-Vector Neural Re-Ranking
Shubham Chatterjee
Main category: cs.IR
TL;DR: 论文提出了REGENT模型,通过实体引导注意力机制,结合细粒度词法匹配和高层次语义推理,显著提升了神经重排模型的性能。
Details
Motivation: 现有神经重排模型在处理复杂信息需求和内容丰富的长文档时表现不佳,主要原因在于缺乏对关键实体和概念的智能内容选择能力。而人类通常会围绕关键实体和概念构建理解,因此论文提出模仿人类的这种能力。Contribution: 论文的主要贡献是提出REGENT模型,首次成功将实体语义直接集成到神经注意力中,为实体感知的信息检索建立了新范式。
Method: REGENT通过实体作为“语义骨架”引导注意力机制,结合细粒度词法匹配和高层次语义推理,实现内容选择和精确匹配。
Result: REGENT在三个挑战性数据集上达到了新的最优性能,相比BM25提升了108%,并显著优于ColBERT和RankVicuna等基线模型。
Insight: 论文揭示了将实体语义直接融入注意力机制的重要性,为未来实体感知的检索模型提供了新方向。
Abstract: Current neural re-rankers often struggle with complex information needs and long, content-rich documents. The fundamental issue is not computational–it is intelligent content selection: identifying what matters in lengthy, multi-faceted texts. While humans naturally anchor their understanding around key entities and concepts, neural models process text within rigid token windows, treating all interactions as equally important and missing critical semantic signals. We introduce REGENT, a neural re-ranking model that mimics human-like understanding by using entities as a “semantic skeleton” to guide attention. REGENT integrates relevance guidance directly into the attention mechanism, combining fine-grained lexical matching with high-level semantic reasoning. This relevance-guided attention enables the model to focus on conceptually important content while maintaining sensitivity to precise term matches. REGENT achieves new state-of-the-art performance in three challenging datasets, providing up to 108% improvement over BM25 and consistently outperforming strong baselines including ColBERT and RankVicuna. To our knowledge, this is the first work to successfully integrate entity semantics directly into neural attention, establishing a new paradigm for entity-aware information retrieval.
[275] FinVet: A Collaborative Framework of RAG and External Fact-Checking Agents for Financial Misinformation Detection
Daniel Berhane Araya,Duoduo Liao
Main category: cs.IR
TL;DR: FinVet是一个新型多智能体框架,通过结合RAG管道和外部事实核查代理,提升金融错误信息检测的透明度和准确性。
Details
Motivation: 金融市场的错误信息可能导致巨大损失,现有方法缺乏透明度且难以溯源可信来源。Contribution: 提出FinVet框架,集成RAG和外部事实核查,支持动态调整验证策略并提供多维度结果(证据、来源、置信度等)。
Method: 采用两级RAG管道和外部事实核查,通过置信度加权投票机制动态选择验证策略(元数据提取到模型分析)。
Result: 在FinFact数据集上F1达0.85,比最佳单管道提升10.4%,比独立RAG提升37%。
Insight: 结合RAG与外部核查的动态多智能体方法可显著提升金融信息检测性能,同时增强透明度和可信度。
Abstract: Financial markets face growing threats from misinformation that can trigger billions in losses in minutes. Most existing approaches lack transparency in their decision-making and provide limited attribution to credible sources. We introduce FinVet, a novel multi-agent framework that integrates two Retrieval-Augmented Generation (RAG) pipelines with external fact-checking through a confidence-weighted voting mechanism. FinVet employs adaptive three-tier processing that dynamically adjusts verification strategies based on retrieval confidence, from direct metadata extraction to hybrid reasoning to full model-based analysis. Unlike existing methods, FinVet provides evidence-backed verdicts, source attribution, confidence scores, and explicit uncertainty flags when evidence is insufficient. Experimental evaluation on the FinFact dataset shows that FinVet achieves an F1 score of 0.85, which is a 10.4% improvement over the best individual pipeline (fact-check pipeline) and 37% improvement over standalone RAG approaches.
[276] MTMD: A Multi-Task Multi-Domain Framework for Unified Ad Lightweight Ranking at Pinterest
Xiao Yang,Peifeng Yin,Abe Engle,Jinfeng Zhuang,Ling Leng
Main category: cs.IR
TL;DR: 论文提出了一个多任务多领域(MTMD)框架,用于统一轻量级广告排序,解决了在多任务学习和多领域数据整合中的挑战,显著提升了离线效果和在线成本效率。
Details
Motivation: 在广告推荐系统中,轻量级排序层是关键,但需要同时优化多个任务(如CTR、CVR)和多领域数据(如不同广告产品和展示位置)。传统多任务学习难以统一处理这些复杂需求。Contribution: 1) 提出统一的多任务多领域框架;2) 设计混合专家架构学习领域专有和共享知识;3) 引入领域适应模块促进知识迁移;4) 对不同任务建模施加约束。
Method: 基于双塔范式,采用混合专家架构和领域适应模块,结合多任务学习,统一处理不同广告产品和展示位置的排序问题。
Result: 离线损失降低12%-36%,在线点击成本降低2%,并取代了9个生产模型。
Insight: 混合专家架构和显式知识迁移是多任务学习在多领域中有效的关键。
Abstract: The lightweight ad ranking layer, living after the retrieval stage and before the fine ranker, plays a critical role in the success of a cascaded ad recommendation system. Due to the fact that there are multiple optimization tasks depending on the ad domain, e.g., Click Through Rate (CTR) for click ads and Conversion Rate (CVR) for conversion ads, as well as multiple surfaces where an ad is served (home feed, search, or related item recommendation) with diverse ad products (shopping or standard ad); it is an essentially challenging problem in industry on how to do joint holistic optimization in the lightweight ranker, such that the overall platform’s value, advertiser’s value, and user’s value are maximized. Deep Neural Network (DNN)-based multitask learning (MTL) can handle multiple goals naturally, with each prediction head mapping to a particular optimization goal. However, in practice, it is unclear how to unify data from different surfaces and ad products into a single model. It is critical to learn domain-specialized knowledge and explicitly transfer knowledge between domains to make MTL effective. We present a Multi-Task Multi-Domain (MTMD) architecture under the classic Two-Tower paradigm, with the following key contributions: 1) handle different prediction tasks, ad products, and ad serving surfaces in a unified framework; 2) propose a novel mixture-of-expert architecture to learn both specialized knowledge each domain and common knowledge shared between domains; 3) propose a domain adaption module to encourage knowledge transfer between experts; 4) constrain the modeling of different prediction tasks. MTMD improves the offline loss value by 12% to 36%, mapping to 2% online reduction in cost per click. We have deployed this single MTMD framework into production for Pinterest ad recommendation replacing 9 production models.
cs.SE [Back]
[277] A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System
Jiale Guo,Suizhi Huang,Mei Li,Dong Huang,Xingsheng Chen,Regina Zhang,Zhijiang Guo,Han Yu,Siu-Ming Yiu,Christian Jensen,Pietro Lio,Kwok-Yan Lam
Main category: cs.SE
TL;DR: 这篇论文对LLM赋能的软件工程代理系统进行了全面的调查,聚焦于基准测试和解决方案的关联性,填补了评估与解决方案之间的关键空白。
Details
Motivation: 随着LLM在软件工程中的应用从传统规则系统转向复杂的代理系统,领域缺乏对基准测试和解决方案如何相互关联的全面理解,阻碍了系统化进展和评估。Contribution: 论文首次对LLM赋能的软件工程进行整体分析,提出了一个涵盖解决方案和基准测试的综合分类法,并揭示了从简单提示工程到复杂代理系统的演进过程。
Method: 通过分析150多篇近期论文,将研究内容分为解决方案(提示工程、微调、代理范式)和基准测试(代码生成、翻译、修复等任务)两个主要维度,并提出统一的工作流程。
Result: 研究发现领域从提示工程发展为包含规划、推理、记忆机制和工具增强的复杂代理系统,并连接了50多个基准测试及其对应的解决方案策略。
Insight: 论文指出了关键研究缺口,如多代理协作框架和自我进化的代码生成系统,并提出了未来研究方向,如LLM方法与形式验证的结合。
Abstract: The integration of LLMs into software engineering has catalyzed a paradigm shift from traditional rule-based systems to sophisticated agentic systems capable of autonomous problem-solving. Despite this transformation, the field lacks a comprehensive understanding of how benchmarks and solutions interconnect, hindering systematic progress and evaluation. This survey presents the first holistic analysis of LLM-empowered software engineering, bridging the critical gap between evaluation and solution approaches. We analyze 150+ recent papers and organize them into a comprehensive taxonomy spanning two major dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, covering code generation, translation, repair, and other tasks. Our analysis reveals how the field has evolved from simple prompt engineering to complex agentic systems incorporating planning and decomposition, reasoning and self-refinement, memory mechanisms, and tool augmentation. We present a unified pipeline that illustrates the complete workflow from task specification to final deliverables, demonstrating how different solution paradigms address varying complexity levels across software engineering tasks. Unlike existing surveys that focus on isolated aspects, we provide full-spectrum coverage connecting 50+ benchmarks with their corresponding solution strategies, enabling researchers to identify optimal approaches for specific evaluation criteria. Furthermore, we identify critical research gaps and propose actionable future directions, including multi-agent collaboration frameworks, self-evolving code generation systems, and integration of formal verification with LLM-based methods. This survey serves as a foundational resource for researchers and practitioners seeking to understand, evaluate, and advance LLM-empowered software engineering systems.
cs.SD [Back]
[278] VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
Jiliang Hu,Wenfu Wang,Zuchao Li,Chenxing Li,Yiyang Zhao,Hanzhao Li,Liqiang Zhang,Meng Yu,Dong Yu
Main category: cs.SD
TL;DR: VCB Bench是一个高质量的中文语音聊天机器人评测基准,首次完全基于真实人类语音构建,填补了现有评测在语言多样性、语音真实性和多维度评测上的不足。
Details
Motivation: 当前大规模音频语言模型(LALMs)的多模态对话系统评测基准多为英语、依赖合成语音,且缺乏全面的多维度评测。VCB Bench旨在解决这些问题,推动中文语音对话模型的进步。Contribution: 提出首个基于真实中文语音的高质量评测基准VCB Bench,涵盖指令跟随、知识理解和鲁棒性三大维度,提供标准化方法和实践指导。
Method: 构建了完全基于真实人类语音的数据集,设计了指令跟随、知识理解和鲁棒性三个评测维度,并通过实验验证其有效性。
Result: 实验揭示了当前LALMs的性能差距,为未来优化提供了方向。VCB Bench展现出较高的可重复性和细粒度评测能力。
Insight: 真实语音数据和多维评测对语音对话模型的开发至关重要;当前LALMs在中文环境下的表现仍有提升空间。
Abstract: Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited – they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions. To address these gaps, we present Voice Chat Bot Bench (VCB Bench) – a high-quality Chinese benchmark built entirely on real human speech. VCB Bench evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits). Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement. VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for advancing Chinese voice conversational models.
[279] Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
KiHyun Nam,Jongmin Choi,Hyeongkeun Lee,Jungwoo Heo,Joon Son Chung
Main category: cs.SD
TL;DR: Diffusion-Link是一种基于扩散概率模型的轻量级模块,用于减少音频-文本模态之间的差距,从而提升多模态编码器与大型语言模型的耦合效果。在自动音频字幕任务中,该方法在不依赖外部知识的情况下取得了最优性能。
Details
Motivation: 现有的对比音频-语言预训练方法在多模态编码器与大型语言模型(LLMs)的耦合中仍然存在音频-文本模态差距问题。本研究旨在通过扩散概率模型解决这一问题。Contribution: 提出了Diffusion-Link,一种基于扩散概率模型的轻量级模块,首次将其应用于自动音频字幕任务(AAC),并在模态差距分析和下游任务中均取得了显著效果。
Method: Diffusion-Link通过一个轻量级的残差MLP网络,将音频嵌入映射到文本嵌入的分布中。模块在多模态编码器的冻结输出嵌入上进行训练。
Result: Diffusion-Link在相似性和几何标准上显著减少了模态差距,并在AudioCaps数据集上的零样本和完全监督字幕任务中分别提升了52.5%和7.5%的性能。
Insight: 研究显示,减少模态差距是多模态编码器与LLMs有效耦合的关键,扩散概率模型为解决模态差距提供了新的方向。
Abstract: Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link
cs.MA [Back]
[280] The Social Cost of Intelligence: Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems
Thi-Nhung Nguyen,Linhao Luo,Thuy-Trang Vu,Dinh Phung
Main category: cs.MA
TL;DR: 这篇论文研究了多智能体系统(MAS)中刻板偏见的产生、传播和放大现象,发现相比于单智能体系统,MAS在偏见鲁棒性上表现较差,但合作和辩论式交流可以缓解偏见放大。
Details
Motivation: 尽管大型语言模型(LLM)的偏见已经被广泛研究,但多智能体系统中的偏见动态尚未得到充分探索。随着MAS的兴起,理解偏见如何在这些系统中涌现和传播变得尤为重要。Contribution: 论文首次全面研究了MAS中的刻板偏见,分析了内部专业化、底层LLM和智能体间通信协议对偏见鲁棒性、传播和放大的影响,并提出了缓解偏见的策略。
Method: 通过模拟社交场景,研究者让智能体代表不同社会群体,并在多种交互和对抗场景下评估系统行为。实验基于三个偏见基准数据集进行。
Result: 结果表明,MAS的偏见鲁棒性通常低于单智能体系统,但合作和辩论式交流可以减轻偏见的放大,同时更鲁棒的底层LLM能提升系统稳定性。
Insight: 研究发现,偏见在MAS中往往通过“群体内偏爱”早期涌现,强调了设计公平和鲁棒的多智能体系统时需要考虑的关键因素。
Abstract: Bias in large language models (LLMs) remains a persistent challenge, manifesting in stereotyping and unfair treatment across social groups. While prior research has primarily focused on individual models, the rise of multi-agent systems (MAS), where multiple LLMs collaborate and communicate, introduces new and largely unexplored dynamics in bias emergence and propagation. In this work, we present a comprehensive study of stereotypical bias in MAS, examining how internal specialization, underlying LLMs and inter-agent communication protocols influence bias robustness, propagation, and amplification. We simulate social contexts where agents represent different social groups and evaluate system behavior under various interaction and adversarial scenarios. Experiments on three bias benchmarks reveal that MAS are generally less robust than single-agent systems, with bias often emerging early through in-group favoritism. However, cooperative and debate-based communication can mitigate bias amplification, while more robust underlying LLMs improve overall system stability. Our findings highlight critical factors shaping fairness and resilience in multi-agent LLM systems.
[281] Automating Structural Engineering Workflows with Large Language Model Agents
Haoran Liang,Yufa Zhou,Mohammad Talebi Kalaleh,Qipei Mei
Main category: cs.MA
TL;DR: 论文介绍了MASSE,一种基于大型语言模型的多智能体系统,旨在自动化结构工程工作流程,显著减少专家工作量并提高效率和准确性。
Details
Motivation: 结构工程领域虽然经济影响巨大,但其核心工作流程几十年来几乎未变,亟需现代化和自动化。Contribution: 提出了MASSE,首个无需训练即可直接部署的专业级结构工程自动化多智能体系统。
Method: 利用大型语言模型的复杂推理和工具使用能力,设计多智能体系统来自动化结构工程任务。
Result: 在实际案例验证中,MASSE将专家工作量从约两小时缩减至几分钟,同时提升了可靠性和准确性。
Insight: 大型语言模型在多智能体系统中的应用潜力巨大,可显著提升传统领域的效率和精确性。
Abstract: We introduce $\textbf{MASSE}$, the first Multi-Agent System for Structural Engineering, effectively integrating large language model (LLM)-based agents with real-world engineering workflows. Structural engineering is a fundamental yet traditionally stagnant domain, with core workflows remaining largely unchanged for decades despite its substantial economic impact and global market size. Recent advancements in LLMs have significantly enhanced their ability to perform complex reasoning, long-horizon planning, and precise tool utilization – capabilities well aligned with structural engineering tasks such as interpreting design codes, executing load calculations, and verifying structural capacities. We present a proof-of-concept showing that most real-world structural engineering workflows can be fully automated through a training-free LLM-based multi-agent system. MASSE enables immediate deployment in professional environments, and our comprehensive validation on real-world case studies demonstrates that it can reduce expert workload from approximately two hours to mere minutes, while enhancing both reliability and accuracy in practical engineering scenarios.
cs.LG [Back]
[282] Group-Adaptive Adversarial Learning for Robust Fake News Detection Against Malicious Comments
Zhao Tong,Chunlin Gong,Yimeng Gu,Haichao Shi,Qiang Liu,Shu Wu,Xiao-Yu Zhang
Main category: cs.LG
TL;DR: 论文提出了一种基于群体适应性的对抗学习方法,用于增强假新闻检测模型对恶意评论的鲁棒性,通过分类评论、生成多样化攻击并动态调整训练焦点来实现。
Details
Motivation: 现有的假新闻检测模型在标准设置下表现良好,但对恶意评论(尤其是由真实用户或大语言模型生成的对抗性评论)的鲁棒性不足。这些评论会微妙地影响模型决策,从而降低检测效果。Contribution: 1. 对评论攻击进行全面评估;2. 提出一种群体适应性的对抗训练策略,包括评论分类、多样化攻击生成和动态调整训练焦点;3. 在基准数据集上验证了方法的有效性。
Method: 1. 将对抗性评论分为感知、认知和社会三类;2. 利用大语言模型生成类别特定的多样化攻击;3. 引入基于Dirichlet的自适应采样机制(InfoDirichlet Adjusting Mechanism)动态调整训练焦点。
Result: 实验表明,该方法在保持高检测准确率的同时,显著提升了模型对多种对抗性评论扰动的鲁棒性。
Insight: 通过心理学的分类方法可以有效组织对抗性评论;动态调整训练焦点能够更高效地提升模型的鲁棒性。
Abstract: The spread of fake news online distorts public judgment and erodes trust in social media platforms. Although recent fake news detection (FND) models perform well in standard settings, they remain vulnerable to adversarial comments-authored by real users or by large language models (LLMs)-that subtly shift model decisions. In view of this, we first present a comprehensive evaluation of comment attacks to existing fake news detectors and then introduce a group-adaptive adversarial training strategy to improve the robustness of FND models. To be specific, our approach comprises three steps: (1) dividing adversarial comments into three psychologically grounded categories: perceptual, cognitive, and societal; (2) generating diverse, category-specific attacks via LLMs to enhance adversarial training; and (3) applying a Dirichlet-based adaptive sampling mechanism (InfoDirichlet Adjusting Mechanism) that dynamically adjusts the learning focus across different comment categories during training. Experiments on benchmark datasets show that our method maintains strong detection accuracy while substantially increasing robustness to a wide range of adversarial comment perturbations.
[283] Building a Foundational Guardrail for General Agentic Systems via Synthetic Data
Yue Huang,Hang Hua,Yujun Zhou,Pengcheng Jing,Manish Nagireddy,Inkit Padhi,Greta Dolcetti,Zhangchen Xu,Subhajit Chaudhury,Ambrish Rawat,Liubov Nedoshivina,Pin-Yu Chen,Prasanna Sattigeri,Xiangliang Zhang
Main category: cs.LG
TL;DR: 该论文提出了一种通过合成数据为通用代理系统构建基础防护栏的方法,解决了现有防护栏主要在事后执行的问题,并填补了数据、模型和评估三大空白。
Details
Motivation: 现有的防护机制大多在代理执行动作后才会介入,难以扩展且在计划阶段缺乏可控监管。本文旨在通过在计划阶段干预,防止潜在风险。Contribution: 1) 引入AuraGen合成可控数据;2) 提出基础防护模型Safiron;3) 发布Pre-Exec Bench基准测试。
Method: 1) 使用AuraGen生成带标签的合成数据;2) Safiron结合跨计划适配器和紧凑防护模型;3) 两阶段训练数据配方。
Result: 实验表明,Safiron在Pre-Exec Bench上表现优于基线,并提供可操作的实践模板。
Insight: 计划阶段的干预是防范代理风险的关键,合成数据和统一输入格式的适配器是实现可控监管的有效工具。
Abstract: While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.
[284] Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling
Hehe Fan,Yi Yang,Mohan Kankanhalli,Fei Wu
Main category: cs.LG
TL;DR: Translution是一种融合了自注意力机制的自适应识别能力和卷积的相对编码优势的新型操作,并通过轻量级变体α-Translution解决了参数过多的问题,在多个任务中表现优于传统自注意力方法。
Details
Motivation: 自注意力机制和卷积各有优缺点:前者能自适应地识别相关元素但依赖绝对位置编码,后者虽然以相对方式编码但受限于固定核大小,无法自适应选择元素。研究者希望统一这两种方法的优势。Contribution: 提出了Translution操作,统一了自注意力的自适应能力和卷积的相对编码优势,并设计了轻量级变体α-Translution以解决计算资源问题。
Method: Translution结合了自适应识别和相对编码机制,通过分解参数的方式减少计算量(α-Translution)。
Result: 在计算机视觉和自然语言处理任务中,Translution(包括α-Translution)的准确率超越了传统的自注意力方法。
Insight: 统一自注意力和卷积的优势可以提升模型性能,但需通过巧妙的参数设计来解决计算资源问题。
Abstract: When modeling a given type of data, we consider it to involve two key aspects: 1) identifying relevant elements (e.g., image pixels or textual words) to a central element, as in a convolutional receptive field, or to a query element, as in self-attention, and 2) encoding these tokens effectively. Self-attention can adaptively identify these elements but relies on absolute positional embedding for structural representation learning. In contrast, convolution encodes elements in a relative manner, yet their fixed kernel size limits their ability to adaptively select the relevant elements. In this paper, we introduce Translution, an operation that unifies the adaptive identification capability of self-attention and the relative encoding advantage of convolution. However, this integration leads to a substantial increase in the number of parameters, exceeding most currently available computational resources. Therefore, we propose a lightweight variant of Translution, named {\alpha}-Translution. Experiments on computer vision and natural language processing tasks show that Translution (including {\alpha}-Translution) achieves superior accuracy compared to self-attention. The code is available at https://github.com/hehefan/Translution.
[285] RLFR: Extending Reinforcement Learning for LLMs with Flow Environment
Jinghao Zhang,Naishan Zheng,Ruilin Li,Dongzhou Cheng,Zheming Liang,Feng Zhao,Jiaqi Wang
Main category: cs.LG
TL;DR: RLFR提出了一种新颖的强化学习框架,通过潜在空间的流场(flow field)奖励信号来优化大型语言模型(LLMs)的推理能力,解决了传统RLVR方法中忽视有价值探索的问题。
Details
Motivation: 传统的基于二进制验证的RLVR方法容易忽视推理轨迹中的潜在价值探索,且黄金过程奖励模型(PRMs)标注成本高昂。RLFR旨在通过潜在空间的流场奖励信号低成本地优化推理过程。Contribution: 1. 提出了RLFR框架,利用潜在空间的流场奖励信号优化LLMs推理能力;2. 展示了潜在空间的表达能力;3. 通过压缩专家数据作为参考信号,高效利用上下文依赖关系。
Method: RLFR构建潜在空间的流场,量化策略潜在速度偏差作为奖励信号,结合离策略高质量数据和在线拒绝采样数据。
Result: 实验表明,流场奖励信号在多模态和语言推理任务中表现可靠,为辅助信号奖励塑造提供了新范式。
Insight: 潜在空间的流场奖励能够高效捕捉上下文依赖关系,为强化学习在LLMs中的应用提供了新思路。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.
[286] Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting
Michael Y. Hu,Benjamin Van Durme,Jacob Andreas,Harsh Jhamtani
Main category: cs.LG
TL;DR: 论文提出了ECHO框架,通过回溯轨迹重写提升语言模型代理的样本效率,在失败尝试中生成优化轨迹,从而在低成本交互环境中实现更高效的在线学习。
Details
Motivation: 语言模型代理在新环境中学习时样本效率低,导致交互成本高的场景(如与人类或物理系统交互)中表现不佳。现有方法未能充分利用语言模型生成或推理反事实轨迹的能力。Contribution: 1. 提出了ECHO框架,利用回溯经验重放技术(Hindsight Optimization)生成优化轨迹;2. 设计了基于语言模型的回溯规则和更新规则,高效存储和利用历史经验。
Method: 1. 回溯规则:语言模型识别子目标并生成优化轨迹;2. 更新规则:维护压缩后的轨迹表示。框架在XMiniGrid和PeopleJoinQA任务中验证。
Result: ECHO在XMiniGrid和PeopleJoinQA上分别优于基线80%,且在XMiniGrid中超越了Reflexion和AWM等复杂代理架构,展示了更好的环境适应能力。
Insight: 通过语言模型直接生成优化轨迹,可以有效利用失败经验,显著提高样本效率,为低成本交互场景提供了实用解决方案。
Abstract: Language model (LM) agents deployed in novel environments often exhibit poor sample efficiency when learning from sequential interactions. This significantly hinders the usefulness of such agents in environments where interaction is costly (for example, when they interact with humans or reset physical systems). While a number of existing LM agent architectures incorporate various mechanisms for experience storage and reflection, they make limited use of LMs’ abilities to directly generate or reason about full counterfactual trajectories. We introduce ECHO (Experience Consolidation via Hindsight Optimization), a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents. ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts, effectively creating synthetic positive examples from unsuccessful interactions. Our approach consists of two components: a hindsight rule that uses the language model itself to identify relevant subgoals and generate optimized trajectories, and an update rule that maintains compressed trajectory representations in memory. We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation. Across both domains, ECHO outperforms vanilla language agent baselines by up to 80%; in XMiniGrid, it also outperforms a number of sophisticated agent architectures including Reflexion and AWM, demonstrating faster adaptation to novel environments through more effective utilization of past experiences.
[287] Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation
Hengyuan Zhang,Shiping Yang,Xiao Liang,Chenming Shang,Yuxuan Jiang,Chaofan Tao,Jing Xiong,Hayden Kwok-Hay So,Ruobing Xie,Angel X. Chang,Ngai Wong
Main category: cs.LG
TL;DR: 该论文提出了一种名为PerSyn的新方法,通过路由引导的多教师蒸馏策略,为学生模型定制合成数据,从而提高学习效率。
Details
Motivation: 现有研究表明,更强的教师模型并非总是最优选择,教师输出与学生可学习性之间存在不匹配问题,因此需要一种个性化数据合成方法。Contribution: 1. 提出PerSyn框架,采用“路由后生成”范式;2. 设计了一个基于查询级别的路由器,动态选择最优教师;3. 在指令调优和数学推理任务中验证了其有效性。
Method: PerSyn分两步:首先通过路由器分配每个提示给最优教师,再由教师生成定制数据。相比传统“生成后选择”方法,效率更高。
Result: 实验表明,PerSyn在不同模型家族和规模上均优于或与基线方法相当,验证了其有效性。
Insight: 个性化数据合成能显著提升学生模型学习效率,路由器设计是关键。未来的研究可进一步优化路由策略或扩展应用场景。
Abstract: Training student models on synthetic data generated by strong teacher models is a promising way to distilling the capabilities of teachers. However, recent studies show that stronger models are not always optimal teachers, revealing a mismatch between teacher outputs and student learnability. To address this issue, we propose PerSyn (Personalized data Synthesis), a novel synthesis strategy that operates under a new Route then Generate'' paradigm to create data tailored to each student model, enabling it to learn more effectively. Specifically, PerSyn first assigns each prompt to its optimal teacher via a query-level router that jointly considers student learnability and teacher response quality. Each teacher then synthesizes data only for its assigned prompts, making the process more efficient than the conventional Generate then Select’’ paradigm, where all teachers must generate parallel responses for the entire prompt set before constructing the final dataset. Extensive experiments across different model families and scales demonstrate that PerSyn consistently achieves superior or comparable performance to all baselines in instruct tuning and math reasoning settings. Further analysis verifies the effectiveness of PerSyn and offers extra insights to propel future research.
[288] Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
Xiaoyun Zhang,Xiaojian Yuan,Di Huang,Wang You,Chen Hu,Jingqing Ruan,Kejiang Chen,Xing Hu
Main category: cs.LG
TL;DR: 本文重新审视了熵正则化在大型语言模型(LLM)强化学习中的作用,提出了自适应熵正则化(AER)框架,通过动态调整系数来解决任务难度差异和探索需求,显著提升了模型的推理能力。
Details
Motivation: 在强化学习验证奖励(RLVR)训练中,策略熵崩溃(policy entropy collapse)导致策略过于确定性,限制了模型的探索能力和推理性能。传统的熵正则化因其固定系数而不稳定,未能充分发挥潜力。Contribution: 提出自适应熵正则化(AER),包含三项组件:任务难度感知的系数分配、初始锚定的目标熵和动态全局系数调整,有效平衡探索与利用。
Method: 通过动态调整熵正则化系数,适应不同任务难度和探索需求,确保策略熵保持在适度范围内,避免熵崩溃或过度随机。
Result: 在多个数学推理基准测试中,AER显著优于基线方法,提升了推理准确性和探索能力。
Insight: 熵正则化的潜力被低估,动态调整系数可以更好地适应任务需求;适度的策略熵范围对平衡探索与利用至关重要。
Abstract: Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)–a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
[289] EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling
Daniel Scalena,Leonidas Zotos,Elisabetta Fersini,Malvina Nissim,Ahmet Üstün
Main category: cs.LG
TL;DR: 论文提出了一种名为EAGER的训练无关生成方法,通过基于熵分布的动态计算资源分配,优化推理模型的性能和效率。
Details
Motivation: 现有推理语言模型在生成候选序列时通常为所有提示分配相同的计算资源,忽略了不同提示的复杂性差异。为了提高效率并减少冗余计算,作者提出了EAGER方法。Contribution: EAGER利用令牌熵分布动态分配计算资源,仅在不确定性高时分支推理路径,显著提升了推理效率与性能。
Method: EAGER通过令牌熵分布识别高不确定性部分,动态调整计算资源分配,避免冗余计算。
Result: 在AIME 2025等复杂推理基准测试中,EAGER无需目标标签即可优化资源分配,实现了推理长度与Pass@k的最佳平衡;若目标标签可用,还能减少65%的令牌生成并提升37%的Pass@k。
Insight: 动态计算资源分配是优化推理语言模型性能的关键,利用模型不确定性可以减少冗余计算并提高效率。
Abstract: With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and then reallocates the saved compute budget to the instances where exploration of alternative paths is most needed. We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels, achieving the best efficiency-performance trade-off in terms of reasoning length and Pass@k. When target labels are accessible, EAGer generates up to 65% fewer tokens (hence saving compute) and achieves up to 37% improvement in Pass@k compared to the Full Parallel Sampling.
[290] Can Tool-Integrated Reinforcement Learning Generalize Across Diverse Domains?
Zhengyu Chen,Jinluan Yang,Teng Xiao,Ruochen Zhou,Luan Zhang,Xiangyu Xi,Xiaowei Shi,Wei Wang,Jinggang Wang
Main category: cs.LG
TL;DR: 论文探讨了基于大语言模型(LLM)的工具增强强化学习在不同领域的泛化能力,并提出了一种名为TGRL的框架来促进领域无关学习和技能迁移。
Details
Motivation: 尽管LLM在工具使用和推理方面表现出色,但其在不同领域的泛化能力仍未得到充分研究。本文旨在验证RL驱动的工具使用方法是否能在训练领域外实现有效迁移。Contribution: 主要贡献包括:1) 提出TGRL框架,支持领域无关的工具使用和技能迁移;2) 标准化工具接口和双组件奖励系统;3) XML提示模板鼓励模块化规划。
Method: 方法包括:1) 标准化工具接口抽象领域细节;2) 双组件奖励系统鼓励通用行为;3) XML模板明确分离思考和工具调用。
Result: 实验结果表明,RL驱动的工具学习方法能有效迁移到其他领域,实现高性能和高效token使用。
Insight: 工具增强的RL具备跨领域泛化潜力,标准化设计和模块化思维是关键。
Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in reasoning and tool utilization. However, the generalization of tool-augmented reinforcement learning (RL) across diverse domains remains underexplored. In this work, we investigate the cross-domain generalization of an LLM agent equipped with a code interpreter tool, which is exclusively trained on mathematical problem-solving tasks. Despite the restricted training domain, we evaluate the agent’s performance across several distinct reasoning domains. The results reveal that RL-based tool usage learned from mathematical tasks can be effectively transferred to complex tasks in other domains, enabling great task performance and high token efficiency. To facilitate this cross-domain transfer, we propose a Tool Generalization Reinforcement Learning (TGRL) framework designed to promote domain-agnostic learning and skill migration, encompassing: (i) a standardized tool interface that abstracts domain-specific nuances through consistent formatting and explicit termination, fostering transferable invocation patterns; (ii) a dual-component reward system that decomposes rewards to incentivize generalizable behaviors like tool efficiency and reasoning abstraction, ensuring alignment and robustness across domain shifts; and (iii) an XML-based prompt template that separates thinking, tool calls, and responses to encourage modular, domain-invariant planning and coherent multi-turn interactions. Extensive experiments across diverse benchmarks validate our approach, achieving state-of-the-art performance and highlighting the cross-domain potential of Tool RL for LLM reasoning.
[291] ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models
Gareth Seneque,Lap-Hang Ho,Nafise Erfanian Saeedi,Jeffrey Molendijk,Ariel Kupermann,Tim Elson
Main category: cs.LG
TL;DR: ENIGMA是一种新型的大语言模型(LLM)训练方法,通过信息几何优化联合提升推理、对齐和鲁棒性,将组织政策视为信息流形上的方向。结合GRPO、SAMI和Sinkhorn正则化实现单循环训练,提出的Sufficiency Index(SI)指标优化原则选择,实验表明高SI原则预测训练更稳定且性能提升。
Details
Motivation: 现有大语言模型在推理、对齐和鲁棒性上需要复杂的多步骤优化,ENIGMA旨在通过信息几何视角统一这些目标,简化训练流程。Contribution: 1. 提出ENIGMA框架,联合优化推理、对齐和鲁棒性;2. 引入Sufficiency Index(SI)指标优化原则选择;3. 结合GRPO、SAMI和Sinkhorn正则化实现高效训练;4. 验证信息几何分析的有效性。
Method: 1. 使用GRPO(一种无批评者的强化学习方法)与CoT奖励;2. SAMI风格的对称InfoNCE辅助损失;3. Sinkhorn最优传输正则化;4. 提出SI指标衡量CoT编码政策的能力。
Result: 在1B参数LLM上实验显示,高SI原则提升训练稳定性和下游性能,信息几何分析验证模型流形的结构变化符合预期。
Insight: 推理、对齐和鲁棒性可能源自单一信息几何目标的不同投影,ENIGMA无需奖励模型即可实现原则性推理,为可信能力提供路径。
Abstract: We present Entropic Mutual-Information Geometry Large-Language Model Alignment (ENIGMA), a novel approach to Large-Language Model (LLM) training that jointly improves reasoning, alignment and robustness by treating an organisation’s policies/principles as directions to move on a model’s information manifold. Our single-loop trainer combines Group-Relative Policy Optimisation (GRPO), an on-policy, critic-free RL method with Chain-of-Thought (CoT)-format only rewards; a Self-Supervised Alignment with Mutual Information (SAMI)-style symmetric InfoNCE auxiliary; and an entropic Sinkhorn optimal-transport regulariser on hidden-state distributions to bound geometry drift. We also introduce infoNCE metrics that specialise to a standard MI lower bound under matched negatives to measure how strongly a model’s CoT encodes these policies. These metrics include a Sufficiency Index (SI) that enables the selection and creation of principles that maximise downstream performance prior to training. In our experiments using small (1B) LLMs, high-SI principles predict steadier training dynamics and improved benchmark performance over GRPO ablations. Our information-geometry analysis of trained models validates desirable structural change in the manifold. These results support our hypothesis that reasoning, alignment, and robustness are projections of a single informationgeometric objective, and that models trained using ENIGMA demonstrate principled reasoning without the use of a reward model, offering a path to trusted capability
[292] ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
Yuhang Li,Chenchen Zhang,Ruilin Lv,Ao Liu,Ken Deng,Yuanxing Zhang,Jiaheng Liu,Wiggin Zhou,Bo Zhou
Main category: cs.LG
TL;DR: ReLook是一个基于视觉的强化学习框架,通过多模态大语言模型(MLLM)作为工具,实现前端代码生成的闭环优化。
Details
Motivation: 大型语言模型(LLMs)在算法代码生成方面表现优异,但在前端开发中由于正确性依赖于渲染像素和交互性而表现不佳。ReLook旨在解决这一问题。Contribution: 提出了ReLook框架,结合MLLM作为视觉评论家和反馈源,利用强化学习优化前端代码生成,并通过零奖励规则和强制优化防止行为崩溃。
Method: 使用MLLM作为训练阶段的视觉评论家和反馈源,引入零奖励规则和强制优化;在推理阶段解耦评论家,运行轻量级自编辑循环。
Result: 在三个广泛使用的基准测试中,ReLook在视觉前端代码生成任务中优于基线方法。
Insight: 通过视觉反馈和强化学习的结合,可以有效提升前端代码生成的质量和鲁棒性。
Abstract: While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate–diagnose–refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic–scoring code with screenshots–and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.
[293] Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
Nianyi Lin,Jiajie Zhang,Lei Hou,Juanzi Li
Main category: cs.LG
TL;DR: BGPO是一种针对扩散大语言模型的内存高效强化学习算法,通过构造特殊的ELBO下界解决传统方法中内存开销大的问题,提高了样本量和性能表现。
Details
Motivation: 现有方法在近似扩散大语言模型的似然函数时,由于需要保留蒙特卡洛样本的前向计算图,导致内存开销巨大,限制了样本量并降低了近似精度。Contribution: 提出Boundary-Guided Policy Optimization (BGPO),通过线性且等效于ELBO目标的下界设计,实现了低内存开销和大样本量,提升了RL任务的性能。
Method: BGPO通过构造线性求和形式的ELBO下界,使得每个MC样本独立计算梯度,从而实现梯度累积和恒定内存使用,同时在策略训练中保持了目标值和梯度的等效性。
Result: 实验表明,BGPO在数学问题求解、代码生成和规划任务中显著优于现有RL算法。
Insight: 通过精心设计的线性下界优化RL目标,既能解决内存问题,又能保持目标等效性,为高效RL算法设计提供了新思路。
Abstract: A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) lies in the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation in each training step. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, the forward computational graphs of all MC samples need to be retained for the gradient computation of non-linear terms in the RL objective, resulting in significant memory overhead. This constraint restricts feasible sample sizes, leading to imprecise likelihood approximations and ultimately distorting the RL objective. To overcome this limitation, we propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is formulated in a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, resulting in more accurate likelihood approximations and improved RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.
[294] QeRL: Beyond Efficiency – Quantization-enhanced Reinforcement Learning for LLMs
Wei Huang,Yi Ge,Shuai Yang,Yicheng Xiao,Huizi Mao,Yujun Lin,Hanrong Ye,Sifei Liu,Ka Chun Cheung,Hongxu Yin,Yao Lu,Xiaojuan Qi,Song Han,Yukang Chen
Main category: cs.LG
TL;DR: QeRL是一个结合NVFP4量化和LoRA的增强型强化学习框架,旨在提高大型语言模型(LLM)的推理效率,同时通过量化噪声增强策略探索,进一步优化训练效果。
Details
Motivation: 当前LLM的强化学习(RL)训练资源消耗大,需要大量GPU内存和长时间的计算周期。QeRL旨在解决这些问题,同时探索量化噪声对策略探索的潜在好处。Contribution: QeRL结合了NVFP4量化和LoRA,显著提升了RL训练的效率,并提出自适应量化噪声(AQN)机制动态调节噪声。首次实现了在单块H100 80GB GPU上训练32B LLM的能力。
Method: QeRL采用NVFP4量化与LoRA结合,减少了内存开销并加速了RL的rollout阶段;引入AQN机制动态调整量化噪声以优化探索。
Result: 实验显示,QeRL在rollout阶段实现了1.5倍以上的加速,并在7B模型上匹配了全参数微调的性能(GSM8K 90.8%,MATH 500 77.4%),同时具有更高的奖励增长速度和最终精度。
Insight: 量化噪声可以增加策略熵,从而增强探索能力,帮助发现更好的策略。QeRL的成功表明,量化不仅提升效率,还能通过动态噪声调节优化RL训练效果。
Abstract: We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs’ reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.
[295] Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models
Filippo Rinaldi,Aniello Panariello,Giacomo Salici,Fengyuan Liu,Marco Ciccone,Angelo Porrello,Simone Calderara
Main category: cs.LG
TL;DR: 该论文提出了一种名为GradFix的方法,通过利用梯度符号结构来跨预训练模型传输任务向量,避免了重复微调的需求。
Details
Motivation: 当基础模型发布新版本时,通常需要重复全微调过程,即使任务在旧版本中已解决。本文旨在通过任务向量传输避免这一重复过程。Contribution: 提出了GradFix方法,利用新模型的梯度符号结构近似理想结构,并通过少量标记样本实现任务向量传输,无需额外微调。
Method: GradFix通过计算目标模型的少量梯度并据此掩码源任务向量,生成局部对齐目标损失空间的更新,实现任务向量的跨模型传输。
Result: 在视觉和语言基准测试中表现显著优于任务向量直接添加和小样本微调方法。
Insight: 梯度符号结构是任务向量成功传输的关键,局部对齐目标损失空间的策略有效提升了传输性能。
Abstract: When a new release of a foundation model is published, practitioners typically need to repeat full fine-tuning, even if the same task has already been solved in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, they often fail to transfer across different pre-trained models due to their misaligned parameter space. In this work, we show that the key to successful transfer lies in the sign structure of the gradients of the new model. Based on this insight, we propose GradFix, a novel method that approximates the ideal gradient sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: the adaptation is achieved by computing a few gradients at the target model and masking the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning.
[296] Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise
Luca Scimeca,Thomas Jiralerspong,Berton Earnshaw,Jason Hartford,Yoshua Bengio
Main category: cs.LG
TL;DR: 本文提出了一种通过频谱各向异性前向噪声引导扩散模型的方法(SAGD),通过结构化协方差替换传统各向同性噪声,以更好地适应数据分布,提升生成性能并实现选择性忽略。
Details
Motivation: 扩散概率模型(DPMs)的生成性能虽强,但其归纳偏好(inductive biases)通常隐含。本文旨在通过在训练和采样中显式引入归纳偏好,以更好地建模目标数据分布。Contribution: 1. 提出频谱各向异性高斯扩散(SAGD),用结构化协方差替换传统各向同性噪声;2. 推导了各向异性协方差的得分关系;3. 展示了该方法在视觉数据集上的优越性及选择性忽略能力。
Method: 通过设计各向异性噪声算子,将前向协方差替换为频率对角协方差,支持频段强调或抑制,同时保持前向过程高斯性。理论证明了得分关系的收敛性及概率流路径的重塑。
Result: 实验表明,SAGD在多视觉数据集上优于标准扩散模型,并能选择性忽略特定频段的已知噪声。
Insight: 各向异性噪声为扩散模型的归纳偏好提供了一种简单且可解释的控制手段,展示了频段操作在新数据建模任务中的潜力。
Abstract: Diffusion Probabilistic Models (DPMs) have achieved strong generative performance, yet their inductive biases remain largely implicit. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. We introduce an anisotropic noise operator that shapes these biases by replacing the isotropic forward covariance with a structured, frequency-diagonal covariance. This operator unifies band-pass masks and power-law weightings, allowing us to emphasize or suppress designated frequency bands, while keeping the forward process Gaussian. We refer to this as spectrally anisotropic Gaussian diffusion (SAGD). In this work, we derive the score relation for anisotropic covariances and show that, under full support, the learned score converges to the true data score as $t!\to!0$, while anisotropy reshapes the probability-flow path from noise to data. Empirically, we show the induced anisotropy outperforms standard diffusion across several vision datasets, and enables selective omission: learning while ignoring known corruptions confined to specific bands. Together, these results demonstrate that carefully designed anisotropic forward noise provides a simple, yet principled, handle to tailor inductive bias in DPMs.
[297] Semantic-Cohesive Knowledge Distillation for Deep Cross-modal Hashing
Changchang Sun,Vickie Chen,Yan Yan
Main category: cs.LG
TL;DR: 本文提出了一种新颖的语义一致性知识蒸馏方法(SODA),用于解决跨模态哈希学习中多标签语义提取未能与原始多模态数据明确交互的问题。SODA通过引入多标签信息作为新的文本模态,并设计跨模态教师网络来蒸馏语义特征,从而学习一个更优的汉明空间。实验表明该方法优于现有技术。
Details
Motivation: 现有深度跨模态哈希方法在学习语义信息时,未能显式地将多标签语义提取与原始多模态数据交互,导致学习的语义信息与异构多模态数据不兼容,从而影响跨模态鸿沟的性能。Contribution: 1. 提出语义一致性知识蒸馏方法(SODA),将多标签信息作为新的文本模态;2. 设计跨模态教师网络,蒸馏图像与标签模态间的语义特征;3. 学习一个通用的汉明空间,作为学生网络的先验知识。
Method: 1. 将多标签信息转化为文本模态的标签提示;2. 设计跨模态教师网络,蒸馏语义特征;3. 使用汉明空间指导跨模态学生网络学习。
Result: 在两个基准数据集上的实验表明,SODA优于现有的最先进方法。
Insight: 通过引入多标签作为文本模态,并利用知识蒸馏技术,可以显式地学习跨模态语义信息,从而提升跨模态哈希的性能。
Abstract: Recently, deep supervised cross-modal hashing methods have achieve compelling success by learning semantic information in a self-supervised way. However, they still suffer from the key limitation that the multi-label semantic extraction process fail to explicitly interact with raw multimodal data, making the learned representation-level semantic information not compatible with the heterogeneous multimodal data and hindering the performance of bridging modality gap. To address this limitation, in this paper, we propose a novel semantic cohesive knowledge distillation scheme for deep cross-modal hashing, dubbed as SODA. Specifically, the multi-label information is introduced as a new textual modality and reformulated as a set of ground-truth label prompt, depicting the semantics presented in the image like the text modality. Then, a cross-modal teacher network is devised to effectively distill cross-modal semantic characteristics between image and label modalities and thus learn a well-mapped Hamming space for image modality. In a sense, such Hamming space can be regarded as a kind of prior knowledge to guide the learning of cross-modal student network and comprehensively preserve the semantic similarities between image and text modality. Extensive experiments on two benchmark datasets demonstrate the superiority of our model over the state-of-the-art methods.
[298] Deep Neural Networks Inspired by Differential Equations
Yongshuai Liu,Lianfang Wang,Kuilin Qin,Qinghua Zhang,Faqiang Wang,Li Cui,Jun Liu,Yuping Duan,Tieyong Zeng
Main category: cs.LG
TL;DR: 本文回顾了受微分方程启发的深度神经网络架构和动态建模方法,重点讨论了基于ODE和SDE的网络模型及其性能和特点,旨在提升模型的可解释性和泛化能力。
Details
Motivation: 深度学习的成功伴随着理论理解、可解释性和泛化能力的挑战,微分方程的视角为这些问题提供了统一的框架和系统性设计方法。Contribution: 论文的主要贡献是系统地总结了基于ODE和SDE的深度神经网络架构和动态建模方法,并提出了结合微分方程与深度学习的研究方向。
Method: 研究了基于ODE的确定性动态网络模型和基于SDE的随机动态网络模型,并通过数值比较展示了它们的特性和性能。
Result: 结果表明,微分方程启发的模型在可解释性和泛化能力方面具有潜力。
Insight: 将微分方程与深度学习结合可以为开发智能计算方法提供新思路,尤其在提升模型的可解释性和泛化能力方面前景广阔。
Abstract: Deep learning has become a pivotal technology in fields such as computer vision, scientific computing, and dynamical systems, significantly advancing these disciplines. However, neural Networks persistently face challenges related to theoretical understanding, interpretability, and generalization. To address these issues, researchers are increasingly adopting a differential equations perspective to propose a unified theoretical framework and systematic design methodologies for neural networks. In this paper, we provide an extensive review of deep neural network architectures and dynamic modeling methods inspired by differential equations. We specifically examine deep neural network models and deterministic dynamical network constructs based on ordinary differential equations (ODEs), as well as regularization techniques and stochastic dynamical network models informed by stochastic differential equations (SDEs). We present numerical comparisons of these models to illustrate their characteristics and performance. Finally, we explore promising research directions in integrating differential equations with deep learning to offer new insights for developing intelligent computational methods that boast enhanced interpretability and generalization capabilities.
[299] Reliable Active Learning from Unreliable Labels via Neural Collapse Geometry
Atharv Goel,Sharat Agarwal,Saket Anand,Chetan Arora
Main category: cs.LG
TL;DR: NCAL-R利用深度网络的几何规律性,提出两种互补信号(类均值对齐扰动和特征波动)来选择可靠的样本,有效减少标注噪声和数据分布偏移的影响,在ImageNet-100和CIFAR100上表现优于传统AL方法。
Details
Motivation: 传统主动学习(AL)方法在面对噪声标签或数据分布偏移时表现不佳,常因选择错误或冗余样本而放大标注错误。需要一种更可靠的方法来应对这些挑战。Contribution: 提出了NCAL-R框架,通过类均值对齐扰动和特征波动两种信号选择样本,增强了AL对噪声标签和数据分布偏移的鲁棒性。
Method: 结合类均值对齐扰动(评估样本对类间几何结构的影响)和特征波动(捕捉表征的时间不稳定性)来选择样本。
Result: 在ImageNet-100和CIFAR100上优于传统AL方法,实现了更高的准确率和更强的泛化能力。
Insight: 利用深度网络的几何规律性可以有效提升主动学习的可靠性和鲁棒性,适用于实际标注任务。
Abstract: Active Learning (AL) promises to reduce annotation cost by prioritizing informative samples, yet its reliability is undermined when labels are noisy or when the data distribution shifts. In practice, annotators make mistakes, rare categories are ambiguous, and conventional AL heuristics (uncertainty, diversity) often amplify such errors by repeatedly selecting mislabeled or redundant samples. We propose Reliable Active Learning via Neural Collapse Geometry (NCAL-R), a framework that leverages the emergent geometric regularities of deep networks to counteract unreliable supervision. Our method introduces two complementary signals: (i) a Class-Mean Alignment Perturbation score, which quantifies how candidate samples structurally stabilize or distort inter-class geometry, and (ii) a Feature Fluctuation score, which captures temporal instability of representations across training checkpoints. By combining these signals, NCAL-R prioritizes samples that both preserve class separation and highlight ambiguous regions, mitigating the effect of noisy or redundant labels. Experiments on ImageNet-100 and CIFAR100 show that NCAL-R consistently outperforms standard AL baselines, achieving higher accuracy with fewer labels, improved robustness under synthetic label noise, and stronger generalization to out-of-distribution data. These results suggest that incorporating geometric reliability criteria into acquisition decisions can make Active Learning less brittle to annotation errors and distribution shifts, a key step toward trustworthy deployment in real-world labeling pipelines. Our code is available at https://github.com/Vision-IIITD/NCAL.
[300] Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs
Lianghuan Huang,Yingshan Chang
Main category: cs.LG
TL;DR: 论文研究了视觉Transformer(ViT)中解码性和因果性的区别,通过激活修补和线性探针实验,发现两者并不一致,揭示了信息的存在与使用之间的差异。
Details
Motivation: 动机是解耦神经网络中的解码性和因果性,明确它们的不同作用,以便更好地理解模型的内部工作机制。Contribution: 主要贡献是通过实验证明了ViT中的解码性和因果性是独立的维度,揭示了信息存在与功能使用之间的不匹配。
Method: 方法包括使用激活修补技术测试空间和CLS token的因果性,同时训练线性探针评估不同层级的解码性。
Result: 结果显示,中层目标token具有强因果性但弱解码性,而最终层目标token具有高解码性但功能惰性;CLS token在中层可解码但仅在最终层具有因果性。
Insight: 研究发现解码性和因果性是互补的,它们的差异可以帮助揭示隐藏的计算电路。
Abstract: Mechanistic interpretability seeks to uncover how internal components of neural networks give rise to predictions. A persistent challenge, however, is disentangling two often conflated notions: decodability–the recoverability of information from hidden states–and causality–the extent to which those states functionally influence outputs. In this work, we investigate their relationship in vision transformers (ViTs) fine-tuned for object counting. Using activation patching, we test the causal role of spatial and CLS tokens by transplanting activations across clean-corrupted image pairs. In parallel, we train linear probes to assess the decodability of count information at different depths. Our results reveal systematic mismatches: middle-layer object tokens exert strong causal influence despite being weakly decodable, whereas final-layer object tokens support accurate decoding yet are functionally inert. Similarly, the CLS token becomes decodable in mid-layers but only acquires causal power in the final layers. These findings highlight that decodability and causality reflect complementary dimensions of representation–what information is present versus what is used–and that their divergence can expose hidden computational circuits.
[301] Decomposer Networks: Deep Component Analysis and Synthesis
Mohsen Joneidi
Main category: cs.LG
TL;DR: Decomposer Networks(DecompNet)是一种语义自编码器,通过并行分支将输入分解为多个可解释的组件,采用残差更新规则实现竞争的语义表示。
Details
Motivation: 传统自编码器将输入压缩为单一潜在表示,难以捕捉多组件语义。DecompNet旨在通过并行分支和残差更新规则,实现输入的语义分解。Contribution: 提出了首个基于“全除一”残差更新规则的语义自编码器,明确分支间的竞争,生成稀疏且语义明确的表示。
Method: 采用Gauss-Seidel风格的块坐标下降法展开为可微分网络,通过并行分支和残差定义实现组件分解。
Result: 相比线性分解方法(PCA、NMF)和目标中心架构(MONet、IODINE等),DecompNet能生成更具语义意义的表示。
Insight: 通过残差更新规则引入竞争机制,可以高效实现多组件的语义分解,适用于复杂输入的解析。
Abstract: We propose the Decomposer Networks (DecompNet), a semantic autoencoder that factorizes an input into multiple interpretable components. Unlike classical autoencoders that compress an input into a single latent representation, the Decomposer Network maintains N parallel branches, each assigned a residual input defined as the original signal minus the reconstructions of all other branches. By unrolling a Gauss–Seidel style block-coordinate descent into a differentiable network, DecompNet enforce explicit competition among components, yielding parsimonious, semantically meaningful representations. We situate our model relative to linear decomposition methods (PCA, NMF), deep unrolled optimization, and object-centric architectures (MONet, IODINE, Slot Attention), and highlight its novelty as the first semantic autoencoder to implement an all-but-one residual update rule.
[302] INR-Bench: A Unified Benchmark for Implicit Neural Representations in Multi-Domain Regression and Reconstruction
Linfei Li,Fengyi Zhang,Zhong Wang,Lin Zhang,Ying Shen
Main category: cs.LG
TL;DR: 论文提出了一个名为INR-Bench的统一基准,用于评估隐式神经网络表示(INRs)在多领域回归和重建任务中的表现,分析了模型架构、位置编码和非线性原语对信号频率响应的影响。
Details
Motivation: 隐式神经网络表示(INRs)因其连续性和无限分辨率的优势在信号处理任务中表现出色,但其有效性及限制因素尚未充分探索。为了更好地理解这些因素,作者从神经网络切线核(NTK)理论出发,分析了模型架构、位置编码和非线性原语对信号频率响应的影响。Contribution: 1. 提出了INR-Bench,这是第一个专为多模态INR任务设计的综合基准。2. 分析了模型架构(经典MLP和新兴KAN)、位置编码和非线性原语对信号频率响应的影响。3. 提供了56种Coordinate-MLP模型变体和22种Coordinate-KAN模型的评估结果,覆盖9个多模态任务。
Method: 1. 利用神经网络切线核(NTK)理论分析模型的频率响应特性。2. 设计了INR-Bench基准,包括多种模型变体(56种Coordinate-MLP和22种Coordinate-KAN)。3. 通过正向和逆向问题的9个多模态任务评估模型性能。
Result: INR-Bench提供了一个稳健的平台,突出了不同神经网络模型的优势和局限性,为未来研究奠定了基础。
Insight: 1. 通过NTK理论分析,揭示了模型架构、位置编码和非线性原语对INRs性能的关键影响。2. 实验结果表明,新兴的KAN模型在某些任务中可能优于经典MLP模型,尤其是在处理高频信号时。
Abstract: Implicit Neural Representations (INRs) have gained success in various signal processing tasks due to their advantages of continuity and infinite resolution. However, the factors influencing their effectiveness and limitations remain underexplored. To better understand these factors, we leverage insights from Neural Tangent Kernel (NTK) theory to analyze how model architectures (classic MLP and emerging KAN), positional encoding, and nonlinear primitives affect the response to signals of varying frequencies. Building on this analysis, we introduce INR-Bench, the first comprehensive benchmark specifically designed for multimodal INR tasks. It includes 56 variants of Coordinate-MLP models (featuring 4 types of positional encoding and 14 activation functions) and 22 Coordinate-KAN models with distinct basis functions, evaluated across 9 implicit multimodal tasks. These tasks cover both forward and inverse problems, offering a robust platform to highlight the strengths and limitations of different neural models, thereby establishing a solid foundation for future research. The code and dataset are available at https://github.com/lif314/INR-Bench.
[303] ImpMIA: Leveraging Implicit Bias for Membership Inference Attack under Realistic Scenarios
Yuval Golbari,Navve Wasserman,Gal Vardi,Michal Irani
Main category: cs.LG
TL;DR: ImpMIA是一种利用神经网络隐式偏置的成员推断攻击方法,无需参考模型,在更现实的隐私攻击场景中表现出色。
Details
Motivation: 现有的黑盒成员推断攻击依赖不现实的假设(如已知训练超参数和训练数据分布),导致性能下降,因此需要一种更通用的方法。Contribution: 提出了ImpMIA,一种白盒攻击方法,利用KKT最优性条件和隐式偏置理论,直接通过模型权重识别训练样本,摆脱了对参考模型和不现实假设的依赖。
Method: 基于最大间隔隐式偏置理论,利用KKT条件,通过样本梯度的重构能力识别训练样本。
Result: 在仅需模型权重和训练数据超集的现实场景下,ImpMIA性能优于现有黑盒和白盒攻击方法。
Insight: 隐式偏置理论可用于隐私攻击,白盒攻击在模型公开化的趋势下更具实用性。
Abstract: Determining which data samples were used to train a model-known as Membership Inference Attack (MIA)-is a well-studied and important problem with implications for data privacy. Black-box methods presume access only to the model’s outputs and often rely on training auxiliary reference models. While they have shown strong empirical performance, they rely on assumptions that rarely hold in real-world settings: (i) the attacker knows the training hyperparameters; (ii) all available non-training samples come from the same distribution as the training data; and (iii) the fraction of training data in the evaluation set is known. In this paper, we demonstrate that removing these assumptions leads to a significant drop in the performance of black-box attacks. We introduce ImpMIA, a Membership Inference Attack that exploits the Implicit Bias of neural networks, hence removes the need to rely on any reference models and their assumptions. ImpMIA is a white-box attack – a setting which assumes access to model weights and is becoming increasingly realistic given that many models are publicly available (e.g., via Hugging Face). Building on maximum-margin implicit bias theory, ImpMIA uses the Karush-Kuhn-Tucker (KKT) optimality conditions to identify training samples. This is done by finding the samples whose gradients most strongly reconstruct the trained model’s parameters. As a result, ImpMIA achieves state-of-the-art performance compared to both black and white box attacks in realistic settings where only the model weights and a superset of the training data are available.
[304] Lightweight Facial Landmark Detection in Thermal Images via Multi-Level Cross-Modal Knowledge Transfer
Qiyi Tong,Olivia Nocentini,Marta Lagomarsino,Kuanqi Cai,Marta Lorenzini,Arash Ajoudani
Main category: cs.LG
TL;DR: 该文提出了一种名为MLCM-KD的新型框架,结合多级跨模态知识蒸馏(DIKD)方法,从RGB图像中高效转移知识到热成像,用于轻量级面部关键点检测(FLD),显著提升了热成像FLD的性能和效率。
Details
Motivation: 传统热成像FLD方法因缺乏丰富的视觉线索而性能受限,而现有的RGB到热成像的跨模态方法计算开销大或引入结构伪影,无法满足实际部署需求。Contribution: 1. 提出MLCM-KD框架,将高保真RGB到热成像的知识转移与模型压缩分离;2. 设计DIKD双向机制,通过闭环监督确保模态不变特征的语义对齐,实现鲁棒的知识转移。
Method: 1. 采用多级知识蒸馏策略,从RGB教师模型中提取多层次特征;2. 通过DIKD双向注入机制,引导热成像学生模型学习模态不变特征并验证其表示。
Result: 在公共热成像FLD基准测试中取得新SOTA,性能显著优于之前方法,同时大幅降低计算开销。
Insight: 双向知识蒸馏(DIKD)通过闭环监督显著缩小RGB与热成像的模态差距,证明了模态不变特征学习在跨模态任务中的重要性。
Abstract: Facial Landmark Detection (FLD) in thermal imagery is critical for applications in challenging lighting conditions, but it is hampered by the lack of rich visual cues. Conventional cross-modal solutions, like feature fusion or image translation from RGB data, are often computationally expensive or introduce structural artifacts, limiting their practical deployment. To address this, we propose Multi-Level Cross-Modal Knowledge Distillation (MLCM-KD), a novel framework that decouples high-fidelity RGB-to-thermal knowledge transfer from model compression to create both accurate and efficient thermal FLD models. A central challenge during knowledge transfer is the profound modality gap between RGB and thermal data, where traditional unidirectional distillation fails to enforce semantic consistency across disparate feature spaces. To overcome this, we introduce Dual-Injected Knowledge Distillation (DIKD), a bidirectional mechanism designed specifically for this task. DIKD establishes a connection between modalities: it not only guides the thermal student with rich RGB features but also validates the student’s learned representations by feeding them back into the frozen teacher’s prediction head. This closed-loop supervision forces the student to learn modality-invariant features that are semantically aligned with the teacher, ensuring a robust and profound knowledge transfer. Experiments show that our approach sets a new state-of-the-art on public thermal FLD benchmarks, notably outperforming previous methods while drastically reducing computational overhead.