Table of Contents

cs.CL [Back]

[1] Mentalic Net: Development of RAG-based Conversational AI and Evaluation Framework for Mental Health Support

Anandi Dutta,Shivani Mruthyunjaya,Jessica Saddington,Kazi Sifatul Islam

Main category: cs.CL

TL;DR: 本文提出了一种基于检索增强生成(RAG)的对话AI Mentalic Net,用于心理健康支持,并通过严格的评估框架确保其准确性、同理心、可信度、隐私和减少偏见。

Details Motivation: 大型语言模型(LLMs)虽然潜力巨大,但在心理健康等敏感领域的应用中面临挑战。因此,作者开发了一个强调安全和有意义的心理健康支持聊天机器人。

Contribution: 主要贡献包括设计了一种基于RAG的心理健康支持AI,提出了全面的评估框架,并结合人类参与和负责任的技术开发策略。

Method: 方法包括集成RAG框架、优化提示工程、对预训练模型进行微调,并在新颖的数据集上进行训练和评估。

Result: Mentalic Net在BERT Score上达到0.898,其他评估指标也表现良好。

Insight: 研究表明,人类参与和长期负责任的技术开发策略在敏感领域的AI应用中至关重要。

Abstract: The emergence of large language models (LLMs) has unlocked boundless possibilities, along with significant challenges. In response, we developed a mental health support chatbot designed to augment professional healthcare, with a strong emphasis on safe and meaningful application. Our approach involved rigorous evaluation, covering accuracy, empathy, trustworthiness, privacy, and bias. We employed a retrieval-augmented generation (RAG) framework, integrated prompt engineering, and fine-tuned a pre-trained model on novel datasets. The resulting system, Mentalic Net Conversational AI, achieved a BERT Score of 0.898, with other evaluation metrics falling within satisfactory ranges. We advocate for a human-in-the-loop approach and a long-term, responsible strategy in developing such transformative technologies, recognizing both their potential to change lives and the risks they may pose if not carefully managed.

[2] Do MLLMs Really Understand the Charts?

Xiao Zhang,Dongyuan Li,Liuyu Xiang,Yao Zhang,Cheng Zhong,Zhaofeng He

Main category: cs.CL

TL;DR: 论文质疑多模态大语言模型(MLLMs)是否真正理解图表,提出了CRBench评估标准,并提出ChartReasoner方法,模仿人类视觉推理能力,显著提升图表理解性能。

Details Motivation: 目前的MLLMs在图表理解中表现出严重的幻觉问题和性能下降,尤其是在未标注图表上。作者认为MLLMs更多依赖识别而非推理,因此需要一种更接近人类视觉推理的方法。

Contribution: 1. 建立了全面的图表推理基准CRBench;2. 提出ChartReasoner方法,通过视觉推理模仿人类行为,显著提升性能。

Method: ChartReasoner方法通过在图表理解中结合视觉推理,模仿人类的估计行为,从而减少幻觉并提高准确性。该方法在CRBench和公共基准测试中验证了其有效性。

Result: ChartReasoner-3B/7B在CRBench上表现优于GPT-4o和Gemini-2.5-Flash,并在公共基准测试中展示了通用的视觉推理能力。

Insight: MLLMs目前主要依赖识别能力而非推理能力,结合人类视觉推理行为的方法可以显著提升其在图表理解任务中的性能,同时降低幻觉现象。

Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated increasingly impressive performance in chart understanding, most of them exhibit alarming hallucinations and significant performance degradation when handling non-annotated charts. Therefore, a question arises: Do MLLMs really understand the charts? Since a human is capable of understanding charts and estimating the values by visual reasoning, we first carefully establish a comprehensive Chart Reasoning Benchmark CRBench to rigorously evaluate the visual reasoning abilities of MLLMs on non-annotated charts. We argue that MLLMs are primarily relying on recognition rather than reasoning to interpret the charts. To steer MLLMs to reasonable chart understanding, we propose ChartReasoner that mimics human behavior by grounding their estimation in chart understanding. Extensive results on the proposed CRBench show that ChartReasnoner-3B/7B achieves superior performance in chart reasoning, even compared to GPT-4o and Gemini-2.5-Flash. More importantly, ChartReasnoner also demonstrates the visual reasoning abilities in general chart comprehension on public benchmarks, leading to significant performance gains and enabling MLLMs to rationally understand the charts. The code and dataset will be publicly available upon publication.

[3] Uncertainty-Aware Collaborative System of Large and Small Models for Multimodal Sentiment Analysis

Shiqin Han,Manning Gao,Menghua Jiang,Yuncheng Jiang,Haifeng Hu,Sijie Mai

Main category: cs.CL

TL;DR: 本文提出了一种基于不确定性感知的协作系统(U-ACS),通过结合大型多模态语言模型(MLLM)和轻量级模型的优势,在降低计算成本的同时保持高性能,实现多模态情感分析任务。

Details Motivation: 现有的MLLM虽然性能强大,但计算成本高;而轻量级模型效率高但性能不足。为解决这一性能与效率的矛盾,提出了U-ACS。

Contribution: 提出了不确定性感知的协作系统,通过不确定性驱动的级联机制动态分配计算资源,并结合高级策略处理模糊或冲突预测。

Method: 系统采用级联机制,小模型过滤低不确定性样本,高不确定性样本由MLLM处理,并结合加权平均和基于提示的交叉验证解决冲突预测。

Result: 在基准数据集上实现了SOTA性能,同时显著降低了计算成本。

Insight: 通过动态分配计算资源,可以实现高性能与高效率的平衡,适用于资源受限的实际场景。

Abstract: The advent of Multimodal Large Language Models (MLLMs) has significantly advanced the state-of-the-art in multimodal machine learning, yet their substantial computational demands present a critical barrier to real-world deployment. Conversely, smaller, specialized models offer high efficiency but often at the cost of performance. To reconcile this performance-efficiency trade-off, we propose a novel Uncertainty-Aware Collaborative System (U-ACS) that synergistically orchestrates a powerful MLLM (e.g., HumanOmni) and a lightweight baseline model for multimodal sentiment analysis. The core of our system is an uncertainty-driven cascade mechanism, where the efficient small model first acts as a rapid filter for all input samples. Only those samples yielding high predictive uncertainty, thereby indicating greater difficulty, are selectively escalated to the MLLM for more sophisticated analysis. Furthermore, our system introduces advanced strategies to handle ambiguous or conflicting predictions, including weighted averaging for predictions of similar polarity and a prompt-based cross-verification to resolve conflicting predictions when both models exhibit high uncertainty. This sample-difficulty-aware approach allows for a dynamic allocation of computational resources, drastically reducing inference costs while retaining the high accuracy of MLLM. Extensive experiments on benchmark datasets demonstrate that our proposed method achieves state-of-the-art performance, while requiring only a fraction of the computational resources compared to using a standalone MLLM.

[4] Benchmarking GPT-5 for biomedical natural language processing

Yu Hou,Zaifu Zhan,Rui Zhang

Main category: cs.CL

TL;DR: 该论文通过标准化生物医学NLP基准测试评估了GPT-5和GPT-4o的零样本、一样本和五样本提示性能,涵盖6类任务。结果表明,GPT-5在多任务中表现优异,尤其在问答任务上超越监督SOTA,但在抽取和摘要任务上仍需优化。

Details Motivation: 随着生物医学文献的快速增长,亟需可扩展的NLP解决方案。尽管GPT-4在问答任务上接近任务专用系统,但在其他领域表现不均,因此需评估下一代模型的性能。

Contribution: 论文更新了标准化BioNLP基准,全面评估GPT-5和GPT-4o的性能,并为生物医学NLP系统设计提供实用建议。

Method: 使用固定提示模板、相同解码参数和批量推理,在12个数据集上测试零样本、一样本和五样本提示的性能,并与GPT-4、GPT-3.5及LLaMA-2-13B对比。

Result: GPT-5在五样本提示下的宏观平均得分为0.557,优于GPT-4(0.506)和GPT-4o(0.508)。在MedQA上达94.1%准确率,显著超越监督SOTA,但在疾病NER和摘要任务上仍需改进。

Insight: GPT-5已成为通用模型,适用于推理类生物医学问答,但对精度要求高的抽取和摘要任务仍需微调或混合方法。基准测试明确了简单提示的适用场景与需检索增强或规划的领域。

Abstract: The rapid expansion of biomedical literature has heightened the need for scalable natural language processing (NLP) solutions. While GPT-4 substantially narrowed the gap with task-specific systems, especially in question answering, its performance across other domains remained uneven. We updated a standardized BioNLP benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across 12 datasets spanning six task families: named entity recognition, relation extraction, multi-label document classification, question answering, text summarization, and text simplification. Using fixed prompt templates, identical decoding parameters, and batch inference, we report primary metrics per dataset and include prior results for GPT-4, GPT-3.5, and LLaMA-2-13B for comparison. GPT-5 achieved the strongest overall benchmark performance, with macro-average scores rising to 0.557 under five-shot prompting versus 0.506 for GPT-4 and 0.508 for GPT-4o. On MedQA, GPT-5 reached 94.1% accuracy, exceeding the previous supervised state of the art by over fifty points, and attained parity with supervised systems on PubMedQA (0.734). In extraction tasks, GPT-5 delivered major gains in chemical NER (0.886 F1) and ChemProt relation extraction (0.616 F1), outperforming GPT-4 and GPT-4o, though summarization and disease NER still lagged behind domain-specific baselines. These results establish GPT-5 as a general-purpose model now offering deployment-ready performance for reasoning-oriented biomedical QA, while precision-critical extraction and evidence-dense summarization continue to favor fine-tuned or hybrid approaches. The benchmark delineates where simple prompting suffices and where retrieval-augmented or planning-based scaffolds are likely required, providing actionable guidance for BioNLP system design as frontier models advance.

[5] Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study

Xuan Yao,Qianteng Wang,Xinbo Liu,Ke-Wei Huang

Main category: cs.CL

TL;DR: 该研究首次系统评估了大型语言模型(LLM)在CFA考试题上的表现,通过零样本提示和检索增强生成(RAG)方法,发现推理专用模型表现最佳。

Details Motivation: 尽管LLMs在金融领域潜力巨大,但缺乏专业场景下的系统性评估,尤其是CFA这种高复杂度的金融分析环境。

Contribution: 1. 提出了基于CFA考试题的金融推理基准;2. 开发了检索增强生成(RAG)方法,显著提升模型表现;3. 分析了模型失败的主要原因。

Method: 1. 使用CFA的三级模拟考试题(1,560道)评估LLMs;2. 对比了多模态、推理专用和轻量级模型;3. 零样本和RAG两种测试方式。

Result: 推理专用模型在零样本下表现最佳,RAG方法在复杂场景中效果显著提升;知识缺口是主要失败原因,文本可读性影响较小。

Insight: 金融领域的LLM应用需结合推理能力和专业知识检索;RAG方法对复杂任务尤为重要,成本效益优化需权衡模型能力。

Abstract: The rapid advancement of large language models presents significant opportunities for financial applications, yet systematic evaluation in specialized financial contexts remains limited. This study presents the first comprehensive evaluation of state-of-the-art LLMs using 1,560 multiple-choice questions from official mock exams across Levels I-III of CFA, most rigorous professional certifications globally that mirror real-world financial analysis complexity. We compare models distinguished by core design priorities: multi-modal and computationally powerful, reasoning-specialized and highly accurate, and lightweight efficiency-optimized. We assess models under zero-shot prompting and through a novel Retrieval-Augmented Generation pipeline that integrates official CFA curriculum content. The RAG system achieves precise domain-specific knowledge retrieval through hierarchical knowledge organization and structured query generation, significantly enhancing reasoning accuracy in professional financial certification evaluation. Results reveal that reasoning-oriented models consistently outperform others in zero-shot settings, while the RAG pipeline provides substantial improvements particularly for complex scenarios. Comprehensive error analysis identifies knowledge gaps as the primary failure mode, with minimal impact from text readability. These findings provide actionable insights for LLM deployment in finance, offering practitioners evidence-based guidance for model selection and cost-performance optimization.

[6] Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing

David Berghaus,Armin Berger,Lars Hillebrand,Kostadin Cvejoski,Rafet Sifa

Main category: cs.CL

TL;DR: 论文比较了三种家族(GPT-5、Gemini 2.5和开源Gemma 3)的八种多模态大语言模型在发票处理中的表现,对比了直接图像处理与结构化解析两种策略,发现前者通常表现更好。

Details Motivation: 探讨多模态大语言模型在发票文档处理中的性能差异,为自动化文档系统选择合适的模型和策略提供依据。

Contribution: 提供了对多模态模型在发票处理中的系统性评估,并比较了直接图像处理与结构化解析策略的性能差异。

Method: 使用零样本提示(zero-shot prompting)对八种模型在三个公开发票数据集上进行测试,对比两种处理策略。

Result: 直接图像处理策略通常优于结构化解析,但性能因模型类型和文档特性而异。

Insight: 多模态模型的直接图像处理能力在发票文档中表现更优,但选择模型和策略需考虑文档的具体特性。

Abstract: This paper benchmarks eight multi-modal large language models from three families (GPT-5, Gemini 2.5, and open-source Gemma 3) on three diverse openly available invoice document datasets using zero-shot prompting. We compare two processing strategies: direct image processing using multi-modal capabilities and a structured parsing approach converting documents to markdown first. Results show native image processing generally outperforms structured approaches, with performance varying across model types and document characteristics. This benchmark provides insights for selecting appropriate models and processing strategies for automated document systems. Our code is available online.

[7] COCORELI: Cooperative, Compositional Reconstitution & Execution of Language Instructions

Swarnadeep Bhar,Omar Naim,Eleni Metheniti,Bastien Navarri,Loïc Cabannes,Morteza Ezzabady,Nicholas Asher

Main category: cs.CL

TL;DR: COCORELI是一个混合智能体框架,旨在解决大语言模型(LLMs)在执行复杂指令、减少幻觉和空间推理任务中的局限性。

Details Motivation: 现有的大语言模型在遵循复杂指令时容易产生幻觉,且在空间推理任务中表现不佳,COCORELI旨在通过协作和组合方式提升这些能力。

Contribution: 1. 提出了一种混合智能体框架,结合中型LLMs和新颖的抽象机制;2. 设计了一个话语模块,动态解析指令并学习环境的高级表示;3. 在协作构建任务和ToolBench API完成任务中验证了其优越性。

Method: 1. 使用中型LLM智能体;2. 引入抽象机制和话语模块解析指令;3. 动态学习环境的高层表示。

Result: 在自然协作构建任务中,COCORELI优于单一LLM的CoT方法和基于更大LLM的智能体系统,能够减少幻觉、识别缺失信息并更新学习对象。

Insight: 通过协作和组合方法,中型LLMs可以更高效地完成复杂任务,避免依赖单一大型LLM的资源消耗。

Abstract: We present COCORELI, a hybrid agent framework designed to tackle the limitations of large language models (LLMs) in tasks requiring: following complex instructions, minimizing hallucination, and spatial reasoning. COCORELI integrates medium-sized LLM agents with novel abstraction mechanisms and a discourse module to parse instructions to in-context learn dynamic, high-level representations of the environment. Experiments on natural collaborative construction tasks show that COCORELI outperforms single-LLM CoT and agentic LLM systems, all using larger LLMs. It manages to largely avoid hallucinations, identify missing information, ask for clarifications, and update its learned objects. COCORELI’s abstraction abilities extend beyond ENVIRONMENT, as shown in the ToolBench API completion task.

[8] MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

Alice Schiavone,Marco Fraccaro,Lea Marie Pehrson,Silvia Ingala,Rasmus Bonnevie,Michael Bachmann Nielsen,Vincent Beliveau,Melanie Ganz,Desmond Elliott

Main category: cs.CL

TL;DR: MOSAIC是一种多语言、与分类法无关且计算高效的放射报告分类方法,基于开源轻量级模型MedGemma-4B,支持零/少样本提示和轻量微调,性能接近专家水平,适用于临床环境。

Details Motivation: 放射报告包含丰富的临床信息,但现有方法(如规则基系统、监督模型或大型LLMs)存在语言多样性、标注成本高、闭源或资源消耗大等问题,且多限于英语和单一分类法。MOSAIC旨在解决这些局限性。

Contribution: 1. 提出多语言、分类法无关的放射报告分类方法MOSAIC; 2. 基于开源轻量模型MedGemma-4B,支持高效部署; 3. 在七种语言和多种分类法的数据集上验证性能,接近专家水平。

Method: 1. 使用轻量级开源模型MedGemma-4B; 2. 支持零/少样本提示和轻量微调; 3. 通过数据增强提升小样本性能。

Result: 1. 在五个胸部X光数据集上平均宏F1为88; 2. 仅需24GB显存; 3. 丹麦报告上80标注样本即可达F1 82(全量1600样本为86)。

Insight: 1. 开源轻量模型在医疗领域潜力大; 2. 多语言和小样本适应性对临床部署关键; 3. 数据增强可缓解标注不足问题。

Abstract: Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.

[9] RECAP: REwriting Conversations for Intent Understanding in Agentic Planning

Kushan Mitra,Dan Zhang,Hannah Kim,Estevam Hruschka

Main category: cs.CL

TL;DR: 论文提出了RECAP(重写对话以支持智能体规划),这是一个用于评估和改进意图重写的新基准,旨在解决开放域对话系统中的意图理解挑战。

Details Motivation: 现实对话中常存在模糊性、动态性和不完整性,这使得意图检测成为挑战。传统分类方法在开放域场景中泛化能力差,导致规划效果不佳。

Contribution: 提出了RECAP基准和基于LLM的评价器,开发了基于提示的重写方法,并通过DPO优化的重写模型进一步提升效果。

Method: 基于提示的重写方法和DPO优化的微调模型。

Result: 论文的方法在意图重写任务中优于基线模型,且DPO优化的模型进一步提升了规划效果。

Insight: 意图重写是提升开放域对话系统规划能力的关键且可行的方法。

Abstract: Understanding user intent is essential for effective planning in conversational assistants, particularly those powered by large language models (LLMs) coordinating multiple agents. However, real-world dialogues are often ambiguous, underspecified, or dynamic, making intent detection a persistent challenge. Traditional classification-based approaches struggle to generalize in open-ended settings, leading to brittle interpretations and poor downstream planning. We propose RECAP (REwriting Conversations for Agent Planning), a new benchmark designed to evaluate and advance intent rewriting, reframing user-agent dialogues into concise representations of user goals. RECAP captures diverse challenges such as ambiguity, intent drift, vagueness, and mixed-goal conversations. Alongside the dataset, we introduce an LLM-based evaluator that assesses planning utility given the rewritten intent. Using RECAP, we develop a prompt-based rewriting approach that outperforms baselines. We further demonstrate that fine-tuning two DPO-based rewriters yields additional utility gains. Our results highlight intent rewriting as a critical and tractable component for improving agent planning in open-domain dialogue systems.

[10] Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling

Shengyin Sun,Yiming Li,Xing Li,Yingzhao Lian,Weizhe Lin,Hui-Ling Zhen,Zhiyuan Yang,Chen Chen,Xianzhi Yu,Mingxuan Yuan,Chen Ma

Main category: cs.CL

TL;DR: 论文提出了首个综合性基准测试,用于评估推测解码方法在加速大型语言模型(LLM)测试时扩展中的效果,发现简单的n-gram方法能有效捕捉重复模式,为加速推理提供了独特潜力。

Details Motivation: 测试时扩展(test-time scaling)虽能提升LLM的推理能力,但因生成冗余的推理轨迹导致计算效率低下。推测解码(speculative decoding)是缓解这一问题的潜在方法,但其效果尚未在结构化、重复性强的测试时扩展场景中得到充分研究。

Contribution: 1. 提出首个评估推测解码方法加速测试时扩展的基准测试;2. 通过实验发现n-gram方法在捕捉重复模式上的有效性;3. 提出将n-gram与其他方法结合以平衡对重复和多样化推理的加速。

Method: 基准测试覆盖了三种推测解码方法(模型驱动、训练驱动、n-gram驱动),并在代表性测试时扩展范式(如Best-of-N采样和多轮思考)下进行实验。

Result: 实验表明,n-gram方法能高效捕捉重复模式,且在测试时扩展中表现出独特潜力,与其他方法结合可进一步提升性能。

Insight: 测试时扩展中的重复性推理路径可通过简单方法(如n-gram)高效加速,同时需结合更复杂方法以多样化推理路径的加速。

Abstract: Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently inefficient due to the generation of redundant and repetitive reasoning traces, leading to significant computational overhead. Speculative decoding offers a promising avenue for mitigating this inefficiency, yet its efficacy in the structured, repetition-rich context of test-time scaling remains largely unexplored. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating LLM test-time scaling. Our benchmark provides consistent experimental protocols across representative test-time scaling paradigms (e.g., Best-of-N sampling and multi-round thinking), enabling a fair comparison of three major categories of speculative decoding: model-based, training-based, and n-gram-based methods. Extensive experiments reveal that simple n-gram-based methods effectively capture repetitive patterns, demonstrating unique potential in accelerating test-time scaling. This phenomenon demonstrates the value of integrating n-gram-based methods with model-based or training-based approaches to balance acceleration for both repetitive and diverse reasoning in test-time scaling. We hope this benchmark spurs further research on speculative decoding for test-time scaling, enabling faster and more practical reasoning in LLMs through better handling of repetitive and diverse reasoning paths.

[11] ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute

Hao Wen,Yifan Su,Feifei Zhang,Yunxin Liu,Yunhao Liu,Ya-Qin Zhang,Yuanchun Li

Main category: cs.CL

TL;DR: ParaThinker 提出了一种新的并行思维范式,通过在推理过程中并行生成多样化的思维路径并综合结果,显著提升了大型语言模型的推理能力,避免了传统顺序扩展的瓶颈。

Details Motivation: 当前大型语言模型通过顺序扩展计算(生成长思维链)来提升推理能力,但其性能提升会随着计算增加遇到边际效益递减的问题(称为“隧道视觉”)。本文认为这是扩展策略的缺陷而非模型能力的限制。

Contribution: 提出了 ParaThinker 框架,首次将原生并行思维引入大型语言模型,通过并行生成多条推理路径并综合结果,显著提升模型推理能力,且计算开销可忽略。

Method: ParaThinker 训练模型同时生成多样化的平行思维路径,并通过综合机制选出最优解,从而避免隧道视觉问题。

Result: 在推理基准测试中,ParaThinker 显著优于顺序扩展模型(小模型提升 12.3%,大模型提升 7.5%),且延迟仅增加 7.1%。

Insight: 并行扩展(宽度)比顺序扩展(深度)更高效,是未来扩展大型语言模型推理能力的关键方向。

Abstract: Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling - a strategy that improves reasoning by generating longer, sequential thought processes. While effective, this approach encounters a significant bottleneck as computation increases, where further computation offers only marginal performance gains. We argue this ceiling is not an inherent limit of the model’s capability but a flaw in the scaling strategy itself, a phenomenon we term “Tunnel Vision”, where a model’s imperfect initial steps lock it into a suboptimal reasoning path. To overcome this, we introduce a new scaling paradigm: native thought parallelism. We present ParaThinker, an end-to-end framework that trains an LLM to generate multiple, diverse reasoning paths in parallel and synthesize them into a superior final answer. By exploring different lines of thoughts simultaneously, ParaThinker effectively sidesteps the Tunnel Vision issue and unlocks the model’s latent reasoning potential. Our approach demonstrates that scaling compute in parallel (width) is a more effective and efficient way to superior reasoning than simply scaling sequentially (depth). On challenging reasoning benchmarks, ParaThinker achieves substantial accuracy improvements over sequential LLMs (12.3% for 1.5B and 7.5% for 7B models on average with 8 parallel paths), while adding only negligible latency overhead (7.1%). This enables smaller models to surpass much larger counterparts and establishes parallel thinking as a critical, efficient dimension for scaling future LLMs.

[12] Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition

Ryo Takahashi,Naoki Saito,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama

Main category: cs.CL

TL;DR: 论文提出了一种基于离散提示调优的方法,通过递归利用黑盒多模态大语言模型(MLLM)来实现个性化视觉情感识别(VER),解决了MLLM在个性化任务中的局限性。

Details Motivation: 多模态大语言模型(MLLM)虽然在通用视觉情感识别上表现优异,但在个性化任务中倾向于多数观点和熟悉模式,限制了其在实际应用中的性能。因此,需要一种方法能够适应个体的情感表达差异。

Contribution: 提出了一种离散提示调优方法,通过递归选择最佳自然语言表示来更新提示,实现了对黑盒MLLM的高效适配,从而提升了个性化VER的性能。

Method: 1. 通过离散提示调优生成多组自然语言提示。2. 递归选择最佳提示用于更新,逐步优化情感识别效果。3. 结合黑盒MLLM的能力,避免了直接微调模型的复杂性。

Result: 方法能够显著提升个性化VER的准确性,优于传统的MLLM直接应用方式。

Insight: 通过提示工程的方式可以有效利用黑盒MLLM的能力,避免了对模型本身的大规模修改,为个性化任务提供了一种高效且可扩展的解决方案。

Abstract: Visual Emotion Recognition (VER) is an important research topic due to its wide range of applications, including opinion mining and advertisement design. Extending this capability to recognize emotions at the individual level further broadens its potential applications. Recently, Multimodal Large Language Models (MLLMs) have attracted increasing attention and demonstrated performance comparable to that of conventional VER methods. However, MLLMs are trained on large and diverse datasets containing general opinions, which causes them to favor majority viewpoints and familiar patterns. This tendency limits their performance in a personalized VER, which is crucial for practical and real-world applications, and indicates a key area for improvement. To address this limitation, the proposed method employs discrete prompt tuning inspired by the process of humans’ prompt engineering to adapt the VER task to each individual. Our method selects the best natural language representation from the generated prompts and uses it to update the prompt for the realization of accurate personalized VER.

[13] Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare

Ravi Shankar,Sheng Wong,Lin Li,Magdalena Bachmann,Alex Silverthorne,Beth Albert,Gabriel Davis Jones

Main category: cs.CL

TL;DR: 该论文提出了一种基于能量的模型(EBM),用于在医疗领域的检索增强生成(RAG)系统中实现可靠的弃权,尤其在语义复杂的边缘分布查询中表现优异。

Details Motivation: 在医疗等安全关键领域,RAG系统的错误回答可能导致严重后果,因此需要可靠的弃权机制来避免风险。

Contribution: 提出了基于能量的模型(EBM),通过学习稠密语义语料库的能量景观,显著提升了在语义困难情况下的弃权性能。

Method: 通过EBM模型学习语义语料库的能量景观,并与softmax校准基线和kNN密度启发法进行对比实验。

Result: EBM在语义困难的边缘分布查询中表现出色(AUROC 0.961),优于softmax(0.950),同时降低了假阳性率。

Insight: 能量评分比基于概率的softmax置信度更可靠,为安全RAG系统提供了可扩展且可解释的基础。

Abstract: Reliable abstention is critical for retrieval-augmented generation (RAG) systems, particularly in safety-critical domains such as women’s health, where incorrect answers can lead to harm. We present an energy-based model (EBM) that learns a smooth energy landscape over a dense semantic corpus of 2.6M guideline-derived questions, enabling the system to decide when to generate or abstain. We benchmark the EBM against a calibrated softmax baseline and a k-nearest neighbour (kNN) density heuristic across both easy and hard abstention splits, where hard cases are semantically challenging near-distribution queries. The EBM achieves superior abstention performance abstention on semantically hard cases, reaching AUROC 0.961 versus 0.950 for softmax, while also reducing FPR@95 (0.235 vs 0.331). On easy negatives, performance is comparable across methods, but the EBM’s advantage becomes most pronounced in safety-critical hard distributions. A comprehensive ablation with controlled negative sampling and fair data exposure shows that robustness stems primarily from the energy scoring head, while the inclusion or exclusion of specific negative types (hard, easy, mixed) sharpens decision boundaries but is not essential for generalisation to hard cases. These results demonstrate that energy-based abstention scoring offers a more reliable confidence signal than probability-based softmax confidence, providing a scalable and interpretable foundation for safe RAG systems.

[14] Refining Transcripts With TV Subtitles by Prompt-Based Weakly Supervised Training of ASR

Xinnian Zhao,Hugo Van Hamme

Main category: cs.CL

TL;DR: 这篇论文提出了一种基于提示的弱监督学习方法,利用电视字幕作为上下文提示,而非直接监督信号,以提高自动语音识别(ASR)的转录准确性。该方法通过迭代细化和加权注意力机制,生成了高质量的伪标注数据集。

Details Motivation: 电视字幕容易获取,但与音频的精确对齐较差,因此无法直接作为逐字转录的监督目标。论文旨在利用字幕的上下文信息,设计一种更有效的弱监督训练方法。

Contribution: 1. 提出了一种基于提示的弱监督训练框架,将电视字幕作为上下文提示;2. 引入了加权注意力机制,增强模型对相关字幕标记的关注;3. 生成了高质量的伪标注数据集,可用于训练鲁棒的ASR系统。

Method: 1. 将电视字幕重构为上下文提示,而非直接监督信号;2. 通过迭代生成伪转录目标,逐步优化模型;3. 使用加权注意力机制在推理阶段突出相关字幕标记。

Result: 实验表明,该方法显著提高了转录准确性,生成的伪标注数据集为ASR系统提供了高质量的训练基础。

Insight: 电视字幕的上下文信息可以有效地作为弱监督训练的辅助工具,而不仅仅是直接监督信号。通过迭代优化和注意力机制,可以逐步提升转录质量。

Abstract: This study proposes a novel approach to using TV subtitles within a weakly supervised (WS) Automatic Speech Recognition (ASR) framework. Although TV subtitles are readily available, their imprecise alignment with corresponding audio limits their applicability as supervised targets for verbatim transcription. Rather than using subtitles as direct supervision signals, our method reimagines them as context-rich prompts. This design enables the model to handle discrepancies between spoken audio and subtitle text. Instead, generated pseudo transcripts become the primary targets, with subtitles acting as guiding cues for iterative refinement. To further enhance the process, we introduce a weighted attention mechanism that emphasizes relevant subtitle tokens during inference. Our experiments demonstrate significant improvements in transcription accuracy, highlighting the effectiveness of the proposed method in refining transcripts. These enhanced pseudo-labeled datasets provide high-quality foundational resources for training robust ASR systems.

[15] Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate

Charles Moslonka,Hicham Randrianarivo,Arthur Garnier,Emmanuel Malherbe

Main category: cs.CL

TL;DR: 该论文提出了一种基于熵产生率(EPR)和监督学习的幻觉检测方法,用于黑盒LLMs的单次生成输出中,仅需少量token对数概率即可实现高效检测。

Details Motivation: 黑盒LLMs在问答任务中输出的幻觉严重影响了其可靠性。传统方法需要多次查询或大量数据,而这在黑盒API场景中难以实现。

Contribution: 1. 提出了一种基于token对数概率的熵产生率(EPR)度量指标;2. 结合监督学习,使用单次生成的top token特征提升检测性能;3. 验证了该方法在黑盒API场景中的高效性和实用性。

Method: 1. 从非贪婪解码中提取token对数概率;2. 计算EPR作为基线不确定性指标;3. 用监督学习模型增强EPR,仅需单次生成的top token特征。

Result: 在多个QA数据集和LLMs上,该方法显著优于单独使用EPR,且仅需少量token对数概率(如top <10)即可实现高效检测。

Insight: 黑盒LLMs的幻觉检测可以通过token级的熵动态有效实现,无需多次查询或大量数据,适用于实际部署中的API限制场景。

Abstract: Hallucinations in Large Language Model (LLM) outputs for Question Answering (QA) tasks critically undermine their real-world reliability. This paper introduces an applied methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access, such as interacting with black-box LLM APIs that typically expose only a few top candidate log-probabilities per token. Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding. We first derive an Entropy Production Rate (EPR) metric that offers baseline performance, later augmented with supervised learning. Our learned model uses features representing the entropic contributions of the accessible top-ranked tokens within a single generated sequence, requiring no multiple query re-runs. Evaluated across diverse QA datasets and multiple LLMs, this estimator significantly improves hallucination detection over using EPR alone. Crucially, high performance is demonstrated using only the typically small set of available log-probabilities (e.g., top <10 per token), confirming its practical efficiency and suitability for these API-constrained deployments. This work provides a readily deployable technique to enhance the trustworthiness of LLM responses from a single generation pass in QA and Retrieval-Augmented Generation (RAG) systems, with its utility further demonstrated in a finance framework analyzing responses to queries on annual reports from an industrial dataset.

[16] Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations

Krithi Shailya,Akhilesh Kumar Mishra,Gokul S Krishnan,Balaraman Ravindran

Main category: cs.CL

TL;DR: 论文实证研究了开源LLMs(LLaMA-3.1-8B、Gemma-7B和Mistral-7B)在学术推荐中的地理、人口和经济偏见,揭示了其对全球北方机构的偏好和性别刻板印象的强化,并提出了多维度评估框架。

Details Motivation: 研究动机是评估大型语言模型(LLMs)在教育推荐中可能传递的社会偏见,确保高等教育推荐的公平性和全球可及性。

Contribution: 主要贡献包括:(1) 实证分析三种开源LLMs的偏见;(2) 提出多维度评估框架;(3) 揭示系统性偏见并提出改进需求。

Method: 方法包括:(1) 使用360个模拟用户生成25,000多个推荐;(2) 分析地理、人口和经济维度的偏见;(3) 设计多维度框架量化偏见。

Result: 结果显示LLMs存在显著偏见:偏爱全球北方机构,强化性别刻板印象,且推荐重复性高。LLaMA-3.1多样性最高但仍存偏见。

Insight: 研究指出教育推荐中的系统性偏见问题,强调未来需在模型训练和应用中融入公平性考量,以促进全球教育公平。

Abstract: Large Language Models (LLMs) are increasingly used as daily recommendation systems for tasks like education planning, yet their recommendations risk perpetuating societal biases. This paper empirically examines geographic, demographic, and economic biases in university and program suggestions from three open-source LLMs: LLaMA-3.1-8B, Gemma-7B, and Mistral-7B. Using 360 simulated user profiles varying by gender, nationality, and economic status, we analyze over 25,000 recommendations. Results show strong biases: institutions in the Global North are disproportionately favored, recommendations often reinforce gender stereotypes, and institutional repetition is prevalent. While LLaMA-3.1 achieves the highest diversity, recommending 481 unique universities across 58 countries, systemic disparities persist. To quantify these issues, we propose a novel, multi-dimensional evaluation framework that goes beyond accuracy by measuring demographic and geographic representation. Our findings highlight the urgent need for bias consideration in educational LMs to ensure equitable global access to higher education.

[17] Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts

Rushi Wang,Jiateng Liu,Cheng Qian,Yifan Shen,Yanzhou Pan,Zhaozhuo Xu,Ahmed Abbasi,Heng Ji,Denghui Zhang

Main category: cs.CL

TL;DR: 论文研究了混合和不适当外部上下文对大型语言模型(LLM)输出的影响,提出了基于Rescorla-Wagner模型的RW-Steering方法,通过两阶段微调让模型学会识别并忽略不适当内容,提升了响应质量和安全性。

Details Motivation: 现实世界中,外部上下文常包含相关和不适当内容的混合,导致LLM输出不可靠。论文旨在探究LLM如何处理混合上下文,并提出解决方案。

Contribution: 1) 提出Poisoned Context Testbed测试平台;2) 改编Rescorla-Wagner模型量化混合上下文影响;3) 提出RW-Steering方法,通过微调提升LLM对不适当内容的鲁棒性。

Method: 1) 使用Poisoned Context Testbed生成混合上下文数据;2) 基于Rescorla-Wagner模型量化LLM行为;3) 提出RW-Steering,通过两阶段微调让模型忽略不适当信号。

Result: 实验表明,RW-Steering将响应质量提升39.8%,并能逆转LLM在混合上下文中的不良行为曲线。

Insight: LLM倾向于吸收上下文中较少出现的内容,这一行为在现实中可能导致少量不适当内容显著降低输出质量;RW-Steering提供了一种无需大量监督数据的通用解决方案。

Abstract: Incorporating external context can significantly enhance the response quality of Large Language Models (LLMs). However, real-world contexts often mix relevant information with disproportionate inappropriate content, posing reliability risks. How do LLMs process and prioritize mixed context? To study this, we introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content. Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs. Our adapted model reveals a consistent behavioral pattern: LLMs exhibit a strong tendency to incorporate information that is less prevalent in the context. This susceptibility is harmful in real-world settings, where small amounts of inappropriate content can substantially degrade response quality. Empirical evaluations on our testbed further confirm this vulnerability. To tackle this, we introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals. Unlike prior methods that rely on extensive supervision across diverse context mixtures, RW-Steering generalizes robustly across varying proportions of inappropriate content. Experiments show that our best fine-tuned model improves response quality by 39.8% and reverses the undesirable behavior curve, establishing RW-Steering as a robust, generalizable context engineering solution for improving LLM safety in real-world use.

[18] Understanding Reinforcement Learning for Model Training, and future directions with GRAPE

Rohit Patel

Main category: cs.CL

TL;DR: 本文系统介绍了强化学习在模型训练中的关键算法(如SFT、REINFORCE、TRPO、PPO、GRPO、DPO等),通过简化和明确的描述,消除了歧义并提供了直观理解,最后提出了GRAPE这一未来研究方向。

Details Motivation: 现有的强化学习算法解释通常假设读者具备先验知识,缺乏细节或过于复杂。本文旨在通过简化和明确的描述,帮助读者(尤其是LLM研究者)清晰理解这些算法。

Contribution: 1. 系统总结了强化学习中的关键算法(如SFT、REINFORCE等),并简化了其描述;2. 提出了GRAPE(Generalized Relative Advantage Policy Evolution)作为未来研究方向。

Method: 1. 使用简化和明确的符号逐步解释了每个算法,避免了无关抽象;2. 连接了概念与LLMs的具体应用场景。

Result: 为读者提供了清晰、直观的强化学习算法理解,减少了认知负担。

Insight: 1. 简化描述有助于消除歧义;2. GRAPE展示了结合相对优势与策略演化的潜在研究方向。

Abstract: This paper provides a self-contained, from-scratch, exposition of key algorithms for instruction tuning of models: SFT, Rejection Sampling, REINFORCE, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Explanations of these algorithms often assume prior knowledge, lack critical details, and/or are overly generalized and complex. Here, each method is discussed and developed step by step using simplified and explicit notation focused on LLMs, aiming to eliminate ambiguity and provide a clear and intuitive understanding of the concepts. By minimizing detours into the broader RL literature and connecting concepts to LLMs, we eliminate superfluous abstractions and reduce cognitive overhead. Following this exposition, we provide a literature review of new techniques and approaches beyond those detailed. Finally, new ideas for research and exploration in the form of GRAPE (Generalized Relative Advantage Policy Evolution) are presented.

[19] VaccineRAG: Boosting Multimodal Large Language Models’ Immunity to Harmful RAG Samples

Qixin Sun,Ziqin Wang,Hengyuan Zhao,Yilin Li,Kaiyou Song,Linjiang Huang,Xiaolin Hu,Qingpei Guo,Si Liu

Main category: cs.CL

TL;DR: VaccineRAG通过引入基于Chain-of-Thought的数据集和改进的训练方法(Partial-GRPO),提升了多模态大语言模型对有害检索样本的抵抗能力。

Details Motivation: 现有检索增强生成(RAG)方法的有效性常因检索到的无关或误导性样本而受限,亟需解决这一问题以提升模型性能。

Contribution: 1)提出VaccineRAG数据集,评估和提升模型对有害样本的免疫力;2)引入Partial-GRPO方法,优化模型对复杂Chain-of-Thought内容的学习能力。

Method: 1)构建包含正负样本比例变化的数据集;2)采用Chain-of-Thought分析提升样本判别能力;3)提出Partial-GRPO,分块建模输出以处理复杂序列。

Result: 实验验证了VaccineRAG和Partial-GRPO的有效性,显著提升了模型对有害样本的抵抗能力。

Insight: 通过系统化的样本设计和分块建模输出,可以有效提升LLM在复杂检索任务中的鲁棒性和准确性。

Abstract: Retrieval Augmented Generation enhances the response accuracy of Large Language Models (LLMs) by integrating retrieval and generation modules with external knowledge, demonstrating particular strength in real-time queries and Visual Question Answering tasks. However, the effectiveness of RAG is frequently hindered by the precision of the retriever: many retrieved samples fed into the generation phase are irrelevant or misleading, posing a critical bottleneck to LLMs’ performance. To address this challenge, we introduce VaccineRAG, a novel Chain-of-Thought-based retrieval-augmented generation dataset. On one hand, VaccineRAG employs a benchmark to evaluate models using data with varying positive/negative sample ratios, systematically exposing inherent weaknesses in current LLMs. On the other hand, it enhances models’ sample-discrimination capabilities by prompting LLMs to generate explicit Chain-of-Thought (CoT) analysis for each sample before producing final answers. Furthermore, to enhance the model’s ability to learn long-sequence complex CoT content, we propose Partial-GRPO. By modeling the outputs of LLMs as multiple components rather than a single whole, our model can make more informed preference selections for complex sequences, thereby enhancing its capacity to learn complex CoT. Comprehensive evaluations and ablation studies on VaccineRAG validate the effectiveness of the proposed scheme. The code and dataset will be publicly released soon.

[20] Behavioral Fingerprinting of Large Language Models

Zehua Pei,Hui-Ling Zhen,Ying Zhang,Zhiyuan Yang,Xing Li,Xianzhi Yu,Mingxuan Yuan,Bei Yu

Main category: cs.CL

TL;DR: 该论文提出了一个名为“行为指纹”的新框架,用于捕捉大型语言模型(LLM)的多样行为特征,超越传统性能评估,揭示了模型在交互行为上的显著差异及其与对齐策略的关系。

Details Motivation: 当前对大型语言模型的评估主要关注性能指标,忽略了行为特征的多样性。本文旨在通过行为指纹框架,更全面地分析模型的认知和交互风格。

Contribution: 提出了“行为指纹”框架,设计了一套诊断提示工具(Diagnostic Prompt Suite),并利用强大的LLM作为公正裁判,自动化评估多模型的行为差异。

Method: 使用诊断提示工具和自动化评估流程,分析18个不同能力水平的模型,识别模型的核心能力与对齐行为之间的差异。

Result: 研究发现,顶级模型的核心能力(如抽象和因果推理)趋于一致,但对齐相关的行为(如奉承和语义鲁棒性)差异显著,且模型普遍存在ISTJ/ESTJ人格聚类。

Insight: 模型的交互行为并非由其规模或推理能力决定,而是由开发者对齐策略直接塑造,这表明行为多样性可通过对齐干预实现。

Abstract: Current benchmarks for Large Language Models (LLMs) primarily focus on performance metrics, often failing to capture the nuanced behavioral characteristics that differentiate them. This paper introduces a novel ``Behavioral Fingerprinting’’ framework designed to move beyond traditional evaluation by creating a multi-faceted profile of a model’s intrinsic cognitive and interactive styles. Using a curated \textit{Diagnostic Prompt Suite} and an innovative, automated evaluation pipeline where a powerful LLM acts as an impartial judge, we analyze eighteen models across capability tiers. Our results reveal a critical divergence in the LLM landscape: while core capabilities like abstract and causal reasoning are converging among top models, alignment-related behaviors such as sycophancy and semantic robustness vary dramatically. We further document a cross-model default persona clustering (ISTJ/ESTJ) that likely reflects common alignment incentives. Taken together, this suggests that a model’s interactive nature is not an emergent property of its scale or reasoning power, but a direct consequence of specific, and highly variable, developer alignment strategies. Our framework provides a reproducible and scalable methodology for uncovering these deep behavioral differences. Project: https://github.com/JarvisPei/Behavioral-Fingerprinting

[21] From Silent Signals to Natural Language: A Dual-Stage Transformer-LLM Approach

Nithyashree Sivasubramaniam

Main category: cs.CL

TL;DR: 这篇论文提出了一种结合Transformer和大型语言模型(LLM)的双阶段方法,用于改进无声语音接口(SSI)中的自动语音识别(ASR),显著降低了词错误率(WER)。

Details Motivation: 无声语音接口虽然在语音生成方面取得了进展,但在合成语音的识别和后处理上仍存在问题,如音素模糊和噪声。论文旨在通过改进的ASR框架解决这些问题。

Contribution: 1. 提出了一种双阶段的ASR框架,结合Transformer和LLM;2. 显著提升了无声语音接口的识别性能,WER相对降低16%,绝对降低6%。

Method: 1. Transformer模型用于捕捉完整的语音上下文;2. 大型语言模型(LLM)用于后处理,确保语言一致性。

Result: 实验结果表明,与基线36%的WER相比,提出的方法相对降低了16%,绝对降低了6%。

Insight: 结合Transformer的上下文建模能力和LLM的语言处理能力,可以有效提升无声语音接口的识别性能,尤其是在噪声和模糊情况下的表现。

Abstract: Silent Speech Interfaces (SSIs) have gained attention for their ability to generate intelligible speech from non-acoustic signals. While significant progress has been made in advancing speech generation pipelines, limited work has addressed the recognition and downstream processing of synthesized speech, which often suffers from phonetic ambiguity and noise. To overcome these challenges, we propose an enhanced automatic speech recognition framework that combines a transformer-based acoustic model with a large language model (LLM) for post-processing. The transformer captures full utterance context, while the LLM ensures linguistic consistency. Experimental results show a 16% relative and 6% absolute reduction in word error rate (WER) over a 36% baseline, demonstrating substantial improvements in intelligibility for silent speech interfaces.

[22] Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations

Martha O. Dimgba,Sharon Oba,Ameeta Agrawal,Philippe J. Giabbanelli

Main category: cs.CL

TL;DR: 本文通过模型解释(BAME方法)减少了AI生成故事中的性别和种族偏见,无需修改模型参数即可提升人口群体代表性。

Details Motivation: 语言模型在输出中传播社会偏见,特别是在性别和种族的代表性上,这促使研究者开发方法以减轻这些偏见。

Contribution: 提出了BAME方法,利用模型生成的解释指导提示工程,显著改善了AI生成故事中对不同人口群体的代表性。

Method: 通过分析生成故事中的偏见模式,并结合模型解释来调整提示设计,从而减少偏见。测试覆盖了25个职业群体和多个语言模型。

Result: BAME方法将人口群体的代表性提高了2%至20%,展示了其在增强生成AI公平性方面的有效性。

Insight: 利用模型自身的解释机制可以有效减少偏见,说明透明性在促进公平AI系统中的作用。

Abstract: Language models have been shown to propagate social bias through their output, particularly in the representation of gender and ethnicity. This paper investigates gender and ethnicity biases in AI-generated occupational stories. Representation biases are measured before and after applying our proposed mitigation strategy, Bias Analysis and Mitigation through Explanation (BAME), revealing improvements in demographic representation ranging from 2% to 20%. BAME leverages model-generated explanations to inform targeted prompt engineering, effectively reducing biases without modifying model parameters. By analyzing stories generated across 25 occupational groups, three large language models (Claude 3.5 Sonnet, Llama 3.1 70B Instruct, and GPT-4 Turbo), and multiple demographic dimensions, we identify persistent patterns of overrepresentation and underrepresentation linked to training data stereotypes. Our findings demonstrate that guiding models with their own internal reasoning mechanisms can significantly enhance demographic parity, thereby contributing to the development of more transparent generative AI systems.

[23] Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets

Sophie Jaffer,Simeon Sayer

Main category: cs.CL

TL;DR: 该研究比较了完全在斯瓦希里语数据上训练和测试的BERT模型与英语训练模型在处理斯瓦希里语翻译输入时的性能。结果显示,斯瓦希里语原生训练的模型表现更优,错误率显著低于翻译后输入到英语模型的错误率。

Details Motivation: 探讨多语言大型语言模型(LLMs)在非英语语言中的性能公平性,揭示英语主导的训练数据可能对非英语使用者造成的不利影响。

Contribution: 通过实验证明,即使经过高质量翻译,原生语言训练的模型在性能上仍优于通过翻译输入到英语模型的情况,凸显原生训练的重要性。

Method: 使用两个单语BERT模型(一个基于斯瓦希里语数据,一个基于英语数据),将斯瓦希里语新闻数据翻译为英语后由英语模型处理,并与原生斯瓦希里语模型直接处理的结果进行对比。

Result: 原生斯瓦希里语模型的错误率为0.36%,显著低于翻译后输入英语模型的1.47%错误率,表明翻译无法完全弥合语言间的表征差异。

Insight: 研究强调了原生语言训练对模型性能的关键作用,并指出多语言模型的评估和数据集开发需要更多关注,以减少AI部署对现有数字鸿沟的加剧。

Abstract: As large language models (LLMs) expand multilingual capabilities, questions remain about the equity of their performance across languages. While many communities stand to benefit from AI systems, the dominance of English in training data risks disadvantaging non-English speakers. To test the hypothesis that such data disparities may affect model performance, this study compares two monolingual BERT models: one trained and tested entirely on Swahili data, and another on comparable English news data. To simulate how multilingual LLMs process non-English queries through internal translation and abstraction, we translated the Swahili news data into English and evaluated it using the English-trained model. This approach tests the hypothesis by evaluating whether translating Swahili inputs for evaluation on an English model yields better or worse performance compared to training and testing a model entirely in Swahili, thus isolating the effect of language consistency versus cross-lingual abstraction. The results prove that, despite high-quality translation, the native Swahili-trained model performed better than the Swahili-to-English translated model, producing nearly four times fewer errors: 0.36% vs. 1.47% respectively. This gap suggests that translation alone does not bridge representational differences between languages and that models trained in one language may struggle to accurately interpret translated inputs due to imperfect internal knowledge representation, suggesting that native-language training remains important for reliable outcomes. In educational and informational contexts, even small performance gaps may compound inequality. Future research should focus on addressing broader dataset development for underrepresented languages and renewed attention to multilingual model evaluation, ensuring the reinforcing effect of global AI deployment on existing digital divides is reduced.

[24] Analysis of Voluntarily Reported Data Post Mesh Implantation for Detecting Public Emotion and Identifying Concern Reports

Indu Bala,Lewis Mitchell,Marianne H Gillam

Main category: cs.CL

TL;DR: 该研究利用自然语言处理技术分析患者报告的情绪和情感极性,发现特定时段内‘关切报告’和情绪强度的增加,揭示了情绪分析在提升医疗实践中的价值。

Details Motivation: 研究旨在通过分析患者情绪和情感极性,识别术后问题,改进术前咨询和术后护理,从而提升患者体验。

Contribution: 主要贡献在于利用NLP技术识别情绪模式及时序变化,提出‘关切报告’分类,为医疗实践提供新视角。

Method: 使用NRC情感词典和TextBlob进行情感分析,将患者报告分类为八种情绪并评估情感极性。

Result: 研究发现2011-2012和2017-2018年期间‘关切报告’和情绪强度显著增加,揭示了患者体验的变化。

Insight: 情绪分析可作为医疗实践的重要工具,帮助医生更好理解患者需求,优化护理流程。

Abstract: Mesh implants are widely utilized in hernia repair surgeries, but postoperative complications present a significant concern. This study analyzes patient reports from the Manufacturer and User Facility Device Experience (MAUDE) database spanning 2000 to 2021 to investigate the emotional aspects of patients following mesh implantation using Natural Language Processing (NLP). Employing the National Research Council Canada (NRC) Emotion Lexicon and TextBlob for sentiment analysis, the research categorizes patient narratives into eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and assesses sentiment polarity. The goal is to discern patterns in patient sentiment over time and to identify reports signaling urgent concerns, referred to as “Concern Reports,” thereby understanding shifts in patient experiences in relation to changes in medical device regulation and technological advancements in healthcare. The study detected an increase in Concern Reports and higher emotional intensity during the periods of 2011-2012 and 2017-2018. Through temporal analysis of Concern Reports and overall sentiment, this research provides valuable insights for healthcare practitioners, enhancing their understanding of patient experiences post-surgery, which is critical for improving preoperative counselling, postoperative care, and preparing patients for mesh implant surgeries. The study underscores the importance of emotional considerations in medical practices and the potential for sentiment analysis to inform and enhance patient care.

[25] Advancing SLM Tool-Use Capability using Reinforcement Learning

Dhruvi Paprunia,Vansh Kharidia,Pankti Doshi

Main category: cs.CL

TL;DR: 本文通过强化学习(特别是GRPO方法)提升小型语言模型(SLM)的工具使用能力,解决了其因资源有限而表现不佳的问题。

Details Motivation: 大型语言模型(LLM)在工具使用方面表现出色,但资源需求高;小型语言模型(SLM)因训练数据有限,工具使用能力较弱。需要一种高效方法提升SLM的实用性。

Contribution: 提出使用强化学习(GRPO)优化SLM的工具使用能力,显著提升其准确性和适应性,同时避免传统微调的高计算成本。

Method: 采用Group Relative Policy Optimization(GRPO)方法,通过强化学习训练SLM,优化其在工具使用任务中的表现。

Result: 实验显示,GRPO方法显著提升了SLM的工具使用准确率,增强了其实际应用的可行性。

Insight: 强化学习为SLM的工具使用能力提供了一种高效且可扩展的优化路径,弥补了其在资源受限场景下的不足。

Abstract: Large Language Models (LLMs) have progressed beyond simple text creation, and tool use has become increasingly important for complex, real-world tasks. Tool use in LLMs refers to their ability to utilize external resources such as APIs, databases, or software functions to extend their functionality beyond generating text.Tools are used for tasks such as performing calculations, making API calls to retrieve the current time and date, and more. This capability enables models to fetch real-time data, execute commands, or solve problems requiring dynamic interaction, making it indispensable for applications like AI agents in virtual assistants, robotic control, or automated workflows. However, while LLMs are usually adept tool use, their vast resource requirements and computation complexity restrict their use in every use case.As a result, there is an increasing need for more compact and efficient Small Language Models (SLMs). Small language models (SLMs) struggle in tool use compared to large language models (LLMs). As soon in Table 1. SLMs are typically trained on smaller, more specific datasets, resulting in a narrower knowledge base and limited contextual understanding compared to LLMs. This research addresses these challenges by using Reinforcement Learning (RL), specifically Group Relative Policy Optimization (GRPO), to enhance tool-use proficiency in SLMs. Unlike conventional fine-tuning approaches that require heavy computation and often lack adaptability, our method provides an efficient, effective solution that significantly boosts SLM tool-use accuracy, increasing their practical utility.

[26] Hierarchical Section Matching Prediction (HSMP) BERT for Fine-Grained Extraction of Structured Data from Hebrew Free-Text Radiology Reports in Crohn’s Disease

Zvi Badash,Hadas Ben-Atya,Naama Gavrielov,Liam Hazan,Gili Focht,Ruth Cytter-Kuint,Talar Hagopian,Dan Turner,Moti Freiman

Main category: cs.CL

TL;DR: HSMP-BERT是一种基于提示的BERT模型,用于从希伯来语放射学报告中提取结构化数据,特别针对克罗恩病患者的多器官稀疏表示的数据。该方法显著优于基准模型,展示了在低资源语言环境中的应用潜力。

Details Motivation: 提取结构化临床信息在放射学报告中具有挑战性,尤其是在低资源语言(如希伯来语)和多器官稀疏表示的克罗恩病数据中。

Contribution: 提出了HSMP-BERT模型,显著提升了结构化数据提取的准确性(F1 0.83±0.08),并在低资源环境中验证了其有效性。

Method: 采用分层结构化匹配预测(HSMP)和BERT结合的方法,通过层级推理减少了运行时间(5.1倍),并优化了多标签数据的分割。

Result: HSMP-BERT在24个器官-病理组合中表现优异(F1 0.83±0.08),显著优于零样本基准(F1 0.49±0.07)和标准微调方法(F1 0.30±0.27)。

Insight: 该方法不仅提高了数据提取效率,还揭示了克罗恩病的临床关联(如回肠壁增厚与狭窄的关联),展示了AI在低资源环境中的应用潜力。

Abstract: Extracting structured clinical information from radiology reports is challenging, especially in low-resource languages. This is pronounced in Crohn’s disease, with sparsely represented multi-organ findings. We developed Hierarchical Structured Matching Prediction BERT (HSMP-BERT), a prompt-based model for extraction from Hebrew radiology text. In an administrative database study, we analyzed 9,683 reports from Crohn’s patients imaged 2010-2023 across Israeli providers. A subset of 512 reports was radiologist-annotated for findings across six gastrointestinal organs and 15 pathologies, yielding 90 structured labels per subject. Multilabel-stratified split (66% train+validation; 33% test), preserving label prevalence. Performance was evaluated with accuracy, F1, Cohen’s $\kappa$, AUC, PPV, NPV, and recall. On 24 organ-finding combinations with $>$15 positives, HSMP-BERT achieved mean F1 0.83$\pm$0.08 and $\kappa$ 0.65$\pm$0.17, outperforming the SMP zero-shot baseline (F1 0.49$\pm$0.07, $\kappa$ 0.06$\pm$0.07) and standard fine-tuning (F1 0.30$\pm$0.27, $\kappa$ 0.27$\pm$0.34; paired t-test $p < 10^{-7}$). Hierarchical inference cuts runtime 5.1$\times$ vs. traditional inference. Applied to all reports, it revealed associations among ileal wall thickening, stenosis, and pre-stenotic dilatation, plus age- and sex-specific trends in inflammatory findings. HSMP-BERT offers a scalable solution for structured extraction in radiology, enabling population-level analysis of Crohn’s disease and demonstrating AI’s potential in low-resource settings.

[27] Using LLMs to create analytical datasets: A case study of reconstructing the historical memory of Colombia

David Anderson,Galia Benitez,Margret Bjarnadottir,Shriyan Reyya

Main category: cs.CL

TL;DR: 利用大型语言模型(LLM)GPT分析20万篇西班牙语暴力相关报纸文章,重构哥伦比亚历史记忆,并进行政策分析。

Details Motivation: 哥伦比亚长期缺乏系统性暴力记录,希望通过LLM技术填补历史空白。

Contribution: 1. 构建基于LLM的暴力相关数据集;2. 展示了LLM在历史研究和政策分析中的潜力。

Method: 使用GPT读取和分析20万篇西班牙语报纸文章,生成问答数据集,进行描述性分析和政策研究。

Result: 成功重构部分历史记忆,并分析了暴力与古柯作物根除之间的关系。

Insight: LLM为大规模文本分析提供了新工具,支持历史和政策研究。

Abstract: Colombia has been submerged in decades of armed conflict, yet until recently, the systematic documentation of violence was not a priority for the Colombian government. This has resulted in a lack of publicly available conflict information and, consequently, a lack of historical accounts. This study contributes to Colombia’s historical memory by utilizing GPT, a large language model (LLM), to read and answer questions about over 200,000 violence-related newspaper articles in Spanish. We use the resulting dataset to conduct both descriptive analysis and a study of the relationship between violence and the eradication of coca crops, offering an example of policy analyses that such data can support. Our study demonstrates how LLMs have opened new research opportunities by enabling examinations of large text corpora at a previously infeasible depth.

[28] Quantized Large Language Models in Biomedical Natural Language Processing: Evaluation and Recommendation

Zaifu Zhan,Shuang Zhou,Min Zeng,Kai Yu,Meijia Song,Xiaoyi Chen,Jun Wang,Yu Hou,Rui Zhang

Main category: cs.CL

TL;DR: 该论文系统评估了量化技术对12种大型语言模型在生物医学NLP任务中的影响,发现量化能显著减少GPU内存需求(高达75%),同时保持性能,支持在本地资源有限的环境中部署大型模型。

Details Motivation: 大型语言模型在生物医学NLP中的潜力巨大,但其计算需求和数据隐私问题阻碍了其在医疗环境中的本地部署。

Contribution: 通过量化技术实现大型语言模型的高效本地部署,平衡性能和资源需求。

Method: 对12种模型在8个基准数据集上进行了量化评估,涵盖4个关键任务(命名实体识别、关系抽取、多标签分类和问答)。

Result: 量化减少了75%的GPU内存需求,使700亿参数模型能在40GB消费级GPU上运行,且性能损失最小。

Insight: 量化是连接AI技术进展与实际临床应用的有效策略,尤其在数据隐私和资源受限的场景中。

Abstract: Large language models have demonstrated remarkable capabilities in biomedical natural language processing, yet their rapid growth in size and computational requirements present a major barrier to adoption in healthcare settings where data privacy precludes cloud deployment and resources are limited. In this study, we systematically evaluated the impact of quantization on 12 state-of-the-art large language models, including both general-purpose and biomedical-specific models, across eight benchmark datasets covering four key tasks: named entity recognition, relation extraction, multi-label classification, and question answering. We show that quantization substantially reduces GPU memory requirements-by up to 75%-while preserving model performance across diverse tasks, enabling the deployment of 70B-parameter models on 40GB consumer-grade GPUs. In addition, domain-specific knowledge and responsiveness to advanced prompting methods are largely maintained. These findings provide significant practical and guiding value, highlighting quantization as a practical and effective strategy for enabling the secure, local deployment of large yet high-capacity language models in biomedical contexts, bridging the gap between technical advances in AI and real-world clinical translation.

[29] Manipulating Transformer-Based Models: Controllability, Steerability, and Robust Interventions

Faruk Alpay,Taylan Alpay

Main category: cs.CL

TL;DR: 该论文研究了如何通过提示、激活和权重三个层面的干预方法,实现对基于Transformer的语言模型的细粒度控制,并提出了一种统一的框架。

Details Motivation: 尽管Transformer模型在NLP任务中表现出色,但对其细粒度控制仍具有挑战性。研究旨在通过系统性干预方法提升模型的可控性和鲁棒性。

Contribution: 提出了一个统一的框架,涵盖提示级引导、激活干预和权重空间编辑,理论证明最小权重更新可实现目标行为调整,实验验证了90%以上的情感控制和事实编辑成功率。

Method: 结合提示工程、参数高效微调、模型编辑和强化学习,将可控文本生成建模为优化问题;分析了鲁棒性和安全性问题。

Result: 实验结果显示,该方法在情感控制和事实编辑上的成功率超过90%,同时保持了基础性能,但也存在泛化与特异性之间的权衡。

Insight: 研究揭示了模型控制的潜在伦理风险,并强调了严格评估的必要性,为设计可控且鲁棒的语言模型奠定了基础。

Abstract: Transformer-based language models excel in NLP tasks, but fine-grained control remains challenging. This paper explores methods for manipulating transformer models through principled interventions at three levels: prompts, activations, and weights. We formalize controllable text generation as an optimization problem addressable via prompt engineering, parameter-efficient fine-tuning, model editing, and reinforcement learning. We introduce a unified framework encompassing prompt-level steering, activation interventions, and weight-space edits. We analyze robustness and safety implications, including adversarial attacks and alignment mitigations. Theoretically, we show minimal weight updates can achieve targeted behavior changes with limited side-effects. Empirically, we demonstrate >90% success in sentiment control and factual edits while preserving base performance, though generalization-specificity trade-offs exist. We discuss ethical dual-use risks and the need for rigorous evaluation. This work lays groundwork for designing controllable and robust language models.

[30] Spoken in Jest, Detected in Earnest: A Systematic Review of Sarcasm Recognition – Multimodal Fusion, Challenges, and Future Prospects

Xiyuan Gao,Shekhar Nayak,Matt Coler

Main category: cs.CL

TL;DR: 这篇系统综述首次聚焦于基于语音的反讽识别,总结了从单模态到多模态方法的演变,涵盖了数据集、特征提取和分类方法,并指出了跨文化和多语言反讽识别的重要性。

Details Motivation: 反讽是人类交流中的常见现象,但在人际和人机交互中带来挑战。先前研究多集中在文本反讽检测,而语音数据的作用被忽视。本文强调利用语音数据进行自动反讽识别的重要性,以提升社交互动和机器对复杂语言的理解。

Contribution: 这篇综述首次系统梳理了语音反讽识别的进展,突出了从单模态到多模态方法的演化,指出了数据集和跨文化研究的局限性,并提出了未来的研究方向。

Method: 综述涵盖了从传统声学特征到基于深度学习的特征提取方法,以及从单模态到多模态融合的分类技术。

Result: 研究发现目前数据集在语音反讽识别中的局限性,以及特征提取和分类方法的演进,同时强调了跨文化和多语言研究的重要性。

Insight: 反讽应被视为多模态现象而非纯文本问题,跨文化和多语言研究对提升识别效果至关重要。

Abstract: Sarcasm, a common feature of human communication, poses challenges in interpersonal interactions and human-machine interactions. Linguistic research has highlighted the importance of prosodic cues, such as variations in pitch, speaking rate, and intonation, in conveying sarcastic intent. Although previous work has focused on text-based sarcasm detection, the role of speech data in recognizing sarcasm has been underexplored. Recent advancements in speech technology emphasize the growing importance of leveraging speech data for automatic sarcasm recognition, which can enhance social interactions for individuals with neurodegenerative conditions and improve machine understanding of complex human language use, leading to more nuanced interactions. This systematic review is the first to focus on speech-based sarcasm recognition, charting the evolution from unimodal to multimodal approaches. It covers datasets, feature extraction, and classification methods, and aims to bridge gaps across diverse research domains. The findings include limitations in datasets for sarcasm recognition in speech, the evolution of feature extraction techniques from traditional acoustic features to deep learning-based representations, and the progression of classification methods from unimodal approaches to multimodal fusion techniques. In so doing, we identify the need for greater emphasis on cross-cultural and multilingual sarcasm recognition, as well as the importance of addressing sarcasm as a multimodal phenomenon, rather than a text-based challenge.

[31] Sample-efficient Integration of New Modalities into Large Language Models

Osman Batur İnce,André F. T. Martins,Oisin Mac Aodha,Edoardo M. Ponti

Main category: cs.CL

TL;DR: 提出了一种样本高效的方法(SEMI),用于将新模态整合到大型语言模型中,通过超网络和共享投影器的设计显著减少所需配对数据量。

Details Motivation: 多模态基础模型无法覆盖所有可能且不断演变的模态空间,且现有方法整合新模态需要大量配对数据,不适合低资源模态。

Contribution: 引入SEMI方法,通过超网络和共享投影器设计,实现样本高效的模态整合,显著减少所需数据量。

Method: 设计一个超网络,基于高资源模态训练,通过少量样本生成适配器;利用等距变换增加训练模态多样性。

Result: SEMI在新模态整合中显著提升样本效率,例如32-shot SEMI的性能相当于传统方法64倍数据量的效果。

Insight: 通过超网络和共享投影器设计,可以高效扩展基础模型的模态覆盖范围,适合低资源场景。

Abstract: Multimodal foundation models can process several modalities. However, since the space of possible modalities is large and evolving over time, training a model from scratch to encompass all modalities is unfeasible. Moreover, integrating a modality into a pre-existing foundation model currently requires a significant amount of paired data, which is often not available for low-resource modalities. In this paper, we introduce a method for sample-efficient modality integration (SEMI) into Large Language Models (LLMs). To this end, we devise a hypernetwork that can adapt a shared projector – placed between modality-specific encoders and an LLM – to any modality. The hypernetwork, trained on high-resource modalities (i.e., text, speech, audio, video), is conditioned on a few samples from any arbitrary modality at inference time to generate a suitable adapter. To increase the diversity of training modalities, we artificially multiply the number of encoders through isometric transformations. We find that SEMI achieves a significant boost in sample efficiency during few-shot integration of new modalities (i.e., satellite images, astronomical images, inertial measurements, and molecules) with encoders of arbitrary embedding dimensionality. For instance, to reach the same accuracy as 32-shot SEMI, training the projector from scratch needs 64$\times$ more data. As a result, SEMI holds promise to extend the modality coverage of foundation models.

[32] AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Aisha Alansari,Hamzah Luqman

Main category: cs.CL

TL;DR: 该论文提出了首个针对阿拉伯语和大规模多语言模型的细粒度幻觉评估框架AraHalluEval,专注于生成性问答和摘要任务,揭示了事实幻觉在这些任务中的普遍性。

Details Motivation: 当前关于大规模语言模型(LLM)幻觉的研究主要集中在英语领域,而对阿拉伯语语境下的评估研究较少。鉴于阿拉伯语的广泛使用及其在全球交流中的重要性,填补这一知识空白显得尤为紧迫。

Contribution: 1. 开发了首个针对阿拉伯语和多语言LLM的细粒度幻觉评估框架AraHalluEval;2. 在生成性问答和摘要任务中评估了12种模型,包括阿拉伯预训练模型、多语言模型和基于推理的模型;3. 提出了12个细粒度幻觉指标,用于评估输出的真实性和一致性。

Method: 提出了一个包含12个细粒度幻觉指标的评估框架,用于量化阿拉伯语和多语言LLM在生成性问答和摘要任务中的幻觉现象。实验覆盖了12种模型,通过比较不同模型的表现来分析幻觉的普遍性。

Result: 结果显示,事实幻觉在所有模型和任务中比忠实性错误更普遍。阿拉伯预训练模型Allam表现出较低的幻觉率,且性能与基于推理的模型相当。

Insight: 阿拉伯预训练模型在减少幻觉方面表现优于多语言模型,表明针对特定语言的训练可能有助于提高模型的真实性和一致性。

Abstract: Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs’ hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic’s widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs’ outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models. The code is available at: \href{https://github.com/aishaalansari57/AraHalluEval}{Github link}.

[33] KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering

Yushi Sun,Kai Sun,Yifan Ethan Xu,Xiao Yang,Xin Luna Dong,Nan Tang,Lei Chen

Main category: cs.CL

TL;DR: KERAG提出了一种基于知识图谱的检索增强生成方法(KG-RAG),通过检索更广泛的子图并结合LLM的思维链推理,显著提升了问答系统的覆盖率和质量。

Details Motivation: 传统的知识图谱问答方法(KGQA)依赖于语义解析,检索的知识范围受限,且容易因严格的模式要求和语义歧义导致覆盖率低。KERAG旨在通过更灵活的检索和增强生成方法解决这一问题。

Contribution: 1. 提出了一个创新的KG-RAG管道,扩展检索范围以提高覆盖率;2. 结合检索-过滤-总结方法和LLM的思维链推理来减少噪声并提升问答质量。

Method: KERAG采用三步法:1. 检索一个广泛的子图;2. 过滤无关信息;3. 结合LLM的思维链推理对子图进行总结和答案生成。

Result: 实验显示KERAG在质量上优于当前最优方法约7%,并超越GPT-4o(Tool)10-21%。

Insight: 通过扩展知识检索范围和结合LLM的高级推理能力,可以有效提升问答系统的性能和覆盖复杂问题的能力。

Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucination in Large Language Models (LLMs) by incorporating external data, with Knowledge Graphs (KGs) offering crucial information for question answering. Traditional Knowledge Graph Question Answering (KGQA) methods rely on semantic parsing, which typically retrieves knowledge strictly necessary for answer generation, thus often suffer from low coverage due to rigid schema requirements and semantic ambiguity. We present KERAG, a novel KG-based RAG pipeline that enhances QA coverage by retrieving a broader subgraph likely to contain relevant information. Our retrieval-filtering-summarization approach, combined with fine-tuned LLMs for Chain-of-Thought reasoning on knowledge sub-graphs, reduces noises and improves QA for both simple and complex questions. Experiments demonstrate that KERAG surpasses state-of-the-art solutions by about 7% in quality and exceeds GPT-4o (Tool) by 10-21%.

[34] Phonological Representation Learning for Isolated Signs Improves Out-of-Vocabulary Generalization

Lee Kezar,Zed Sehyr,Jesse Thomason

Main category: cs.CL

TL;DR: 该论文研究了两种语音学归纳偏置(参数解缠和语音学半监督)如何通过矢量量化自编码器提升孤立手语识别和对未见手语的重建质量。

Details Motivation: 手语数据集在词汇上往往缺乏代表性,需要能够推广到未见手语的模型。矢量量化虽能学习离散的类标记表示,但其是否捕捉了妨碍性能的虚假相关性尚待验证。

Contribution: 提出两种语音学归纳偏置技术,并通过实验证明其能提高模型对未见手语的重建和识别能力,为学习手语的泛化表示提供了语言学动机的定量分析。

Method: 结合参数解缠(架构偏置)和语音学半监督(正则化技术),使用矢量量化自编码器学习手语的离散表示。

Result: 提出的模型在未见手语的一次性重建和识别任务中表现优于基线,证明了语音学偏置的有效性。

Insight: 语言学启发的显式归纳偏置能显著提升手语表示的泛化能力,为手语识别模型的改进提供了新思路。

Abstract: Sign language datasets are often not representative in terms of vocabulary, underscoring the need for models that generalize to unseen signs. Vector quantization is a promising approach for learning discrete, token-like representations, but it has not been evaluated whether the learned units capture spurious correlations that hinder out-of-vocabulary performance. This work investigates two phonological inductive biases: Parameter Disentanglement, an architectural bias, and Phonological Semi-Supervision, a regularization technique, to improve isolated sign recognition of known signs and reconstruction quality of unseen signs with a vector-quantized autoencoder. The primary finding is that the learned representations from the proposed model are more effective for one-shot reconstruction of unseen signs and more discriminative for sign identification compared to a controlled baseline. This work provides a quantitative analysis of how explicit, linguistically-motivated biases can improve the generalization of learned representations of sign language.

[35] A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning

Cheng Peng,Xinyu Dong,Mengxian Lyu,Daniel Paredes,Yaoyun Zhang,Yonghui Wu

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型(LLMs)在患者信息提取中的应用,重点探讨了模型架构、微调策略和多任务指令调优对性能的影响,并在多个临床数据集上进行了验证。

Details Motivation: 随着LLMs在临床自然语言处理(NLP)任务中的快速发展,如何优化LLMs以提取患者信息仍需要进一步探索。论文旨在填补这一空白,提出高效的方法来提升临床信息提取的鲁棒性和泛化性。

Contribution: 论文的主要贡献包括:1)比较了编码器(BERT)和生成式(GPT)LLMs在临床信息提取中的表现;2)研究了参数高效微调(PEFT)策略与传统方法的对比;3)提出了一种多任务指令调优框架,以提升零样本和小样本学习性能。

Method: 论文采用了以下方法:1)在不同架构的LLMs(如BERT、GatorTronGPT)上进行实验;2)比较了传统微调与PEFT策略;3)通过多任务指令调优和留一数据集策略评估模型的泛化能力。

Result: 实验结果表明,生成式LLMs在某些任务中表现优于编码器模型,PEFT策略在资源有限的情况下显著高效,多任务指令调优进一步提升了零样本和小样本学习的性能。

Insight: 论文揭示了生成式LLMs在临床信息提取中的潜力,同时证明了多任务学习和小样本调优对模型泛化的关键作用,为未来的研究提供了实践指导。

Abstract: Natural language processing (NLP) is a key technology to extract important patient information from clinical narratives to support healthcare applications. The rapid development of large language models (LLMs) has revolutionized many NLP tasks in the clinical domain, yet their optimal use in patient information extraction tasks requires further exploration. This study examines LLMs’ effectiveness in patient information extraction, focusing on LLM architectures, fine-tuning strategies, and multi-task instruction tuning techniques for developing robust and generalizable patient information extraction systems. This study aims to explore key concepts of using LLMs for clinical concept and relation extraction tasks, including: (1) encoder-only or decoder-only LLMs, (2) prompt-based parameter-efficient fine-tuning (PEFT) algorithms, and (3) multi-task instruction tuning on few-shot learning performance. We benchmarked a suite of LLMs, including encoder-based LLMs (BERT, GatorTron) and decoder-based LLMs (GatorTronGPT, Llama 3.1, GatorTronLlama), across five datasets. We compared traditional full-size fine-tuning and prompt-based PEFT. We explored a multi-task instruction tuning framework that combines both tasks across four datasets to evaluate the zero-shot and few-shot learning performance using the leave-one-dataset-out strategy.

[36] Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework

Zucheng Liang,Wenxin Wei,Kaijie Zhang,Hongyi Chen

Main category: cs.CL

TL;DR: 本文在MQUAKE框架下提出了一种多跳问题分解方法,用于提升大语言模型(LLMs)在回答复杂问题时的性能。通过LLAMA3模型的实验验证,多跳分解方法在训练前后均显著优于直接回答复杂问题的方法。

Details Motivation: 复杂问题的准确回答一直是LLMs面临的重大挑战,本文旨在通过多跳问题分解方法提升模型的理解和推理能力。

Contribution: 主要贡献包括:1) 提出了基于MQUAKE框架的多跳问题分解方法;2) 通过LLAMA3模型验证了该方法在训练前后的有效性;3) 展示了多跳分解方法对提升复杂问题回答能力的显著优势。

Method: 采用MQUAKE-T数据集,将其分为单跳和多跳两种格式,利用LoRA方法微调LLAMA3模型,并对比多跳分解与直接回答的效果。

Result: 实验表明,多跳分解方法在训练前后均优于直接回答方法。微调后,两者性能均有提升,但多跳分解方法仍保持优势。

Insight: 多跳问题分解方法能有效提升LLMs对复杂问题的理解和推理能力,尤其在未训练和微调后均表现出色,展示了其鲁棒性和通用性。

Abstract: Accurately answering complex questions has consistently been a significant challenge for Large Language Models (LLMs). To address this, this paper proposes a multi-hop question decomposition method for complex questions, building upon research within the MQUAKE framework. Utilizing the LLAMA3 model, we systematically investigate the impact of multi-hop question decomposition within knowledge graphs on model comprehension and reasoning accuracy, both before and after model training. In our experiments, we systematically partitioned and converted the MQUAKE-T dataset into two distinct formats: a single-hop dataset designed for directly answering complex questions, and a multi-hop dataset constructed using the multi-hop question decomposition method. We then fine-tuned the LLAMA3 model on these datasets and conducted inference tests. Our results demonstrate that, without fine-tuning the LLM, the prediction performance based on the multi-hop question decomposition method significantly outperforms the method of directly answering complex questions. After fine-tuning using the LoRA (Low-Rank Adaptation) method, the performance of both approaches improved compared to the untrained baseline. Crucially, the method utilizing multi-hop decomposition consistently maintained its superiority. These findings validate the effectiveness of the multi-hop decomposition method both before and after training, demonstrating its capability to effectively enhance the LLM’s ability to answer complex questions.

[37] Enhancing Diversity in Large Language Models via Determinantal Point Processes

Yilei Chen,Souradip Chakraborty,Lorenz Wolf,Ioannis Ch. Paschalidis,Aldo Pacchiano

Main category: cs.CL

TL;DR: 该论文提出了一种名为DQO的新训练方法,利用行列式点过程(DPPs)联合优化大型语言模型(LLM)在质量和语义多样性上的表现。实验表明,该方法在多个任务中显著提高了语义多样性且未影响模型质量。

Details Motivation: 监督微调和强化学习等方法虽然能提升LLM在下游任务中的性能,但往往会减少输出多样性,导致回答过于单一。现有方法要么局限于推理阶段,要么仅关注词汇差异,无法充分解决这一问题。

Contribution: 提出了基于DPPs的DQO训练方法,通过计算嵌入响应的核相似矩阵行列式来衡量多样性,从而在训练阶段联合优化模型的质量与多样性。

Method: 1. 为每个提示采样一组响应并嵌入;2. 构建响应嵌入的核相似矩阵;3. 利用行列式点过程衡量嵌入空间的体积以量化多样性;4. 联合优化多样性指标与模型质量。

Result: 在指令遵循、摘要、故事生成和推理任务中,DQO显著提升了语义多样性,同时保持了模型性能。

Insight: 通过引入DPPs的数学工具,直接在训练阶段优化语义多样性,为提升LLM输出的丰富性提供了新的有效途径。

Abstract: Supervised fine-tuning and reinforcement learning are two popular methods for post-training large language models (LLMs). While improving the model’s performance on downstream tasks, they often reduce the model’s output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on lexical differences. We propose a novel training method named DQO based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.

[38] Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects

Gunmay Handa,Zekun Wu,Adriano Koshiyama,Philip Treleaven

Main category: cs.CL

TL;DR: 本文系统研究了基于大五人格特质的大语言模型(LLM)性格操控方法,对比了上下文学习(ICL)、参数高效微调(PEFT)和机制导向(MS)三种方法,提出了统一的评估框架和稳定性分析,为实际部署提供指导。

Details Motivation: 随着LLM在客服和智能体场景中的广泛应用,对其性格操控机制及其权衡的研究需求日益增长,但现有方法的效果和局限尚不明确。

Contribution: 1)构建了平衡的高低特质对比数据集;2)提出基于Δ分析的统一评估框架;3)开发了特质纯化技术;4)提出了三级稳定性框架量化方法稳健性。

Method: 对比了ICL、PEFT和MS三种方法,并引入特质纯化技术和三级稳定性分析框架。

Result: 实验表明,ICL在保持能力损失最小的同时实现强对齐,PEFT对齐效果最好但任务性能下降,MS提供轻量级运行时控制。开放性和宜人性特质表现出独特挑战。

Insight: 性格操控可作为多层次行为表征的探针,机制导向(MS)是微调的轻量级替代方案,适用于部署和可解释性需求。

Abstract: Personality manipulation in large language models (LLMs) is increasingly applied in customer service and agentic scenarios, yet its mechanisms and trade-offs remain unclear. We present a systematic study of personality control using the Big Five traits, comparing in-context learning (ICL), parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS). Our contributions are fourfold. First, we construct a contrastive dataset with balanced high/low trait responses, enabling effective steering vector computation and fair cross-method evaluation. Second, we introduce a unified evaluation framework based on within-run $\Delta$ analysis that disentangles, reasoning capability, agent performance, and demographic bias across MMLU, GAIA, and BBQ benchmarks. Third, we develop trait purification techniques to separate openness from conscientiousness, addressing representational overlap in trait encoding. Fourth, we propose a three-level stability framework that quantifies method-, trait-, and combination-level robustness, offering practical guidance under deployment constraints. Experiments on Gemma-2-2B-IT and LLaMA-3-8B-Instruct reveal clear trade-offs: ICL achieves strong alignment with minimal capability loss, PEFT delivers the highest alignment at the cost of degraded task performance, and MS provides lightweight runtime control with competitive effectiveness. Trait-level analysis shows openness as uniquely challenging, agreeableness as most resistant to ICL, and personality encoding consolidating around intermediate layers. Taken together, these results establish personality manipulation as a multi-level probe into behavioral representation, linking surface conditioning, parameter encoding, and activation-level steering, and positioning mechanistic steering as a lightweight alternative to fine-tuning for both deployment and interpretability.

[39] PRIM: Towards Practical In-Image Multilingual Machine Translation

Yanzhi Tian,Zeming Liu,Zhengyang Liu,Chong Feng,Xin Li,Heyan Huang,Yuhang Guo

Main category: cs.CL

TL;DR: 论文提出了一个面向实际场景的实用多语言图像内机器翻译(IIMMT)任务,并标注了PRIM数据集。通过提出端到端模型VisTrans,解决了复杂背景、多样字体和多语言翻译的挑战。

Details Motivation: 现有图像内机器翻译(IIMT)研究主要基于合成数据,与现实场景存在显著差距,提出了实用多语言图像内机器翻译(IIMMT)任务以填补这一空白。

Contribution: 1) 标注了PRIM数据集,包含真实世界的一行文本图像;2) 提出了端到端模型VisTrans,能够分离处理视觉文本和背景信息,支持多语言翻译并提升视觉质量。

Method: VisTrans模型将图像中的视觉文本和背景信息分开处理,确保多语言翻译能力的同时提高视觉效果。

Result: 实验结果表明,VisTrans在翻译质量和视觉效果上优于其他模型。

Insight: 处理真实世界图像内机器翻译时,分离开视觉文本和背景信息是关键,模型还需要支持多语言翻译方向的灵活性。

Abstract: In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (IIMMT). In order to convince the lack of publicly available data, we annotate the PRIM dataset, which contains real-world captured one-line text images with complex background, various fonts, diverse text positions, and supports multilingual translation directions. We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM, which processes visual text and background information in the image separately, ensuring the capability of multilingual translation while improving the visual quality. Experimental results indicate the VisTrans achieves a better translation quality and visual effect compared to other models. The code and dataset are available at: https://github.com/BITHLP/PRIM.

[40] Using LLMs for Multilingual Clinical Entity Linking to ICD-10

Sylvia Vassileva,Ivan Koychev,Svetla Boytcheva

Main category: cs.CL

TL;DR: 该论文提出了一种利用大型语言模型(LLMs)和多阶段流水线方法,将临床术语链接到多语言ICD-10编码的自动化系统。

Details Motivation: 临床实体链接是提取结构化临床信息的关键步骤,但手动编码费时且易出错。通过自动化为多语言临床文本分配ICD-10编码,可以简化医疗专业人员的工作并确保编码一致性。

Contribution: 提出了一种结合临床词典和GPT-4.1上下文学习的多阶段流水线方法,实现多语言临床术语到ICD-10编码的高效链接。

Method: 首先使用临床词典匹配明确的术语,对未匹配的术语通过GPT-4.1上下文学习预测ICD-10编码。

Result: 在西班牙语和希腊语基准数据集上表现优异(西班牙语:类别F1 0.89,子类别F1 0.78;希腊语:F1 0.85)。

Insight: LLMs在临床实体链接任务中表现出强大的潜力,尤其是在多语言环境下,结合词典匹配和上下文学习可以提高准确性和效率。

Abstract: The linking of clinical entities is a crucial part of extracting structured information from clinical texts. It is the process of assigning a code from a medical ontology or classification to a phrase in the text. The International Classification of Diseases - 10th revision (ICD-10) is an international standard for classifying diseases for statistical and insurance purposes. Automatically assigning the correct ICD-10 code to terms in discharge summaries will simplify the work of healthcare professionals and ensure consistent coding in hospitals. Our paper proposes an approach for linking clinical terms to ICD-10 codes in different languages using Large Language Models (LLMs). The approach consists of a multistage pipeline that uses clinical dictionaries to match unambiguous terms in the text and then applies in-context learning with GPT-4.1 to predict the ICD-10 code for the terms that do not match the dictionary. Our system shows promising results in predicting ICD-10 codes on different benchmark datasets in Spanish - 0.89 F1 for categories and 0.78 F1 on subcategories on CodiEsp, and Greek - 0.85 F1 on ElCardioCC.

[41] PLaMo 2 Technical Report

Preferred Networks,:,Kaizaburo Chubachi,Yasuhiro Fujita,Shinichi Hemmi,Yuta Hirokawa,Toshiki Kataoka,Goro Kobayashi,Kenichi Maehashi,Calvin Metzger,Hiroaki Mikami,Shogo Murai,Daisuke Nishino,Kento Nozawa,Shintarou Okada,Daisuke Okanohara,Shunta Saito,Shotaro Sano,Shuji Suzuki,Daisuke Tanaka,Avinash Ummadisingu,Hanqin Wang,Sixue Wang,Tianqi Xu

Main category: cs.CL

TL;DR: PLaMo-2是基于日语的混合架构大语言模型,通过高效剪枝和训练优化,在日语任务上表现优异。

Details Motivation: 解决日语数据稀缺和计算效率问题,同时提升模型在日语任务上的表现。

Contribution: 1. 提出混合Samba架构,支持32K上下文;2. 提出高效的剪枝方法,8B模型性能媲美100B模型;3. 利用合成数据和优化训练流程,提升日语任务表现。

Method: 1. 混合Samba架构转为全注意力;2. 合成数据预训练;3. 权重复用和结构化剪枝;4. SFT和DPO微调;5. vLLM推理优化和量化。

Result: 在日语基准测试中达到SOTA,表现优于同规模开源模型。

Insight: 合成数据和剪枝对提升小模型性能至关重要;优化训练流程能显著提升语言模型在特定语言任务上的表现。

Abstract: In this report, we introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture that transitions to full attention via continual pre-training to support 32K token contexts. Training leverages extensive synthetic corpora to overcome data scarcity, while computational efficiency is achieved through weight reuse and structured pruning. This efficient pruning methodology produces an 8B model that achieves performance comparable to our previous 100B model. Post-training further refines the models using a pipeline of supervised fine-tuning (SFT) and direct preference optimization (DPO), enhanced by synthetic Japanese instruction data and model merging techniques. Optimized for inference using vLLM and quantization with minimal accuracy loss, the PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.

[42] ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning

Jianghao Chen,Wei Sun,Qixiang Yin,Lingxing Kong,Zhixing Tan,Jiajun Zhang

Main category: cs.CL

TL;DR: ACE-RL提出了一种自适应约束增强奖励的强化学习框架,解决了长文本生成中数据稀缺和质量优化粒度粗的问题,显著优于现有基线模型。

Details Motivation: 解决现有LLMs在长文本生成中的两个主要问题:高质量训练数据的稀缺性和质量优化的粗粒度性。

Contribution: 提出ACE-RL框架,通过自适应约束条件和奖励机制将主观质量评估转化为约束验证,提升长文本生成能力。

Method: 1.自动分解指令为细粒度约束条件;2.设计基于约束满足的奖励机制;3.利用强化学习优化生成模型。

Result: 在WritingBench上显著超越SFT和RL基线模型20.70%和7.32%,部分表现甚至超过GPT-4o。

Insight: 通过自适应约束和奖励机制,可以更高效地训练LLMs生成高质量长文本,适用于多样化场景。

Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in long-context understanding, yet they face significant challenges in high-quality long-form generation. Existing studies primarily suffer from two limitations: (1) A heavy reliance on scarce, high-quality long-form response data for supervised fine-tuning (SFT) or for pairwise preference reward in reinforcement learning (RL). (2) Focus on coarse-grained quality optimization dimensions, such as relevance, coherence, and helpfulness, overlooking the fine-grained specifics inherent to diverse long-form generation scenarios. To address this issue, we propose a framework using Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first automatically deconstructs each instruction into a set of fine-grained, adaptive constraint criteria by identifying its underlying intents and demands. Subsequently, we design a reward mechanism that quantifies the quality of long-form responses based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we utilize reinforcement learning to guide models toward superior long-form generation capabilities. Experimental results demonstrate that our ACE-RL framework significantly outperforms existing SFT and RL baselines by 20.70% and 7.32% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 7.10%, providing a more effective training paradigm for LLMs to generate high-quality content across diverse long-form generation scenarios.

[43] Optimizing Small Transformer-Based Language Models for Multi-Label Sentiment Analysis in Short Texts

Julius Neumann,Robert Lange,Yuni Susanti,Michael Färber

Main category: cs.CL

TL;DR: 该论文研究了小型Transformer模型(如BERT和RoBERTa)在短文本多标签情感分类任务中的优化方法,重点探讨了领域特定预训练、数据增强和分类头架构的影响。实验表明数据增强有效,而领域预训练可能引入噪声,分类头改进效果有限。

Details Motivation: 短文本情感分类面临类别不平衡、样本稀疏和标记主观性等挑战,且短文本上下文的限制加剧了这些问题。论文旨在优化小型Transformer模型在此任务中的表现。

Contribution: 论文评估了领域预训练、数据增强和分类头改进对模型性能的影响,提供了在资源受限场景下优化BERT模型的实用指导。

Method: 研究了三种方法:(1) 领域特定continued预训练,(2) 生成式数据增强,(3) 分类头架构调整。实验在短文本数据集上进行。

Result: 数据增强显著提升分类性能,而领域预训练可能引入噪声,分类头改进效果有限。

Insight: 在短文本情感分析中,数据增强是有效的优化手段,但需要注意领域预训练可能带来的负面影响。分类头的改进潜力较小。

Abstract: Sentiment classification in short text datasets faces significant challenges such as class imbalance, limited training samples, and the inherent subjectivity of sentiment labels – issues that are further intensified by the limited context in short texts. These factors make it difficult to resolve ambiguity and exacerbate data sparsity, hindering effective learning. In this paper, we evaluate the effectiveness of small Transformer-based models (i.e., BERT and RoBERTa, with fewer than 1 billion parameters) for multi-label sentiment classification, with a particular focus on short-text settings. Specifically, we evaluated three key factors influencing model performance: (1) continued domain-specific pre-training, (2) data augmentation using automatically generated examples, specifically generative data augmentation, and (3) architectural variations of the classification head. Our experiment results show that data augmentation improves classification performance, while continued pre-training on augmented datasets can introduce noise rather than boost accuracy. Furthermore, we confirm that modifications to the classification head yield only marginal benefits. These findings provide practical guidance for optimizing BERT-based models in resource-constrained settings and refining strategies for sentiment classification in short-text datasets.

[44] Masked Diffusion Language Models with Frequency-Informed Training

Despoina Kosmopoulou,Efthymios Georgiou,Vaggelis Dorovatas,Georgios Paraskevopoulos,Alexandros Potamianos

Main category: cs.CL

TL;DR: 论文提出了一种基于掩码扩散语言模型的框架,旨在BabyLM 2025挑战中实现数据高效训练。通过频率感知掩码和多样化噪声调度策略,该方法在有限数据下表现出竞争力,展示了扩散模型在语言学习中的潜力。

Details Motivation: 针对BabyLM挑战中的数据限制问题,研究如何在有限数据条件下高效训练语言模型,探索扩散模型在这一场景下的适用性。

Contribution: 1. 提出了频率感知掩码的扩散语言模型框架,优化罕见词汇的学习;2. 研究了多种噪声调度策略和权重方案;3. 在BabyLM基准上验证了方法的有效性。

Method: 1. 应用扩散目标函数于语言建模;2. 引入频率感知掩码,优先学习罕见词汇;3. 探索噪声调度(如双模式)和权重方案。

Result: 在BabyLM基准测试中,性能与混合自回归-掩码基线相当,证明了扩散模型在数据受限条件下的可行性。

Insight: 扩散模型可以通过特定设计(如频率感知掩码)在语言任务中有效学习,为数据受限场景提供新思路。

Abstract: We present a masked diffusion language modeling framework for data-efficient training for the BabyLM 2025 Challenge. Our approach applies diffusion training objectives to language modeling under strict data constraints, incorporating frequency-informed masking that prioritizes learning from rare tokens while maintaining theoretical validity. We explore multiple noise scheduling strategies, including two-mode approaches, and investigate different noise weighting schemes within the NELBO objective. We evaluate our method on the BabyLM benchmark suite, measuring linguistic competence, world knowledge, and human-likeness. Results show performance competitive to hybrid autoregressive-masked baselines, demonstrating that diffusion-based training offers a viable alternative for data-restricted language learning.

[45] ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions

Matteo Bortoletto,Constantin Ruhdorfer,Andreas Bulling

Main category: cs.CL

TL;DR: 论文提出了ToM-SSI,一个新基准,用于在多模态和复杂社交环境中评估ToM能力,揭示了现有模型在新任务中的局限性。

Details Motivation: 现有ToM基准(如Sally-Anne测试)过于简化,无法捕捉复杂社交互动的动态和多智能体心理状态推理。

Contribution: 提出了第一个多模态、支持多智能体社交互动的ToM基准ToM-SSI,扩展了ToM评估的广度和复杂性。

Method: 设计了一个包含多智能体(最多四个)的交互环境,支持合作与对抗任务,同时评估模型对多智能体心理状态的并行推理能力。

Result: 实验表明,现有模型在新任务中表现有限,凸显了ToM能力在复杂社交环境中的研究缺口。

Insight: ToM评估需要更贴近真实社交场景的复杂性,多智能体和多模态交互为未来研究提供了新方向。

Abstract: Most existing Theory of Mind (ToM) benchmarks for foundation models rely on variations of the Sally-Anne test, offering only a very limited perspective on ToM and neglecting the complexity of human social interactions. To address this gap, we propose ToM-SSI: a new benchmark specifically designed to test ToM capabilities in environments rich with social interactions and spatial dynamics. While current ToM benchmarks are limited to text-only or dyadic interactions, ToM-SSI is multimodal and includes group interactions of up to four agents that communicate and move in situated environments. This unique design allows us to study, for the first time, mixed cooperative-obstructive settings and reasoning about multiple agents’ mental state in parallel, thus capturing a wider range of social cognition than existing benchmarks. Our evaluations reveal that the current models’ performance is still severely limited, especially in these new tasks, highlighting critical gaps for future research.

[46] Triadic Fusion of Cognitive, Functional, and Causal Dimensions for Explainable LLMs: The TAXAL Framework

David Herrera-Poyatos,Carlos Peláez-González,Cristina Zuheros,Virilo Tejedor,Rosana Montes,Francisco Herrera

Main category: cs.CL

TL;DR: 本文介绍了TAXAL框架,通过融合认知、功能和因果三个维度,为LLMs提供了可解释性设计、评估和部署的统一方法。

Details Motivation: LLMs在高风险领域的应用日益增多,但其不透明性、偏见和不稳定性问题削弱了信任和责任。传统可解释性方法仅关注表面输出,无法捕捉其推理路径和系统性影响。

Contribution: 提出了TAXAL框架,首次将认知(用户理解)、功能(实用价值)和因果(忠实推理)三个维度统一起来,为LLMs的可解释性提供了全面、角色敏感的解决方案。

Method: TAXAL框架通过分析现有方法(如后处理归因、对话式界面和解释感知提示),将其纳入三元融合模型,并通过法律、教育、医疗等案例验证其适应性。

Result: TAXAL框架在实际应用中展示了如何根据不同机构限制和利益相关者角色调整解释策略,增强了LLM的可信度和上下文敏感性。

Insight: 可解释性不仅是技术问题,更是社会技术实践,TAXAL框架为设计可信赖的LLM应用提供了理论基础和实用工具。

Abstract: Large Language Models (LLMs) are increasingly being deployed in high-risk domains where opacity, bias, and instability undermine trust and accountability. Traditional explainability methods, focused on surface outputs, do not capture the reasoning pathways, planning logic, and systemic impacts of agentic LLMs. We introduce TAXAL (Triadic Alignment for eXplainability in Agentic LLMs), a triadic fusion framework that unites three complementary dimensions: cognitive (user understanding), functional (practical utility), and causal (faithful reasoning). TAXAL provides a unified, role-sensitive foundation for designing, evaluating, and deploying explanations in diverse sociotechnical settings. Our analysis synthesizes existing methods, ranging from post-hoc attribution and dialogic interfaces to explanation-aware prompting, and situates them within the TAXAL triadic fusion model. We further demonstrate its applicability through case studies in law, education, healthcare, and public services, showing how explanation strategies adapt to institutional constraints and stakeholder roles. By combining conceptual clarity with design patterns and deployment pathways, TAXAL advances explainability as a technical and sociotechnical practice, supporting trustworthy and context-sensitive LLM applications in the era of agentic AI.

[47] Hunyuan-MT Technical Report

Mao Zheng,Zheng Li,Bingxin Qu,Mingyang Song,Yang Du,Mingrui Sun,Di Wang

Main category: cs.CL

TL;DR: 论文介绍了首个开源的Hunyuan-MT-7B多语言翻译模型,支持33种主要语言的双向翻译,特别关注汉语与少数民族语言及方言的翻译。通过引入Hunyuan-MT-Chimera-7B慢思考模式进一步增强性能。

Details Motivation: 解决多语言翻译需求,尤其是汉语与少数民族语言和方言的翻译,提升翻译模型的通用性和性能。

Contribution: 1. 首个开源支持33种语言的多语言翻译模型;2. 创新的慢思考模式集成方法提升性能;3. 在WMT2025任务中取得SOTA表现。

Method: 1. 整体训练流程:从预训练到SFT和RL对齐;2. 慢思考模式集成多参数下的模型输出。

Result: 在31个语言对中排名第一,显著优于同类模型和大部分SOTA大模型,尤其在少数民族语言和方言翻译上表现突出。

Insight: 通过创新训练流程和慢思考模式,可以有效提升多语言翻译性能,尤其适用于低资源语言场景。

Abstract: In this report, we introduce Hunyuan-MT-7B, our first open-source multilingual translation model, which supports bidirectional translation across 33 major languages and places a special emphasis on translation between Mandarin and several ethnic minority languages as well as dialects. Furthermore, to serve and address diverse translation scenarios and enhance model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a translation model inspired by the slow thinking mode. This model integrates multiple outputs generated by the Hunyuan-MT-7B model under varying parameter settings, thereby achieving performance superior to that of conventional slow-thinking models based on Chain-of-Thought (CoT). The development of our models follows a holistic training process specifically engineered for multilingual translation, which begins with general and MT-oriented pre-training to build foundational capabilities, proceeds to Supervised Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models of comparable parameter size and most of the SOTA large models, particularly on the task of translation between Mandarin and minority languages as well as dialects. In the WMT2025 shared task (General Machine Translation), our models demonstrate state-of-the-art performance, ranking first in 30 out of 31 language pairs. This result highlights the robustness of our models across a diverse linguistic spectrum, encompassing high-resource languages such as Chinese, English, and Japanese, as well as low-resource languages including Czech, Marathi, Estonian, and Icelandic.

[48] BEDTime: A Unified Benchmark for Automatically Describing Time Series

Medhasweta Sen,Zachary Gottesman,Jiaxing Qiu,C. Bayan Bruss,Nam Nguyen,Tom Hartvigsen

Main category: cs.CL

TL;DR: 该论文提出了一个统一的基准BEDTime,用于评估时间序列描述任务的模型性能,包括识别、区分和生成三个子任务,并整合了四个数据集。实验表明,纯语言模型表现不佳,视觉-语言模型表现良好,而时间序列-语言模型虽优于纯语言模型,仍有改进空间。

Details Motivation: 现有时间序列分析模型的评估通常伴随新数据集发布,缺乏直接比较机会,且多任务评估难以明确模型的具体能力。论文旨在填补这一空白,提出一个统一基准,标准化时间序列描述任务的评估。

Contribution: 1. 提出BEDTime基准,统一评估时间序列描述的三个子任务;2. 整合四个现有数据集以支持模型比较;3. 通过实验揭示纯语言、视觉-语言和时间序列-语言模型的性能差异及改进空间。

Method: 1. 定义三个任务:识别(True/False问答)、区分(多选问答)和生成(开放式自然语言描述);2. 统一四个现有数据集;3. 评估13种先进模型(语言、视觉-语言和时间序列-语言模型)。

Result: 实验结果表明:1. 纯语言模型表现较差;2. 视觉-语言模型表现优异;3. 时间序列-语言模型优于纯语言模型,但仍有不足;4. 所有模型在鲁棒性测试中表现脆弱。

Insight: 1. 时间序列描述任务需要专门设计的架构;2. 视觉模型在任务中表现突出;3. 时间序列-语言模型虽有潜力,但需进一步优化;4. 统一的基准有助于推动模型性能的标准化评估。

Abstract: Many recent studies have proposed general-purpose foundation models designed for a variety of time series analysis tasks. While several established datasets already exist for evaluating these models, previous works frequently introduce their models in conjunction with new datasets, limiting opportunities for direct, independent comparisons and obscuring insights into the relative strengths of different methods. Additionally, prior evaluations often cover numerous tasks simultaneously, assessing a broad range of model abilities without clearly pinpointing which capabilities contribute to overall performance. To address these gaps, we formalize and evaluate 3 tasks that test a model’s ability to describe time series using generic natural language: (1) recognition (True/False question-answering), (2) differentiation (multiple choice question-answering), and (3) generation (open-ended natural language description). We then unify 4 recent datasets to enable head-to-head model comparisons on each task. Experimentally, in evaluating 13 state-of-the-art language, vision–language, and time series–language models, we find that (1) popular language-only methods largely underperform, indicating a need for time series-specific architectures, (2) VLMs are quite successful, as expected, identifying the value of vision models for these tasks and (3) pretrained multimodal time series–language models successfully outperform LLMs, but still have significant room for improvement. We also find that all approaches exhibit clear fragility in a range of robustness tests. Overall, our benchmark provides a standardized evaluation on a task necessary for time series reasoning systems.

[49] Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation

Abdul Waheed,Chancharik Mitra,Laurie Z. Wang,Deva Ramanan,Bhiksha Raj

Main category: cs.CL

TL;DR: 论文提出了一种难度感知的思维链(CoT)框架,使模型能够根据问题复杂度动态调整推理深度,无需修改架构,仅通过微调数据即可实现。

Details Motivation: 传统的思维链推理可能对简单问题产生冗余输出,而复杂问题又需要足够深入的分析,因此需要一种动态调整推理深度的机制。

Contribution: 1. 提出难度感知的推理框架,动态调整思维链长度;2. 展示仅通过数据微调即可实现动态推理路径;3. 结合监督微调(SFT)和直接偏好优化(DPO)平衡推理长度与准确性。

Method: 1. 数据精心设计,确保思维链长度与问题难度成正比;2. 通过SFT捕获推理长度和格式模式;3. 使用DPO保持推理准确性;4. 结合两者优化性能。

Result: 模型学会“按比例思考”,简单问题推理简化,复杂问题保持深度,定量和定性评估均支持这一结论。

Insight: 动态推理路径可通过数据设计实现,无需架构修改,且SFT与DPO的结合是关键。

Abstract: Chain-of-thought reasoning, while powerful, can produce unnecessarily verbose output for simpler problems. We present a framework for difficulty-aware reasoning that teaches models to dynamically adjust reasoning depth based on problem complexity. Remarkably, we show that models can be endowed with such dynamic inference pathways without any architectural modifications; we simply post-train on data that is carefully curated to include chain-of-thought traces that are proportional in length to problem difficulty. Our analysis reveals that post-training via supervised fine-tuning (SFT) primarily captures patterns like reasoning length and format, while direct preference optimization (DPO) preserves reasoning accuracy, with their combination reducing length and maintaining or improving performance. Both quantitative metrics and qualitative assessments confirm that models can learn to “think proportionally”, reasoning minimally on simple problems while maintaining depth for complex ones.

[50] CURE: Controlled Unlearning for Robust Embeddings – Mitigating Conceptual Shortcuts in Pre-Trained Language Models

Aysenur Kocak,Shuo Yang,Bardh Prenkaj,Gjergji Kasneci

Main category: cs.CL

TL;DR: CURE是一种轻量级框架,通过分离和抑制概念捷径,提升预训练语言模型的鲁棒性,显著提升了IMDB和Yelp数据集的性能。

Details Motivation: 预训练语言模型易受虚假概念相关性影响,导致鲁棒性和公平性下降。CURE旨在解决这一问题。

Contribution: 提出CURE框架,通过内容提取器和可控去偏模块,有效抑制概念捷径,显著提升模型性能。

Method: 使用内容提取器和反转网络分离概念无关表示,结合对比学习调整残余概念影响。

Result: 在IMDB和Yelp数据集上,F1分数分别提升了10和2个点,计算成本较低。

Insight: CURE为无监督解决概念偏见提供了灵活方案,有助于构建更可靠的语言理解系统。

Abstract: Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.

[51] Elucidating the Design Space of Decay in Linear Attention

Zhen Qin,Xuyang Shen,Yiran Zhong

Main category: cs.CL

TL;DR: 这篇论文系统地研究了线性复杂度序列模型中衰减机制的设计空间,从参数化策略、参数共享、衰减粒度和与相对位置编码方法的兼容性四个维度展开分析,并通过实验揭示了关键的设计洞见。

Details Motivation: 线性注意力机制因其计算效率受到关注,但其衰减机制的设计缺乏系统性研究。本文旨在填补这一空白,探索衰减机制的设计空间及其对模型性能的影响。

Contribution: 论文的主要贡献包括:1)定义了衰减机制的四个关键设计维度;2)通过实验揭示了参数化策略的有效范围;3)指出了参数共享的潜在问题;4)比较了标量与矢量衰减的性能差异;5)分析了RoPE与线性注意力机制的兼容性问题。

Method: 作者通过系统的实验设计,对比了不同参数化策略、参数共享方式、衰减粒度(标量与矢量)以及RoPE位置编码对模型性能的影响,覆盖了多个语言建模任务。

Result: 实验结果表明:1)参数化策略的有效配置有限;2)参数共享不当会导致性能下降;3)矢量衰减通常优于标量衰减,但特定策略下后者可能更优;4)RoPE对多数线性注意力机制无明显增益。

Insight: 论文的洞见包括:衰减机制的设计需要细致权衡;参数共享和衰减粒度的选择依赖于具体任务和策略;RoPE的通用性可能被高估,需结合特定场景评估。

Abstract: This paper presents a comprehensive investigation into the decay mechanisms inherent in linear complexity sequence models. We systematically delineate the design space of decay mechanisms across four pivotal dimensions: parameterization strategy, which refers to the computational methodology for decay; parameter sharing, which involves the utilization of supplementary parameters for decay computation; decay granularity, comparing scalar versus vector-based decay; and compatibility with relative positional encoding methods, such as Rotary Position Embedding (RoPE). Through an extensive series of experiments conducted on diverse language modeling tasks, we uncovered several critical insights. Firstly, the design of the parameterization strategy for decay requires meticulous consideration. Our findings indicate that effective configurations are typically confined to a specific range of parameters. Secondly, parameter sharing cannot be used arbitrarily, as it may cause decay values to be too large or too small, thereby significantly impacting performance. Thirdly, under identical parameterization strategies, scalar decay generally underperforms compared to its vector-based counterpart. However, in certain scenarios with alternative parameterization strategies, scalar decay may unexpectedly surpass vector decay in efficacy. Lastly, our analysis reveals that RoPE, a commonly employed relative positional encoding method, typically fails to provide tangible benefits to the majority of linear attention mechanisms.

cs.CV [Back]

[52] Facial Emotion Recognition does not detect feeling unsafe in automated driving

Abel van Elburg,Konstantinos Gkentsidis,Mathieu Sarrazin,Sarah Barendswaard,Varun Kotian,Riender Happee

Main category: cs.CV

TL;DR: 论文研究了在自动驾驶中,面部情绪识别技术无法可靠检测乘客的不安全感,转而提出基于车辆运动和皮肤电导的神经网络模型来评估风险感知。

Details Motivation: 公众对自动驾驶汽车的信任和感知安全性是其广泛接受的关键因素,但目前缺乏可靠的方法来客观评估乘客的风险感知。

Contribution: 论文的主要贡献是指出面部情绪识别在评估风险感知中的局限性,并提出一种基于车辆运动和皮肤电导的神经网络模型,能够更准确地预测乘客的感知风险。

Method: 通过驾驶模拟器实验收集32名参与者的主观舒适度评分、运动数据、面部表情、皮肤电导、心率和眼动数据。使用神经网络模型分析车辆运动和皮肤电导数据。

Result: 结果显示,动态驾驶风格比平稳风格引发更强的不适感;面部表情识别在24名参与者中仅对9名有效,且多以快乐表情为主(8/9),恐惧表情未出现。神经网络模型与主观风险感知相关性较好。

Insight: 面部表情识别在自动驾驶风险感知评估中不可靠,而生理和行为数据(如车辆运动和皮肤电导)可能更适用于客观风险评估。

Abstract: Trust and perceived safety play a crucial role in the public acceptance of automated vehicles. To understand perceived risk, an experiment was conducted using a driving simulator under two automated driving styles and optionally introducing a crossing pedestrian. Data was collected from 32 participants, consisting of continuous subjective comfort ratings, motion, webcam footage for facial expression, skin conductance, heart rate, and eye tracking. The continuous subjective perceived risk ratings showed significant discomfort associated with perceived risk during cornering and braking followed by relief or even positive comfort on continuing the ride. The dynamic driving style induced a stronger discomfort as compared to the calm driving style. The crossing pedestrian did not affect discomfort with the calm driving style but doubled the comfort decrement with the dynamic driving style. This illustrates the importance of consequences of critical interactions in risk perception. Facial expression was successfully analyzed for 24 participants but most (15/24) did not show any detectable facial reaction to the critical event. Among the 9 participants who did, 8 showed a Happy expression, and only 4 showed a Surprise expression. Fear was never dominant. This indicates that facial expression recognition is not a reliable method for assessing perceived risk in automated vehicles. To predict perceived risk a neural network model was implemented using vehicle motion and skin conductance. The model correlated well with reported perceived risk, demonstrating its potential for objective perceived risk assessment in automated vehicles, reducing subjective bias and highlighting areas for future research.

[53] PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

Linqing Wang,Ximing Xing,Yiji Cheng,Zhiyuan Zhao,Jiale Tao,Qixun Wang,Ruihuang Li,Xin Li,Mingrui Wu,Xinchi Deng,Chunyu Wang,Qinglin Lu

Main category: cs.CV

TL;DR: PromptEnhancer提出了一种通过链式思维(CoT)提示重写方法,显著提升了文本到图像(T2I)模型的生成质量,无需修改模型权重。通过强化学习和专用奖励模型AlignEvaluator,解决了复杂提示的准确性问题。

Details Motivation: 现有T2I模型在复杂提示(如属性绑定、否定、组合关系)上表现不佳,导致生成图像与用户意图不匹配。PromptEnhancer旨在通过提示重写提升模型表现。

Contribution: 1. 提出通用提示重写框架PromptEnhancer,无需修改T2I模型权重;2. 引入AlignEvaluator奖励模型,基于24种失败模式提供细粒度反馈;3. 提出高质量人类偏好基准。

Method: 1. 通过强化学习训练链式思维(CoT)提示重写器;2. 使用AlignEvaluator提供细粒度奖励信号;3. 基于24种常见失败的分类优化提示。

Result: 在HunyuanImage 2.1上的实验表明,PromptEnhancer显著提升了图像-文本对齐性,尤其是在语义和组合性挑战上。

Insight: 1. 提示重写是提升T2I模型性能的有效途径;2. 细粒度奖励信号对优化提示设计至关重要;3. 未来可扩展至更多生成任务。

Abstract: Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine-tuning or implicit reward signals like image-reward scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.

[54] Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model

Hongyang Wei,Baixin Xu,Hongbo Liu,Cyrus Wu,Jie Liu,Yi Peng,Peiyu Wang,Zexiang Liu,Jingwen He,Yidan Xietian,Chuanxin Tang,Zidong Wang,Yichen Wei,Liang Hu,Boyi Jiang,William Li,Ying He,Yang Liu,Xuchen Song,Eric Li,Yahui Zhou

Main category: cs.CV

TL;DR: UniPic2.0提出了一种基于SD3.5-Medium的2B参数DiT模型,通过改进架构和大规模预训练,结合渐进式双任务强化策略(PDTR),实现了高效的文本到图像生成与编辑能力。最终,模型通过连接多模态组件,构建了统一的UniPic2-Metaquery系统。

Details Motivation: 现有的开源多模态模型通常专注于参数规模而非训练策略优化,导致效率和性能受限。本文旨在通过改进训练策略和架构设计,提升模型能力。

Contribution: 1. 提出改进的DiT架构和渐进式双任务强化策略(PDTR);2. 构建了UniPic2-SD3.5M-Kontext模型,性能超越更大参数的竞品;3. 提出统一的UniPic2-Metaquery多模态框架。

Method: 1. 改进SD3.5-Medium架构并进行大规模预训练;2. 引入PDTR策略,分阶段强化任务能力;3. 通过连接器与Qwen2.5-VL-7B联合训练。

Result: UniPic2-SD3.5M-Kontext在生成和编辑任务上优于BAGEL(7B)和Flux-Kontext(12B)。UniPic2-Metaquery实现了多模态任务的高性能统一。

Insight: 分阶段的强化策略可以避免任务间的负干扰,同时提升模型能力。统一的训练范式具有较好的扩展性和泛化性。

Abstract: Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework. Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data, enabling joint text-to-image generation and editing capabilities. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy (PDTR), which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters-including BAGEL (7B) and Flux-Kontext (12B). Furthermore, following the MetaQuery, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery. UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.

[55] Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping

Jingyi Lu,Kai Han

Main category: cs.CV

TL;DR: Inpaint4Drag 是一种利用双向变形和图像修复技术实现拖拽式图像编辑的新方法,显著提升了交互速度和效果。

Details Motivation: 现有的拖拽式编辑方法依赖生成模型的潜在空间操作,导致精度低、反馈延迟且受限于特定模型。

Contribution: 提出了一种将拖拽编辑分解为像素空间双向变形和图像修复的通用框架,支持实时预览和高效修复。

Method: 将图像区域视为可变形材料,通过双向变形生成目标图像,再利用任意修复模型完成编辑。

Result: 在 512x512 分辨率下实现了 0.01s 的实时变形预览和 0.3s 的修复速度,优于现有方法分钟级的编辑耗时。

Insight: 通过将拖拽输入转化为标准修复格式,Inpaint4Drag 可以直接利用任何修复模型,无需修改架构即可适配未来技术进步。

Abstract: Drag-based image editing has emerged as a powerful paradigm for intuitive image manipulation. However, existing approaches predominantly rely on manipulating the latent space of generative models, leading to limited precision, delayed feedback, and model-specific constraints. Accordingly, we present Inpaint4Drag, a novel framework that decomposes drag-based editing into pixel-space bidirectional warping and image inpainting. Inspired by elastic object deformation in the physical world, we treat image regions as deformable materials that maintain natural shape under user manipulation. Our method achieves real-time warping previews (0.01s) and efficient inpainting (0.3s) at 512x512 resolution, significantly improving the interaction experience compared to existing methods that require minutes per edit. By transforming drag inputs directly into standard inpainting formats, our approach serves as a universal adapter for any inpainting model without architecture modification, automatically inheriting all future improvements in inpainting technology. Extensive experiments demonstrate that our method achieves superior visual quality and precise control while maintaining real-time performance. Project page: https://visual-ai.github.io/inpaint4drag/

[56] DisPatch: Disarming Adversarial Patches in Object Detection with Diffusion Models

Jin Ma,Mohammed Aldeen,Christopher Salas,Feng Luo,Mashrur Chowdhury,Mert Pesé,Long Cheng

Main category: cs.CV

TL;DR: DISPATCH是首个基于扩散模型的防御框架,用于应对目标检测中的对抗性补丁攻击,通过‘再生与修正’策略消除攻击效应。

Details Motivation: 当前最先进的目标检测器容易受到对抗性补丁攻击,可能导致严重的安全隐患。现有防御方法缺乏泛化性和鲁棒性。

Contribution: 提出DISPATCH框架,首次利用扩散模型的生成能力消除攻击效应,无需提前了解补丁信息,且攻击无关。

Method: 采用‘再生与修正’策略:扩散模型用于再生全图,随后修正过程替换对抗区域为良性部分。

Result: 在多种检测器和攻击类型上优于现有防御方法,隐藏攻击mAP.5达89.3%,未定向创建攻击成功率降至24.8%。

Insight: 扩散模型的生成能力在对抗防御中表现优异,为攻击无关的防御提供了新思路。

Abstract: Object detection is fundamental to various real-world applications, such as security monitoring and surveillance video analysis. Despite their advancements, state-of-theart object detectors are still vulnerable to adversarial patch attacks, which can be easily applied to real-world objects to either conceal actual items or create non-existent ones, leading to severe consequences. Given the current diversity of adversarial patch attacks and potential unknown threats, an ideal defense method should be effective, generalizable, and robust against adaptive attacks. In this work, we introduce DISPATCH, the first diffusion-based defense framework for object detection. Unlike previous works that aim to “detect and remove” adversarial patches, DISPATCH adopts a “regenerate and rectify” strategy, leveraging generative models to disarm attack effects while preserving the integrity of the input image. Specifically, we utilize the in-distribution generative power of diffusion models to regenerate the entire image, aligning it with benign data. A rectification process is then employed to identify and replace adversarial regions with their regenerated benign counterparts. DISPATCH is attack-agnostic and requires no prior knowledge of the existing patches. Extensive experiments across multiple detectors and attacks demonstrate that DISPATCH consistently outperforms state-of-the-art defenses on both hiding attacks and creating attacks, achieving the best overall mAP.5 score of 89.3% on hiding attacks, and lowering the attack success rate to 24.8% on untargeted creating attacks. Moreover, it maintains strong robustness against adaptive attacks, making it a practical and reliable defense for object detection systems.

[57] WATCH: World-aware Allied Trajectory and pose reconstruction for Camera and Human

Qijun Ying,Zhongyuan Hu,Rui Zhang,Ronghui Li,Yu Lu,Zijiao Zeng

Main category: cs.CV

TL;DR: WATCH提出了一种统一的框架,通过分解相机朝向信息和整合相机平移线索,解决了单目视频中全局人体运动重建的挑战。

Details Motivation: 单目视频中全局人体运动重建在VR、图形学和机器人技术中需求增加,但面临深度模糊、运动模糊以及相机与人体运动纠缠的挑战。现有方法在利用相机信息时存在不足。

Contribution: 1. 提出了WATCH框架,联合建模相机与人体的运动关系;2. 设计了高效的朝向角分解技术;3. 提出了基于世界模型的相机轨迹整合机制。

Method: 1. 通过分析朝向角分解技术提取相机姿态信息;2. 利用世界模型整合相机运动线索,避免硬解码的局限性。

Result: 在野外基准测试中,WATCH在端到端轨迹重建任务中达到了最先进的性能。

Insight: 联合建模相机与人体的运动关系是解决全局运动重建的有效途径,为相机平移信息整合提供了新思路。

Abstract: Global human motion reconstruction from in-the-wild monocular videos is increasingly demanded across VR, graphics, and robotics applications, yet requires accurate mapping of human poses from camera to world coordinates-a task challenged by depth ambiguity, motion ambiguity, and the entanglement between camera and human movements. While human-motion-centric approaches excel in preserving motion details and physical plausibility, they suffer from two critical limitations: insufficient exploitation of camera orientation information and ineffective integration of camera translation cues. We present WATCH (World-aware Allied Trajectory and pose reconstruction for Camera and Human), a unified framework addressing both challenges. Our approach introduces an analytical heading angle decomposition technique that offers superior efficiency and extensibility compared to existing geometric methods. Additionally, we design a camera trajectory integration mechanism inspired by world models, providing an effective pathway for leveraging camera translation information beyond naive hard-decoding approaches. Through experiments on in-the-wild benchmarks, WATCH achieves state-of-the-art performance in end-to-end trajectory reconstruction. Our work demonstrates the effectiveness of jointly modeling camera-human motion relationships and offers new insights for addressing the long-standing challenge of camera translation integration in global human motion reconstruction. The code will be available publicly.

[58] Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

MinJu Jeon,Si-Woo Kim,Ye-Chan Kim,HyunGee Kim,Dong-Jin Kim

Main category: cs.CV

TL;DR: Sali4Vid提出了一种基于显著性的视频重加权和自适应字幕检索框架,通过改进视频帧权重标注和动态分割视频以捕捉场景转换,显著提升了密集视频字幕任务的性能。

Details Motivation: 当前密集视频字幕任务的两大局限性:1)文本标签的时间戳监督未充分利用,视频帧被平等对待;2)固定大小的视频块检索字幕,忽略了场景转换。针对这些问题,提出Sali4Vid框架。

Contribution: 1)引入显著性感知视频重加权,将时间戳标注转化为基于Sigmoid的帧重要性权重;2)提出基于语义的自适应字幕检索,通过帧相似性分割视频以捕捉场景转换。

Method: 1)使用显著性感知视频重加权技术,动态分配帧重要性;2)基于语义的检索方法,通过帧相似性动态分割视频。

Result: 在YouCook2和ViTT数据集上达到SOTA,验证了联合优化视频权重和字幕检索的有效性。

Insight: 引入显著性感知和动态分割技术,能够更精准地捕捉视频中的关键事件和场景变化,从而提升字幕生成的准确性。

Abstract: Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose Sali4Vid, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning

[59] UAV-Based Intelligent Traffic Surveillance System: Real-Time Vehicle Detection, Classification, Tracking, and Behavioral Analysis

Ali Khanpour,Tianyi Wang,Afra Vahidi-Shams,Wim Ectors,Farzam Nakhaie,Amirhossein Taheri,Christian Claudel

Main category: cs.CV

TL;DR: 该论文提出了一种基于无人机的智能交通监控系统,能够实时检测、分类、跟踪车辆并分析其行为。该系统通过多尺度多角度模板匹配、卡尔曼滤波和单应性标定等技术,在城市交通监控中表现出色,检测精度达91.8%。

Details Motivation: 传统交通监控系统(如固定摄像头和传感器)受限于覆盖范围窄、适应性差和可扩展性不足。无人机监控系统能够克服这些限制,适用于复杂多变的城市环境。

Contribution: 1. 提出了一套完整的基于无人机的实时交通监控系统。2. 实现了高精度的车辆检测(91.8%)、分类、跟踪(MOTA 92.1%)和交通违规行为分析。3. 系统具有高度可扩展性和实用性,适用于智慧城市。

Method: 系统采用多尺度多角度模板匹配检测车辆,结合卡尔曼滤波进行跟踪,并通过单应性标定实现精确的空间定位。行为分析融合了地理围栏、运动过滤和轨迹偏差分析。

Result: 系统在实验中表现优异,检测精度91.8%,F1得分90.5%,跟踪性能MOTA/MOTP分别为92.1%和93.7%。还能高效分类车辆并检测违规行为。

Insight: 无人机监控为城市交通管理提供了高覆盖、灵活且独立的解决方案,特别适合复杂交通环境下的实时分析和智能决策。

Abstract: Traffic congestion and violations pose significant challenges for urban mobility and road safety. Traditional traffic monitoring systems, such as fixed cameras and sensor-based methods, are often constrained by limited coverage, low adaptability, and poor scalability. To address these challenges, this paper introduces an advanced unmanned aerial vehicle (UAV)-based traffic surveillance system capable of accurate vehicle detection, classification, tracking, and behavioral analysis in real-world, unconstrained urban environments. The system leverages multi-scale and multi-angle template matching, Kalman filtering, and homography-based calibration to process aerial video data collected from altitudes of approximately 200 meters. A case study in urban area demonstrates robust performance, achieving a detection precision of 91.8%, an F1-score of 90.5%, and tracking metrics (MOTA/MOTP) of 92.1% and 93.7%, respectively. Beyond precise detection, the system classifies five vehicle types and automatically detects critical traffic violations, including unsafe lane changes, illegal double parking, and crosswalk obstructions, through the fusion of geofencing, motion filtering, and trajectory deviation analysis. The integrated analytics module supports origin-destination tracking, vehicle count visualization, inter-class correlation analysis, and heatmap-based congestion modeling. Additionally, the system enables entry-exit trajectory profiling, vehicle density estimation across road segments, and movement direction logging, supporting comprehensive multi-scale urban mobility analytics. Experimental results confirms the system’s scalability, accuracy, and practical relevance, highlighting its potential as an enforcement-aware, infrastructure-independent traffic monitoring solution for next-generation smart cities.

[60] VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation

Mustafa Munir,Alex Zhang,Radu Marculescu

Main category: cs.CV

TL;DR: VCMamba是一个新颖的视觉骨干网络,结合了CNN的多方向Mamba SSM的优点,提升了全局和局部特征的表征能力,同时保持了线性复杂度。

Details Motivation: CNN在局部特征提取上有优势,但缺乏全局建模能力;ViT和SSM(如Mamba)擅长全局建模,但局部特征提取不足。因此,VCMamba旨在结合两者的优势。

Contribution: 提出了VCMamba,首次将CNN与多方向Mamba SSM结合,显著提升了视觉任务的性能,同时降低了参数量。

Method: VCMamba使用卷积核提取局部特征,并通过多方向Mamba块建模长程依赖关系,实现了高效的线性复杂度设计。

Result: 在ImageNet-1K分类和ADE20K语义分割任务中,VCMamba超越了PlainMamba、Vision GNN和EfficientFormer等模型。

Insight: CNN与SSM的结合不仅提升了特征表达能力,也验证了线性复杂度在高分辨率任务中的可行性。

Abstract: Recent advances in Vision Transformers (ViTs) and State Space Models (SSMs) have challenged the dominance of Convolutional Neural Networks (CNNs) in computer vision. ViTs excel at capturing global context, and SSMs like Mamba offer linear complexity for long sequences, yet they do not capture fine-grained local features as effectively as CNNs. Conversely, CNNs possess strong inductive biases for local features but lack the global reasoning capabilities of transformers and Mamba. To bridge this gap, we introduce \textit{VCMamba}, a novel vision backbone that integrates the strengths of CNNs and multi-directional Mamba SSMs. VCMamba employs a convolutional stem and a hierarchical structure with convolutional blocks in its early stages to extract rich local features. These convolutional blocks are then processed by later stages incorporating multi-directional Mamba blocks designed to efficiently model long-range dependencies and global context. This hybrid design allows for superior feature representation while maintaining linear complexity with respect to image resolution. We demonstrate VCMamba’s effectiveness through extensive experiments on ImageNet-1K classification and ADE20K semantic segmentation. Our VCMamba-B achieves 82.6% top-1 accuracy on ImageNet-1K, surpassing PlainMamba-L3 by 0.3% with 37% fewer parameters, and outperforming Vision GNN-B by 0.3% with 64% fewer parameters. Furthermore, VCMamba-B obtains 47.1 mIoU on ADE20K, exceeding EfficientFormer-L7 by 2.0 mIoU while utilizing 62% fewer parameters. Code is available at https://github.com/Wertyuui345/VCMamba.

[61] Guideline-Consistent Segmentation via Multi-Agent Refinement

Vanshika Vats,Ashwani Rathee,James Davis

Main category: cs.CV

TL;DR: 该论文提出了一种多代理的训练无关框架,通过Worker-Supervisor的迭代优化架构,结合通用视觉-语言模型,确保语义分割严格遵循复杂的标注指南。

Details Motivation: 现实应用中的语义分割不仅需要高精度掩膜,还需严格遵守复杂的文本标注指南。传统方法依赖昂贵的任务特定重训练,而现有开放词汇分割方法难以应对段落长度的复杂规则。

Contribution: 提出了一个无需训练的多代理框架,结合Worker-Supervisor迭代优化和轻量级强化学习终止策略,实现了对复杂指南的严格遵循和高效分割。

Method: 采用Worker执行分割、Supervisor根据检索到的指南进行批评,并通过强化学习终止策略动态调整迭代次数,平衡资源使用。

Result: 在Waymo和ReasonSeg数据集上显著优于现有方法,展示了优异的泛化能力和指令遵循能力。

Insight: 多代理协作和动态迭代终止策略能够有效解决复杂指南下的语义分割问题,为实际应用提供了一种高效、灵活的解决方案。

Abstract: Semantic segmentation in real-world applications often requires not only accurate masks but also strict adherence to textual labeling guidelines. These guidelines are typically complex and long, and both human and automated labeling often fail to follow them faithfully. Traditional approaches depend on expensive task-specific retraining that must be repeated as the guidelines evolve. Although recent open-vocabulary segmentation methods excel with simple prompts, they often fail when confronted with sets of paragraph-length guidelines that specify intricate segmentation rules. To address this, we introduce a multi-agent, training-free framework that coordinates general-purpose vision-language models within an iterative Worker-Supervisor refinement architecture. The Worker performs the segmentation, the Supervisor critiques it against the retrieved guidelines, and a lightweight reinforcement learning stop policy decides when to terminate the loop, ensuring guideline-consistent masks while balancing resource use. Evaluated on the Waymo and ReasonSeg datasets, our method notably outperforms state-of-the-art baselines, demonstrating strong generalization and instruction adherence.

[62] Enhancing Self-Driving Segmentation in Adverse Weather Conditions: A Dual Uncertainty-Aware Training Approach to SAM Optimization

Dharsan Ravindran,Kevin Wang,Zhuoyuan Cao,Saleh Abdelrahman,Jeffery Wu

Main category: cs.CV

TL;DR: 该论文提出了一种双不确定感知训练方法,优化SAM模型在恶劣天气条件下的自动驾驶图像分割性能。通过多步微调和不确定性适配器(UAT)提升分割鲁棒性,并在多个数据集上验证了其有效性。

Details Motivation: 现有的视觉基础模型(如SAM和SAM2)在恶劣天气条件下表现不佳,主要缺乏对不确定性的量化。受医学影像中不确定性感知训练的启发,作者希望提升自动驾驶在复杂环境中的分割可靠性。

Contribution: 1. 提出多步微调SAM2的方法,将不确定性指标直接融入损失函数;2. 将医学领域的不确定性适配器(UAT)适配到自动驾驶场景中,增强分割鲁棒性。

Method: 1. 设计多步微调SAM2的流程,结合不确定性损失;2. 调整UAT方法,用于驾驶场景的分割任务。

Result: 实验表明,UAT-SAM在极端天气条件下优于标准SAM,而加入不确定性损失的SAM2在多样化驾驶场景中表现更优。

Insight: 显式建模不确定性对于安全关键的自动驾驶任务至关重要,尤其在恶劣环境中能显著提升分割性能。

Abstract: Recent advances in vision foundation models, such as the Segment Anything Model (SAM) and its successor SAM2, have achieved state-of-the-art performance on general image segmentation benchmarks. However, these models struggle in adverse weather conditions where visual ambiguity is high, largely due to their lack of uncertainty quantification. Inspired by progress in medical imaging, where uncertainty-aware training has improved reliability in ambiguous cases, we investigate two approaches to enhance segmentation robustness for autonomous driving. First, we introduce a multi-step finetuning procedure for SAM2 that incorporates uncertainty metrics directly into the loss function, improving overall scene recognition. Second, we adapt the Uncertainty-Aware Adapter (UAT), originally designed for medical image segmentation, to driving contexts. We evaluate both methods on CamVid, BDD100K, and GTA driving datasets. Experiments show that UAT-SAM outperforms standard SAM in extreme weather, while SAM2 with uncertainty-aware loss achieves improved performance across diverse driving scenes. These findings underscore the value of explicit uncertainty modeling for safety-critical autonomous driving in challenging environments.

[63] WatchHAR: Real-time On-device Human Activity Recognition System for Smartwatches

Taeyoung Yeon,Vasco Xu,Henry Hoffmann,Karan Ahuja

Main category: cs.CV

TL;DR: WatchHAR是一种完全在智能手表上运行的多模态细粒度人类活动识别系统,通过优化数据预处理和推理流程,显著提升了实时性和准确性。

Details Motivation: 现有的HAR系统往往需要外部数据处理,导致隐私和延迟问题。WatchHAR旨在解决这些问题,实现完全在设备上运行的高效活动识别。

Contribution: 1. 提出了一种端到端可训练的架构,统一了传感器数据的预处理和推理;2. 在25类活动中保持了90%以上的准确率,同时处理速度提升了5倍。

Method: 通过优化数据流水线各组件,WatchHAR将传感器数据处理和推理整合为一个模块,显著提升了效率和速度。

Result: WatchHAR在事件检测和活动分类任务中优于现有方法,事件检测处理时间为9.3毫秒,多模态活动分类为11.8毫秒。

Insight: 这项研究表明,智能手表可以作为独立、隐私友好且非侵入性的连续活动跟踪设备,展现了其潜力。

Abstract: Despite advances in practical and multimodal fine-grained Human Activity Recognition (HAR), a system that runs entirely on smartwatches in unconstrained environments remains elusive. We present WatchHAR, an audio and inertial-based HAR system that operates fully on smartwatches, addressing privacy and latency issues associated with external data processing. By optimizing each component of the pipeline, WatchHAR achieves compounding performance gains. We introduce a novel architecture that unifies sensor data preprocessing and inference into an end-to-end trainable module, achieving 5x faster processing while maintaining over 90% accuracy across more than 25 activity classes. WatchHAR outperforms state-of-the-art models for event detection and activity classification while running directly on the smartwatch, achieving 9.3 ms processing time for activity event detection and 11.8 ms for multimodal activity classification. This research advances on-device activity recognition, realizing smartwatches’ potential as standalone, privacy-aware, and minimally-invasive continuous activity tracking devices.

[64] MCANet: A Multi-Scale Class-Specific Attention Network for Multi-Label Post-Hurricane Damage Assessment using UAV Imagery

Zhangding Liu,Neda Mohammadi,John E. Taylor

Main category: cs.CV

TL;DR: MCANet提出了一种多标签分类框架,通过多尺度表征和类别特定的注意力机制,提升了无人机影像中飓风灾害多标签损伤评估的性能,优于多种基线模型。

Details Motivation: 现有CNN方法难以捕捉多尺度空间特征或区分视觉相似或共现的损伤类型,影响了飓风灾害后损伤评估的准确性和速度。

Contribution: 1) 提出了MCANet框架,结合多尺度表征和类别特定注意力机制;2) 在RescueNet数据集上实现了91.75%的mAP,优于多种基线模型;3) 通过注意力机制提升了困难类别的分类性能(如’Road Blocked’提升6%)。

Method: 1) 采用Res2Net作为多尺度特征提取的骨干网络;2) 设计了多头的类别特定残差注意力模块,每个注意力分支关注不同空间粒度;3) 结合局部细节与全局上下文。

Result: 在RescueNet数据集上,MCANet达到了91.75%的mAP(八头注意力提升至92.35%),显著优于ResNet、VGG、ViT等模型。

Insight: 1) 多尺度和类别特定注意力机制在复杂多标签分类任务中表现优异;2) 类别激活映射支持模型可解释性;3) 未来可结合知识图谱和多模态大语言模型进一步提升泛化性和语义理解能力。

Abstract: Rapid and accurate post-hurricane damage assessment is vital for disaster response and recovery. Yet existing CNN-based methods struggle to capture multi-scale spatial features and to distinguish visually similar or co-occurring damage types. To address these issues, we propose MCANet, a multi-label classification framework that learns multi-scale representations and adaptively attends to spatially relevant regions for each damage category. MCANet employs a Res2Net-based hierarchical backbone to enrich spatial context across scales and a multi-head class-specific residual attention module to enhance discrimination. Each attention branch focuses on different spatial granularities, balancing local detail with global context. We evaluate MCANet on the RescueNet dataset of 4,494 UAV images collected after Hurricane Michael. MCANet achieves a mean average precision (mAP) of 91.75%, outperforming ResNet, Res2Net, VGG, MobileNet, EfficientNet, and ViT. With eight attention heads, performance further improves to 92.35%, boosting average precision for challenging classes such as Road Blocked by over 6%. Class activation mapping confirms MCANet’s ability to localize damage-relevant regions, supporting interpretability. Outputs from MCANet can inform post-disaster risk mapping, emergency routing, and digital twin-based disaster response. Future work could integrate disaster-specific knowledge graphs and multimodal large language models to improve adaptability to unseen disasters and enrich semantic understanding for real-world decision-making.

[65] Dynamic Group Detection using VLM-augmented Temporal Groupness Graph

Kaname Yokoyama,Chihiro Nakatani,Norimichi Ukita

Main category: cs.CV

TL;DR: 该论文提出了一种动态检测视频中人群组的方法,结合视觉语言模型(VLM)和时间一致性优化,在公开数据集上优于现有方法。

Details Motivation: 传统的群体检测方法通常假设群体的结构在视频中保持不变,这限制了其对动态变化群体的检测能力。论文旨在通过结合局部和全局特征以及时间一致性,实现更准确的动态群体检测。

Contribution: 1. 使用增强的视觉语言模型(VLM)提取局部和全局特征以检测群体;2. 提出一种时间一致性优化方法,通过构建全局图模型动态检测群体变化。

Method: 1. 利用VLM提取每帧中的局部和全局特征;2. 使用群体概率估计构建时间图模型,并通过全局优化实现动态群体检测。

Result: 实验表明,该方法在公开数据集上显著优于现有方法。

Insight: 结合视觉语言模型和时间一致性优化能有效提升动态群体检测的性能,尤其在复杂场景下表现优异。

Abstract: This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames’ groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://github.com/irajisamurai/VLM-GroupDetection.git

[66] FloodVision: Urban Flood Depth Estimation Using Foundation Vision-Language Models and Domain Knowledge Graph

Zhangding Liu,Neda Mohammadi,John E. Taylor

Main category: cs.CV

TL;DR: FloodVision提出了一种零样本框架,结合基础视觉-语言模型GPT-4o和领域知识图谱,用于城市洪水深度估计,显著提升了准确性和泛化能力。

Details Motivation: 现有计算机视觉方法在洪水深度估计中因依赖固定对象检测器和任务特定训练而存在准确性和泛化性问题。

Contribution: 提出了一种零样本框架FloodVision,结合GPT-4o的语义推理能力和领域知识图谱,实现了高精度的泛化洪水深度估计。

Method: 通过动态识别RGB图像中的参考对象,从知识图谱中检索验证高度,估计淹没比例,并应用统计离群值过滤计算最终深度。

Result: 在MyCoast New York的110张众包图像上,FloodVision的平均绝对误差为8.17厘米,比基线GPT-4o降低了20.5%。

Insight: 结合视觉-语言模型和领域知识图谱,可显著提升计算机视觉任务在复杂场景中的准确性和泛化能力。

Abstract: Timely and accurate floodwater depth estimation is critical for road accessibility and emergency response. While recent computer vision methods have enabled flood detection, they suffer from both accuracy limitations and poor generalization due to dependence on fixed object detectors and task-specific training. To enable accurate depth estimation that can generalize across diverse flood scenarios, this paper presents FloodVision, a zero-shot framework that combines the semantic reasoning abilities of the foundation vision-language model GPT-4o with a structured domain knowledge graph. The knowledge graph encodes canonical real-world dimensions for common urban objects including vehicles, people, and infrastructure elements to ground the model’s reasoning in physical reality. FloodVision dynamically identifies visible reference objects in RGB images, retrieves verified heights from the knowledge graph to mitigate hallucination, estimates submergence ratios, and applies statistical outlier filtering to compute final depth values. Evaluated on 110 crowdsourced images from MyCoast New York, FloodVision achieves a mean absolute error of 8.17 cm, reducing the GPT-4o baseline 10.28 cm by 20.5% and surpassing prior CNN-based methods. The system generalizes well across varying scenes and operates in near real-time, making it suitable for future integration into digital twin platforms and citizen-reporting apps for smart city flood resilience.

[67] Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval

Bangxiang Lan,Ruobing Xie,Ruixiang Zhao,Xingwu Sun,Zhanhui Kang,Gang Yang,Xirong Li

Main category: cs.CV

TL;DR: 本文提出了一种新的Hybrid-Tower框架(PIG方法),结合了Two-Tower和Single-Tower的优势,通过生成伪查询实现细粒度交互,在文本到视频检索任务中同时实现了高效性和高效能。

Details Motivation: 现有CLIP-based方法中,Two-Tower框架效率高但效能低,Single-Tower框架效能高但效率低。为了解决这一矛盾,本文提出Hybrid-Tower框架。

Contribution: 1. 提出Hybrid-Tower框架,结合Two-Tower和Single-Tower的优势;2. 设计了伪查询生成器(PIG),实现细粒度交互;3. 在推理阶段不增加存储或计算开销。

Method: 通过伪查询生成器为每个视频生成伪查询,使视频特征与伪查询的文本特征进行细粒度交互。推理阶段保持Two-Tower的效率,效能接近Single-Tower。

Result: 在五个文本-视频检索基准测试中,R@1提升1.6% ~ 3.9%,效能接近SOTA,效率与Two-Tower模型相当。

Insight: Hybrid-Tower框架通过生成伪查询解决了效率和效能的矛盾,展示了在检索任务中结合不同框架优势的潜力。

Abstract: The Text-to-Video Retrieval (T2VR) task aims to retrieve unlabeled videos by textual queries with the same semantic meanings. Recent CLIP-based approaches have explored two frameworks: Two-Tower versus Single-Tower framework, yet the former suffers from low effectiveness, while the latter suffers from low efficiency. In this study, we explore a new Hybrid-Tower framework that can hybridize the advantages of the Two-Tower and Single-Tower framework, achieving high effectiveness and efficiency simultaneously. We propose a novel hybrid method, Fine-grained Pseudo-query Interaction and Generation for T2VR, ie, PIG, which includes a new pseudo-query generator designed to generate a pseudo-query for each video. This enables the video feature and the textual features of pseudo-query to interact in a fine-grained manner, similar to the Single-Tower approaches to hold high effectiveness, even before the real textual query is received. Simultaneously, our method introduces no additional storage or computational overhead compared to the Two-Tower framework during the inference stage, thus maintaining high efficiency. Extensive experiments on five commonly used text-video retrieval benchmarks demonstrate that our method achieves a significant improvement over the baseline, with an increase of $1.6% \sim 3.9%$ in R@1. Furthermore, our method matches the efficiency of Two-Tower models while achieving near state-of-the-art performance, highlighting the advantages of the Hybrid-Tower framework.

[68] Comparative Evaluation of Traditional and Deep Learning Feature Matching Algorithms using Chandrayaan-2 Lunar Data

R. Makharia,J. G. Singla,Amitabh,N. Dube,H. Sharma

Main category: cs.CV

TL;DR: 本文评估了五种特征匹配算法在月球数据上的表现,发现深度学习算法SuperGlue在精度和效率上优于传统方法,特别是在极地光照条件下。

Details Motivation: 准确的图像配准对月球探索至关重要,但多传感器数据的差异(如分辨率、光照和畸变)增加了配准的难度。

Contribution: 1. 提出了一套预处理流程,包括地理参考、分辨率对齐和强度归一化。2. 比较了传统和深度学习算法的性能。3. 发现SuperGlue在复杂环境下表现最优。

Method: 使用五种算法(SIFT、ASIFT、AKAZE、RIFT2和SuperGlue)处理跨模态月球图像,并采用预处理技术优化结果。

Result: SuperGlue在均方根误差和运行时间上表现最佳,传统方法在赤道附近表现良好但在极地较差。

Insight: 预处理和学习方法(如SuperGlue)对复杂光照条件下的图像配准至关重要。

Abstract: Accurate image registration is critical for lunar exploration, enabling surface mapping, resource localization, and mission planning. Aligning data from diverse lunar sensors – optical (e.g., Orbital High Resolution Camera, Narrow and Wide Angle Cameras), hyperspectral (Imaging Infrared Spectrometer), and radar (e.g., Dual-Frequency Synthetic Aperture Radar, Selene/Kaguya mission) – is challenging due to differences in resolution, illumination, and sensor distortion. We evaluate five feature matching algorithms: SIFT, ASIFT, AKAZE, RIFT2, and SuperGlue (a deep learning-based matcher), using cross-modality image pairs from equatorial and polar regions. A preprocessing pipeline is proposed, including georeferencing, resolution alignment, intensity normalization, and enhancements like adaptive histogram equalization, principal component analysis, and shadow correction. SuperGlue consistently yields the lowest root mean square error and fastest runtimes. Classical methods such as SIFT and AKAZE perform well near the equator but degrade under polar lighting. The results highlight the importance of preprocessing and learning-based approaches for robust lunar image registration across diverse conditions.

[69] Extracting Uncertainty Estimates from Mixtures of Experts for Semantic Segmentation

Svetlana Pavlitska,Beyza Keskin,Alwin Faßbender,Christian Hubschneider,J. Marius Zöllner

Main category: cs.CV

TL;DR: 论文探讨了如何从专家混合(MoE)模型中提取预测不确定性估计,用于语义分割任务,结果显示MoE比集成方法在不分布数据(OOD)下提供更可靠的不确定性估计。

Details Motivation: 在安全关键应用中(如交通场景感知),准确且校准良好的预测不确定性对于提高模型可靠性至关重要。论文旨在通过MoE模型高效地提取不确定性估计。

Contribution: 1. 提出无需修改MoE架构即可提取校准良好的预测不确定性估计;2. 比较了预测熵、互信息和专家方差三种方法;3. 发现MoE在OOD数据下比集成方法更可靠;4. 验证了简单门控机制在路由不确定性校准上的优势。

Method: 1. 使用MoE架构,包括门控网络动态加权专家预测;2. 采用预测熵、互信息和专家方差三种方法提取不确定性;3. 在A2D2和Cityscapes数据集上实验。

Result: MoE在不分布数据下比集成方法表现更好,简单门控机制在路由不确定性校准上更优,增加专家数量可进一步提升校准效果。

Insight: MoE不仅高效,而且在不确定性估计上表现出色,简单门控机制优于复杂方法,专家数量是提升校准效果的关键因素。

Abstract: Estimating accurate and well-calibrated predictive uncertainty is important for enhancing the reliability of computer vision models, especially in safety-critical applications like traffic scene perception. While ensemble methods are commonly used to quantify uncertainty by combining multiple models, a mixture of experts (MoE) offers an efficient alternative by leveraging a gating network to dynamically weight expert predictions based on the input. Building on the promising use of MoEs for semantic segmentation in our previous works, we show that well-calibrated predictive uncertainty estimates can be extracted from MoEs without architectural modifications. We investigate three methods to extract predictive uncertainty estimates: predictive entropy, mutual information, and expert variance. We evaluate these methods for an MoE with two experts trained on a semantical split of the A2D2 dataset. Our results show that MoEs yield more reliable uncertainty estimates than ensembles in terms of conditional correctness metrics under out-of-distribution (OOD) data. Additionally, we evaluate routing uncertainty computed via gate entropy and find that simple gating mechanisms lead to better calibration of routing uncertainty estimates than more complex classwise gates. Finally, our experiments on the Cityscapes dataset suggest that increasing the number of experts can further enhance uncertainty calibration. Our code is available at https://github.com/KASTEL-MobilityLab/mixtures-of-experts/.

[70] Exploring Non-Local Spatial-Angular Correlations with a Hybrid Mamba-Transformer Framework for Light Field Super-Resolution

Haosong Liu,Xiancheng Zhu,Huanqiang Zeng,Jianqing Zhu,Jiuwen Cao,Junhui Hou

Main category: cs.CV

TL;DR: 该论文提出了一种结合Mamba和Transformer的混合框架(LFMT),通过子空间简单扫描策略和双阶段建模策略,高效捕捉光场超分辨率中的非局部空间-角度相关性,显著提升了性能。

Details Motivation: 现有Mamba方法在光场超分辨率(LFSR)中存在特征提取效率低和信息冗余的问题,且状态空间模型难以完整保留空间-角度和视差信息。

Contribution: 1. 提出子空间简单扫描策略(Sub-SS)和子空间简单Mamba块(SSMB),优化特征提取效率;2. 设计双阶段建模策略,结合SA-RSMB和EPMB/EPTB模块,全面探索非局部相关性;3. 提出混合框架LFMT,整合Mamba和Transformer优势。

Method: 1. 阶段I:使用SA-RSMB提取浅层空间-角度特征;2. 阶段II:通过EPMB和EPTB并行分支进行深层极平面特征细化;3. 依赖Sub-SS策略和SSMB模块优化效率。

Result: LFMT在真实和合成的光场数据集上显著优于现有方法,性能提升明显且计算复杂度低。

Insight: Mamba与Transformer的结合能够有效弥补各自在LFSR中的局限性,双阶段策略和并行模块设计是性能提升的关键。

Abstract: Recently, Mamba-based methods, with its advantage in long-range information modeling and linear complexity, have shown great potential in optimizing both computational cost and performance of light field image super-resolution (LFSR). However, current multi-directional scanning strategies lead to inefficient and redundant feature extraction when applied to complex LF data. To overcome this challenge, we propose a Subspace Simple Scanning (Sub-SS) strategy, based on which we design the Subspace Simple Mamba Block (SSMB) to achieve more efficient and precise feature extraction. Furthermore, we propose a dual-stage modeling strategy to address the limitation of state space in preserving spatial-angular and disparity information, thereby enabling a more comprehensive exploration of non-local spatial-angular correlations. Specifically, in stage I, we introduce the Spatial-Angular Residual Subspace Mamba Block (SA-RSMB) for shallow spatial-angular feature extraction; in stage II, we use a dual-branch parallel structure combining the Epipolar Plane Mamba Block (EPMB) and Epipolar Plane Transformer Block (EPTB) for deep epipolar feature refinement. Building upon meticulously designed modules and strategies, we introduce a hybrid Mamba-Transformer framework, termed LFMT. LFMT integrates the strengths of Mamba and Transformer models for LFSR, enabling comprehensive information exploration across spatial, angular, and epipolar-plane domains. Experimental results demonstrate that LFMT significantly outperforms current state-of-the-art methods in LFSR, achieving substantial improvements in performance while maintaining low computational complexity on both real-word and synthetic LF datasets.

[71] PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

Ming Dai,Wenxuan Cheng,Jiedong Zhuang,Jiang-jiang Liu,Hongshen Zhao,Zhenhua Feng,Wankou Yang

Main category: cs.CV

TL;DR: PropVG提出了一种端到端的基于提议的视觉定位框架,首次无需额外检测器即可无缝整合前景目标提议生成与参考目标理解,同时引入对比学习模块和多粒度目标判别模块以提升性能。

Details Motivation: 现有视觉定位方法多采用端到端直接参考范式,但忽视了潜在候选目标的监督信息,且缺乏多粒度判别能力,导致复杂场景下的目标识别不够鲁棒。

Contribution: 1. 提出首个无需额外检测器的端到端提议驱动视觉定位框架;2. 设计对比式参考评分模块(CRS)增强目标理解和区分能力;3. 提出多粒度目标判别模块(MTD)融合对象和语义信息提升性能。

Method: 1. 通过端到端框架整合提议生成与参考理解;2. 使用CRS模块在句子和词语级别进行对比学习;3. 设计MTD模块融合对象和语义信息。

Result: 在多个基准测试(gRefCOCO、Ref-ZOM等)上验证了PropVG的有效性。

Insight: 多粒度判别与对比学习结合能显著提升复杂场景下的视觉定位性能,提议驱动框架在不增加额外检测器的情况下仍能高效工作。

Abstract: Recent advances in visual grounding have largely shifted away from traditional proposal-based two-stage frameworks due to their inefficiency and high computational complexity, favoring end-to-end direct reference paradigms. However, these methods rely exclusively on the referred target for supervision, overlooking the potential benefits of prominent prospective targets. Moreover, existing approaches often fail to incorporate multi-granularity discrimination, which is crucial for robust object identification in complex scenarios. To address these limitations, we propose PropVG, an end-to-end proposal-based framework that, to the best of our knowledge, is the first to seamlessly integrate foreground object proposal generation with referential object comprehension without requiring additional detectors. Furthermore, we introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the capability in understanding and distinguishing referred objects. Additionally, we design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets. Extensive experiments on gRefCOCO (GREC/GRES), Ref-ZOM, R-RefCOCO, and RefCOCO (REC/RES) benchmarks demonstrate the effectiveness of PropVG. The codes and models are available at https://github.com/Dmmm1997/PropVG.

[72] TemporalFlowViz: Parameter-Aware Visual Analytics for Interpreting Scramjet Combustion Evolution

Yifei Jia,Shiyu Cheng,Yu Dong,Guan Li,Dong Tian,Ruixiao Peng,Xuyi Lu,Yu Wang,Wei Yao,Guihua Shan

Main category: cs.CV

TL;DR: TemporalFlowViz是一种参数感知的可视化分析系统,专为解释超燃冲压发动机燃烧模拟的时变流场数据而设计,结合了深度学习、降维和密度聚类技术,支持专家聚类和模式发现。

Details Motivation: 超燃冲压发动机的燃烧动力学数据规模大、维度高,传统方法难以进行视觉解释、特征区分和跨案例比较,亟需一种高效的分析工具。

Contribution: 提出了TemporalFlowViz系统,集成ViT特征提取、降维、密度聚类和视觉-语言模型,支持专家驱动的燃烧模式分析和自然语言生成。

Method: 使用预训练的Vision Transformers提取高维嵌入,降维和密度聚类揭示潜在模式,构建时间轨迹,并结合专家标注和视觉-语言模型生成描述性总结。

Result: 通过案例研究和专家反馈验证了系统的有效性,表明其能够提升假设生成、可解释模式发现和大规模燃烧分析的知识获取。

Insight: 结合深度学习与专家知识可以有效解决高维时变数据的分析难题,多模态交互和自然语言生成是提升系统可用性的关键。

Abstract: Understanding the complex combustion dynamics within scramjet engines is critical for advancing high-speed propulsion technologies. However, the large scale and high dimensionality of simulation-generated temporal flow field data present significant challenges for visual interpretation, feature differentiation, and cross-case comparison. In this paper, we present TemporalFlowViz, a parameter-aware visual analytics workflow and system designed to support expert-driven clustering, visualization, and interpretation of temporal flow fields from scramjet combustion simulations. Our approach leverages hundreds of simulated combustion cases with varying initial conditions, each producing time-sequenced flow field images. We use pretrained Vision Transformers to extract high-dimensional embeddings from these frames, apply dimensionality reduction and density-based clustering to uncover latent combustion modes, and construct temporal trajectories in the embedding space to track the evolution of each simulation over time. To bridge the gap between latent representations and expert reasoning, domain specialists annotate representative cluster centroids with descriptive labels. These annotations are used as contextual prompts for a vision-language model, which generates natural-language summaries for individual frames and full simulation cases. The system also supports parameter-based filtering, similarity-based case retrieval, and coordinated multi-view exploration to facilitate in-depth analysis. We demonstrate the effectiveness of TemporalFlowViz through two expert-informed case studies and expert feedback, showing TemporalFlowViz enhances hypothesis generation, supports interpretable pattern discovery, and enhances knowledge discovery in large-scale scramjet combustion analysis.

[73] Pose-Free 3D Quantitative Phase Imaging of Flowing Cellular Populations

Enze Ye,Wei Lin,Shaochi Ren,Yakun Liu,Xiaoping Li,Hao Wang,He Sun,Feng Pan

Main category: cs.CV

TL;DR: OmniFHT是一种无需姿态信息的3D定量相位成像框架,通过结合傅里叶衍射定理和隐式神经表示(INR),实现对流动细胞群的高通量断层成像,支持任意细胞几何形状和多轴旋转。

Details Motivation: 当前的3D定量相位成像方法假设细胞在流动过程中进行均匀的单轴旋转,这限制了其对不规则形状细胞的成像能力,只能分析部分细胞群。OmniFHT旨在解决这一问题。

Contribution: OmniFHT是首个支持高通量断层成像的无姿态3D RI重构框架,能够在稀疏投影和受限角度范围内实现高保真重建,适用于完整的流动细胞群。

Method: 结合傅里叶衍射定理和隐式神经表示(INR),通过联合优化未知旋转轨迹和体积结构,支持任意几何形状和多轴旋转。

Result: OmniFHT能够实现高保真重建,即使仅有10个视角或120度角范围。

Insight: 该方法突破了传统成像的限制,为流式细胞术提供了无偏见的形态分析解决方案。

Abstract: High-throughput 3D quantitative phase imaging (QPI) in flow cytometry enables label-free, volumetric characterization of individual cells by reconstructing their refractive index (RI) distributions from multiple viewing angles during flow through microfluidic channels. However, current imaging methods assume that cells undergo uniform, single-axis rotation, which require their poses to be known at each frame. This assumption restricts applicability to near-spherical cells and prevents accurate imaging of irregularly shaped cells with complex rotations. As a result, only a subset of the cellular population can be analyzed, limiting the ability of flow-based assays to perform robust statistical analysis. We introduce OmniFHT, a pose-free 3D RI reconstruction framework that leverages the Fourier diffraction theorem and implicit neural representations (INRs) for high-throughput flow cytometry tomographic imaging. By jointly optimizing each cell’s unknown rotational trajectory and volumetric structure under weak scattering assumptions, OmniFHT supports arbitrary cell geometries and multi-axis rotations. Its continuous representation also allows accurate reconstruction from sparsely sampled projections and restricted angular coverage, producing high-fidelity results with as few as 10 views or only 120 degrees of angular range. OmniFHT enables, for the first time, in situ, high-throughput tomographic imaging of entire flowing cell populations, providing a scalable and unbiased solution for label-free morphometric analysis in flow cytometry platforms.

[74] Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning

Trixia Simangan,Ahmed Nadeem Abbasi,Yipeng Hu,Shaheer U. Saeed

Main category: cs.CV

TL;DR: 本文提出了一种名为Cryo-RL的强化学习框架,用于自动化前列腺癌冷冻消融计划制定,解决了手动规划耗时且依赖专家经验的问题。

Details Motivation: 冷冻消融是一种微创的前列腺癌治疗方法,但其成功率高度依赖于准确的冷冻探针放置规划。目前的手动规划过程费时且依赖专家经验,导致治疗质量不稳定。

Contribution: 主要贡献是提出了Cryo-RL框架,通过强化学习模拟冷冻消融规划过程,自动优化冷冻探针放置策略,无需人工设计。

Method: 方法是将冷冻消融规划建模为马尔可夫决策过程,利用强化学习在模拟环境中学习最优的冷冻探针放置策略,奖励函数基于肿瘤覆盖范围。

Result: 在583个回顾性前列腺癌案例中,Cryo-RL相比几何优化基线方法提升了8%以上的Dice分数,与专家表现相当,且规划时间显著减少。

Insight: 强化学习在冷冻消融规划中展现出临床可行性,能够提供高效、可重复且质量稳定的治疗方案。

Abstract: Cryoablation is a minimally invasive localised treatment for prostate cancer that destroys malignant tissue during de-freezing, while sparing surrounding healthy structures. Its success depends on accurate preoperative planning of cryoprobe placements to fully cover the tumour and avoid critical anatomy. This planning is currently manual, expertise-dependent, and time-consuming, leading to variability in treatment quality and limited scalability. In this work, we introduce Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process and learns an optimal policy for cryoprobe placement. Within a simulated environment that models clinical constraints and stochastic intraoperative variability, an agent sequentially selects cryoprobe positions and ice sphere diameters. Guided by a reward function based on tumour coverage, this agent learns a cryoablation strategy that leads to optimal cryoprobe placements without the need for any manually-designed plans. Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared with the best automated baselines, based on geometric optimisation, and matched human expert performance while requiring substantially less planning time. These results highlight the potential of reinforcement learning to deliver clinically viable, reproducible, and efficient cryoablation plans.

Dominik Pegler,David Steyrl,Mengfan Zhang,Alexander Karner,Jozsef Arato,Frank Scharnowski,Filip Melinscak

Main category: cs.CV

TL;DR: 研究探讨了预训练的计算机视觉模型是否能预测蜘蛛相关图像的恐惧评分,通过迁移学习调整三种模型,结果显示预测效果良好,强调了解释性和数据集大小的重要性。

Details Motivation: 计算机视觉的进步为临床应用(如动态调整治疗中的视觉刺激)提供了新可能,需验证其在预测恐惧情绪中的有效性。

Contribution: 1) 验证预训练模型能预测恐惧评分;2) 通过解释性分析确认模型依赖蜘蛛相关特征;3) 发现数据集大小对性能的关键影响。

Method: 使用迁移学习调整三种预训练模型,预测标准化数据集中313张图像的恐惧评分,采用交叉验证和解释性分析。

Result: 模型预测的平均绝对误差(MAE)在10.1-11.0之间,数据集缩减会显著降低性能。解释性显示模型基于蜘蛛特征做出预测。

Insight: 解释性和足够的数据集是开发情绪感知治疗技术的关键,需关注视觉条件(如远距离或人工蜘蛛)对预测误差的影响。

Abstract: Advances in computer vision have opened new avenues for clinical applications, particularly in computerized exposure therapy where visual stimuli can be dynamically adjusted based on patient responses. As a critical step toward such adaptive systems, we investigated whether pretrained computer vision models can accurately predict fear levels from spider-related images. We adapted three diverse models using transfer learning to predict human fear ratings (on a 0-100 scale) from a standardized dataset of 313 images. The models were evaluated using cross-validation, achieving an average mean absolute error (MAE) between 10.1 and 11.0. Our learning curve analysis revealed that reducing the dataset size significantly harmed performance, though further increases yielded no substantial gains. Explainability assessments showed the models’ predictions were based on spider-related features. A category-wise error analysis further identified visual conditions associated with higher errors (e.g., distant views and artificial/painted spiders). These findings demonstrate the potential of explainable computer vision models in predicting fear ratings, highlighting the importance of both model explainability and a sufficient dataset size for developing effective emotion-aware therapeutic technologies.

[76] SynGen-Vision: Synthetic Data Generation for training industrial vision models

Alpana Dubey,Suma Mani Kuriakose,Nitish Bhardwaj

Main category: cs.CV

TL;DR: 提出了SynGen-Vision方法,通过合成数据训练工业视觉模型,用于检测工业设备的磨损和锈蚀。该方法结合视觉语言模型与3D仿真引擎,生成多样化的锈蚀场景数据,训练后的模型在真实锈蚀图像上表现优异,mAP50达0.87。

Details Motivation: 工业设备磨损检测是预测性维护的重要任务,但真实数据收集成本高且耗时。现有数据集难以覆盖多种磨损场景,因此需要一种高效生成合成数据的方法。

Contribution: 1. 提出结合视觉语言模型与3D仿真引擎的合成数据生成方法;2. 展示了合成数据在锈蚀检测任务中的有效性,优于其他方法;3. 方法可扩展至其他工业磨损检测场景。

Method: 1. 使用视觉语言模型生成锈蚀描述的文本;2. 利用3D仿真引擎根据文本渲染不同锈蚀程度的工业对象图像;3. 生成的大规模合成数据用于训练计算机视觉模型。

Result: 在锈蚀检测任务中,基于合成数据训练的模型mAP50为0.87,优于其他方法,证明了合成数据的有效性。

Insight: 1. 合成数据可以弥补真实数据不足的问题;2. 视觉语言模型与3D仿真的结合为工业视觉任务提供了灵活性;3. 方法可推广到其他类似场景。

Abstract: We propose an approach to generate synthetic data to train computer vision (CV) models for industrial wear and tear detection. Wear and tear detection is an important CV problem for predictive maintenance tasks in any industry. However, data curation for training such models is expensive and time-consuming due to the unavailability of datasets for different wear and tear scenarios. Our approach employs a vision language model along with a 3D simulation and rendering engine to generate synthetic data for varying rust conditions. We evaluate our approach by training a CV model for rust detection using the generated dataset and tested the trained model on real images of rusted industrial objects. The model trained with the synthetic data generated by our approach, outperforms the other approaches with a mAP50 score of 0.87. The approach is customizable and can be easily extended to other industrial wear and tear detection scenarios

[77] Evaluating Multiple Instance Learning Strategies for Automated Sebocyte Droplet Counting

Maryam Adelipour,Gustavo Carneiro,Jeongkwon Kim

Main category: cs.CV

TL;DR: 论文提出了一种基于注意力的多实例学习(MIL)框架,用于自动化皮脂腺细胞脂滴计数,并与基线多层感知机(MLP)进行比较,结果显示MLP性能更稳定,而MIL模型在某些情况下表现更好。

Details Motivation: 手动计数皮脂腺细胞脂滴效率低且主观性强,因此需要自动化解决方案。

Contribution: 引入了一种基于注意力的MIL框架,并通过实验验证了其与基线MLP的性能差异。

Method: 采用ResNet-50提取特征,设计了一种注意力MIL模型,并与基于聚合计数的MLP模型进行比较。

Result: MLP表现更稳定(MAE=5.6),而MIL在某些折叠中表现更好(MAE=10.7)。

Insight: 简单的袋级聚合方法是一种稳健的基线,而注意力MIL需要进一步优化才能充分发挥潜力。

Abstract: Sebocytes are lipid-secreting cells whose differentiation is marked by the accumulation of intracellular lipid droplets, making their quantification a key readout in sebocyte biology. Manual counting is labor-intensive and subjective, motivating automated solutions. Here, we introduce a simple attention-based multiple instance learning (MIL) framework for sebocyte image analysis. Nile Red-stained sebocyte images were annotated into 14 classes according to droplet counts, expanded via data augmentation to about 50,000 cells. Two models were benchmarked: a baseline multi-layer perceptron (MLP) trained on aggregated patch-level counts, and an attention-based MIL model leveraging ResNet-50 features with instance weighting. Experiments using five-fold cross-validation showed that the baseline MLP achieved more stable performance (mean MAE = 5.6) compared with the attention-based MIL, which was less consistent (mean MAE = 10.7) but occasionally superior in specific folds. These findings indicate that simple bag-level aggregation provides a robust baseline for slide-level droplet counting, while attention-based MIL requires task-aligned pooling and regularization to fully realize its potential in sebocyte image analysis.

[78] UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features

Haowang Cui,Rui Chen,Tao Luo,Rui Li,Jiaze Wang

Main category: cs.CV

TL;DR: UniView提出了一种新方法,通过利用相似物体的参考图像增强单张图像的新视角合成能力,显著减少了失真。

Details Motivation: 当前的单张图像新视角合成方法在处理未观测区域时通常依赖模糊先验和插值,导致严重失真。

Contribution: 提出了UniView模型,通过检索和增强系统以及多模态大语言模型(MLLM)选择参考图像,并引入动态生成参考特征的适配器模块和解耦三重注意力机制。

Method: 构建检索和增强系统,利用MLLM选择参考图像;设计动态适配器模块和多级隔离层;提出解耦三重注意力机制对齐和整合特征。

Result: 在复杂数据集上显著提升了新视角合成的性能,超越了现有最优方法。

Insight: 通过引入外部参考信息和动态特征生成,可以有效解决单张图像视角合成中的失真问题。

Abstract: The task of synthesizing novel views from a single image is highly ill-posed due to multiple explanations for unobserved areas. Most current methods tend to generate unseen regions from ambiguity priors and interpolation near input views, which often lead to severe distortions. To address this limitation, we propose a novel model dubbed as UniView, which can leverage reference images from a similar object to provide strong prior information during view synthesis. More specifically, we construct a retrieval and augmentation system and employ a multimodal large language model (MLLM) to assist in selecting reference images that meet our requirements. Additionally, a plug-and-play adapter module with multi-level isolation layers is introduced to dynamically generate reference features for the target views. Moreover, in order to preserve the details of an original input image, we design a decoupled triple attention mechanism, which can effectively align and integrate multi-branch features into the synthesis process. Extensive experiments have demonstrated that our UniView significantly improves novel view synthesis performance and outperforms state-of-the-art methods on the challenging datasets.

[79] Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper

Gehui Chen,Guan’an Wang,Xiaowen Huang,Jitao Sang

Main category: cs.CV

TL;DR: 论文提出了一种基于多基础模型映射器(MFM-Mapper)的高效视频到音频生成方法,通过融合双视觉编码器的特征和改进特征对齐,显著提升了训练效率和生成性能。

Details Motivation: 现有的视频到音频生成方法依赖从头训练模型,资源消耗大。利用基础模型的跨模态知识迁移能力成为一种高效替代方案。此前工作通过微调轻量级映射器连接视觉编码器和文本到音频模型,但仍有改进空间。

Contribution: 1. 引入多基础模型映射器(MFM-Mapper),融合双视觉编码器的特征,丰富语义和时序信息;2. 用GPT-2替代线性映射器,类比自回归翻译任务改进特征对齐;3. 显著提升训练效率,仅需16%的训练量即可达到竞争性能。

Method: MFM-Mapper通过以下方法实现:1. 使用双视觉编码器提取多模态特征;2. 采用GPT-2作为映射器,优化跨模态特征对齐;3. 结合基础模型的泛化能力,减少训练资源需求。

Result: 实验表明,MFM-Mapper在语义和时序一致性上表现更优,仅需16%的训练规模即可与其他大规模训练模型竞争。

Insight: 1. 多模态特征融合能显著提升跨模态任务的性能;2. 自回归模型(如GPT-2)在特征对齐任务中具有潜力;3. 基础模型的组合利用是高效实现复杂任务的关键。

Abstract: Recent Video-to-Audio (V2A) generation relies on extracting semantic and temporal features from video to condition generative models. Training these models from scratch is resource intensive. Consequently, leveraging foundation models (FMs) has gained traction due to their cross-modal knowledge transfer and generalization capabilities. One prior work has explored fine-tuning a lightweight mapper network to connect a pre-trained visual encoder with a text-to-audio generation model for V2A. Inspired by this, we introduce the Multiple Foundation Model Mapper (MFM-Mapper). Compared to the previous mapper approach, MFM-Mapper benefits from richer semantic and temporal information by fusing features from dual visual encoders. Furthermore, by replacing a linear mapper with GPT-2, MFM-Mapper improves feature alignment, drawing parallels between cross-modal features mapping and autoregressive translation tasks. Our MFM-Mapper exhibits remarkable training efficiency. It achieves better performance in semantic and temporal consistency with fewer training consuming, requiring only 16% of the training scale compared to previous mapper-based work, yet achieves competitive performance with models trained on a much larger scale.

[80] Dual-Domain Perspective on Degradation-Aware Fusion: A VLM-Guided Robust Infrared and Visible Image Fusion Framework

Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui

Main category: cs.CV

TL;DR: 论文提出了一种名为GD^2Fusion的双领域(频域/空域)框架,用于红外与可见光图像融合,结合视觉语言模型(VLMs)的降质感知能力,解决了现有方法在双源降质场景下的局限性。

Details Motivation: 当前的红外-可见光图像融合方法通常假设输入图像质量高,难以处理双源降质场景,且需手动选择多个预增强步骤,导致误差累积和性能下降。

Contribution: 提出了GD^2Fusion框架,首次将视觉语言模型(VLMs)用于降质感知,并结合频域和空域的双域联合优化,实现了更鲁棒的红外与可见光图像融合。

Method: 1. 设计了引导频域模态特定提取模块(GFMSE),在频域实现降质感知与抑制;2. 设计了引导空域模态聚合融合模块(GSMAF),在空域进行跨模态降质滤波和自适应特征聚合。

Result: 实验表明,GD^2Fusion在双源降质场景下优于现有融合算法和策略,表现出更强的融合性能。

Insight: 通过结合VLM的降质感知能力与双域联合优化,可以显著提升图像融合在复杂场景下的鲁棒性,为多模态图像处理提供了新思路。

Abstract: Most existing infrared-visible image fusion (IVIF) methods assume high-quality inputs, and therefore struggle to handle dual-source degraded scenarios, typically requiring manual selection and sequential application of multiple pre-enhancement steps. This decoupled pre-enhancement-to-fusion pipeline inevitably leads to error accumulation and performance degradation. To overcome these limitations, we propose Guided Dual-Domain Fusion (GD^2Fusion), a novel framework that synergistically integrates vision-language models (VLMs) for degradation perception with dual-domain (frequency/spatial) joint optimization. Concretely, the designed Guided Frequency Modality-Specific Extraction (GFMSE) module performs frequency-domain degradation perception and suppression and discriminatively extracts fusion-relevant sub-band features. Meanwhile, the Guided Spatial Modality-Aggregated Fusion (GSMAF) module carries out cross-modal degradation filtering and adaptive multi-source feature aggregation in the spatial domain to enhance modality complementarity and structural consistency. Extensive qualitative and quantitative experiments demonstrate that GD^2Fusion achieves superior fusion performance compared with existing algorithms and strategies in dual-source degraded scenarios. The code will be publicly released after acceptance of this paper.

[81] A biologically inspired separable learning vision model for real-time traffic object perception in Dark

Hulin Li,Qiliang Ren,Jun Li,Hanbing Wei,Zheng Liu,Linfang Fan

Main category: cs.CV

TL;DR: 论文提出了一个生物启发的可分学习视觉模型(SLVM),用于暗光交通场景下的实时物体感知,并构建了最大的暗光交通数据集Dark-traffic。

Details Motivation: 解决暗光交通场景下物体感知的挑战,包括光照退化问题、缺乏可靠的视觉线索,以及缺乏大规模基准数据集。

Contribution: 1. 提出了Dark-traffic数据集,支持物体检测、实例分割和光流估计;2. 设计了生物启发的可分学习视觉模型(SLVM),提升了暗光环境下的感知性能。

Method: SLVM包含四个关键组件:光照自适应瞳孔机制(特征提取)、特征级可分学习策略(高效表示)、任务特定解耦分支(多任务学习)和空间错位感知融合模块(多特征对齐)。

Result: SLVM在Dark-traffic数据集上表现优异,检测性能超过RT-DETR 11.2个百分点,实例分割超过YOLOv12 6.1个百分点,光流估计端误差降低12.37%。在LIS基准测试中,SLVM平均超过Swin Transformer+EnlightenGAN和ConvNeXt-T+EnlightenGAN 11个百分点。

Insight: 生物启发的可分学习策略和任务解耦设计能显著提升暗光场景下的感知性能,同时降低计算开销。

Abstract: Fast and accurate object perception in low-light traffic scenes has attracted increasing attention. However, due to severe illumination degradation and the lack of reliable visual cues, existing perception models and methods struggle to quickly adapt to and accurately predict in low-light environments. Moreover, there is the absence of available large-scale benchmark specifically focused on low-light traffic scenes. To bridge this gap, we introduce a physically grounded illumination degradation method tailored to real-world low-light settings and construct Dark-traffic, the largest densely annotated dataset to date for low-light traffic scenes, supporting object detection, instance segmentation, and optical flow estimation. We further propose the Separable Learning Vision Model (SLVM), a biologically inspired framework designed to enhance perception under adverse lighting. SLVM integrates four key components: a light-adaptive pupillary mechanism for illumination-sensitive feature extraction, a feature-level separable learning strategy for efficient representation, task-specific decoupled branches for multi-task separable learning, and a spatial misalignment-aware fusion module for precise multi-feature alignment. Extensive experiments demonstrate that SLVM achieves state-of-the-art performance with reduced computational overhead. Notably, it outperforms RT-DETR by 11.2 percentage points in detection, YOLOv12 by 6.1 percentage points in instance segmentation, and reduces endpoint error (EPE) of baseline by 12.37% on Dark-traffic. On the LIS benchmark, the end-to-end trained SLVM surpasses Swin Transformer+EnlightenGAN and ConvNeXt-T+EnlightenGAN by an average of 11 percentage points across key metrics, and exceeds Mask RCNN (with light enhancement) by 3.1 percentage points. The Dark-traffic dataset and complete code is released at https://github.com/alanli1997/slvm.

[82] Leveraging Transfer Learning and Mobile-enabled Convolutional Neural Networks for Improved Arabic Handwritten Character Recognition

Mohsine El Khayati,Ayyad Maafiri,Yassine Himeur,Hamzah Ali Alkhazaleh,Shadi Atalla,Wathiq Mansoor

Main category: cs.CV

TL;DR: 该论文研究了如何通过结合迁移学习和轻量级卷积神经网络(MbNets)提升阿拉伯手写字符识别(AHCR)。实验验证了MobileNet的性能最优,而ShuffleNet在泛化能力上表现突出,特别是在IFHCDB数据集上达到了99%的准确率。

Details Motivation: 阿拉伯手写字符识别面临计算资源需求高和数据集稀缺的挑战,本文旨在通过迁移学习和轻量化网络解决这些问题。

Contribution: 主要贡献是通过实验验证了迁移学习与轻量级网络结合的有效性,明确了MobileNet在性能和效率上的优势,并提出了未来优化方向。

Method: 使用三种迁移学习策略(全微调、部分微调、从头训练)和四种轻量级网络(MobileNet、SqueezeNet、MnasNet、ShuffleNet),在三个基准数据集(AHCD、HIJJA、IFHCDB)上进行实验。

Result: MobileNet在性能上表现最佳,IFHCDB数据集上MnasNet全微调达到99%准确率。全微调整体表现最优,部分微调表现不佳。

Insight: 轻量级网络结合迁移学习是资源受限场景下提升AHCR的有效方法,未来可通过架构优化和数据增强进一步提升性能。

Abstract: The study explores the integration of transfer learning (TL) with mobile-enabled convolutional neural networks (MbNets) to enhance Arabic Handwritten Character Recognition (AHCR). Addressing challenges like extensive computational requirements and dataset scarcity, this research evaluates three TL strategies–full fine-tuning, partial fine-tuning, and training from scratch–using four lightweight MbNets: MobileNet, SqueezeNet, MnasNet, and ShuffleNet. Experiments were conducted on three benchmark datasets: AHCD, HIJJA, and IFHCDB. MobileNet emerged as the top-performing model, consistently achieving superior accuracy, robustness, and efficiency, with ShuffleNet excelling in generalization, particularly under full fine-tuning. The IFHCDB dataset yielded the highest results, with 99% accuracy using MnasNet under full fine-tuning, highlighting its suitability for robust character recognition. The AHCD dataset achieved competitive accuracy (97%) with ShuffleNet, while HIJJA posed significant challenges due to its variability, achieving a peak accuracy of 92% with ShuffleNet. Notably, full fine-tuning demonstrated the best overall performance, balancing accuracy and convergence speed, while partial fine-tuning underperformed across metrics. These findings underscore the potential of combining TL and MbNets for resource-efficient AHCR, paving the way for further optimizations and broader applications. Future work will explore architectural modifications, in-depth dataset feature analysis, data augmentation, and advanced sensitivity analysis to enhance model robustness and generalizability.

[83] LUIVITON: Learned Universal Interoperable VIrtual Try-ON

Cong Cao,Xianhang Cheng,Jingyuan Liu,Yujian Zheng,Zhenhui Lin,Meriem Chkir,Hao Li

Main category: cs.CV

TL;DR: LUIVITON是一个端到端的虚拟试穿系统,能够自动化地将复杂多层衣物适配到多样化且任意姿态的人形角色上。通过分离衣物到SMPL和身体到SMPL的对应任务,结合几何学习和扩散模型方法,实现高效且通用的虚拟试穿。

Details Motivation: 虚拟试穿在多样化和任意姿态的人形角色上具有挑战性,尤其是衣物与复杂身体形状的对齐问题。需要一个通用的解决方案,能够处理复杂几何和多样化角色。

Contribution: 1. 提出一种端到端的虚拟试穿系统LUIVITON,支持多层衣物和多样化人形角色;2. 通过分离衣物到SMPL和身体到SMPL的对应任务,分别采用几何学习和扩散模型方法;3. 支持衣物尺寸和材质的快速定制。

Method: 1. 使用SMPL作为代理表示,分解问题为衣物到SMPL和身体到SMPL的对应任务;2. 衣物到SMPL采用基于几何学习的形状对应预测;3. 身体到SMPL采用扩散模型,结合多视角一致外观特征和预训练2D基础模型。

Result: 系统能够高效处理复杂几何和非流形网格,适用于人类、机器人、卡通角色、生物和外星人等多样化角色,无需2D衣物缝纫图案支持。

Insight: 通过分离对应任务和结合不同方法,可以实现高效且通用的虚拟试穿技术,同时支持用户自定义衣物属性。

Abstract: We present LUIVITON, an end-to-end system for fully automated virtual try-on, capable of draping complex, multi-layer clothing onto diverse and arbitrarily posed humanoid characters. To address the challenge of aligning complex garments with arbitrary and highly diverse body shapes, we use SMPL as a proxy representation and separate the clothing-to-body draping problem into two correspondence tasks: 1) clothing-to-SMPL and 2) body-to-SMPL correspondence, where each has its unique challenges. While we address the clothing-to-SMPL fitting problem using a geometric learning-based approach for partial-to-complete shape correspondence prediction, we introduce a diffusion model-based approach for body-to-SMPL correspondence using multi-view consistent appearance features and a pre-trained 2D foundation model. Our method can handle complex geometries, non-manifold meshes, and generalizes effectively to a wide range of humanoid characters – including humans, robots, cartoon subjects, creatures, and aliens, while maintaining computational efficiency for practical adoption. In addition to offering a fully automatic fitting solution, LUIVITON supports fast customization of clothing size, allowing users to adjust clothing sizes and material properties after they have been draped. We show that our system can produce high-quality 3D clothing fittings without any human labor, even when 2D clothing sewing patterns are not available.

[84] Towards Efficient Pixel Labeling for Industrial Anomaly Detection and Localization

Jingqi Wu,Hanxi Li,Lin Yuanbo Wu,Hao Chen,Deyin Liu,Peng Wang

Main category: cs.CV

TL;DR: 论文提出了ADClick和ADClick-Seg方法,利用用户点击和文本描述高效生成像素级异常标注,显著提升工业异常检测性能,并在MVTec AD数据集上达到SOTA结果。

Details Motivation: 工业异常检测通常依赖无缺陷样本训练,但缺陷样本的像素级标注成本高,限制了可扩展性。本文旨在解决这一问题。

Contribution: 1. 提出ADClick,通过用户点击和文本描述高效生成像素级异常标注;2. 进一步提出ADClick-Seg,结合视觉特征和文本提示的跨模态框架。

Method: 1. ADClick:基于交互式图像分割(IIS),从用户输入生成标注;2. ADClick-Seg:通过原型对齐视觉与文本特征,结合像素级先验和语言引导。

Result: 在MVTec AD上,ADClick达到AP=96.1%,ADClick-Seg在Multi-class任务中AP=80.0%、PRO=97.5%、Pixel-AUROC=99.1%。

Insight: 跨模态(视觉+语言)结合能显著提升工业异常检测的效率和精度,用户交互标注是解决标注成本高的有效途径。

Abstract: Industrial product inspection is often performed using Anomaly Detection (AD) frameworks trained solely on non-defective samples. Although defective samples can be collected during production, leveraging them usually requires pixel-level annotations, limiting scalability. To address this, we propose ADClick, an Interactive Image Segmentation (IIS) algorithm for industrial anomaly detection. ADClick generates pixel-wise anomaly annotations from only a few user clicks and a brief textual description, enabling precise and efficient labeling that significantly improves AD model performance (e.g., AP = 96.1% on MVTec AD). We further introduce ADClick-Seg, a cross-modal framework that aligns visual features and textual prompts via a prototype-based approach for anomaly detection and localization. By combining pixel-level priors with language-guided cues, ADClick-Seg achieves state-of-the-art results on the challenging ``Multi-class’’ AD task (AP = 80.0%, PRO = 97.5%, Pixel-AUROC = 99.1% on MVTec AD).

[85] Scale-interaction transformer: a hybrid cnn-transformer model for facial beauty prediction

Djamel Eddine Boukhari

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Automated Facial Beauty Prediction (FBP) is a challenging computer vision task due to the complex interplay of local and global facial features that influence human perception. While Convolutional Neural Networks (CNNs) excel at feature extraction, they often process information at a fixed scale, potentially overlooking the critical inter-dependencies between features at different levels of granularity. To address this limitation, we introduce the Scale-Interaction Transformer (SIT), a novel hybrid deep learning architecture that synergizes the feature extraction power of CNNs with the relational modeling capabilities of Transformers. The SIT first employs a multi-scale module with parallel convolutions to capture facial characteristics at varying receptive fields. These multi-scale representations are then framed as a sequence and processed by a Transformer encoder, which explicitly models their interactions and contextual relationships via a self-attention mechanism. We conduct extensive experiments on the widely-used SCUT-FBP5500 benchmark dataset, where the proposed SIT model establishes a new state-of-the-art. It achieves a Pearson Correlation of 0.9187, outperforming previous methods. Our findings demonstrate that explicitly modeling the interplay between multi-scale visual cues is crucial for high-performance FBP. The success of the SIT architecture highlights the potential of hybrid CNN-Transformer models for complex image regression tasks that demand a holistic, context-aware understanding.

[86] SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing

Chaolei Wang,Yang Luo,Jing Du,Siyu Chen,Yiping Chen,Ting Han

Main category: cs.CV

TL;DR: 论文提出了一种名为SGS-3D的新方法,通过分割和生长可靠的语义掩码来实现高保真的3D实例分割,有效解决了传统2D到3D方法中的误差累积问题。

Details Motivation: 现有的基于2D到3D提升的3D实例分割方法由于语义指导的模糊性和深度约束的不足,导致分割精度较低,需要一种新方法来提升分割质量。

Contribution: 1. 提出了一种“分割-再生长”框架(SGS-3D),结合语义和几何信息优化3D实例分割;2. 引入掩码过滤策略,利用几何基元去除模糊掩码;3. 通过空间连续性和高级特征构建细粒度对象实例。

Method: 1. 首先对模糊的提升掩码进行净化和分割;2. 利用几何基元过滤不可靠语义掩码;3. 结合空间连续性和高级特征生长完整实例。

Result: 在ScanNet200、ScanNet++和KITTI-360数据集上,SGS-3D显著提高了分割精度和鲁棒性,适用于多样化的室内外环境。

Insight: 语义和几何信息的联合优化可以有效提升3D实例分割的精度,尤其是在语义模糊的场景中。

Abstract: Accurate 3D instance segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D-to-3D lifting approaches struggle to produce precise instance-level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose splitting and growing reliable semantic mask for high-fidelity 3D instance segmentation (SGS-3D), a novel “split-then-grow” framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS-3D serves as a training-free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co-occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine-grained object instances by exploiting both spatial continuity and high-level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI-360 demonstrate that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments. Code is available in the supplementary materials.

[87] SL-SLR: Self-Supervised Representation Learning for Sign Language Recognition

Ariel Basso Madjoukeng,Jérôme Fink,Pierre Poitier,Edith Belise Kenmogne,Benoit Frenay

Main category: cs.CV

TL;DR: 论文提出了一种自监督学习框架SL-SLR,通过新设计的自由负样本对方法和数据增强技术,解决了手语识别中对比学习的局限性,显著提升了识别准确率。

Details Motivation: 传统对比学习在手语识别中面临两个问题:一是忽视视频部分区域的重要性差异,二是不同手语间的相似动作导致负样本对难以区分。这些问题导致学到的特征缺乏判别性。

Contribution: 1. 提出了一种新的自监督框架,结合自由负样本对方法和数据增强技术;2. 显著提升了手语识别的准确性,且在多种任务(线性评估、半监督学习、跨语言迁移)中表现优异。

Method: 框架包含两个核心组件:1. 自由负样本对的自监督方法;2. 针对手语特点设计的数据增强技术。两者协同工作,学习更具判别性的特征。

Result: 实验表明,SL-SLR在多个任务中超越了现有对比学习和自监督方法,特别是在跨语言迁移任务中表现突出。

Insight: 手语识别需要关注视频中的关键区域,而自由负样本对设计能有效缓解相似动作的干扰,是提升特征判别性的关键。

Abstract: Sign language recognition (SLR) is a machine learning task aiming to identify signs in videos. Due to the scarcity of annotated data, unsupervised methods like contrastive learning have become promising in this field. They learn meaningful representations by pulling positive pairs (two augmented versions of the same instance) closer and pushing negative pairs (different from the positive pairs) apart. In SLR, in a sign video, only certain parts provide information that is truly useful for its recognition. Applying contrastive methods to SLR raises two issues: (i) contrastive learning methods treat all parts of a video in the same way, without taking into account the relevance of certain parts over others; (ii) shared movements between different signs make negative pairs highly similar, complicating sign discrimination. These issues lead to learning non-discriminative features for sign recognition and poor results in downstream tasks. In response, this paper proposes a self-supervised learning framework designed to learn meaningful representations for SLR. This framework consists of two key components designed to work together: (i) a new self-supervised approach with free-negative pairs; (ii) a new data augmentation technique. This approach shows a considerable gain in accuracy compared to several contrastive and self-supervised methods, across linear evaluation, semi-supervised learning, and transferability between sign languages.

[88] Symbolic Graphics Programming with Large Language Models

Yamei Chen,Haoquan Zhang,Yangyi Huang,Zeju Qiu,Kaipeng Zhang,Yandong Wen,Weiyang Liu

Main category: cs.CV

TL;DR: 论文研究了大型语言模型(LLMs)在生成符号化图形程序(SGPs)方面的能力,提出了一个综合基准测试SGP-GenBench,并提出了一种带可验证奖励的强化学习方法,显著提升了SVG生成的质量和语义理解,性能媲美前沿系统。

Details Motivation: 研究LLMs生成SGPs的能力,尤其是SVG,旨在探索LLMs如何通过程序合成理解视觉世界,并通过SGPs生成精确的视觉内容。

Contribution: 1) 提出了SGP-GenBench基准测试,用于评估SGPs生成的能力;2) 提出了一种基于强化学习的方法,结合可验证奖励,显著提升SVG生成质量;3) 分析了RL训练对模型行为的影响,揭示了其对对象分解和场景连贯性的改善。

Method: 采用强化学习(RL)方法,结合格式有效性验证和跨模态奖励(如SigLIP和DINO编码器),优化LLMs生成SGPs的能力。实验基于Qwen-2.5-7B模型。

Result: RL方法显著提升了SVG生成的质量和语义理解,性能与前沿系统相当。RL还促进了对象分解为可控基元的能力和场景连贯性的提升。

Insight: 符号化图形编程为跨模态基础提供了精确且可解释的研究视角,RL训练能够改善模型的基元控制和细节处理能力。

Abstract: Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs’ ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.

[89] COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization

Yassine Taoudi-Benchekroun,Klim Troyan,Pascal Sager,Stefan Gerber,Lukas Tuggener,Benjamin Grewe

Main category: cs.CV

TL;DR: COGITAO是一个模块化和可扩展的数据生成框架与基准,用于系统性研究视觉领域的组合性与泛化能力。它通过基于规则的网格环境任务,支持可调节深度的组合,生成了数百万个独特的任务规则。

Details Motivation: 人类能够组合学习的概念并将其应用于新场景,但当前机器学习模型在这方面的能力仍有限。COGITAO旨在填补这一空白,提供一个系统性研究工具。

Contribution: 1) 提出了COGITAO框架,支持可扩展的任务生成与基准测试;2) 提供了丰富的任务规则和控制参数,生成了远超现有数据集的规模;3) 开源了所有代码与数据集。

Method: 基于ARC-AGI的问题设置,COGITAO通过在网格环境中对物体应用一组可组合的变换(共28种)生成任务。任务规则支持可调节的组合深度与广泛的参数控制。

Result: 实验表明,即使是当前最先进的视觉模型,在面对熟悉的元素的新组合时也表现出泛化能力不足,尽管在已知领域表现良好。

Insight: COGITAO揭示了视觉推理任务中组合性与泛化的挑战,强调了需要更强大的模型设计来应对复杂的新组合场景。

Abstract: The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI’s problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules – surpassing concurrent datasets by several orders of magnitude – across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.

[90] WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool

Zizun Li,Jianjun Zhou,Yifan Wang,Haoyu Guo,Wenzheng Chang,Yang Zhou,Haoyi Zhu,Junyi Chen,Chunhua Shen,Tong He

Main category: cs.CV

TL;DR: WinT3R提出了一种基于滑动窗口的在线重建模型,结合相机令牌池技术,实现了高质量相机姿态估计和点云重建,并平衡了实时性能。

Details Motivation: 现有方法在重建质量和实时性能之间存在权衡,无法同时满足两者需求。WinT3R旨在通过窗口机制和相机令牌池解决这一问题。

Contribution: 1. 引入滑动窗口机制,优化帧间信息交换;2. 提出紧凑相机表示和全局相机令牌池,提升姿态估计的可靠性。

Method: 采用滑动窗口进行帧间信息共享,并结合相机令牌池技术,实现高效且高质量的在线重建。

Result: 在多个数据集上验证了WinT3R的优越性,实现了SOTA的在线重建质量、姿态估计和速度。

Insight: 窗口机制和令牌池的结合为实时高质量重建提供了新思路,平衡了计算效率与精度。

Abstract: We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.

[91] FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases

Matteo Poggi,Fabio Tosi

Main category: cs.CV

TL;DR: FlowSeek是一种低硬件资源需求的光流估计框架,结合深度基础模型和运动参数化,在单消费级GPU上训练,实现了跨数据集的高性能表现。

Details Motivation: 当前光流估计方法需要大量硬件资源训练,限制了其广泛应用。FlowSeek旨在通过结合深度基础模型和低维运动参数化,降低训练资源需求,同时保持高精度。

Contribution: 1. 结合深度基础模型和低维运动参数化;2. 设计紧凑且准确的光流架构;3. 在单消费级GPU上实现高效训练,资源需求降低8倍;4. 在多个数据集上超越现有方法。

Method: 1. 引入单图像深度基础模型;2. 采用低维运动参数化方法;3. 设计轻量化光流网络架构。

Result: 在Sintel Final、KITTI等数据集上取得10%和15%的相对性能提升,优于先前最优方法SEA-RAFT。

Insight: 结合基础模型和传统参数化方法可在低资源需求下实现高精度光流估计,为实际应用提供了可能。

Abstract: We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. FlowSeek marries the latest advances on the design space of optical flow networks with cutting-edge single-image depth foundation models and classical low-dimensional motion parametrization, implementing a compact, yet accurate architecture. FlowSeek is trained on a single consumer-grade GPU, a hardware budget about 8x lower compared to most recent methods, and still achieves superior cross-dataset generalization on Sintel Final and KITTI, with a relative improvement of 10 and 15% over the previous state-of-the-art SEA-RAFT, as well as on Spring and LayeredFlow datasets.

cs.DC [Back]

[92] STADI: Fine-Grained Step-Patch Diffusion Parallelism for Heterogeneous GPUs

Han Liang,Jiahui Zhou,Zicheng Zhou,Xiaoxi Zhang,Xu Chen

Main category: cs.DC

TL;DR: STADI是一种针对异构GPU环境的细粒度步-块扩散并行框架,旨在优化扩散模型推理的资源利用和负载均衡。

Details Motivation: 扩散模型在图像生成等应用中计算成本高昂,现有并行推理方法在异构GPU环境下资源利用率不足,导致负载不均衡。

Contribution: 提出了STADI框架,结合时间步分配器和弹性块并行机制,显著减少推理延迟并提高资源利用。

Method: 采用计算感知的步分配器和弹性块并行,通过最小公倍数量化减少慢速GPU上的去噪步骤,动态分配不同大小的图像块。

Result: 在异构GPU集群上,STADI比先进方法减少端到端推理延迟高达45%,显著改善负载均衡。

Insight: 细粒度的时间-空间并行策略是实现异构GPU高效扩散模型推理的关键。

Abstract: The escalating adoption of diffusion models for applications such as image generation demands efficient parallel inference techniques to manage their substantial computational cost. However, existing diffusion parallelism inference schemes often underutilize resources in heterogeneous multi-GPU environments, where varying hardware capabilities or background tasks cause workload imbalance. This paper introduces Spatio-Temporal Adaptive Diffusion Inference (STADI), a novel framework to accelerate diffusion model inference in such settings. At its core is a hybrid scheduler that orchestrates fine-grained parallelism across both temporal and spatial dimensions. Temporally, STADI introduces a novel computation-aware step allocator applied after warmup phases, using a least-common-multiple-minimizing quantization technique to reduce denoising steps on slower GPUs and execution synchronization. To further minimize GPU idle periods, STADI executes an elastic patch parallelism mechanism that allocates variably sized image patches to GPUs according to their computational capability, ensuring balanced workload distribution through a complementary spatial mechanism. Extensive experiments on both load-imbalanced and heterogeneous multi-GPU clusters validate STADI’s efficacy, demonstrating improved load balancing and mitigation of performance bottlenecks. Compared to patch parallelism, a state-of-the-art diffusion inference framework, our method significantly reduces end-to-end inference latency by up to 45% and significantly improves resource utilization on heterogeneous GPUs.

cs.DS [Back]

[93] Labelling Data with Unknown References

Adrian de Wynter

Main category: cs.DS

TL;DR: 论文提出了一种‘无数据算法’,通过连续挑战评估器来验证其可信度,无需依赖标注参考数据,适用于低资源语言的LLM评估。

Details Motivation: 在缺乏标注参考数据的情况下,传统验证评估器可信度的方法(测试或假设评估器‘知道’标注方法)均失效,需提出新的解决方案。

Contribution: 提出了‘无数据算法’,能在无参考数据的情况下验证评估器的可信度,并提供了理论证明、实验验证及在低资源语言LLM评估中的应用。

Method: 通过连续挑战评估器,观察其响应行为,判断其是否可信。算法设计确保了对‘真知’评估器的接受和对不可信评估器的拒绝。

Result: 理论证明和实验表明,该算法能高概率地验证评估器的可信度,并成功应用于低资源语言的LLM评估场景。

Insight: 无需依赖标注数据即可验证评估器,为低资源或无标注数据条件下的评估问题提供了新思路。

Abstract: An evaluator is trustworthy when there exists some agreed-upon way to measure its performance as a labeller. The two ways to establish trustworthiness are either by testing it, or by assuming the evaluator knows' somehow the way to label the corpus. However, if labelled references (e.g., a development set) are unavailable, neither of these approaches work: the former requires the data, and the latter is an assumption, not evidence. To address this, we introduce an algorithm (the No-Data Algorithm’) by which to establish trust in an evaluator without any existing references. Our algorithm works by successively posing challenges to said evaluator. We show that this is sufficient to establish trustworthiness w.h.p., in such a way that when the evaluator actually knows the way to label the corpus, the No-Data Algorithm accepts its output; and, conversely, flags untrustworthy evaluators when these are unable to prove it. We present formal proofs of correctness, empirical tests, and applications to LLMs-as-judges on low-resource languages.

cs.GR [Back]

[94] Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control

Haruo Fujiwara,Yusuke Mukuta,Tatsuya Harada

Main category: cs.GR

TL;DR: 该论文提出了一种改进的文本引导3D场景风格化方法,通过区域控制和多视图一致性技术,提升了风格化的质量和一致性。

Details Motivation: 现有的文本驱动3D场景风格化方法在高质量风格化和视图一致性之间存在矛盾。此外,对不同区域或对象应用语义一致的风格化仍有挑战。

Contribution: 1. 提出了一种基于参考的注意力共享机制,增强多视图风格一致性。2. 利用深度图网格强化视图一致性。3. 提出多区域加权切片Wasserstein距离损失,实现区域化风格控制。

Method: 1. 扩展风格对齐的深度条件视图生成框架。2. 使用深度图网格增强一致性。3. 引入多区域加权损失函数。

Result: 实验结果表明,该方法在质量和一致性上优于现有方法,并能实现区域化风格混合。

Insight: 结合深度信息和区域分割技术是提升3D风格化效果的有效途径。

Abstract: Recent advances in text-driven 3D scene editing and stylization, which leverage the powerful capabilities of 2D generative models, have demonstrated promising outcomes. However, challenges remain in ensuring high-quality stylization and view consistency simultaneously. Moreover, applying style consistently to different regions or objects in the scene with semantic correspondence is a challenging task. To address these limitations, we introduce techniques that enhance the quality of 3D stylization while maintaining view consistency and providing optional region-controlled style transfer. Our method achieves stylization by re-training an initial 3D representation using stylized multi-view 2D images of the source views. Therefore, ensuring both style consistency and view consistency of stylized multi-view images is crucial. We achieve this by extending the style-aligned depth-conditioned view generation framework, replacing the fully shared attention mechanism with a single reference-based attention-sharing mechanism, which effectively aligns style across different viewpoints. Additionally, inspired by recent 3D inpainting methods, we utilize a grid of multiple depth maps as a single-image reference to further strengthen view consistency among stylized images. Finally, we propose Multi-Region Importance-Weighted Sliced Wasserstein Distance Loss, allowing styles to be applied to distinct image regions using segmentation masks from off-the-shelf models. We demonstrate that this optional feature enhances the faithfulness of style transfer and enables the mixing of different styles across distinct regions of the scene. Experimental evaluations, both qualitative and quantitative, demonstrate that our pipeline effectively improves the results of text-driven 3D stylization.

cs.SD [Back]

[95] Ecologically Valid Benchmarking and Adaptive Attention: Scalable Marine Bioacoustic Monitoring

Nicholas R. Rasmussen,Rodrigue Rizk,Longwei Wang,KC Santosh

Main category: cs.SD

TL;DR: 该论文提出了一种名为GetNetUPAM的分层嵌套交叉验证框架,用于在生态多样性背景下量化模型稳定性,并结合ARPA-N神经网络架构提升水下声学监测的准确性和可扩展性。

Details Motivation: 水下被动声学监测(UPAM)面临环境噪声复杂及信号依赖性强的挑战,现有方法难以稳定泛化。需要一种能应对环境多样性的评估框架和鲁棒模型架构。

Contribution: 1) GetNetUPAM框架通过按站点-年份分块验证,量化模型在真实生态多样性下的表现。2) ARPA-N网络利用自适应池化和空间注意力机制提升对不规则频谱图的处理能力。

Method: 1) 分层嵌套交叉验证(GetNetUPAM)将数据划分为站点-年份子集,确保验证反映环境多样性。2) ARPA-N结合自适应分辨率池化和空间注意力机制,扩展感受野。

Result: ARPA-N在GetNetUPAM评估下,平均精度比DenseNet基线提升14.4%,所有指标的变异性显著降低,实现跨站点-年份的一致性检测。

Insight: 生态多样性验证框架(如GetNetUPAM)是评估模型泛化性的关键,而自适应注意力机制能有效处理环境噪声和非均匀数据,适用于海洋生物声学监测。

Abstract: Underwater Passive Acoustic Monitoring (UPAM) provides rich spatiotemporal data for long-term ecological analysis, but intrinsic noise and complex signal dependencies hinder model stability and generalization. Multilayered windowing has improved target sound localization, yet variability from shifting ambient noise, diverse propagation effects, and mixed biological and anthropogenic sources demands robust architectures and rigorous evaluation. We introduce GetNetUPAM, a hierarchical nested cross-validation framework designed to quantify model stability under ecologically realistic variability. Data are partitioned into distinct site-year segments, preserving recording heterogeneity and ensuring each validation fold reflects a unique environmental subset, reducing overfitting to localized noise and sensor artifacts. Site-year blocking enforces evaluation against genuine environmental diversity, while standard cross-validation on random subsets measures generalization across UPAM’s full signal distribution, a dimension absent from current benchmarks. Using GetNetUPAM as the evaluation backbone, we propose the Adaptive Resolution Pooling and Attention Network (ARPA-N), a neural architecture for irregular spectrogram dimensions. Adaptive pooling with spatial attention extends the receptive field, capturing global context without excessive parameters. Under GetNetUPAM, ARPA-N achieves a 14.4% gain in average precision over DenseNet baselines and a log2-scale order-of-magnitude drop in variability across all metrics, enabling consistent detection across site-year folds and advancing scalable, accurate bioacoustic monitoring.

[96] WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

Gagan Mundada,Yash Vishe,Amit Namburi,Xin Xu,Zachary Novack,Julian McAuley,Junda Wu

Main category: cs.SD

TL;DR: WildScore是一个新的基准测试,用于评估多模态大语言模型(MLLMs)在真实场景中解析音乐符号并进行复杂音乐推理的能力。

Details Motivation: 目前MLLMs在多模态任务上表现强大,但在符号音乐领域的推理能力尚未深入研究。为此,作者提出了WildScore基准测试,填补了这一空白。

Contribution: 1. 提出了首个真实世界的多模态符号音乐推理与分析基准测试。2. 设计了系统的音乐学分类体系。3. 将复杂音乐推理问题形式化为选择题形式,便于评估。

Method: 1. 从真实音乐作品中提取实例,并附用户生成的问题。2. 提出高层次和细粒度的音乐学问分类体系。3. 采用选择题形式评估模型的符号音乐理解能力。

Result: 对当前顶尖MLLMs的实证测试揭示了其在视觉-符号推理中的优势和不足。

Insight: WildScore揭示了MLLMs在符号音乐推理领域的潜力与挑战,为该方向的进一步研究提供了数据基础和方法框架。

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs’ capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs’ symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.

cs.AI [Back]

[97] SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing

Hongyi Jing,Jiafu Chen,Chen Rao,Ziqiang Dang,Jiajie Teng,Tianyi Chu,Juncheng Mo,Shuo Fang,Huaizhong Lin,Rui Lv,Chenguang Ma,Lei Zhao

Main category: cs.AI

TL;DR: SparkUI-Parser提出了一种新型端到端框架,通过连续坐标建模和改进的匈牙利匹配算法,显著提升了GUI感知的定位精度和解析能力。

Details Motivation: 现有MLLMs在GUI感知中因离散坐标建模和预定义元素限制,导致精度低、速度慢且泛化性差,亟需改进。

Contribution: 1) 提出了基于MLLM的连续坐标建模方法;2) 引入改进的匈牙利匹配算法以减少误检;3) 发布ScreenParse基准测试。

Method: 1) 使用预训练MLLM+令牌路由器和坐标解码器进行连续坐标建模;2) 通过改进的匈牙利算法实现拒绝机制。

Result: 在多个基准测试(ScreenSpot、CAGUI-Grounding等)上优于SOTA方法。

Insight: 连续建模和拒绝机制的结合显著提升了GUI解析的鲁棒性和效率。

Abstract: The existing Multimodal Large Language Models (MLLMs) for GUI perception have made great progress. However, the following challenges still exist in prior methods: 1) They model discrete coordinates based on text autoregressive mechanism, which results in lower grounding accuracy and slower inference speed. 2) They can only locate predefined sets of elements and are not capable of parsing the entire interface, which hampers the broad application and support for downstream tasks. To address the above issues, we propose SparkUI-Parser, a novel end-to-end framework where higher localization precision and fine-grained parsing capability of the entire interface are simultaneously achieved. Specifically, instead of using probability-based discrete modeling, we perform continuous modeling of coordinates based on a pre-trained Multimodal Large Language Model (MLLM) with an additional token router and coordinate decoder. This effectively mitigates the limitations inherent in the discrete output characteristics and the token-by-token generation process of MLLMs, consequently boosting both the accuracy and the inference speed. To further enhance robustness, a rejection mechanism based on a modified Hungarian matching algorithm is introduced, which empowers the model to identify and reject non-existent elements, thereby reducing false positives. Moreover, we present ScreenParse, a rigorously constructed benchmark to systematically assess structural perception capabilities of GUI models across diverse scenarios. Extensive experiments demonstrate that our approach consistently outperforms SOTA methods on ScreenSpot, ScreenSpot-v2, CAGUI-Grounding and ScreenParse benchmarks. The resources are available at https://github.com/antgroup/SparkUI-Parser.

[98] LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation

Yinglin Duan,Zhengxia Zou,Tongwei Gu,Wei Jia,Zhan Zhao,Luyi Xu,Xinzhu Liu,Hao Jiang,Kang Chen,Shuang Qiu

Main category: cs.AI

TL;DR: LatticeWorld是一個基於輕量級LLM(LLaMA-2-7B)和渲染引擎(如Unreal Engine 5)的多模態框架,用於生成大規模3D動態互動世界,顯著提升工業生產效率。

Details Motivation: 現有3D場景生成方法多依賴傳統手動建模或生成式方法,但效率與真實性不足。LatticeWorld旨在通過多模態輸入(文本和視覺)實現高效且高保真度的3D世界生成。

Contribution: 提出LatticeWorld框架,結合輕量級LLM和渲染引擎,實現多模態輸入生成動態3D世界,顯著提升工業效率(90倍)。

Method: 整合LLaMA-2-7B與Unreal Engine 5,支持文本和視覺輸入,生成具備多智能體互動、高保真物理模擬和實時渲染的3D場景。

Result: 實驗顯示LatticeWorld在場景布局生成和視覺保真度上表現優異,工業生產效率提升90倍。

Insight: 輕量級LLM與渲染引擎的結合為3D世界生成提供了高效率和高真實性的解決方案,縮小了模擬與現實的差距。

Abstract: Recent research has been increasingly focusing on developing 3D world models that simulate complex real-world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim-to-real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes, modern approaches have leveraged advanced machine learning algorithms for 3D world generation, with most recent advances focusing on generative methods that can create virtual worlds based on user instructions. This work explores such a research direction by proposing LatticeWorld, a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments. LatticeWorld leverages lightweight LLMs (LLaMA-2-7B) alongside the industry-grade rendering engine (e.g., Unreal Engine 5) to generate a dynamic environment. Our proposed framework accepts textual descriptions and visual instructions as multimodal inputs and creates large-scale 3D interactive worlds with dynamic agents, featuring competitive multi-agent interaction, high-fidelity physics simulation, and real-time rendering. We conduct comprehensive experiments to evaluate LatticeWorld, showing that it achieves superior accuracy in scene layout generation and visual fidelity. Moreover, LatticeWorld achieves over a $90\times$ increase in industrial production efficiency while maintaining high creative quality compared with traditional manual production methods. Our demo video is available at https://youtu.be/8VWZXpERR18

[99] Maestro: Joint Graph & Config Optimization for Reliable AI Agents

Wenxiao Wang,Priyatham Kattakinda,Soheil Feizi

Main category: cs.AI

TL;DR: Maestro是一个联合优化图结构和节点配置的框架,旨在提升LLM代理的可靠性和性能,显著优于现有方法。

Details Motivation: 现有的优化方法主要关注节点配置,而忽略了图结构的优化,导致无法解决结构性的故障模式。Maestro通过联合优化两者来提升代理的整体质量。

Contribution: 1. 提出一个框架无关的联合优化方法Maestro,同时搜索最优图结构和节点配置;2. 利用反射性文本反馈提高样本效率和针对性;3. 在多个基准和应用中显著优于现有方法。

Method: Maestro在固定预算下联合优化图结构和节点配置,通过反射性文本反馈优先处理特定故障模式。

Result: 在IFBench和HotpotQA上,Maestro平均比MIPROv2、GEPA和GEPA+Merge分别高出12%、4.9%和4.86%,且样本效率更高。

Insight: 结构优化和配置优化的联合搜索能有效解决LLM代理的深层次问题,而单纯提示优化无法覆盖这些故障模式。

Abstract: Building reliable LLM agents requires decisions at two levels: the graph (which modules exist and how information flows) and the configuration of each node (models, prompts, tools, control knobs). Most existing optimizers tune configurations while holding the graph fixed, leaving structural failure modes unaddressed. We introduce Maestro, a framework-agnostic holistic optimizer for LLM agents that jointly searches over graphs and configurations to maximize agent quality, subject to explicit rollout/token budgets. Beyond numeric metrics, Maestro leverages reflective textual feedback from traces to prioritize edits, improving sample efficiency and targeting specific failure modes. On the IFBench and HotpotQA benchmarks, Maestro consistently surpasses leading prompt optimizers–MIPROv2, GEPA, and GEPA+Merge–by an average of 12%, 4.9%, and 4.86%, respectively; even when restricted to prompt-only optimization, it still leads by 9.65%, 2.37%, and 2.41%. Maestro achieves these results with far fewer rollouts than GEPA. We further show large gains on two applications (interviewer & RAG agents), highlighting that joint graph & configuration search addresses structural failure modes that prompt tuning alone cannot fix.

[100] Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning

Brennen Hill

Main category: cs.AI

TL;DR: 论文提出了一种语言驱动的层级任务结构作为显式世界模型的方法,用于提升多智能体学习的效果,特别是在复杂、长周期的任务中。

Details Motivation: 当前的多智能体强化学习在复杂任务中常因探索空间过大和稀疏奖励而失败,需要一种显式、结构化的世界模型来指导学习。

Contribution: 提出了一种利用大型语言模型动态生成层级任务结构的方法,从而显式地构建世界模型,提升学习效率和智能体行为的高级策略性。

Method: 通过语言模型将复杂目标分解为可管理的子任务,构建层级世界模型,为多智能体强化学习提供密集的学习信号和内在课程。

Result: 研究表明,这种方法可以显著提升多智能体在复杂任务(如机器人足球)中的学习效率和战略行为。

Insight: 语言模型不仅可以用于自然语言处理,还能通过动态生成任务结构显式指导多智能体学习,为智能体训练提供了新的范式。

Abstract: The convergence of Language models, Agent models, and World models represents a critical frontier for artificial intelligence. While recent progress has focused on scaling Language and Agent models, the development of sophisticated, explicit World Models remains a key bottleneck, particularly for complex, long-horizon multi-agent tasks. In domains such as robotic soccer, agents trained via standard reinforcement learning in high-fidelity but structurally-flat simulators often fail due to intractable exploration spaces and sparse rewards. This position paper argues that the next frontier in developing capable agents lies in creating environments that possess an explicit, hierarchical World Model. We contend that this is best achieved through hierarchical scaffolding, where complex goals are decomposed into structured, manageable subgoals. Drawing evidence from a systematic review of 2024 research in multi-agent soccer, we identify a clear and decisive trend towards integrating symbolic and hierarchical methods with multi-agent reinforcement learning (MARL). These approaches implicitly or explicitly construct a task-based world model to guide agent learning. We then propose a paradigm shift: leveraging Large Language Models to dynamically generate this hierarchical scaffold, effectively using language to structure the World Model on the fly. This language-driven world model provides an intrinsic curriculum, dense and meaningful learning signals, and a framework for compositional learning, enabling Agent Models to acquire sophisticated, strategic behaviors with far greater sample efficiency. By building environments with explicit, language-configurable task layers, we can bridge the gap between low-level reactive behaviors and high-level strategic team play, creating a powerful and generalizable framework for training the next generation of intelligent agents.

[101] Towards Ontology-Based Descriptions of Conversations with Qualitatively-Defined Concepts

Barbara Gendron,Gaël Guibon,Mathieu D’aquin

Main category: cs.AI

TL;DR: 提出一种基于本体的方法,使用语言描述符将定性会话特征定量化,并通过本体推理和一致性检查实现可控文本生成,应用于CEFR语言能力级别控制。

Details Motivation: 大型语言模型作为会话代理时,可控性尤其是确保可预测和用户个性化响应是关键挑战,需解决定性特征的量化定义问题。

Contribution: 将定性会话特征通过语言描述符定量化,构建本体支持推理和一致性检查,实现透明可控的文本生成。

Method: 利用语言描述符定义定性概念,形式化为描述逻辑并整合到本体中,通过微调引导LLM的文本生成。

Result: 实验证明方法能提供一致且可解释的能力级别定义,提升会话AI的透明度。

Insight: 通过本体和形式化描述逻辑,为定性会话特征提供量化定义,是提升LLM可控性和透明性的有效途径。

Abstract: The controllability of Large Language Models (LLMs) when used as conversational agents is a key challenge, particularly to ensure predictable and user-personalized responses. This work proposes an ontology-based approach to formally define conversational features that are typically qualitative in nature. By leveraging a set of linguistic descriptors, we derive quantitative definitions for qualitatively-defined concepts, enabling their integration into an ontology for reasoning and consistency checking. We apply this framework to the task of proficiency-level control in conversations, using CEFR language proficiency levels as a case study. These definitions are then formalized in description logic and incorporated into an ontology, which guides controlled text generation of an LLM through fine-tuning. Experimental results demonstrate that our approach provides consistent and explainable proficiency-level definitions, improving transparency in conversational AI.

[102] Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework

Jie Chen,Jinhao Jiang,Yingqian Min,Zican Dong,Shijie Wang,Wayne Xin Zhao,Ji-Rong Wen

Main category: cs.AI

TL;DR: Sticker-TTS是一个新颖的测试时间缩放框架,通过协调三个协作的大型推理模型(LRMs),利用历史经验迭代探索和优化解决方案,显著提升了计算效率和性能。

Details Motivation: 当前测试时间缩放方法主要依赖冗余采样,忽略历史经验的利用,限制了计算效率。为解决这一问题,提出了Sticker-TTS。

Contribution: 1. 提出了Sticker-TTS框架,利用历史经验指导模型迭代优化;2. 引入蒸馏关键条件(stickers)实现信息的提取、精炼和重用;3. 提出了结合模仿学习和自我改进的两阶段优化策略。

Method: 框架核心是通过三个协作的LRMs和stickers机制进行多轮推理和优化。采用两阶段优化策略,结合模仿学习和自我改进。

Result: 在AIME-24、AIME-25和OlymMATH三个数学推理基准测试中,Sticker-TTS表现优于自洽性和先进强化学习方法。

Insight: 利用历史经验(通过stickers机制)可以显著提升模型的推理效率和性能,避免冗余计算。

Abstract: Large reasoning models (LRMs) have exhibited strong performance on complex reasoning tasks, with further gains achievable through increased computational budgets at inference. However, current test-time scaling methods predominantly rely on redundant sampling, ignoring the historical experience utilization, thereby limiting computational efficiency. To overcome this limitation, we propose Sticker-TTS, a novel test-time scaling framework that coordinates three collaborative LRMs to iteratively explore and refine solutions guided by historical attempts. At the core of our framework are distilled key conditions-termed stickers-which drive the extraction, refinement, and reuse of critical information across multiple rounds of reasoning. To further enhance the efficiency and performance of our framework, we introduce a two-stage optimization strategy that combines imitation learning with self-improvement, enabling progressive refinement. Extensive evaluations on three challenging mathematical reasoning benchmarks, including AIME-24, AIME-25, and OlymMATH, demonstrate that Sticker-TTS consistently surpasses strong baselines, including self-consistency and advanced reinforcement learning approaches, under comparable inference budgets. These results highlight the effectiveness of sticker-guided historical experience utilization. Our code and data are available at https://github.com/RUCAIBox/Sticker-TTS.

eess.IV [Back]

[103] Inferring the Graph Structure of Images for Graph Neural Networks

Mayur S Gowda,John Shi,Augusto Santos,José M. F. Moura

Main category: eess.IV

TL;DR: 该论文提出了一种改进GNN任务性能的方法,通过为图像数据集(如MNIST和Fashion-MNIST)寻找替代传统网格图和超像素方法的图结构。

Details Motivation: 传统图像数据集通常表示为网格图,但这种方法可能未能充分利用像素之间的关系信息,限制了GNN的性能。

Contribution: 提出基于像素值相关性构建行相关性图、列相关性图和乘积图的方法,并将其作为GNN的输入,提高了下游任务的准确性。

Method: 利用像素值之间的相关性构建不同的图结构(如行相关图、列相关图和乘积图),并将其输入GNN模型。

Result: 实验表明,与传统网格图和超像素方法相比,该方法显著提高了MNIST和Fashion-MNIST的分类准确性。

Insight: 选择合适的图结构对GNN性能至关重要,像素之间的相关性可以挖掘更丰富的图拓扑信息。

Abstract: Image datasets such as MNIST are a key benchmark for testing Graph Neural Network (GNN) architectures. The images are traditionally represented as a grid graph with each node representing a pixel and edges connecting neighboring pixels (vertically and horizontally). The graph signal is the values (intensities) of each pixel in the image. The graphs are commonly used as input to graph neural networks (e.g., Graph Convolutional Neural Networks (Graph CNNs) [1, 2], Graph Attention Networks (GAT) [3], GatedGCN [4]) to classify the images. In this work, we improve the accuracy of downstream graph neural network tasks by finding alternative graphs to the grid graph and superpixel methods to represent the dataset images, following the approach in [5, 6]. We find row correlation, column correlation, and product graphs for each image in MNIST and Fashion-MNIST using correlations between the pixel values building on the method in [5, 6]. Experiments show that using these different graph representations and features as input into downstream GNN models improves the accuracy over using the traditional grid graph and superpixel methods in the literature.

[104] Multi-modal Uncertainty Robust Tree Cover Segmentation For High-Resolution Remote Sensing Images

Yuanyuan Gui,Wei Li,Yinjian Wang,Xiang-Gen Xia,Mauro Marty,Christian Ginzler,Zuyuan Wang

Main category: eess.IV

TL;DR: 这篇论文提出了一种名为MURTreeFormer的新型多模态分割框架,用于解决高分辨率遥感图像中由时间错位引起的模态间不确定性,从而提高树冠覆盖分割的鲁棒性。

Details Motivation: 多模态遥感数据的获取时间往往不一致,导致数据间存在时间错位和不确定性,影响语义分割的精度。论文旨在解决这一问题,提升树冠覆盖分割的鲁棒性。

Contribution: 1. 提出MURTreeFormer框架,通过概率潜在表示明确建模辅助模态的块级不确定性。2. 引入了基于VAE的重采样机制和梯度幅值注意力模块(GMA)等技术,优化特征融合并提升分割性能。

Method: 1. 将一种模态作为主模态,其他作为辅助模态。2. 通过概率潜在表示建模辅助模态的不确定性。3. 使用VAE重采样机制重构不确定区域。4. 结合GMA模块和轻量级细化头(RH)优化解码器。

Result: 在上海和苏黎世的多模态数据集上,MURTreeFormer显著提升了分割性能,并有效降低了时间引起的随机不确定性对结果的影响。

Insight: 1. 多模态数据的时间错位问题可以通过建模不确定性来缓解。2. 结合生成模型(如VAE)和注意力机制可以进一步提升分割任务的鲁棒性。

Abstract: Recent advances in semantic segmentation of multi-modal remote sensing images have significantly improved the accuracy of tree cover mapping, supporting applications in urban planning, forest monitoring, and ecological assessment. Integrating data from multiple modalities-such as optical imagery, light detection and ranging (LiDAR), and synthetic aperture radar (SAR)-has shown superior performance over single-modality methods. However, these data are often acquired days or even months apart, during which various changes may occur, such as vegetation disturbances (e.g., logging, and wildfires) and variations in imaging quality. Such temporal misalignments introduce cross-modal uncertainty, especially in high-resolution imagery, which can severely degrade segmentation accuracy. To address this challenge, we propose MURTreeFormer, a novel multi-modal segmentation framework that mitigates and leverages aleatoric uncertainty for robust tree cover mapping. MURTreeFormer treats one modality as primary and others as auxiliary, explicitly modeling patch-level uncertainty in the auxiliary modalities via a probabilistic latent representation. Uncertain patches are identified and reconstructed from the primary modality’s distribution through a VAE-based resampling mechanism, producing enhanced auxiliary features for fusion. In the decoder, a gradient magnitude attention (GMA) module and a lightweight refinement head (RH) are further integrated to guide attention toward tree-like structures and to preserve fine-grained spatial details. Extensive experiments on multi-modal datasets from Shanghai and Zurich demonstrate that MURTreeFormer significantly improves segmentation performance and effectively reduces the impact of temporally induced aleatoric uncertainty.

[105] VLSM-Ensemble: Ensembling CLIP-based Vision-Language Models for Enhanced Medical Image Segmentation

Julia Dietlmeier,Oluwabukola Grace Adegboro,Vayangi Ganepola,Claudia Mazo,Noel E. O’Connor

Main category: eess.IV

TL;DR: 该论文提出了VLSM-Ensemble方法,通过集成基于CLIP和BiomedCLIP的视觉语言模型与低复杂度CNN,显著提升了医学图像分割的性能,尤其在BKAI息肉数据集上Dice分数提高了6.3%。

Details Motivation: 现有基于CLIP和BiomedCLIP的视觉语言模型在图像分割任务中表现不如更复杂的架构(如CRIS),因此研究如何通过模型集成来缩小这一差距。

Contribution: 主要贡献是提出了一种低复杂度的CNN与预训练视觉语言模型的集成方法,显著提升了分割性能,并在多个数据集中验证了其有效性。

Method: 通过集成预训练的视觉语言模型(如BiomedCLIPSeg)与低复杂度CNN,无需复杂的文本提示工程,即可提升分割精度。

Result: 在BKAI息肉数据集上Dice分数提高了6.3%,其他数据集的提升范围为1%至6%,部分数据集表现优于CRIS模型。

Insight: 模型集成在不同数据集上的表现差异显著,表明其效果与数据集特性相关,这为未来研究提供了方向。

Abstract: Vision-language models and their adaptations to image segmentation tasks present enormous potential for producing highly accurate and interpretable results. However, implementations based on CLIP and BiomedCLIP are still lagging behind more sophisticated architectures such as CRIS. In this work, instead of focusing on text prompt engineering as is the norm, we attempt to narrow this gap by showing how to ensemble vision-language segmentation models (VLSMs) with a low-complexity CNN. By doing so, we achieve a significant Dice score improvement of 6.3% on the BKAI polyp dataset using the ensembled BiomedCLIPSeg, while other datasets exhibit gains ranging from 1% to 6%. Furthermore, we provide initial results on additional four radiology and non-radiology datasets. We conclude that ensembling works differently across these datasets (from outperforming to underperforming the CRIS model), indicating a topic for future investigation by the community. The code is available at https://github.com/juliadietlmeier/VLSM-Ensemble.

cs.LG [Back]

[106] Beyond I-Con: Exploring New Dimension of Distance Measures in Representation Learning

Jasmine Shone,Shaden Alshammari,Mark Hamilton,Zhening Li,William Freeman

Main category: cs.LG

TL;DR: 论文《Beyond I-Con》探索了表示学习距离度量的新维度,提出通过替代统计差距(如TV距离)和相似性核改进了传统KL差距的局限性,并在多个任务中取得了SOTA结果。

Details Motivation: 传统I-Con框架中使用的KL差距可能因不对称性和无界性导致优化问题,难以对齐真实目标,因此需要探索更合适的统计差距和相似性核。

Contribution: 1. 提出了Beyond I-Con框架,系统性探索了替代统计差距(如TV距离)和相似性核;2. 在无监督聚类、监督对比学习和降维任务中实现了SOTA性能。

Method: 1. 在DINO-ViT嵌入的无监督聚类中,用TV距离改进PMI算法;2. 在监督对比学习中,用TV距离和基于距离的相似性核替代KL和角核;3. 在降维中,用有界f-差距替换KL。

Result: 1. 无监督聚类的SOTA结果;2. 监督对比学习中超越标准方法;3. 降维任务中优于SNE,并在下游任务中表现更好。

Insight: 统计差距和相似性核的选择对表示学习的优化至关重要,TV距离等替代方法能显著提升性能。

Abstract: The Information Contrastive (I-Con) framework revealed that over 23 representation learning methods implicitly minimize KL divergence between data and learned distributions that encode similarities between data points. However, a KL-based loss may be misaligned with the true objective, and properties of KL divergence such as asymmetry and unboundedness may create optimization challenges. We present Beyond I-Con, a framework that enables systematic discovery of novel loss functions by exploring alternative statistical divergences and similarity kernels. Key findings: (1) on unsupervised clustering of DINO-ViT embeddings, we achieve state-of-the-art results by modifying the PMI algorithm to use total variation (TV) distance; (2) on supervised contrastive learning, we outperform the standard approach by using TV and a distance-based similarity kernel instead of KL and an angular kernel; (3) on dimensionality reduction, we achieve superior qualitative results and better performance on downstream tasks than SNE by replacing KL with a bounded f-divergence. Our results highlight the importance of considering divergence and similarity kernel choices in representation learning optimization.

[107] SpikingBrain Technical Report: Spiking Brain-inspired Large Models

Yuqi Pan,Yupeng Feng,Jinghao Zhuang,Siyu Ding,Zehao Liu,Bohan Sun,Yuhong Chou,Han Xu,Xuerui Qiu,Anlin Deng,Anjie Hu,Peng Zhou,Man Yao,Jibin Wu,Jian Yang,Guoliang Sun,Bo Xu,Guoqi Li

Main category: cs.LG

TL;DR: SpikingBrain是一种受大脑启发的模型家族,旨在解决Transformer大模型的效率和长上下文处理问题,通过线性/混合线性注意力和自适应脉冲神经元架构等创新,实现在非NVIDIA平台上的高效训练和推理。

Details Motivation: 主流Transformer大模型面临训练计算复杂度高和推理内存需求大的问题,尤其是在长上下文处理方面。为了解决这些问题,作者提出了一种基于大脑机制的SpikingBrain模型家族。

Contribution: 1. 提出SpikingBrain模型家族,支持高效的长上下文训练和推理;2. 开发了线性/混合线性注意力架构和自适应脉冲神经元;3. 设计了专用训练管道和脉冲编码框架;4. 实现了在非NVIDIA平台上的稳定训练。

Method: 1. 模型架构:采用线性和混合线性注意力架构及自适应脉冲神经元;2. 算法优化:高效的基于转换的训练管道和脉冲编码框架;3. 系统工程:定制化的训练框架、算子库和并行策略,适配MetaX硬件。

Result: SpikingBrain-7B和SpikingBrain-76B在开源Transformer基线性能相当的情况下,仅需150B tokens进行持续预训练。SpikingBrain-7B在4M-token序列上的首次标记时间获得100倍加速,训练时模型FLOPs利用率达23.4%,脉冲方案实现69.15%稀疏度。

Insight: 受大脑启发的机制(如脉冲神经元)可以显著提升大模型的效率和可扩展性,尤其是在长序列处理和非NVIDIA硬件上的训练稳定性。

Abstract: Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models significantly improve long-sequence training efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Training remains stable for weeks on hundreds of MetaX C550 GPUs, with the 7B model reaching a Model FLOPs Utilization of 23.4 percent. The proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

cs.RO [Back]

[108] Towards an Accurate and Effective Robot Vision (The Problem of Topological Localization for Mobile Robots)

Emanuela Boros

Main category: cs.RO

TL;DR: 本文系统比较了多种视觉描述符和方法在拓扑定位问题中的性能,并提出了优化配置方案,验证了在真实机器人视觉任务中的有效性。

Details Motivation: 机器人拓扑定位是移动机器人完成任务的基本需求,但视觉定位存在感知模糊性、传感器噪声和光照变化等挑战。

Contribution: 对多种视觉描述符、距离度量和分类器进行了系统的定量比较,并验证了优化配置在实际任务中的表现。

Method: 使用颜色直方图、SIFT、ASIFT、RGB-SIFT以及基于词袋模型的方法,结合不同距离度量和分类器进行评估。

Result: 结果表明,适当的描述符配置、相似性度量和分类器组合能显著提升定位性能。

Insight: 未来工作需关注层次模型、特征组合和实时性优化,以应对更高维度的挑战。

Abstract: Topological localization is a fundamental problem in mobile robotics, since robots must be able to determine their position in order to accomplish tasks. Visual localization and place recognition are challenging due to perceptual ambiguity, sensor noise, and illumination variations. This work addresses topological localization in an office environment using only images acquired with a perspective color camera mounted on a robot platform, without relying on temporal continuity of image sequences. We evaluate state-of-the-art visual descriptors, including Color Histograms, SIFT, ASIFT, RGB-SIFT, and Bag-of-Visual-Words approaches inspired by text retrieval. Our contributions include a systematic, quantitative comparison of these features, distance measures, and classifiers. Performance was analyzed using standard evaluation metrics and visualizations, extending previous experiments. Results demonstrate the advantages of proper configurations of appearance descriptors, similarity measures, and classifiers. The quality of these configurations was further validated in the Robot Vision task of the ImageCLEF evaluation campaign, where the system identified the most likely location of novel image sequences. Future work will explore hierarchical models, ranking methods, and feature combinations to build more robust localization systems, reducing training and runtime while avoiding the curse of dimensionality. Ultimately, this aims toward integrated, real-time localization across varied illumination and longer routes.

[109] Pointing-Guided Target Estimation via Transformer-Based Attention

Luca Müller,Hassan Ali,Philipp Allgeuer,Lukáš Gajdošech,Stefan Wermter

Main category: cs.RO

TL;DR: 论文提出了一种名为MM-ITF的模块化架构,通过多模态注意力机制将人类的指向手势映射到目标物体位置,实现高效的人机交互。

Details Motivation: 在人类-机器人交互(HRI)中,机器人需要理解人类通过非语言(如指向手势)传达的意图,从而提高协作的自然性和可访问性。

Contribution: 1. 提出MM-ITF模块化架构,利用多模态注意力机制预测指向的目标物体;2. 引入补丁混淆矩阵分析模型性能。

Method: MM-ITF基于Transformer的多模态注意力机制,将单目RGB数据中的指向手势映射到目标位置,并为每个候选目标分配可能性分数。

Result: 实验表明,该方法能准确预测目标物体,增强了人机协作的自然性。

Insight: 多模态注意力机制在非语言意图理解中具有潜力,补丁混淆矩阵为模型性能评估提供了新视角。

Abstract: Deictic gestures, like pointing, are a fundamental form of non-verbal communication, enabling humans to direct attention to specific objects or locations. This capability is essential in Human-Robot Interaction (HRI), where robots should be able to predict human intent and anticipate appropriate responses. In this work, we propose the Multi-Modality Inter-TransFormer (MM-ITF), a modular architecture to predict objects in a controlled tabletop scenario with the NICOL robot, where humans indicate targets through natural pointing gestures. Leveraging inter-modality attention, MM-ITF maps 2D pointing gestures to object locations, assigns a likelihood score to each, and identifies the most likely target. Our results demonstrate that the method can accurately predict the intended object using monocular RGB data, thus enabling intuitive and accessible human-robot collaboration. To evaluate the performance, we introduce a patch confusion matrix, providing insights into the model’s predictions across candidate object locations. Code available at: https://github.com/lucamuellercode/MMITF.

[110] Robust Model Predictive Control Design for Autonomous Vehicles with Perception-based Observers

Nariman Niknejad,Gokul S. Sankar,Bahare Kiumarsi,Hamidreza Modares

Main category: cs.RO

TL;DR: 该论文提出了一种鲁棒的模型预测控制(MPC)框架,用于自动驾驶车辆,通过基于感知的状态估计方法处理非高斯噪声问题,显著提升了控制性能。

Details Motivation: 传统MPC框架通常假设感知模块的噪声为零均值高斯分布,而实际上深度学习的感知模块噪声可能具有偏置和重尾特性,这对控制系统的安全性和稳定性构成挑战。

Contribution: 1. 提出了一种基于约束zonotopes的集合状态估计方法,以捕捉感知噪声的偏置和重尾特性;2. 将鲁棒MPC重构成线性规划问题,提高了计算效率;3. 通过Minkowski-Lyapunov不等式和收缩不变集确保闭环稳定性。

Method: 1. 使用约束zonotopes表示有界状态估计误差;2. 引入Minkowski-Lyapunov成本函数和松弛变量,避免退化解;3. 通过椭球逼近zonotopes推导最大稳定终端集及其反馈增益。

Result: 实验验证表明,该方法在重尾噪声条件下显著优于传统高斯噪声假设的MPC设计,状态估计误差和控制性能均表现更优。

Insight: 1. 感知噪声的非高斯特性对控制系统至关重要;2. 集合状态估计和鲁棒优化结合可以有效处理不确定性;3. 计算效率的提升使该方法适用于实时控制。

Abstract: This paper presents a robust model predictive control (MPC) framework that explicitly addresses the non-Gaussian noise inherent in deep learning-based perception modules used for state estimation. Recognizing that accurate uncertainty quantification of the perception module is essential for safe feedback control, our approach departs from the conventional assumption of zero-mean noise quantification of the perception error. Instead, it employs set-based state estimation with constrained zonotopes to capture biased, heavy-tailed uncertainties while maintaining bounded estimation errors. To improve computational efficiency, the robust MPC is reformulated as a linear program (LP), using a Minkowski-Lyapunov-based cost function with an added slack variable to prevent degenerate solutions. Closed-loop stability is ensured through Minkowski-Lyapunov inequalities and contractive zonotopic invariant sets. The largest stabilizing terminal set and its corresponding feedback gain are then derived via an ellipsoidal approximation of the zonotopes. The proposed framework is validated through both simulations and hardware experiments on an omnidirectional mobile robot along with a camera and a convolutional neural network-based perception module implemented within a ROS2 framework. The results demonstrate that the perception-aware MPC provides stable and accurate control performance under heavy-tailed noise conditions, significantly outperforming traditional Gaussian-noise-based designs in terms of both state estimation error bounding and overall control performance.

cs.SI [Back]

[111] Evaluating Cognitive-Behavioral Fixation via Multimodal User Viewing Patterns on Social Media

Yujie Wang,Yunwei Zhao,Jing Yang,Han Han,Shiguang Shan,Jie Zhang

Main category: cs.SI

TL;DR: 该论文提出了一种通过分析用户的多模态社交媒体参与模式来评估认知行为固着的新框架,填补了计算检测方法的空白。

Details Motivation: 数字社交媒体平台常导致用户对窄领域内容表现出持续的重复参与,即认知行为固着。尽管心理学已广泛研究此现象,但计算检测方法仍未充分探索。

Contribution: 引入多模态主题提取模块和认知行为固着量化模块,实现了对用户行为的自适应、分层和可解释评估。

Method: 结合多模态数据分析(如文本和视觉),提出分级量化框架,通过协同模块提取主题并量化固着程度。

Result: 实验表明该方法在现有基准和新数据集上有效,为认知固着的可扩展计算分析奠定了基础。

Insight: 多模态分析能更全面地捕捉用户行为模式,分层量化方法提高了评估的准确性和解释性。

Abstract: Digital social media platforms frequently contribute to cognitive-behavioral fixation, a phenomenon in which users exhibit sustained and repetitive engagement with narrow content domains. While cognitive-behavioral fixation has been extensively studied in psychology, methods for computationally detecting and evaluating such fixation remain underexplored. To address this gap, we propose a novel framework for assessing cognitive-behavioral fixation by analyzing users’ multimodal social media engagement patterns. Specifically, we introduce a multimodal topic extraction module and a cognitive-behavioral fixation quantification module that collaboratively enable adaptive, hierarchical, and interpretable assessment of user behavior. Experiments on existing benchmarks and a newly curated multimodal dataset demonstrate the effectiveness of our approach, laying the groundwork for scalable computational analysis of cognitive fixation. All code in this project is publicly available for research purposes at https://github.com/Liskie/cognitive-fixation-evaluation.