Table of Contents

cs.CL [Back]

[1] Truth Sleuth and Trend Bender: AI Agents to fact-check YouTube videos and influence opinions

Logé Cécile,Ghori Rehan

Main category: cs.CL

TL;DR: 论文提出了一个AI系统,包含Truth Sleuth和Trend Bender两个模块,用于自动检测YouTube视频中的虚假信息并通过评论影响用户观点。

Details Motivation: 虚假信息在数字世界中迅速传播,YouTube等平台成为主要渠道,亟需技术手段进行干预。

Contribution: 设计了结合检索增强生成(RAG)的Truth Sleuth进行事实核查,以及通过自我改进评论生成的Trend Bender,形成闭环系统。

Method: Truth Sleuth从视频中提取声明,利用RAG和多源数据验证;Trend Bender基于报告生成评论并迭代优化。

Result: 系统在基准测试和实际部署中表现出高准确性和用户互动潜力。

Insight: AI驱动的事实核查和观点干预可有效对抗虚假信息,提升在线讨论质量。

Abstract: Misinformation poses a significant threat in today’s digital world, often spreading rapidly through platforms like YouTube. This paper introduces a novel approach to combating misinformation by developing an AI-powered system that not only fact-checks claims made in YouTube videos but also actively engages users in the comment section and challenge misleading narratives. Our system comprises two main agents: Truth Sleuth and Trend Bender. Truth Sleuth extracts claims from a YouTube video, uses a Retrieval-Augmented Generation (RAG) approach - drawing on sources like Wikipedia, Google Search, Google FactCheck - to accurately assess their veracity and generates a nuanced and comprehensive report. Through rigorous prompt engineering, Trend Bender leverages this report along with a curated corpus of relevant articles to generate insightful and persuasive comments designed to stimulate a productive debate. With a carefully set up self-evaluation loop, this agent is able to iteratively improve its style and refine its output. We demonstrate the system’s capabilities through experiments on established benchmark datasets and a real-world deployment on YouTube, showcasing its potential to engage users and potentially influence perspectives. Our findings highlight the high accuracy of our fact-checking agent, and confirm the potential of AI-driven interventions in combating misinformation and fostering a more informed online space.

[2] An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation

Vimaleswar A,Prabhu Nandan Sahu,Nilesh Kumar Sahu,Haroon R Lone

Main category: cs.CL

TL;DR: EmoSApp 是一款离线智能手机对话应用,专为心理健康支持设计,通过微调量化的大语言模型(如 LLaMA-3.2-1B-Instruct)在资源受限设备上运行,结合心理健康领域知识数据集和学生主导评估验证其有效性。

Details Motivation: 当前数字心理健康平台存在用户访问受限、网络连接不足和数据隐私问题,迫切需要一种离线、智能手机端的解决方案。

Contribution: 1. 提出完全离线的 EmoSApp,支持心理健康对话;2. 基于 LLaMA-3.2-1B-Instruct 微调,适配移动设备;3. 通过领域数据集和用户评估验证效果。

Method: 1. 微调并量化 LLaMA-3.2-1B-Instruct 模型;2. 使用心理健康问答数据集(14,582 对)和多轮对话数据;3. 采用学生群体进行定性和定量评估。

Result: 定性地,EmoSApp 能提供连贯、共情的对话和建议;定量地,在九个常识推理基准上表现优异,验证其在低资源环境下的效果。

Insight: EmoSApp 展示了专用领域微调和设备端部署在心理健康支持中的潜力,为未来便携、安全的 AI 解决方案提供了范例。

Abstract: Mental health plays a crucial role in the overall well-being of an individual. In recent years, digital platforms have been increasingly used to expand mental health and emotional support. However, there are persistent challenges related to limited user accessibility, internet connectivity, and data privacy, which highlight the need for an offline, smartphone-based solution. To address these challenges, we propose EmoSApp (Emotional Support App): an entirely offline, smartphone-based conversational app designed for mental health and emotional support. The system leverages Large Language Models (LLMs), specifically fine-tuned, quantized and deployed using Torchtune and Executorch for resource-constrained devices, allowing all inferences to occur on the smartphone. To equip EmoSApp with robust domain expertise, we fine-tuned the LLaMA-3.2-1B-Instruct model on our custom curated ``Knowledge dataset’’ of 14,582 mental-health QA pairs, along with the multi-turn conversational data. Through qualitative human evaluation with the student population, we demonstrate that EmoSApp has the ability to respond coherently, empathetically, maintain interactive dialogue, and provide relevant suggestions to user’s mental health problems. Additionally, quantitative evaluations on nine standard commonsense and reasoning benchmarks demonstrate the efficacy of our fine-tuned, quantized model in low-resource settings. By prioritizing on-device deployment and specialized domain adaptation, EmoSApp serves as a blueprint for future innovations in portable, secure, and highly tailored AI-driven mental health solutions.

[3] A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations

Isar Nejadgholi,Mona Omidyeganeh,Marc-Antoine Drouin,Jonathan Boisvert

Main category: cs.CL

TL;DR: 该论文提出了一个针对基于提示的自然语言解释(NLEs)的分类法,用于设计和评估NLEs,以促进AI系统的透明性。

Details Motivation: 随着大型语言模型的兴起,自然语言解释(NLEs)成为描述模型行为的关键工具,亟需对其特性和治理意义进行系统化研究。

Contribution: 提出了一个更新的可解释AI(XAI)分类法,专为基于提示的NLEs设计,涵盖上下文、生成与呈现、评估三个维度。

Method: 借鉴XAI文献,构建了一个三维分类法,包括任务、数据、受众和目标的上下文;生成方法、交互性等生成与呈现;以及内容、用户中心属性等评估标准。

Result: 该分类法为研究者、审计者和政策制定者提供了一个框架,以规范和优化NLEs,增强AI系统的透明性。

Insight: 分类法的系统化方法有助于解决AI治理中的挑战,特别是通过标准化NLEs的设计和评估来提升透明度。

Abstract: Effective AI governance requires structured approaches for stakeholders to access and verify AI system behavior. With the rise of large language models, Natural Language Explanations (NLEs) are now key to articulating model behavior, which necessitates a focused examination of their characteristics and governance implications. We draw on Explainable AI (XAI) literature to create an updated XAI taxonomy, adapted to prompt-based NLEs, across three dimensions: (1) Context, including task, data, audience, and goals; (2) Generation and Presentation, covering generation methods, inputs, interactivity, outputs, and forms; and (3) Evaluation, focusing on content, presentation, and user-centered properties, as well as the setting of the evaluation. This taxonomy provides a framework for researchers, auditors, and policymakers to characterize, design, and enhance NLEs for transparent AI systems.

[4] PLEX: Perturbation-free Local Explanations for LLM-Based Text Classification

Yogachandran Rahulamathavan,Misbah Farooq,Varuna De Silva

Main category: cs.CL

TL;DR: PLEX提出了一种无需扰动的局部解释方法,通过利用LLM的上下文嵌入和Siamese网络,显著降低了计算开销,同时保持了与LIME和SHAP相似的准确性。

Details Motivation: 当前基于扰动的XAI方法(如LIME和SHAP)在文本分类任务中计算开销大,尤其对LLM不高效,因此需要一种更高效的解释方法。

Contribution: 提出PLEX方法,通过一次性训练的Siamese网络和上下文嵌入,实现了无需扰动的高效局部解释,显著降低了时间和计算成本。

Method: 利用LLM的上下文嵌入和Siamese网络,训练一个神经网絡以对齐特征重要性分数,避免多次扰动和推理。

Result: 在四个分类任务中,PLEX与LIME/SHAP的92%以上一致,且在某些情况下表现更优,同时计算效率提升了2-4个数量级。

Insight: PLEX展示了通过一次性训练替代反复扰动的可行性,为高效解释LLM提供了新思路。

Abstract: Large Language Models (LLMs) excel in text classification, but their complexity hinders interpretability, making it difficult to understand the reasoning behind their predictions. Explainable AI (XAI) methods like LIME and SHAP offer local explanations by identifying influential words, but they rely on computationally expensive perturbations. These methods typically generate thousands of perturbed sentences and perform inferences on each, incurring a substantial computational burden, especially with LLMs. To address this, we propose \underline{P}erturbation-free \underline{L}ocal \underline{Ex}planation (PLEX), a novel method that leverages the contextual embeddings extracted from the LLM and a Siamese network" style neural network trained to align with feature importance scores. This one-off training eliminates the need for subsequent perturbations, enabling efficient explanations for any new sentence. We demonstrate PLEX's effectiveness on four different classification tasks (sentiment, fake news, fake COVID-19 news and depression), showing more than 92\% agreement with LIME and SHAP. Our evaluation using a stress test” reveals that PLEX accurately identifies influential words, leading to a similar decline in classification accuracy as observed with LIME and SHAP when these words are removed. Notably, in some cases, PLEX demonstrates superior performance in capturing the impact of key features. PLEX dramatically accelerates explanation, reducing time and computational overhead by two and four orders of magnitude, respectively. This work offers a promising solution for explainable LLM-based text classification.

[5] Emergence of Hierarchical Emotion Organization in Large Language Models

Bo Zhao,Maya Okawa,Eric J. Bigelow,Rose Yu,Tomer Ullman,Ekdeep Singh Lubana,Hidenori Tanaka

Main category: cs.CL

TL;DR: 这篇论文研究了大型语言模型(LLMs)如何自然地形成与人类心理学模型一致的分层情感树,并揭示了在情感识别中的系统性偏差,特别是对代表性不足群体的误分类。

Details Motivation: 随着大型语言模型越来越多地应用于对话代理,理解它们如何建模用户的情感状态对伦理部署至关重要。受情感轮(一种心理学框架)的启发,研究者分析了模型输出中情感状态之间的概率依赖性。

Contribution: 主要贡献包括发现LLMs自然形成分层情感树,与人类心理学模型一致;以及揭示情感识别中的系统性偏差,特别是对边缘化群体的误分类。

Method: 论文通过分析LLMs输出中情感状态的概率依赖性,并结合人类心理学的情感轮框架进行研究。

Result: 研究发现LLMs能够形成与人类心理学模型一致的分层情感树,但在情感识别中存在系统性偏差,对边缘化群体的误分类尤为明显。

Insight: 研究表明LLMs可能内化了社会认知的某些方面,为开发基于认知理论的模型评估方法提供了潜在方向。

Abstract: As large language models (LLMs) increasingly power conversational agents, understanding how they model users’ emotional states is critical for ethical deployment. Inspired by emotion wheels – a psychological framework that argues emotions organize hierarchically – we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.

[6] Language Models for Adult Service Website Text Analysis

Nickolas Freeman,Thanh Nguyen,Gregory Bott,Jason Parton,Collin Francel

Main category: cs.CL

TL;DR: 该论文研究如何利用语言模型分析成人服务网站(ASW)的文本数据,以帮助识别性贩卖受害者。作者比较了多种方法,包括预训练Transformer和自定义Transformer模型,结果显示自定义模型在性能和资源效率上优于现有模型。

Details Motivation: 成人服务网站(ASW)的文本数据常用于识别性贩卖受害者,但由于文本中包含大量表情符号、语法错误和刻意混淆内容,传统分析方法难以处理。因此,需要高效的语言模型来解决这一挑战。

Contribution: 1. 提出了一种高效的自定义Transformer模型,适合ASW文本分析;2. 在多种任务(如图解构、聚类和学习表情符号嵌入)上验证了模型的有效性;3. 在性能上优于BERT-base、RoBERTa和ModernBERT等知名模型。

Method: 比较了信息检索方法、预训练Transformer和自定义Transformer模型。通过实验验证自定义模型在资源效率和性能上的优势,并将其应用于ASW文本的图分解、聚类和表情符号理解等任务。

Result: 自定义Transformer模型在准确率、召回率、F1分数和ROC AUC等指标上优于BERT-base、RoBERTa和ModernBERT。同时在资源消耗上也更高效。

Insight: 1. ASW文本的特性使得轻量级自定义模型能够高效训练和推理;2. 表情符号和混淆内容的高效处理对性贩卖研究至关重要;3. 该模型可为下游应用和研究提供支持。

Abstract: Sex trafficking refers to the use of force, fraud, or coercion to compel an individual to perform in commercial sex acts against their will. Adult service websites (ASWs) have and continue to be linked to sex trafficking, offering a platform for traffickers to advertise their victims. Thus, organizations involved in the fight against sex trafficking often use ASW data when attempting to identify potential sex trafficking victims. A critical challenge in transforming ASW data into actionable insight is text analysis. Previous research using ASW data has shown that ASW ad text is important for linking ads. However, working with this text is challenging due to its extensive use of emojis, poor grammar, and deliberate obfuscation to evade law enforcement scrutiny. We conduct a comprehensive study of language modeling approaches for this application area, including simple information retrieval methods, pre-trained transformers, and custom transformer models. We demonstrate that characteristics of ASW text data allow efficient custom transformer models to be trained with relatively small GPU resources and used efficiently for inference on consumer hardware. Our custom models outperform fine-tuned variants of well-known encoder-only transformer models, including BERT-base, RoBERTa, and ModernBERT, on accuracy, recall, F1 score, and ROC AUC. We demonstrate the use of our best-performing custom configuration on three tasks related to ASW data analysis: (i) decomposing the giant component in a graph representation of ASW data, (ii) clustering ASW ad text, and (iii) using the learned token embeddings to understand the use of emojis in the illicit context we study. The models we develop represent a significant advancement in ASW text analysis, which can be leveraged in a variety of downstream applications and research.

[7] Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Yilun Zhao,Chengye Wang,Chuhan Li,Arman Cohan

Main category: cs.CL

TL;DR: 该论文介绍了MISS-QA,首个专门评估模型对科学文献中示意图理解能力的基准,包含1500个专家标注样本,测试了18个前沿多模态基础模型的性能,揭示了与人类专家的显著差距。

Details Motivation: 科学文献中的示意图是传达复杂研究内容的重要工具,但目前的多模态基础模型能否理解这些示意图仍是一个未解问题。为填补这一空白,作者设计了MISS-QA基准。

Contribution: 1. 提出首个针对科学文献中示意图理解的基准MISS-QA;2. 对18个前沿多模态基础模型进行全面评估;3. 揭示了模型与人类专家的性能差距及改进方向。

Method: 构建MISS-QA基准,包含465篇科学论文和1500个专家标注样本,测试模型在示意图理解和回答信息搜索问题上的能力。

Result: 当前模型在MISS-QA上表现显著低于人类专家,尤其是在未可回答问题上的处理能力和错误分析中显示出局限性。

Insight: 多模态基础模型在理解科学示意图时仍存在明显不足,需进一步优化模型对多模态科学文献的综合理解能力。

Abstract: This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.

[8] Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Posts from Parler

David M. Markowitz,Samuel Hardman Taylor

Main category: cs.CL

TL;DR: 本文研究了线上仇恨言论与社会认可之间的关系,通过分析Parler平台的1.1亿条帖子,验证了社会认可理论的两个假设,但结果显示社会认可与仇恨言论的后续传播及极端化并无显著关联。

Details Motivation: 探讨线上仇恨言论的动机是否源于社会认可,验证Walther(2024)的社会认可理论在特定平台上的适用性。

Contribution: 通过大规模数据分析,发现社会认可对仇恨言论的传播及极端化影响有限,挑战了既有理论的普适性。

Method: 使用Parler平台2018-2021年的1.1亿条帖子,分析用户发布的仇恨言论与其获得的社会认可(如点赞)之间的关系。

Result: 社会认可(点赞数)与后续仇恨言论的产生及其极端化无显著关联,甚至在某些时间段呈现负相关。

Insight: 在特定社交媒体平台上,社会认可机制对仇恨言论的强化作用可能与主流平台不同,需结合平台特性重新审视理论模型。

Abstract: In this paper, we explored how online hate is motivated by receiving social approval from others. We specifically examined two central tenets of Walther’s (2024) social approval theory of online hate: (H1a) more signals of social approval on hate messages predicts more subsequent hate messages, and (H1b) as social approval increases, hate speech messages become more extreme. Using over 110 million posts from Parler (2018-2021), we observed that the number of upvotes a person received on a hate speech post was unassociated with the amount of hate speech in their next post and posts during the next week, month, three months, and six months. Between-person effects revealed an average negative relationship between social approval and hate speech production at the post level, but this relationship was mixed at other time intervals. Social approval reinforcement mechanisms of online hate may operate differently on niche social media platforms.

[9] LLMs on Trial: Evaluating Judicial Fairness for Large Language Models

Yiran Hu,Zongyue Xue,Haitao Li,Siyuan Zheng,Qingjing Chen,Shaochun Wang,Xihan Zhang,Ning Zheng,Yun Liu,Qingyao Ai,Yiqun Liu,Charles L. A. Clarke,Weixing Shen

Main category: cs.CL

TL;DR: 该论文构建了一个评估大型语言模型(LLM)司法公平性的框架,揭示了LLM在司法决策中的不一致性、偏见和不平衡的准确性,并提出了改进方法。

Details Motivation: LLM在高风险领域的应用日益广泛,但对其司法公平性的评估不足,这可能影响社会公正。因此,研究者旨在开发一种系统化的方法来衡量和改进LLM的司法公平性。

Contribution: 论文的主要贡献包括:(1)构建了全面的司法公平性评估框架;(2)开发了三个评估指标(不一致性、偏见和不平衡准确性);(3)提出了一个跨标签评估LLM公平性的方法;(4)公开了数据集JudiFair和相关工具包,以支持未来研究。

Method: 研究设计了65个标签和161个相关值来评估LLM的司法公平性,并基于JudiFair数据集开发了三个统计指标。此外,还引入了一种方法,用于评估多个LLM在不同标签上的公平性表现。

Result: 实验表明,16个LLM普遍存在不一致性、偏见和不平衡准确性的问题。LLM在人口统计标签上的偏见尤为显著,准确性的提升反而加剧了偏见。调整温度参数可以影响公平性,但模型大小、发布时间和来源国对公平性无显著影响。

Insight: 研究发现,LLM的司法公平性受到复杂因素的影响,如偏见和准确性的权衡。未来的研究需要关注如何在不牺牲准确性的情况下减少偏见,同时利用公开工具包推动LLM公平性的改进。

Abstract: Large Language Models (LLMs) are increasingly used in high-stakes fields where their decisions impact rights and equity. However, LLMs’ judicial fairness and implications for social justice remain underexplored. When LLMs act as judges, the ability to fairly resolve judicial issues is a prerequisite to ensure their trustworthiness. Based on theories of judicial fairness, we construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values. Applying this framework to the judicial system, we compile an extensive dataset, JudiFair, comprising 177,100 unique case facts. To achieve robust statistical inference, we develop three evaluation metrics, inconsistency, bias, and imbalanced inaccuracy, and introduce a method to assess the overall fairness of multiple LLMs across various labels. Through experiments with 16 LLMs, we uncover pervasive inconsistency, bias, and imbalanced inaccuracy across models, underscoring severe LLM judicial unfairness. Particularly, LLMs display notably more pronounced biases on demographic labels, with slightly less bias on substance labels compared to procedure ones. Interestingly, increased inconsistency correlates with reduced biases, but more accurate predictions exacerbate biases. While we find that adjusting the temperature parameter can influence LLM fairness, model size, release date, and country of origin do not exhibit significant effects on judicial fairness. Accordingly, we introduce a publicly available toolkit containing all datasets and code, designed to support future research in evaluating and improving LLM fairness.

[10] How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations

Ikumi Numaya,Shoji Moriya,Shiki Sato,Reina Akama,Jun Suzuki

Main category: cs.CL

TL;DR: 这篇论文探讨了对话生成中风格相似性如何影响用户偏好,通过引入包含用户偏好、主观风格相似性和第三方客观风格相似性的数据集,揭示了主观相似性与用户偏好的强相关性。

Details Motivation: 研究对话生成中风格相似性对用户偏好的影响,区分主观与客观相似性的差异,填补了现有研究中对这一区别的忽视。

Contribution: 1. 提出了一个包含用户偏好和风格相似性标注的新数据集;2. 揭示了主观风格相似性与用户偏好的强相关性;3. 发现主观与客观相似性存在显著差异。

Method: 构建了一个开放域对话数据集,收集用户的偏好评分、主观风格相似性评估以及第三方标注的客观相似性,并通过统计分析揭示其关系。

Result: 主观风格相似性与用户偏好呈强正相关,而主观与客观相似性存在差异,突出了区分二者的重要性。

Insight: 用户偏好更依赖于主观感知的风格相似性,而非第三方客观标注的相似性,这为设计和评估对话系统提供了重要启示。

Abstract: Recent advancements in dialogue generation have broadened the scope of human-bot interactions, enabling not only contextually appropriate responses but also the analysis of human affect and sensitivity. While prior work has suggested that stylistic similarity between user and system may enhance user impressions, the distinction between subjective and objective similarity is often overlooked. To investigate this issue, we introduce a novel dataset that includes users’ preferences, subjective stylistic similarity based on users’ own perceptions, and objective stylistic similarity annotated by third party evaluators in open-domain dialogue settings. Analysis using the constructed dataset reveals a strong positive correlation between subjective stylistic similarity and user preference. Furthermore, our analysis suggests an important finding: users’ subjective stylistic similarity differs from third party objective similarity. This underscores the importance of distinguishing between subjective and objective evaluations and understanding the distinct aspects each captures when analyzing the relationship between stylistic similarity and user preferences. The dataset presented in this paper is available online.

[11] HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training

Seungho Choi

Main category: cs.CL

TL;DR: 论文提出了HanjaBridge,一种通过Hanja增强预训练解决韩语LLMs中语义歧义的方法,显著提升了韩语理解能力,并观察到正面的跨语言迁移效果。

Details Motivation: 韩语等低资源语言在LLMs中表现不佳,部分原因是同音异义的Sino-Korean词汇在Hangul脚本中无法区分,导致语义歧义。

Contribution: 提出HanjaBridge方法,通过动态引入多Hanja候选来帮助模型学习上下文消歧,并结合知识蒸馏避免灾难性遗忘。

Method: 在持续预训练框架中,为同音词提供所有可能的Hanja候选,结合token级知识蒸馏。

Result: 在KoBALT基准上相对提升21%,并在跨语言迁移中表现出色,推理时无需额外算力开销。

Insight: 通过Hanja强化韩语与汉语的语义对齐,不仅提升韩语理解,还促进了跨语言迁移能力。

Abstract: Large language models (LLMs) often show poor performance in low-resource languages like Korean, partly due to unique linguistic challenges such as homophonous Sino-Korean words that are indistinguishable in Hangul script. To address this semantic ambiguity, we propose HanjaBridge, a novel meaning-injection technique integrated into a continual pre-training (CPT) framework. Instead of deterministically mapping a word to a single Hanja (Chinese character), HanjaBridge presents the model with all possible Hanja candidates for a given homograph, encouraging the model to learn contextual disambiguation. This process is paired with token-level knowledge distillation to prevent catastrophic forgetting. Experimental results show that HanjaBridge significantly improves Korean language understanding, achieving a 21% relative improvement on the KoBALT benchmark. Notably, by reinforcing semantic alignment between Korean and Chinese through shared Hanja, we observe a strong positive cross-lingual transfer. Furthermore, these gains persist even when Hanja augmentation is omitted at inference time, ensuring practical efficiency with no additional run-time cost.

[12] Modeling Understanding of Story-Based Analogies Using Large Language Models

Kalit Inani,Keshav Kabra,Vijay Marupudi,Sashank Varma

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型(LLMs)在故事类类比任务中的推理能力,通过与人类表现对比,评估了LLMs的语义表示和显式提示效果,并考察了模型大小和架构对其表现的影响。

Details Motivation: 现有研究表明,LLMs在类比任务中虽能提取相似性但缺乏人类式的推理能力。论文旨在深入探讨LLMs在故事类类比任务中的表现,填补对LLMs类比推理能力理解的空白。

Contribution: 论文的主要贡献包括:1)评估LLMs的语义表示是否捕捉类比中的相似性和非相似性;2)研究显式提示对LLMs解释类比的效果;3)通过个体类比级别分析LLMs与人类表现的差异;4)考察模型大小和架构对表现的影响。

Method: 论文采用故事类类比任务,通过句子嵌入评估LLMs的语义表示,并引入显式提示要求模型解释类比。实验对比了不同参数规模(8B vs. 70B)和架构(如GPT-4和LLaMA3)的LLMs表现。

Result: 实验结果显示,LLMs在类比任务中的表现与人类存在差异,尤其是在显式提示和语义表示方面。模型大小和架构对性能有显著影响,更大模型表现更优。

Insight: 论文揭示了LLMs在类比推理中的局限性,尤其是在语义理解和显式推理方面,同时也表明通过调整模型规模和架构可以提升性能。这为LLMs作为人类推理模型的潜力提供了新的理解。

Abstract: Recent advancements in Large Language Models (LLMs) have brought them closer to matching human cognition across a variety of tasks. How well do these models align with human performance in detecting and mapping analogies? Prior research has shown that LLMs can extract similarities from analogy problems but lack robust human-like reasoning. Building on Webb, Holyoak, and Lu (2023), the current study focused on a story-based analogical mapping task and conducted a fine-grained evaluation of LLM reasoning abilities compared to human performance. First, it explored the semantic representation of analogies in LLMs, using sentence embeddings to assess whether they capture the similarity between the source and target texts of an analogy, and the dissimilarity between the source and distractor texts. Second, it investigated the effectiveness of explicitly prompting LLMs to explain analogies. Throughout, we examine whether LLMs exhibit similar performance profiles to those observed in humans by evaluating their reasoning at the level of individual analogies, and not just at the level of overall accuracy (as prior studies have done). Our experiments include evaluating the impact of model size (8B vs. 70B parameters) and performance variation across state-of-the-art model architectures such as GPT-4 and LLaMA3. This work advances our understanding of the analogical reasoning abilities of LLMs and their potential as models of human reasoning.

[13] DS@GT at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models

Anthony Miyaguchi,David Guecha,Yuwen Chiu,Sidharth Gaur

Main category: cs.CL

TL;DR: DS@GT团队在eRisk 2025挑战中,通过提示工程策略利用LLMs进行抑郁症检测,并分析了对话提示对症状预测的影响。

Details Motivation: 探索利用大语言模型(LLMs)和提示工程方法,在缺乏真实标签的情况下通过对话评估早期抑郁症检测的性能。

Contribution: 提出了一种基于BDI-II标准的提示设计方法,实现了模型输出与临床标准的对齐,并在对话中识别了影响症状预测的关键提示。

Method: 采用多LLMs的提示工程技术,生成结构化JSON输出,并通过交叉模型一致性和内部一致性进行评估。

Result: 在官方排行榜上排名第二,性能指标为DCHR = 0.50, ADODL = 0.89, ASHR = 0.27。

Insight: 提示工程和对话分析可以显著提高LLMs在抑郁症检测中的表现,为无监督或半监督方法提供了新思路。

Abstract: This Working Note summarizes the participation of the DS@GT team in two eRisk 2025 challenges. For the Pilot Task on conversational depression detection with large language-models (LLMs), we adopted a prompt-engineering strategy in which diverse LLMs conducted BDI-II-based assessments and produced structured JSON outputs. Because ground-truth labels were unavailable, we evaluated cross-model agreement and internal consistency. Our prompt design methodology aligned model outputs with BDI-II criteria and enabled the analysis of conversational cues that influenced the prediction of symptoms. Our best submission, second on the official leaderboard, achieved DCHR = 0.50, ADODL = 0.89, and ASHR = 0.27.

[14] Teach Me Sign: Stepwise Prompting LLM for Sign Language Production

Zhaoyi An,Rei Kawakami

Main category: cs.CL

TL;DR: 该论文提出了一种基于大语言模型(LLM)的手语生成方法TEAM-Sign,通过逐步提示策略提取LLM内部的手语知识,以解决手语与口语之间的分布和语法规则差异问题。

Details Motivation: 手语生成因其复杂性和独特规则,现有的AI技术(包括LLM)难以直接应用。论文旨在通过将手语视为另一种自然语言,利用LLM的推理能力和知识库,实现更有效的手语生成。

Contribution: 主要贡献包括:1) 提出TEAM-Sign方法,通过逐步提示策略提取LLM的手语知识;2) 将手语生成任务融入LLM框架,并解决了分布和语法对齐问题。

Method: 方法包括:1) 对LLM进行微调,学习文本与手语的对应关系;2) 采用逐步提示策略,引导LLM生成符合手语规则的内容。实验使用了How2Sign和Phoenix14T数据集。

Result: 实验结果表明,TEAM-Sign能有效结合LLM的手语知识和推理能力,生成符合手语规则的内容。

Insight: 该研究揭示了LLM在手语生成任务中的潜力,逐步提示策略为处理复杂语言对齐问题提供了新思路。

Abstract: Large language models, with their strong reasoning ability and rich knowledge, have brought revolution to many tasks of AI, but their impact on sign language generation remains limited due to its complexity and unique rules. In this paper, we propose TEAch Me Sign (TEAM-Sign), treating sign language as another natural language. By fine-tuning an LLM, we enable it to learn the correspondence between text and sign language, and facilitate generation. Considering the differences between sign and spoken language, we employ a stepwise prompting strategy to extract the inherent sign language knowledge within the LLM, thereby supporting the learning and generation process. Experimental results on How2Sign and Phoenix14T datasets demonstrate that our approach effectively leverages both the sign language knowledge and reasoning capabilities of LLM to align the different distribution and grammatical rules between sign and spoken language.

[15] Mario at EXIST 2025: A Simple Gateway to Effective Multilingual Sexism Detection

Lin Tian,Johanne R. Trippas,Marian-Andrei Rizoiu

Main category: cs.CL

TL;DR: 该论文提出了一种基于LoRA的高效多语言性别歧视检测方法,通过层次化条件适配器和多语言联合训练,显著减少了计算和存储开销,同时保持了竞争力强的性能。

Details Motivation: 现有性别歧视检测方法通常需要复杂的预处理或集成模型,计算成本高。论文旨在通过参数高效微调简化流程,并利用多语言能力提升性能。

Contribution: 1. 引入层次化条件适配器路由显式建模任务标签依赖;2. 将LoRA扩展到所有线性层;3. 提出多语言联合训练策略,减少语言特定模型需求。

Method: 采用分层LoRA适配器(rank=16,4-bit QLoRA),针对三类子任务进行联合多语言训练,仅微调1.67%参数。

Result: 在多语言测试集上,该方法在二进制分类(0.6774)、意图检测(0.4991)和多标签分类(0.6519)任务中表现优异,计算和存储效率显著提升。

Insight: 1. 层次化适配器有效建模任务依赖;2. 参数高效微调在简化流程的同时保持性能;3. 多语言联合训练促进跨语言迁移。

Abstract: This paper presents our approach to EXIST 2025 Task 1, addressing text-based sexism detection in English and Spanish tweets through hierarchical Low-Rank Adaptation (LoRA) of Llama 3.1 8B. Our method introduces conditional adapter routing that explicitly models label dependencies across three hierarchically structured subtasks: binary sexism identification, source intention detection, and multilabel sexism categorization. Unlike conventional LoRA applications that target only attention layers, we apply adaptation to all linear transformations, enhancing the model’s capacity to capture task-specific patterns. In contrast to complex data processing and ensemble approaches, we show that straightforward parameter-efficient fine-tuning achieves strong performance. We train separate LoRA adapters (rank=16, QLoRA 4-bit) for each subtask using unified multilingual training that leverages Llama 3.1’s native bilingual capabilities. The method requires minimal preprocessing and uses standard supervised learning. Our multilingual training strategy eliminates the need for separate language-specific models, achieving 1.7-2.4% F1 improvements through cross-lingual transfer. With only 1.67% trainable parameters compared to full fine-tuning, our approach reduces training time by 75% and model storage by 98%, while achieving competitive performance across all subtasks (ICM-Hard: 0.6774 for binary classification, 0.4991 for intention detection, 0.6519 for multilabel categorization).

[16] Team HUMANE at AVeriTeC 2025: HerO 2 for Efficient Fact Verification

Yejun Yoon,Jaeyoon Jung,Seunghyun Yoon,Kunwoo Park

Main category: cs.CL

TL;DR: HerO 2是HUMANE团队在AVeriTeC任务中的改进系统,通过文档摘要、答案重构和模型优化提升了事实验证的效率,同时保持了高性能。

Details Motivation: 改进现有的开源模型HerO,以提升证据质量、优化验证预测并适应计算资源限制,从而在事实验证任务中实现高效且高性能的系统。

Contribution: 1. 通过文档摘要和答案重构提升证据质量。2. 使用后训练量化优化计算资源。3. 集成了更新的语言模型主干网络。

Method: 1. 文档摘要和答案重构改进证据质量。2. 后训练量化以减少计算开销。3. 更新语言模型主干以优化系统性能。

Result: HerO 2在AVeriTeC任务中排名第二,且是前三名中运行时间最短的系统,展现了高效和高性能的潜力。

Insight: 结合模型优化和计算效率的设计可以在资源受限的场景下实现高性能的事实验证。

Abstract: This paper presents HerO 2, Team HUMANE’s system for the AVeriTeC shared task at the FEVER-25 workshop. HerO 2 is an enhanced version of HerO, the best-performing open-source model from the previous year’s challenge. It improves evidence quality through document summarization and answer reformulation, optimizes veracity prediction via post-training quantization under computational constraints, and enhances overall system performance by integrating updated language model (LM) backbones. HerO 2 ranked second on the leaderboard while achieving the shortest runtime among the top three systems, demonstrating both high efficiency and strong potential for real-world fact verification. The code is available at https://github.com/ssu-humane/HerO2.

[17] Journalism-Guided Agentic In-Context Learning for News Stance Detection

Dahyun Lee,Jonghyeon Choi,Jiyoung Han,Kunwoo Park

Main category: cs.CL

TL;DR: 论文介绍了K-News-Stance,首个韩语文章级立场检测数据集,并提出JoA-ICL框架,通过语言模型代理预测关键段落立场,聚合推断文章整体立场,效果优于现有方法。

Details Motivation: 在线新闻消费增长导致推荐系统可能强化信息茧房和政治极化,需要立场检测技术来支持多样化视角的推荐和分析媒体偏见。

Contribution: 1) 发布首个韩语文章级立场检测数据集K-News-Stance;2) 提出JoA-ICL框架,通过段落级代理推断文章整体立场。

Method: JoA-ICL框架利用语言模型代理预测新闻文章关键结构段落(如导语、引语)的立场,再聚合这些立场推断文章整体立场。

Result: 实验表明JoA-ICL优于现有立场检测方法,段落级代理能更有效捕捉长文立场。案例分析还展示其在促进新闻推荐多样化视角和揭示媒体偏见方面的潜力。

Insight: 段落级立场检测能更好地理解长文整体立场,为新闻推荐和媒体分析提供新思路。

Abstract: As online news consumption grows, personalized recommendation systems have become integral to digital journalism. However, these systems risk reinforcing filter bubbles and political polarization by failing to incorporate diverse perspectives. Stance detection – identifying a text’s position on a target – can help mitigate this by enabling viewpoint-aware recommendations and data-driven analyses of media bias. Yet, existing stance detection research remains largely limited to short texts and high-resource languages. To address these gaps, we introduce \textsc{K-News-Stance}, the first Korean dataset for article-level stance detection, comprising 2,000 news articles with article-level and 19,650 segment-level stance annotations across 47 societal issues. We also propose \textsc{JoA-ICL}, a \textbf{Jo}urnalism-guided \textbf{A}gentic \textbf{I}n-\textbf{C}ontext \textbf{L}earning framework that employs a language model agent to predict the stances of key structural segments (e.g., leads, quotes), which are then aggregated to infer the overall article stance. Experiments show that \textsc{JoA-ICL} outperforms existing stance detection methods, highlighting the benefits of segment-level agency in capturing the overall position of long-form news articles. Two case studies further demonstrate its broader utility in promoting viewpoint diversity in news recommendations and uncovering patterns of media bias.

[18] LLM-Augmented Symptom Analysis for Cardiovascular Disease Risk Prediction: A Clinical NLP

Haowei Yang,Ziyu Shen,Junli Shao,Luyao Men,Xinyue Han,Jing Dong

Main category: cs.CL

TL;DR: 论文提出了一种基于LLM增强的临床NLP流程,用于从非结构化临床文本中提取心血管疾病症状,并结合领域适应的提示推理技术,显著提升了预测性能。

Details Motivation: 心血管疾病(CVD)的及时识别和准确风险评估是全球死亡率降低的关键。现有模型主要依赖结构化数据,而临床笔记中的非结构化文本包含重要早期指标。结合LLM技术的自然语言处理可以挖掘这些信息。

Contribution: 1. 提出了一种结合心血管领域适应的LLM增强临床NLP流程;2. 解决了上下文幻觉和时间模糊性等问题;3. 在MIMIC-III和CARDIO-NLP数据集上验证了性能提升。

Method: 1. 对LLM进行心血管领域的微调;2. 使用提示推理和实体感知技术;3. 结合基于规则的验证解决幻觉和时间问题。

Result: 在精度、召回率、F1分数和AUROC指标上表现更优,临床相关性评分(kappa=0.82)较高。

Insight: LLM在临床决策支持系统(CDSS)中潜力巨大,可帮助将患者叙述转化为可行动的风险评估。解决幻觉和时间问题是未来研究重点。

Abstract: Timely identification and accurate risk stratification of cardiovascular disease (CVD) remain essential for reducing global mortality. While existing prediction models primarily leverage structured data, unstructured clinical notes contain valuable early indicators. This study introduces a novel LLM-augmented clinical NLP pipeline that employs domain-adapted large language models for symptom extraction, contextual reasoning, and correlation from free-text reports. Our approach integrates cardiovascular-specific fine-tuning, prompt-based inference, and entity-aware reasoning. Evaluations on MIMIC-III and CARDIO-NLP datasets demonstrate improved performance in precision, recall, F1-score, and AUROC, with high clinical relevance (kappa = 0.82) assessed by cardiologists. Challenges such as contextual hallucination, which occurs when plausible information contracts with provided source, and temporal ambiguity, which is related with models struggling with chronological ordering of events are addressed using prompt engineering and hybrid rule-based verification. This work underscores the potential of LLMs in clinical decision support systems (CDSS), advancing early warning systems and enhancing the translation of patient narratives into actionable risk assessments.

[19] Social Media Sentiments Analysis on the July Revolution in Bangladesh: A Hybrid Transformer Based Machine Learning Approach

Md. Sabbir Hossen,Md. Saiduzzaman,Pabon Shaha

Main category: cs.CL

TL;DR: 该论文提出了一种基于混合Transformer的情感分析框架,用于分析孟加拉七月革命期间社交媒体评论中的公众情绪,结合多种先进技术和传统分类器,提出了一种新型混合模型XMB-BERT,显著提升了情感分类的准确率。

Details Motivation: 研究旨在通过情感分析解码孟加拉七月革命期间社交媒体上的公众情绪,揭示社交媒体在推动社会变革中的作用。孟加拉语作为低资源语言,相关研究相对较少,因此填补了这一空白。

Contribution: 主要贡献包括提出混合Transformer框架XMB-BERT,结合多种预训练模型(如BanglaBERT、mBERT、XLM-RoBERTa)和传统分类器,并在新收集的4200条孟加拉语评论数据集上验证其有效性,达到83.7%的准确率。

Method: 方法包括:(1)使用多种Transformer模型(如BanglaBERT)提取文本特征;(2)采用PCA降维;(3)探索11种机器学习分类器,最终选择投票分类器与XMB-BERT结合。

Result: XMB-BERT与投票分类器的组合表现最佳,准确率达83.7%,优于其他模型组合,证实了其在低资源语言情感分析中的潜力。

Insight: 研究强调了Transformer模型在低资源语言(如孟加拉语)情感分析中的实用性,为未来类似社会事件的情绪监测提供了技术参考。

Abstract: The July Revolution in Bangladesh marked a significant student-led mass uprising, uniting people across the nation to demand justice, accountability, and systemic reform. Social media platforms played a pivotal role in amplifying public sentiment and shaping discourse during this historic mass uprising. In this study, we present a hybrid transformer-based sentiment analysis framework to decode public opinion expressed in social media comments during and after the revolution. We used a brand new dataset of 4,200 Bangla comments collected from social media. The framework employs advanced transformer-based feature extraction techniques, including BanglaBERT, mBERT, XLM-RoBERTa, and the proposed hybrid XMB-BERT, to capture nuanced patterns in textual data. Principle Component Analysis (PCA) were utilized for dimensionality reduction to enhance computational efficiency. We explored eleven traditional and advanced machine learning classifiers for identifying sentiments. The proposed hybrid XMB-BERT with the voting classifier achieved an exceptional accuracy of 83.7% and outperform other model classifier combinations. This study underscores the potential of machine learning techniques to analyze social sentiment in low-resource languages like Bangla.

[20] Beyond Traditional Algorithms: Leveraging LLMs for Accurate Cross-Border Entity Identification

Andres Azqueta-Gavaldón,Joaquin Ramos Cosgrove

Main category: cs.CL

TL;DR: 论文探索了利用大型语言模型(LLMs)改进跨境实体识别任务,通过与传统算法对比,展示了LLMs在处理语言变体和语义关系上的优势。

Details Motivation: 跨境金融活动的增加要求更准确的实体识别方法,传统算法因难以处理语言变体和语义关系而受限。

Contribution: 提出了LLMs在跨境实体识别中的应用,通过实验验证了其在准确性和降低误报率方面的优势。

Method: 对比了传统算法(如Jaccard、余弦、Levenshtein距离)、Hugging Face的LLMs和接口型LLMs(如Microsoft Copilot、阿里Qwen 2.5)的性能。

Result: 接口型LLMs表现最佳,准确率超过93%,F1分数超过96%,且误报率显著降低(40-80%)。

Insight: LLMs在处理复杂语义和语言变体时更具灵活性,为跨境实体识别提供了更高效的解决方案。

Abstract: The growing prevalence of cross-border financial activities in global markets has underscored the necessity of accurately identifying and classifying foreign entities. This practice is essential within the Spanish financial system for ensuring robust risk management, regulatory adherence, and the prevention of financial misconduct. This process involves a labor-intensive entity-matching task, where entities need to be validated against available reference sources. Challenges arise from linguistic variations, special characters, outdated names, and changes in legal forms, complicating traditional matching algorithms like Jaccard, cosine, and Levenshtein distances. These methods struggle with contextual nuances and semantic relationships, leading to mismatches. To address these limitations, we explore Large Language Models (LLMs) as a flexible alternative. LLMs leverage extensive training to interpret context, handle abbreviations, and adapt to legal transitions. We evaluate traditional methods, Hugging Face-based LLMs, and interface-based LLMs (e.g., Microsoft Copilot, Alibaba’s Qwen 2.5) using a dataset of 65 Portuguese company cases. Results show traditional methods achieve accuracies over 92% but suffer high false positive rates (20-40%). Interface-based LLMs outperform, achieving accuracies above 93%, F1 scores exceeding 96%, and lower false positives (40-80%).

[21] MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models

Seif Ahmed,Mohamed T. Younes,Abdelrahman Moustafa,Abdelrahman Allam,Hamza Moustafa

Main category: cs.CL

TL;DR: 本文提出了一种基于集成的多语言多模态推理系统,参加了ImageCLEF 2025竞赛,通过结合多个Gemini模型和精心设计的提示策略,在竞赛中取得了领先的成绩。

Details Motivation: 解决在多语言教育场景下,多模态推理任务的挑战,尤其是在高精度和轻量化模型的需求之间取得平衡。

Contribution: 1. 提出了一种集成的多语言多模态推理系统;2. 通过提示策略和跨语言增强显著提升了模型性能;3. 在竞赛中获得多项领先成绩。

Method: 1. 使用Gemini 2.5 Flash生成视觉描述,Gemini 1.5 Pro进行标题优化和一致性检查,Gemini 2.5 Pro作为推理器;2. 设计了few-shot和zero-shot提示策略;3. 对多个大型语言模型进行了消融实验。

Result: 1. 在竞赛中取得了81.4%的总准确率,并在11个语言赛道中领先;2. 通过提示优化,英文验证集准确率从55.9%提升至61.7%。

Insight: 轻量化的OCR-VLM集成模型结合精确的提示策略和跨语言增强,能够在多语言任务中超越更重的端到端模型。

Abstract: We present a robust ensemble-based system for multilingual multimodal reasoning, designed for the ImageCLEF 2025 EXAMS V challenge. Our approach integrates Gemini 2.5 Flash for visual description, Gemini 1.5 Pro for caption refinement and consistency checks, and Gemini 2.5 Pro as a reasoner which handles final answer selection, all coordinated through carefully engineered few-shot and zero-shot prompts. We conducted an extensive ablation study, training several large language models (Gemini 2.5 Flash, Phi 4, Gemma 3, Mistral) on an English dataset and its multilingual augmented version. Additionally, we evaluated Gemini 2.5 Flash in a zero-shot setting for comparison and found it to substantially outperform the trained models. Prompt design also proved critical: enforcing concise, language-normalized formats and prohibiting explanatory text boosted model accuracy on the English validation set from 55.9% to 61.7%. On the official leaderboard, our system (Team MSA) achieved first place overall in the multilingual track with 81.4% accuracy, and led 11 out of 13 individual language tracks, with top results such as 95.07% for Croatian and 92.12% for Italian. These findings highlight that lightweight OCR-VLM ensembles, when paired with precise prompt strategies and cross-lingual augmentation, can outperform heavier end-to-end models in high-stakes, multilingual educational settings.

[22] What Should LLMs Forget? Quantifying Personal Data in LLMs for Right-to-Be-Forgotten Requests

Dimitri Staufer

Main category: cs.CL

TL;DR: 论文探讨了如何在大型语言模型(LLMs)中量化个人数据,以支持‘被遗忘权’请求。提出了WikiMem数据集和一种模型无关的度量方法,用于识别模型中的个人-事实关联。

Details Motivation: 由于LLMs可能记忆并泄露个人信息,亟需一种方法以符合GDPR的‘被遗忘权’要求。现有方法无法有效识别模型中的个人数据。

Contribution: 引入了WikiMem数据集(5000+自然语言样本,涵盖243种人类相关属性)和一种度量方法,用于量化LLMs中的个人-事实关联。

Method: 使用校准的负对数似然对真实值和反事实进行排序,通过多样化提示评估15个不同规模的LLMs(410M-70B参数)。

Result: 研究发现,记忆与主题的网络存在感和模型规模相关。

Insight: 为动态构建遗忘集和响应‘被遗忘权’请求提供了基础,填补了在个体层面识别个人数据的空白。

Abstract: Large Language Models (LLMs) can memorize and reveal personal information, raising concerns regarding compliance with the EU’s GDPR, particularly the Right to Be Forgotten (RTBF). Existing machine unlearning methods assume the data to forget is already known but do not address how to identify which individual-fact associations are stored in the model. Privacy auditing techniques typically operate at the population level or target a small set of identifiers, limiting applicability to individual-level data inquiries. We introduce WikiMem, a dataset of over 5,000 natural language canaries covering 243 human-related properties from Wikidata, and a model-agnostic metric to quantify human-fact associations in LLMs. Our approach ranks ground-truth values against counterfactuals using calibrated negative log-likelihood across paraphrased prompts. We evaluate 200 individuals across 15 LLMs (410M-70B parameters), showing that memorization correlates with subject web presence and model scale. We provide a foundation for identifying memorized personal data in LLMs at the individual level, enabling the dynamic construction of forget sets for machine unlearning and RTBF requests.

[23] Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

Conrad Borchers,Bahar Shahrokhian,Francesco Balzan,Elham Tajik,Sreecharan Sankaranarayanan,Sebastian Simon

Main category: cs.CL

TL;DR: 论文研究了LLM在多代理系统中的人格和温度参数对共识构建和编码准确性的影响,发现MAS在多数情况下并未显著提升准确性,但可通过缩小模糊代码应用来改进代码书和人类-AI编码。

Details Motivation: 探索LLM在多代理系统中如何通过不同人格和温度参数影响定性编码的共识和准确性,以验证MAS是否比单代理系统更具优势。

Contribution: 1. 验证温度和人格对LLM代理共识的影响;2. 发现MAS在多数情况下未显著提升准确性;3. 开源了MAS实验代码。

Method: 通过六种开源LLM和18种实验配置,分析了77,000次编码决策,对比了单代理与MAS在共识构建和准确性上的表现。

Result: 温度和人格显著影响共识达成,但MAS在大多数情况下未能提升准确性,仅在某些配置下(如低温和特定人格)表现略优。

Insight: MAS在定性编码中的应用可能更适用于缩小模糊代码应用,而非直接提升准确性;多样人格的代理系统未必能带来更好的结果。

Abstract: Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including coding and data annotation. While multi-agent systems (MAS) can emulate human coding workflows, their benefits over single-agent coding remain poorly understood. We conducted an experimental study of how agent persona and temperature shape consensus-building and coding accuracy of dialog segments based on a codebook with 8 codes. Our open-source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open-source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions. Temperature significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic), significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing lead to robust improvements in coding accuracy. Single agents matched or outperformed MAS consensus in most conditions. Only one model (OpenHermesV2:7B) and code category showed above-chance gains from MAS deliberation when temperature was 0.5 or lower and especially when the agents included at least one assertive persona. Qualitative analysis of MAS collaboration for these configurations suggests that MAS may nonetheless aid in narrowing ambiguous code applications that could improve codebooks and human-AI coding. We contribute new insight into the limits of LLM-based qualitative methods, challenging the notion that diverse MAS personas lead to better outcomes. We open-source our MAS and experimentation code.

[24] EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering

Valle Ruiz-Fernández,Mario Mina,Júlia Falcão,Luis Vasquez-Reina,Anna Sallés,Aitor Gonzalez-Agirre,Olatz Perez-de-Viñaspre

Main category: cs.CL

TL;DR: 该论文提出了西班牙语和加泰罗尼亚语的偏见基准数据集EsBBQ和CaBBQ,用于评估大型语言模型在这些语言中的社会偏见,填补了非英语语言偏见评估资源的空白。

Details Motivation: 当前缺乏非英语语言(如西班牙语和加泰罗尼亚语)及非美国社会背景下的大型语言模型偏见评估资源,该论文旨在填补这一空白。

Contribution: 提出了西班牙语和加泰罗尼亚语的偏见基准数据集EsBBQ和CaBBQ,用于评估语言模型在这些语言中的社会偏见。

Method: 基于原始BBQ数据集,设计了10个类别的多选问答任务,适应西班牙和加泰罗尼亚的社会背景。

Result: 实验表明,模型在模糊情境下易选择错误答案,高问答准确率常与依赖社会偏见相关。

Insight: 社会偏见评估需要针对不同语言和社会背景定制化,高准确率可能掩盖模型对偏见的依赖。

Abstract: Previous literature has largely shown that Large Language Models (LLMs) perpetuate social biases learnt from their pre-training data. Given the notable lack of resources for social bias evaluation in languages other than English, and for social contexts outside of the United States, this paper introduces the Spanish and the Catalan Bias Benchmarks for Question Answering (EsBBQ and CaBBQ). Based on the original BBQ, these two parallel datasets are designed to assess social bias across 10 categories using a multiple-choice QA setting, now adapted to the Spanish and Catalan languages and to the social context of Spain. We report evaluation results on different LLMs, factoring in model family, size and variant. Our results show that models tend to fail to choose the correct answer in ambiguous scenarios, and that high QA accuracy often correlates with greater reliance on social biases.

[25] An Agentic Flow for Finite State Machine Extraction using Prompt Chaining

Fares Wael,Youssef Maklad,Ali Hamdi,Wael Elsersy

Main category: cs.CL

TL;DR: FlowFSM 是一种基于大型语言模型(LLM)的框架,通过提示链和链式思维推理,从 RFC 文档中提取准确的有限状态机(FSM),解决了现有技术的可扩展性和覆盖不全问题。

Details Motivation: 现有 FSM 提取技术在可扩展性、覆盖率和自然语言规范歧义性方面存在局限,亟需一种能够高效准确提取 FSM 的方法。

Contribution: 提出了 FlowFSM 框架,结合 LLM 和提示链技术,实现了从协议规范中高精度提取 FSM,并最小化虚假状态转移。

Method: 使用提示链和链式思维推理,系统化处理协议规范,识别状态转移并构建结构化规则库。

Result: 在 FTP 和 RTSP 协议上的实验表明,FlowFSM 提取精度高,虚假转移少,效果显著。

Insight: 基于代理的 LLM 系统在协议分析和 FSM 推理方面具有巨大潜力,尤其在网络安全和逆向工程领域。

Abstract: Finite-State Machines (FSMs) are critical for modeling the operational logic of network protocols, enabling verification, analysis, and vulnerability discovery. However, existing FSM extraction techniques face limitations such as scalability, incomplete coverage, and ambiguity in natural language specifications. In this paper, we propose FlowFSM, a novel agentic framework that leverages Large Language Models (LLMs) combined with prompt chaining and chain-of-thought reasoning to extract accurate FSMs from raw RFC documents. FlowFSM systematically processes protocol specifications, identifies state transitions, and constructs structured rule-books by chaining agent outputs. Experimental evaluation across FTP and RTSP protocols demonstrates that FlowFSM achieves high extraction precision while minimizing hallucinated transitions, showing promising results. Our findings highlight the potential of agent-based LLM systems in the advancement of protocol analysis and FSM inference for cybersecurity and reverse engineering applications.

[26] Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages

Lyzander Marciano Andrylie,Inaya Rahmanisa,Mahardika Krisna Ihsani,Alfan Farizki Wicaksono,Haryo Akbarianto Wibowo,Alham Fikri Aji

Main category: cs.CL

TL;DR: 论文提出了一种基于稀疏自编码器(SAE)的方法SAE-LAPE,用于捕捉大语言模型(LLMs)中跨语言的语种特定特征,揭示了这些特征在模型中间到最终层的分布及其对多语言性能的影响。

Details Motivation: 理解LLMs的多语言机制是关键,但现有研究多关注单个神经元,其多义性使得语种特定特征的分离具有挑战性。因此,需要新的方法来识别跨语言中的语种特定特征。

Contribution: 1. 提出SAE-LAPE方法,利用特征激活概率识别语种特定特征;2. 发现这些特征集中在模型中后层,且具有可解释性;3. 证明了这些特征对多语言性能和语种识别的有效性。

Method: 使用稀疏自编码器(SAE)提取单义特征,并通过特征激活概率(LAPE)识别语种特定特征。实验集中在模型的feed-forward网络中。

Result: 语种特定特征主要分布在模型的中后层,且具有可解释性;这些特征可用于语种识别,性能与fastText相当,但更具可解释性。

Insight: 稀疏自编码器能有效捕捉跨语言的语种特定特征,揭示了LLMs在处理多语言时的内部机制,为模型解释和多语言任务提供了新工具。

Abstract: Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model’s multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code is available at https://github.com/LyzanderAndrylie/language-specific-features .

[27] KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding

Luohe Shi,Zuchao Li,Lefei Zhang,Guoming Liu,Baoyuan Qi,Hai Zhao

Main category: cs.CL

TL;DR: 这篇论文提出了KV-Latent方法,通过将Key-Value向量降维到潜在空间,显著减少KV Cache的内存占用并提升推理速度,同时改进了Rotary Positional Embedding的频率采样机制以增强稳定性。

Details Motivation: Transformer Decoder架构的LLMs在推理过程中逐渐增大的KV Cache成为效率瓶颈,限制了内存消耗和数据传输带宽。

Contribution: 1. 提出KV-Latent方法,降维KV向量以减少KV Cache;2. 改进Rotary Positional Embedding的频率采样机制以适配低维向量;3. 实验验证了方法的有效性。

Method: 1. 通过降维将Key-Value向量映射到潜在空间;2. 调整Rotary Positional Embedding的频率采样机制以避免高维噪声;3. 分别研究Key和Value降维对模型性能的影响。

Result: 实验结果显示,KV-Latent能显著减少KV Cache占用并提升推理速度,额外训练成本低(小于预训练的1%)。

Insight: 通过频率感知的Rotary Positional Embedding改进,可以在降维后保持位置信息的稳定性,为高效LLMs设计提供了新思路。

Abstract: Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent. By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed, only with a small amount of extra training, less than 1% of pre-training takes. Besides, we enhanced the stability of Rotary Positional Embedding applied on lower-dimensional vectors by modifying its frequency sampling mechanism, avoiding noise introduced by higher frequencies while retaining position attenuation. Our experiments, including both models with Grouped Query Attention and those without, have yielded satisfactory results. Finally, we conducted comparative experiments to study the impact of separately reducing Key and Value components on model’s performance. Our approach allows for the construction of more efficient language model systems, and opens the new possibility on KV Cache saving and efficient LLMs. Our code is available at https://github.com/ShiLuohe/KV-Latent.

[28] FMC: Formalization of Natural Language Mathematical Competition Problems

Jiaxuan Xie,Chengwu Liu,Ye Yuan,Siqi Li,Zhiping Xiao,Ming Zhang

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型及其错误反馈的自动形式化管道,用以构建数学问题的形式化语言数据集(FMC),并生成了一个奥林匹克级别的数据集,用于支持形式化数学推理和自动定理证明的研究。

Details Motivation: 高效准确的自动形式化方法是推动形式化数学推理的关键。当前缺乏大规模的自然语言数学问题与形式化语言对齐的数据集,尤其是奥林匹克级别的题目,因此需要开发自动化的方法来填补这一空白。

Contribution: 1) 提出了一种完全自动且无需训练的自动形式化管道;2) 构建了一个奥林匹克级别的数据集FMC,包含3,922个自然语言问题和9,787个Lean形式化问题,其中64.46%被评为高质量;3) 研究了多种大语言模型的形式化与推理能力。

Method: 基于大语言模型的自动形式化管道,结合错误反馈机制,通过few-shot学习、错误反馈和增加采样数量来优化过程。

Result: 实验验证了FMC数据集的高质量与挑战性,同时表明该数据集适合作为形式化推理任务的基准。

Insight: 错误反馈机制和few-shot学习能显著提升形式化的质量,而增加采样数量进一步优化了结果;FMC数据集为形式化数学推理和自动定理证明提供了新的研究资源。

Abstract: Efficient and accurate autoformalization methods, which leverage large-scale datasets of extensive natural language mathematical problems to construct formal language datasets, are key to advancing formal mathematical reasoning. In this paper, we propose an autoformalization pipeline based on large language models with error feedback, achieving a fully automatic and training-free formalization approach. Using this pipeline, we curate an Olympiad-level dataset aligning natural language problems with Lean formalizations. The dataset comprises $3,922$ mathematical problems in natural language and $9,787$ in Lean, of which $64.46%$ were assessed as at least above-average quality, making it suitable as a benchmark for automated theorem provers. Additionally, we investigate the formalization and reasoning capabilities of various LLMs and empirically demonstrate that few-shot learning, error feedback, and increasing sampling numbers enhance the autoformalization process. Experiments of three automated theorem provers on the \dataset\ dataset also highlight its challenging nature and its value as a benchmark for formal reasoning tasks.

[29] Fine-Grained Chinese Hate Speech Understanding: Span-Level Resources, Coded Term Lexicon, and Enhanced Detection Frameworks

Zewen Bai,Liang Yang,Shengdi Yin,Yuanyuan Sun,Hongfei Lin

Main category: cs.CL

TL;DR: 论文提出了首个细粒度的中文仇恨言论数据集(STATE ToxiCN),研究了中文编码仇恨术语及大模型在仇恨语义解释方面的能力,并提出了一种整合标注词典的方法来提升检测性能。

Details Motivation: 针对中文仇恨言论检测研究中细粒度标注数据稀缺和编码仇恨术语识别不足的问题,论文旨在填补这一空白并提升模型的可解释性。

Contribution: 1. 提出了首个细粒度的中文仇恨言论数据集STATE ToxiCN;2. 系统研究了中文编码仇恨术语及大模型的语义解释能力;3. 提出了一种整合标注词典的方法以提升检测性能。

Method: 1. 构建span-level标注数据集;2. 研究编码仇恨术语和大模型的解释能力;3. 设计整合标注词典的模型增强框架。

Result: 提出的方法显著提升了中文仇恨言论检测的性能,并为相关研究提供了宝贵资源。

Insight: 细粒度标注数据和编码术语的研究对提升仇恨言论检测的可解释性和性能至关重要,整合外部知识(如标注词典)可以进一步优化模型。

Abstract: The proliferation of hate speech has inflicted significant societal harm, with its intensity and directionality closely tied to specific targets and arguments. In recent years, numerous machine learning-based methods have been developed to detect hateful comments on online platforms automatically. However, research on Chinese hate speech detection lags behind, and interpretability studies face two major challenges: first, the scarcity of span-level fine-grained annotated datasets limits models’ deep semantic understanding of hate speech; second, insufficient research on identifying and interpreting coded hate speech restricts model explainability in complex real-world scenarios. To address these, we make the following contributions: (1) We introduce the Span-level Target-Aware Toxicity Extraction dataset (STATE ToxiCN), the first span-level Chinese hate speech dataset, and evaluate the hate semantic understanding of existing models using it. (2) We conduct the first comprehensive study on Chinese coded hate terms, LLMs’ ability to interpret hate semantics. (3) We propose a method to integrate an annotated lexicon into models, significantly enhancing hate speech detection performance. Our work provides valuable resources and insights to advance the interpretability of Chinese hate speech detection research.

[30] Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian

Andrei Niculae,Adrian Cosma,Cosmin Dumitrache,Emilian Rǎdoi

Main category: cs.CL

TL;DR: 该论文提出了Dr.Copilot,一个多智能体大语言模型系统,通过17个可解释的维度为罗马尼亚语医生提供实时反馈,优化医患沟通质量。

Details Motivation: 文本远程医疗日益普及,但医患互动的质量往往取决于沟通方式而非临床准确性。Dr.Copilot旨在通过多智能体LLM系统提升罗马尼亚语医生的沟通表现。

Contribution: 提出首个面向罗马尼亚语医疗背景的多智能体LLM系统Dr.Copilot,通过DSPy优化提示词,提供基于17个维度的实时反馈。

Method: 系统包含三个LLM智能体,采用DSPy自动优化提示词,并通过低资源罗马尼亚语数据进行部署。

Result: 实证评估和41名医生的实际部署表明,用户评价和响应质量显著提升。

Insight: 论文展示了LLM在低资源语言医疗环境中的实际应用潜力,强调了沟通表现对医患互动的重要性。

Abstract: Text-based telemedicine has become increasingly common, yet the quality of medical advice in doctor-patient interactions is often judged more on how advice is communicated rather than its clinical accuracy. To address this, we introduce Dr.Copilot , a multi-agent large language model (LLM) system that supports Romanian-speaking doctors by evaluating and enhancing the presentation quality of their written responses. Rather than assessing medical correctness, Dr.Copilot provides feedback along 17 interpretable axes. The system comprises of three LLM agents with prompts automatically optimized via DSPy. Designed with low-resource Romanian data and deployed using open-weight models, it delivers real-time specific feedback to doctors within a telemedicine platform. Empirical evaluations and live deployment with 41 doctors show measurable improvements in user reviews and response quality, marking one of the first real-world deployments of LLMs in Romanian medical settings.

[31] Internal Value Alignment in Large Language Models through Controlled Value Vector Activation

Haoran Jin,Meng Li,Xiting Wang,Zhihao Xu,Minlie Huang,Yantao Jia,Defu Lian

Main category: cs.CL

TL;DR: 该论文提出了一种通过受控值向量激活(ConVA)直接对齐大语言模型内部值的方法,既能保证模型性能不下降,又能有效控制其价值输出。

Details Motivation: 对齐大语言模型(LLMs)与人类价值观很重要,但目前的方法在透明性和适应性上存在不足。因此,需要一种能直接干预模型内部表示的方法来实现更精准的值对齐。

Contribution: 1. 提出ConVA方法,通过解释和修改模型的潜在表示来对齐内部值;2. 设计上下文控制的值向量识别方法,避免偏见;3. 引入门控值向量激活机制,以最小化干预程度实现有效控制。

Method: 1. 分析模型潜在表示中值的编码方式;2. 通过上下文控制的算法识别值向量;3. 使用门控机制动态修改激活值,确保对值的控制不影响模型性能。

Result: 实验表明,ConVA在10种基本价值观上实现了最高的控制成功率,同时保持了模型的流畅性和性能,即使在对抗性或恶意输入下也能确保目标值。

Insight: 通过对模型内部表示的直接干预,可以实现更透明、灵活的值对齐,同时避免了对模型性能的负面影响。门控机制是关键,因为它以最小的干预实现了有效的控制。

Abstract: Aligning Large Language Models (LLMs) with human values has attracted increasing attention since it provides clarity, transparency, and the ability to adapt to evolving scenarios. In this paper, we introduce a Controlled Value Vector Activation (ConVA) method that directly aligns the internal values of LLMs by interpreting how a value is encoded in their latent representations and modifies relevant activations to ensure consistent values in LLMs. To ensure an accurate and unbiased interpretation, we propose a context-controlled value vector identification method. To consistently control values without sacrificing model performance, we introduce a gated value vector activation method for effective and minimum degree of value control. Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency, and ensures target values even with opposite and potentially malicious input prompts. Source code and data are available at~ https://github.com/hr-jin/ConVA.

[32] Automated Novelty Evaluation of Academic Paper: A Collaborative Approach Integrating Human and Large Language Model Knowledge

Wenqing Wu,Chengzhi Zhang,Yi Zhao

Main category: cs.CL

TL;DR: 该论文提出了一种结合人类专家和大语言模型(LLM)知识的方法,用于评估学术论文的方法新颖性。通过提取同行评议报告中的新颖性句子和大语言模型总结的方法部分,微调预训练语言模型(PLM),并结合稀疏注意力机制融合人类和LLM知识,实现了优异的性能。

Details Motivation: 传统的新颖性评估方法依赖于专家或独特的引用组合,但存在专家知识有限和引用组合有效性不确定的局限性。论文旨在整合LLM的知识和人类专家的判断能力,以弥补这些不足。

Contribution: 1. 提出一种结合人类知识和LLM的方法,用于预测论文的方法新颖性。2. 设计了一个基于稀疏注意力的文本引导融合模块,优化了人类和LLM知识的整合。

Method: 1. 从同行评议报告中提取新颖性句子,并用LLM总结方法部分。2. 使用这些数据微调PLMs(如BERT)。3. 设计了稀疏注意力机制融合模块整合人类和LLM知识。

Result: 大量实验表明,该方法在性能上优于多种基线模型。

Insight: 1. 结合人类和LLM知识能够更全面地评估论文的新颖性。2. 稀疏注意力机制有效提升了多源知识的整合效果。

Abstract: Novelty is a crucial criterion in the peer review process for evaluating academic papers. Traditionally, it’s judged by experts or measure by unique reference combinations. Both methods have limitations: experts have limited knowledge, and the effectiveness of the combination method is uncertain. Moreover, it’s unclear if unique citations truly measure novelty. The large language model (LLM) possesses a wealth of knowledge, while human experts possess judgment abilities that the LLM does not possess. Therefore, our research integrates the knowledge and abilities of LLM and human experts to address the limitations of novelty assessment. The most common novelty in academic papers is the introduction of new methods. In this paper, we propose leveraging human knowledge and LLM to assist pretrained language models (PLMs, e.g. BERT etc.) in predicting the method novelty of papers. Specifically, we extract sentences related to the novelty of the academic paper from peer review reports and use LLM to summarize the methodology section of the academic paper, which are then used to fine-tune PLMs. In addition, we have designed a text-guided fusion module with novel Sparse-Attention to better integrate human and LLM knowledge. We compared the method we proposed with a large number of baselines. Extensive experiments demonstrate that our method achieves superior performance.

[33] DCR: Quantifying Data Contamination in LLMs Evaluation

Cheng Xu,Nan Yan,Shuhao Guan,Changhong Jin,Yuke Mei,Yibing Guo,M-Tahar Kechadi

Main category: cs.CL

TL;DR: 本文提出了数据污染风险(DCR)框架,用于检测和量化大规模语言模型(LLM)评估中的数据污染问题,并通过模糊推理系统生成统一的DCR因子,调整性能指标以反映污染情况。

Details Motivation: 随着大型语言模型的快速发展,评估数据污染(BDC)问题日益突出,导致模型表现虚高,影响真实泛化能力的评估。

Contribution: 提出了轻量级、可解释的DCR框架,从语义、信息、数据和标签四个层面检测和量化数据污染,并通过DCR因子调整性能指标。

Method: 使用模糊推理系统综合污染分数,生成统一的DCR因子,并在情感分析、假新闻检测和算术推理任务中验证了其有效性。

Result: 在9个LLM(0.5B-72B)上验证,DCR因子调整后的精度与无污染基准相比平均误差在4%以内。

Insight: DCR框架高效透明,可集成到常规评估中,促进更公平的比较,提升LLM基准测试的可信度。

Abstract: The rapid advancement of large language models (LLMs) has heightened concerns about benchmark data contamination (BDC), where models inadvertently memorize evaluation data, inflating performance metrics and undermining genuine generalization assessment. This paper introduces the Data Contamination Risk (DCR) framework, a lightweight, interpretable pipeline designed to detect and quantify BDC across four granular levels: semantic, informational, data, and label. By synthesizing contamination scores via a fuzzy inference system, DCR produces a unified DCR Factor that adjusts raw accuracy to reflect contamination-aware performance. Validated on 9 LLMs (0.5B-72B) across sentiment analysis, fake news detection, and arithmetic reasoning tasks, the DCR framework reliably diagnoses contamination severity and with accuracy adjusted using the DCR Factor to within 4% average error across the three benchmarks compared to the uncontaminated baseline. Emphasizing computational efficiency and transparency, DCR provides a practical tool for integrating contamination assessment into routine evaluations, fostering fairer comparisons and enhancing the credibility of LLM benchmarking practices.

[34] EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

LG AI Research,:,Kyunghoon Bae,Eunbi Choi,Kibong Choi,Stanley Jungkyu Choi,Yemuk Choi,Kyubeen Han,Seokhee Hong,Junwon Hwang,Taewan Hwang,Joonwon Jang,Hyojin Jeon,Kijeong Jeon,Gerrard Jeongwon Jo,Hyunjik Jo,Jiyeon Jung,Euisoon Kim,Hyosang Kim,Jihoon Kim,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Yongil Kim,Youchul Kim,Edward Hwayoung Lee,Gwangho Lee,Haeju Lee,Honglak Lee,Jinsik Lee,Kyungmin Lee,Sangha Park,Young Min Paik,Yongmin Park,Youngyong Park,Sanghyun Seo,Sihoon Yang,Heuiyeen Yeen,Sihyuk Yi,Hyeongu Yun

Main category: cs.CL

TL;DR: EXAONE 4.0 是一款统一的大型语言模型,整合了非推理和推理模式,具备多语言支持和工具使用能力,提供两种模型规模,性能优于同类开源模型。

Details Motivation: 为了适应智能代理时代的到来,需要一种既能保持高效可用性又具备高级推理能力的语言模型,同时扩展多语言支持。

Contribution: EXAONE 4.0 整合了非推理和推理模式,支持西班牙语等多语言能力,并引入工具使用功能,提供了高中低两种规模的模型选项。

Method: 通过统一的架构设计,结合非推理模式(高效性)和推理模式(高级能力),并通过多语言训练扩大支持范围。

Result: EXAONE 4.0 在性能上优于同类开源模型,甚至在部分场景中可与前沿模型竞争,且模型已公开供研究使用。

Insight: 整合不同模式和多语言支持是提升语言模型实用性和适应性的有效路径,为智能代理发展铺路。

Abstract: This technical report introduces EXAONE 4.0, which integrates a Non-reasoning mode and a Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. The EXAONE 4.0 demonstrates superior performance compared to open-weight models in its class and remains competitive even against frontier-class models. The models are publicly available for research purposes and can be easily downloaded via https://huggingface.co/LGAI-EXAONE.

[35] KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?

Soumadeep Saha,Akshay Chaturvedi,Saptarshi Saha,Utpal Garain,Nicholas Asher

Main category: cs.CL

TL;DR: 论文提出了Causal CoT Graphs (CCGs),用于从推理轨迹中建模大语言模型输出的细粒度因果依赖关系,并构建了KisMATH数据集。实验表明,CCG中的推理节点是最终答案的媒介,且语言模型内部能够识别类似的结构。

Details Motivation: 研究链式思维(Chain-of-thought, CoT)在大语言模型数学推理中的作用机制,探索模型内部是否能够识别隐式结构。

Contribution: 1. 提出了Causal CoT Graphs (CCGs)方法;2. 构建了KisMATH数据集;3. 通过实验验证CCG节点的中介作用及LLM对推理路径的重视。

Method: 1. 从推理轨迹中自动提取CCGs;2. 使用KisMATH数据集对15个开源LLM进行实证分析。

Result: 1. CCG中的节点是最终答案的媒介;2. LLM能够内部识别类似CCG的结构。

Insight: 链式思维的有效性可能与模型内部的隐式结构识别能力有关,这为未来研究提供了一个新的视角。

Abstract: Chain-of-thought traces have been shown to improve performance of large language models in a plethora of reasoning tasks, yet there is no consensus on the mechanism through which this performance boost is achieved. To shed more light on this, we introduce Causal CoT Graphs (CCGs), which are directed acyclic graphs automatically extracted from reasoning traces that model fine-grained causal dependencies in the language model output. A collection of $1671$ mathematical reasoning problems from MATH500, GSM8K and AIME, and their associated CCGs are compiled into our dataset – \textbf{KisMATH}. Our detailed empirical analysis with 15 open-weight LLMs shows that (i) reasoning nodes in the CCG are mediators for the final answer, a condition necessary for reasoning; and (ii) LLMs emphasise reasoning paths given by the CCG, indicating that models internally realise structures akin to our graphs. KisMATH enables controlled, graph-aligned interventions and opens up avenues for further investigation into the role of chain-of-thought in LLM reasoning.

[36] Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Orion Weller,Kathryn Ricci,Marc Marone,Antoine Chaffin,Dawn Lawrie,Benjamin Van Durme

Main category: cs.CL

TL;DR: 论文提出了一个名为Ettin的开源模型套件,包含配对的编码器-解码器模型,覆盖了1700万到10亿参数规模。通过统一的训练方法,该套件在编码器和解码器任务上均达到了SOTA性能,同时探讨了编码器和解码器任务之间的适应性差异。

Details Motivation: 当前大语言模型(LLM)社区主要关注仅解码器模型,但编码器模型在分类和检索任务中仍广泛使用。此前的研究因模型参数、训练技术和数据集的差异难以公平比较,因此需要统一框架下的对比研究。

Contribution: 1. 提出了Ettin模型套件,覆盖多种规模的配对编码器-解码器模型;2. 在统一训练方法下,两种架构均达到SOTA性能;3. 提供了开源的数据、训练检查点等资源。

Method: 使用统一的训练方法(相同的参数规模、训练技术和数据集)训练编码器和解码器模型,并对其在不同任务(分类、检索和生成)上的性能进行对比。

Result: 编码器模型擅长分类和检索任务,解码器模型擅长生成任务。通过继续训练适应的模型表现不如专一任务的模型(例如400M编码器在MNLI上优于1B解码器)。

Insight: 编码器和解码器架构在特定任务上具有天然优势,任务适应性转换并非最优选择。统一的训练方法为公平比较提供了基础。

Abstract: The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.

[37] Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?

Yanjian Zhang,Guillaume Wisniewski,Nadi Tomeh,Thierry Charnois

Main category: cs.CL

TL;DR: 这篇论文探讨了如何通过提示控制大型语言模型(LLMs)的推理策略,并评估其对逻辑问题解决的影响。研究发现,虽然单个策略无法一致提升准确性,但如果模型能自适应选择最优策略,性能可得到提升。

Details Motivation: 人类推理涉及多种策略,而现有研究表明LLMs倾向于单一策略,可能限制了其解决多样化推理问题的能力。因此,需要研究如何通过提示控制LLMs的推理策略。

Contribution: 论文提出了一种方法,通过指导LLMs选择最优推理策略,改善了其推理能力。

Method: 通过实验研究了策略选择的影响,并提出了引导LLMs自适应选择策略的方法。

Result: 实验结果显示,虽然单个策略无法一致提升准确性,但自适应策略选择能显著提高性能。

Insight: 大型语言模型的推理能力可以通过灵活的策略选择进一步优化,提示设计是其关键。

Abstract: Human reasoning involves different strategies, each suited to specific problems. Prior work shows that large language model (LLMs) tend to favor a single reasoning strategy, potentially limiting their effectiveness in diverse reasoning challenges. In this work, we investigate whether prompting can control LLMs reasoning strategies and assess its impact on logical problem-solving. While our experiments show that no single strategy consistently improves accuracy, performance could be enhanced if models could adaptively choose the optimal strategy. We propose methods to guide LLMs in strategy selection, highlighting new ways to refine their reasoning abilities.

cs.CV [Back]

[38] CWNet: Causal Wavelet Network for Low-Light Image Enhancement

Tongshun Zhang,Pingping Liu,Yubing Lu,Mengen Cai,Zijian Zhang,Zhe Zhang,Qiuzhan Zhou

Main category: cs.CV

TL;DR: 本文提出了一种基于因果小波变换的网络(CWNet),用于低光照图像增强,通过因果推理和小波变换优化恢复频率信息,显著提升了性能。

Details Motivation: 传统低光照图像增强方法通常忽略实例级语义信息及不同特征的固有特性,CWNet通过因果推理和小波变换解决这些局限性。

Contribution: 1) 提出因果推理视角揭示低光照增强中的因果关系;2) 设计基于小波变换的骨干网络,优化频率信息恢复。

Method: 1) 使用因果推理和度量学习分离因果与非因果因素;2) 引入CLIP语义损失保持因果一致性;3) 提出小波变换网络增强频率信息。

Result: 在多个数据集上显著优于现有方法,展示了跨场景的鲁棒性能。

Insight: 因果推理和小波变换的结合能有效解决低光照图像增强中的语义一致性和频率恢复问题。

Abstract: Traditional Low-Light Image Enhancement (LLIE) methods primarily focus on uniform brightness adjustment, often neglecting instance-level semantic information and the inherent characteristics of different features. To address these limitations, we propose CWNet (Causal Wavelet Network), a novel architecture that leverages wavelet transforms for causal reasoning. Specifically, our approach comprises two key components: 1) Inspired by the concept of intervention in causality, we adopt a causal reasoning perspective to reveal the underlying causal relationships in low-light enhancement. From a global perspective, we employ a metric learning strategy to ensure causal embeddings adhere to causal principles, separating them from non-causal confounding factors while focusing on the invariance of causal factors. At the local level, we introduce an instance-level CLIP semantic loss to precisely maintain causal factor consistency. 2) Based on our causal analysis, we present a wavelet transform-based backbone network that effectively optimizes the recovery of frequency information, ensuring precise enhancement tailored to the specific attributes of wavelet transforms. Extensive experiments demonstrate that CWNet significantly outperforms current state-of-the-art methods across multiple datasets, showcasing its robust performance across diverse scenes. Code is available at https://github.com/bywlzts/CWNet-Causal-Wavelet-Network.

[39] Integrating Biological Knowledge for Robust Microscopy Image Profiling on De Novo Cell Lines

Jiayuan Chen,Thai-Hoang Pham,Yuanlong Wang,Ping Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种整合外部生物知识的新方法,以改善显微镜图像分析模型在新细胞系上的性能,通过解耦细胞系特异性和扰动特异性特征,提升了模型的泛化能力。

Details Motivation: 由于细胞系之间显著的形态和生物学异质性,针对新细胞系的扰动筛选仍然具有挑战性。现有模型难以直接泛化到未见过的细胞系,因此需要引入外部生物知识以增强模型的鲁棒性。

Contribution: 1. 提出了一种整合外部生物知识的新框架;2. 利用知识图谱和转录组特征解耦细胞系特异性和扰动特异性特征;3. 在RxRx数据集上验证了方法的有效性。

Method: 1. 使用STRING和Hetionet数据库的蛋白质相互作用数据构建知识图谱;2. 引入单细胞转录组特征捕获细胞系特异性;3. 通过预训练和解耦特征学习提升泛化能力。

Result: 实验表明,该方法在RxRx1和RxRx19a数据集上通过单样本和小样本微调,显著提升了显微镜图像分析的性能,尤其是在新细胞系上的表现。

Insight: 整合外部生物知识可以帮助模型更好地学习细胞系特异性和扰动特异性特征,从而提高在真实药物发现任务中的适用性。

Abstract: High-throughput screening techniques, such as microscopy imaging of cellular responses to genetic and chemical perturbations, play a crucial role in drug discovery and biomedical research. However, robust perturbation screening for \textit{de novo} cell lines remains challenging due to the significant morphological and biological heterogeneity across cell lines. To address this, we propose a novel framework that integrates external biological knowledge into existing pretraining strategies to enhance microscopy image profiling models. Our approach explicitly disentangles perturbation-specific and cell line-specific representations using external biological information. Specifically, we construct a knowledge graph leveraging protein interaction data from STRING and Hetionet databases to guide models toward perturbation-specific features during pretraining. Additionally, we incorporate transcriptomic features from single-cell foundation models to capture cell line-specific representations. By learning these disentangled features, our method improves the generalization of imaging models to \textit{de novo} cell lines. We evaluate our framework on the RxRx database through one-shot fine-tuning on an RxRx1 cell line and few-shot fine-tuning on cell lines from the RxRx19a dataset. Experimental results demonstrate that our method enhances microscopy image profiling for \textit{de novo} cell lines, highlighting its effectiveness in real-world phenotype-based drug discovery applications.

[40] Auditing Facial Emotion Recognition Datasets for Posed Expressions and Racial Bias

Rina Khan,Catherine Stinson

Main category: cs.CV

TL;DR: 该研究审计了两种最先进的面部表情识别(FER)数据集,发现其中大量图像是摆拍的,而非自然流露的表情,导致模型在实际应用中的性能可能被高估。此外,FER模型存在对非白人或深肤色个体的情感预测偏差,倾向于将其误判为负面情绪(如愤怒或悲伤)。

Details Motivation: FER算法的性能在检测自然表情时下降,且存在对某些种族和肤色的偏见问题。这些问题与数据集的数据收集实践相关,因此需要审计数据集以揭示潜在偏差。

Contribution: 提出了一种识别自然/摆拍表情图像的方法,揭露了数据集中的摆拍问题;同时通过肤色和种族分析,揭示了FER模型的预测偏差,指出其对非白人或深肤色个体的不公平预测。

Method: 从两个FER数据集中随机抽样,手动标注图像是否为自然或摆拍表情;同时标注肤色,并测试三个模型在不同种族和肤色上的表现。

Result: 发现数据集中存在大量摆拍图像,模型在自然表情上的性能可能被高估;模型对非白人或深肤色个体的负面情绪预测存在明显偏差。

Insight: 数据集的构建方式直接影响模型性能和社会公平性,未来的FER研究需更关注自然表情的收集和种族多样性,以避免算法偏见。

Abstract: Facial expression recognition (FER) algorithms classify facial expressions into emotions such as happy, sad, or angry. An evaluative challenge facing FER algorithms is the fall in performance when detecting spontaneous expressions compared to posed expressions. An ethical (and evaluative) challenge facing FER algorithms is that they tend to perform poorly for people of some races and skin colors. These challenges are linked to the data collection practices employed in the creation of FER datasets. In this study, we audit two state-of-the-art FER datasets. We take random samples from each dataset and examine whether images are spontaneous or posed. In doing so, we propose a methodology for identifying spontaneous or posed images. We discover a significant number of images that were posed in the datasets purporting to consist of in-the-wild images. Since performance of FER models vary between spontaneous and posed images, the performance of models trained on these datasets will not represent the true performance if such models were to be deployed in in-the-wild applications. We also observe the skin color of individuals in the samples, and test three models trained on each of the datasets to predict facial expressions of people from various races and skin tones. We find that the FER models audited were more likely to predict people labeled as not white or determined to have dark skin as showing a negative emotion such as anger or sadness even when they were smiling. This bias makes such models prone to perpetuate harm in real life applications.

[41] FPC-Net: Revisiting SuperPoint with Descriptor-Free Keypoint Detection via Feature Pyramids and Consistency-Based Implicit Matching

Ionuţ Grigore,Călin-Adrian Popa,Claudiu Leoveanu-Condrei

Main category: cs.CV

TL;DR: FPC-Net提出了一种无需描述符的关键点检测方法,通过特征金字塔和一致性隐式匹配实现,显著减少了内存占用,但匹配精度略低于传统方法。

Details Motivation: 传统的关键点匹配方法依赖描述符,导致内存占用高。本文旨在消除描述符的需求,提高系统的效率。

Contribution: 提出了FPC-Net,一种无需描述符的关键点检测和匹配方法,通过特征金字塔和隐式匹配实现高效定位。

Method: 利用特征金字塔提取关键点,并通过一致性隐式匹配直接关联关键点,避免描述符的计算和存储。

Result: 尽管匹配精度略低,但显著减少了内存占用,适用于实时定位系统。

Insight: 通过隐式匹配技术,可以高效地减少系统复杂度,为资源受限的场景提供了新的解决方案。

Abstract: The extraction and matching of interest points are fundamental to many geometric computer vision tasks. Traditionally, matching is performed by assigning descriptors to interest points and identifying correspondences based on descriptor similarity. This work introduces a technique where interest points are inherently associated during detection, eliminating the need for computing, storing, transmitting, or matching descriptors. Although the matching accuracy is marginally lower than that of conventional approaches, our method completely eliminates the need for descriptors, leading to a drastic reduction in memory usage for localization systems. We assess its effectiveness by comparing it against both classical handcrafted methods and modern learned approaches.

[42] Warehouse Spatial Question Answering with LLM Agent

Hsiang-Wei Huang,Jen-Hao Cheng,Kuang-Ming Chen,Cheng-Yen Yang,Bahaa Alattar,Yi-Ru Lin,Pyongkun Kim,Sangwon Kim,Kwangju Kim,Chung-I Huang,Jenq-Neng Hwang

Main category: cs.CV

TL;DR: 本文提出了一种数据高效的方法,通过结合LLM代理系统和多种工具,解决了复杂室内仓库场景中的空间问答问题,实现了高准确率和高效性。

Details Motivation: 现有的多模态大语言模型(MLLM)在空间理解任务上表现不佳。本文旨在通过设计一种数据高效的LLM代理系统,提升模型在复杂室内仓库场景中的空间推理能力。

Contribution: 提出了一种结合LLM代理和工具的解决方案,用于解决复杂的空间问答任务,展示了在仓库场景中的高效性和准确性。

Method: 采用LLM代理系统,整合多种工具进行空间推理和API工具交互,以回答复杂的空间问题。

Result: 在2025 AI City Challenge数据集上的实验表明,系统在目标检索、计数和距离估计等任务中表现优异。

Insight: 通过工具整合而非单纯的大规模微调,可以有效提升MLLM在空间推理任务中的表现,尤其在数据有限的场景下更具优势。

Abstract: Spatial understanding has been a challenging task for existing Multi-modal Large Language Models~(MLLMs). Previous methods leverage large-scale MLLM finetuning to enhance MLLM’s spatial understanding ability. In this paper, we present a data-efficient approach. We propose a LLM agent system with strong and advanced spatial reasoning ability, which can be used to solve the challenging spatial question answering task in complex indoor warehouse scenarios. Our system integrates multiple tools that allow the LLM agent to conduct spatial reasoning and API tools interaction to answer the given complicated spatial question. Extensive evaluations on the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse dataset demonstrate that our system achieves high accuracy and efficiency in tasks such as object retrieval, counting, and distance estimation. The code is available at: https://github.com/hsiangwei0903/SpatialAgent

[43] ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference

Ali Hojjat,Janek Haberer,Soren Pirk,Olaf Landsiedel

Main category: cs.CV

TL;DR: ThinkingViT是一种基于渐进思考阶段的嵌套ViT架构,通过动态调整计算量以提升推理效率。

Details Motivation: 现有嵌套Transformer架构对所有输入采用固定计算量,忽视了输入复杂性差异,导致效率低下。

Contribution: 1)提出ThinkingViT,动态调整计算量;2)引入Token Recycling机制,渐进优化推理。

Method: 通过渐进激活注意力头,动态终止推理;利用Token Recycling复用前一阶段嵌入。

Result: 在ImageNet-1K上,同等吞吐量下提升2.0 p.p.准确率,同等计算量下提升2.9 p.p.。

Insight: 动态调整计算量能显著提升模型效率,同时Token Recycling机制为ViT优化提供了新思路。

Abstract: Vision Transformers deliver state-of-the-art performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent nested Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT initiates inference by activating a small subset of the most important attention heads and terminates early if predictions reach sufficient certainty. Otherwise, it activates additional attention heads and re-evaluates the input. At the core of ThinkingViT is our Token Recycling mechanism, which conditions each subsequent inference stage on the embeddings from the previous stage, enabling progressive improvement. Due to its backbone-preserving design, ThinkingViT also serves as a plugin upgrade for vanilla ViT. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. The source code is available at https://github.com/ds-kiel/ThinkingViT.

[44] SpaRTAN: Spatial Reinforcement Token-based Aggregation Network for Visual Recognition

Quan Bi Pay,Vishnu Monn Baskaran,Junn Yong Loo,KokSheik Wong,Simon See

Main category: cs.CV

TL;DR: SpaRTAN 是一种轻量级视觉识别架构,通过多核感知野和波式通道聚合模块增强空间和通道信息处理,显著提升了参数效率和性能。

Details Motivation: 现有 CNN 和 Transformer 架构存在简单性偏差和通道冗余问题,需要通过更高效的设计捕捉复杂空间特征并减少冗余。

Contribution: 提出了一种结合多核感知野和波式通道聚合模块的轻量级网络 SpaRTAN,显著提升了空间特征提取和参数效率。

Method: 1. 使用不同核大小和膨胀因子的多核结构捕捉多阶空间特征;2. 引入波式通道聚合模块减少通道冗余。

Result: 在 ImageNet-1k 上达到 77.7% 准确率(3.8M 参数,1.0 GFLOPs),COCO 上 50.0% AP(21.5M 参数),性能优于先前基准。

Insight: 通过动态上下文和多阶特征提取,轻量级设计也能实现高性能,为视觉任务提供了新思路。

Abstract: The resurgence of convolutional neural networks (CNNs) in visual recognition tasks, exemplified by ConvNeXt, has demonstrated their capability to rival transformer-based architectures through advanced training methodologies and ViT-inspired design principles. However, both CNNs and transformers exhibit a simplicity bias, favoring straightforward features over complex structural representations. Furthermore, modern CNNs often integrate MLP-like blocks akin to those in transformers, but these blocks suffer from significant information redundancies, necessitating high expansion ratios to sustain competitive performance. To address these limitations, we propose SpaRTAN, a lightweight architectural design that enhances spatial and channel-wise information processing. SpaRTAN employs kernels with varying receptive fields, controlled by kernel size and dilation factor, to capture discriminative multi-order spatial features effectively. A wave-based channel aggregation module further modulates and reinforces pixel interactions, mitigating channel-wise redundancies. Combining the two modules, the proposed network can efficiently gather and dynamically contextualize discriminative features. Experimental results in ImageNet and COCO demonstrate that SpaRTAN achieves remarkable parameter efficiency while maintaining competitive performance. In particular, on the ImageNet-1k benchmark, SpaRTAN achieves 77. 7% accuracy with only 3.8M parameters and approximately 1.0 GFLOPs, demonstrating its ability to deliver strong performance through an efficient design. On the COCO benchmark, it achieves 50.0% AP, surpassing the previous benchmark by 1.2% with only 21.5M parameters. The code is publicly available at [https://github.com/henry-pay/SpaRTAN].

[45] Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection

Yuhu Bai,Jiangning Zhang,Yunkang Cao,Guangyuan Lu,Qingdong He,Xiangtai Li,Guanzhong Tian

Main category: cs.CV

TL;DR: FiSeCLIP结合CLIP模型的特征匹配和跨模态对齐,实现零样本异常检测,通过批量测试和文本信息过滤噪声特征,提升了异常分类和分割性能。

Details Motivation: 零样本异常检测(ZSAD)在罕见类别检测中非常重要,现有方法缺乏高效的批量测试和噪声过滤机制,导致性能受限。

Contribution: 1. 提出FiSeCLIP方法,结合特征匹配和跨模态对齐;2. 利用批量测试和文本信息过滤噪声特征;3. 恢复CLIP的局部语义关联以适配细粒度异常检测任务。

Method: 1. 批量测试作为参考点;2. 利用文本信息过滤噪声特征;3. 通过恢复局部语义关联优化过滤过程。

Result: 在MVTec-AD上,FiSeCLIP性能优于SOTA AdaCLIP,分割指标AU-ROC和F1-max分别提升4.6%和5.7%。

Insight: 批量测试和文本信息过滤能有效提升零样本异常检测性能,同时恢复局部语义关联有助于细粒度任务的优化。

Abstract: With the advent of vision-language models (e.g., CLIP) in zero- and few-shot settings, CLIP has been widely applied to zero-shot anomaly detection (ZSAD) in recent research, where the rare classes are essential and expected in many applications. This study introduces \textbf{FiSeCLIP} for ZSAD with training-free \textbf{CLIP}, combining the feature matching with the cross-modal alignment. Testing with the entire dataset is impractical, while batch-based testing better aligns with real industrial needs, and images within a batch can serve as mutual reference points. Accordingly, FiSeCLIP utilizes other images in the same batch as reference information for the current image. However, the lack of labels for these references can introduce ambiguity, we apply text information to \textbf{fi}lter out noisy features. In addition, we further explore CLIP’s inherent potential to restore its local \textbf{se}mantic correlation, adapting it for fine-grained anomaly detection tasks to enable a more accurate filtering process. Our approach exhibits superior performance for both anomaly classification and segmentation on anomaly detection benchmarks, building a stronger baseline for the direction, e.g., on MVTec-AD, FiSeCLIP outperforms the SOTA AdaCLIP by +4.6%$\uparrow$/+5.7%$\uparrow$ in segmentation metrics AU-ROC/$F_1$-max.

[46] Semantically Informed Salient Regions Guided Radiology Report Generation

Zeyi Hou,Zeqiang Wei,Ruixin Yan,Ning Lang,Xiuzhuang Zhou

Main category: cs.CV

TL;DR: 论文提出了一种基于语义显著性区域的放射学报告生成方法(SISRNet),通过跨模态语义识别医学关键区域,显著提升了报告的临床准确性。

Details Motivation: 现有基于深度学习的放射学报告生成方法因数据偏差问题常生成流畅但不准确的报告,无法满足临床需求,亟需改进。

Contribution: 提出SISRNet方法,利用跨模态语义显式识别医学关键区域,并在图像建模和报告生成中系统性关注这些区域,生成更准确的报告。

Method: 通过跨模态语义分析识别医学显著性区域,并在模型训练和报告生成过程中优先处理这些区域。

Result: 在IU-Xray和MIMIC-CXR数据集上,SISRNet表现优于现有方法。

Insight: 显式关注医学关键区域可以有效缓解数据偏差问题,提升放射学报告生成的临床实用性。

Abstract: Recent advances in automated radiology report generation from chest X-rays using deep learning algorithms have the potential to significantly reduce the arduous workload of radiologists. However, due to the inherent massive data bias in radiology images, where abnormalities are typically subtle and sparsely distributed, existing methods often produce fluent yet medically inaccurate reports, limiting their applicability in clinical practice. To address this issue effectively, we propose a Semantically Informed Salient Regions-guided (SISRNet) report generation method. Specifically, our approach explicitly identifies salient regions with medically critical characteristics using fine-grained cross-modal semantics. Then, SISRNet systematically focuses on these high-information regions during both image modeling and report generation, effectively capturing subtle abnormal findings, mitigating the negative impact of data bias, and ultimately generating clinically accurate reports. Compared to its peers, SISRNet demonstrates superior performance on widely used IU-Xray and MIMIC-CXR datasets.

[47] A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion

Jie-Wen Li,Zi-Han Ye,Qingyuan Zhou,Jiayi Song,Ying He,Ben Fei,Wen-Ming Chen

Main category: cs.CV

TL;DR: 这篇论文介绍了FootGait3D,一个新颖的多视角高分辨率足踝点云数据集,专注于动态步态下的足踝区域建模,用于评估3D点云补全方法的性能。

Details Motivation: 动态步态下足踝区域的精确表面几何数据采集因遮挡和视角限制而困难,现有数据集通常关注全身或下肢运动,缺乏对足踝的详细建模。

Contribution: 提出了FootGait3D数据集,包含46名受试者的8403帧多视角点云数据,支持从部分视图恢复完整足踝形状的3D补全任务。

Method: 使用五相机深度传感系统采集数据,提供完整五视角重建和部分视图点云(四、三或二视角),用于评估不同遮挡水平下的补全方法。

Result: FootGait3D数据集为足踝形状补全任务提供了基准测试平台,支持单模态和多模态补全网络的性能评估。

Insight: 该数据集有望推动生物力学、临床步态分析和机器人应用中足踝3D建模的研究,填补了现有数据的空白。

Abstract: The kinematics analysis of foot-ankle complex during gait is essential for advancing biomechanical research and clinical assessment. Collecting accurate surface geometry data from the foot and ankle during dynamic gait conditions is inherently challenging due to swing foot occlusions and viewing limitations. Thus, this paper introduces FootGait3D, a novel multi-view dataset of high-resolution ankle-foot surface point clouds captured during natural gait. Different from existing gait datasets that typically target whole-body or lower-limb motion, FootGait3D focuses specifically on the detailed modeling of the ankle-foot region, offering a finer granularity of motion data. To address this, FootGait3D consists of 8,403 point cloud frames collected from 46 subjects using a custom five-camera depth sensing system. Each frame includes a complete 5-view reconstruction of the foot and ankle (serving as ground truth) along with partial point clouds obtained from only four, three, or two views. This structured variation enables rigorous evaluation of 3D point cloud completion methods under varying occlusion levels and viewpoints. Our dataset is designed for shape completion tasks, facilitating the benchmarking of state-of-the-art single-modal (e.g., PointTr, SnowflakeNet, Anchorformer) and multi-modal (e.g., SVDFormer, PointSea, CSDN) completion networks on the challenge of recovering the full foot geometry from occluded inputs. FootGait3D has significant potential to advance research in biomechanics and multi-segment foot modeling, offering a valuable testbed for clinical gait analysis, prosthetic design, and robotics applications requiring detailed 3D models of the foot during motion. The dataset is now available at https://huggingface.co/datasets/ljw285/FootGait3D.

[48] A Survey on Interpretability in Visual Recognition

Qiyang Wan,Chengzhi Gao,Ruiping Wang,Xilin Chen

Main category: cs.CV

TL;DR: 这篇论文是关于视觉识别模型可解释性的综述,提出了基于人类中心的分类法,并探讨了评估标准和新技术的机遇。

Details Motivation: 随着视觉识别模型在关键领域(如自动驾驶和医疗诊断)的应用增加,理解模型机制和诊断失败的需求推动了可解释性研究的发展。

Contribution: 提出了一个基于Intent、Object、Presentation和Methodology的分类法,为视觉识别模型的解释方法提供系统性标准。

Method: 系统综述现有视觉识别模型的可解释性研究,并从人类中心角度提出分类法。

Result: 总结了现有研究的评估标准,并探讨了新技术(如多模态大模型)带来的新机遇。

Insight: 研究为未来视觉识别模型的可解释性研究提供了系统性框架和新的研究方向。

Abstract: In recent years, visual recognition methods have advanced significantly, finding applications across diverse fields. While researchers seek to understand the mechanisms behind the success of these models, there is also a growing impetus to deploy them in critical areas like autonomous driving and medical diagnostics to better diagnose failures, which promotes the development of interpretability research. This paper systematically reviews existing research on the interpretability of visual recognition models and proposes a taxonomy of methods from a human-centered perspective. The proposed taxonomy categorizes interpretable recognition methods based on Intent, Object, Presentation, and Methodology, thereby establishing a systematic and coherent set of grouping criteria for these XAI methods. Additionally, we summarize the requirements for evaluation metrics and explore new opportunities enabled by recent technologies, such as large multimodal models. We aim to organize existing research in this domain and inspire future investigations into the interpretability of visual recognition models.

[49] KptLLM++: Towards Generic Keypoint Comprehension with Large Language Model

Jie Yang,Wang Zeng,Sheng Jin,Lumin Xu,Wentao Liu,Chen Qian,Zhen Li,Ruimao Zhang

Main category: cs.CV

TL;DR: KptLLM++ 是一种多模态大语言模型,专注于通用关键点理解,通过用户指令整合多种输入模态,采用“识别-检测”范式实现高精度关键点定位。

Details Motivation: 现有 MLLMs 在细粒度语义信息(如关键点)捕捉上表现不足,而关键点在细粒度图像分析、行为识别等应用中至关重要。

Contribution: 1) 提出 KptLLM++,首个专注于通用关键点理解的 MLLM;2) 设计“识别-检测”范式及链式推理机制;3) 构建大规模训练数据集(500K+ 样本)。

Method: 采用“识别-检测”范式:先解析关键点语义,再通过结构化链式推理机制精确定位。数据集规模显著扩大(500K+ 样本),覆盖多样化场景。

Result: 在多个关键点检测基准测试中达到 SOTA 性能,展现优异精度和泛化能力。

Insight: KptLLM++ 表明大语言模型可通过结构化推理机制解决细粒度视觉任务,推动人机交互发展。大规模数据对模型性能提升至关重要。

Abstract: The emergence of Multimodal Large Language Models (MLLMs) has revolutionized image understanding by bridging textual and visual modalities. However, these models often struggle with capturing fine-grained semantic information, such as the precise identification and analysis of object keypoints. Keypoints, as structure-aware, pixel-level, and compact representations of objects, particularly articulated ones, play a crucial role in applications such as fine-grained image analysis, object retrieval, and behavior recognition. In this paper, we propose KptLLM++, a novel multimodal large language model that specifically designed for generic keypoint comprehension through the integration of diverse input modalities guided by user-defined instructions. By unifying keypoint detection across varied contexts, KptLLM++ establishes itself as an advanced interface, fostering more effective human-AI collaboration. The model is built upon a novel identify-then-detect paradigm, which first interprets keypoint semantics and subsequently localizes their precise positions through a structured chain-of-thought reasoning mechanism. To push the boundaries of performance, we have scaled up the training dataset to over 500K samples, encompassing diverse objects, keypoint categories, image styles, and scenarios with complex occlusions. This extensive scaling enables KptLLM++ to unlock its potential, achieving remarkable accuracy and generalization. Comprehensive experiments on multiple keypoint detection benchmarks demonstrate its state-of-the-art performance, underscoring its potential as a unified solution for fine-grained image understanding and its transformative implications for human-AI interaction.

[50] Jellyfish Species Identification: A CNN Based Artificial Neural Network Approach

Md. Sabbir Hossen,Md. Saiduzzaman,Pabon Shaha,Mostofa Kamal Nasir

Main category: cs.CV

TL;DR: 该研究提出了一种基于深度学习的框架,用于通过水下图像数据集对水母物种进行检测和分类,结合多种特征提取技术和分类器,表现最佳的MobileNetV3与人工神经网络的组合达到了98%的准确率。

Details Motivation: 水母在海洋生态系统中具有重要作用,但其快速繁殖和对生态的影响为生物多样性和保护带来了挑战。准确识别水母物种对生态监测和管理至关重要。

Contribution: 1. 提出了一个结合多种特征提取技术和分类器的深度学习框架;2. 表现最佳的MobileNetV3与人工神经网络的组合达到了98%的准确率。

Method: 1. 使用MobileNetV3、ResNet50、EfficientNetV2-B0和VGG16进行特征提取;2. 结合七种传统机器学习分类器和三种前馈神经网络分类器;3. 使用softmax函数直接对水母物种进行分类。

Result: MobileNetV3与人工神经网络的组合表现最佳,达到了98%的准确率,显著优于其他组合。

Insight: 深度学习和混合框架在处理生物多样性挑战和海洋物种检测方面具有显著潜力。

Abstract: Jellyfish, a diverse group of gelatinous marine organisms, play a crucial role in maintaining marine ecosystems but pose significant challenges for biodiversity and conservation due to their rapid proliferation and ecological impact. Accurate identification of jellyfish species is essential for ecological monitoring and management. In this study, we proposed a deep learning framework for jellyfish species detection and classification using an underwater image dataset. The framework integrates advanced feature extraction techniques, including MobileNetV3, ResNet50, EfficientNetV2-B0, and VGG16, combined with seven traditional machine learning classifiers and three Feedforward Neural Network classifiers for precise species identification. Additionally, we activated the softmax function to directly classify jellyfish species using the convolutional neural network models. The combination of the Artificial Neural Network with MobileNetV3 is our best-performing model, achieving an exceptional accuracy of 98%, significantly outperforming other feature extractor-classifier combinations. This study demonstrates the efficacy of deep learning and hybrid frameworks in addressing biodiversity challenges and advancing species detection in marine environments.

[51] Try Harder: Hard Sample Generation and Learning for Clothes-Changing Person Re-ID

Hankun Liu,Yujian Zhao,Guanglin Niu

Main category: cs.CV

TL;DR: 该论文提出了一种多模态引导的硬样本生成与学习框架(HSGL),通过结合文本和视觉模态,首次统一了对硬样本的定义、生成和优化,提升了衣物更换行人重识别(CC-ReID)任务的性能。

Details Motivation: 在衣物更换行人重识别(CC-ReID)任务中,硬样本由于其模糊性和相似性,成为模型性能提升的瓶颈。现有方法缺乏对硬样本的明确定义和针对性学习策略,限制了模型的鲁棒性。

Contribution: 提出HSGL框架,首次统一利用多模态(文本和视觉)定义、生成和优化硬样本。核心贡献包括双粒度硬样本生成(DGHSG)和硬样本自适应学习(HSAL)。

Method: 1. DGHSG:利用多模态线索合成语义一致的粗粒度和细粒度硬样本(正负样本)。
2. HSAL:引入基于文本语义标签的硬度感知优化策略,调整特征空间中的距离。

Result: 在PRCC和LTCC数据集上实现最佳性能,显著加速了学习过程的收敛。

Insight: 多模态信息(尤其是文本语义)在定义和优化硬样本方面具有潜力,可为CC-ReID任务提供更鲁棒的解决方案。

Abstract: Hard samples pose a significant challenge in person re-identification (ReID) tasks, particularly in clothing-changing person Re-ID (CC-ReID). Their inherent ambiguity or similarity, coupled with the lack of explicit definitions, makes them a fundamental bottleneck. These issues not only limit the design of targeted learning strategies but also diminish the model’s robustness under clothing or viewpoint changes. In this paper, we propose a novel multimodal-guided Hard Sample Generation and Learning (HSGL) framework, which is the first effort to unify textual and visual modalities to explicitly define, generate, and optimize hard samples within a unified paradigm. HSGL comprises two core components: (1) Dual-Granularity Hard Sample Generation (DGHSG), which leverages multimodal cues to synthesize semantically consistent samples, including both coarse- and fine-grained hard positives and negatives for effectively increasing the hardness and diversity of the training data. (2) Hard Sample Adaptive Learning (HSAL), which introduces a hardness-aware optimization strategy that adjusts feature distances based on textual semantic labels, encouraging the separation of hard positives and drawing hard negatives closer in the embedding space to enhance the model’s discriminative capability and robustness to hard samples. Extensive experiments on multiple CC-ReID benchmarks demonstrate the effectiveness of our approach and highlight the potential of multimodal-guided hard sample generation and learning for robust CC-ReID. Notably, HSAL significantly accelerates the convergence of the targeted learning procedure and achieves state-of-the-art performance on both PRCC and LTCC datasets. The code is available at https://github.com/undooo/TryHarder-ACMMM25.

[52] MMOne: Representing Multiple Modalities in One Scene

Zhifeng Gu,Bing Wang

Main category: cs.CV

TL;DR: 论文提出了一种通用框架MMOne,用于在单一场景中表示多种模态,通过解耦模态信息为共享和特定模态组件,解决模态间的属性与粒度差异问题。

Details Motivation: 人类通过多模态感知世界,但不同模态间的固有差异(如属性和粒度)导致冲突,限制了多模态场景表示的学习能力。

Contribution: 提出MMOne框架,包括模态建模模块和多模态分解机制,能够有效分离多模态信息并提升单模态表示能力。

Method: 1) 设计带模态指示器的模态建模模块;2) 提出多模态分解机制,将高斯分布按模态差异分解为单模态分布;3) 解耦共享和模态特定组件。

Result: 实验表明,该方法显著提升了各模态的表示能力,并可扩展到更多模态。

Insight: 通过解耦处理模态差异,MMOne提供了一种紧凑且高效的多模态场景表示方法,为后续研究提供了新思路。

Abstract: Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code is available at https://github.com/Neal2020GitHub/MMOne.

[53] Assessing Color Vision Test in Large Vision-language Models

Hongfei Ye,Bin Chen,Wenxi Liu,Yu Zhang,Zhao Li,Dandan Ni,Hongyang Chen

Main category: cs.CV

TL;DR: 该论文研究了大规模视觉-语言模型的颜色视觉能力,定义了颜色视觉测试任务并构建了一个多样化数据集,分析了模型的错误类型并提出了微调策略。

Details Motivation: 随着大规模视觉-语言模型的广泛应用,其颜色视觉能力尚未被充分研究。因此,作者希望通过定义测试任务和构建数据集来填补这一空白。

Contribution: 主要贡献包括定义了颜色视觉测试任务,构建了一个涵盖多类别和不同难度任务的数据集,并提出了改进模型表现的微调策略。

Method: 作者设计了颜色视觉测试任务,并构造了一个多样化数据集。通过分析模型的错误类型,提出了针对性的微调策略。

Result: 论文展示了模型在颜色视觉测试中的表现,并验证了微调策略的有效性。

Insight: 研究表明,大规模视觉-语言模型在颜色视觉任务中存在显著不足,通过有针对性的训练可以有效提升其表现。

Abstract: With the widespread adoption of large vision-language models, the capacity for color vision in these models is crucial. However, the color vision abilities of large visual-language models have not yet been thoroughly explored. To address this gap, we define a color vision testing task for large vision-language models and construct a dataset \footnote{Anonymous Github Showing some of the data https://anonymous.4open.science/r/color-vision-test-dataset-3BCD} that covers multiple categories of test questions and tasks of varying difficulty levels. Furthermore, we analyze the types of errors made by large vision-language models and propose fine-tuning strategies to enhance their performance in color vision tests.

[54] Clustering-Guided Multi-Layer Contrastive Representation Learning for Citrus Disease Classification

Jun Chen,Yonghua Yu,Weifu Li,Yaohui Chen,Hong Chen

Main category: cs.CV

TL;DR: 该论文提出了一种基于聚类引导的多层对比表示学习(CMCRL)方法,用于柑橘疾病分类,利用无标注数据和多层对比训练,显著提升了分类性能。

Details Motivation: 柑橘是全球重要的经济作物,但其产量常受疾病影响。现有的深度学习方法依赖大量标注数据,难以应对症状相似性和类别不平衡问题。因此,需要一种无监督或弱监督的方法来提升分类效果。

Contribution: 1. 提出了一种新的聚类引导的自监督多层对比表示学习算法(CMCRL)。2. 设计了与聚类中心的对比机制和多层对比训练范式,解决了无标注数据利用和症状相似性问题。3. 在公开柑橘数据集(CDD)上取得了SOTA性能,并显著缩短了与非监督方法的差距。

Method: 1. 通过聚类中心对比,优化无标注数据的表示学习。2. 采用多层对比训练(MCT)范式,学习分层特征表示。3. 结合聚类和对比学习的优势,适应症状相似性和类别不平衡问题。

Result: 在CDD数据集上,CMCRL方法比现有方法提升了4.5%-30.1%的准确率,同时在F1分数、精度和召回率等指标上表现优异,展现了分类的鲁棒性。

Insight: 1. 聚类引导的对比学习是一种有效的无监督表示学习方法。2. 多层训练策略有助于学习更丰富的特征表示。3. 该方法在类别不平衡和症状相似的情况下依然高效。

Abstract: Citrus, as one of the most economically important fruit crops globally, suffers severe yield depressions due to various diseases. Accurate disease detection and classification serve as critical prerequisites for implementing targeted control measures. Recent advancements in artificial intelligence, particularly deep learning-based computer vision algorithms, have substantially decreased time and labor requirements while maintaining the accuracy of detection and classification. Nevertheless, these methods predominantly rely on massive, high-quality annotated training examples to attain promising performance. By introducing two key designs: contrasting with cluster centroids and a multi-layer contrastive training (MCT) paradigm, this paper proposes a novel clustering-guided self-supervised multi-layer contrastive representation learning (CMCRL) algorithm. The proposed method demonstrates several advantages over existing counterparts: (1) optimizing with massive unannotated samples; (2) effective adaptation to the symptom similarity across distinct citrus diseases; (3) hierarchical feature representation learning. The proposed method achieves state-of-the-art performance on the public citrus image set CDD, outperforming existing methods by 4.5%-30.1% accuracy. Remarkably, our method narrows the performance gap with fully supervised counterparts (all samples are labeled). Beyond classification accuracy, our method shows great performance on other evaluation metrics (F1 score, precision, and recall), highlighting the robustness against the class imbalance challenge.

[55] How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study

Che Liu,Jiazhen Pan,Weixiang Shen,Wenjia Bai,Daniel Rueckert,Rossella Arcucci

Main category: cs.CV

TL;DR: 这篇论文对通用和医学专用的视觉语言模型(VLMs)在多个医学基准测试中进行了全面评估,发现通用模型在某些任务上已超越专用模型,但推理能力仍是瓶颈,且可靠性未达到临床要求。

Details Motivation: 医疗领域对VLMs的需求日益增长,但其在医学任务中的真实表现尚不清楚。本文旨在填补这一空白,通过系统评估揭示模型的优势和不足。

Contribution: 提供了对通用和医学专用VLMs的首次全面评估,揭示了它们在理解与推理能力上的表现差异,并为未来模型开发提供了方向。

Method: 论文评估了3B到72B参数的VLMs在八个医学基准测试(如MedXpert、PathVQA等)上的表现,并将性能分解为理解和推理两部分进行分析。

Result: 通用模型在某些任务上已优于专用模型;推理能力普遍较弱;不同基准测试间性能差异显著;尚无模型达到临床部署的可靠性要求。

Insight: 未来研究需加强多模态对齐,并设计更细粒度的评估协议。医学任务的复杂性凸显了推理能力提升的重要性。

Abstract: Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare; however, their competence in medical tasks remains underexplored. We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, ranging from 3B to 72B parameters, across eight benchmarks: MedXpert, OmniMedVQA, PMC-VQA, PathVQA, MMMU, SLAKE, and VQA-RAD. To observe model performance across different aspects, we first separate it into understanding and reasoning components. Three salient findings emerge. First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images. Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support. Third, performance varies widely across benchmarks, reflecting differences in task design, annotation quality, and knowledge demands. No model yet reaches the reliability threshold for clinical deployment, underscoring the need for stronger multimodal alignment and more rigorous, fine-grained evaluation protocols.

[56] A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition

Xinkui Zhao,Jinsong Shu,Yangyang Wu,Guanjie Cheng,Zihe Liu,Naibo Wang,Shuiguang Deng,Zhongle Xie,Jianwei Yin

Main category: cs.CV

TL;DR: 论文提出了一种名为MCULoRA的新方法,用于解决多模态情绪识别中模态缺失问题。通过解耦模态组合的共享信息与特性,并动态调整训练比例,显著提升了任务精度。

Details Motivation: 实际应用中,多模态情绪识别常因传感器故障或隐私保护需求面临模态缺失。现有方法因梯度冲突导致性能下降,需新方法解决这一问题。

Contribution: 提出了MCULoRA框架,包含模态组合感知的低秩适应(MCLA)和动态参数微调(DPFT)模块,有效解耦模态组合信息并优化学习效率。

Method: MCULoRA通过MCLA解耦模态组合的共享与独特信息,DPFT基于表征空间可分性调整训练比例,实现高效参数微调。

Result: 在多个基准数据集上,MCULoRA显著优于现有不完全多模态学习方法。

Insight: 解耦模态组合信息并动态调整训练比例是提升不完全多模态学习性能的关键。

Abstract: Multimodal Emotion Recognition (MER) often encounters incomplete multimodality in practical applications due to sensor failures or privacy protection requirements. While existing methods attempt to address various incomplete multimodal scenarios by balancing the training of each modality combination through additional gradients, these approaches face a critical limitation: training gradients from different modality combinations conflict with each other, ultimately degrading the performance of the final prediction model. In this paper, we propose a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA, which is a novel framework for the parameter-efficient training of incomplete multimodal learning models. MCULoRA consists of two key modules, modality combination aware low-rank adaptation (MCLA) and dynamic parameter fine-tuning (DPFT). The MCLA module effectively decouples the shared information from the distinct characteristics of individual modality combinations. The DPFT module adjusts the training ratio of modality combinations based on the separability of each modality’s representation space, optimizing the learning efficiency across different modality combinations. Our extensive experimental evaluation in multiple benchmark datasets demonstrates that MCULoRA substantially outperforms previous incomplete multimodal learning approaches in downstream task accuracy.

[57] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

X. Feng,H. Yu,M. Wu,S. Hu,J. Chen,C. Zhu,J. Wu,X. Chu,K. Huang

Main category: cs.CV

TL;DR: 论文提出了首个针对长视频生成模型的叙事能力评估基准NarrLV,定义了Temporal Narrative Atom(TNA)作为基本叙事单元,并设计了基于多模态大模型(MLLM)的自动化评估指标,实验表明其与人类评判高度一致。

Details Motivation: 现有长视频生成模型的评估主要基于简单的叙事提示(如VBench),缺乏专门针对叙事丰富性的评估基准。为了更全面地评估这些模型的叙事表达能力和揭示其能力边界,作者提出了NarrLV。

Contribution: 1. 提出了首个叙事为中心的长视频生成模型评估基准NarrLV;2. 引入了Temporal Narrative Atom(TNA)作为基本叙事单元,用于定量化衡量叙事丰富性;3. 设计了一个自动化的提示生成流程,可灵活扩展TNAs数量;4. 开发了基于MLLM的评估指标,分为三个递进层次衡量叙事内容表达。

Method: 1. 基于电影叙事理论定义了TNA作为连续视觉呈现的基本叙事单元;2. 构建了一个自动化提示生成流程,用于生成可评估多TNA的提示;3. 设计了三个层次的叙事内容表达评估指标,利用MLLM生成问题并回答。

Result: 实验结果表明,NarrLV的评估指标与人类评判高度一致,揭示了当前视频生成模型在叙事内容表达上的能力边界。

Insight: 叙事丰富性是长视频生成的关键目标,而TNA的引入为定量化评估提供了新视角。基于MLLM的自动化评估方法能有效替代人工评估,为未来模型改进提供了明确方向。

Abstract: With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.

[58] MFGDiffusion: Mask-Guided Smoke Synthesis for Enhanced Forest Fire Detection

Guanghao Wu,Chen Xu,Hai Song,Chong Wang,Qixing Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种基于掩模引导的烟雾合成框架(MFGDiffusion),用于生成高质量的森林火灾烟雾图像,解决了现有修复模型生成烟雾图像质量不高的问题,并通过实际实验验证了其增强烟雾检测性能的效果。

Details Motivation: 森林火灾烟雾检测的瓶颈在于缺乏足够的真实烟雾图像数据。现有的图像生成模型在合成烟雾图像时,难以保证烟雾与背景的一致性。因此,作者提出了一种掩模引导的烟雾合成方法,以生成更逼真和多样化的烟雾图像。

Contribution: 1. 提出了一种掩模和掩模图像特征引导的烟雾生成框架。
2. 设计了一种新的损失函数(掩模随机差异损失),提升生成烟雾在掩模边缘区域的连续性。
3. 利用多模态大语言模型筛选生成的烟雾图像,提升合成数据集的多样性和合理性。

Method: 1. 使用预训练的语义分割模型和多模态模型生成烟雾掩模和图像描述。
2. 提出掩模引导的网络架构,通过掩模和掩模图像特征指导烟雾生成。
3. 提出掩模随机差异损失函数,增强掩模边缘区域的一致性。
4. 结合烟雾特性和多模态大语言模型筛选生成的烟雾图像。

Result: 实验表明,生成的烟雾图像逼真且多样化,能有效提升森林火灾烟雾检测模型的性能。

Insight: 该方法的亮点在于通过掩模引导和多模态大语言模型结合生成高质量的烟雾图像,为解决数据稀缺问题提供了新思路,同时为其他类似任务(如缺陷检测)的数据增强提供了参考。

Abstract: Smoke is the first visible indicator of a wildfire.With the advancement of deep learning, image-based smoke detection has become a crucial method for detecting and preventing forest fires. However, the scarcity of smoke image data from forest fires is one of the significant factors hindering the detection of forest fire smoke. Image generation models offer a promising solution for synthesizing realistic smoke images. However, current inpainting models exhibit limitations in generating high-quality smoke representations, particularly manifesting as inconsistencies between synthesized smoke and background contexts. To solve these problems, we proposed a comprehensive framework for generating forest fire smoke images. Firstly, we employed the pre-trained segmentation model and the multimodal model to obtain smoke masks and image captions.Then, to address the insufficient utilization of masks and masked images by inpainting models, we introduced a network architecture guided by mask and masked image features. We also proposed a new loss function, the mask random difference loss, which enhances the consistency of the generated effects around the mask by randomly expanding and eroding the mask edges.Finally, to generate a smoke image dataset using random masks for subsequent detection tasks, we incorporated smoke characteristics and use a multimodal large language model as a filtering tool to select diverse and reasonable smoke images, thereby improving the quality of the synthetic dataset. Experiments showed that our generated smoke images are realistic and diverse, and effectively enhance the performance of forest fire smoke detection models. Code is available at https://github.com/wghr123/MFGDiffusion.

[59] ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

Ronggang Huang,Haoxin Yang,Yan Cai,Xuemiao Xu,Huaidong Zhang,Shengfeng He

Main category: cs.CV

TL;DR: ViewSRD提出了一种通过结构化多视角分解解决3D视觉定位任务的框架,显著提升了复杂查询中的空间区分能力。

Details Motivation: 现有方法在复杂多锚点查询中难以区分目标与锚点,且视角变化导致空间描述不一致。ViewSRD旨在解决这些问题。

Contribution: 引入了结构化多视角分解框架,包含SRD模块、Multi-TSI模块和文本场景推理模块,提升了3D视觉定位的精度。

Method: 1. SRD模块将多锚点查询解耦为目标单锚点描述;2. Multi-TSI模块通过共享的CCVT在多视图中融合文本与场景特征;3. 多视图预测被综合为统一的3D定位结果。

Result: 在3D视觉定位数据集上表现优于现有方法,尤其在需要精确空间区分的复杂查询中。

Insight: 结构化多视角分解能有效解耦复杂查询中的空间关系,跨模态一致性视图令牌有助于保持空间关联性。

Abstract: 3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation.

[60] YOLOatr : Deep Learning Based Automatic Target Detection and Localization in Thermal Infrared Imagery

Aon Safdar,Usman Akram,Waseem Anwar,Basit Malik,Mian Ibad Ali

Main category: cs.CV

TL;DR: 论文提出了一种改进的单阶段目标检测器YOLOatr,用于在热红外图像中实现自动目标检测与定位,解决了复杂场景下的挑战,并在测试中取得了99.6%的SOTA性能。

Details Motivation: 热红外图像在国防和监控领域的自动目标检测与识别面临诸多挑战,如数据集有限、尺度变化大、遮挡多等,导致现有深度学习模型表现不佳。

Contribution: 提出了YOLOatr,一种基于改进YOLOv5s的单阶段检测器,通过优化检测头、特征融合和自定义数据增强策略,显著提升了性能。

Method: 方法包括对YOLOv5s的检测头改进、颈部特征融合调整及自定义数据增强策略的设计,以应对热红外图像的特殊性。

Result: 在DSIAC MWIR数据集上,YOLOatr实现了99.6%的性能,超越了现有技术。

Insight: 通过针对热红外图像特性的优化设计,可以显著提升目标检测性能,尤其是在复杂军事和监控场景中。

Abstract: Automatic Target Detection (ATD) and Recognition (ATR) from Thermal Infrared (TI) imagery in the defense and surveillance domain is a challenging computer vision (CV) task in comparison to the commercial autonomous vehicle perception domain. Limited datasets, peculiar domain-specific and TI modality-specific challenges, i.e., limited hardware, scale invariance issues due to greater distances, deliberate occlusion by tactical vehicles, lower sensor resolution and resultant lack of structural information in targets, effects of weather, temperature, and time of day variations, and varying target to clutter ratios all result in increased intra-class variability and higher inter-class similarity, making accurate real-time ATR a challenging CV task. Resultantly, contemporary state-of-the-art (SOTA) deep learning architectures underperform in the ATR domain. We propose a modified anchor-based single-stage detector, called YOLOatr, based on a modified YOLOv5s, with optimal modifications to the detection heads, feature fusion in the neck, and a custom augmentation profile. We evaluate the performance of our proposed model on a comprehensive DSIAC MWIR dataset for real-time ATR over both correlated and decorrelated testing protocols. The results demonstrate that our proposed model achieves state-of-the-art ATR performance of up to 99.6%.

[61] Detección y Cuantificación de Erosión Fluvial con Visión Artificial

Paúl Maji,Marlon Túquerres,Stalin Valencia,Marcela Valenzuela,Christian Mejia-Escobar

Main category: cs.CV

TL;DR: 本文提出了一种基于人工智能的方法,用于自动识别河流侵蚀区域并估算其面积,使用YOLOv11模型结合LiDAR与照片数据,开发了交互式Web应用EROSCAN。

Details Motivation: 河流侵蚀对土壤稳定性和基础设施有重大影响,传统方法依赖专业知识和手动处理,效率低且耗时。

Contribution: 提出了一种结合YOLOv11和LiDAR数据的自动检测方法,开发了交互式工具EROSCAN。

Method: 使用YOLOv11模型,通过微调训练LiDAR和照片数据,数据通过Roboflow平台标注和分割。

Result: 实现了70%的侵蚀模式检测精度,并能可靠计算侵蚀区域面积(像素和平方米)。

Insight: 结合计算机视觉与传统地理信息数据可显著提升侵蚀检测效率,为风险管理提供支持。

Abstract: Fluvial erosion is a natural process that can generate significant impacts on soil stability and strategic infrastructures. The detection and monitoring of this phenomenon is traditionally addressed by photogrammetric methods and analysis in geographic information systems. These tasks require specific knowledge and intensive manual processing. This study proposes an artificial intelligence-based approach for automatic identification of eroded zones and estimation of their area. The state-of-the-art computer vision model YOLOv11, adjusted by fine-tuning and trained with photographs and LiDAR images, is used. This combined dataset was segmented and labeled using the Roboflow platform. Experimental results indicate efficient detection of erosion patterns with an accuracy of 70%, precise identification of eroded areas and reliable calculation of their extent in pixels and square meters. As a final product, the EROSCAN system has been developed, an interactive web application that allows users to upload images and obtain automatic segmentations of fluvial erosion, together with the estimated area. This tool optimizes the detection and quantification of the phenomenon, facilitating decision making in risk management and territorial planning.

[62] A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction

Haoxuan Qu,Yujun Cai,Hossein Rahmani,Ajay Kumar,Junsong Yuan,Jun Liu

Main category: cs.CV

TL;DR: 该论文提出了一种混合基元的GS方法,通过结合不同类型的基元提升表面重建质量,并设计了初始化策略和顶点剪枝机制优化学习过程。

Details Motivation: 现有基于高斯泼溅(GS)的方法仅使用单一类型的基元(如高斯椭圆或椭球)进行表面重建,难以应对复杂多样的物体形状,因此需要一种更灵活的表示方法。

Contribution: 1. 提出首个支持多类型基元的GS框架;2. 设计了组合式泼溅策略、混合基元初始化策略和顶点剪枝机制;3. 在实验中验证了方法的有效性。

Method: 1. 组合式泼溅策略支持多种基元的泼溅与渲染;2. 混合基元初始化策略优化基元分布;3. 顶点剪枝机制提升学习效率。

Result: 实验表明,该方法在表面重建任务中表现出高精度和有效性。

Insight: 通过结合多种几何基元,可以更灵活地表示复杂物体表面,提升重建质量,为GS方法的扩展提供了新思路。

Abstract: Recently, Gaussian Splatting (GS) has received a lot of attention in surface reconstruction. However, while 3D objects can be of complex and diverse shapes in the real world, existing GS-based methods only limitedly use a single type of splatting primitive (Gaussian ellipse or Gaussian ellipsoid) to represent object surfaces during their reconstruction. In this paper, we highlight that this can be insufficient for object surfaces to be represented in high quality. Thus, we propose a novel framework that, for the first time, enables Gaussian Splatting to incorporate multiple types of (geometrical) primitives during its surface reconstruction process. Specifically, in our framework, we first propose a compositional splatting strategy, enabling the splatting and rendering of different types of primitives in the Gaussian Splatting pipeline. In addition, we also design our framework with a mixed-primitive-based initialization strategy and a vertex pruning mechanism to further promote its surface representation learning process to be well executed leveraging different types of primitives. Extensive experiments show the efficacy of our framework and its accurate surface reconstruction performance.

[63] UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

Peiran Wu,Yunze Liu,Zhengdong Zhu,Enmin Zhou,Shawn Shen

Main category: cs.CV

TL;DR: 论文提出了UGC-VideoCap,一个专注于短用户生成视频(UGC)的多模态标题生成新基准和模型,强调音频与视觉的平衡整合,并提供了一个3B参数的轻量级模型。

Details Motivation: 现有的视频标题生成基准和模型过于依赖视觉信息,忽略了音频在理解场景动态和叙事背景中的关键作用。缺乏高质量的多模态数据集和轻量级模型阻碍了相关研究的进展。

Contribution: 1. 提出UGC-VideoCap基准,包含1000个TikTok视频及其多模态注释和4000个QA对;2. 提出3B参数的UGC-VideoCaptioner模型,采用两阶段训练策略(监督微调 + GRPO)高效适应小规模数据。

Method: 1. 数据集构建:通过三阶段人工标注流程整合音频和视觉模态;2. 模型训练:使用监督微调(SFT)和群体相对策略优化(GRPO)两阶段方法,从Gemini 2.5 Flash蒸馏。

Result: 提出的模型和基准为无约束UGC场景下的多模态视频标题生成提供了高质量基础,模型在有限数据下表现高效。

Insight: 音频和视觉的平衡整合对理解真实世界UGC视频至关重要;两阶段训练策略在小数据场景下展现高效性。

Abstract: Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio visual content. However, existing video captioning benchmarks and models remain predominantly visual centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of omni datasets and lightweight, capable models hampers progress in fine grained, multimodal video understanding. To address these challenges, we introduce UGC-VideoCap, a new benchmark and model framework specifically designed for detailed omnimodal captioning of short form user-generated videos. Unlike prior datasets, UGC-VideoCap emphasizes balanced integration of audio and visual modalities, featuring 1000 TikTok videos annotated through a structured three stage human-in-the-loop pipeline covering audio only, visual only, and joint audio visual semantics. The benchmark also includes 4000 carefully crafted QA pairs probing both unimodal and cross modal understanding. Alongside the dataset, we propose UGC-VideoCaptioner(3B), a 3B parameter captioning model distilled from Gemini 2.5 Flash. Using a novel two-stage training strategy supervised fine tuning followed by Group Relative Policy Optimization (GRPO), our approach enables efficient adaptation from limited data while maintaining competitive performance. Together, our benchmark and model offer a high-quality foundation and a data-efficient solution for advancing omnimodal video captioning in unconstrained real-world UGC settings.

[64] Attributes Shape the Embedding Space of Face Recognition Models

Pierrick Leroy,Antonio Mastropietro,Marco Nurisso,Francesco Vaccarino

Main category: cs.CV

TL;DR: 人脸识别(FR)模型通过对比损失学习身份信息,但嵌入空间表现出多尺度几何结构,受可解释的面部和图像属性影响。作者提出了一种几何方法和物理启发的对齐度量,揭示了模型对不同属性的不同程度的不变性。

Details Motivation: 尽管FR模型通过对比损失专注于身份信息,但嵌入空间中的几何结构与面部属性(如发色)和图像属性(如对比度)相关。作者希望理解这些属性对嵌入空间的影响,并提高模型的解释性。

Contribution: 1. 提出了一种描述FR模型属性依赖性和不变性的几何方法;2. 引入了一种物理启发的对齐度量;3. 揭示了模型对不同属性的不变性程度。

Method: 1. 分析嵌入空间的多尺度几何结构;2. 提出几何对齐度量;3. 通过合成数据微调FR模型,评估其对属性的不变性。

Result: 模型在不同属性上表现出不同程度的不变性,揭示了其优势和局限性,同时提升了模型的解释性。

Insight: 面部和图像属性显著影响FR模型的嵌入空间几何结构,理解这些影响有助于优化模型设计和提升鲁棒性。

Abstract: Face Recognition (FR) tasks have made significant progress with the advent of Deep Neural Networks, particularly through margin-based triplet losses that embed facial images into high-dimensional feature spaces. During training, these contrastive losses focus exclusively on identity information as labels. However, we observe a multiscale geometric structure emerging in the embedding space, influenced by interpretable facial (e.g., hair color) and image attributes (e.g., contrast). We propose a geometric approach to describe the dependence or invariance of FR models to these attributes and introduce a physics-inspired alignment metric. We evaluate the proposed metric on controlled, simplified models and widely used FR models fine-tuned with synthetic data for targeted attribute augmentation. Our findings reveal that the models exhibit varying degrees of invariance across different attributes, providing insight into their strengths and weaknesses and enabling deeper interpretability. Code available here: https://github.com/mantonios107/attrs-fr-embs}{https://github.com/mantonios107/attrs-fr-embs

[65] Implementing Adaptations for Vision AutoRegressive Model

Kaif Shaikh,Antoni Kowalczuk,Franziska Boenisch,Adam Dziedzic

Main category: cs.CV

TL;DR: 这篇论文研究了自回归视觉模型(VAR)在适应特定下游任务(如医疗数据生成)时的表现,并探索了其差分隐私(DP)适应策略。研究发现,VAR在非DP适应中表现优于扩散模型(DMs),但在DP适应中表现不佳。

Details Motivation: VAR作为一种替代DMs的图像生成方法,其适应策略研究较少,特别是在隐私保护(DP)方面。论文旨在填补这一空白。

Contribution: 论文实现了多种VAR适应策略,并进行了基准测试,比较了VAR与DMs在DP和非DP适应中的表现。

Method: 论文在预训练的VAR模型上应用了多种适应策略,包括非DP和DP方法,并与DMs的适应策略进行了对比。

Result: VAR在非DP适应中优于DMs,但在DP适应中表现较差,需要进一步研究。

Insight: 尽管VAR在图像生成任务中表现出色,但其隐私保护适应仍有待改进,这为未来研究提供了方向。

Abstract: Vision AutoRegressive model (VAR) was recently introduced as an alternative to Diffusion Models (DMs) in image generation domain. In this work we focus on its adaptations, which aim to fine-tune pre-trained models to perform specific downstream tasks, like medical data generation. While for DMs there exist many techniques, adaptations for VAR remain underexplored. Similarly, differentially private (DP) adaptations-ones that aim to preserve privacy of the adaptation data-have been extensively studied for DMs, while VAR lacks such solutions. In our work, we implement and benchmark many strategies for VAR, and compare them to state-of-the-art DM adaptation strategies. We observe that VAR outperforms DMs for non-DP adaptations, however, the performance of DP suffers, which necessitates further research in private adaptations for VAR. Code is available at https://github.com/sprintml/finetuning_var_dp.

[66] COLI: A Hierarchical Efficient Compressor for Large Images

Haoran Wang,Hanyu Pei,Yang Lyu,Kai Zhang,Li Li,Feng-Lei Fan

Main category: cs.CV

TL;DR: 论文提出了一种名为COLI的分层高效压缩框架,专为高分辨率大视野图像设计。通过改进基于INR的方法,解决了压缩速度慢和压缩比不足的问题。

Details Motivation: 高分辨率和大视野图像的广泛应用需要高效的压缩方法,但传统方法无法保留关键细节,而数据驱动方法通用性有限。INR虽有潜力,但在处理大图像时存在速度和压缩比不足的问题。

Contribution: 提出了COLI框架,通过预训练-微调范式、混合精度训练和并行化目标优化加速INR收敛。引入Hyper-Compression技术,显著提升压缩比。

Method: 基于NeRV的INR压缩方法,结合预训练-微调、混合精度训练和并行目标函数优化。引入Hyper-Compression后处理技术。

Result: 在医学影像数据集上,COLI显著提升了压缩速度和压缩比(加速4倍),同时保持了较高的PSNR和SSIM指标。

Insight: INR在图像压缩中的潜力可通过优化训练策略和后处理技术进一步释放,为大图像压缩提供了新的方向。

Abstract: The escalating adoption of high-resolution, large-field-of-view imagery amplifies the need for efficient compression methodologies. Conventional techniques frequently fail to preserve critical image details, while data-driven approaches exhibit limited generalizability. Implicit Neural Representations (INRs) present a promising alternative by learning continuous mappings from spatial coordinates to pixel intensities for individual images, thereby storing network weights rather than raw pixels and avoiding the generalization problem. However, INR-based compression of large images faces challenges including slow compression speed and suboptimal compression ratios. To address these limitations, we introduce COLI (Compressor for Large Images), a novel framework leveraging Neural Representations for Videos (NeRV). First, recognizing that INR-based compression constitutes a training process, we accelerate its convergence through a pretraining-finetuning paradigm, mixed-precision training, and reformulation of the sequential loss into a parallelizable objective. Second, capitalizing on INRs’ transformation of image storage constraints into weight storage, we implement Hyper-Compression, a novel post-training technique to substantially enhance compression ratios while maintaining minimal output distortion. Evaluations across two medical imaging datasets demonstrate that COLI consistently achieves competitive or superior PSNR and SSIM metrics at significantly reduced bits per pixel (bpp), while accelerating NeRV training by up to 4 times.

[67] C-FBI: A Combinatorial method using Convolutions for Circle Fitting in Blurry Images

Esteban Román Catafau,Torbjörn E. M. Nordling

Main category: cs.CV

TL;DR: 本文提出了一种名为3C-FBI的组合卷积方法,用于在模糊图像中实现稳健的圆检测和拟合,兼具高精度和实时性能。

Details Motivation: 解决在图像质量退化条件下,传统圆检测和拟合方法精度不足或计算效率低的问题。

Contribution: 提出了一种结合组合边缘像素采样和卷积参数空间密度估计的高效算法,显著提升了圆拟合的精度和速度。

Method: 采用组合采样的边缘像素点,并通过卷积在参数空间中进行密度估计,实现了高效的圆拟合。

Result: 在真实医疗数据和合成数据上表现优越,精度(Jaccard指数0.896)和速度(40.3 fps)均达SOTA,尤其在160x160分辨率下仍保持高精度(Jaccard>0.95)。

Insight: 3C-FBI的实时性和精度使其在医疗影像、机器人和工业检测等应用中具有广泛潜力。

Abstract: This paper addresses the fundamental computer vision challenge of robust circle detection and fitting in degraded imaging conditions. We present Combinatorial Convolution-based Circle Fitting for Blurry Images (3C-FBI), an algorithm that bridges the gap between circle detection and precise parametric fitting by combining (1) efficient combinatorial edge pixel (edgel) sampling and (2) convolution-based density estimation in parameter space. We evaluate 3C-FBI across three experimental frameworks: (1) real-world medical data from Parkinson’s disease assessments (144 frames from 36 videos), (2) controlled synthetic data following established circle-fitting benchmarks, and (3) systematic analysis across varying spatial resolutions and outlier contamination levels. Results show that 3C-FBI achieves state-of-the-art accuracy (Jaccard index 0.896) while maintaining real-time performance (40.3 fps), significantly outperforming classical methods like RCD (6.8 fps) on a standard CPU (i7-10875H). It maintains near-perfect accuracy (Jaccard almost 1.0) at high resolutions (480x480) and reliable performance (Jaccard higher than 0.95) down to 160x160 with up to 20% outliers. In extensive synthetic testing, 3C-FBI achieves a mean Jaccard Index of 0.989 across contamination levels, comparable to modern methods like Qi et al. (2024, 0.991), and surpassing RHT (0.964). This combination of accuracy, speed, and robustness makes 3C-FBI ideal for medical imaging, robotics, and industrial inspection under challenging conditions.

[68] CATVis: Context-Aware Thought Visualization

Tariq Mehmood,Hamza Ahmad,Muhammad Haroon Shakeel,Murtaza Taj

Main category: cs.CV

TL;DR: 该论文提出了一种新的5阶段框架CATVis,用于从EEG信号中解码视觉表示,通过跨模态对齐和重新排名实现了上下文感知的EEG到图像生成,显著优于现有方法。

Details Motivation: EEG信号复杂且嘈杂,从EEG信号解码视觉表示是一个挑战。为了解决这一问题,作者提出了一个结合跨模态对齐和生成模型的框架,以实现高质量的EEG到图像转换。

Contribution: 论文的主要贡献包括:1)提出了一种新颖的5阶段框架CATVis;2)通过跨模态对齐和重新排名实现了上下文感知的图像生成;3)实验证明该方法在分类准确性、生成准确性和语义对齐方面显著优于现有方法。

Method: 方法包括:1)EEG编码器用于概念分类;2)在CLIP特征空间中对齐EEG和文本嵌入;3)通过重新排名优化标题;4)加权插值概念和标题嵌入以增强语义;5)使用预训练的Stable Diffusion模型生成图像。

Result: 实验结果显示,该方法在分类准确性上提高了13.43%,生成准确性上提高了15.21%,Fr’echet Inception Distance(FID)降低了36.61%,表明其语义对齐和图像质量优于现有方法。

Insight: 通过跨模态对齐和生成模型的结合,可以实现从EEG信号中解码高质量的视觉表示,为脑机接口的视觉应用提供了一种新思路。

Abstract: EEG-based brain-computer interfaces (BCIs) have shown promise in various applications, such as motor imagery and cognitive state monitoring. However, decoding visual representations from EEG signals remains a significant challenge due to their complex and noisy nature. We thus propose a novel 5-stage framework for decoding visual representations from EEG signals: (1) an EEG encoder for concept classification, (2) cross-modal alignment of EEG and text embeddings in CLIP feature space, (3) caption refinement via re-ranking, (4) weighted interpolation of concept and caption embeddings for richer semantics, and (5) image generation using a pre-trained Stable Diffusion model. We enable context-aware EEG-to-image generation through cross-modal alignment and re-ranking. Experimental results demonstrate that our method generates high-quality images aligned with visual stimuli, outperforming SOTA approaches by 13.43% in Classification Accuracy, 15.21% in Generation Accuracy and reducing Fr'echet Inception Distance by 36.61%, indicating superior semantic alignment and image quality.

[69] Streaming 4D Visual Geometry Transformer

Dong Zhuo,Wenzhao Zheng,Jiahe Guo,Yuqi Wu,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 论文提出了一种流式4D视觉几何变换器(StreamVGGT),用于实时感知和重建视频中的4D时空几何,采用因果变换器架构和隐性记忆机制,支持高效推理和高质量空间一致性。

Details Motivation: 实时和交互式4D几何感知与重建是计算机视觉中的重要但具有挑战性的任务。传统方法难以在实时性和性能之间取得平衡,因此需要一种高效且可扩展的解决方案。

Contribution: 1. 提出了一种流式4D视觉几何变换器,支持实时处理;2. 采用因果注意力和历史键值缓存机制,实现高效长时4D重建;3. 通过知识蒸馏从双向VGGT迁移知识,提升训练效率;4. 支持高效注意力算子(如FlashAttention)的迁移。

Method: 1. 因果变换器架构,以在线方式处理输入序列;2. 时间因果注意力机制和隐性记忆缓存历史信息;3. 从双向VGGT模型中蒸馏知识;4. 支持高效注意力算子迁移。

Result: 在多个4D几何感知基准测试中,模型在保持竞争力的同时显著提升了在线推理速度,为可扩展和交互式4D视觉系统铺平了道路。

Insight: 1. 将语言模型中的因果变换器思想迁移到4D视觉任务中是有效的;2. 隐性记忆机制和知识蒸馏可以显著提升实时系统的性能。

Abstract: Perceiving and reconstructing 4D spatial-temporal geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and real-time applications, we propose a streaming 4D visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 4D reconstruction. This design can handle real-time 4D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operator (e.g., FlashAttention) from the field of large language models. Extensive experiments on various 4D geometry perception benchmarks demonstrate that our model increases the inference speed in online scenarios while maintaining competitive performance, paving the way for scalable and interactive 4D vision systems. Code is available at: https://github.com/wzzheng/StreamVGGT.

Zhen Xu,Hongyu Zhou,Sida Peng,Haotong Lin,Haoyu Guo,Jiahao Shao,Peishan Yang,Qinglin Yang,Sheng Miao,Xingyi He,Yifan Wang,Yue Wang,Ruizhen Hu,Yiyi Liao,Xiaowei Zhou,Hujun Bao

Main category: cs.CV

TL;DR: 该论文回顾了基于视觉的深度估计方法的发展历程,探讨了构建‘深度基础模型’的潜力及其在解决泛化和稳定性问题上的作用,并总结了相关大规模数据集和训练策略。

Details Motivation: 传统的深度估计方法依赖硬件传感器(如LiDAR),成本高且受限,而现有视觉方法在泛化和稳定性上面临挑战。受其他领域基础模型的启发,研究者希望开发具有零样本泛化能力的‘深度基础模型’。

Contribution: 论文系统总结了单目、双目、多视角和单目视频等不同设置下的深度估计方法,并探讨了构建深度基础模型的潜力及其所需的大规模数据集和训练策略。

Method: 论文调研了深度学习的架构和范式在深度估计领域的演进,分析了如何通过大规模数据和训练策略提升模型性能。

Result: 论文未提出新方法,但总结了当前研究进展,为未来深度基础模型的开发提供了方向。

Insight: 深度基础模型有望通过大规模数据和训练策略实现零样本泛化能力,从而克服传统方法和现有视觉方法的局限性。

Abstract: Depth estimation is a fundamental task in 3D computer vision, crucial for applications such as 3D reconstruction, free-viewpoint rendering, robotics, autonomous driving, and AR/VR technologies. Traditional methods relying on hardware sensors like LiDAR are often limited by high costs, low resolution, and environmental sensitivity, limiting their applicability in real-world scenarios. Recent advances in vision-based methods offer a promising alternative, yet they face challenges in generalization and stability due to either the low-capacity model architectures or the reliance on domain-specific and small-scale datasets. The emergence of scaling laws and foundation models in other domains has inspired the development of “depth foundation models”: deep neural networks trained on large datasets with strong zero-shot generalization capabilities. This paper surveys the evolution of deep learning architectures and paradigms for depth estimation across the monocular, stereo, multi-view, and monocular video settings. We explore the potential of these models to address existing challenges and provide a comprehensive overview of large-scale datasets that can facilitate their development. By identifying key architectures and training strategies, we aim to highlight the path towards robust depth foundation models, offering insights into their future research and applications.

eess.IV [Back]

[71] HANS-Net: Hyperbolic Convolution and Adaptive Temporal Attention for Accurate and Generalizable Liver and Tumor Segmentation in CT Imaging

Arefin Ittesafun Abian,Ripon Kumar Debnath,Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Md Rafiqul Islam,Asif Karim,Reem E. Mohamed,Sami Azam

Main category: eess.IV

TL;DR: HANS-Net是一种用于CT影像中肝脏和肿瘤分割的新框架,结合了双曲卷积、多尺度纹理学习、适应性特征增强和隐式神经表示,取得了高准确性和泛化能力。

Details Motivation: 肝脏和肿瘤的精确分割对诊断和治疗至关重要,但因复杂解剖结构、肿瘤外观多变和标注数据有限而具有挑战性。

Contribution: 提出HANS-Net,融合双曲卷积、小波分解模块、突触可塑性机制和隐式神经表示,并引入不确定性量化与时序注意力。

Method: 结合双曲卷积的层次几何表示、小波多尺度纹理学习、突触可塑性特征增强及隐式神经边界建模,辅以蒙特卡洛Dropout和轻量时序注意力。

Result: 在LiTS数据集上取得93.26% Dice、88.09% IoU、0.72 mm ASSD和11.91% VOE;跨数据集验证也表现优异。

Insight: HANS-Net通过多种创新模块的协同作用,实现了高精度分割和强泛化能力,尤其适合复杂医学影像任务。

Abstract: Accurate liver and tumor segmentation on abdominal CT images is critical for reliable diagnosis and treatment planning, but remains challenging due to complex anatomical structures, variability in tumor appearance, and limited annotated data. To address these issues, we introduce Hyperbolic-convolutions Adaptive-temporal-attention with Neural-representation and Synaptic-plasticity Network (HANS-Net), a novel segmentation framework that synergistically combines hyperbolic convolutions for hierarchical geometric representation, a wavelet-inspired decomposition module for multi-scale texture learning, a biologically motivated synaptic plasticity mechanism for adaptive feature enhancement, and an implicit neural representation branch to model fine-grained and continuous anatomical boundaries. Additionally, we incorporate uncertainty-aware Monte Carlo dropout to quantify prediction confidence and lightweight temporal attention to improve inter-slice consistency without sacrificing efficiency. Extensive evaluations of the LiTS dataset demonstrate that HANS-Net achieves a mean Dice score of 93.26%, an IoU of 88.09%, an average symmetric surface distance (ASSD) of 0.72 mm, and a volume overlap error (VOE) of 11.91%. Furthermore, cross-dataset validation on the 3D-IRCADb-01 dataset obtains an average Dice of 87.45%, IoU of 80.30%, ASSD of 1.525 mm, and VOE of 19.71%, indicating strong generalization across different datasets. These results confirm the effectiveness and robustness of HANS-Net in providing anatomically consistent, accurate, and confident liver and tumor segmentation.

[72] Comparative Analysis of Vision Transformers and Traditional Deep Learning Approaches for Automated Pneumonia Detection in Chest X-Rays

Gaurav Singh

Main category: eess.IV

TL;DR: 论文比较了传统深度学习与Vision Transformers(ViT)在胸部X光肺炎检测中的性能,发现Cross-ViT表现最佳,凸显架构选择的重要性。

Details Motivation: 肺炎(尤其是COVID-19相关)的快速准确诊断是全球健康挑战,需要通过自动化方法提升检测效率。

Contribution: 系统比较了传统方法和ViT的性能,证明Cross-ViT在准确率和召回率上均优于传统CNN。

Method: 评估了PCA聚类、逻辑回归、SVM、CNN(LeNet、DenseNet-121)和多种ViT(Deep-ViT、Compact Convolutional Transformer、Cross-ViT)在5,856张儿科胸部X光上的表现。

Result: Cross-ViT以88.25%准确率和99.42%召回率表现最优,显示架构选择比模型规模更关键。

Insight: 视觉变换器在医学影像诊断中潜力巨大,尤其在平衡精度与召回率方面具有优势。

Abstract: Pneumonia, particularly when induced by diseases like COVID-19, remains a critical global health challenge requiring rapid and accurate diagnosis. This study presents a comprehensive comparison of traditional machine learning and state-of-the-art deep learning approaches for automated pneumonia detection using chest X-rays (CXRs). We evaluate multiple methodologies, ranging from conventional machine learning techniques (PCA-based clustering, Logistic Regression, and Support Vector Classification) to advanced deep learning architectures including Convolutional Neural Networks (Modified LeNet, DenseNet-121) and various Vision Transformer (ViT) implementations (Deep-ViT, Compact Convolutional Transformer, and Cross-ViT). Using a dataset of 5,856 pediatric CXR images, we demonstrate that Vision Transformers, particularly the Cross-ViT architecture, achieve superior performance with 88.25% accuracy and 99.42% recall, surpassing traditional CNN approaches. Our analysis reveals that architectural choices impact performance more significantly than model size, with Cross-ViT’s 75M parameters outperforming larger models. The study also addresses practical considerations including computational efficiency, training requirements, and the critical balance between precision and recall in medical diagnostics. Our findings suggest that Vision Transformers offer a promising direction for automated pneumonia detection, potentially enabling more rapid and accurate diagnosis during health crises.

[73] U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV

Hongbo Ye,Fenghe Tang,Peiang Zhao,Zhen Huang,Dexin Zhao,Minghao Bian,S. Kevin Zhou

Main category: eess.IV

TL;DR: U-RWKV是一种轻量级医学图像分割框架,通过方向自适应的RWKV架构和创新的模块设计,实现了高效的长程建模和资源受限环境下的高性能分割。

Details Motivation: 解决医疗图像分割在资源受限环境下对高效长程建模的需求,现有方法如U-Net的全局有效感受野有限,无法有效捕捉长程依赖。

Contribution: 提出U-RWKV框架,结合方向自适应RWKV模块(DARM)和阶段自适应Squeeze-and-Excitation模块(SASE),实现了高效的长程上下文建模和动态特征提取。

Method: 采用基于RWKV的轻量级架构,引入DARM模块(双RWKV和QuadScan机制)处理方向性偏置,SASE模块动态适应不同特征提取阶段。

Result: 实验表明,U-RWKV在保持计算高效的同时,实现了最先进的医学图像分割性能。

Insight: 方向自适应和阶段自适应机制可显著提升轻量级模型的长程建模能力,为资源受限环境提供实用解决方案。

Abstract: Achieving equity in healthcare accessibility requires lightweight yet high-performance solutions for medical image segmentation, particularly in resource-limited settings. Existing methods like U-Net and its variants often suffer from limited global Effective Receptive Fields (ERFs), hindering their ability to capture long-range dependencies. To address this, we propose U-RWKV, a novel framework leveraging the Recurrent Weighted Key-Value(RWKV) architecture, which achieves efficient long-range modeling at O(N) computational cost. The framework introduces two key innovations: the Direction-Adaptive RWKV Module(DARM) and the Stage-Adaptive Squeeze-and-Excitation Module(SASE). DARM employs Dual-RWKV and QuadScan mechanisms to aggregate contextual cues across images, mitigating directional bias while preserving global context and maintaining high computational efficiency. SASE dynamically adapts its architecture to different feature extraction stages, balancing high-resolution detail preservation and semantic relationship capture. Experiments demonstrate that U-RWKV achieves state-of-the-art segmentation performance with high computational efficiency, offering a practical solution for democratizing advanced medical imaging technologies in resource-constrained environments. The code is available at https://github.com/hbyecoding/U-RWKV.

cs.NI [Back]

[74] LiLM-RDB-SFC: Lightweight Language Model with Relational Database-Guided DRL for Optimized SFC Provisioning

Parisa Fard Moshiri,Xinyu Zhu,Poonam Lohan,Burak Kantarci,Emil Janulewicz

Main category: cs.NI

TL;DR: 论文提出了一种结合轻量级语言模型(LiLM)和关系数据库(RDB)的方法LiLM-RDB-SFC,用于优化服务功能链(SFC)的配置。该方法利用BART和FLAN-T5两种语言模型解读网络数据,实验表明FLAN-T5表现更优。

Details Motivation: 现代SDN和NFV环境中,SFC和VNF的高效管理是关键挑战。现有DRL方法由于依赖结构化数据和固定规则,适应性不足。

Contribution: 提出LiLM-RDB-SFC方法,结合轻量级语言模型和关系数据库,提高了SFC配置的动态适应性。

Method: 使用BART和FLAN-T5两种语言模型解读网络数据,并通过RDB查询指导DRL模型优化SFC配置。

Result: FLAN-T5在测试损失(0.00161 vs 0.00734)、准确率(94.79% vs 80.2%)和耗时(2h 2min vs 2h 38min)上优于BART,且相比SQLCoder耗时减少96%。

Insight: 轻量级语言模型在动态网络决策中具有显著优势,FLAN-T5在小模型规模下表现优异。

Abstract: Effective management of Service Function Chains (SFCs) and optimal Virtual Network Function (VNF) placement are critical challenges in modern Software-Defined Networking (SDN) and Network Function Virtualization (NFV) environments. Although Deep Reinforcement Learning (DRL) is widely adopted for dynamic network decision-making, its inherent dependency on structured data and fixed action rules often limits adaptability and responsiveness, particularly under unpredictable network conditions. This paper introduces LiLM-RDB-SFC, a novel approach combining Lightweight Language Model (LiLM) with Relational Database (RDB) to answer network state queries to guide DRL model for efficient SFC provisioning. Our proposed approach leverages two LiLMs, Bidirectional and Auto-Regressive Transformers (BART) and the Fine-tuned Language Net T5 (FLAN-T5), to interpret network data and support diverse query types related to SFC demands, data center resources, and VNF availability. Results demonstrate that FLAN-T5 outperforms BART with a lower test loss (0.00161 compared to 0.00734), higher accuracy (94.79% compared to 80.2%), and less processing time (2h 2min compared to 2h 38min). Moreover, when compared to the large language model SQLCoder, FLAN-T5 matches the accuracy of SQLCoder while cutting processing time by 96% (SQLCoder: 54 h 43 min; FLAN-T5: 2 h 2 min).

cs.GR [Back]

[75] Elevating 3D Models: High-Quality Texture and Geometry Refinement from a Low-Quality Model

Nuri Ryu,Jiyun Won,Jooeun Son,Minsu Gong,Joo-Haeng Lee,Sunghyun Cho

Main category: cs.GR

TL;DR: Elevate3D是一个新颖的框架,通过增强低质量3D资产的纹理和几何细节,提升其质量,利用HFS-SDEdit方法优化纹理并结合单目几何预测器改进几何结构。

Details Motivation: 高质量3D资产稀缺且获取成本高,现有方法多忽略几何细节的优化。Elevate3D旨在通过纹理和几何的双重增强,填补高质量3D资产的不足。

Contribution: 提出了HFS-SDEdit纹理增强方法,并结合单目几何预测器,首次实现了纹理与几何的协同优化,显著提升了3D模型的整体质量。

Method: 框架采用逐视图处理方式,交替优化纹理和几何。纹理增强通过HFS-SDEdit实现,几何优化则基于改进后的图像与单目几何预测技术。

Result: Elevate3D在3D模型细化任务中优于现有方法,达到了最先进的质量水平。

Insight: 几何细节的优化对提升3D资产质量至关重要,结合纹理与几何的协同优化策略是未来的发展方向。

Abstract: High-quality 3D assets are essential for various applications in computer graphics and 3D vision but remain scarce due to significant acquisition costs. To address this shortage, we introduce Elevate3D, a novel framework that transforms readily accessible low-quality 3D assets into higher quality. At the core of Elevate3D is HFS-SDEdit, a specialized texture enhancement method that significantly improves texture quality while preserving the appearance and geometry while fixing its degradations. Furthermore, Elevate3D operates in a view-by-view manner, alternating between texture and geometry refinement. Unlike previous methods that have largely overlooked geometry refinement, our framework leverages geometric cues from images refined with HFS-SDEdit by employing state-of-the-art monocular geometry predictors. This approach ensures detailed and accurate geometry that aligns seamlessly with the enhanced texture. Elevate3D outperforms recent competitors by achieving state-of-the-art quality in 3D model refinement, effectively addressing the scarcity of high-quality open-source 3D assets.

cs.IR [Back]

[76] Overview of the TREC 2022 deep learning track

Nick Craswell,Bhaskar Mitra,Emine Yilmaz,Daniel Campos,Jimmy Lin,Ellen M. Voorhees,Ian Soboroff

Main category: cs.IR

TL;DR: TREC 2022深度学习赛道聚焦于构建更完整的测试集以优化段落检索任务,同时继续利用大规模预训练深度神经排序模型,发现其仍优于传统检索方法。

Details Motivation: 通过扩大数据集规模和优化测试集,验证和改进深度学习在检索任务中的性能,并提供更高质量的数据集供未来研究使用。

Contribution: 1) 扩展并优化了MS MARCO数据集;2) 验证大规模预训练深度神经排序模型的优越性;3) 提供更可靠的测试集和评估结果。

Method: 利用MS MARCO数据集的扩展版本,结合大规模预训练的深度神经排序模型,进行段落检索和文档排序任务。

Result: 深度神经排序模型表现优于传统方法,但今年单阶段密集检索方法的竞争力下降,部分未使用密集检索的方法表现突出。

Insight: 大规模预训练仍为检索任务的核心技术,但密集检索方法的优越性可能受任务设计或数据集变化影响。

Abstract: This is the fourth year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human annotated training labels available for both passage and document ranking tasks. In addition, this year we also leverage both the refreshed passage and document collections that were released last year leading to a nearly $16$ times increase in the size of the passage collection and nearly four times increase in the document collection size. Unlike previous years, in 2022 we mainly focused on constructing a more complete test collection for the passage retrieval task, which has been the primary focus of the track. The document ranking task was kept as a secondary task, where document-level labels were inferred from the passage-level labels. Our analysis shows that similar to previous years, deep neural ranking models that employ large scale pretraining continued to outperform traditional retrieval methods. Due to the focusing our judging resources on passage judging, we are more confident in the quality of this year’s queries and judgments, with respect to our ability to distinguish between runs and reuse the dataset in future. We also see some surprises in overall outcomes. Some top-performing runs did not do dense retrieval. Runs that did single-stage dense retrieval were not as competitive this year as they were last year.

cs.HC [Back]

[77] Theory of Mind and Self-Disclosure to CUIs

Samuel Rhys Cox

Main category: cs.HC

TL;DR: 论文探讨了如何通过增强对话式用户界面的心智理论透明度来促进用户的自我表露。

Details Motivation: 自我表露对人类心理健康很重要,但对人类或其他渠道进行自我表露时可能因担心他人反应而感到困难。因此,研究如何通过CUIs(对话式用户界面)促进自我表露。

Contribution: 提出了通过显示CUIs的推理过程或表达不确定性,增强其心智理论透明度的方法,从而鼓励用户自我表露。

Method: 讨论了CUIs中社交线索的作用,并提出通过透明化CUIs的心智理论(如表达不确定性或展示推理过程)来促进自我表露。

Result: 该方法有望提高用户对CUIs的信任,从而更愿意进行自我表露。

Insight: 透明化CUIs的决策过程(即心智理论)可能是增强用户信任和促进自我表露的有效策略。

Abstract: Self-disclosure is important to help us feel better, yet is often difficult. This difficulty can arise from how we think people are going to react to our self-disclosure. In this workshop paper, we briefly discuss self-disclosure to conversational user interfaces (CUIs) in relation to various social cues. We then, discuss how expressions of uncertainty or representation of a CUI’s reasoning could help encourage self-disclosure, by making a CUI’s intended “theory of mind” more transparent to users.

cs.LG [Back]

[78] Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them

Neel Rajani,Aryo Pradipta Gema,Seraphina Goldfarb-Tarrant,Ivan Titov

Main category: cs.LG

TL;DR: 这篇论文比较了强化学习(RL)和监督微调(SFT)在训练大型语言模型(LLM)时的效果,发现RL在数学任务上有轻微提升但对知识密集型任务略有下降,而SFT的提升和下降更明显,且可能通过修改模型中间层权重导致能力替换。作者尝试冻结部分模型参数以缓解性能下降,但结果不一致。

Details Motivation: 研究RL和SFT在LLM后训练中的动态效果差异,理解它们对模型能力和知识的影响。

Contribution: 通过对比实验和分析模型参数变化,揭示了RL和SFT的不同作用机制:RL增强现有能力,而SFT可能导致能力替换。

Method: 在同一模型和超参数设置下,对比RL和SFT在数学任务上的表现,并分析参数更新模式。尝试通过冻结部分模型来减轻SFT的副作用。

Result: RL在数学任务上有轻微提升但对知识密集型任务略有下降;SFT的效果更显著但可能导致能力替换。冻结参数的结果不一致。

Insight: SFT更可能通过修改模型中间层权重替换已有能力,而RL更倾向于增强当前能力。冻结部分参数的效果需要进一步研究。

Abstract: Training large language models (LLMs) for reasoning via maths and code datasets has become a major new focus in LLM post-training. Two particularly popular approaches are reinforcement learning (RL) and supervised fine-tuning (SFT), but their training dynamics are poorly understood. We present a comparative analysis of RL and SFT on the same maths problems with the same model and similar hyperparameters. We find that RL yields minor in-domain gains on maths and slight degradation on knowledge-intensive benchmarks like MMLU, while both trends are more pronounced in SFT. We also analyse model parameters across checkpoints, observing that both algorithms modify query and key weights the most. Meanwhile, SFT exhibits greater updates and also affects mid-layer MLPs more, leading us to hypothesise that this may have caused the out-of-domain degradation. We therefore investigate whether freezing parts of the model during training can mitigate the reduced performance on knowledge-intensive benchmarks. However, our results are inconclusive, with benefits on GPQA:Diamond and degradation on other benchmarks. Taken together, our observations provide a preliminary indication for why RL amplifies existing capabilities, while SFT replaces old skills with new ones.

[79] First-Order Error Matters: Accurate Compensation for Quantized Large Language Models

Xingyu Zheng,Haotong Qin,Yuye Li,Jiakai Wang,Jinyang Guo,Michele Magno,Xianglong Liu

Main category: cs.LG

TL;DR: FOEM是一种新的PTQ方法,通过显式引入一阶梯度项来改进量化误差补偿,显著提升了大型语言模型的量化性能。

Details Motivation: 现有基于补偿的权重校准方法通常假设一阶梯度项在训练好的全精度模型中可忽略,但量化过程中的累积一阶偏差导致这一假设不成立。因此,需要更准确地补偿量化误差。

Contribution: 提出了FOEM方法,首次显式引入一阶梯度项量化补偿,并通过直接计算潜权重与全精度权重差值高效近似梯度。同时利用预计算的Cholesky因子实时恢复Hessian子矩阵逆。

Method: FOEM通过直接计算潜权重与全精度权重的差值近似梯度,避免基于反向传播的高成本梯度计算;利用预计算的Cholesky因子高效恢复Hessian子矩阵逆。

Result: 在3比特仅权重量化下,FOEM显著降低了Llama3-8B的困惑度(减少89.6%),并将Llama3-70B的5-shot MMLU准确率从51.7%提升至74.9%,接近全精度性能(78.6%)。

Insight: 量化过程中的累积一阶偏差不可忽略,显式补偿一阶梯度项能显著提升量化性能。FOEM的高效设计使其在计算开销最小化的情况下实现最佳补偿效果。

Abstract: Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by directly computing the difference between latent and full-precision weights, avoiding the high cost and limited generalization of backpropagation-based gradient computation. This approach introduces minimal additional computational overhead. Moreover, FOEM leverages precomputed Cholesky factors to efficiently recover the inverse of Hessian submatrices in real time. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 89.6%, and improves the 5-shot MMLU accuracy of Llama3-70B from 51.7% to 74.9%, approaching the full-precision performance of 78.6%. Furthermore, FOEM can be seamlessly integrated with advanced techniques such as GPTAQ and SpinQuant, yielding additional improvements under the challenging W4A4KV4 setting, and further narrowing the accuracy gap with full-precision baselines beyond what current state-of-the-art methods achieve. The code is available at https://github.com/Xingyu-Zheng/FOEM.

[80] AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air

Shiyi Yang,Xiaoxue Yu,Rongpeng Li,Jianhang Zhu,Zhifeng Zhao,Honggang Zhang

Main category: cs.LG

TL;DR: AirLLM提出了一种基于扩散策略的自适应LoRA方法,用于在有限通信带宽下实现LLM的高效远程微调。

Details Motivation: 边缘设备上运行大型语言模型(LLM)面临通信带宽和计算资源的限制,现有的LoRA方法在固定秩配置和传输效率上存在不足。

Contribution: 1. 提出了AirLLM框架,结合PPO和扩散模型实现动态LoRA秩配置;2. 在通信和任务复杂度的联合优化下减少传输开销;3. 实验验证了其在信噪比变化下的高效性。

Method: 1. 将LoRA秩配置建模为结构化动作向量;2. PPO生成粗粒度决策,DDIM细化为高分辨率秩向量;3. 交替优化PPO和DDIM,使用CFG保持对齐。

Result: 实验表明,AirLLM在减少传输成本的同时提升了微调性能。

Insight: 结合强化学习和扩散模型的动态LoRA秩配置能够高效适应通信和任务需求,为边缘设备上的LLM微调提供了新思路。

Abstract: Operating Large Language Models (LLMs) on edge devices is increasingly challenged by limited communication bandwidth and strained computational and memory costs. Thus, cloud-assisted remote fine-tuning becomes indispensable. Nevertheless, existing Low-Rank Adaptation (LoRA) approaches typically employ fixed or heuristic rank configurations, and the subsequent over-the-air transmission of all LoRA parameters could be rather inefficient. To address this limitation, we develop AirLLM, a hierarchical diffusion policy framework for communication-aware LoRA adaptation. Specifically, AirLLM models the rank configuration as a structured action vector that spans all LoRA-inserted projections. To solve the underlying high-dimensional sequential decision-making problem, a Proximal Policy Optimization (PPO) agent generates coarse-grained decisions by jointly observing wireless states and linguistic complexity, which are then refined via Denoising Diffusion Implicit Models (DDIM) to produce high-resolution, task- and channel-adaptive rank vectors. The two modules are optimized alternatively, with the DDIM trained under the Classifier-Free Guidance (CFG) paradigm to maintain alignment with PPO rewards. Experiments under varying signal-to-noise ratios demonstrate that AirLLM consistently enhances fine-tuning performance while significantly reducing transmission costs, highlighting the effectiveness of reinforcement-driven, diffusion-refined rank adaptation for scalable and efficient remote fine-tuning over the air.

[81] FedGSCA: Medical Federated Learning with Global Sample Selector and Client Adaptive Adjuster under Label Noise

Mengwen Ye,Yingzi Huangfu,Shujian Gao,Wei Ren,Weifan Liu,Zekuan Yu

Main category: cs.LG

TL;DR: FedGSCA提出了一种新颖的联邦学习框架,专注于解决医学图像分类中的标签噪声问题。通过全局样本选择器和客户端自适应调整机制,有效提升了模型在噪声环境下的鲁棒性和性能。

Details Motivation: 医学联邦学习中存在标签噪声和数据不平衡问题,现有方法难以应对噪声的异质性和多样性。FedGSCA旨在通过创新的机制解决这些问题。

Contribution: 1) 提出全局样本选择器(GSS),汇总客户端噪声知识以应对噪声异质性;2) 开发客户端自适应调整机制(CAA),结合伪标签生成和鲁棒标签损失,动态适应数据分布。

Method: 1) GSS聚合全局噪声知识;2) CAA通过自适应阈值伪标签和鲁棒Credal标签损失,调整本地训练过程。

Result: 在真实结肠切片数据集和合成医学数据集上,FedGSCA在极端和异质噪声场景下优于现有方法,显著提升模型稳定性和泛化能力。

Insight: FedGSCA通过全局噪声管理和本地动态调整,为医学联邦学习中的噪声问题提供了有效解决方案,展现了在真实场景中的潜力。

Abstract: Federated Learning (FL) emerged as a solution for collaborative medical image classification while preserving data privacy. However, label noise, which arises from inter-institutional data variability, can cause training instability and degrade model performance. Existing FL methods struggle with noise heterogeneity and the imbalance in medical data. Motivated by these challenges, we propose FedGSCA, a novel framework for enhancing robustness in noisy medical FL. FedGSCA introduces a Global Sample Selector that aggregates noise knowledge from all clients, effectively addressing noise heterogeneity and improving global model stability. Furthermore, we develop a Client Adaptive Adjustment (CAA) mechanism that combines adaptive threshold pseudo-label generation and Robust Credal Labeling Loss. CAA dynamically adjusts to class distributions, ensuring the inclusion of minority samples and carefully managing noisy labels by considering multiple plausible labels. This dual approach mitigates the impact of noisy data and prevents overfitting during local training, which improves the generalizability of the model. We evaluate FedGSCA on one real-world colon slides dataset and two synthetic medical datasets under various noise conditions, including symmetric, asymmetric, extreme, and heterogeneous types. The results show that FedGSCA outperforms the state-of-the-art methods, excelling in extreme and heterogeneous noise scenarios. Moreover, FedGSCA demonstrates significant advantages in improving model stability and handling complex noise, making it well-suited for real-world medical federated learning scenarios.

[82] Flows and Diffusions on the Neural Manifold

Daniel Saragih,Deyu Cao,Tejas Balaji

Main category: cs.LG

TL;DR: 该论文将扩散和基于流的生成模型扩展到权重空间学习,通过梯度流匹配统一轨迹推断技术,优化路径作为归纳偏置,并在实验中验证其性能。

Details Motivation: 利用扩散和流生成模型的成功,将其扩展到权重空间学习,通过优化动力学引入结构先验,提升生成权重和下游任务性能。

Contribution: 提出梯度流匹配框架,统一轨迹推断技术;优化路径作为归纳偏置;提出任务相关的上下文条件和源分布选择;验证方法在生成权重、初始化和安全检测中的优势。

Method: 将梯度下降轨迹建模为轨迹推断问题,采用梯度流匹配框架;结合自编码器、任务条件化和Kaiming均匀分布等技术。

Result: 方法生成与基准相当或更好的分布内权重,提升下游训练初始化,并在检测有害协变量偏移中表现优异。

Insight: 优化路径可作为生成模型的强归纳偏置;任务条件化和结构先验对权重生成至关重要;方法在安全关键系统中具有潜力。

Abstract: Diffusion and flow-based generative models have achieved remarkable success in domains such as image synthesis, video generation, and natural language modeling. In this work, we extend these advances to weight space learning by leveraging recent techniques to incorporate structural priors derived from optimization dynamics. Central to our approach is modeling the trajectory induced by gradient descent as a trajectory inference problem. We unify several trajectory inference techniques under the framework of gradient flow matching, providing a theoretical framework for treating optimization paths as inductive bias. We further explore architectural and algorithmic choices, including reward fine-tuning by adjoint matching, the use of autoencoders for latent weight representation, conditioning on task-specific context data, and adopting informative source distributions such as Kaiming uniform. Experiments demonstrate that our method matches or surpasses baselines in generating in-distribution weights, improves initialization for downstream training, and supports fine-tuning to enhance performance. Finally, we illustrate a practical application in safety-critical systems: detecting harmful covariate shifts, where our method outperforms the closest comparable baseline.

[83] A Simple Baseline for Stable and Plastic Neural Networks

É. Künzel,A. Jaziri,V. Ramesh

Main category: cs.LG

TL;DR: 本文提出了一种简单的方法RDBP,结合ReLUDown和递减反向传播,以低成本实现持续学习中的稳定性和塑性平衡。

Details Motivation: 持续学习中现有方法往往难以平衡稳定性和塑性,导致性能不佳或计算成本过高。

Contribution: 提出RDBP方法,通过ReLUDown激活函数修改和递减反向传播梯度调度,高效平衡稳定性和塑性。

Method: 结合ReLUDown(防止神经元休眠的激活函数修改)和递减反向传播(逐步保护早期层的梯度调度)。

Result: 在Continual ImageNet基准测试中表现优异,匹配或超越SOTA方法,同时降低计算成本。

Insight: RDBP为持续学习提供了一种简单高效的基准方法,突出了轻量化和生物启发的策略价值。

Abstract: Continual learning in computer vision requires that models adapt to a continuous stream of tasks without forgetting prior knowledge, yet existing approaches often tip the balance heavily toward either plasticity or stability. We introduce RDBP, a simple, low-overhead baseline that unites two complementary mechanisms: ReLUDown, a lightweight activation modification that preserves feature sensitivity while preventing neuron dormancy, and Decreasing Backpropagation, a biologically inspired gradient-scheduling scheme that progressively shields early layers from catastrophic updates. Evaluated on the Continual ImageNet benchmark, RDBP matches or exceeds the plasticity and stability of state-of-the-art methods while reducing computational cost. RDBP thus provides both a practical solution for real-world continual learning and a clear benchmark against which future continual learning strategies can be measured.

[84] Spatial Reasoners for Continuous Variables in Any Domain

Bart Pogodzinski,Christopher Wewer,Bernt Schiele,Jan Eric Lenssen

Main category: cs.LG

TL;DR: 提出了Spatial Reasoners框架,支持在任意领域对连续变量进行空间推理,基于生成式去噪模型,降低了研究门槛。

Details Motivation: 生成式去噪模型在图像生成中表现优异,但在多连续变量推理中的应用尚需高效工具支持。

Contribution: 开发了Spatial Reasoners框架,简化了生成式推理的研究流程,支持灵活的变量映射、模型范式和推理策略。

Method: 框架整合了多种去噪模型、采样器和推理策略,提供易用接口支持任意数据域的空间推理。

Result: 框架已在开源平台发布,为研究者提供了便利的工具支持。

Insight: 生成式模型在连续变量推理中的潜力巨大,需进一步探索其应用场景和性能优化。

Abstract: We present Spatial Reasoners, a software framework to perform spatial reasoning over continuous variables with generative denoising models. Denoising generative models have become the de-facto standard for image generation, due to their effectiveness in sampling from complex, high-dimensional distributions. Recently, they have started being explored in the context of reasoning over multiple continuous variables. Providing infrastructure for generative reasoning with such models requires a high effort, due to a wide range of different denoising formulations, samplers, and inference strategies. Our presented framework aims to facilitate research in this area, providing easy-to-use interfaces to control variable mapping from arbitrary data domains, generative model paradigms, and inference strategies. Spatial Reasoners are openly available at https://spatialreasoners.github.io/

[85] LogTinyLLM: Tiny Large Language Models Based Contextual Log Anomaly Detection

Isaiah Thompson Ocansey,Ritwik Bhattacharya,Tanmay Sen

Main category: cs.LG

TL;DR: 论文提出了基于低秩适应(LoRA)和适配器的参数高效微调方法,用于在大规模日志数据中检测上下文异常,相比传统方法显著提升了性能。

Details Motivation: 日志数据的复杂性和大规模使得传统规则或深度学习方法在异常检测中表现不佳,亟需一种高效的方法来识别日志序列中的异常。

Contribution: 首次将LoRA和适配器方法应用于日志异常检测,展示了参数高效微调的有效性,其性能显著优于传统全微调方法(如LogBERT)。

Method: 采用低秩适应(LoRA)和适配器为基础的微调策略,结合小规模大语言模型(Tiny LLMs),在Thunderbird数据集上进行比较。

Result: LoRA方法在准确率上实现了18-19%的提升,达到97.76%-98.83%,而传统方法仅79.37%。

Insight: 参数高效微调技术(如LoRA)在资源受限的场景下具有显著优势,适合用于大规模日志数据的异常检测任务。

Abstract: Log anomaly detection using traditional rule based or deep learning based methods is often challenging due to the large volume and highly complex nature of log sequence. So effective way of detection of anomalous sequence of logs is crucial for system maintenance and development. This paper proposes parameter efficient finetuning specifically low rank adaptation (LoRA) and adapter based approaches for finding contextual anomalies in sequence of logs in large log data set. It compares different tiny large language models (LLMs) on the Thunderbird dataset. The results show that LoRA based finetuning provides substantial performance improvements of 18 to 19 percentage over LogBert based full finetuning approach, achieving accuracy scores between 97.76% and 98.83% compared to 79.37%.

cs.MM [Back]

[86] MultiVox: Benchmarking Voice Assistants for Multimodal Interactions

Ramaneswaran Selvakumar,Ashish Seth,Nishit Anand,Utkarsh Tyagi,Sonal Kumar,Sreyan Ghosh,Dinesh Manocha

Main category: cs.MM

TL;DR: 该论文提出了MultiVox,这是首个用于评估语音助手在多模态交互中整合语音和视觉线索能力的基准测试,揭示了当前模型在生成上下文感知响应方面的不足。

Details Motivation: 随着大语言模型(LLMs)的发展,语音助手能够处理多模态输入(如语音和视觉数据)。然而,现有基准测试未能全面评估这些模型在理解细粒度语音特征和环境声学上下文方面的表现,以及如何结合视觉信号生成上下文感知的响应。

Contribution: 提出了MultiVox基准测试,包含1000条人类标注的语音对话,涵盖多样化的副语言特征和视觉线索(如图像和视频),用于评估语音助手的多模态理解能力。

Method: MultiVox通过收集和标注包含副语言特征(如音高、情绪、音色、音量)和环境声学上下文的语音对话,并辅以视觉线索(图像和视频),构建了一个多模态评估数据集。对9种先进模型进行了对比测试。

Result: 实验表明,尽管人类在这些任务上表现优异,当前模型在生成上下文感知响应时仍存在明显困难。

Insight: 多模态语音助手的性能提升需要更全面的评估基准,特别是对语音和视觉线索的整合能力,以及如何利用副语言特征优化上下文理解的深入探究。

Abstract: The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 9 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.

cs.CY [Back]

[87] Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?

Bhakti Khera,Rezvan Alamian,Pascal A. Scherz,Stephan M. Goetz

Main category: cs.CY

TL;DR: 研究了多种大型语言模型(LLMs)在专利法规理解和应用中的表现,发现即使是表现最佳的模型也未达到专业专利律师考试的标准,揭示了模型在逻辑一致性、多模态能力和提示适应性上的不足。

Details Motivation: 探索LLMs在法律领域,尤其是专利法规理解中的实际能力,评估其是否满足专业标准,并为相关研究提供定量分析。

Contribution: 定量评估了多种LLMs在模拟专利律师考试中的表现,揭示了其局限性,并指出了未来改进方向。

Method: 通过欧洲专利律师资格考试的模拟试题测试多种开源和专有LLMs,结合人类专家对模型输出的文本合理性进行评估。

Result: OpenAI的GPT-4o表现最佳(准确率0.82,F1分数0.81),但未达到专业考试的0.90标准;其他模型如Llama 3.1 8B表现较差(准确率0.50-0.55)。人类专家更重视模型的解释清晰度和法律逻辑而非答案正确性。

Insight: LLMs在法律领域的能力仍有限,尤其是逻辑一致性和多模态适应性;专业领域的模型评估需结合人类专家意见,而非仅依赖自动化指标。

Abstract: The legal field already uses various large language models (LLMs) in actual applications, but their quantitative performance and reasons for it are underexplored. We evaluated several open-source and proprietary LLMs – including GPT-series, Anthropic, Deepseek and Llama-3, variants – on parts of the European Qualifying Examination (EQE) for future European Patent Attorneys. OpenAI o1 led with 0.82 accuracy and 0.81 F1 score, whereas (Amazon Web Services) AWS Llama 3.1 8B lagged at 0.50 accuracy, and a Python-deployed Llama 3.1 8B scored 0.55. The latter two are within the range of mere guessing for the two-answer forced-choice design. None of the evaluated models could have passed the examination fully, as accuracy never exceeded the average threshold of 0.90 required for professional-level standards – also not models that are regularly promoted for their assumed beyond-PhD- and bar-admitted-lawyer-level performance. GPT-4o excelled at integrating text and graphics, while Claude 3 Opus often lost formatting coherence. Human patent experts evaluated the textual justifications and uncovered various critical shortcomings of each model. They valued clarity and legal rationale over the raw correctness of the answers, which revealed misalignment between automatic metrics and expert judgment. Model outputs were sensitive to modest temperature changes and prompt wording, which underscores the remaining necessity of expert oversight. Future work should target logical consistency, robust multimodality, and adaptive prompting to approach human-level patent proficiency. In summary, despite the outstanding performance of recent large models, the general public might overestimate their performance. The field has a long way to go to develop a virtual patent attorney. This paper wants to point out several specific limitations that need solutions.

[88] Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors

Ekaterina Kochmar,Kaushal Kumar Maurya,Kseniia Petukhova,KV Aditya Srivatsa,Anaïs Tack,Justin Vasselli

Main category: cs.CY

TL;DR: 这篇论文介绍了BEA 2025共享任务的主要发现,旨在评估基于大语言模型(LLMs)的AI导师的教学能力,尤其是在学生错误纠正方面的表现。任务分为五个方向,吸引了50多个国际团队参与,结果显示当前方法仍有较大改进空间。

Details Motivation: 随着AI导师在教育领域的应用日益广泛,评估其教学能力(尤其是错误纠正能力)变得至关重要。该任务旨在通过多维度评估,推动AI导师交互质量的提升。

Contribution: 论文的主要贡献在于提出了一个多维度评估框架(包括错误识别、精确定位、提供指导和反馈可行性),并提供了公开数据集和性能基准。

Method: 任务采用五个评估方向(四维教学能力评估和导师身份检测),基于学习科学原则设计,并使用人工标注作为金标准对各团队提交的模型进行评测。

Result: 最佳模型的性能在F1分数上表现不一:教学能力评估方向的F1分数为58.34(提供指导)至71.81(错误识别),导师身份检测任务的F1分数为96.98。

Insight: 研究显示,AI导师在教学能力方面仍有明显不足,尤其是在提供指导和反馈可行性方面,未来需进一步结合教育理论和模型优化以提升表现。

Abstract: This shared task has aimed to assess pedagogical abilities of AI tutors powered by large language models (LLMs), focusing on evaluating the quality of tutor responses aimed at student’s mistake remediation within educational dialogues. The task consisted of five tracks designed to automatically evaluate the AI tutor’s performance across key dimensions of mistake identification, precise location of the mistake, providing guidance, and feedback actionability, grounded in learning science principles that define good and effective tutor responses, as well as the track focusing on detection of the tutor identity. The task attracted over 50 international teams across all tracks. The submitted models were evaluated against gold-standard human annotations, and the results, while promising, show that there is still significant room for improvement in this domain: the best results for the four pedagogical ability assessment tracks range between macro F1 scores of 58.34 (for providing guidance) and 71.81 (for mistake identification) on three-class problems, with the best F1 score in the tutor identification track reaching 96.98 on a 9-class task. In this paper, we overview the main findings of the shared task, discuss the approaches taken by the teams, and analyze their performance. All resources associated with this task are made publicly available to support future research in this critical domain.

cs.RO [Back]

[89] rt-RISeg: Real-Time Model-Free Robot Interactive Segmentation for Active Instance-Level Object Understanding

Howard H. Qian,Yiting Chen,Gaotian Wang,Podshara Chanrungmaneekul,Kaiyu Hang

Main category: cs.RO

TL;DR: 论文提出了一种实时交互式感知框架rt-RISeg,通过机器人交互和无模型方法实现未见物体的实例级分割,显著提升了分割准确率。

Details Motivation: 现有基于大规模数据训练的未见物体实例分割(UOIS)方法容易过拟合静态视觉特征,泛化能力不足。为此,作者提出利用交互性视觉特性解决这一问题。

Contribution: 1. 提出无模型的实时交互感知框架rt-RISeg;2. 设计了不变于机体坐标系的特征(BFIF);3. 实验显示分割准确率比SOTA方法高27.5%。

Method: 通过机器人交互生成BFIF特征,利用物体相对运动信息分割未见物体,无需预训练模型。

Result: rt-RISeg的分割准确率比现有方法高27.5%,并可作为视觉基础模型的提示输入进一步提升性能。

Insight: 交互性是解决泛化问题的关键,无模型方法在高动态场景中具有潜力。

Abstract: Successful execution of dexterous robotic manipulation tasks in new environments, such as grasping, depends on the ability to proficiently segment unseen objects from the background and other objects. Previous works in unseen object instance segmentation (UOIS) train models on large-scale datasets, which often leads to overfitting on static visual features. This dependency results in poor generalization performance when confronted with out-of-distribution scenarios. To address this limitation, we rethink the task of UOIS based on the principle that vision is inherently interactive and occurs over time. We propose a novel real-time interactive perception framework, rt-RISeg, that continuously segments unseen objects by robot interactions and analysis of a designed body frame-invariant feature (BFIF). We demonstrate that the relative rotational and linear velocities of randomly sampled body frames, resulting from selected robot interactions, can be used to identify objects without any learned segmentation model. This fully self-contained segmentation pipeline generates and updates object segmentation masks throughout each robot interaction without the need to wait for an action to finish. We showcase the effectiveness of our proposed interactive perception method by achieving an average object segmentation accuracy rate 27.5% greater than state-of-the-art UOIS methods. Furthermore, although rt-RISeg is a standalone framework, we show that the autonomously generated segmentation masks can be used as prompts to vision foundation models for significantly improved performance.

[90] Whom to Respond To? A Transformer-Based Model for Multi-Party Social Robot Interaction

He Zhu,Ryo Miyoshi,Yuki Okafuji

Main category: cs.RO

TL;DR: 该论文提出了一种基于Transformer的多任务学习框架,用于改善社交机器人在多人交互环境中的决策能力,并通过新的损失函数和数据集实现了最先进的性能。

Details Motivation: 现有的研究主要集中在单人交互场景,而现实中社交机器人需要在多人环境中理解上下文并决定向谁以及何时回应。

Contribution: 1. 提出一个Transformer-based多任务学习框架;2. 设计了两种新的损失函数,分别优化了主动说话者建模和针对机器人的回应选择;3. 构建了一个包含真实世界复杂性的多人HRI数据集。

Method: 采用基于Transformer的多任务学习框架,结合两种新的损失函数:(1)对主动说话者施加约束;(2)引导回应选择朝向针对机器人的话语。

Result: 实验表明,该模型在回应决策任务上表现优于现有的启发式和单任务方法,达到了最先进的性能。

Insight: 该研究为社交机器人在复杂的多人交互场景中实现更自然的交互提供了技术支持,突出了多任务学习和上下文建模的重要性。

Abstract: Prior human-robot interaction (HRI) research has primarily focused on single-user interactions, where robots do not need to consider the timing or recipient of their responses. However, in multi-party interactions, such as at malls and hospitals, social robots must understand the context and decide both when and to whom they should respond. In this paper, we propose a Transformer-based multi-task learning framework to improve the decision-making process of social robots, particularly in multi-user environments. Considering the characteristics of HRI, we propose two novel loss functions: one that enforces constraints on active speakers to improve scene modeling, and another that guides response selection towards utterances specifically directed at the robot. Additionally, we construct a novel multi-party HRI dataset that captures real-world complexities, such as gaze misalignment. Experimental results demonstrate that our model achieves state-of-the-art performance in respond decisions, outperforming existing heuristic-based and single-task approaches. Our findings contribute to the development of socially intelligent social robots capable of engaging in natural and context-aware multi-party interactions.

[91] Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation

Yanbo Wang,Zipeng Fang,Lei Zhao,Weidong Chen

Main category: cs.RO

TL;DR: LE-Nav是一个基于多模态大语言模型(MLLM)推理和条件变分自编码器(CVAE)的导航框架,能够自适应调整规划器超参数,实现零样本场景理解和专家级调参。

Details Motivation: 传统导航系统依赖固定参数,难以适应动态多样的非结构化环境,导致性能下降和社交接受度低。尽管强化学习方法尝试改进,但仿真多样性的限制和泛化能力差阻碍了实际部署。

Contribution: 提出LE-Nav框架,利用MLLM进行场景理解,并通过CVAE实现语言指令到导航超参数的映射,支持自适应调参。实验表明其性能超越现有方法。

Method: 结合MLLM的零样本场景推理(通过单样本示例和思维链提示)和CVAE的超参数映射机制,实现动态适配。

Result: 在真实导航试验和用户研究中,LE-Nav在成功率、效率、安全性和舒适性等指标上优于现有方法,并获得更高的主观评分。

Insight: 通过语言模型和生成模型的结合,导航系统能够更好适应动态环境,同时保持解释性和社交兼容性。

Abstract: Service robots are increasingly deployed in diverse and dynamic environments, where both physical layouts and social contexts change over time and across locations. In these unstructured settings, conventional navigation systems that rely on fixed parameters often fail to generalize across scenarios, resulting in degraded performance and reduced social acceptance. Although recent approaches have leveraged reinforcement learning to enhance traditional planners, these methods often fail in real-world deployments due to poor generalization and limited simulation diversity, which hampers effective sim-to-real transfer. To tackle these issues, we present LE-Nav, an interpretable and scene-aware navigation framework that leverages multi-modal large language model reasoning and conditional variational autoencoders to adaptively tune planner hyperparameters. To achieve zero-shot scene understanding, we utilize one-shot exemplars and chain-of-thought prompting strategies. Additionally, a conditional variational autoencoder captures the mapping between natural language instructions and navigation hyperparameters, enabling expert-level tuning. Experiments show that LE-Nav can generate hyperparameters achieving human-level tuning across diverse planners and scenarios. Real-world navigation trials and a user study on a smart wheelchair platform demonstrate that it outperforms state-of-the-art methods on quantitative metrics such as success rate, efficiency, safety, and comfort, while receiving higher subjective scores for perceived safety and social acceptance. Code is available at https://github.com/Cavendish518/LE-Nav.

[92] All Eyes, no IMU: Learning Flight Attitude from Vision Alone

Jesse J. Hagenaars,Stein Stroobants,Sander M. Bohte,Guido C. H. E. De Croon

Main category: cs.RO

TL;DR: 该论文提出了一种仅依赖视觉的飞行控制方法,利用事件相机和循环卷积神经网络估计无人机的姿态和旋转速率,成功替代了传统惯性测量单元。

Details Motivation: 许多飞行生物依赖视觉而非专用重力感应器实现姿态控制,而传统无人机则依赖惯性传感器。该研究旨在探索仅通过视觉实现飞行控制的可行性。

Contribution: 提出了首个通用环境下仅依赖视觉的飞行控制方法,通过事件相机和低延迟神经网络实现姿态估计,验证了其替代惯性传感器的潜力。

Method: 使用小型循环卷积神经网络(Recurrent CNN)进行监督学习,从单一事件流中提取姿态和旋转速率信息。网络设计考虑了记忆和视野范围的影响。

Result: 实验表明,该方法在实飞中成功替代了传统的惯性测量单元,实现了稳定的飞行控制。具有记忆和广视野的网络性能最佳,窄视野版本泛化能力更强。

Insight: 视觉仅依赖的飞行控制在微型自主飞行机器人中具有潜力,尤其是在无法使用惯性传感器的场景下。记忆和视野设计对性能和泛化能力有重要影响。

Abstract: Vision is an essential part of attitude control for many flying animals, some of which have no dedicated sense of gravity. Flying robots, on the other hand, typically depend heavily on accelerometers and gyroscopes for attitude stabilization. In this work, we present the first vision-only approach to flight control for use in generic environments. We show that a quadrotor drone equipped with a downward-facing event camera can estimate its attitude and rotation rate from just the event stream, enabling flight control without inertial sensors. Our approach uses a small recurrent convolutional neural network trained through supervised learning. Real-world flight tests demonstrate that our combination of event camera and low-latency neural network is capable of replacing the inertial measurement unit in a traditional flight control loop. Furthermore, we investigate the network’s generalization across different environments, and the impact of memory and different fields of view. While networks with memory and access to horizon-like visual cues achieve best performance, variants with a narrower field of view achieve better relative generalization. Our work showcases vision-only flight control as a promising candidate for enabling autonomous, insect-scale flying robots.

cs.AI [Back]

[93] Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning

Konstantinos I. Roumeliotis,Ranjan Sapkota,Manoj Karkee,Nikolaos D. Tselikas

Main category: cs.AI

TL;DR: 论文提出了一种模块化的多智能体AI框架,结合视觉分类和信任感知编排,通过RAG增强推理能力,显著提升了零样本设置下的分类准确率。

Details Motivation: 现有的多智能体AI架构在零样本场景中缺乏信任机制,难以确保决策可靠性,尤其是在视觉分类任务中。

Contribution: 1) 提出模块化的智能体架构,分离视觉感知与元推理;2) 引入基于RAG的信任感知编排机制;3) 在苹果叶片病害诊断任务中验证了方法的有效性。

Method: 通过三种配置(零样本、微调、信任校准编排)测试系统性能,利用CLIP-based图像检索和迭代评估提升信任机制。

Result: 零样本设置下准确率提升77.94%,总体准确率达85.63%;GPT-4o展现出更好的校准性,而Qwen-2.5-VL则表现过度自信。

Insight: 分离视觉与推理模块可提升系统的可扩展性和可解释性;信任感知编排和RAG能有效纠正智能体的过度自信问题。

Abstract: Modern Artificial Intelligence (AI) increasingly relies on multi-agent architectures that blend visual and language understanding. Yet, a pressing challenge remains: How can we trust these agents especially in zero-shot settings with no fine-tuning? We introduce a novel modular Agentic AI visual classification framework that integrates generalist multimodal agents with a non-visual reasoning orchestrator and a Retrieval-Augmented Generation (RAG) module. Applied to apple leaf disease diagnosis, we benchmark three configurations: (I) zero-shot with confidence-based orchestration, (II) fine-tuned agents with improved performance, and (III) trust-calibrated orchestration enhanced by CLIP-based image retrieval and re-evaluation loops. Using confidence calibration metrics (ECE, OCR, CCC), the orchestrator modulates trust across agents. Our results demonstrate a 77.94% accuracy improvement in the zero-shot setting using trust-aware orchestration and RAG, achieving 85.63% overall. GPT-4o showed better calibration, while Qwen-2.5-VL displayed overconfidence. Furthermore, image-RAG grounded predictions with visually similar cases, enabling correction of agent overconfidence via iterative re-evaluation. The proposed system separates perception (vision agents) from meta-reasoning (orchestrator), enabling scalable and interpretable multi-agent AI. This blueprint is extensible to diagnostics, biology, and other trust-critical domains. All models, prompts, results, and system components including the complete software source code are openly released to support reproducibility, transparency, and community benchmarking at Github: https://github.com/Applied-AI-Research-Lab/Orchestrator-Agent-Trust

[94] From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents

Tatiana Petrova,Aleksandr Puzikov,Boris Bliznukov,Radu State

Main category: cs.AI

TL;DR: 本文提出了Web of Agents (WoA)的概念,将静态、文档为中心的Web转变为自主代理环境,并首次全面梳理了其演变历程。作者通过四维度分类法(语义基础、通信范式、智能核心位置、发现机制)统一比较历代代理架构,揭示了从外部数据(语义Web)或平台(MAS)到代理模型核心(LLM)的智能核心转移,为现代Agentic AI奠定了基础。同时指出未来研究方向应为解决去中心化身份、经济模型、安全等持续的社会技术挑战。

Details Motivation: 当前关于Web of Agents的研究分散在不同领域,缺乏统一的历史脉络。现代LLM驱动的框架与传统的多代理系统(MAS)和语义Web技术被视为分离的领域,妨碍了对这一领域发展的整体理解。

Contribution: 1. 提出了首个全面的Web of Agents演变综述。\n2. 设计了四维度分类法,为比较不同代代理架构提供了统一框架。\n3. 揭示了智能核心从外部到代理模型核心的范式转移。\n4. 提出了未来研究方向,聚焦于社会技术挑战。

Method: 作者通过历史分析和分类法构建,结合四维度(语义基础、通信范式、智能核心位置、发现机制)系统比较了不同代代理架构。

Result: 研究展示了一个清晰的演变脉络,显示现代协议(如A2A、MCP)是对早期标准(如FIPA、OWL)局限性的直接进化回应,并指出智能核心的转移是现代Agentic AI的基础。

Insight: 现代Agentic AI的规模化与适应性依赖于智能核心从外部转移到代理模型内部。未来研究需围绕去中心化身份、经济模型等社会技术挑战展开。

Abstract: The concept of the Web of Agents (WoA), which transforms the static, document-centric Web into an environment of autonomous agents acting on users’ behalf, has attracted growing interest as large language models (LLMs) become more capable. However, research in this area is still fragmented across different communities. Contemporary surveys catalog the latest LLM-powered frameworks, while the rich histories of Multi-Agent Systems (MAS) and the Semantic Web are often treated as separate, legacy domains. This fragmentation obscures the intellectual lineage of modern systems and hinders a holistic understanding of the field’s trajectory. We present the first comprehensive evolutionary overview of the WoA. We show that modern protocols like A2A and the MCP, are direct evolutionary responses to the well-documented limitations of earlier standards like FIPA standards and OWL-based semantic agents. To systematize this analysis, we introduce a four-axis taxonomy (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism). This framework provides a unified analytical lens for comparing agent architectures across all generations, revealing a clear line of descent where others have seen a disconnect. Our analysis identifies a paradigm shift in the ‘locus of intelligence’: from being encoded in external data (Semantic Web) or the platform (MAS) to being embedded within the agent’s core model (LLM). This shift is foundational to modern Agentic AI, enabling the scalable and adaptive systems the WoA has long envisioned. We conclude that while new protocols are essential, they are insufficient for building a robust, open, trustworthy ecosystem. Finally, we argue that the next research frontier lies in solving persistent socio-technical challenges, and we map out a new agenda focused on decentralized identity, economic models, security, and governance for the emerging WoA.

[95] Automated Thematic Analyses Using LLMs: Xylazine Wound Management Social Media Chatter Use Case

JaMor Hairston,Ritvik Ranjan,Sahithi Lakamana,Anthony Spadaro,Selen Bozkurt,Jeanmarie Perrone,Abeed Sarker

Main category: cs.AI

TL;DR: 论文探讨了使用大型语言模型(LLMs)进行归纳性主题分析的可行性,以社交媒体的Xylazine讨论为例,展示了LLMs在复现专家主题分析任务中的潜力。

Details Motivation: 传统主题分析需要大量领域专家参与,过程耗时且难以扩展。研究旨在探索LLMs是否能够自动化这一过程,提升研究的可扩展性和效率。

Contribution: 提出了基于LLMs的主题分析方法,通过零样本、单样本和多样本提示策略,实现了对社交媒体数据的自动化主题分类,并与专家标注结果进行了对比验证。

Method: 1. 使用Reddit上关于Xylazine的两个非重叠数据集(分别用于模型优化和验证);2. 基于十二个专家标注主题,将任务建模为一系列二元分类问题;3. 采用了多种提示策略(零样本、单样本、多样本)和GPT-4等LLMs进行实验。

Result: 在验证集上,使用多样本提示的GPT-4表现最佳(准确率:90.9%;F1分数:0.71),且对于高频率主题,模型分布与专家标注结果高度一致。

Insight: 研究结果表明,LLMs可以通过多样化提示策略有效自动化主题分析任务,尤其适用于高频率主题,为定性研究提供了可扩展的辅助工具。

Abstract: Background Large language models (LLMs) face challenges in inductive thematic analysis, a task requiring deep interpretive and domain-specific expertise. We evaluated the feasibility of using LLMs to replicate expert-driven thematic analysis of social media data. Methods Using two temporally non-intersecting Reddit datasets on xylazine (n=286 and n=686, for model optimization and validation, respectively) with twelve expert-derived themes, we evaluated five LLMs against expert coding. We modeled the task as a series of binary classifications, rather than a single, multi-label classification, employing zero-, single-, and few-shot prompting strategies and measuring performance via accuracy, precision, recall, and F1-score. Results On the validation set, GPT-4o with two-shot prompting performed best (accuracy: 90.9%; F1-score: 0.71). For high-prevalence themes, model-derived thematic distributions closely mirrored expert classifications (e.g., xylazine use: 13.6% vs. 17.8%; MOUD use: 16.5% vs. 17.8%). Conclusions Our findings suggest that few-shot LLM-based approaches can automate thematic analyses, offering a scalable supplement for qualitative research. Keywords: thematic analysis, large language models, natural language processing, qualitative analysis, social media, prompt engineering, public health