Table of Contents

cs.CL [Back]

[1] Automatic Machine Translation Detection Using a Surrogate Multilingual Translation Model

Cristian García-Romero,Miquel Esplà-Gomis,Felipe Sánchez-Martínez

Main category: cs.CL

TL;DR: 提出了一种新的方法,利用多语言机器翻译模型的内部表示来区分人工翻译和机器翻译的句子,显著提升了检测准确性。

Details Motivation: 目前机器翻译系统依赖大规模平行语料库,但这些语料中可能混杂机器翻译的内容,影响翻译质量。因此需要有效过滤非人工翻译文本。

Contribution: 提出利用代理多语言机器翻译模型的内部表示进行翻译检测,显著提升了对非英语语言对的检测准确性。

Method: 通过分析多语言机器翻译模型的内部表示,设计了一种区分人工与机器翻译的模型。

Result: 在非英语语言对上,该方法比现有技术至少提升5%的准确率。

Insight: 机器翻译模型的内部表示蕴含了区分人工与机器翻译的重要信息,可用于优化翻译检测任务。

Abstract: Modern machine translation (MT) systems depend on large parallel corpora, often collected from the Internet. However, recent evidence indicates that (i) a substantial portion of these texts are machine-generated translations, and (ii) an overreliance on such synthetic content in training data can significantly degrade translation quality. As a result, filtering out non-human translations is becoming an essential pre-processing step in building high-quality MT systems. In this work, we propose a novel approach that directly exploits the internal representations of a surrogate multilingual MT model to distinguish between human and machine-translated sentences. Experimental results show that our method outperforms current state-of-the-art techniques, particularly for non-English language pairs, achieving gains of at least 5 percentage points of accuracy.

[2] LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Gyeom Hwangbo,Hyungjoo Chae,Minseok Kang,Hyeonjong Ju,Soohyun Oh,Jinyoung Yeo

Main category: cs.CL

TL;DR: LEGO-Eval是一个用于评估3D场景合成质量的框架,通过工具增强实现细粒度对齐评估,揭示了当前生成方法的局限性。

Details Motivation: 现有基于LLM的3D场景生成方法因指令粒度不足,导致生成场景与现实环境不符,影响具身智能体的训练效果。

Contribution: 1. 提出LEGO-Eval评估框架,利用多样化工具显式地对齐场景组件;2. 发布LEGO-Bench基准数据集,包含细粒度指令;3. 实验表明LEGO-Eval优于现有评估方法。

Method: LEGO-Eval通过工具增强显式地捕捉场景组件的对齐情况,结合LEGO-Bench的细粒度指令评估生成场景的准确性。

Result: LEGO-Eval在评估场景-指令对齐任务中F1分数比VLM-as-a-judge高0.41,现有方法在LEGO-Bench上的完全对齐成功率仅为10%。

Insight: 细粒度指令和显式对齐工具是评估和改进3D场景生成质量的关键。

Abstract: Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.

[3] Targeted Error Correction in Knowledge Distillation: Small Language Models Surpass GPT

Hee-Jin Lee,Zhen Guo,Luchao Jin,Morteza Moazami Goudarzi

Main category: cs.CL

TL;DR: 论文提出了一种Analyze-Revise-Finetune(ARF)流程,通过针对性纠错,使得较小的开源语言模型在客服摘要任务中超越GPT-3.5等大型商业模型。

Details Motivation: 现有的知识蒸馏方法通常依赖大型教师模型生成训练数据,但难以纠正其中的错误。ARF流程旨在解决这一问题,提升小型模型性能,同时兼顾成本和数据隐私。

Contribution: 1. 提出了ARF流程,结合分析和针对性修正,生成高质量训练数据。2. 在客服摘要任务中,8B参数的Llama模型超越了175B参数的GPT-3.5。3. 提供了一种高效、隐私友好的知识蒸馏框架。

Method: 1. 分析教师模型(GPT-3.5)生成的摘要中的常见错误。2. 使用小型编辑模型(Llama 3.1 70B)针对性修正错误,生成高质量数据。3. 在学生模型(Llama 3.1 8B)上微调,实现性能提升。

Result: 学生模型在摘要任务中超越了GPT-3.5,同时显著降低了计算成本和数据隐私风险。

Insight: 通过针对性修正教师模型的错误,小型模型可以实现超越大型模型的性能,为开源模型的应用提供了可行路径。

Abstract: We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller open-source language models (LLMs) to surpass substantially larger proprietary models in customer service summarization tasks. The pipeline first analyzes and categorizes common errors in summaries produced by a teacher model (GPT-3.5), then performs a targeted revision using a compact editor model (Llama 3.1 70B) to generate high-quality, refined training data. Fine-tuning a smaller student model (Llama 3.1 8B) on this refined data resulted in superior summarization performance compared to GPT-3.5. The ARF pipeline improves cost efficiency and data privacy while maintaining competitive accuracy, illustrating a generalizable framework for enhancing open-source LLMs across diverse downstream applications.

[4] Data-Efficient Adaptation and a Novel Evaluation Method for Aspect-based Sentiment Analysis

Yan Cathy Hua,Paul Denny,Jörg Wicker,Katerina Taškova

Main category: cs.CL

TL;DR: 该论文提出了一种新的评估方法FTS-OBP,解决了传统ABSA评估中边界变化被过度惩罚的问题,并通过研究小型生成语言模型(SLMs)的数据高效适应方法,展示了在低资源场景下的显著性能提升。

Details Motivation: 当前ABSA研究集中在商业领域,而教育和医疗等高需求低资源领域的研究和资源匮乏。此外,传统评估方法对边界变化过于严格,生成模型的性能难以准确评估。

Contribution: 1) 提出FTS-OBP评估方法,支持边界变化评估并提供细粒度诊断;2) 首次研究小型生成语言模型(SLMs)在ABSA任务中的表现,并提出数据高效的微调策略;3) 发布首个教育评论ABSA公开数据集。

Method: 1) FTS-OBP方法结合了文本相似度匹配和最优二分对;2) 通过上下文学习和权重融合等数据免费方法及多任务微调策略优化SLMs性能;3) 在单GPU上使用少量数据(200-1,000样本)实现高性能。

Result: 1.5-3.8B参数的SLMs性能超过专有大模型,接近基准结果;FTS-OBP与传统指标强相关且提供更灵活的评估。

Insight: 小型生成模型通过数据高效适应策略可在低资源场景下表现优异;灵活的评估方法更贴合实际需求,推动ABSA在非商业领域的发展。

Abstract: Aspect-based Sentiment Analysis (ABSA) is a fine-grained opinion mining approach that identifies and classifies opinions associated with specific entities (aspects) or their categories within a sentence. Despite its rapid growth and broad potential, ABSA research and resources remain concentrated in commercial domains, leaving analytical needs unmet in high-demand yet low-resource areas such as education and healthcare. Domain adaptation challenges and most existing methods’ reliance on resource-intensive in-training knowledge injection further hinder progress in these areas. Moreover, traditional evaluation methods based on exact matches are overly rigid for ABSA tasks, penalising any boundary variations which may misrepresent the performance of generative models. This work addresses these gaps through three contributions: 1) We propose a novel evaluation method, Flexible Text Similarity Matching and Optimal Bipartite Pairing (FTS-OBP), which accommodates realistic extraction boundary variations while maintaining strong correlation with traditional metrics and offering fine-grained diagnostics. 2) We present the first ABSA study of small decoder-only generative language models (SLMs; <7B parameters), examining resource lower bounds via a case study in education review ABSA. We systematically explore data-free (in-context learning and weight merging) and data-light fine-tuning methods, and propose a multitask fine-tuning strategy that significantly enhances SLM performance, enabling 1.5-3.8 B models to surpass proprietary large models and approach benchmark results with only 200-1,000 examples on a single GPU. 3) We release the first public set of education review ABSA resources to support future research in low-resource domains.

[5] MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Kaiyuan Zhang,Chenghao Yang,Zhoufutu Wen,Sihang Yuan,Qiuyue Wang,Chaoyi Huang,Guosheng Zhu,He Wang,Huawenyu Lu,Jianing Wen,Jianpeng Jiao,Lishu Luo,Longxiang Liu,Sijin Wu,Xiaolei Zhu,Xuanliang Zhang,Ge Zhang,Yi Lin,Guang Shi,Chaoyou Fu,Wenhao Huang

Main category: cs.CL

TL;DR: 论文提出了名为MME-CC的多模态评估基准,用于系统性评估多模态大语言模型(MLLMs)在视觉为中心的认知能力方面的表现,并通过实验揭示了当前模型的局限性和常见错误模式。

Details Motivation: 现有多模态基准大多过度关注文本推理,未能系统性评估视觉为中心的认知行为,导致MLLMs的认知能力评估不足。

Contribution: 提出了MME-CC基准,涵盖了11种代表性推理任务,分为空间、几何和知识推理三类,提供了对MLLMs认知能力的细粒度分析。

Method: 基于MME-CC基准,对16种代表性的MLLMs进行了实验,分析其在不同认知任务中的表现及常见错误模式。

Result: 实验显示闭源模型(如Gemini-2.5-Pro)整体表现领先,但空间和几何推理能力普遍较弱(≤30%)。发现常见的错误模式包括方向错误、跨视图身份识别问题等。

Insight: Chain-of-Thought(思维链)通常遵循提取→推理→验证的三阶段过程,但严重依赖视觉提取。研究呼吁将MLLMs的认知能力作为评估和模型设计的核心。

Abstract: As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs’ cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

[6] Who Sees the Risk? Stakeholder Conflicts and Explanatory Policies in LLM-based Risk Assessment

Srishti Yadav,Jasmina Gajcin,Erik Miehling,Elizabeth Daly

Main category: cs.CL

TL;DR: 这篇论文提出了一个基于LLM的利益相关者风险评估框架,通过Risk Atlas Nexus和GloVE解释方法生成可解释的政策,揭示不同利益相关者对风险的分歧与共识,并在三个实际用例中验证了方法的有效性。

Details Motivation: 为了负责任地部署AI系统,理解不同利益相关者对风险的认知差异至关重要。

Contribution: 1. 提出了一个利益相关者导向的风险评估框架;2. 引入了可解释的政策生成方法;3. 展示了冲突推理的交互式可视化。

Method: 使用LLM作为评估者,结合Risk Atlas Nexus和GloVE方法,生成可解释的政策并分析风险感知差异。

Result: 结果表明,利益相关者的视角显著影响风险感知和冲突模式。

Insight: 利益相关者感知的透明解释是LLM评估透明化和符合以人为本AI治理目标的关键。

Abstract: Understanding how different stakeholders perceive risks in AI systems is essential for their responsible deployment. This paper presents a framework for stakeholder-grounded risk assessment by using LLMs, acting as judges to predict and explain risks. Using the Risk Atlas Nexus and GloVE explanation method, our framework generates stakeholder-specific, interpretable policies that shows how different stakeholders agree or disagree about the same risks. We demonstrate our method using three real-world AI use cases of medical AI, autonomous vehicles, and fraud detection domain. We further propose an interactive visualization that reveals how and why conflicts emerge across stakeholder perspectives, enhancing transparency in conflict reasoning. Our results show that stakeholder perspectives significantly influence risk perception and conflict patterns. Our work emphasizes the importance of these stakeholder-aware explanations needed to make LLM-based evaluations more transparent, interpretable, and aligned with human-centered AI governance goals.

[7] Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks

Kevin Wang,Subre Abdoul Moktar,Jia Li,Kangshuo Li,Feng Chen

Main category: cs.CL

TL;DR: 本文通过实证研究评估了LLMs中的随机性和认知不确定性,比较了12种不确定性估计方法在ID和OOD QA任务中的表现,发现信息基方法在ID设置中表现优异,而密度基方法和P(True)指标在OOD场景中更有效。

Details Motivation: 确保LLM输出的可信度是当前研究的关键,不确定性估计(UE)在其中扮演重要角色。本文旨在评估不同UE方法在衡量LLM中随机性和认知不确定性时的效果。

Contribution: 对比了12种UE方法在ID和OOD数据集上的表现,揭示了信息基、密度基和P(True)方法在不同场景下的优势,并验证了语义一致性方法的稳健性。

Method: 采用12种UE方法和4种生成质量指标(包括LLMScore),在ID和OOD QA任务中评估LLM生成答案的不确定性。

Result: 信息基方法在ID数据中表现优异;密度基方法和P(True)在OOD场景中更有效;语义一致性方法在不同数据集和指标中表现稳健。

Insight: 不同UE方法适用于不同场景,信息基方法适合ID数据,而密度基方法和P(True)更适合捕捉OOD中的认知不确定性。语义一致性方法是一种通用但非最优的选择。

Abstract: Large Language Models (LLMs) have become increasingly pervasive, finding applications across many industries and disciplines. Ensuring the trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE) plays a key role. In this work, a comprehensive empirical study is conducted to examine the robustness and effectiveness of diverse UE measures regarding aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE methods and four generation quality metrics including LLMScore from LLM criticizers to evaluate the uncertainty of LLM-generated answers in Question-Answering (QA) tasks on both in-distribution (ID) and out-of-distribution (OOD) datasets. Our analysis reveals that information-based methods, which leverage token and sequence probabilities, perform exceptionally well in ID settings due to their alignment with the model’s understanding of the data. Conversely, density-based methods and the P(True) metric exhibit superior performance in OOD contexts, highlighting their effectiveness in capturing the model’s epistemic uncertainty. Semantic consistency methods, which assess variability in generated answers, show reliable performance across different datasets and generation metrics. These methods generally perform well but may not be optimal for every situation.

[8] BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

Shahriyar Zaman Ridoy,Azmine Toushik Wasi,Koushik Ahamed Tonmoy

Main category: cs.CL

TL;DR: 该论文提出了BengaliMoralBench,这是首个针对孟加拉语言和文化的道德推理基准,填补了多语言大语言模型在本地伦理规范对齐方面的空白。

Details Motivation: 随着多语言大语言模型在南亚的普及,其在孟加拉语言和文化中的道德对齐问题尚未被充分研究。现有的道德基准大多以英语为中心,忽略了本地文化背景的重要性。

Contribution: 论文的主要贡献是开发了首个针对孟加拉语言和文化的道德推理基准,覆盖五个道德领域,并通过母语使用者共识标注场景,采用三种伦理视角进行评估。

Method: 通过系统性的零样本评估,对多个多语言大语言模型(如Llama、Gemma等)进行统一提示协议和标准指标的测试,揭示其在文化背景、常识推理和道德公平性方面的弱点。

Result: 评估结果显示,模型的准确率在50-91%之间,表现出显著的性能差异,且存在文化基础薄弱和道德公平性问题。

Insight: 该基准为多语言、低资源环境下的伦理对齐提供了重要工具,支持未来开发更具文化适应性的AI系统。

Abstract: As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, which is spoken by over 285 million people and ranked 6th globally, remains underexplored. Existing ethics benchmarks are largely English-centric and shaped by Western frameworks, overlooking cultural nuances critical for real-world deployment. To address this, we introduce BengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts. It covers five moral domains, Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities, subdivided into 50 culturally relevant subtopics. Each scenario is annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct systematic zero-shot evaluation of prominent multilingual LLMs, including Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and standard metrics. Performance varies widely (50-91% accuracy), with qualitative analysis revealing consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. BengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting the deployment of ethically robust AI in diverse, low-resource multilingual settings such as Bangladesh.

[9] Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks

Jindong Hong,Tianjie Chen,Lingjie Luo,Chuanyang Zheng,Ting Xu,Haibao Yu,Jianing Qiu,Qianzhong Chen,Suning Huang,Yan Xu,Yong Gui,Yijun He,Jiankai Sun

Main category: cs.CL

TL;DR: 该论文评估了多模态大语言模型(MLLMs)在临床任务中‘思考模式’的表现,发现其与标准‘非思考模式’相比性能提升有限,尤其在复杂医疗任务中表现欠佳。

Details Motivation: 随着支持‘思考模式’的MLLMs快速发展,研究旨在验证这种模式在临床任务中对模型性能和可靠性的实际影响。

Contribution: 系统地评估了两种领先MLLMs(Seed1.5-VL和Gemini-2.5-Flash)在医疗任务中的表现,揭示了思考模式的局限性。

Method: 使用VQA-RAD和ROCOv2数据集,在四种视觉医疗任务中比较了思考模式与非思考模式的性能表现。

Result: 思考模式在多数任务中的改进有限,尤其是在开放性问答和医学图像解读等复杂任务中表现不佳。

Insight: 研究强调了医疗领域需要更多领域特定数据和更先进的医学知识整合方法来提升MLLMs的临床适用性。

Abstract: A recent advancement in Multimodal Large Language Models (MLLMs) research is the emergence of “reasoning MLLMs” that offer explicit control over their internal thinking processes (normally referred as the “thinking mode”) alongside the standard “non-thinking mode”. This capability allows these models to engage in a step-by-step process of internal deliberation before generating a final response. With the rapid transition to and adoption of these “dual-state” MLLMs, this work rigorously evaluated how the enhanced reasoning processes of these MLLMs impact model performance and reliability in clinical tasks. This paper evaluates the active “thinking mode” capabilities of two leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We assessed their performance on four visual medical tasks using VQA-RAD and ROCOv2 datasets. Our findings reveal that the improvement from activating the thinking mode remains marginal compared to the standard non-thinking mode for the majority of the tasks. Their performance on complex medical tasks such as open-ended VQA and medical image interpretation remains suboptimal, highlighting the need for domain-specific medical data and more advanced methods for medical knowledge integration.

[10] EQ-Negotiator: Dynamic Emotional Personas Empower Small Language Models for Edge-Deployable Credit Negotiation

Yunbo Long,Yuhan Liu,Alexandra Brintrup

Main category: cs.CL

TL;DR: EQ-Negotiator通过动态情感角色为小型语言模型赋能,使其在边缘部署的信用谈判中表现出色,甚至超越更大的模型。

Details Motivation: 大型语言模型(LLM)在自动谈判中性能优秀,但其计算成本高且隐私保护不足,不适合移动设备等边缘场景。小型语言模型(SLM)虽然更实用,但在处理情感化复杂角色时表现不佳。EQ-Negotiator旨在填补SLM与LLM之间的性能差距。

Contribution: 提出EQ-Negotiator框架,结合博弈论和隐马尔可夫模型(HMM),动态学习并跟踪债务人的情感状态,为SLM注入战略智能,提升谈判效率和债务回收率。表明情感智能(而非模型规模)是谈判成功的关键。

Method: 采用基于HMM的情感状态跟踪系统,结合游戏理论,动态调整谈判策略。框架无需预训练,适用于边缘设备。

Result: 在多种信用谈判场景中,7B参数的SLM搭载EQ-Negotiator,表现优于规模大十倍的LLM基准模型,尤其在对抗性策略下。

Insight: 情感智能和动态角色建模是谈判成功的关键。EQ-Negotiator展示了在隐私约束下实现高效谈判的可能性,推动了轻量化AI的发展。

Abstract: The deployment of large language models (LLMs) in automated negotiation has set a high performance benchmark, but their computational cost and data privacy requirements render them unsuitable for many privacy-sensitive, on-device applications such as mobile assistants, embodied AI agents or private client interactions. While small language models (SLMs) offer a practical alternative, they suffer from a significant performance gap compared to LLMs in playing emotionally charged complex personas, especially for credit negotiation. This paper introduces EQ-Negotiator, a novel framework that bridges this capability gap using emotional personas. Its core is a reasoning system that integrates game theory with a Hidden Markov Model(HMM) to learn and track debtor emotional states online, without pre-training. This allows EQ-Negotiator to equip SLMs with the strategic intelligence to counter manipulation while de-escalating conflict and upholding ethical standards. Through extensive agent-to-agent simulations across diverse credit negotiation scenarios, including adversarial debtor strategies like cheating, threatening, and playing the victim, we show that a 7B parameter language model with EQ-Negotiator achieves better debt recovery and negotiation efficiency than baseline LLMs more than 10 times its size. This work advances persona modeling from descriptive character profiles to dynamic emotional architectures that operate within privacy constraints. Besides, this paper establishes that strategic emotional intelligence, not raw model scale, is the critical factor for success in automated negotiation, paving the way for effective, ethical, and privacy-preserving AI negotiators that can operate on the edge.

[11] LFC-DA: Logical Formula-Controlled Data Augmentation for Enhanced Logical Reasoning

Shenghao Li

Main category: cs.CL

TL;DR: 论文提出了一种基于符号逻辑的数据增强方法LFC-DA,通过命题表达式和规则库生成多样化的逻辑问题,显著提升了预训练模型的逻辑推理能力。

Details Motivation: 现有数据增强方法在复杂逻辑任务中对人工标注依赖高,或依赖大模型生成但缺乏逻辑多样性和可解释性。

Contribution: 提出LFC-DA,一种符号逻辑控制的数据增强管道,通过命题表达式和系统搜索生成逻辑严谨且多样化的自然语言问题。

Method: 将逻辑文本映射为命题表达式,编译规则库,并通过有界状态空间搜索生成有效公式,最终将其转化为自然语言问题。

Result: 在ReClor和LogiQA数据集上的实验表明,LFC-DA显著提升了预训练模型的逻辑推理准确性。

Insight: 通过符号逻辑控制生成多样化的逻辑问题,可以有效增强模型的逻辑推理能力。

Abstract: For complex logical data augmentation, heavy reliance on human annotation is costly, whereas direct generation with large language models yields uninterpretable and logically homogeneous examples. To address this, we present LFC-DA, a symbolic-logic-controlled pipeline: logical text is first mapped to propositional expressions, a compact rule library is compiled, and a bounded state-space search systematically discovers valid formulas that are then verbalized back into natural-language questions, ensuring both diversity and logical rigor under propositional logic. Experiments on ReClor and LogiQA show significant improvements in the logical-reasoning accuracy of pretrained models, confirming the effectiveness of LFC-DA for LLM-guided logical data augmentation.

[12] Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

Saumitra Yadav,Manish Shrivastava

Main category: cs.CL

TL;DR: 论文研究了在机器翻译中不对称Byte Pair Encoding(BPE)的效果,发现与对称BPE相比,不对称BPE(源语言和目标语言使用不同的合并操作次数)在低资源语言对中显著提升性能。

Details Motivation: 现有研究通常默认使用对称BPE(源语言和目标语言使用相同的合并操作次数),但作者发现这种统一方法并不能在所有语言对和数据规模下实现最优性能,尤其在低资源场景下。

Contribution: 提出不对称BPE方法,验证其在低资源机器翻译中的优越性,并通过实验证明其在多种语言对中显著提升性能。

Method: 通过调整源语言和目标语言的BPE合并操作次数(NMO),设计不对称BPE策略,并在不同数据规模和语言对上进行机器翻译实验。

Result: 不对称BPE在低资源场景(50K、100K、500K句对)中显著优于对称BPE,例如在英语-印地语任务中平均提升5.32、4.46和0.7 CHRF++分数。

Insight: 高NMO用于源语言(4K到32K)和低NMO用于目标语言(0.5K到2K)的组合在低资源机器翻译中表现最佳,打破了对对称BPE的默认假设。

Abstract: Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn’t guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE, where the source and target languages have different NMOs, significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant ($p<0.05$) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in low-resource setups. We validated this trend across six additional language pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut), observing statistically significant improvement in 10 out of 12 systems compared to symmetric BPE. Our findings indicate a high NMO for the source (4K to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results, particularly benefiting low-resource MT.

[13] Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

Célian Ringwald,Fabien Gandon,Catherine Faron,Franck Michel,Hanna Abi Akl

Main category: cs.CL

TL;DR: 该论文研究了小型语言模型(SLM)在基于形状的RDF图提取中处理数据类型和对象属性的能力,发现稀有属性的长尾分布是主要瓶颈,并提出了多种策略来解决这一问题。

Details Motivation: 当前SLM在关系提取(RE)中表现良好,但仅限于常见数据类型属性。作者希望探索SLM在完整RDF图提取(包括数据类型和对象属性)中的能力,尤其是如何处理稀有属性的长尾分布问题。

Contribution: 论文的主要贡献是识别了SLM在提取稀有属性时的瓶颈,并通过对比多种策略(如分层抽样、加权损失、数据集扩展和基于模板的合成数据增强),提出了训练集的优化方法(确保每个属性的出现次数超过阈值),以提高模型性能。

Method: 作者评估了四种策略:分层抽样、加权损失、数据集扩展和基于模板的合成数据增强。最佳策略是通过合成数据增强和数据集扩展,确保训练集中每个属性的出现次数超过特定阈值。

Result: 实验结果表明,最佳策略能够平衡模型在稀有属性和常见属性上的表现。作者公开了数据集、代码和实验结果,支持研究的可复现性。

Insight: 论文指出,SLM在语义关系提取中的性能提升依赖于训练数据的平衡性,未来研究可以进一步探索更高效率的数据增强方法和模型优化策略。

Abstract: Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.

[14] Efficient Reasoning via Thought-Training and Thought-Free Inference

Canhui Wu,Qiong Cao,Chao Xue,Wei Xi,Xiaodong He

Main category: cs.CL

TL;DR: 该论文提出了一种名为3TF(Thought-Training and Thought-Free inference)的框架,通过短到长的视角提升大规模语言模型的推理效率,将结构化推理内化,同时在推理时启用无需显式推理的模式,显著提高了推理性能。

Details Motivation: 当前利用显式思维链(CoT)提示的推理方法虽然有效,但依赖冗长的显式推理输出,效率较低。3TF旨在通过训练模型内化推理能力,减少推理时的显式输出需求。

Contribution: 提出3TF框架,通过训练-推理分离的模式提升语言模型的推理效率和质量。3TF允许模型在推理时无需显式生成中间步骤,同时保持高质量的推理能力。

Method: 3TF首先训练一个混合模型(支持推理和非推理模式),然后在CoT标注数据上进一步训练以内化结构化推理。推理时使用非推理模式生成简洁的无需推理的输出。

Result: 实验表明,3TF训练的模型在无需显式推理的情况下显著提升了推理性能,证明了高质量推理可以通过隐式学习实现。

Insight: 内化推理能力可以减少推理时的显式输出需求,同时提高效率和质量;短到长的训练视角为高效的推理模型设计提供了新思路。

Abstract: Recent advances in large language models (LLMs) have leveraged explicit Chain-of-Thought (CoT) prompting to improve reasoning accuracy. However, most existing methods primarily compress verbose reasoning outputs. These Long-to-Short transformations aim to improve efficiency, but still rely on explicit reasoning during inference. In this work, we introduce \textbf{3TF} (\textbf{T}hought-\textbf{T}raining and \textbf{T}hought-\textbf{F}ree inference), a framework for efficient reasoning that takes a Short-to-Long perspective. We first train a hybrid model that can operate in both reasoning and non-reasoning modes, and then further train it on CoT-annotated data to internalize structured reasoning, while enforcing concise, thought-free outputs at inference time using the no-reasoning mode. Unlike compression-based approaches, 3TF improves the reasoning quality of non-reasoning outputs, enabling models to perform rich internal reasoning implicitly while keeping external outputs short. Empirically, 3TF-trained models obtain large improvements on reasoning benchmarks under thought-free inference, demonstrating that high quality reasoning can be learned and executed implicitly without explicit step-by-step generation.

[15] Knowledge-Augmented Question Error Correction for Chinese Question Answer System with QuestionRAG

Longpeng Qiu,Ting Li,Shuai Mao,Nan Yang,Xiaohui Yan

Main category: cs.CL

TL;DR: 论文提出了QuestionRAG框架,通过知识增强和强化学习解决问答系统中的输入错误问题,提升错误问题的理解和校正能力。

Details Motivation: 问答系统中的输入错误常导致错误回答,现有大语言模型在理解用户意图和避免过度校正方面表现不佳。

Contribution: 提出QuestionRAG框架,结合外部知识增强和强化学习,显著提升问题校正的效果和泛化能力。

Method: 采用知识增强(如搜索结果)和强化学习(RL)对齐模型目标,避免过度校正。

Result: 实验表明,知识增强对理解错误问题至关重要,RL比传统监督微调更有效。

Insight: 结合知识增强和强化学习可以充分释放大语言模型在问题校正任务中的潜力。

Abstract: Input errors in question-answering (QA) systems often lead to incorrect responses. Large language models (LLMs) struggle with this task, frequently failing to interpret user intent (misinterpretation) or unnecessarily altering the original question’s structure (over-correction). We propose QuestionRAG, a framework that tackles these problems. To address misinterpretation, it enriches the input with external knowledge (e.g., search results, related entities). To prevent over-correction, it uses reinforcement learning (RL) to align the model’s objective with precise correction, not just paraphrasing. Our results demonstrate that knowledge augmentation is critical for understanding faulty questions. Furthermore, RL-based alignment proves significantly more effective than traditional supervised fine-tuning (SFT), boosting the model’s ability to follow instructions and generalize. By integrating these two strategies, QuestionRAG unlocks the full potential of LLMs for the question correction task.

[16] CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Doria Bonzi,Alexandre Guiggi,Frédéric Béchet,Carlos Ramisch,Benoit Favre

Main category: cs.CL

TL;DR: CareMedEval是一个新的数据集,专为评估大型语言模型(LLM)在生物医学领域的批判性评价和推理能力而设计,包含534个问题,基于37篇科学论文。结果显示现有LLM在此任务上表现不佳,尤其是在研究局限性和统计分析问题上。

Details Motivation: 生物医学领域的批判性评价是一个关键技能,但现有LLM在专业领域的可靠性有限。CareMedEval旨在填补这一空白,提供一个专门的数据集来评估和改进LLM的表现。

Contribution: 提出了CareMedEval数据集,这是首个基于真实医学考试问题、专注于生物医学批判性评价和推理的数据集。

Method: 数据集来自法国医学生的真实考试问题,包含534个问题,覆盖37篇科学论文。实验评估了通用和生物医学专用LLM的表现,并分析了上下文条件和中间推理对结果的影响。

Result: 现有LLM在Exact Match Rate上未超过0.5,生成中间推理虽能提升表现,但在研究局限性和统计分析问题上仍表现不佳。

Insight: CareMedEval揭示了当前LLM在专业领域批判性推理上的局限性,为未来自动化支持工具的开发提供了重要基准。

Abstract: Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.

[17] MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

Sofie Helene Bruun,Dan Saattrup Smart

Main category: cs.CL

TL;DR: 论文提出了MultiZebraLogic,一个多语言的逻辑推理基准,旨在评估大型语言模型在不同语言和不同难度下的逻辑推理能力。通过生成不同主题、大小和干扰类型的斑马谜题,作者展示了模型的性能表现,并发布了相关数据集和生成代码。

Details Motivation: 现有的大型语言模型(LLM)评测基准难以全面评估其逻辑推理能力,尤其是在多语言和不同难度任务上的表现。因此,作者旨在创建一个高质量、多语言的数据集,以填补这一空白。

Contribution: 1) 提出了MultiZebraLogic基准,支持九种日耳曼语言;2) 设计了多种难度提升方法(如谜题大小、干扰信息);3) 展示了不同模型在逻辑推理任务中的表现差异;4) 发布了数据集和灵活的生成代码。

Method: 作者通过生成多种斑马谜题(zebra puzzles)来构建数据集,这些谜题涵盖了不同语言、主题、大小(如2x3和4x5)、线索类型(14种)以及干扰信息(8种)。通过调整这些因素,评测了模型的推理能力。

Result: 研究发现:1) GPT-4o mini(非推理模型)和o3-mini(推理模型)分别在2x3和4x5规模的谜题上表现较好;2) 加入5个干扰线索会使o3-mini在4x5谜题上的准确率下降15%;3) 语言和主题对模型表现无显著影响;4) 线索类型与难度无相关性。

Insight: 1) 干扰信息显著影响模型表现,表明模型在处理无关信息时存在不足;2) 多语言和主题的可扩展性表明基准具有普适性;3) 谜题大小是一个有效的难度调节因素。

Abstract: Measuring the full abilities of large language models (LLMs) requires benchmarks representing multiple tasks. We aim to create large, high-quality datasets for comparison of logical reasoning skills across several languages and of suitable difficulty for LLMs of various reasoning ability. We explore multiple ways of increasing difficulty. We generate zebra puzzles in multiple languages, themes, sizes and including 14 different clue types and 8 red herring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 are sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a reasoning model), respectively. Including 5 red herrings decreases o3-mini puzzle-level accuracy on 4x5 puzzles by 15$\pm$7 %. Scores of o3-mini on 4x5 puzzles are not significantly affected by use of English vs. Danish or the common houses theme vs. the country-specific smoerrebroed theme. We find no correlation between difficulty and the selected clue types. Datasets of 128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic languages for sizes 2x3 and 4x5. We publish code for puzzle generation, designed for adaptablity into more languages and themes.

[18] AILA–First Experiments with Localist Language Models

Joachim Diederich

Main category: cs.CL

TL;DR: 该论文首次展示了可控局部性在Transformer语言模型中的实证,通过可调局部性参数实现了介于完全局部化与分布式表示之间的动态控制。

Details Motivation: 传统语言模型完全依赖分布式表示,缺乏透明性和解释性,而该论文通过引入局部化表示参数,探索如何在保持性能的同时提高模型的解释性。

Contribution: 提出了一个新颖的框架,通过可调局部性参数动态控制表示形式,为需要透明性和高性能的领域提供了一种实用解决方案。

Method: 使用双层Transformer架构,在WikiText语料上系统性地调整局部性参数{\lambda},从完全局部化({\lambda}=1.0)到完全分布式({\lambda}=0.0),研究了其对注意力熵和指针保真度等指标的影响。

Result: 完全局部化配置显著降低了注意力熵(5.36 bits vs. 7.18 bits),同时保持了较高的指针保真度。中间局部性值(如{\lambda}=0.6)在解释性和性能之间取得了最佳平衡,测试困惑度为4.65,准确率为84.7%。

Insight: 局部化语言模型通过显式控制参数,为需要在透明性和性能之间取得平衡的应用提供了灵活的工具,尤其是在受监管领域。

Abstract: This paper presents the first empirical demonstration of controllable locality in transformer language models, a novel architectural framework that enables continuous control over the degree of representation localization through a tunable locality dial parameter. Unlike traditional language models that rely exclusively on distributed representations, our approach allows dynamic interpolation between highly interpretable localist encodings and efficient distributed representations without requiring model retraining. We conducted experiments on the WikiText corpus using a two-layer transformer architecture, systematically varying the locality parameter {\lambda} across the full spectrum from 1.0 (fully localist) to 0.0 (fully distributed). Our results demonstrate that localist configurations achieve dramatically lower attention entropy, with {\lambda} = 1.0 yielding 5.36 bits compared to 7.18 bits at {\lambda} = 0.0, while maintaining substantially higher pointer fidelity scores reflecting stronger alignment with rule-specified targets. Prediction experiments reveal that intermediate locality values optimize the tradeoff between interpretability and performance, with {\lambda} = 0.6 achieving test perplexity of 4.65 and accuracy of 84.7%. These findings establish that localist language models provide a practical framework for applications in regulated domains requiring both transparency and capability, offering precise mathematical control over the interpretability-performance spectrum through explicit penalty thresholds and information-theoretic design principles.

One Octadion,Bondan Sapta Prakoso,Nanang Yudi Setiawan,Novanto Yudistira

Main category: cs.CL

TL;DR: 本文通过结合微调大型语言模型(LLMs)与检索增强生成(RAG)方法,提出了一种提升法律文本理解和法规制定的工具。

Details Motivation: 法律领域的信息量大且动态变化,传统方法难以满足政策制定者对法规理解和制定的需求。通过结合微调和RAG,可以更好地支持政策制定者工作。

Contribution: 提出了一种结合微调LLMs与RAG的方法,增强了法律领域的模型能力,使其能够动态检索并利用外部法律知识。

Method: 1. 构建了针对法律领域的监督数据集;2. 结合RAG方法,使模型能够动态获取外部法律知识。

Result: 实验表明,该方法显著提升了法律研究和法规制定的效率,为政策制定者提供了有力支持。

Insight: 将RAG与LLMs结合在法律领域具有潜力,能够动态适应法律变化并提升文本生成的相关性和准确性。

Abstract: In this study, we explore the fine-tuning of Large Language Models (LLMs) to better support policymakers in their crucial work of understanding, analyzing, and crafting legal regulations. To equip the model with a deep understanding of legal texts, we curated a supervised dataset tailored to the specific needs of the legal domain. Additionally, we integrated the Retrieval-Augmented Generation (RAG) method, enabling the LLM to access and incorporate up-to-date legal knowledge from external sources. This combination of fine-tuning and RAG-based augmentation results in a tool that not only processes legal information but actively assists policymakers in interpreting regulations and drafting new ones that align with current needs. The results demonstrate that this approach can significantly enhance the effectiveness of legal research and regulation development, offering a valuable resource in the ever-evolving field of law.

[20] Step-Audio-EditX Technical Report

Chao Yan,Boyong Wu,Peng Yang,Pengfei Tan,Guoqiang Hu,Yuxin Zhang,Xiangyu,Zhang,Fei Tian,Xuerui Yang,Xiangyu Zhang,Daxin Jiang,Gang Yu

Main category: cs.CL

TL;DR: Step-Audio-EditX是基于LLM的开源音频模型,首次在情感、说话风格和副语言等方面实现迭代式音频编辑,并具备零样本文本转语音能力,通过大间隔合成数据替代嵌入先验或辅助模块。

Details Motivation: 传统音频编辑方法依赖于表示级解耦和辅助模块,限制了模型的表达能力和迭代控制。Step-Audio-EditX旨在通过大间隔学习突破这些限制。

Contribution: 1. 首个基于LLM的开源音频编辑模型;2. 利用大间隔合成数据实现高表达力和迭代控制;3. 在情感编辑等任务中优于现有模型。

Method: 采用大间隔合成数据训练,避免嵌入先验或辅助模块的使用,直接从数据中学习高表达力和控制能力。

Result: 在情感编辑和细粒度控制任务中,性能优于MiniMax-2.6-hd和Doubao-Seed-TTS-2.0等模型。

Insight: 大间隔学习方法为音频编辑提供了一种新范式,避免了复杂的表示解耦,同时实现了更高的表达力和灵活性。

Abstract: We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

[21] Towards Transparent Stance Detection: A Zero-Shot Approach Using Implicit and Explicit Interpretability

Apoorva Upadhyaya,Wolfgang Nejdl,Marco Fisichella

Main category: cs.CL

TL;DR: 该论文提出了一种名为IRIS的零样本立场检测框架,结合隐式和显式解释性,解决了现有方法在泛化性和解释性上的不足。

Details Motivation: 现有的零样本立场检测方法过度依赖显式推理,解释粗糙,且未显式建模推理过程。IRIS旨在提供更透明的立场检测,结合隐式和显式解释性模型。

Contribution: 提出了IRIS框架,通过隐式(文本内部序列)和显式(语言特征)两种解释性方法,提供透明的立场检测,无需真实解释标注。

Method: IRIS将立场检测视为信息检索排序任务,利用隐式依据指导预测,并结合显式的语言特征解码情感和认知维度。

Result: 在VAST、EZ-STANCE、P-Stance和RFD数据集上,即使仅用10%训练数据,IRIS也表现出优异的泛化能力。

Insight: 结合隐式和显式解释性方法可以显著提升零样本立场检测的透明度和泛化性。

Abstract: Zero-Shot Stance Detection (ZSSD) identifies the attitude of the post toward unseen targets. Existing research using contrastive, meta-learning, or data augmentation suffers from generalizability issues or lack of coherence between text and target. Recent works leveraging large language models (LLMs) for ZSSD focus either on improving unseen target-specific knowledge or generating explanations for stance analysis. However, most of these works are limited by their over-reliance on explicit reasoning, provide coarse explanations that lack nuance, and do not explicitly model the reasoning process, making it difficult to interpret the model’s predictions. To address these issues, in our study, we develop a novel interpretable ZSSD framework, IRIS. We provide an interpretable understanding of the attitude of the input towards the target implicitly based on sequences within the text (implicit rationales) and explicitly based on linguistic measures (explicit rationales). IRIS considers stance detection as an information retrieval ranking task, understanding the relevance of implicit rationales for different stances to guide the model towards correct predictions without requiring the ground-truth of rationales, thus providing inherent interpretability. In addition, explicit rationales based on communicative features help decode the emotional and cognitive dimensions of stance, offering an interpretable understanding of the author’s attitude towards the given target. Extensive experiments on the benchmark datasets of VAST, EZ-STANCE, P-Stance, and RFD using 50%, 30%, and even 10% training data prove the generalizability of our model, benefiting from the proposed architecture and interpretable design.

cs.CV [Back]

[22] Generative Hints

Andy Dimnaku,Abdullah Yusuf Kavranoğlu,Yaser Abu-Mostafa

Main category: cs.CV

TL;DR: 本文提出了一种称为‘生成提示’的训练方法,旨在通过生成模型生成无标签虚拟样本,以半监督方式学习已知的不变性(即‘提示’),从而改进传统数据增强方法在输入空间全局捕捉不变性的不足。

Details Motivation: 传统的数据增强方法仅依赖于有限的训练数据转换来学习不变性,未能全局捕捉输入空间中的不变性。为了解决这一问题,作者提出利用生成模型生成虚拟样本,以半监督方式直接强化模型对不变性的学习。

Contribution: 提出了‘生成提示’方法,通过生成模型生成虚拟样本并结合半监督学习,直接强化模型对已知不变性的学习。实验表明,该方法在多个数据集、架构和损失函数中均优于传统数据增强方法。

Method: 方法分为两部分:1)训练生成模型以近似输入分布并生成无标签虚拟样本;2)将虚拟样本与标记数据结合,通过半监督学习同时优化分类目标和提示目标。

Result: 在多个任务中,生成提示方法均优于传统数据增强方法,例如在细粒度视觉分类任务中平均提升0.63%的Top-1准确率,在CheXpert X-ray数据集上平均提升1.286%的性能。

Insight: 生成模型可以为学习不变性提供丰富的全局信息,弥补数据增强的局限性;半监督学习能够有效利用虚拟样本提升模型性能。

Abstract: Data augmentation is widely used in vision to introduce variation and mitigate overfitting, through enabling models to learn invariant properties, such as spatial invariance. However, these properties are not fully captured by data augmentation alone, since it attempts to learn the property on transformations of the training data only. We propose generative hints, a training methodology that directly enforces known invariances in the entire input space. Our approach leverages a generative model trained on the training set to approximate the input distribution and generate unlabeled images, which we refer to as virtual examples. These virtual examples are used to enforce functional properties known as hints. In generative hints, although the training dataset is fully labeled, the model is trained in a semi-supervised manner on both the classification and hint objectives, using the unlabeled virtual examples to guide the model in learning the desired hint. Across datasets, architectures, and loss functions, generative hints consistently outperform standard data augmentation when learning the same property. On popular fine-grained visual classification benchmarks, we achieved up to 1.78% top-1 accuracy improvement (0.63% on average) over fine-tuned models with data augmentation and an average performance boost of 1.286% on the CheXpert X-ray dataset.

[23] ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology

Srikumar Sastry,Subash Khanal,Aayush Dhakal,Jiayu Lin,Dan Cher,Phoenix Jarosz,Nathan Jacobs

Main category: cs.CV

TL;DR: ProM3E是一种概率掩码多模态嵌入模型,专注于生态学领域的多模态表示生成与重建,支持嵌入空间中的模态反转与模态融合分析。

Details Motivation: 生态学研究需要处理多模态数据(如视觉、声音等),而现有方法在多模态表示学习与模态补全方面存在局限性。ProM3E通过概率掩码学习填补这一空白。

Contribution: 1. 提出概率掩码多模态嵌入模型ProM3E;2. 支持模态反转与融合分析;3. 提出混合跨模态检索方法;4. 展示模型的优越表示学习能力。

Method: 基于嵌入空间的掩码模态重建,通过概率模型推断缺失模态,并设计跨模态检索方法结合模态间与模态内相似性。

Result: 模型在多模态检索与线性探测任务中表现优越,代码与数据集已开源。

Insight: 概率掩码学习不仅提升多模态融合的灵活性,还为模态互补性分析提供了新视角。

Abstract: We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyse the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model. All our code, datasets and model will be released at https://vishu26.github.io/prom3e.

[24] EvtSlowTV – A Large and Diverse Dataset for Event-Based Depth Estimation

Sadiq Layi Macaulay,Nimet Kaygusuz,Simon Hadfield

Main category: cs.CV

TL;DR: 本文介绍了一个名为EvtSlowTV的大规模事件相机数据集,用于解决现有事件数据集规模小、泛化能力不足的问题。通过从YouTube公开视频中提取数据,该数据集包含了多种环境条件下的13B事件,支持自监督学习框架,提升了模型在复杂场景下的泛化能力。

Details Motivation: 现有事件相机数据集规模小且受限,限制了基于事件的深度估计方法在实际场景中的泛化能力。

Contribution: 提出了EvtSlowTV,这是目前最大的事件相机数据集,包含13B事件,覆盖多种环境和运动场景,支持自监督学习。

Method: 通过从YouTube公开视频中提取事件数据,构建大规模数据集,并利用自监督学习框架训练深度估计模型,避免了对帧注释的依赖。

Result: 实验表明,使用EvtSlowTV训练的模型在复杂场景和运动中表现出更强的泛化能力。

Insight: 大规模自然数据集对于提升事件相机模型的性能和泛化能力至关重要,同时自监督学习可以充分利用事件的异步特性。

Abstract: Event cameras, with their high dynamic range (HDR) and low latency, offer a promising alternative for robust depth estimation in challenging environments. However, many event-based depth estimation approaches are constrained by small-scale annotated datasets, limiting their generalizability to real-world scenarios. To bridge this gap, we introduce EvtSlowTV, a large-scale event camera dataset curated from publicly available YouTube footage, which contains more than 13B events across various environmental conditions and motions, including seasonal hiking, flying, scenic driving, and underwater exploration. EvtSlowTV is an order of magnitude larger than existing event datasets, providing an unconstrained, naturalistic setting for event-based depth learning. This work shows the suitability of EvtSlowTV for a self-supervised learning framework to capitalise on the HDR potential of raw event streams. We further demonstrate that training with EvtSlowTV enhances the model’s ability to generalise to complex scenes and motions. Our approach removes the need for frame-based annotations and preserves the asynchronous nature of event data.

[25] Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification

Mikhael Djajapermana,Moritz Reiber,Daniel Mueller-Gritschneder,Ulf Schlichtmann

Main category: cs.CV

TL;DR: 本文提出了一种用于TinyML图像分类的新型混合CNN-ViT搜索空间,通过NAS找到高效的混合架构,平衡计算成本和模型性能。

Details Motivation: 尽管混合CNN和ViT架构在图像分类中表现优异,但其高参数量和计算成本使其难以部署在TinyML设备上。本文旨在通过NAS搜索空间找到适合TinyML的高效混合架构。

Contribution: 1. 提出了一种新的混合CNN-ViT搜索空间,覆盖了局部和全局信息学习的模块;2. 引入了可搜索的池化层,以高效降低特征图尺寸;3. 在CIFAR10数据集上验证了所提架构在模型大小受限下的优越性能。

Method: 1. 设计了包含混合CNN和ViT块的搜索空间;2. 引入了可搜索的池化层优化特征图;3. 使用NAS技术找到高效的混合架构。

Result: 实验表明,所提架构在CIFAR10上的准确性和推理速度优于基于ResNet的TinyML模型。

Insight: 通过灵活组合CNN和ViT的模块,并优化池化层设计,可以在资源受限的设备上实现高性能的图像分类模型。

Abstract: Hybrids of Convolutional Neural Network (CNN) and Vision Transformer (ViT) have outperformed pure CNN or ViT architecture. However, since these architectures require large parameters and incur large computational costs, they are unsuitable for tinyML deployment. This paper introduces a new hybrid CNN-ViT search space for Neural Architecture Search (NAS) to find efficient hybrid architectures for image classification. The search space covers hybrid CNN and ViT blocks to learn local and global information, as well as the novel Pooling block of searchable pooling layers for efficient feature map reduction. Experimental results on the CIFAR10 dataset show that our proposed search space can produce hybrid CNN-ViT architectures with superior accuracy and inference speed to ResNet-based tinyML models under tight model size constraints.

[26] SCALE-VLP: Soft-Weighted Contrastive Volumetric Vision-Language Pre-training with Spatial-Knowledge Semantics

Ailar Mahdizadeh,Puria Azadi Moghadam,Xiangteng He,Shahriar Mirabbasi,Panos Nasiopoulos,Leonid Sigal

Main category: cs.CV

TL;DR: SCALE-VLP提出了一种软加权对比视觉语言预训练框架,专注于体数据(如CT),通过整合空间知识语义和领域知识,显著提升了跨任务和跨领域的泛化能力。

Details Motivation: 现有视觉语言模型主要针对2D数据,忽视了体数据的连续性和结构化依赖。此外,现有方法通常将体数据视为独立2D切片,破坏了空间一致性,且未能充分利用丰富的临床语义。

Contribution: 提出了SCALE-VLP框架,整合了体数据空间语义和领域知识语义,生成结构一致且语义丰富的表征。在CT报告检索、异常分类和报告生成任务中表现优异,并在零样本跨域评估中展现出泛化能力。

Method: 采用软加权对比学习,结合体数据的空间语义(如解剖结构)和领域知识(如放射学本体),通过有限的监督实现对齐。

Result: 在CT报告检索中Top-1准确率提升4.3倍,异常分类提升10点,报告生成ROUGE-L达0.44,BERT-F1达0.89。零样本跨域评估也表现稳定。

Insight: 整合空间和领域知识语义能显著提升体数据任务的性能,同时展现出跨任务和跨领域的泛化潜力,为医学影像分析提供了新思路。

Abstract: Vision-language models (VLMs) have demonstrated strong cross-modal capabilities, yet most work remains limited to 2D data and assumes binary supervision (i.e., positive vs. negative pairs), overlooking the continuous and structured dependencies present in volumetric data such as CT. Existing approaches often treat volumetric scans as independent 2D slices, compromising spatial coherence and underutilizing rich clinical semantics. We propose SCALE-VLP, a soft-weighted contrastive vision-language pre-training framework that integrates (i) volumetric spatial semantics to preserve anatomical structure and (ii) domain-aware, knowledge-infused semantics (e.g., radiological ontologies) to guide alignment. This yields structurally consistent and semantically grounded representations under limited supervision, demonstrating strong cross-task transferability (retrieval, report generation, and classification), and cross-domain generalizability with consistent gains without further fine-tuning. In particular, compared to the previous state of the art, SCALE-VLP achieves up to 4.3x higher top-1 CT-report retrieval, improves abnormality classification by 10 points, and reaches ROUGE-L 0.44 and BERT-F1 0.89 for report generation. Further, in zero-shot evaluation on an out-of-domain external dataset, we observe consistent gains, indicating the cross-task and cross-domain generalization ability of SCALE-VLP.

[27] Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

Dakota Hester,Vitor S. Martins,Lucas B. Ferreira,Thainara M. A. Lima

Main category: cs.CV

TL;DR: 该论文提出了一种基于自监督深度学习的高效标签方法,用于1米分辨率的土地覆盖分类,仅需1000个标注样本即可实现州级范围的高精度分类。

Details Motivation: 高分辨率土地覆盖分类需要大量标注数据,但标注成本高昂,限制了模型的广泛应用。论文旨在通过自监督学习减少对标注数据的依赖。

Contribution: 1. 利用BYOL自监督预训练策略,结合大量未标注数据预训练ResNet-101编码器;2. 将预训练权重迁移至多种语义分割模型(如U-Net、DeepLabV3+等),在小样本下微调;3. 在密西西比州实现了8类土地覆盖的高精度分类(87.14%准确率)。

Method: 1. 使用377,921个未标注的红外图像块预训练ResNet-101;2. 将预训练编码器迁移至多个分割模型;3. 在小样本(250-750个标注块)下微调模型。

Result: 模型在密西西比州的8类土地覆盖分类中达到87.14%整体准确率和75.58%宏F1分数,展示了自监督学习在小样本任务中的有效性。

Insight: 自监督学习显著减少了高分辨率土地覆盖分类对标注数据的依赖,但仍存在耕地、草本和裸地分类的挑战。

Abstract: Deep learning semantic segmentation methods have shown promising performance for very high 1-m resolution land cover classification, but the challenge of collecting large volumes of representative training data creates a significant barrier to widespread adoption of such models for meter-scale land cover mapping over large areas. In this study, we present a novel label-efficient approach for statewide 1-m land cover classification using only 1,000 annotated reference image patches with self-supervised deep learning. We use the “Bootstrap Your Own Latent” pre-training strategy with a large amount of unlabeled color-infrared aerial images (377,921 256x256 1-m pixel patches) to pre-train a ResNet-101 convolutional encoder. The learned encoder weights were subsequently transferred into multiple deep semantic segmentation architectures (FCN, U-Net, Attention U-Net, DeepLabV3+, UPerNet, PAN), which were then fine-tuned using very small training dataset sizes with cross-validation (250, 500, 750 patches). Among the fine-tuned models, we obtained the 87.14% overall accuracy and 75.58% macro F1 score using an ensemble of the best performing U-Net models for comprehensive 1-m, 8-class land cover mapping, covering more than 123 billion pixels over the state of Mississippi, USA. Detailed qualitative and quantitative analysis revealed accurate mapping of open water and forested areas, while highlighting challenges in accurate delineation between cropland, herbaceous, and barren land cover types. These results show that self-supervised learning is an effective strategy for reducing the need for large volumes of manually annotated data, directly addressing a major limitation to high spatial resolution land cover mapping at scale.

[28] A Foundation Model for Brain MRI with Dynamic Modality Integration

Minh Sao Khue Luu,Bair N. Tuchinov

Main category: cs.CV

TL;DR: 提出了一种基础模型,用于处理脑部MRI的不同成像序列组合。通过可学习模态嵌入和条件层归一化,结合掩码自编码目标,处理缺失模态,并使用方差-协方差正则化器提升特征学习的稳定性和多样性。

Details Motivation: 传统方法需要为每个MRI模态单独训练模型,这不仅计算资源消耗大,且无法灵活处理缺失或未见的模态序列。

Contribution: 1. 设计了一种统一的多模态MRI编码器;2. 引入可学习模态嵌入和条件层归一化,灵活适应不同输入;3. 使用掩码自编码和模态补全任务进行自监督学习。

Method: 1. 采用可学习模态嵌入和条件层归一化;2. 掩码自编码目标结合方差-协方差正则化器;3. 自监督训练(重建和模态补全任务)。

Result: 初步结果显示模型可行,计划进一步评估其在脑肿瘤、多发性硬化分割及病变分类任务中的表现。

Insight: 统一的编码器设计显著减少了计算资源需求,同时增强了模型对缺失模态的鲁棒性。

Abstract: We present a foundation model for brain MRI that can work with different combinations of imaging sequences. The model uses one encoder with learnable modality embeddings, conditional layer normalization, and a masked autoencoding objective that accounts for missing modalities. A variance-covariance regularizer is applied to stabilize feature learning and improve representation diversity. This design removes the need for separate models for each modality and allows the network to adapt when some sequences are missing or unseen. It is trained on about 60,000 multi-center MRIs using self-supervised reconstruction and modality imputation to learn flexible representations. A learnable modality embedding guides feature extraction so the encoder can adjust to different inputs. We describe our planned evaluation on brain tumor and multiple sclerosis segmentation, as well as lesion classification, under various modality settings. Preliminary results show that the method works feasibly, and further experiments are planned to study its performance in more detail. All code and pretrained models are available at https://github.com/BrainFM/brainfm

[29] SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

Wenbo Lu

Main category: cs.CV

TL;DR: SLIP提出了一种结构感知的语言-图像预训练方法,通过引入结构化对比损失和建模实体间关系,显著提升了跨模态对齐性能。

Details Motivation: 现有视觉-语言预训练方法将图像-文本对视为独立训练样本,忽略了领域内的丰富关系结构(如电商产品共购图)。

Contribution: 1. 提出了SLIP框架,整合结构化对比损失;2. 构建了大规模亚马逊产品共购多模态图数据集;3. 实验证明了关系监督对跨模态对齐的价值。

Method: SLIP通过结构化对比损失对齐模态,并建模结构化图中相邻实体的关系。

Result: SLIP在零样本和小样本跨模态检索及分类任务上一致优于CLIP。

Insight: 关系监督可以显著提升跨模态对齐性能,结构化信息是重要的监督信号。

Abstract: Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment.

[30] ISC-Perception: A Hybrid Computer Vision Dataset for Object Detection in Novel Steel Assembly

Miftahur Rahman,Samuel Adebayo,Dorian A. Acevedo-Mejia,David Hester,Daniel McPolin,Karen Rafferty,Debra F. Laefer

Main category: cs.CV

TL;DR: 该论文提出了一个名为ISC-Perception的混合数据集,专门用于检测新型钢结构装配中的对象,解决了建筑机器人感知领域的数据缺失问题。通过结合CAD渲染图像、游戏引擎生成的逼真场景和少量真实照片,实现了高效自动标注,显著减少了人工标注时间。

Details Motivation: 建筑机器人(如ISC系统的机器人)需要可靠的感知能力,但当前缺乏专门的数据集,且在施工现场收集图像存在安全和隐私问题。

Contribution: 1. 首次为ISC部件检测设计的混合数据集;2. 结合了合成数据和真实数据,实现了高效自动标注;3. 显著减少了人工标注时间(比传统方法减少81.7%)。

Method: 1. 使用CAD渲染生成程序化图像;2. 利用游戏引擎生成逼真场景;3. 结合少量真实照片;4. 自动化标注合成数据部分。

Result: 使用ISC-Perception训练的检测器在IoU 0.50下的mAP为0.756,显著优于仅使用合成数据或逼真数据训练的模型。在1,200帧测试中,mAP@0.50/mAP@[0.50:0.95]分别达到0.943/0.823。

Insight: 混合数据集(合成+真实)在目标检测任务中表现优异,同时大幅降低了人工标注成本,对建筑机器人和工业应用的快速开发具有重要意义。

Abstract: The Intermeshed Steel Connection (ISC) system, when paired with robotic manipulators, can accelerate steel-frame assembly and improve worker safety by eliminating manual assembly. Dependable perception is one of the initial stages for ISC-aware robots. However, this is hampered by the absence of a dedicated image corpus, as collecting photographs on active construction sites is logistically difficult and raises safety and privacy concerns. In response, we introduce ISC-Perception, the first hybrid dataset expressly designed for ISC component detection. It blends procedurally rendered CAD images, game-engine photorealistic scenes, and a limited, curated set of real photographs, enabling fully automatic labelling of the synthetic portion. We explicitly account for all human effort to produce the dataset, including simulation engine and scene setup, asset preparation, post-processing scripts and quality checks; our total human time to generate a 10,000-image dataset was 30.5,h versus 166.7,h for manual labelling at 60,s per image (-81.7%). A manual pilot on a representative image with five instances of ISC members took 60,s (maximum 80,s), anchoring the manual baseline. Detectors trained on ISC-Perception achieved a mean Average Precision at IoU 0.50 of 0.756, substantially surpassing models trained on synthetic-only or photorealistic-only data. On a 1,200-frame bench test, we report mAP@0.50/mAP@[0.50:0.95] of 0.943/0.823. By bridging the data gap for construction-robotics perception, ISC-Perception facilitates rapid development of custom object detectors and is freely available for research and industrial use upon request.

[31] DentalSplat: Dental Occlusion Novel View Synthesis from Sparse Intra-Oral Photographs

Yiyi Miao,Taoyu Wu,Tong Chen,Sihao Li,Ji Jiang,Youpeng Yang,Angelos Stefanidis,Limin Yu,Jionglong Su

Main category: cs.CV

TL;DR: 提出了DentalSplat,一种从稀疏口腔照片中进行3D重建和新视角合成的方法,适用于正畸治疗中的远程医疗场景。通过先验引导的立体重建和尺度自适应剪枝策略,显著提升了稀疏输入下的重建质量。

Details Motivation: 在正畸治疗和远程医疗中,需要从稀疏的口腔照片(如前视图和双侧颊视图)中重建3D牙齿咬合情况,但传统3DGS方法依赖密集输入和精确相机姿态,无法直接应用。

Contribution: 提出了DentalSplat框架,通过先验引导的立体重建初始化点云,结合尺度自适应剪枝和光学流几何约束,提升了稀疏输入下的重建和新视角合成质量。

Method: 1. 先验引导的密集立体重建初始化点云;2. 尺度自适应剪枝优化3DGS训练效率和重建质量;3. 在极端稀疏视角下引入光学流几何约束和梯度正则化。

Result: 在950个临床案例和195个视频测试集上验证,DentalSplat在稀疏输入下优于现有技术。

Insight: 通过引入先验和几何约束,可以显著提升稀疏输入下的3D重建质量,为远程正畸治疗提供了实用工具。

Abstract: In orthodontic treatment, particularly within telemedicine contexts, observing patients’ dental occlusion from multiple viewpoints facilitates timely clinical decision-making. Recent advances in 3D Gaussian Splatting (3DGS) have shown strong potential in 3D reconstruction and novel view synthesis. However, conventional 3DGS pipelines typically rely on densely captured multi-view inputs and precisely initialized camera poses, limiting their practicality. Orthodontic cases, in contrast, often comprise only three sparse images, specifically, the anterior view and bilateral buccal views, rendering the reconstruction task especially challenging. The extreme sparsity of input views severely degrades reconstruction quality, while the absence of camera pose information further complicates the process. To overcome these limitations, we propose DentalSplat, an effective framework for 3D reconstruction from sparse orthodontic imagery. Our method leverages a prior-guided dense stereo reconstruction model to initialize the point cloud, followed by a scale-adaptive pruning strategy to improve the training efficiency and reconstruction quality of 3DGS. In scenarios with extremely sparse viewpoints, we further incorporate optical flow as a geometric constraint, coupled with gradient regularization, to enhance rendering fidelity. We validate our approach on a large-scale dataset comprising 950 clinical cases and an additional video-based test set of 195 cases designed to simulate real-world remote orthodontic imaging conditions. Experimental results demonstrate that our method effectively handles sparse input scenarios and achieves superior novel view synthesis quality for dental occlusion visualization, outperforming state-of-the-art techniques.

[32] Image-Intrinsic Priors for Integrated Circuit Defect Detection and Novel Class Discovery via Self-Supervised Learning

Botong. Zhao,Xubin. Wang,Shujing. Lyu,Yue. Lu

Main category: cs.CV

TL;DR: 该论文提出了一种名为IC DefectNCD的框架,通过自监督学习利用图像固有先验,实现了集成电路缺陷检测和新类发现,避免了支持集依赖和标注问题。

Details Motivation: 集成电路制造过程中缺陷复杂且多样,监督方法需要大量标注且难以处理新类别,无监督方法性能不稳定。论文旨在解决这些问题。

Contribution: 提出IC DefectNCD框架,结合自监督学习和图像固有先验,实现无需支持集的缺陷检测和新类发现;开发了自适应二值化策略和软掩码注意力机制。

Method: 1. 自标准信息引导的缺陷检测,通过可学习的正常信息提取器聚合正常特征;2. 自适应二值化策略聚焦核心缺陷区域;3. 软掩码注意力机制注入空间缺陷先验,增强教师-学生模型对缺陷区域的敏感性。

Result: 在涵盖15种缺陷类型的真实数据集上验证,展现了对缺陷检测和未见过缺陷分类的鲁棒性能。

Insight: 通过结合自监督学习和图像固有先验,可以显著提升对复杂制造环境中缺陷的检测和新类别识别的能力,为工业缺陷分析提供了新思路。

Abstract: Integrated circuit manufacturing is highly complex, comprising hundreds of process steps. Defects can arise at any stage, causing yield loss and ultimately degrading product reliability. Supervised methods require extensive human annotation and struggle with emergent categories and rare, data scarce defects. Clustering-based unsupervised methods often exhibit unstable performance due to missing priors. We propose IC DefectNCD, a support set free framework that leverages Image Intrinsic Priors in IC SEM images for defect detection and novel class discovery. We first develop Self Normal Information Guided IC Defect Detection, aggregating representative normal features via a learnable normal information extractor and using reconstruction residuals to coarsely localize defect regions. To handle saliency variations across defects, we introduce an adaptive binarization strategy that produces stable subimages focused on core defective areas. Finally, we design Self Defect Information Guided IC Defect Classification, which incorporates a soft mask guided attention mechanism to inject spatial defect priors into the teacher student model. This enhances sensitivity to defective regions, suppresses background interference, and enables recognition and classification of unseen defects. We validate the approach on a real world dataset spanning three key fabrication stages and covering 15 defect types. Experiments demonstrate robust performance on both defect detection and unseen defect classification.

[33] Accelerating Physical Property Reasoning for Augmented Visual Cognition

Hongbo Lan,Zhenlin An,Haoyu Li,Vaibhav Singh,Longfei Shangguan

Main category: cs.CV

TL;DR: 是一个通过算法和系统优化加速视觉指导的物理属性推理的系统,将延迟从10-20分钟降至6秒以内,并保持或提高了准确性。

Details Motivation: 当前视觉指导的物理属性推理常靠长时间的处理,影响实晋可用性。

Contribution: 1. 通过算法和系统优化加速推理流程。2. 结合眼动跟踪在杂乱环境中定位物体。3. 在ABO数据集上实现了62.9×–287.2×的速度提升。

Method: 1. 快速几何5D重建。2. 高效语义特征融合。3. 并行视图编码。

Result: 将延迟从10-20分钟降至6秒以内,并在物体层面的物理属性估计中达到或超过SOTA。

Insight: 通过绿色计算和并行化,可以在保持性能的同时实现分布式推理。

Abstract: This paper introduces \sysname, a system that accelerates vision-guided physical property reasoning to enable augmented visual cognition. \sysname minimizes the run-time latency of this reasoning pipeline through a combination of both algorithmic and systematic optimizations, including rapid geometric 3D reconstruction, efficient semantic feature fusion, and parallel view encoding. Through these simple yet effective optimizations, \sysname reduces the end-to-end latency of this reasoning pipeline from 10–20 minutes to less than 6 seconds. A head-to-head comparison on the ABO dataset shows that \sysname achieves this 62.9$\times$–287.2$\times$ speedup while not only reaching on-par (and sometimes slightly better) object-level physical property estimation accuracy(e.g. mass), but also demonstrating superior performance in material segmentation and voxel-level inference than two SOTA baselines. We further combine gaze-tracking with \sysname to localize the object of interest in cluttered, real-world environments, streamlining the physical property reasoning on smart glasses. The case study with Meta Aria Glasses conducted at an IKEA furniture store demonstrates that \sysname achives consistently high performance compared to controlled captures, providing robust property estimations even with fewer views in real-world scenarios.

[34] Deploying Rapid Damage Assessments from sUAS Imagery for Disaster Response

Thomas Manzini,Priyankari Perali,Robin R. Murphy

Main category: cs.CV

TL;DR: 该论文介绍了首个用于无人机(sUAS)图像中建筑损坏自动评估的AI/ML系统,并在联邦宣布的飓风灾害中实际部署,显著提升了灾害响应效率。

Details Motivation: 灾害期间,无人机团队每天收集大量图像(47GB至369GB),远超专家手动处理能力,导致响应延迟。亟需自动化的计算机视觉和机器学习技术来解决这一问题。

Contribution: 该研究建立了首个基于sUAS图像的建筑损坏评估实践标准,开发并部署了相关模型,并为AI/ML研究和用户群体提供了实际应用经验和教训。

Method: 利用最大规模的灾害后sUAS图像数据集(21,716个建筑损坏标签)训练模型,并通过91名灾害从业者的操作培训优化性能。最优模型在飓风灾害响应中部署。

Result: 在飓风Debby和Helene响应中,模型在约18分钟内评估了415座建筑,显著提升了评估效率。

Insight: 本研究证明了AI/ML在灾害响应中的实际价值,提供了一套可扩展的自动化评估框架。

Abstract: This paper presents the first AI/ML system for automating building damage assessment in uncrewed aerial systems (sUAS) imagery to be deployed operationally during federally declared disasters (Hurricanes Debby and Helene). In response to major disasters, sUAS teams are dispatched to collect imagery of the affected areas to assess damage; however, at recent disasters, teams collectively delivered between 47GB and 369GB of imagery per day, representing more imagery than can reasonably be transmitted or interpreted by subject matter experts in the disaster scene, thus delaying response efforts. To alleviate this data avalanche encountered in practice, computer vision and machine learning techniques are necessary. While prior work has been deployed to automatically assess damage in satellite imagery, there is no current state of practice for sUAS-based damage assessment systems, as all known work has been confined to academic settings. This work establishes the state of practice via the development and deployment of models for building damage assessment with sUAS imagery. The model development involved training on the largest known dataset of post-disaster sUAS aerial imagery, containing 21,716 building damage labels, and the operational training of 91 disaster practitioners. The best performing model was deployed during the responses to Hurricanes Debby and Helene, where it assessed a combined 415 buildings in approximately 18 minutes. This work contributes documentation of the actual use of AI/ML for damage assessment during a disaster and lessons learned to the benefit of the AI/ML research and user communities.

[35] Finetuning-Free Personalization of Text to Image Generation via Hypernetworks

Sagar Shrestha,Gopal Sharma,Luowei Zhou,Suren Kumar

Main category: cs.CV

TL;DR: 该论文提出了一种无需微调的个性化文本到图像生成方法,通过超网络预测LoRA适应权重,避免了传统方法的高计算成本问题,并通过改进的训练目标和混合模型分类器自由引导(HM-CFG)提升了生成质量和组合泛化能力。

Details Motivation: 传统文本到图像扩散模型的个性化方法(如DreamBooth)依赖对每个主题的微调,计算成本高昂且推理速度慢。现有基于适配器或编码器的方法虽尝试降低开销,但仍需额外微调或依赖大型骨干模型。为此,论文探索了一种无需微调的超网络方法。

Contribution: 1. 提出了一种基于超网络的无需微调个性化生成方法,直接从主题图像预测LoRA适应权重;2. 引入了端到端训练目标,配合输出正则化提升超网络的稳定性;3. 设计了混合模型分类器自由引导(HM-CFG),增强推理时的组合泛化能力。

Method: 通过超网络预测LoRA权重,避免了每个主题的独立优化;训练目标通过输出正则化稳定;在推理时使用HM-CFG,结合基础扩散模型与个性化模型的优势。

Result: 在CelebA-HQ、AFHQ-v2和DreamBench上的实验表明,该方法在个性化生成性能上表现优异,同时保持了主题保真度和提示对齐。

Insight: 超网络是一种可扩展的、高效的开放类别个性化方向,通过结合基础模型和个性化模型的优势,可以在不增加额外计算负担的情况下实现高质量生成。

Abstract: Personalizing text-to-image diffusion models has traditionally relied on subject-specific fine-tuning approaches such as DreamBooth~\cite{ruiz2023dreambooth}, which are computationally expensive and slow at inference. Recent adapter- and encoder-based methods attempt to reduce this overhead but still depend on additional fine-tuning or large backbone models for satisfactory results. In this work, we revisit an orthogonal direction: fine-tuning-free personalization via Hypernetworks that predict LoRA-adapted weights directly from subject images. Prior hypernetwork-based approaches, however, suffer from costly data generation or unstable attempts to mimic base model optimization trajectories. We address these limitations with an end-to-end training objective, stabilized by a simple output regularization, yielding reliable and effective hypernetworks. Our method removes the need for per-subject optimization at test time while preserving both subject fidelity and prompt alignment. To further enhance compositional generalization at inference time, we introduce Hybrid-Model Classifier-Free Guidance (HM-CFG), which combines the compositional strengths of the base diffusion model with the subject fidelity of personalized models during sampling. Extensive experiments on CelebA-HQ, AFHQ-v2, and DreamBench demonstrate that our approach achieves strong personalization performance and highlights the promise of hypernetworks as a scalable and effective direction for open-category personalization.

[36] Subsampled Randomized Fourier GaLore for Adapting Foundation Models in Depth-Driven Liver Landmark Segmentation

Yun-Chen Lin,Jiayuan Huang,Hanyuan Zhang,Sergi Kavtaradze,Matthew J. Clarkson,Mobarak I. Hoque

Main category: cs.CV

TL;DR: 该论文提出了一种结合RGB和深度信息的肝脏标志分割框架,通过引入SRFT-GaLore方法高效调整大规模视觉模型以适应手术领域,并在跨数据集评估中表现出色。

Details Motivation: 腹腔镜肝脏手术中2D视频流限制了深度感知,导致标志定位困难。现有方法在RGB和深度特征融合及大模型高效适配方面仍有挑战,需改进以适应实时手术环境。

Contribution: 1. 提出了结合RGB和深度信息的双编码器分割框架;2. 引入SRFT-GaLore方法,替代SVD实现高效低秩梯度投影;3. 构建了新的LLSD数据集用于跨数据集验证。

Method: 使用SAM2和DA2编码器分别提取RGB和深度特征,引入SRFT-GaLore高效调整高维注意力层,并通过跨注意力模块融合特征。

Result: 在L3D数据集上Dice相似系数提升4.85%,平均对称表面距离降低11.78点;在LLSD数据集上表现优于基线方法。

Insight: SRFT-GaLore为高效调整大模型提供了新思路,双编码器融合RGB和深度信息显著提升了手术场景下的分割精度。

Abstract: Accurate detection and delineation of anatomical structures in medical imaging are critical for computer-assisted interventions, particularly in laparoscopic liver surgery where 2D video streams limit depth perception and complicate landmark localization. While recent works have leveraged monocular depth cues for enhanced landmark detection, challenges remain in fusing RGB and depth features and in efficiently adapting large-scale vision models to surgical domains. We propose a depth-guided liver landmark segmentation framework integrating semantic and geometric cues via vision foundation encoders. We employ Segment Anything Model V2 (SAM2) encoder to extract RGB features and Depth Anything V2 (DA2) encoder to extract depth-aware features. To efficiently adapt SAM2, we introduce SRFT-GaLore, a novel low-rank gradient projection method that replaces the computationally expensive SVD with a Subsampled Randomized Fourier Transform (SRFT). This enables efficient fine-tuning of high-dimensional attention layers without sacrificing representational power. A cross-attention fusion module further integrates RGB and depth cues. To assess cross-dataset generalization, we also construct a new Laparoscopic Liver Surgical Dataset (LLSD) as an external validation benchmark. On the public L3D dataset, our method achieves a 4.85% improvement in Dice Similarity Coefficient and a 11.78-point reduction in Average Symmetric Surface Distance compared to the D2GPLand. To further assess generalization capability, we evaluate our model on LLSD dataset. Our model maintains competitive performance and significantly outperforms SAM-based baselines, demonstrating strong cross-dataset robustness and adaptability to unseen surgical environments. These results demonstrate that our SRFT-GaLore-enhanced dual-encoder framework enables scalable and precise segmentation under real-time, depth-constrained surgical settings.

[37] SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

Shreyas C. Dhake,Jiayuan Huang,Runlong He,Danyal Z. Khan,Evangelos B. Mazomenos,Sophia Bano,Hani J. Marcus,Danail Stoyanov,Matthew J. Clarkson,Mobarak I. Hoque

Main category: cs.CV

TL;DR: 该论文提出了首个用于前瞻性手术推理的视觉问答(VQA)数据集PitVQA-Anticipation,并提出了一种名为SurgAnt-ViVQA的模型,通过GRU驱动的时序交叉注意力模块实现手术事件的预测。

Details Motivation: 在有限的视野和快速变化的手术流程中,提前预测手术事件对于实时辅助至关重要。现有VQA系统仅基于静态帧进行推理,无法有效支持未来事件的预测。

Contribution: 1. 引入首个面向前瞻性手术推理的数据集PitVQA-Anticipation;2. 提出SurgAnt-ViVQA模型,结合GRU时序编码和门控交叉注意力,提升了手术未来事件的预测能力。

Method: 1. 使用双向GRU编码帧间动态;2. 通过自适应门控机制在语言模型中注入视觉上下文;3. 采用参数高效的微调方法定制手术领域语言模型。

Result: 在PitVQA-Anticipation和EndoVis数据集上,SurgAnt-ViVQA超越了基于图像和视频的基线模型。时序建模和门控融合显著提升了性能。

Insight: 时序建模和细粒度的门控交叉注意力是未来手术VQA系统的关键,8帧的输入在流畅性和预测性能之间取得了平衡。

Abstract: Anticipating forthcoming surgical events is vital for real-time assistance in endonasal transsphenoidal pituitary surgery, where visibility is limited and workflow changes rapidly. Most visual question answering (VQA) systems reason on isolated frames with static vision language alignment, providing little support for forecasting next steps or instrument needs. Existing surgical VQA datasets likewise center on the current scene rather than the near future. We introduce PitVQA-Anticipation, the first VQA dataset designed for forward looking surgical reasoning. It comprises 33.5 hours of operative video and 734,769 question answer pairs built from temporally grouped clips and expert annotations across four tasks: predicting the future phase, next step, upcoming instrument, and remaining duration. We further propose SurgAnt-ViVQA, a video language model that adapts a large language model using a GRU Gated Temporal Cross-Attention module. A bidirectional GRU encodes frame to frame dynamics, while an adaptive gate injects visual context into the language stream at the token level. Parameter efficient fine tuning customizes the language backbone to the surgical domain. SurgAnt-ViVQA tested upon on PitVQA-Anticipation and EndoVis datasets, surpassing strong image and video based baselines. Ablations show that temporal recurrence and gated fusion drive most of the gains. A frame budget study indicates a trade-off: 8 frames maximize fluency, whereas 32 frames slightly reduce BLEU but improve numeric time estimation. By pairing a temporally aware encoder with fine grained gated cross-attention, SurgAnt-ViVQA advances surgical VQA from retrospective description to proactive anticipation. PitVQA-Anticipation offers a comprehensive benchmark for this setting and highlights the importance of targeted temporal modeling for reliable, future aware surgical assistance.

[38] PETWB-REP: A Multi-Cancer Whole-Body FDG PET/CT and Radiology Report Dataset for Medical Imaging Research

Le Xue,Gang Feng,Wenbo Zhang,Yichi Zhang,Lanlan Li,Shuqi Wang,Liling Peng,Sisi Peng,Xin Gao

Main category: cs.CV

TL;DR: PETWB-REP是一个公开的多癌种全身FDG PET/CT及放射学报告数据集,包含490名患者的影像与报告,旨在支持医学影像、放射组学和多模态学习研究。

Details Motivation: 现有医学影像数据集较少结合功能与解剖影像及详细临床报告,限制了医学影像AI研究的发展。

Contribution: 提供了多癌种全身PET/CT影像及配对放射学报告的数据集,填补了相关领域的数据空白。

Method: 收集并整理了490名多种癌症患者的FDG PET/CT扫描影像及对应的临床报告,包含结构化和非结构化数据。

Result: 构建了一个包含多种常见癌症的数据集,为医学影像和多模态学习研究提供了资源。

Insight: 该数据集支持功能与解剖影像的结合分析,有助于推动医学影像AI和多模态学习的进步。

Abstract: Publicly available, large-scale medical imaging datasets are crucial for developing and validating artificial intelligence models and conducting retrospective clinical research. However, datasets that combine functional and anatomical imaging with detailed clinical reports across multiple cancer types remain scarce. Here, we present PETWB-REP, a curated dataset comprising whole-body 18F-Fluorodeoxyglucose (FDG) Positron Emission Tomography/Computed Tomography (PET/CT) scans and corresponding radiology reports from 490 patients diagnosed with various malignancies. The dataset primarily includes common cancers such as lung cancer, liver cancer, breast cancer, prostate cancer, and ovarian cancer. This dataset includes paired PET and CT images, de-identified textual reports, and structured clinical metadata. It is designed to support research in medical imaging, radiomics, artificial intelligence, and multi-modal learning.

[39] QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models

Kuei-Chun Kao,Hsu Tzu-Yin,Yunqi Hong,Ruochen Wang,Cho-Jui Hsieh

Main category: cs.CV

TL;DR: 论文提出了QG-CoC方法,通过问题引导的链式描述(Chain-of-Captions),解决了多模态大语言模型在处理多图像任务时的细粒度感知和信息整合问题。

Details Motivation: 现有多模态大语言模型在多图像场景中缺乏细粒度感知和有效推理能力,且现有提示方法多局限于单图像场景,无法解决复杂的多图像任务。

Contribution: 提出QG-CoC方法,一种零样本提示方法,能够处理任意数量的图像,并在多模态模型中表现优异。

Method: QG-CoC通过问题引导生成链式描述,将感知与推理无缝结合,适用于多图像任务。

Result: 实验表明QG-CoC在单图像和多图像基准测试中均表现优异,尤其在现有方法失败的挑战性场景中表现稳健。

Insight: 问题引导的链式描述可以有效弥补多模态模型在多图像任务中的感知与推理不足。

Abstract: Recently, Multimodal Large Language Models (MLLMs) encounter two key issues in multi-image contexts: (1) a lack of fine-grained perception across disparate images, and (2) a diminished capability to effectively reason over and synthesize information from multiple visual inputs. However, while various prompting methods aim to describe visual content, many existing studies focus primarily on single-image settings or specific, constrained scenarios. This leaves a critical gap in understanding and addressing how MLLMs tackle more general and complex multi-image reasoning tasks. Thus, we first extensively investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images. Our findings reveal that existing prompting methods fall short in attending to needed clues and seamlessly integrating perception and reasoning. Inspired by the findings, we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions (QG-CoC), a generalized prompting approach that effectively handles problems with an arbitrary number of images. We evaluate our method on various open-source and closed-source MLLMs for multi-image and single-image benchmarks. Experimental results indicate that QG-CoC demonstrates competitive performance across tasks and exhibits robust improvements in the challenging scenarios where existing prompting methods fail.

[40] MvBody: Multi-View-Based Hybrid Transformer Using Optical 3D Body Scan for Explainable Cesarean Section Prediction

Ruting Cheng,Boyuan Feng,Yijiang Zheng,Chuhui Qiu,Aizierjiang Aiersilan,Joaquin A. Calderon,Wentao Zhao,Qing Pan,James K. Hahn

Main category: cs.CV

TL;DR: 论文提出了一种基于多视角Transformer网络MvBody的方法,利用3D光学身体扫描数据预测剖宫产风险,结合自报告医疗数据,实现高精度和可解释性。

Details Motivation: 在医疗资源有限的地区,剖宫产风险预测对产前护理的早期决策至关重要。现有模型依赖院内参数,难以推广到资源有限的环境,因此探索使用3D体形数据作为替代方案。

Contribution: 1. 提出MvBody模型,结合多视角Transformer和度量学习,提升数据稀缺环境下的泛化能力。2. 利用3D光学扫描和自报告数据,实现非侵入性风险预测。3. 通过Integrated Gradients提供模型决策的可解释性。

Method: 1. 多视角Transformer网络处理3D体形数据。2. 结合自报告医疗数据。3. 使用度量学习损失提升训练效率和泛化能力。4. 应用Integrated Gradients解释模型决策。

Result: 模型在独立测试集上达到84.62%准确率和0.724 AUC-ROC,优于现有机器学习方法和3D分析技术。

Insight: 预孕体重、产妇年龄、产科史、既往剖宫产史及头肩部体形是剖宫产风险的关键预测因素,验证了模型的临床相关性。

Abstract: Accurately assessing the risk of cesarean section (CS) delivery is critical, especially in settings with limited medical resources, where access to healthcare is often restricted. Early and reliable risk prediction allows better-informed prenatal care decisions and can improve maternal and neonatal outcomes. However, most existing predictive models are tailored for in-hospital use during labor and rely on parameters that are often unavailable in resource-limited or home-based settings. In this study, we conduct a pilot investigation to examine the feasibility of using 3D body shape for CS risk assessment for future applications with more affordable general devices. We propose a novel multi-view-based Transformer network, MvBody, which predicts CS risk using only self-reported medical data and 3D optical body scans obtained between the 31st and 38th weeks of gestation. To enhance training efficiency and model generalizability in data-scarce environments, we incorporate a metric learning loss into the network. Compared to widely used machine learning models and the latest advanced 3D analysis methods, our method demonstrates superior performance, achieving an accuracy of 84.62% and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.724 on the independent test set. To improve transparency and trust in the model’s predictions, we apply the Integrated Gradients algorithm to provide theoretically grounded explanations of the model’s decision-making process. Our results indicate that pre-pregnancy weight, maternal age, obstetric history, previous CS history, and body shape, particularly around the head and shoulders, are key contributors to CS risk prediction.

[41] Diffusion-Guided Mask-Consistent Paired Mixing for Endoscopic Image Segmentation

Pengyu Jie,Wanquan Liu,Rui He,Yihui Wen,Deyu Meng,Chenqiang Gao

Main category: cs.CV

TL;DR: 论文提出了一种结合扩散合成和样本混合的方法(MCPMix),通过生成与真实图像相同掩码的合成图像对,并在混合时保持掩码一致性,同时引入自适应调整策略(RLA)优化训练过程,提升了内窥镜图像分割的性能。

Details Motivation: 现有的数据增强方法(如样本混合或扩散合成)分别存在标签模糊和域偏移的问题,无法同时兼顾多样性和语义一致性。论文试图结合两种方法的优势,解决这些挑战。

Contribution: 1. 提出了MCPMix方法,通过在相同掩码下生成合成图像对并混合,避免标签模糊;2. 设计了RLA策略,自适应调整混合强度和损失权重,优化模型训练。

Method: 1. 基于扩散模型生成与真实图像共享掩码的合成图像;2. 对图像对进行外观混合,同时保留原始硬掩码作为监督;3. 使用RLA动态调整训练过程中混合样本的重要性。

Result: 在多个数据集(Kvasir-SEG、PICCOLO等)上实现了SOTA分割性能,证明了方法的有效性和通用性。

Insight: 结合扩散合成的多样性和样本混合的鲁棒性,同时通过自适应策略保持与真实数据的对齐,是实现高性能医学图像分割的关键。

Abstract: Augmentation for dense prediction typically relies on either sample mixing or generative synthesis. Mixing improves robustness but misaligned masks yield soft label ambiguity. Diffusion synthesis increases apparent diversity but, when trained as common samples, overlooks the structural benefit of mask conditioning and introduces synthetic-real domain shift. We propose a paired, diffusion-guided paradigm that fuses the strengths of both. For each real image, a synthetic counterpart is generated under the same mask and the pair is used as a controllable input for Mask-Consistent Paired Mixing (MCPMix), which mixes only image appearance while supervision always uses the original hard mask. This produces a continuous family of intermediate samples that smoothly bridges synthetic and real appearances under shared geometry, enlarging diversity without compromising pixel-level semantics. To keep learning aligned with real data, Real-Anchored Learnable Annealing (RLA) adaptively adjusts the mixing strength and the loss weight of mixed samples over training, gradually re-anchoring optimization to real data and mitigating distributional bias. Across Kvasir-SEG, PICCOLO, CVC-ClinicDB, a private NPC-LES cohort, and ISIC 2017, the approach achieves state-of-the-art segmentation performance and consistent gains over baselines. The results show that combining label-preserving mixing with diffusion-driven diversity, together with adaptive re-anchoring, yields robust and generalizable endoscopic segmentation.

[42] Generative deep learning for foundational video translation in ultrasound

Nikolina Tomic Roshni Bhatnagar,Sarthak Jain,Connor Lau,Tien-Yu Liu,Laura Gambini,Rima Arnaout

Main category: cs.CV

TL;DR: 该论文提出了一种生成式深度学习方法,用于超声成像中子模态(如灰度与彩色多普勒)之间的视频翻译,解决了数据不平衡和缺失问题。方法结合像素级、对抗性和感知损失,通过两个网络实现高保真合成,结果接近真实数据。

Details Motivation: 超声成像中子模态数据不平衡和缺失问题限制了深度学习在医学影像中的应用。生成式方法可以平衡数据集,但超声子模态间的翻译仍具挑战性。

Contribution: 1. 提出了一种生成式深度学习方法,用于超声视频中子模态的翻译;2. 通过双网络结构(结构重建与去噪)实现高保真合成;3. 合成数据在分类、分割任务中表现与真实数据无异,且泛化能力强。

Method: 1. 使用像素级、对抗性和感知损失;2. 采用双网络设计:一个用于结构重建,一个用于去噪;3. 训练数据为54,975个视频,测试数据为8,368个视频。

Result: 1. 合成视频与真实视频的结构相似性指数(SSIM)平均为0.91±0.04;2. 合成数据在分类(F1分数0.89)和分割(Dice分数0.97)中表现接近真实数据;3. 临床专家区分合成与真实视频的准确率仅为54±6%。

Insight: 1. 尽管仅基于心脏视频训练,模型在多个临床领域的超声数据中表现良好,展示了基础能力;2. 生成式方法可扩展回顾性影像数据的实用性,丰富医学影像数据集设计工具箱。

Abstract: Deep learning (DL) has the potential to revolutionize image acquisition and interpretation across medicine, however, attention to data imbalance and missingness is required. Ultrasound data presents a particular challenge because in addition to different views and structures, it includes several sub-modalities-such as greyscale and color flow doppler (CFD)-that are often imbalanced in clinical studies. Image translation can help balance datasets but is challenging for ultrasound sub-modalities to date. Here, we present a generative method for ultrasound CFD-greyscale video translation, trained on 54,975 videos and tested on 8,368. The method developed leveraged pixel-wise, adversarial, and perceptual loses and utilized two networks: one for reconstructing anatomic structures and one for denoising to achieve realistic ultrasound imaging. Average pairwise SSIM between synthetic videos and ground truth was 0.91+/-0.04. Synthetic videos performed indistinguishably from real ones in DL classification and segmentation tasks and when evaluated by blinded clinical experts: F1 score was 0.9 for real and 0.89 for synthetic videos; Dice score between real and synthetic segmentation was 0.97. Overall clinician accuracy in distinguishing real vs synthetic videos was 54+/-6% (42-61%), indicating realistic synthetic videos. Although trained only on heart videos, the model worked well on ultrasound spanning several clinical domains (average SSIM 0.91+/-0.05), demonstrating foundational abilities. Together, these data expand the utility of retrospectively collected imaging and augment the dataset design toolbox for medical imaging.

[43] Enhancing Medical Image Segmentation via Heat Conduction Equation

Rong Wu,Yim-Sang Yu

Main category: cs.CV

TL;DR: 论文提出了一种结合U-Mamba和热传导方程的新架构,用于医学图像分割,通过状态空间模块和热传导算子提升长距离依赖建模能力,实验证明其在多模态腹部CT和MRI数据集上优于基线方法。

Details Motivation: 现有基于深度学习的医学图像分割方法(如U-Net变体)在全局上下文建模和长距离依赖推理上效率不足,尤其是在计算资源有限的情况下。因此,需一种更高效且可扩展的解决方案。

Contribution: 提出了U-Mamba与热传导方程结合的混合架构,通过状态空间模块和热传导算子(HCOs)提升全局语义抽象能力,同时保持计算效率。

Method: 结合Mamba状态空间模块用于高效长距离推理,并在瓶颈层引入热传导算子(HCOs),模拟频域热扩散以增强语义表示。

Result: 在多模态腹部CT和MRI数据集上超越基线方法,验证了模型的有效性和泛化能力。

Insight: 融合状态空间动态和基于热扩散的全局建模为医学图像分割提供了一种可扩展且可解释的解决方案。

Abstract: Medical image segmentation has been significantly advanced by deep learning architectures, notably U-Net variants. However, existing models struggle to achieve efficient global context modeling and long-range dependency reasoning under practical computational budgets simultaneously. In this work, we propose a novel hybrid architecture utilizing U-Mamba with Heat Conduction Equation. Our model combines Mamba-based state-space modules for efficient long-range reasoning with Heat Conduction Operators (HCOs) in the bottleneck layers, simulating frequency-domain thermal diffusion for enhanced semantic abstraction. Experimental results on multimodal abdominal CT and MRI datasets demonstrate that the proposed model consistently outperforms strong baselines, validating its effectiveness and generalizability. It suggest that blending state-space dynamics with heat-based global diffusion offers a scalable and interpretable solution for medical segmentation tasks.

[44] Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising

Shuangquan Lyu,Steven Mao,Yue Ma

Main category: cs.CV

TL;DR: 本文提出了一种统一的长视频修复与外延方法,通过重叠高阶共去噪技术实现高质量的长视频编辑。

Details Motivation: 生成高质量的长视频具有挑战性,尤其是在视频修复和外延的可控性方面。现有方法通常难以处理长视频的一致性问题或会产生拼接痕迹。

Contribution: 1. 提出了一种统一的长视频修复与外延方法;2. 利用LoRA高效微调预训练的视频扩散模型;3. 采用重叠高阶共去噪策略保持长视频的一致性。

Method: 1. 基于LoRA微调预训练的视频扩散模型(如Wan 2.1);2. 使用重叠高阶共去噪策略生成一致的长视频;3. 通过时间编码和混合技术避免拼接痕迹。

Result: 在长视频修复和外延任务中,表现优于Wan 2.1和VACE基线方法,PSNR/SSIM和LPIPS指标均更优。

Insight: 结合高效参数微调和高阶共去噪策略,能够在不显著增加计算开销的情况下实现高质量的长视频编辑。

Abstract: Generating long videos remains a fundamental challenge, and achieving high controllability in video inpainting and outpainting is particularly demanding. To address both of these challenges simultaneously and achieve controllable video inpainting and outpainting for long video clips, we introduce a novel and unified approach for long video inpainting and outpainting that extends text-to-video diffusion models to generate arbitrarily long, spatially edited videos with high fidelity. Our method leverages LoRA to efficiently fine-tune a large pre-trained video diffusion model like Alibaba’s Wan 2.1 for masked region video synthesis, and employs an overlap-and-blend temporal co-denoising strategy with high-order solvers to maintain consistency across long sequences. In contrast to prior work that struggles with fixed-length clips or exhibits stitching artifacts, our system enables arbitrarily long video generation and editing without noticeable seams or drift. We validate our approach on challenging inpainting/outpainting tasks including editing or adding objects over hundreds of frames and demonstrate superior performance to baseline methods like Wan 2.1 model and VACE in terms of quality (PSNR/SSIM), and perceptual realism (LPIPS). Our method enables practical long-range video editing with minimal overhead, achieved a balance between parameter efficient and superior performance.

[45] Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models

Minghao Fu,Guo-Hua Wang,Tianyu Cui,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种名为Diffusion-SDPO的新方法,针对扩散模型在直接偏好优化(DPO)中存在的偏好边界扩大问题,通过自适应缩放损失分支的梯度来保护优势分支,从而提升生成质量。

Details Motivation: 文本到图像扩散模型虽然能生成高质量图像,但在与人类偏好对齐方面仍存在挑战。标准的Diffusion-DPO方法在扩大偏好边界时可能导致生成质量下降,甚至影响优势分支的输出质量。

Contribution: 提出了Diffusion-SDPO,一种新的安全更新规则,通过自适应缩放损失分支梯度来保护优势分支,确保优化过程中优势分支的误差不会增加。

Method: 采用一阶分析推导出封闭形式的缩放系数,动态调整损失分支的梯度,使其与优势分支梯度对齐。该方法简单、模型无关,且计算开销小。

Result: 实验表明,Diffusion-SDPO在标准文本到图像基准测试中,在自动化偏好、美学和提示对齐指标上均优于基线方法。

Insight: 通过保护优势分支并动态调整损失分支梯度,可以在优化偏好对齐的同时避免生成质量下降。

Abstract: Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology: enlarging the preference margin does not necessarily improve generation quality. In particular, the standard Diffusion-DPO objective can increase the reconstruction error of both winner and loser branches. Consequently, degradation of the less-preferred outputs can become sufficiently severe that the preferred branch is also adversely affected even as the margin grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the preferred output is non-increasing at each optimization step. Our method is simple, model-agnostic, broadly compatible with existing DPO-style alignment frameworks and adds only marginal computational overhead. Across standard text-to-image benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics. Code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO.

[46] SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

Mauro Orazio Drago,Luca Carlini,Pelinsu Celebi Balyemez,Dennis Pierantozzi,Chiara Lena,Cesare Hassan,Danail Stoyanov,Elena De Momi,Sophia Bano,Mobarak I. Hoque

Main category: cs.CV

TL;DR: SurgViVQA 是一个针对手术场景动态理解的新型视频问答模型,通过结合视频与文本特征,捕捉时间动态信息,显著提升现有基于图像的VQA模型的性能。

Details Motivation: 现有的手术视频问答模型多基于静态图像特征,缺乏对时间动态信息的捕捉,而手术场景的理解高度依赖连续动作和工具-组织交互等动态信息。

Contribution: 1. 提出SurgViVQA模型,将视觉推理从静态图像扩展到动态手术场景;2. 创建REAL-Colon-VQA数据集,包含运动相关问题和诊断属性;3. 模型在关键词准确性上显著优于现有基准。

Method: 1. 使用Masked Video–Text Encoder融合视频与问题特征,捕捉时间动态信息;2. 微调大语言模型(LLM)生成连贯答案;3. 通过扰动问题验证模型的鲁棒性。

Result: 在REAL-Colon-VQA和EndoVis18-VQA数据集上,SurgViVQA性能优于PitVQA模型,关键词准确率分别提升11%和9%。扰动测试显示模型对问题表达的鲁棒性更强。

Insight: 时间动态信息对手术场景理解至关重要,SurgViVQA通过结合视频与文本特征,为动态手术视频问答提供了有效框架。

Abstract: Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video–Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool–tissue interactions, which a fine-tuned large language model (LLM) then decodes into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as out-of-template questions with rephrased or semantically altered formulations to assess model robustness. Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset shows that SurgViVQA outperforms existing image-based VQA benchmark models, particularly in keyword accuracy, improving over PitVQA by +11% on REAL-Colon-VQA and +9% on EndoVis18-VQA. A perturbation study on the questions further confirms improved generalizability and robustness to variations in question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively. Code and dataset available at https://github.com/madratak/SurgViVQA.

[47] Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge

Yi Yang,Yiming Xu,Timo Kaiser,Hao Cheng,Bodo Rosenhahn,Michael Ying Yang

Main category: cs.CV

TL;DR: 本文提出一种无需训练的两阶段方法,结合FastTracker和LLaVA-Video,在MOT25-StAG挑战中实现多目标跟踪与检索任务。

Details Motivation: 解决复杂场景中基于语言查询的多目标跟踪问题,提出一种无需训练的解决方案。

Contribution: 提出一种两阶段零样本方法,结合SOTA跟踪模型和跨模态大语言模型,在挑战中取得第二名。

Method: 第一阶段使用FastTracker进行目标跟踪,第二阶段利用LLaVA-Video实现语言查询匹配。

Result: 在MOT25-StAG测试集上m-HIoU和HOTA分别达到20.68和10.73。

Insight: 跨模态大语言模型可有效用于视频检索任务,零样本方法在复杂场景中具备潜力。

Abstract: In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.

[48] UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Guozhen Zhang,Zixiang Zhou,Teng Hu,Ziqiao Peng,Youliang Zhang,Yi Chen,Yuan Zhou,Qinglin Lu,Limin Wang

Main category: cs.CV

TL;DR: UniAVGen提出了一种统一的音频和视频生成框架,通过非对称跨模态交互机制解决了现有方法在唇部同步和语义一致性上的不足,并在多个任务中表现出色。

Details Motivation: 现有开源音频-视频生成方法缺乏有效的跨模态建模,导致唇部同步和语义一致性不足,因此需要一种更高效的统一框架。

Contribution: 1. 提出UniAVGen,一种统一的联合音频和视频生成框架;2. 设计了非对称跨模态交互机制和面部感知调制模块;3. 引入了模态感知的无分类器引导策略;4. 在少量训练数据下实现性能优势。

Method: 1. 双分支联合合成架构,使用并行DiTs构建跨模态潜在空间;2. 非对称跨模态交互机制实现双向时间对齐的跨注意力;3. 面部感知调制模块动态优先处理关键区域;4. 模态感知的无分类器引导增强生成质量。

Result: 在1.3M训练样本(远少于30.1M)下,UniAVGen在音频-视频同步、音色一致性和情感一致性上表现优越。

Insight: 1. 非对称跨模态交互机制可显著提升多模态任务的同步性和语义一致性;2. 动态调制和引导策略是提升生成质量的关键。

Abstract: Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen’s robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.

[49] Decoupling Augmentation Bias in Prompt Learning for Vision-Language Models

Gahyeon Kim,Sohee Kim,Seokju Lee

Main category: cs.CV

TL;DR: 论文提出了AAPL方法,通过解耦图像增广引入的视觉变化与类别相关的语义表征,提升提示学习在零样本和少样本任务中的表现。

Details Motivation: 现有提示学习方法(如CoOp和CoCoOp)主要关注文本修改,忽视了图像增广的潜力,且在未见过的类别上泛化能力有限。

Contribution: 提出了AAPL方法,通过对抗性词嵌入解耦增广带来的表面视觉变化与语义表征,提升提示学习的泛化能力。

Method: 使用对抗性词嵌入技术,分离增广引入的无关视觉特征与类别相关语义特征,使提示学习聚焦于判别性视觉特征。

Result: 在11个基准数据集上,AAPL在少样本、零样本、跨数据集和领域泛化任务中均优于现有方法。

Insight: 图像增广与提示学习的结合可以有效提升模型泛化能力,关键在于解耦增广带来的噪声与类别语义。

Abstract: Recent advances in large-scale vision and language models have led to significant progress in zero-shot learning tasks. Methods such as CoOp and CoCoOp have shown that replacing handcrafted prompts with learnable vectors, known as prompt learning, can result in improved performance. However, these models often struggle to generalize to entirely unseen categories. While traditional zero-shot learning techniques benefit from various data augmentation strategies, prompt learning has primarily focused on text-based modifications, leaving the potential of image-based augmentation largely unexplored. In this work, we explore how image-level augmentations, particularly those that introduce attribute-specific variations, can support and enhance prompt learning. Our analysis examines the interaction between these augmentations and soft prompt frameworks, revealing their potential to improve generalization. We also identify a limitation in existing methods, such as CoCoOp, which do not provide explicit guidance for learning prompts that focus on semantically meaningful visual features. To address this, we propose Adding Attributes to Prompt Learning, AAPL, a novel method that introduces adversarial token embeddings to decouple superficial visual variations introduced by augmentation from class-relevant semantic representations. This decoupling enables the learned prompts to concentrate on visually discriminative features that align with the target categories. We conduct comprehensive experiments on eleven benchmark datasets, and AAPL consistently outperforms existing methods across few-shot, zero-shot, cross-dataset, and domain generalization settings. Our source code is publicly available at: https://github.com/Gahyeonkim09/AAPL

[50] Robust Alignment of the Human Embryo in 3D Ultrasound using PCA and an Ensemble of Heuristic, Atlas-based and Learning-based Classifiers Evaluated on the Rotterdam Periconceptional Cohort

Nikolai Herrmann,Marcella C. Zijta,Stefan Klein,Régine P. M. Steegers-Theunissen,Rene M. H. Wijnen,Bernadette S. de Bakker,Melek Rousian,Wietske A. P. Bastiaansen

Main category: cs.CV

TL;DR: 该论文提出了一种自动化方法,通过PCA提取胚胎主轴并结合启发式、图谱对齐和分类器策略,实现了三维超声图像中胚胎的标准化对齐,显著提高了对齐准确性。

Details Motivation: 胚胎在三维超声图像中的标准化对齐有助于临床生长监测和标准化平面检测,但目前缺乏高效、准确的自动化方法。

Contribution: 主要贡献包括:(1)利用PCA提取胚胎主轴;(2)结合启发式、图谱对齐和随机森林分类器三种策略选择候选方向;(3)在1043次妊娠的2166张图像上验证了高精度对齐。

Method: 方法分为三步:(1)用PCA从分割掩码中提取胚胎主轴;(2)生成四个候选方向;(3)通过启发式(Pearson相关性)、图谱匹配(归一化互相关)或随机森林分类器选择正确方向。

Result: 在99.0%的图像中,PCA成功提取主轴;三种策略分别实现97.4%、95.8%和98.4%的正确选择率;多数投票进一步提升至98.5%。

Insight: 结合多种策略(启发式+图谱+学习)能显著提升对齐任务的鲁棒性和准确性,为临床和研究提供了可扩展的自动化工具。

Abstract: Standardized alignment of the embryo in three-dimensional (3D) ultrasound images aids prenatal growth monitoring by facilitating standard plane detection, improving visualization of landmarks and accentuating differences between different scans. In this work, we propose an automated method for standardizing this alignment. Given a segmentation mask of the embryo, Principal Component Analysis (PCA) is applied to the mask extracting the embryo’s principal axes, from which four candidate orientations are derived. The candidate in standard orientation is selected using one of three strategies: a heuristic based on Pearson’s correlation assessing shape, image matching to an atlas through normalized cross-correlation, and a Random Forest classifier. We tested our method on 2166 images longitudinally acquired 3D ultrasound scans from 1043 pregnancies from the Rotterdam Periconceptional Cohort, ranging from 7+0 to 12+6 weeks of gestational age. In 99.0% of images, PCA correctly extracted the principal axes of the embryo. The correct candidate was selected by the Pearson Heuristic, Atlas-based and Random Forest in 97.4%, 95.8%, and 98.4% of images, respectively. A Majority Vote of these selection methods resulted in an accuracy of 98.5%. The high accuracy of this pipeline enables consistent embryonic alignment in the first trimester, enabling scalable analysis in both clinical and research settings. The code is publicly available at: https://gitlab.com/radiology/prenatal-image-analysis/pca-3d-alignment.

[51] A Lightweight 3D-CNN for Event-Based Human Action Recognition with Privacy-Preserving Potential

Mehdi Sefidgar Dilmaghani,Francis Fowley,Peter Corcoran

Main category: cs.CV

TL;DR: 论文提出了一种轻量化的3D-CNN,用于基于事件视觉数据的人类动作识别,通过事件相机保护隐私,同时实现了高精度(F1-score 0.9415)和高效性。

Details Motivation: 传统帧相机在监控中可能泄露个人隐私信息,而事件相机仅记录像素强度变化,具有天然的隐私保护性。因此,研究如何利用事件相机开发高效、隐私保护的人类动作识别系统具有重要意义。

Contribution: 提出了一种轻量化的3D-CNN模型,结合焦点损失和类别重加权策略,以及针对性数据增强,实现了高精度且适合边缘部署的动作识别。

Method: 采用轻量化3D-CNN结构,结合焦点损失和类别重加权解决类别不平衡问题,并通过数据增强提升泛化能力。模型在Toyota Smart Home和ETRI数据集上训练和评估。

Result: 模型在测试中表现优异,F1-score达到0.9415,整体准确率为94.17%,超越了C3D、ResNet3D等基准模型。

Insight: 事件相机结合轻量化深度学习模型,不仅保护隐私,还能实现高效的动作识别,为边缘计算场景提供了新思路。

Abstract: This paper presents a lightweight three-dimensional convolutional neural network (3DCNN) for human activity recognition (HAR) using event-based vision data. Privacy preservation is a key challenge in human monitoring systems, as conventional frame-based cameras capture identifiable personal information. In contrast, event cameras record only changes in pixel intensity, providing an inherently privacy-preserving sensing modality. The proposed network effectively models both spatial and temporal dynamics while maintaining a compact design suitable for edge deployment. To address class imbalance and enhance generalization, focal loss with class reweighting and targeted data augmentation strategies are employed. The model is trained and evaluated on a composite dataset derived from the Toyota Smart Home and ETRI datasets. Experimental results demonstrate an F1-score of 0.9415 and an overall accuracy of 94.17%, outperforming benchmark 3D-CNN architectures such as C3D, ResNet3D, and MC3_18 by up to 3%. These results highlight the potential of event-based deep learning for developing accurate, efficient, and privacy-aware human action recognition systems suitable for real-world edge applications.

[52] Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection

Dongkeun Kim,Minsu Cho,Suha Kwak

Main category: cs.CV

TL;DR: 提出了一种基于局部感知的从下至上群体推理框架,用于细粒度社交交互检测,通过考虑身体局部特征和人际关系的细微社交线索,显著提升了社交群体推断的准确性。

Details Motivation: 现有社交交互检测方法忽略细微的社交线索(如表情、注视和手势),并依赖个体的整体表示,导致无法捕捉局部社交信号和在群体配置推断中引入模糊性。

Contribution: 1. 提出了一种局部感知的从下至上群体推理框架;2. 通过身体局部特征和人际关系建模社交交互;3. 在NVI数据集上实现了新的SOTA性能。

Method: 1. 检测个体并通过局部感知增强其特征;2. 基于相似性推理(考虑空间关系和社交线索)关联个体,推断群体配置。

Result: 在NVI数据集上表现优于现有方法,达到新的SOTA。

Insight: 局部特征和细微社交线索对社交交互检测至关重要,从下至上的推理能更准确地建模群体配置。

Abstract: Social interactions often emerge from subtle, fine-grained cues such as facial expressions, gaze, and gestures. However, existing methods for social interaction detection overlook such nuanced cues and primarily rely on holistic representations of individuals. Moreover, they directly detect social groups without explicitly modeling the underlying interactions between individuals. These drawbacks limit their ability to capture localized social signals and introduce ambiguity when group configurations should be inferred from social interactions grounded in nuanced cues. In this work, we propose a part-aware bottom-up group reasoning framework for fine-grained social interaction detection. The proposed method infers social groups and their interactions using body part features and their interpersonal relations. Our model first detects individuals and enhances their features using part-aware cues, and then infers group configuration by associating individuals via similarity-based reasoning, which considers not only spatial relations but also subtle social cues that signal interactions, leading to more accurate group inference. Experiments on the NVI dataset demonstrate that our method outperforms prior methods, achieving the new state of the art.

[53] Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition

Jongseo Lee,Wooil Lee,Gyeong-Moon Park,Seong Tae Kim,Jinwoo Choi

Main category: cs.CV

TL;DR: 论文提出了一种基于解耦概念的视频动作识别框架DANCE,通过分离运动动态、物体和场景的概念,提高了模型解释的清晰度,并在多个数据集上验证了其性能和实用性。

Details Motivation: 现有的视频动作识别解释方法(如显著性)往往将运动和空间上下文纠缠在一起,导致解释不清晰。语言驱动的方法则难以解释直觉性的运动。论文旨在解决这些问题。

Contribution: 设计了DANCE框架,通过解耦的运动动态、物体和场景概念预测动作,并利用概念瓶颈设计强制预测通过这些概念,提高了解释清晰度和实用性。

Method: DANCE定义运动动态概念为人体姿态序列,利用大语言模型自动提取物体和场景概念,并通过概念瓶颈设计实现预测的解耦解释。

Result: 在四个数据集上的实验表明,DANCE显著提升了解释清晰度且性能竞争性强。用户研究验证了其优越的可解释性,并展示了其在模型调试、编辑和失败分析中的实用价值。

Insight: 解耦概念设计不仅提升了模型的可解释性,还为模型的实际应用(如调试和编辑)提供了新的可能性。

Abstract: Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods based on saliency produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature – intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets – KTH, Penn Action, HAA500, and UCF-101 – demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.

eess.SP [Back]

[54] Benchmarking ResNet for Short-Term Hypoglycemia Classification with DiaData

Beyza Cinar,Maria Maleshkova

Main category: eess.SP

TL;DR: 该研究通过改进DiaData数据质量,利用ResNet模型为短期低血糖分类提供基准,数据清理和插值方法提升了模型性能。

Details Motivation: 针对1型糖尿病(T1D)数据分析中存在的噪声、缺失值和数据量小的问题,研究旨在提升数据质量,并利用高质量数据为低血糖分类提供可靠基准。

Contribution: 1) 使用四分位距识别并处理异常值;2) 采用线性插值和Stineman插值填补缺失数据;3) 分析血糖与心率的相关性;4) 基于ResNet模型提供低血糖分类基准。

Method: 通过IQR处理异常值,利用线性插值和Stineman插值填补数据缺失,训练ResNet模型进行低血糖分类。

Result: 数据质量改进使模型性能提升2-3%,更多数据训练提升7%。

Insight: 高质量数据对模型性能至关重要,Stineman插值在较大数据间隙中表现优于线性插值。

Abstract: Individualized therapy is driven forward by medical data analysis, which provides insight into the patient’s context. In particular, for Type 1 Diabetes (T1D), which is an autoimmune disease, relationships between demographics, sensor data, and context can be analyzed. However, outliers, noisy data, and small data volumes cannot provide a reliable analysis. Hence, the research domain requires large volumes of high-quality data. Moreover, missing values can lead to information loss. To address this limitation, this study improves the data quality of DiaData, an integration of 15 separate datasets containing glucose values from 2510 subjects with T1D. Notably, we make the following contributions: 1) Outliers are identified with the interquartile range (IQR) approach and treated by replacing them with missing values. 2) Small gaps ($\le$ 25 min) are imputed with linear interpolation and larger gaps ($\ge$ 30 and $<$ 120 min) with Stineman interpolation. Based on a visual comparison, Stineman interpolation provides more realistic glucose estimates than linear interpolation for larger gaps. 3) After data cleaning, the correlation between glucose and heart rate is analyzed, yielding a moderate relation between 15 and 60 minutes before hypoglycemia ($\le$ 70 mg/dL). 4) Finally, a benchmark for hypoglycemia classification is provided with a state-of-the-art ResNet model. The model is trained with the Maindatabase and Subdatabase II of DiaData to classify hypoglycemia onset up to 2 hours in advance. Training with more data improves performance by 7% while using quality-refined data yields a 2-3% gain compared to raw data.

[55] NEF-NET+: Adapting Electrocardio panorama in the wild

Zehui Zhan,Yaojun Hu,Jiajing Zhan,Wanchen Lian,Wanqing Wu,Jintai Chen

Main category: eess.SP

TL;DR: NEF-NET+ 是一个改进的心电全景合成框架,解决了传统心电图系统的局限性,能够在真实环境下支持任意长度和视角的信号合成,并适应不同设备和操作偏差。

Details Motivation: 传统心电图系统只能从固定的解剖视角捕捉信号,某些心脏疾病需要非标准视角才能检测关键模式。Nef-Net 虽然能够重建连续心电场,但在实际应用中面临长时程建模、设备噪声和电极放置偏差等挑战。

Contribution: 提出了 NEF-NET+,支持任意长度和视角的心电信号合成,能泛化到不同设备并校正电极放置偏差;构建了新的评估基准 Panobench。

Method: 设计了一种直接视角转换的模型架构,包括离线预训练、设备校准和患者特定校准步骤。

Result: 在真实环境中,NEF-NET+ 比 Nef-Net 提升了约 6 dB 的 PSNR。

Insight: 通过引入实际环境中的适应性校准步骤,可以显著提高心电全景合成的鲁棒性和精确性,为心脏病的诊断提供了更灵活的工具。

Abstract: Conventional multi-lead electrocardiogram (ECG) systems capture cardiac signals from a fixed set of anatomical viewpoints defined by lead placement. However, certain cardiac conditions (e.g., Brugada syndrome) require additional, non-standard viewpoints to reveal diagnostically critical patterns that may be absent in standard leads. To systematically overcome this limitation, Nef-Net was recently introduced to reconstruct a continuous electrocardiac field, enabling virtual observation of ECG signals from arbitrary views (termed Electrocardio Panorama). Despite its promise, Nef-Net operates under idealized assumptions and faces in-the-wild challenges, such as long-duration ECG modeling, robustness to device-specific signal artifacts, and suboptimal lead placement calibration. This paper presents NEF-NET+, an enhanced framework for realistic panoramic ECG synthesis that supports arbitrary-length signal synthesis from any desired view, generalizes across ECG devices, and compensates for operator-induced deviations in electrode placement. These capabilities are enabled by a newly designed model architecture that performs direct view transformation, incorporating a workflow comprising offline pretraining, device calibration tuning steps as well as an on-the-fly calibration step for patient-specific adaptation. To rigorously evaluate panoramic ECG synthesis, we construct a new Electrocardio Panorama benchmark, called Panobench, comprising 5367 recordings with 48-view per subject, capturing the full spatial variability of cardiac electrical activity. Experimental results show that NEF-NET+ delivers substantial improvements over Nef-Net, yielding an increase of around 6 dB in PSNR in real-world setting. The code and Panobench will be released in a subsequent publication.

cs.RO [Back]

[56] Comprehensive Assessment of LiDAR Evaluation Metrics: A Comparative Study Using Simulated and Real Data

Syed Mostaquim Ali,Taufiq Rahman,Ghazal Farhani,Mohamed H. Zaki,Benoit Anctil,Dominique Charlebois

Main category: cs.RO

TL;DR: 该论文通过模拟和真实LiDAR数据的对比研究,综合评估了LiDAR评估指标的适用性,发现Density Aware Chamfer Distance (DCD)在多种情况下表现最佳。

Details Motivation: 为自动驾驶系统(ADS)的安全性开发需要严格的测试,但由于成本和安全性问题,实际物理测试不可行,因此虚拟测试环境(VTE)成为替代方案。研究旨在找到适合比较真实和模拟LiDAR扫描的评估指标。

Contribution: 1. 提出了一种全面的实验方法,对比了多种LiDAR评估指标的敏感性和准确性。2. 发现DCD在所有测试场景中表现最优。3. 使用真实LiDAR数据生成了虚拟测试环境,并验证了模型的感知和几何相似性。

Method: 1. 测试不同噪声、密度、失真、传感器方向和通道设置下的指标性能。2. 使用真实LiDAR数据生成VTE,并模拟LiDAR扫描进行对比。3. 比较模拟和真实LiDAR扫描的感知结果和几何相似性。

Result: 模拟和真实LiDAR扫描在语义分割上的mIoU为21%,几何相似性的DCD为0.63,表明两者在几何特性上存在轻微差异,但在感知输出上有显著差异。DCD是与感知方法相关性最强的指标。

Insight: DCD是评估LiDAR数据模拟与真实数据差异的有效指标,为虚拟测试环境的开发和自动驾驶系统的安全性验证提供了重要参考。

Abstract: For developing safe Autonomous Driving Systems (ADS), rigorous testing is required before they are deemed safe for road deployments. Since comprehensive conventional physical testing is impractical due to cost and safety concerns, Virtual Testing Environments (VTE) can be adopted as an alternative. Comparing VTE-generated sensor outputs against their real-world analogues can be a strong indication that the VTE accurately represents reality. Correspondingly, this work explores a comprehensive experimental approach to finding evaluation metrics suitable for comparing real-world and simulated LiDAR scans. The metrics were tested in terms of sensitivity and accuracy with different noise, density, distortion, sensor orientation, and channel settings. From comparing the metrics, we found that Density Aware Chamfer Distance (DCD) works best across all cases. In the second step of the research, a Virtual Testing Environment was generated using real LiDAR scan data. The data was collected in a controlled environment with only static objects using an instrumented vehicle equipped with LiDAR, IMU and cameras. Simulated LiDAR scans were generated from the VTEs using the same pose as real LiDAR scans. The simulated and LiDAR scans were compared in terms of model perception and geometric similarity. Actual and simulated LiDAR scans have a similar semantic segmentation output with a mIoU of 21% with corrected intensity and an average density aware chamfer distance (DCD) of 0.63. This indicates a slight difference in the geometric properties of simulated and real LiDAR scans and a significant difference between model outputs. During the comparison, density-aware chamfer distance was found to be the most correlated among the metrics with perception methods.

[57] OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera

Hao Shi,Ze Wang,Shangwei Guo,Mengfei Duan,Song Wang,Teng Chen,Kailun Yang,Lin Wang,Kaiwei Wang

Main category: cs.RO

TL;DR: OneOcc是一个针对腿部/人形机器人设计的全景语义占据预测框架,通过单目全景相机实现高效3D语义占据预测,克服了步态引起的抖动和360度连续性挑战。

Details Motivation: 现有语义场景补全(SSC)系统多针对轮式平台和前置传感器,缺乏对腿部机器人步态抖动和360度连续性的支持。作者提出OneOcc,填补这一空白并提升实用性。

Contribution: 提出了DP-ER、BGV、Hierarchical AMoE-3D解码器和GDC等创新模块,并发布两个全景占据数据集(QuadOcc和Human360Occ),在性能和通用性上取得SOTA。

Method: 结合双投影融合(DP-ER)和双网格体素化(BGV),使用轻量级解码器和动态多尺度融合技术,同时无需额外传感器实现步态位移补偿(GDC)。

Result: 在QuadOcc上超越视觉和LiDAR基线,在H3O数据集中分别提升3.83 mIoU(城市场景内)和8.08(跨城市)。

Insight: 全景视觉与动态融合的结合能够有效解决腿部机器人在步态下的感知问题,轻量化设计使其具备实际部署潜力。

Abstract: Robust 3D semantic occupancy is crucial for legged/humanoid robots, yet most semantic scene completion (SSC) systems target wheeled platforms with forward-facing sensors. We present OneOcc, a vision-only panoramic SSC framework designed for gait-introduced body jitter and 360{\deg} continuity. OneOcc combines: (i) Dual-Projection fusion (DP-ER) to exploit the annular panorama and its equirectangular unfolding, preserving 360{\deg} continuity and grid alignment; (ii) Bi-Grid Voxelization (BGV) to reason in Cartesian and cylindrical-polar spaces, reducing discretization bias and sharpening free/occupied boundaries; (iii) a lightweight decoder with Hierarchical AMoE-3D for dynamic multi-scale fusion and better long-range/occlusion reasoning; and (iv) plug-and-play Gait Displacement Compensation (GDC) learning feature-level motion correction without extra sensors. We also release two panoramic occupancy benchmarks: QuadOcc (real quadruped, first-person 360{\deg}) and Human360Occ (H3O) (CARLA human-ego 360{\deg} with RGB, Depth, semantic occupancy; standardized within-/cross-city splits). OneOcc sets new state-of-the-art (SOTA): on QuadOcc it beats strong vision baselines and popular LiDAR ones; on H3O it gains +3.83 mIoU (within-city) and +8.08 (cross-city). Modules are lightweight, enabling deployable full-surround perception for legged/humanoid robots. Datasets and code will be publicly available at https://github.com/MasterHow/OneOcc.

q-fin.TR [Back]

[58] LiveTradeBench: Seeking Real-World Alpha with Large Language Models

Haofei Yu,Fenghai Li,Jiaxuan You

Main category: q-fin.TR

TL;DR: LiveTradeBench是一个实时交易环境,用于在大语言模型(LLMs)中评估现实和动态市场中的表现,揭示了静态评估与真实世界能力之间的差距。

Details Motivation: 现有的大语言模型在静态环境中表现优异,但缺乏对动态和不确定性的评估。LiveTradeBench旨在填补这一空白,测试LLMs在实时市场中的决策能力。

Contribution: 提出了LiveTradeBench,一个基于实时数据和多市场的交易环境,扩展了LLMs在风险管理和多资产分配中的能力评估。

Method: LiveTradeBench的设计原则包括:实时数据流、多资产组合管理抽象,以及多市场评估。通过50天的实时评估,测试了21种LLM的交易表现。

Result: 结果显示:(1)高LMArena分数不一定对应优越的交易结果;(2)模型表现出不同的投资风格;(3)部分LLM能有效利用实时信号调整决策。

Insight: 静态评估与真实世界的动态决策能力存在显著差距,未来的基准测试需更关注实时不确定环境下的决策一致性。

Abstract: Large language models (LLMs) achieve strong performance across benchmarks–from knowledge quizzes and math reasoning to web-agent tasks–but these tests occur in static settings, lacking real dynamics and uncertainty. Consequently, they evaluate isolated reasoning or problem-solving rather than decision-making under uncertainty. To address this, we introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets. LiveTradeBench follows three design principles: (i) Live data streaming of market prices and news, eliminating dependence on offline backtesting and preventing information leakage while capturing real-time uncertainty; (ii) a portfolio-management abstraction that extends control from single-asset actions to multi-asset allocation, integrating risk management and cross-asset reasoning; and (iii) multi-market evaluation across structurally distinct environments–U.S. stocks and Polymarket prediction markets–differing in volatility, liquidity, and information flow. At each step, an agent observes prices, news, and its portfolio, then outputs percentage allocations that balance risk and return. Using LiveTradeBench, we run 50-day live evaluations of 21 LLMs across families. Results show that (1) high LMArena scores do not imply superior trading outcomes; (2) models display distinct portfolio styles reflecting risk appetite and reasoning dynamics; and (3) some LLMs effectively leverage live signals to adapt decisions. These findings expose a gap between static evaluation and real-world competence, motivating benchmarks that test sequential decision making and consistency under live uncertainty.

cs.GR [Back]

[59] Scheduling the Off-Diagonal Weingarten Loss of Neural SDFs for CAD Models

Haotian Yin,Przemyslaw Musialski

Main category: cs.GR

TL;DR: 本文提出了一种针对神经符号距离函数(SDFs)的Off-Diagonal Weingarten(ODW)损失权重调度策略,通过动态调整权重在不同训练阶段的强度,优化CAD模型的几何重建效果。

Details Motivation: FlatCAD的ODW损失虽然是一种高效的第二阶先验,但其固定权重策略在训练早期和后期表现不一:早期需要强正则化以稳定优化,后期则需要弱化正则化以恢复细节。因此,需要一种动态调度策略来平衡不同阶段的需求。

Contribution: 1. 提出了多种ODW损失权重调度策略(如常数、线性、五次、阶跃插值等),动态调整正则化强度;
2. 通过实验验证,动态调度策略在CAD重建任务中显著优于固定权重基线。

Method: 1. 采用了五种调度策略(常数、线性、五次、阶跃插值及递增预热);
2. 在训练过程中动态调整ODW损失的权重;
3. 在ABC CAD数据集上评估不同调度策略的效果。

Result: 实验结果表明,动态调度策略比固定权重基线在Chamfer Distance上提升高达35%,证明了该方法在CAD重建中的有效性。

Insight: 1. 动态调节正则化强度是优化神经SDFs表现的关键;
2. 不同的调度策略对结果有显著影响,需要根据任务需求选择合适的调度方式。

Abstract: Neural signed distance functions (SDFs) have become a powerful representation for geometric reconstruction from point clouds, yet they often require both gradient- and curvature-based regularization to suppress spurious warp and preserve structural fidelity. FlatCAD introduced the Off-Diagonal Weingarten (ODW) loss as an efficient second-order prior for CAD surfaces, approximating full-Hessian regularization at roughly half the computational cost. However, FlatCAD applies a fixed ODW weight throughout training, which is suboptimal: strong regularization stabilizes early optimization but suppresses detail recovery in later stages. We present scheduling strategies for the ODW loss that assign a high initial weight to stabilize optimization and progressively decay it to permit fine-scale refinement. We investigate constant, linear, quintic, and step interpolation schedules, as well as an increasing warm-up variant. Experiments on the ABC CAD dataset demonstrate that time-varying schedules consistently outperform fixed weights. Our method achieves up to a 35% improvement in Chamfer Distance over the FlatCAD baseline, establishing scheduling as a simple yet effective extension of curvature regularization for robust CAD reconstruction.

eess.IV [Back]

[60] Optimizing the nnU-Net model for brain tumor (Glioma) segmentation Using a BraTS Sub-Saharan Africa (SSA) dataset

Chukwuemeka Arua Kalu,Adaobi Chiazor Emegoakor,Fortune Okafor,Augustine Okoh Uchenna,Chijioke Kelvin Ukpai,Godsent Erere Onyeugbo

Main category: eess.IV

TL;DR: 该研究使用BraTS撒哈拉以南非洲数据集优化了nnU-Net模型,用于脑瘤分割。研究发现,原始数据结合nnU-Net的在线增强优于离线增强数据,强调了数据质量和增强方法的重要性。

Details Motivation: 医学图像分割对现代医疗至关重要,自动分割技术能帮助医生专注于诊断和治疗规划。本研究旨在优化nnU-Net模型,提高在撒哈拉以南非洲地区脑瘤分割的准确性。

Contribution: 研究发现原始数据结合nnU-Net的在线增强优于离线增强数据,并在脑瘤分割中取得了较高的Dice分数(0.84),为数据质量和增强方法的选择提供了新见解。

Method: 研究使用了BraTS撒哈拉以南非洲数据集,包含60例多模态MRI病例,通过nnU-Net模型进行训练。对比了原始数据和离线增强数据的效果,发现在线增强更优。

Result: 模型在原始数据结合在线增强下,全肿瘤分割的Dice分数达到0.84,表现优于离线增强数据。

Insight: 研究表明数据质量和增强方法对模型泛化能力至关重要,尤其是在数据稀缺地区,原始数据的自然变异性可能优于人为增强数据。

Abstract: Medical image segmentation is a critical achievement in modern medical science, developed over decades of research. It allows for the exact delineation of anatomical and pathological features in two- or three-dimensional pictures by utilizing notions like pixel intensity, texture, and anatomical context. With the advent of automated segmentation, physicians and radiologists may now concentrate on diagnosis and treatment planning while intelligent computers perform routine image processing tasks. This study used the BraTS Sub-Saharan Africa dataset, a selected subset of the BraTS dataset that included 60 multimodal MRI cases from patients with glioma. Surprisingly, the nnU Net model trained on the initial 60 instances performed better than the network trained on an offline-augmented dataset of 360 cases. Hypothetically, the offline augmentations introduced artificial anatomical variances or intensity distributions, reducing generalization. In contrast, the original dataset, when paired with nnU Net’s robust online augmentation procedures, maintained realistic variability and produced better results. The study achieved a Dice score of 0.84 for whole tumor segmentation. These findings highlight the significance of data quality and proper augmentation approaches in constructing accurate, generalizable medical picture segmentation models, particularly for under-represented locations.

[61] Domain-Adaptive Transformer for Data-Efficient Glioma Segmentation in Sub-Saharan MRI

Ilerioluwakiiye Abolade,Aniekan Udo,Augustine Ojo,Abdulbasit Oyetunji,Hammed Ajigbotosho,Aondana Iorumbur,Confidence Raymond,Maruf Adewole

Main category: eess.IV

TL;DR: 提出 SegFormer3D-plus,一种放射组学引导的 Transformer 架构,用于在资源有限的撒哈拉以南非洲地区解决胶质瘤分割中的域偏移问题。

Details Motivation: 撒哈拉以南非洲地区由于 MRI 基础设施有限和采集协议异质性,导致严重的域偏移问题,使得胶质瘤分割极具挑战性。

Contribution: 1) 提出 SegFormer3D-plus 架构;2) 结合直方图匹配、放射组学特征和 PCA-降维 k-means 进行域感知采样;3) 提出双路径编码器与频率感知特征提取模块。

Method: 1) 直方图匹配实现强度归一化;2) 放射组学特征提取与 PCA 降维结合 k-means 采样;3) 双路径编码器结合空间-通道注意力;4) 复合 Dice-Cross-Entropy 损失优化边界。

Result: 在 BraTS-Africa 数据上微调后,模型在异质性临床扫描中表现出更好的肿瘤子区域分割和边界定位能力。

Insight: 放射组学引导的域适应方法在资源有限的环境中具有显著价值。

Abstract: Glioma segmentation is critical for diagnosis and treatment planning, yet remains challenging in Sub-Saharan Africa due to limited MRI infrastructure and heterogeneous acquisition protocols that induce severe domain shift. We propose SegFormer3D-plus, a radiomics-guided transformer architecture designed for robust segmentation under domain variability. Our method combines: (1) histogram matching for intensity harmonization across scanners, (2) radiomic feature extraction with PCA-reduced k-means for domain-aware stratified sampling, (3) a dual-pathway encoder with frequency-aware feature extraction and spatial-channel attention, and (4) composite Dice-Cross-Entropy loss for boundary refinement. Pretrained on BraTS 2023 and fine-tuned on BraTS-Africa data, SegFormer3D-plus demonstrates improved tumor subregion delineation and boundary localization across heterogeneous African clinical scans, highlighting the value of radiomics-guided domain adaptation for resource-limited settings.

[62] Morpho-Genomic Deep Learning for Ovarian Cancer Subtype and Gene Mutation Prediction from Histopathology

Gabriela Fernandes

Main category: eess.IV

TL;DR: 该论文提出了一种结合核形态测量和深度学习的方法,从H&E病理图像预测卵巢癌亚型和基因突变,实现了较高的分类和推断准确性。

Details Motivation: 卵巢癌因晚期诊断和高度异质性导致高死亡率,当前诊断方法难以揭示对精准肿瘤学至关重要的基因组变异。

Contribution: 提出了一种新型混合深度学习流程,整合核形态测量和图像特征,直接从病理图像预测亚型和基因突变。

Method: 采用ResNet-50 CNN和Vision Transformer (ViT)的融合模型,提取局部形态和全局组织上下文特征。

Result: 亚型分类准确率达84.2%,TP53、BRCA1、ARID1A基因突变的AUC分别为0.82、0.76、0.73。特征分析揭示了核形态与基因突变的直接联系。

Insight: 可量化的组织学表型编码了可测量的基因组信号,为低成本精准病理学提供了新途径。

Abstract: Ovarian cancer remains one of the most lethal gynecological malignancies, largely due to late diagnosis and extensive heterogeneity across subtypes. Current diagnostic methods are limited in their ability to reveal underlying genomic variations essential for precision oncology. This study introduces a novel hybrid deep learning pipeline that integrates quantitative nuclear morphometry with deep convolutional image features to perform ovarian cancer subtype classification and gene mutation inference directly from Hematoxylin and Eosin (H&E) histopathological images. Using $\sim45,000$ image patches sourced from The Cancer Genome Atlas (TCGA) and public datasets, a fusion model combining a ResNet-50 Convolutional Neural Network (CNN) encoder and a Vision Transformer (ViT) was developed. This model successfully captured both local morphological texture and global tissue context. The pipeline achieved a robust overall subtype classification accuracy of $84.2%$ (Macro AUC of $0.87 \pm 0.03$). Crucially, the model demonstrated the capacity for gene mutation inference with moderate-to-high accuracy: $AUC_{TP53} = 0.82 \pm 0.02$, $AUC_{BRCA1} = 0.76 \pm 0.04$, and $AUC_{ARID1A} = 0.73 \pm 0.05$. Feature importance analysis established direct quantitative links, revealing that nuclear solidity and eccentricity were the dominant predictors for TP53 mutation. These findings validate that quantifiable histological phenotypes encode measurable genomic signals, paving the way for cost-effective, precision histopathology in ovarian cancer triage and diagnosis.

cs.LG [Back]

[63] Data-Efficient Realized Volatility Forecasting with Vision Transformers

Emi Soroka,Artem Arzyn

Main category: cs.LG

TL;DR: 这篇论文探索了使用Vision Transformer(ViT)模型从隐含波动率曲面预测资产的未来30天实现波动率,证明了其在金融时序数据中的潜力。

Details Motivation: 深度学习方法在金融预测中表现出色,但Transformer模型在期权数据上的应用尚未充分探索。论文试图填补这一空白。

Contribution: 首次将ViT应用于隐含波动率曲面数据,展示了其捕捉非线性特征和季节性模式的能力。

Method: 采用ViT模型,输入为单日的隐含波动率曲面(增强日期信息),输出为未来30天的实现波动率预测。

Result: ViT能够从隐含波动率曲面中学习季节性模式和复杂非线性关系。

Insight: ViT在金融时序数据中的表现表明其有望成为波动率预测的新工具,为模型开发提供了新方向。

Abstract: Recent work in financial machine learning has shown the virtue of complexity: the phenomenon by which deep learning methods capable of learning highly nonlinear relationships outperform simpler approaches in financial forecasting. While transformer architectures like Informer have shown promise for financial time series forecasting, the application of transformer models for options data remains largely unexplored. We conduct preliminary studies towards the development of a transformer model for options data by training the Vision Transformer (ViT) architecture, typically used in modern image recognition and classification systems, to predict the realized volatility of an asset over the next 30 days from its implied volatility surface (augmented with date information) for a single day. We show that the ViT can learn seasonal patterns and nonlinear features from the IV surface, suggesting a promising direction for model development.

[64] Test Time Adaptation Using Adaptive Quantile Recalibration

Paria Mehrbod,Pedro Vianna,Geraldin Nanfack,Guy Wolf,Eugene Belilovsky

Main category: cs.LG

TL;DR: 该论文提出了一种称为自适应分位数重校准(AQR)的测试时适应方法,通过通道级别的分位数对齐来调整预激活分布,无需重新训练模型,同时改进了分布尾部的估计稳定性。

Details Motivation: 解决传统领域适应方法依赖目标领域先验知识或需要重新训练的问题,以及现有测试时适应方法在处理复杂激活分布和不同归一化层时的局限性。

Contribution: 提出了AQR方法,能够捕捉完整的激活分布形状,并适用于多种归一化层(BatchNorm、GroupNorm、LayerNorm)。同时设计了一种鲁棒的尾部校准策略。

Method: AQR通过通道级别的分位数对齐调整预激活分布,利用了训练时计算的源域统计量,并通过尾部校准策略提升估计稳定性。

Result: 在CIFAR-10-C、CIFAR-100-C和ImageNet-C数据集上的实验表明,AQR能够稳定适应多样化的设置,性能优于现有测试时适应基线方法。

Insight: AQR展现了在处理动态和不可预测的现实数据分布时的潜力,适用于资源受限或动态环境中部署的深度学习模型。

Abstract: Domain adaptation is a key strategy for enhancing the generalizability of deep learning models in real-world scenarios, where test distributions often diverge significantly from the training domain. However, conventional approaches typically rely on prior knowledge of the target domain or require model retraining, limiting their practicality in dynamic or resource-constrained environments. Recent test-time adaptation methods based on batch normalization statistic updates allow for unsupervised adaptation, but they often fail to capture complex activation distributions and are constrained to specific normalization layers. We propose Adaptive Quantile Recalibration (AQR), a test-time adaptation technique that modifies pre-activation distributions by aligning quantiles on a channel-wise basis. AQR captures the full shape of activation distributions and generalizes across architectures employing BatchNorm, GroupNorm, or LayerNorm. To address the challenge of estimating distribution tails under varying batch sizes, AQR incorporates a robust tail calibration strategy that improves stability and precision. Our method leverages source-domain statistics computed at training time, enabling unsupervised adaptation without retraining models. Experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C across multiple architectures demonstrate that AQR achieves robust adaptation across diverse settings, outperforming existing test-time adaptation baselines. These results highlight AQR’s potential for deployment in real-world scenarios with dynamic and unpredictable data distributions.

[65] A Probabilistic U-Net Approach to Downscaling Climate Simulations

Maryam Alipourhajiagha,Pierre-Louis Lemaire,Youssef Diouane,Julie Carreau

Main category: cs.LG

TL;DR: 论文提出了一种基于概率U-Net的方法,用于将气候模拟结果从粗分辨率降尺度到细分辨率,并通过变分潜在空间捕捉不确定性。评估了四种训练目标,发现WMSE-MS-SSIM在极端事件表现优异,而afCRPS在跨尺度空间变异性上表现更好。

Details Motivation: 气候模型的计算成本高昂,通常只能生成粗分辨率输出,而许多气候变化影响研究需要更精细的分辨率。统计降尺度方法可以填补这一空白,因此作者提出了一种概率U-Net方法来解决这一问题。

Contribution: 主要贡献是提出了一种结合确定性U-Net主干和变分潜在空间的概率U-Net方法,用于气候模拟的降尺度任务。通过对比不同训练目标,明确了WMSE-MS-SSIM和afCRPS分别在极端事件和空间变异性上的优势。

Method: 方法基于概率U-Net,结合确定性U-Net主干和变分潜在空间以捕捉不确定性。评估了四种训练目标:afCRPS和WMSE-MS-SSIM(三种设置),用于降尺度降水和温度数据。

Result: 结果显示,WMSE-MS-SSIM在某些设置下对极端事件表现良好,而afCRPS在跨尺度空间变异性上表现更优。

Insight: 通过概率性建模可以有效捕捉气候模拟中的不确定性,并且不同训练目标在不同任务场景下各有优势,为实际应用提供了灵活性。

Abstract: Climate models are limited by heavy computational costs, often producing outputs at coarse spatial resolutions, while many climate change impact studies require finer scales. Statistical downscaling bridges this gap, and we adapt the probabilistic U-Net for this task, combining a deterministic U-Net backbone with a variational latent space to capture aleatoric uncertainty. We evaluate four training objectives, afCRPS and WMSE-MS-SSIM with three settings for downscaling precipitation and temperature from $16\times$ coarser resolution. Our main finding is that WMSE-MS-SSIM performs well for extremes under certain settings, whereas afCRPS better captures spatial variability across scales.

cs.CR [Back]

[66] Watermarking Large Language Models in Europe: Interpreting the AI Act in Light of Technology

Thomas Souverain

Main category: cs.CR

TL;DR: 本文探讨了欧盟AI法案中对大型语言模型(LLM)水印的要求,提出了一种分类框架和评估方法,并指出当前技术尚未完全满足标准,建议未来研究嵌入底层架构的水印技术。

Details Motivation: 欧盟AI法案要求提供商对其通用模型输出进行标记和检测,但目前水印技术多样且快速演变,缺乏统一评估标准。

Contribution: 1. 提出基于LLM生命周期的水印分类法;2. 将欧盟标准映射到水印技术评估中;3. 比较现有水印方法并揭示其局限性。

Method: 通过分类、评估和比较现有水印技术,结合欧盟法案的四个标准(可靠性、互操作性、有效性和鲁棒性)进行分析。

Result: 研究发现当前水印技术尚未完全满足欧盟标准,建议进一步研究嵌入底层架构的水印方法。

Insight: 水印技术的评估需结合具体应用场景和标准,未来研究应关注底层架构的嵌入以实现更高可靠性。

Abstract: To foster trustworthy Artificial Intelligence (AI) within the European Union, the AI Act requires providers to mark and detect the outputs of their general-purpose models. The Article 50 and Recital 133 call for marking methods that are ‘’sufficiently reliable, interoperable, effective and robust’’. Yet, the rapidly evolving and heterogeneous landscape of watermarks for Large Language Models (LLMs) makes it difficult to determine how these four standards can be translated into concrete and measurable evaluations. Our paper addresses this challenge, anchoring the normativity of European requirements in the multiplicity of watermarking techniques. Introducing clear and distinct concepts on LLM watermarking, our contribution is threefold. (1) Watermarking Categorisation: We propose an accessible taxonomy of watermarking methods according to the stage of the LLM lifecycle at which they are applied - before, during, or after training, and during next-token distribution or sampling. (2) Watermarking Evaluation: We interpret the EU AI Act’s requirements by mapping each criterion with state-of-the-art evaluations on robustness and detectability of the watermark, and of quality of the LLM. Since interoperability remains largely untheorised in LLM watermarking research, we propose three normative dimensions to frame its assessment. (3) Watermarking Comparison: We compare current watermarking methods for LLMs against the operationalised European criteria and show that no approach yet satisfies all four standards. Encouraged by emerging empirical tests, we recommend further research into watermarking directly embedded within the low-level architecture of LLMs.