Table of Contents
- cs.CL [Total: 21]
- cs.CV [Total: 38]
- cs.LG [Total: 4]
- cs.AI [Total: 2]
- cs.HC [Total: 1]
- cs.RO [Total: 3]
- cs.CY [Total: 1]
- physics.ed-ph [Total: 1]
- cs.DB [Total: 1]
cs.CL [Back]
[1] Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting
Jan Fillies,Michael Peter Hoffmann,Rebecca Reichel,Roman Salzwedel,Sven Bodemer,Adrian Paschke
Main category: cs.CL
TL;DR: 该论文引入了一个大规模德语数据集,标注了毒性并结合平台提供的年龄估计,揭示了不同年龄群体在语言毒性上的差异。
Details
Motivation: 现有毒性语言数据集缺乏人口统计信息,限制了我们对不同年龄群体在线交流的理解。Contribution: 首次提供了一个大型德语数据集,标注毒性并包含年龄估计,用于研究年龄相关的毒性语言模式。
Method: 结合人工标注和语言模型标注,从Instagram、TikTok和YouTube收集了30,024条评论,并通过预先定义的有毒关键词筛选。
Result: 数据集显示,年轻用户倾向于使用表达性语言,而年长用户更多涉及虚假信息和贬低行为。
Insight: 该数据集为研究跨人口统计的语言变体提供了新机会,并为开发更具公平性和年龄感知的内容审核系统提供了支持。
Abstract: A lack of demographic context in existing toxic speech datasets limits our understanding of how different age groups communicate online. In collaboration with funk, a German public service content network, this research introduces the first large-scale German dataset annotated for toxicity and enriched with platform-provided age estimates. The dataset includes 3,024 human-annotated and 30,024 LLM-annotated anonymized comments from Instagram, TikTok, and YouTube. To ensure relevance, comments were consolidated using predefined toxic keywords, resulting in 16.7% labeled as problematic. The annotation pipeline combined human expertise with state-of-the-art language models, identifying key categories such as insults, disinformation, and criticism of broadcasting fees. The dataset reveals age-based differences in toxic speech patterns, with younger users favoring expressive language and older users more often engaging in disinformation and devaluation. This resource provides new opportunities for studying linguistic variation across demographics and supports the development of more equitable and age-aware content moderation systems.
[2] TrInk: Ink Generation with Transformer Network
Zezhong Jin,Shubhang Desai,Xu Chen,Biyi Fang,Zhuoyi Huang,Zhe Li,Chong-Xin Gan,Xiao Tu,Man-Wai Mak,Yan Lu,Shujie Liu
Main category: cs.CL
TL;DR: TrInk提出了一种基于Transformer的墨水生成模型,通过引入缩放位置嵌入和高斯记忆掩码,提升了输入文本与生成笔画的全局依赖关系,显著降低了字符和单词错误率。
Details
Motivation: 传统的墨水生成方法难以捕获全局依赖关系,导致生成的笔画与输入文本对齐不佳。TrInk旨在通过Transformer架构解决这一问题。Contribution: 1. 设计了基于Transformer的墨水生成模型;2. 引入了缩放位置嵌入和高斯记忆掩码以优化跨注意力模块;3. 提出了主客观结合的评估流程。
Method: 使用Transformer架构生成墨水笔画,通过缩放位置嵌入和高斯记忆掩码增强跨注意力模块的全局依赖捕获能力。
Result: 在IAM-OnDB数据集上,CER和WER分别降低了35.56%和29.66%,显著优于基线方法。
Insight: Transformer架构在墨水生成任务中表现出色,全局依赖关系的优化是提升生成质量的关键。
Abstract: In this paper, we propose TrInk, a Transformer-based model for ink generation, which effectively captures global dependencies. To better facilitate the alignment between the input text and generated stroke points, we introduce scaled positional embeddings and a Gaussian memory mask in the cross-attention module. Additionally, we design both subjective and objective evaluation pipelines to comprehensively assess the legibility and style consistency of the generated handwriting. Experiments demonstrate that our Transformer-based model achieves a 35.56% reduction in character error rate (CER) and an 29.66% reduction in word error rate (WER) on the IAM-OnDB dataset compared to previous methods. We provide an demo page with handwriting samples from TrInk and baseline models at: https://akahello-a11y.github.io/trink-demo/
[3] How Does Cognitive Bias Affect Large Language Models? A Case Study on the Anchoring Effect in Price Negotiation Simulations
Yoshiki Takenami,Yin Jou Huang,Yugo Murawaki,Chenhui Chu
Main category: cs.CL
TL;DR: 研究发现大型语言模型(LLM)在价格谈判模拟中会受到锚定效应的影响,类似于人类。推理能力较强的模型更不易受此效应影响,而人格特质与此无关。
Details
Motivation: 探讨LLM是否像人类一样受到认知偏差(如锚定效应)的影响,以提高其在实际应用中的可靠性和安全性。Contribution: 首次系统研究了LLM在价格谈判中的锚定效应,并分析了推理能力和人格特质对其的影响。
Method: 设计价格谈判实验,让卖家LLM代理应用锚定效应,并通过主客观指标评估谈判结果。
Result: LLM确实受锚定效应影响;推理能力强的模型表现更优,人格特质无显著相关性。
Insight: 长链推理可能减轻认知偏差,为安全应用LLM提供了新方向。
Abstract: Cognitive biases, well-studied in humans, can also be observed in LLMs, affecting their reliability in real-world applications. This paper investigates the anchoring effect in LLM-driven price negotiations. To this end, we instructed seller LLM agents to apply the anchoring effect and evaluated negotiations using not only an objective metric but also a subjective metric. Experimental results show that LLMs are influenced by the anchoring effect like humans. Additionally, we investigated the relationship between the anchoring effect and factors such as reasoning and personality. It was shown that reasoning models are less prone to the anchoring effect, suggesting that the long chain of thought mitigates the effect. However, we found no significant correlation between personality traits and susceptibility to the anchoring effect. These findings contribute to a deeper understanding of cognitive biases in LLMs and to the realization of safe and responsible application of LLMs in society.
[4] Can Multimodal LLMs Solve the Basic Perception Problems of Percept-V?
Samrajnee Ghosh,Naman Agarwal,Hemanshu Garg,Chinmay Mittal,Mausam,Parag Singla
Main category: cs.CL
TL;DR: 论文提出了Percept-V数据集,用于评估多模态大语言模型(MLLMs)在基本视觉感知任务中的表现。实验表明,尽管MLLMs在复杂任务中表现优异,但在基本感知任务中随着问题复杂性增加性能显著下降。
Details
Motivation: 当前多模态大语言模型(MLLMs)在复杂任务中表现出色,但对其在简单视觉感知任务中的能力研究较少。论文旨在填补这一空白,通过引入一个包含基本形状和结构的新数据集Percept-V,评估MLLMs的基本感知能力。Contribution: 1. 提出了Percept-V数据集,包含7200张程序生成图像,覆盖30个类别,用于测试MLLMs的基本视觉感知能力。2. 对GPT-4o、Gemini等先进MLLMs的性能进行了系统评估,发现其随着任务复杂性增加性能下降的现象。
Method: 1. 创建Percept-V数据集,通过程序生成图像,测试多种视觉感知技能。2. 在GPT-4o、Gemini等MLLMs和OpenAI o4-mini、DeepSeek R1等LRMs上进行实验,分析其性能。
Result: 实验结果显示,MLLMs在基本感知任务中表现不佳,且性能随复杂性增加显著下降。不同模型在测试特定认知技能时表现出相似的准确性趋势,某些技能比其他技能更具挑战性。
Insight: 1. MLLMs虽然在复杂任务中表现优异,但在基本视觉感知任务中存在局限性。2. 通过Percept-V数据集,揭示了MLLMs在感知能力上的不足,为未来研究方向提供了参考。
Abstract: The reasoning abilities of Multimodal Large Language Models (MLLMs) have garnered a lot of attention in recent times, with advances made in frontiers like coding, mathematics, and science. However, very limited experiments have been done to assess their performance in simple perception tasks performed over uncontaminated, generated images containing basic shapes and structures. To address this issue, the paper introduces a dataset, Percept-V, containing a total of 7200 program-generated images equally divided into 30 categories, each testing a combination of visual perception skills. Unlike previously proposed datasets, Percept-V comprises very basic tasks of varying complexity that test the perception abilities of MLLMs. This dataset is then tested on state-of-the-art MLLMs like GPT-4o, Gemini, and Claude as well as Large Reasoning Models (LRMs) like OpenAI o4-mini and DeepSeek R1 to gauge their performance. Contrary to the evidence that MLLMs excel in many complex tasks, our experiments show a significant drop in the models’ performance with increasing problem complexity across all categories. An analysis of the performances also reveals that the tested MLLMs exhibit a similar trend in accuracy across categories, testing a particular cognitive skill and find some skills to be more difficult than others.
[5] A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
Ming Hu,Chenglong Ma,Wei Li,Wanghan Xu,Jiamin Wu,Jucheng Hu,Tianbin Li,Guohang Zhuang,Jiaqi Liu,Yingzhou Lu,Ying Chen,Chaoyang Zhang,Cheng Tan,Jie Ying,Guocheng Wu,Shujian Gao,Pengcheng Chen,Jiashi Lin,Haitao Wu,Lulu Chen,Fengxiang Wang,Yuanyuan Zhang,Xiangyu Zhao,Feilong Tang,Encheng Su,Junzhi Ning,Xinyao Liu,Ye Du,Changkai Ji,Cheng Tang,Huihui Xu,Ziyang Chen,Ziyan Huang,Jiyao Liu,Pengfei Jiang,Yizhou Wang,Chen Tang,Jianyu Wu,Yuchen Ren,Siyuan Yan,Zhonghua Wang,Zhongxing Xu,Shiyan Su,Shangquan Sun,Runkai Zhao,Zhisheng Zhang,Yu Liu,Fudi Wang,Yuanfeng Ji,Yanzhou Su,Hongming Shan,Chunmei Feng,Jiahao Xu,Jiangtao Yan,Wenhao Tang,Diping Song,Lihao Liu,Yanyan Huang,Lequan Yu,Bin Fu,Shujun Wang,Xiaomeng Li,Xiaowei Hu,Yun Gu,Ben Fei,Zhongying Deng,Benyou Wang,Yuewen Cao,Minjie Shen,Haodong Duan,Jie Xu,Yirong Chen,Fang Yan,Hongxia Hao,Jielan Li,Jiajun Du,Yanbo Wang,Imran Razzak,Chi Zhang,Lijun Wu,Conghui He,Zhaohui Lu,Jinhai Huang,Yihao Liu,Fenghua Ling,Yuqiang Li,Aoran Wang,Qihao Zheng,Nanqing Dong,Tianfan Fu,Dongzhan Zhou,Yan Lu,Wenlong Zhang,Jin Ye,Jianfei Cai,Wanli Ouyang,Yu Qiao,Zongyuan Ge,Shixiang Tang,Junjun He,Chunfeng Song,Lei Bai,Bowen Zhou
Main category: cs.CL
TL;DR: 这篇综述全面探讨了科学大语言模型(Sci-LLMs)的发展,从数据基础到智能代理的前沿应用,强调了科学与数据的协同演化关系,并提出了统一的数据分类法和科学知识层次模型。
Details
Motivation: 科学大语言模型的快速发展面临科学数据的复杂性挑战,需要一种数据为中心的视角来理解模型与数据的共同进化。Contribution: 提出了科学数据的统一分类法和科学知识层次模型;系统回顾了270多个训练数据集和190多个评测基准;探讨了Sci-LLMs在闭环系统和智能代理中的未来应用。
Method: 采用数据为中心的综述方法,分析科学数据的多模态、跨尺度和领域特异性特征,并提出数据开发的新解决方案,如半自动标注管道和专家验证。
Result: 研究表明Sci-LLMs需要处理异构、多尺度和不确定性强的数据,同时评测趋势正从静态测试转向过程和发现导向的评估。
Insight: 科学数据与模型的协同进化是关键,未来的Sci-LLMs有望发展为闭环系统中的自主代理,加速科学发现。
Abstract: Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands – heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
[6] Improving Aviation Safety Analysis: Automated HFACS Classification Using Reinforcement Learning with Group Relative Policy Optimization
Arash Ahmadi,Sarah Sharif,Yaser Banad
Main category: cs.CL
TL;DR: 论文提出了一种基于强化学习(GRPO)的自动化HFACS分类框架,通过优化Llama-3.1 8B语言模型并结合多组件奖励系统和合成数据生成,显著提升了航空安全分析的准确性和效率。
Details
Motivation: 传统HFACS分类方法在可扩展性和一致性上存在局限性,亟需一种自动化解决方案以提升航空安全分析的效率和可靠性。Contribution: 1. 提出了基于GRPO的自动化HFACS分类框架;2. 引入了多组件奖励系统和合成数据生成解决类别不平衡问题;3. 证明了小型领域优化模型在航空安全分析中的高效性和优越性。
Method: 采用强化学习(GRPO)优化Llama-3.1 8B语言模型,结合多组件奖励系统和合成数据生成,以提升分类性能。
Result: 模型在精确匹配准确率上提升了350%(0.0400到0.1800),部分匹配准确率达到0.8800,优于GPT-5-mini和Gemini-2.5-fiash等先进LLM。
Insight: 小型、领域优化的模型在资源受限的边缘设备上具有高效部署潜力,为关键安全分析提供了更优解决方案。
Abstract: Analyzing the human factors behind aviation accidents is crucial for preventing future incidents, yet traditional methods using the Human Factors Analysis and Classification System (HFACS) are limited by scalability and consistency. To address this, we introduce an automated HFACS classification framework for aviation safety analysis that utilizes Reinforcement Learning with Group Relative Policy Optimization (GRPO) to fine-tune a Llama-3.1 8B language model. Our approach incorporates a multi-component reward system tailored for aviation safety analysis and integrates synthetic data generation to overcome class imbalance in accident datasets. The resulting GRPO-optimized model achieved noticeable performance gains, including a 350% increase in exact match accuracy (from 0.0400 to 0.1800) and an improved partial match accuracy of 0.8800. Significantly, our specialized model outperforms state-of-the-art LLMs (Large Language Models), including GPT-5-mini and Gemini-2.5-fiash, on key metrics. This research also proposes exact match accuracy in multi-label HFACS classification problem as a new benchmarking methodology to evaluate the advanced reasoning capabilities of language models. Ultimately, our work validates that smaller, domain-optimized models can provide a computationally efficient and better solution for critical safety analysis. This approach makes powerful, low-latency deployment on resource-constrained edge devices feasible.
[7] Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach
Han Yang,Jian Lan,Yihong Liu,Hinrich Schütze,Thomas Seidl
Main category: cs.CL
TL;DR: 论文提出了一种基于像素的自回归语言模型,通过将单词渲染为图像来解决传统模型对多语言拼写攻击的脆弱性问题。
Details
Motivation: 自回归语言模型对多语言拼写攻击(输入文本被多语言字符扰动)表现脆弱,主要是因为子词分词器及其嵌入的词汇表外问题。Contribution: 提出了一种像素生成语言模型,将基于文本的嵌入替换为基于像素的表征,增强了模型对噪声输入的鲁棒性,并扩展了对多语言文本的支持能力。
Method: 通过将单词渲染为图像,生成像素表示替代传统的文本嵌入,从而避免了子词分词器的局限性和嵌入的词汇表外问题。
Result: 在LAMBADA多语言数据集、WMT24数据集和SST-2基准测试中验证了模型对拼写噪声的鲁棒性和在多语言场景中的有效性。
Insight: 基于像素的表征可能为语言模型提供一种新的鲁棒性设计思路,尤其是在多语言和噪声环境下。
Abstract: Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.
[8] Do Self-Supervised Speech Models Exhibit the Critical Period Effects in Language Acquisition?
Yurie Koga,Shunsuke Kando,Yusuke Miyao
Main category: cs.CL
TL;DR: 这篇论文探讨了自我监督语音模型(S3Ms)是否表现出人类语言习得中的关键期(CP)效应,发现这些模型并未显示出明确的CP效应。
Details
Motivation: 研究动机是探索S3Ms是否能够模拟人类语言习得中观察到的CP效应,尤其是在语音模型中,这一问题尚未被充分研究。Contribution: 主要贡献是首次在S3Ms中测试了CP效应,发现它们与人类语言习得的模式不同,模型的表现不遵循CP效应的典型规律。
Method: 方法包括训练具有不同L2训练开始时间和L1训练结束时间的S3Ms,并评估其音素判别性能。
Result: 结果显示,S3Ms未表现出明显的CP效应,延迟L2训练开始时间的模型在L2上表现更好,而延迟L1训练结束时间会导致L1遗忘。
Insight: 研究揭示了语音模型与人类语言习得的不同机制,为未来的语言模型优化提供了新的视角。
Abstract: This paper investigates whether the Critical Period (CP) effects in human language acquisition are observed in self-supervised speech models (S3Ms). CP effects refer to greater difficulty in acquiring a second language (L2) with delayed L2 exposure onset, and greater retention of their first language (L1) with delayed L1 exposure offset. While previous work has studied these effects using textual language models, their presence in speech models remains underexplored despite the central role of spoken language in human language acquisition. We train S3Ms with varying L2 training onsets and L1 training offsets on child-directed speech and evaluate their phone discrimination performance. We find that S3Ms do not exhibit clear evidence of either CP effects in terms of phonological acquisition. Notably, models with delayed L2 exposure onset tend to perform better on L2 and delayed L1 exposure offset leads to L1 forgetting.
[9] Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection
Weizhi Gao,Xiaorui Liu,Feiyi Wang,Dan Lu,Junqi Yin
Main category: cs.CL
TL;DR: 本文提出了一种名为Decoding Memory Pipeline(DMP)的高效方法,用于检测大语言模型(LLMs)的自一致幻觉问题,通过选择性推理和退火解码显著提升了生成效率。
Details
Motivation: 现有幻觉检测方法在句子级生成上表现不佳或依赖领域知识,而自一致方法又因重复生成导致高计算成本。本文旨在解决这一问题,提升效率。Contribution: 首次研究了自一致方法中的冗余问题(如共享前缀标记),并提出DMP方法,通过选择性推理和退火解码实现高效生成,同时保持性能。
Method: DMP利用非精确答案标记对语义贡献较小的观察,通过选择性推理和退火解码加速生成,与模型、数据集等无关。
Result: 实验表明,DMP在不牺牲AUROC性能的情况下,实现了3倍的速度提升。
Insight: 冗余标记对语义贡献有限,选择性解码可显著提升效率,这一方法还可扩展到对齐和推理任务中。
Abstract: Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.
[10] Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework
Nils Dycke,Iryna Gurevych
Main category: cs.CL
TL;DR: 该论文提出了一个自动化的反事实评估框架,用于测试大型语言模型作为自动评审生成器(ARGs)在检测研究论文中逻辑缺陷的能力,发现现有方法无法有效识别逻辑错误,并提出了改进建议。
Details
Motivation: 大型语言模型(LLMs)在学术评审中的潜力日益显现,但潜在偏差和系统错误可能威胁科学诚信。因此,理解现有ARGs的具体能力和局限至关重要。Contribution: 论文的主要贡献包括:1) 设计了一个自动化反事实评估框架,专门测试ARGs检测研究逻辑缺陷的能力;2) 发现现有ARGs无法显著识别逻辑错误;3) 提出了三项改进建议并公开了相关数据集和框架。
Method: 采用反事实评估框架,通过控制条件隔离并测试ARGs对研究逻辑缺陷的检测能力。测试了多种ARGs方法,评估其输出是否受逻辑缺陷影响。
Result: 研究发现,现有ARGs在检测研究逻辑缺陷方面表现不佳,逻辑缺陷对其生成的评审内容无显著影响。
Insight: 该研究揭示了当前ARGs在逻辑推理能力上的不足,强调了进一步改进的必要性,尤其是在保证科学严谨性和减少偏差方面。
Abstract: Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper’s results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.
[11] Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
Meidan Ding,Jipeng Zhang,Wenxuan Wang,Cheng-Yi Li,Wei-Chieh Fang,Hsin-Yu Wu,Haiqin Zhong,Wenting Chen,Linlin Shen
Main category: cs.CL
TL;DR: Med-RewardBench是首个专注于医学多模态大语言模型(MLLMs)中奖励模型和评判器的基准测试,旨在填补医学领域评估的空白。
Details
Motivation: 医学MLLMs在疾病诊断和临床决策中潜力巨大,但现有基准未针对其临床需求设计,缺乏对诊断准确性和临床相关性等维度的评估。Contribution: 提出了首个医学奖励模型和评判器的基准测试Med-RewardBench,包含1,026个专家标注的多模态病例,覆盖13个器官系统和8个临床科室。
Method: 通过三步流程生成高质量评估数据,涵盖六个临床关键维度;评估了32个前沿MLLMs,并开发了微调后的基线模型。
Result: 评估显示当前MLLMs与专家判断对齐存在重大挑战,基线模型通过微调显著提升性能。
Insight: 医学MLLMs需更多领域特定设计,Med-RewardBench为未来研究提供了重要的评估框架和数据支持。
Abstract: Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.
[12] Discovering Semantic Subdimensions through Disentangled Conceptual Representations
Yunhao Zhang,Shaonan Wang,Nan Lin,Xinyi Dong,Chong Li,Chengqing Zong
Main category: cs.CL
TL;DR: 本文提出了一种新的框架DCSRM,通过从大型语言模型的词嵌入中分解出多个子嵌入,发现可解释的语义子维度,并验证其与大脑活动的相关性,揭示了语义维度的结构原则和极性作用。
Details
Motivation: 现有方法依赖预定义的语义维度,忽视了更细粒度的概念区分,难以揭示语义组织的核心维度。本文旨在通过更精细的语义子维度研究,填补这一空白。Contribution: 提出了DCSRM模型,能够分解词嵌入并发现可解释的语义子维度;验证了这些子维度的神经相关性,揭示了语义维度的结构原则。
Method: 使用DCSRM从词嵌入中分解出多个子嵌入,每个子嵌入编码特定语义信息;通过体素编码模型将这些子维度映射到大脑激活。
Result: 发现了细化且可解释的语义子维度,并验证了其神经相关性;极性是驱动语义维度分解的关键因素。
Insight: 语义维度可以进一步分解为更细粒度的子维度,且这些子维度具有认知和神经科学的合理性,为语义研究提供了新视角。
Abstract: Understanding the core dimensions of conceptual semantics is fundamental to uncovering how meaning is organized in language and the brain. Existing approaches often rely on predefined semantic dimensions that offer only broad representations, overlooking finer conceptual distinctions. This paper proposes a novel framework to investigate the subdimensions underlying coarse-grained semantic dimensions. Specifically, we introduce a Disentangled Continuous Semantic Representation Model (DCSRM) that decomposes word embeddings from large language models into multiple sub-embeddings, each encoding specific semantic information. Using these sub-embeddings, we identify a set of interpretable semantic subdimensions. To assess their neural plausibility, we apply voxel-wise encoding models to map these subdimensions to brain activation. Our work offers more fine-grained interpretable semantic subdimensions of conceptual meaning. Further analyses reveal that semantic dimensions are structured according to distinct principles, with polarity emerging as a key factor driving their decomposition into subdimensions. The neural correlates of the identified subdimensions support their cognitive and neuroscientific plausibility.
[13] Beyond the Surface: Probing the Ideological Depth of Large Language Models
Shariar Kabir,Kevin Esterling,Yue Dong
Main category: cs.CL
TL;DR: 这篇论文探讨了大型语言模型(LLMs)的意识形态深度,通过指令提示和激活导向测量其可操控性,并利用稀疏自编码器(SAEs)分析内部机制,发现某些模型具有更强烈的意识形态结构,且其政治特征明显多于同类模型。
Details
Motivation: LLMs的意识形态倾向虽然显著,但其稳定性和深度尚不清楚。表面响应可能通过简单的提示工程被操控,需研究其内在意识形态是否具有一致性。Contribution: 提出了‘意识形态深度’概念,通过实验证明了它是LLMs的可量化属性;同时揭示了可操控性是观察其潜在政治架构的重要窗口。
Method: 采用双管齐下的方法:1)通过指令提示和激活导向测量可操控性;2)使用SAEs分析模型的内部机制,识别抽象意识形态特征。
Result: 发现低可操控性模型具有更多独特且抽象的政治特征;一个模型的政治特征是另一个同类模型的7.3倍。针对‘深度’模型的核心政治特征进行消融,会导致其推理逻辑的连贯变化,而‘浅层’模型则表现为拒绝输出增加。
Insight: 意识形态深度可能是LLMs的内在属性,可操控性和内部特征分析为理解其政治架构提供了新视角。
Abstract: Large Language Models (LLMs) have demonstrated pronounced ideological leanings, yet the stability and depth of these positions remain poorly understood. Surface-level responses can often be manipulated through simple prompt engineering, calling into question whether they reflect a coherent underlying ideology. This paper investigates the concept of “ideological depth” in LLMs, defined as the robustness and complexity of their internal political representations. We employ a dual approach: first, we measure the “steerability” of two well-known open-source LLMs using instruction prompting and activation steering. We find that while some models can easily switch between liberal and conservative viewpoints, others exhibit resistance or an increased rate of refusal, suggesting a more entrenched ideological structure. Second, we probe the internal mechanisms of these models using Sparse Autoencoders (SAEs). Preliminary analysis reveals that models with lower steerability possess more distinct and abstract ideological features. Our evaluations reveal that one model can contain 7.3x more political features than another model of similar size. This allows targeted ablation of a core political feature in an ideologically “deep” model, leading to consistent, logical shifts in its reasoning across related topics, whereas the same intervention in a “shallow” model results in an increase in refusal outputs. Our findings suggest that ideological depth is a quantifiable property of LLMs and that steerability serves as a valuable window into their latent political architecture.
[14] Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards
Xiaolong Wei,Bo Lu,Xingyu Zhang,Zhejun Zhao,Dongdong Shen,Long Xia,Dawei Yin
Main category: cs.CL
TL;DR: 本文探讨了在RLAIF框架下使用两种AI驱动的奖励策略提升7B参数小型语言模型(SLM)的创意写作能力,特别针对中文问候语生成任务。两种策略均显著优于基线,其中基于原则的LLM-as-a-Judge方法表现更优。
Details
Motivation: 大型语言模型(LLM)虽然表现出强大的创意写作能力,但其高昂计算成本限制了广泛应用。提升小型语言模型(SLM)是一个可行替代方案,但现有方法如监督微调(SFT)缺乏新颖性,而强化学习人类反馈(RLHF)成本高。Contribution: 1. 提出两种AI驱动的奖励策略:基于多智能体拒绝采样框架的RM和基于原则的LLM-as-a-Judge。2. 证明LLM-as-a-Judge方法在生成质量、训练效率和减少人类标注数据依赖方面更具优势。3. 提供自动化评估方法与人类判断高度一致的开源代码和数据。
Method: 1. 使用多智能体拒绝采样框架训练的RM生成高质量偏好数据。2. 设计基于原则的LLM-as-a-Judge方法,通过对抗训练和反思机制优化奖励函数。
Result: 两种方法均显著提升了SLM的创意输出,但LLM-as-a-Judge在生成质量和效率上表现更优。自动化评估方法与人类判断高度一致。
Insight: 基于原则的LLM-as-a-Judge方法为提升SLM创意能力提供了一种更高效、可扩展的路径,减少了对人类标注数据的依赖。
Abstract: Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.
[15] HSFN: Hierarchical Selection for Fake News Detection building Heterogeneous Ensemble
Sara B. Coutinho,Rafael M. O. Cruz,Francimaria R. S. Nascimento,George D. C. Cavalcanti
Main category: cs.CL
TL;DR: 该论文提出了一种基于层次聚类的自动分类器选择方法HSFN,用于构建异构集成模型以提升假新闻检测的多样性性能和鲁棒性。
Details
Motivation: 心理偏见(如确认偏误)使人们容易相信和传播假新闻,这对公共健康和政治等领域产生重大影响。现有的集成方法虽有效,但其性能高度依赖于分类器的多样性,而如何选择真正多样化的分类器仍是一个挑战。Contribution: 论文的主要贡献是提出了一种新颖的自动分类器选择方法,该方法通过层次聚类和多样性优先的策略,构建异构集成模型,提高假新闻检测的准确性和鲁棒性。
Method: 方法首先计算分类器间的成对多样性,并应用层次聚类将其组织为不同粒度的组。接着通过HierarchySelect探索层次级别,选择每个级别的分类器池,每个池代表不同的内部多样性。最终选择最具多样性的池用于集成构建。
Result: 在六个不同应用领域的数据集上,使用40个异构分类器进行实验,结果显示该方法在两个数据集上达到了最高准确率。
Insight: 研究揭示了分类器多样性对集成性能的重要性,并提出了一种可扩展的层次化选择策略,为假新闻检测领域提供了新的思路。
Abstract: Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning-based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers-selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into groups at different levels of granularity. A HierarchySelect then explores these hierarchical levels to select one pool of classifiers per level, each representing a distinct intra-pool diversity. The most diverse pool is identified and selected for ensemble construction from these. The selection process incorporates an evaluation metric reflecting each classifiers’s performance to ensure the ensemble also generalises well. We conduct experiments with 40 heterogeneous classifiers across six datasets from different application domains and with varying numbers of classes. Our method is compared against the Elbow heuristic and state-of-the-art baselines. Results show that our approach achieves the highest accuracy on two of six datasets. The implementation details are available on the project’s repository: https://github.com/SaraBCoutinho/HSFN .
[16] L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models
Aishwarya Mirashi,Ananya Joshi,Raviraj Joshi
Main category: cs.CL
TL;DR: 本文介绍了MahaSTS,一个马拉地语的句子文本相似性数据集,以及MahaSBERT-STS-v2模型,用于回归型相似性评分。数据集包含16,860对句子,均匀分布在0-5的评分范围内,优化了模型稳定性。实验表明该数据集在低资源环境下有效提升了句子相似性任务的表现。
Details
Motivation: 马拉地语等低资源语言缺乏高质量的句子相似性数据集和模型,限制了NLP任务的发展。本文旨在填补这一空白。Contribution: 1. 发布了首个马拉地语STS数据集MahaSTS;2. 提出了微调的MahaSBERT-STS-v2模型;3. 验证了均匀分布标签对模型性能的提升。
Method: 1. 构建了16,860对带连续评分的马拉地语句子数据集;2. 通过均匀分布标签减少偏差;3. 微调MahaSBERT模型并与其他BERT模型对比。
Result: MahaSBERT-STS-v2在马拉地语STS任务上表现优于其他模型,证明了数据集和微调策略的有效性。
Insight: 1. 均匀分布的标签有利于模型稳定性;2. 低资源语言中,人工标注和针对性微调至关重要。
Abstract: We present MahaSTS, a human-annotated Sentence Textual Similarity (STS) dataset for Marathi, along with MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model optimized for regression-based similarity scoring. The MahaSTS dataset consists of 16,860 Marathi sentence pairs labeled with continuous similarity scores in the range of 0-5. To ensure balanced supervision, the dataset is uniformly distributed across six score-based buckets spanning the full 0-5 range, thus reducing label bias and enhancing model stability. We fine-tune the MahaSBERT model on this dataset and benchmark its performance against other alternatives like MahaBERT, MuRIL, IndicBERT, and IndicSBERT. Our experiments demonstrate that MahaSTS enables effective training for sentence similarity tasks in Marathi, highlighting the impact of human-curated annotations, targeted fine-tuning, and structured supervision in low-resource settings. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP
[17] Is this chart lying to me? Automating the detection of misleading visualizations
Jonathan Tonglet,Jan Zimny,Tinne Tuytelaars,Iryna Gurevych
Main category: cs.CL
TL;DR: 论文引入了Misviz和Misviz-synth数据集,用于检测误导性图表,并评估了多种模型的表现,发现任务仍具挑战性。
Details
Motivation: 误导性图表在社交媒体上传播错误信息,检测其违规设计原则对减少错误信息传播至关重要。Contribution: 1) 发布Misviz(2,604个真实图表)和Misviz-synth(81,814个合成图表)数据集;2) 综合评估多种模型表现。
Method: 1) 构建真实与合成数据集;2) 使用多模态大语言模型(MLLMs)、基于规则的系统及微调分类器进行检测。
Result: 任务难度较高,现有模型表现有限。
Insight: 数据集为研究提供了重要资源,但检测误导性图表仍需更优方法。
Abstract: Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also release Misviz-synth, a synthetic dataset of 81,814 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and fine-tuned classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.
[18] Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
Yao Wang,Di Liang,Minlong Peng
Main category: cs.CL
TL;DR: 论文提出了一种名为CPI-FT的新框架,通过识别并隔离核心参数区域,减少下游任务微调中的任务干扰和灾难性遗忘。
Details
Motivation: 传统全参数微调会导致任务间干扰(seesaw现象),部分任务性能提升以其他任务性能下降为代价。Contribution: 提出了Core Parameter Isolation Fine-Tuning (CPI-FT)框架,通过参数隔离和融合技术减轻任务干扰和遗忘。
Method: 1) 独立微调识别核心参数区域;2) 基于区域重叠的任务聚类;3) 核心参数直接移植,非核心参数通过SLERP融合;4) 轻量级混合任务训练,冻结核心区域。
Result: 在多任务基准测试中,CPI-FT显著优于传统多任务和多阶段微调方法。
Insight: 核心参数隔离和任务分组可以有效减少参数更新的冲突,提升多任务微调的性能和稳定性。
Abstract: Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the ``seesaw phenomenon’’, where indiscriminate parameter updates yield progress on certain tasks at the expense of others. To address this challenge, we propose a novel \emph{Core Parameter Isolation Fine-Tuning} (CPI-FT) framework. Specifically, we first independently fine-tune the LLM on each task to identify its core parameter regions by quantifying parameter update magnitudes. Tasks with similar core regions are then grouped based on region overlap, forming clusters for joint modeling. We further introduce a parameter fusion technique: for each task, core parameters from its individually fine-tuned model are directly transplanted into a unified backbone, while non-core parameters from different tasks are smoothly integrated via Spherical Linear Interpolation (SLERP), mitigating destructive interference. A lightweight, pipelined SFT training phase using mixed-task data is subsequently employed, while freezing core regions from prior tasks to prevent catastrophic forgetting. Extensive experiments on multiple public benchmarks demonstrate that our approach significantly alleviates task interference and forgetting, consistently outperforming vanilla multi-task and multi-stage fine-tuning baselines.
[19] Reasoning-Intensive Regression
Diane Tchuindjo,Omar Khattab
Main category: cs.CL
TL;DR: 论文提出了一种称为推理密集型回归(RiR)的任务,并通过MENTAT方法显著提升了性能。
Details
Motivation: 传统方法在处理需要深层文本分析的推理密集型回归任务时表现不佳,尤其是当任务特定数据和计算资源有限时。Contribution: 提出了RiR任务的概念,并设计了MENTAT方法,结合了批量提示优化和神经集成学习,显著提升了性能。
Method: MENTAT方法通过批量优化的提示和神经集成学习,增强模型在RiR任务中的推理能力。
Result: 在实验的三个RiR任务中,MENTAT比冻结LLM和微调Transformer编码器的方法提升了高达65%。
Insight: 推理密集型任务需要更精细的设计,MENTAT的成功显示了提示优化和集成的潜力,但仍有改进空间。
Abstract: AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e. deducing subtle numerical properties from text. Unlike standard language regression tasks, e.g. for sentiment or similarity, RiR often appears instead in ad-hoc problems like rubric-based scoring or domain-specific retrieval, where much deeper analysis of text is required while only limited task-specific training data and computation are available. We cast three realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.
[20] PiCSAR: Probabilistic Confidence Selection And Ranking
Joshua Ong Jun Leang,Zheng Zhao,Aryo Pradipta Gema,Sohee Yang,Wai-Chung Kwan,Xuanli He,Wenda Li,Pasquale Minervini,Eleonora Giunchiglia,Shay B. Cohen
Main category: cs.CL
TL;DR: PiCSAR是一种无需训练的简单方法,通过联合对数似然评分候选生成结果,显著提升了LLMs和LRMs在推理任务中的准确性,优于基线方法。
Details
Motivation: 为了在推理任务中设计一个无需真实答案即可识别正确推理链的评分函数,提出PiCSAR方法。Contribution: 提出PiCSAR方法,利用推理链和最终答案的联合对数似然进行评分,分解为推理置信度和答案置信度。
Method: 基于联合对数似然的评分方法,无需训练,通过推理置信度和答案置信度评估候选生成结果。
Result: 在MATH500和AIME2025等基准上表现优异(分别提升+10.18和+9.81),比基线方法样本需求减少2倍。
Insight: 正确的推理链具有显著更高的推理和答案置信度,验证了PiCSAR的有效性。
Abstract: Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.
[21] Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval
Inés Altemir Marinas,Anastasiia Kucherenko,Andrei Kucharavy
Main category: cs.CL
TL;DR: 本文提出了一个基于ElasticSearch的框架,用于索引和分析大型语言模型(LLM)的训练数据集,并将其应用于瑞士AI的FineWeb-2语料库(1.5TB,四种语言),实现了快速查询性能,为安全且可问责的AI系统提供了实用工具。
Details
Motivation: 尽管训练数据质量对大型语言模型至关重要,但以往关于有害内容的研究因计算限制仅限于小样本。本文旨在解决这一问题,提供一种能够分析大规模数据集的工具。Contribution: 提出了一个高效的ElasticSearch框架,用于索引和分析大规模的LLM训练数据集,并在FineWeb-2语料库上验证了其快速查询性能。
Method: 使用ElasticSearch构建了一个索引和分析管道,并将其应用于1.5TB的多语言FineWeb-2语料库。
Result: 实现了毫秒级查询性能(所有搜索均在2秒内完成),展示了实时数据集分析的可行性。
Insight: 通过有效的索引和分析工具,可以提升训练数据的质量管控,为构建更安全、透明的AI系统提供了技术支持。
Abstract: Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI’s FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance–most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.
cs.CV [Back]
[22] 2COOOL: 2nd Workshop on the Challenge Of Out-Of-Label Hazards in Autonomous Driving
Ali K. AlShami,Ryan Rabinowitz,Maged Shoman,Jianwu Fang,Lukas Picek,Shao-Yuan Lo,Steve Cruz,Khang Nhut Lam,Nachiket Kamod,Lei-Lei Li,Jugal Kalita,Terrance E. Boult
Main category: cs.CV
TL;DR: 2COOOL是ICCV 2025的研讨会,专注于解决自动驾驶中的新场景问题,尤其是分布外危害(OOD)的检测与处理,旨在推动算法和系统的创新。
Details
Motivation: 当前自动驾驶算法虽取得进展,但完全安全的自动驾驶仍未实现,主要原因在于难以处理新场景和分布外危害(OOD)。Contribution: 提供了一个专门讨论OOD危害处理的论坛,整合了异常检测、开集识别、开放词汇建模等领域的技术,推动相关研究与实践。
Method: 结合多模态传感器数据与视觉语言模型,提出新的基准测试和方法论,关注危害检测与安全驾驶实践。
Result: 研讨会汇集了学术界与工业界的专家,致力于推动OOD危害处理技术的发展。
Insight: 自动驾驶安全的关键在于处理新场景,需跨领域技术整合,特别是视觉语言模型的运用将为危害理解提供新方向。
Abstract: As the computer vision community advances autonomous driving algorithms, integrating vision-based insights with sensor data remains essential for improving perception, decision making, planning, prediction, simulation, and control. Yet we must ask: Why don’t we have entirely safe self-driving cars yet? A key part of the answer lies in addressing novel scenarios, one of the most critical barriers to real-world deployment. Our 2COOOL workshop provides a dedicated forum for researchers and industry experts to push the state of the art in novelty handling, including out-of-distribution hazard detection, vision-language models for hazard understanding, new benchmarking and methodologies, and safe autonomous driving practices. The 2nd Workshop on the Challenge of Out-of-Label Hazards in Autonomous Driving (2COOOL) will be held at the International Conference on Computer Vision (ICCV) 2025 in Honolulu, Hawaii, on October 19, 2025. We aim to inspire the development of new algorithms and systems for hazard avoidance, drawing on ideas from anomaly detection, open-set recognition, open-vocabulary modeling, domain adaptation, and related fields. Building on the success of its inaugural edition at the Winter Conference on Applications of Computer Vision (WACV) 2025, the workshop will feature a mix of academic and industry participation.
[23] ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion
Xurui Peng,Hong Liu,Chenqian Yan,Rui Ma,Fangmin Chen,Xing Wang,Zhihua Wu,Songwei Liu,Mingbao Lin
Main category: cs.CV
TL;DR: ERTACache提出了一种高效的扩散模型加速框架,通过纠正缓存引起的累积误差(特征偏移误差和时间步放大误差),实现了显著的推理加速和质量保持。
Details
Motivation: 扩散模型因迭代推理过程导致计算开销大。现有缓存策略虽能加速,但由于缓存输出的不准确性和固定时间步长导致的误差传播,常带来明显质量下降。Contribution: 1)形式化分析了缓存引入的累积误差,特征偏移误差和时间步放大误差;2)提出了ERTACache框架,联合纠正两种误差;3)通过离线残差分析和动态调整时间步长,实现高效采样。
Method: 1)离线残差分析识别可重用步骤;2)动态调整积分间隔;3)通过封闭形式的残差线性化模型近似误差。
Result: 在图像和视频生成基准测试中,ERTACache实现了2倍推理加速,同时保持或提升了视觉质量。在Wan2.1视频扩散模型上,加速效果显著且质量退化极小。
Insight: ERTACache通过联合优化特征和时间步长的误差,证明了高效的缓存策略可以同时兼顾速度和质量。
Abstract: Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at https://github.com/bytedance/ERTACache.
[24] Video-LLMs with Temporal Visual Screening
Zheyu Fan,Jiateng Liu,Yuji Zhang,Zihan Wang,Yi R.,Fung,Manling Li,Heng Ji
Main category: cs.CV
TL;DR: 论文提出了Temporal Visual Screening (TVS)任务,通过保留关键视频片段、重构查询并保持答案一致性,优化视频问答和理解能力。TVS在训练和推理中均显著提升了性能。
Details
Motivation: 当前的Video-LLMs由于稀疏帧采样和训练中缺乏帧间推理监督,难以捕捉细粒度的时间语义。受认知科学启发,作者提出了TVS任务来解决这一问题。Contribution: 1) 提出TVS任务,统一预处理视频问答和指令调优数据;2) 设计了ReSimplifyIt基线方法,优于现有方法;3) 构建了首个TVS基准测试集。
Method: TVS通过保留关键视频片段、同步重构查询并保持答案一致性,作为模块化的前端适配任务集成到训练和推理流程中。
Result: 实验表明,TVS在训练中实现了7.33%的相对提升,在推理中实现了34.6%的相对提升。ReSimplifyIt在视频修剪任务中F1分数提升了0.47。
Insight: TVS通过优化时间信息筛选,显著提升了视频语言理解能力,展示了时间语义建模的重要性。
Abstract: Humans naturally perform temporal screening by dragging the progress bar and focusing on salient temporal segments, but current Video Large Language Models (Video-LLMs) struggle to capture fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision during their training. To address this, Inspired by well-established cognitive science principles, we propose Temporal Visual Screening (TVS), a new task that universally pre-processes video question answering and instruction tuning data by: (1) retaining focus-critical video segments, (2) synchronously reconstructing queries to their most direct form while preserving answer consistency, and (3) keeping the invariance and consistency for any possible answer. TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines. TVS optimizes distribution of reasoning burden and cognitive load; during training, it aligns queries with focus-critical visual information; at inference, it enables query-aware segment focus and streamlined query representations. In particular, we curate the first benchmark for TVS and propose ReSimplifyIt, a baseline outperforming prior approaches on seemingly similar tasks by 0.47 in F-1 score on video trimming while achieving competitive query rewriting performance. Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference), demonstrating the effectiveness of temporal information screening for improving video-language understanding.
[25] ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments
Zhe Han,Charlie Budd,Gongyu Zhang,Huanyu Tian,Christos Bergeles,Tom Vercauteren
Main category: cs.CV
TL;DR: ROBUST-MIPS是一个结合了手术工具骨骼姿态和实例分割的数据集,旨在通过高效的姿态标注促进手术工具定位的研究。
Details
Motivation: 手术工具的定位是计算机辅助介入技术的基础,但现有深度学习方法依赖大量标注数据。骨骼姿态标注在语义信息和标注效率间找到了平衡,可加速标注数据的增长。Contribution: 提出了ROBUST-MIPS数据集,结合了姿态和实例分割标注,支持两种标注方式的联合研究;提供了基准模型和自定义标注工具以促进采用。
Method: 基于现有的ROBUST-MIS数据集,增加了骨骼姿态标注;使用流行的姿态估计算法建立了简单基准。
Result: 实验表明姿态标注能高质量完成手术工具定位任务。
Insight: 骨骼姿态标注是一种高效的数据标注方式,适合推广到手术工具定位任务中。
Abstract: Localisation of surgical tools constitutes a foundational building block for computer-assisted interventional technologies. Works in this field typically focus on training deep learning models to perform segmentation tasks. Performance of learning-based approaches is limited by the availability of diverse annotated data. We argue that skeletal pose annotations are a more efficient annotation approach for surgical tools, striking a balance between richness of semantic information and ease of annotation, thus allowing for accelerated growth of available annotated data. To encourage adoption of this annotation style, we present, ROBUST-MIPS, a combined tool pose and tool instance segmentation dataset derived from the existing ROBUST-MIS dataset. Our enriched dataset facilitates the joint study of these two annotation styles and allow head-to-head comparison on various downstream tasks. To demonstrate the adequacy of pose annotations for surgical tool localisation, we set up a simple benchmark using popular pose estimation methods and observe high-quality results. To ease adoption, together with the dataset, we release our benchmark models and custom tool pose annotation software.
[26] R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Jie Jiang,Qi Yang,Bolin Ni,Shiming Xiang,Han Hu,Houwen Peng
Main category: cs.CV
TL;DR: R-4B是一种多模态大语言模型,通过双模式退火和强化学习,自适应地决定是否启动思考过程,以解决复杂推理问题的冗余思考问题。
Details
Motivation: 现有MLLMs在解决简单问题时仍进行复杂的逐步推理,造成了计算资源的浪费。R-4B旨在通过自适应决策,区分问题复杂度,实现高效推理。Contribution: 1. 提出双模式退火(Bi-Mode Annealing)和BPO方法,训练模型区分问题复杂度;2. 在25个基准测试中表现优异,优于Qwen2.5-VL-7B,部分任务接近更大模型Kimi-VL-A3B-Thinking-2506的性能,且计算成本更低。
Method: 1. 使用双模式退火训练模型,包含思考与非思考模式;2. 采用BPO强化学习优化模型决策能力;3. 结合多领域数据集进行两阶段训练。
Result: R-4B在多个基准测试中取得SOTA性能,尤其在推理密集型任务中表现突出,同时计算效率更高。
Insight: 自适应思考能力能显著提升MLLMs的效率,双模式训练策略为复杂任务和简单任务提供了灵活解决方案。
Abstract: Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model’s accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.
[27] HiddenObject: Modality-Agnostic Fusion for Multimodal Hidden Object Detection
Harris Song,Tuan-Anh Vu,Sanjith Menon,Sriram Narasimhan,M. Khalid Jawed
Main category: cs.CV
TL;DR: 本文提出了HiddenObject,一种基于Mamba的多模态融合框架,融合RGB、热成像和深度数据,以提升隐蔽或部分遮挡物体的检测能力。
Details
Motivation: 传统基于RGB的方法在遮挡、伪装和光照变化等复杂条件下表现不佳,促使研究人员探索更强健、模态无关的多模态融合方法。Contribution: 提出了HiddenObject框架,通过Mamba机制融合RGB、热成像和深度数据,显著提升了在复杂条件下的隐蔽物体检测性能。
Method: 采用Mamba-based融合机制,提取模态特异性特征并融合为统一表示,实现了跨模态互补信号的捕捉。
Result: 在多个基准数据集上验证了方法的有效性,性能达到或超越了现有方法。
Insight: Mamba-based融合架构在多模态目标检测(尤其是复杂视觉条件下)中具有显著优势,凸显了当前单模态和简单融合策略的局限性。
Abstract: Detecting hidden or partially concealed objects remains a fundamental challenge in multimodal environments, where factors like occlusion, camouflage, and lighting variations significantly hinder performance. Traditional RGB-based detection methods often fail under such adverse conditions, motivating the need for more robust, modality-agnostic approaches. In this work, we present HiddenObject, a fusion framework that integrates RGB, thermal, and depth data using a Mamba-based fusion mechanism. Our method captures complementary signals across modalities, enabling enhanced detection of obscured or camouflaged targets. Specifically, the proposed approach identifies modality-specific features and fuses them in a unified representation that generalizes well across challenging scenarios. We validate HiddenObject across multiple benchmark datasets, demonstrating state-of-the-art or competitive performance compared to existing methods. These results highlight the efficacy of our fusion design and expose key limitations in current unimodal and na"ive fusion strategies. More broadly, our findings suggest that Mamba-based fusion architectures can significantly advance the field of multimodal object detection, especially under visually degraded or complex conditions.
[28] RadGS-Reg: Registering Spine CT with Biplanar X-rays via Joint 3D Radiative Gaussians Reconstruction and 3D/3D Registration
Ao Shen,Xueming Fu,Junfeng Jiang,Qiang Zeng,Ye Tang,Zhengming Chen,Luming Nong,Feng Wang,S. Kevin Zhou
Main category: cs.CV
TL;DR: RadGS-Reg是一种新型框架,通过联合3D辐射高斯重建和3D/3D配准,解决了CT/X射线配准的高精度和实时性问题,利用Counterfactual Attention Learning机制和改进的训练策略,显著提升了性能。
Details
Motivation: 传统CT/X射线配准方法因空间信息丢失和领域差距问题表现不佳,多视图重建方法受限于高密度视图需求和对噪声的敏感性。本文提出RadGS-Reg,旨在通过联合学习解决这些问题。Contribution: 1. 提出RadGS-Reg框架,结合3D辐射高斯重建和3D/3D配准;2. 引入Counterfactual Attention Learning机制,专注于椎体区域;3. 提出患者特定的预训练策略,逐步从模拟数据迁移到真实数据。
Method: 1. 使用基于学习的RadGS重建方法,结合CAL机制处理噪声X射线;2. 采用渐进式预训练策略,学习椎体形状先验知识;3. 联合优化重建和配准任务。
Result: 在内部数据集上,RadGS-Reg在两种任务上均优于现有方法,展示了最先进的性能。
Insight: 通过联合学习和注意力机制,可以有效解决CT/X射线配准中的空间信息丢失和噪声问题,预训练策略则提升了模型对真实数据的适应能力。
Abstract: Computed Tomography (CT)/X-ray registration in image-guided navigation remains challenging because of its stringent requirements for high accuracy and real-time performance. Traditional “render and compare” methods, relying on iterative projection and comparison, suffer from spatial information loss and domain gap. 3D reconstruction from biplanar X-rays supplements spatial and shape information for 2D/3D registration, but current methods are limited by dense-view requirements and struggles with noisy X-rays. To address these limitations, we introduce RadGS-Reg, a novel framework for vertebral-level CT/X-ray registration through joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration. Specifically, our biplanar X-rays vertebral RadGS reconstruction module explores learning-based RadGS reconstruction method with a Counterfactual Attention Learning (CAL) mechanism, focusing on vertebral regions in noisy X-rays. Additionally, a patient-specific pre-training strategy progressively adapts the RadGS-Reg from simulated to real data while simultaneously learning vertebral shape prior knowledge. Experiments on in-house datasets demonstrate the state-of-the-art performance for both tasks, surpassing existing methods. The code is available at: https://github.com/shenao1995/RadGS_Reg.
[29] SYNBUILD-3D: A large, multi-modal, and semantically rich synthetic dataset of 3D building models at Level of Detail 4
Kevin Mayer,Alex Vesel,Xinyi Zhao,Martin Fischer
Main category: cs.CV
TL;DR: SYNBUILD-3D是一个大规模、多模态、语义丰富的合成3D建筑数据集,包含620万栋LoD4级别的住宅建筑,支持自动化3D建模的研究。
Details
Motivation: 现有公开领域缺乏大规模标注的3D建筑数据集,限制了自动化3D建模的发展。因此,作者提出SYNBUILD-3D,填补这一空白,并支持多模态生成算法的研究。Contribution: 1. 发布SYNBUILD-3D数据集,包含620万栋LoD4级别的住宅建筑;2. 提供3D线框图、平面图图像和激光雷达式屋顶点云三种模态;3. 支持语义几何一致性建模的生成算法。
Method: 通过合成数据生成技术,构建包含三种模态的数据集:1. LoD4级别的3D线框图(语义标注);2. 对应的平面图图像;3. 激光雷达式屋顶点云。语义标注来自平面图图像。
Result: 数据集公开可用,支持未来研究开发多模态生成算法,实现基于预设平面布局和屋顶几何的自动化3D建模。
Insight: 合成数据在3D建筑建模中具有潜力,多模态设计为生成算法提供了新的研究空间,尤其是语义和几何一致性的结合。
Abstract: 3D building models are critical for applications in architecture, energy simulation, and navigation. Yet, generating accurate and semantically rich 3D buildings automatically remains a major challenge due to the lack of large-scale annotated datasets in the public domain. Inspired by the success of synthetic data in computer vision, we introduce SYNBUILD-3D, a large, diverse, and multi-modal dataset of over 6.2 million synthetic 3D residential buildings at Level of Detail (LoD) 4. In the dataset, each building is represented through three distinct modalities: a semantically enriched 3D wireframe graph at LoD 4 (Modality I), the corresponding floor plan images (Modality II), and a LiDAR-like roof point cloud (Modality III). The semantic annotations for each building wireframe are derived from the corresponding floor plan images and include information on rooms, doors, and windows. Through its tri-modal nature, future work can use SYNBUILD-3D to develop novel generative AI algorithms that automate the creation of 3D building models at LoD 4, subject to predefined floor plan layouts and roof geometries, while enforcing semantic-geometric consistency. Dataset and code samples are publicly available at https://github.com/kdmayer/SYNBUILD-3D.
[30] Radially Distorted Homographies, Revisited
Mårten Wadenbäck,Marcus Valtonen Örnhag,Johan Edstedt
Main category: cs.CV
TL;DR: 本文提出了一种新颖且统一的方法,用于同时估计单应性变换(homography)和径向畸变(radial distortion),涵盖了三种不同的配置情况,并构建了快速、稳定且准确的最小求解器。
Details
Motivation: 在真实图像中,相机镜头引起的几何畸变(尤其是径向畸变)对单应性估计造成干扰。以往的研究将这三种配置情况分开处理,缺乏统一的解决方案。Contribution: 提出了一种统一的方法来处理三种不同配置的径向畸变与单应性变换,并开发了更快且准确的最小求解器。
Method: 通过数学建模,将单应性变换与径向畸变统一处理,并为每种配置设计了最小求解器。实验验证了其快速性和准确性。
Result: 在各种基准测试中(包括鱼眼相机图像),提出的求解器在速度上优于现有方法,同时保持了相似的精度。
Insight: 统一的数学框架可以有效解决单应性变换与径向畸变的联合估计问题,为实时计算机视觉任务提供了新的工具。
Abstract: Homographies are among the most prevalent transformations occurring in geometric computer vision and projective geometry, and homography estimation is consequently a crucial step in a wide assortment of computer vision tasks. When working with real images, which are often afflicted with geometric distortions caused by the camera lens, it may be necessary to determine both the homography and the lens distortion-particularly the radial component, called radial distortion-simultaneously to obtain anything resembling useful estimates. When considering a homography with radial distortion between two images, there are three conceptually distinct configurations for the radial distortion; (i) distortion in only one image, (ii) identical distortion in the two images, and (iii) independent distortion in the two images. While these cases have been addressed separately in the past, the present paper provides a novel and unified approach to solve all three cases. We demonstrate how the proposed approach can be used to construct new fast, stable, and accurate minimal solvers for radially distorted homographies. In all three cases, our proposed solvers are faster than the existing state-of-the-art solvers while maintaining similar accuracy. The solvers are tested on well-established benchmarks including images taken with fisheye cameras. The source code for our solvers will be made available in the event our paper is accepted for publication.
[31] GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability
Zhenghao He,Sanchit Sinha,Guangzhi Xiong,Aidong Zhang
Main category: cs.CV
TL;DR: 本文提出了全局概念激活向量框架(GCAV),通过跨层一致性的概念表达解决了CAV在不同层独立性导致的不一致问题。
Details
Motivation: 现有CAV方法在不同层独立计算时易出现概念不一致,难以进行可靠的跨层比较。Contribution: 1. 提出GCAV框架,统一CAV为语义一致的全局表示;2. 引入TGCAV方法评估GCAV的有效性;3. 实验验证GCAV能够减少概念方差并提升鲁棒性。
Method: 1. 使用对比学习对齐跨层概念;2. 采用注意力融合机制构建全局CAV;3. 设计TGCAV测试框架。
Result: 实验表明GCAV显著减少了TCAV分数方差,提升了概念定位能力和对抗扰动的鲁棒性。
Insight: 全局化的概念表达能够更一致地反映模型对概念的学习,增强可解释性。
Abstract: Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV to GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts. Code and models are available at https://github.com/Zhenghao-He/GCAV.
[32] Generalizable Object Re-Identification via Visual In-Context Prompting
Zhizhong Huang,Xiaoming Liu
Main category: cs.CV
TL;DR: 该论文提出了一种名为VICP的新框架,通过视觉上下文提示(Visual In-Context Prompting)实现对未知类别的对象重识别,无需参数调整,结合了大型语言模型(LLMs)和视觉基础模型(VFMs)的优势。
Details
Motivation: 当前的对象重识别方法通常是针对特定领域训练的,缺乏泛化能力且需要大量标注数据。自监督学习虽减少标注需求,但难以捕捉对重识别关键的身份敏感特征。Contribution: 1. 提出VICP框架,利用上下文示例作为提示,使模型能直接泛化到未见类别。2. 结合LLM和VFM,通过任务特定的动态视觉提示提取身份辨别特征。3. 引入ShopID10K数据集,支持多视图和跨域测试。
Method: 1. 使用LLM从少量正负对中推断语义身份规则。2. 通过动态视觉提示引导VFM(如DINO)提取特征。3. 对齐LLM的语义概念与VFM的预训练先验知识。
Result: 实验表明,VICP在ShopID10K和其他重识别基准测试中,对未见类别的表现优于基线方法。
Insight: 融合语言和视觉模型的先验知识可以实现零样本泛化,避免了数据集特定的重新训练需求。
Abstract: Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID. This paper proposes Visual In-Context Prompting(VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models(VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}. By aligning LLM-derived semantic concepts with the VFM’s pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.
[33] PHD: Personalized 3D Human Body Fitting with Point Diffusion
Hsuan-I Ho,Chen Guo,Po-Chen Wu,Ivan Shugurov,Chengcheng Tang,Abhay Mittal,Sizhe An,Manuel Kaufmann,Linguang Zhang
Main category: cs.CV
TL;DR: PHD提出了一种基于用户特定形状信息的个性化3D人体网格恢复方法,通过解耦形状校准与姿态拟合,利用点扩散变换器优化3D姿态,显著提升了姿态估计的准确性。
Details
Motivation: 传统的人体姿态估计方法为通用性设计,忽略了用户特定形状与3D姿态合理性的联合优化,导致姿态准确性不足。PHD旨在解决这一问题。Contribution: 1. 提出了一种解耦形状校准与姿态拟合的个性化3D人体姿态估计框架。2. 开发了点扩散变换器作为3D姿态先验,通过点蒸馏采样损失迭代优化姿态。3. 显著提升了骨盆对齐与绝对姿态的准确性。
Method: 1. 首先校准用户的身体形状。2. 基于形状条件,利用Point Diffusion Transformer迭代优化3D姿态。3. 通过点蒸馏采样损失优化拟合过程。
Result: 实验表明,PHD在姿态准确性上优于传统方法,尤其在绝对姿态准确性上表现突出,且仅需合成数据训练,具有高效性。
Insight: 通过解耦形状与姿态学习,并结合点扩散技术,PHD展示了在减少对2D约束依赖的同时,提升3D姿态估计效果的潜力。
Abstract: We introduce PHD, a novel approach for personalized 3D human mesh recovery (HMR) and body fitting that leverages user-specific shape information to improve pose estimation accuracy from videos. Traditional HMR methods are designed to be user-agnostic and optimized for generalization. While these methods often refine poses using constraints derived from the 2D image to improve alignment, this process compromises 3D accuracy by failing to jointly account for person-specific body shapes and the plausibility of 3D poses. In contrast, our pipeline decouples this process by first calibrating the user’s body shape and then employing a personalized pose fitting process conditioned on that shape. To achieve this, we develop a body shape-conditioned 3D pose prior, implemented as a Point Diffusion Transformer, which iteratively guides the pose fitting via a Point Distillation Sampling loss. This learned 3D pose prior effectively mitigates errors arising from an over-reliance on 2D constraints. Consequently, our approach improves not only pelvis-aligned pose accuracy but also absolute pose accuracy – an important metric often overlooked by prior work. Furthermore, our method is highly data-efficient, requiring only synthetic data for training, and serves as a versatile plug-and-play module that can be seamlessly integrated with existing 3D pose estimators to enhance their performance. Project page: https://phd-pose.github.io/
[34] GLENDA: Gynecologic Laparoscopy Endometriosis Dataset
Andreas Leibetseder,Sabrina Kletz,Klaus Schoeffmann,Simon Keckstein,Jörg Keckstein
Main category: cs.CV
TL;DR: 论文发布了首个针对子宫内膜异位症的妇科腹腔镜图像数据集GLENDA,包含区域标注,为计算机视觉和机器学习在手术分析中的应用提供宝贵资源。
Details
Motivation: 当前妇科腹腔镜手术的视频分析依赖手动处理,耗时且低效,亟需自动化工具。但医疗领域的数据稀缺限制了相关技术的发展。Contribution: 贡献了首个专注于子宫内膜异位症的妇科腹腔镜图像数据集GLENDA,填补了该领域数据空白,并附带专业医学标注。
Method: 与顶级医学专家合作,收集并标注妇科腹腔镜手术图像,针对子宫内膜异位症区域进行详细标注。
Result: GLENDA数据集为开发自动化手术分析工具提供了可靠的数据支持,推动了医学影像领域的研究进展。
Insight: 医学领域的专用数据集对推动AI应用至关重要,GLENDA的发布为子宫内膜异位症的诊断和治疗研究提供了新机遇。
Abstract: Gynecologic laparoscopy as a type of minimally invasive surgery (MIS) is performed via a live feed of a patient’s abdomen surveying the insertion and handling of various instruments for conducting treatment. Adopting this kind of surgical intervention not only facilitates a great variety of treatments, the possibility of recording said video streams is as well essential for numerous post-surgical activities, such as treatment planning, case documentation and education. Nonetheless, the process of manually analyzing surgical recordings, as it is carried out in current practice, usually proves tediously time-consuming. In order to improve upon this situation, more sophisticated computer vision as well as machine learning approaches are actively developed. Since most of such approaches heavily rely on sample data, which especially in the medical field is only sparsely available, with this work we publish the Gynecologic Laparoscopy ENdometriosis DAtaset (GLENDA) - an image dataset containing region-based annotations of a common medical condition named endometriosis, i.e. the dislocation of uterine-like tissue. The dataset is the first of its kind and it has been created in collaboration with leading medical experts in the field.
[35] Identifying Surgical Instruments in Laparoscopy Using Deep Learning Instance Segmentation
Sabrina Kletz,Klaus Schoeffmann,Jenny Benois-Pineau,Heinrich Husslein
Main category: cs.CV
TL;DR: 该论文研究了使用深度学习的实例分割技术识别腹腔镜手术中的手术器械,实现了高精度的器械定位与分割,但器械类型识别仍具挑战性。
Details
Motivation: 腹腔镜手术视频记录虽普及,但自动内容索引仍面临挑战,尤其是手术器械的分割与识别,亟需高效解决方案。Contribution: 1. 提出基于区域的全卷积网络,实现器械的实例分割与多类别识别;2. 验证小样本训练下器械分割的高精度。
Method: 采用区域全卷积网络(R-FCN)进行实例分割,同时对器械进行多类别识别。
Result: 实验表明,器械分割精度较高,但受器械相似性影响,类型识别仍具挑战。
Insight: 器械分割任务在小样本下表现良好,但需进一步解决器械间高相似性带来的识别难题。
Abstract: Recorded videos from surgeries have become an increasingly important information source for the field of medical endoscopy, since the recorded footage shows every single detail of the surgery. However, while video recording is straightforward these days, automatic content indexing - the basis for content-based search in a medical video archive - is still a great challenge due to the very special video content. In this work, we investigate segmentation and recognition of surgical instruments in videos recorded from laparoscopic gynecology. More precisely, we evaluate the achievable performance of segmenting surgical instruments from their background by using a region-based fully convolutional network for instance-aware (1) instrument segmentation as well as (2) instrument recognition. While the first part addresses only binary segmentation of instances (i.e., distinguishing between instrument or background) we also investigate multi-class instrument recognition (i.e., identifying the type of instrument). Our evaluation results show that even with a moderately low number of training examples, we are able to localize and segment instrument regions with a pretty high accuracy. However, the results also reveal that determining the particular instrument is still very challenging, due to the inherently high similarity of surgical instruments.
[36] SatDINO: A Deep Dive into Self-Supervised Pretraining for Remote Sensing
Jakub Straka,Ivan Gruber
Main category: cs.CV
TL;DR: 论文提出了SatDINO,一种基于自监督学习方法DINO的遥感图像表示学习模型,通过实验证明其在多个基准测试中优于基于掩码自编码器(MAE)的方法,并引入新增强方法如GSD编码和自适应视图采样。
Details
Motivation: 遥感领域存在大量未标记数据,传统方法如MAE在自监督学习中表现受限,因此探索更高效的预训练方法。Contribution: 1. 提出SatDINO模型,专门用于遥感图像的自监督学习。 2. 在多个数据集和测试设置中验证其优于MAE等方法的性能。 3. 引入GSD编码和自适应视图采样等新增强方法。
Method: 基于对比自监督方法DINO,设计SatDINO模型,并结合GSD编码和自适应视图采样等新方法提升性能。
Result: SatDINO在多个基准测试中表现优于MAE等方法,并通过消融实验验证了其组件的有效性。
Insight: 1. 对比自监督学习方法在遥感图像中具有潜力。 2. GSD等元信息的编码对提升模型性能有显著帮助。
Abstract: Self-supervised learning has emerged as a powerful tool for remote sensing, where large amounts of unlabeled data are available. In this work, we investigate the use of DINO, a contrastive self-supervised method, for pretraining on remote sensing imagery. We introduce SatDINO, a model tailored for representation learning in satellite imagery. Through extensive experiments on multiple datasets in multiple testing setups, we demonstrate that SatDINO outperforms other state-of-the-art methods based on much more common masked autoencoders (MAE) and achieves competitive results in multiple benchmarks. We also provide a rigorous ablation study evaluating SatDINO’s individual components. Finally, we propose a few novel enhancements, such as a new way to incorporate ground sample distance (GSD) encoding and adaptive view sampling. These enhancements can be used independently on our SatDINO model. Our code and trained models are available at: https://github.com/strakaj/SatDINO.
[37] Standardized Multi-Layer Tissue Maps for Enhanced Artificial Intelligence Integration and Search in Large-Scale Whole Slide Image Archives
Gernot Fiala,Markus Plass,Robert Harb,Peter Regitnig,Kristijan Skok,Wael Al Zoughbi,Carmen Zerner,Paul Torke,Michaela Kargl,Heimo Müller,Tomas Brazdil,Matej Gallo,Jaroslav Kubín,Roman Stoklasa,Rudolf Nenutil,Norman Zerbe,Andreas Holzinger,Petr Holub
Main category: cs.CV
TL;DR: 论文提出了一种为全幻灯片图像(WSI)生成标准化多层组织地图的框架,以增强AI在大规模WSI档案中的集成与搜索能力。
Details
Motivation: 当前缺乏描述WSI内容的元数据标准,导致筛选适合AI训练或验证的数据集时需手动检查,效率低下且不适用于大规模集合。Contribution: 提出了一个通用的框架,生成包含源、组织类型和病理变化的三层组织地图,为WSI内容提供细粒度信息。
Method: 通过分层分类(源、组织类型、病理变化)构建2D索引地图,并利用通用语法和语义实现不同目录间的互操作性。
Result: 在临床病理学领域的实验中,验证了该标准的优势,展示了在WSI目录、机器学习和基于图的WSI表示中的应用。
Insight: 标准化组织地图不仅能提升AI算法的开发效率,还为跨领域研究和大规模数据管理提供了可能性。
Abstract: A Whole Slide Image (WSI) is a high-resolution digital image created by scanning an entire glass slide containing a biological specimen, such as tissue sections or cell samples, at multiple magnifications. These images can be viewed, analyzed, shared digitally, and are used today for Artificial Intelligence (AI) algorithm development. WSIs are used in a variety of fields, including pathology for diagnosing diseases and oncology for cancer research. They are also utilized in neurology, veterinary medicine, hematology, microbiology, dermatology, pharmacology, toxicology, immunology, and forensic science. When assembling cohorts for the training or validation of an AI algorithm, it is essential to know what is present on such a WSI. However, there is currently no standard for this metadata, so such selection has mainly been done through manual inspection, which is not suitable for large collections with several million objects. We propose a general framework to generate a 2D index map for WSI and a profiling mechanism for specific application domains. We demonstrate this approach in the field of clinical pathology, using common syntax and semantics to achieve interoperability between different catalogs. Our approach augments each WSI collection with a detailed tissue map that provides fine-grained information about the WSI content. The tissue map is organized into three layers: source, tissue type, and pathological alterations, with each layer assigning segments of the WSI to specific classes. We illustrate the advantages and applicability of the proposed standard through specific examples in WSI catalogs, Machine Learning (ML), and graph-based WSI representations.
[38] Unsupervised Incremental Learning Using Confidence-Based Pseudo-Labels
Lucas Rakotoarivony
Main category: cs.CV
TL;DR: 该论文提出了一种基于置信度伪标签的无监督增量学习方法(ICPL),能够在无标注数据集上实现增量学习,性能接近监督方法,并优于现有无监督方法。
Details
Motivation: 现实场景中新类别不断出现,而传统增量学习方法依赖全标注数据,不实用。因此,需要一种无监督的增量学习方法。Contribution: 提出了利用置信度伪标签的无监督增量学习方法(ICPL),并将其整合到多种CIL方法中,显著提升了无监督增量学习的性能。
Method: 通过置信度选择高质量的伪标签替代人工标注,结合增量学习框架,实现无监督学习。在CIFAR100和ImageNet100上验证了方法的有效性。
Result: ICPL在最终准确率上比现有无监督方法高出5%以上,且与监督方法竞争性能。在细粒度数据集和资源受限环境中也表现出实用性。
Insight: 置信度伪标签能有效替代人工标注,无监督增量学习在实际应用中具有潜力,特别是对新类别发现和资源受限场景。
Abstract: Deep learning models have achieved state-of-the-art performance in many computer vision tasks. However, in real-world scenarios, novel classes that were unseen during training often emerge, requiring models to acquire new knowledge incrementally. Class-Incremental Learning (CIL) methods enable a model to learn novel classes while retaining knowledge of previous classes. However, these methods make the strong assumption that the incremental dataset is fully labeled, which is unrealistic in practice. In this work, we propose an unsupervised Incremental Learning method using Confidence-based Pseudo-labels (ICPL), which replaces human annotations with pseudo-labels, enabling incremental learning from unlabeled datasets. We integrate these pseudo-labels into various CIL methods with confidence-based selection and evaluate performance degradation on CIFAR100 and ImageNet100. Then, we compare our approach to popular Class Incremental Novel Category Discovery (class-iNCD) methods addressing similar challenges. Additionally, we apply our method to fine-grained datasets to demonstrate its real-world practicality and measure its computational complexity to validate its suitability for resource-constrained environments. ICPL achieves competitive results compared to supervised methods and outperforms state-of-the-art class-iNCD methods by more than 5% in final accuracy.
[39] MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation
Francisco Caetano,Christiaan Viviers,Peter H. H. de With,Fons van der Sommen
Main category: cs.CV
TL;DR: MedShift是一种基于流匹配和薛定谔桥的统一类别条件生成模型,用于解决合成与真实X光图像之间的域适应问题,支持多域之间高保真的无配对图像翻译。
Details
Motivation: 合成医学数据虽然可扩展用于训练鲁棒模型,但与真实临床数据存在显著的域差距。文章致力于解决合成与真实X光图像的跨域翻译问题,以弥合衰减行为、噪声特征和软组织表现的差异。Contribution: 提出MedShift模型,基于流匹配和薛定谔桥,支持多域名之间的高保真图像翻译;引入X-DigiSkull数据集,用于域翻译模型的基准测试。
Method: MedShift通过共享的域无关潜在空间实现多域间的无缝翻译,无需域特定训练或配对数据。
Result: 实验表明,MedShift在模型规模较小的情况下性能优异,且在推理时可灵活调整以优先关注感知保真度或结构一致性。
Insight: MedShift为医学影像中的域适应问题提供了可扩展和通用的解决方案,尤其适合需要多域翻译的场景。
Abstract: Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at https://caetas.github.io/medshift.html
[40] One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist
Junha Song,Yongsik Jo,So Yeon Min,Quanting Xie,Taehwan Kim,Yonatan Bisk,Jaegul Choo
Main category: cs.CV
TL;DR: 论文探讨了轻量级图像描述任务的可行性,通过一个125M参数的语言模型实现了与大型多模态通用模型相当的性能,同时提出了Sharp-Eyed Refinement框架以缓解视觉盲区问题。
Details
Motivation: 部署多模态大语言模型(MLLMs)在本地设备上计算成本高,因此需要开发轻量级、高效的视觉描述模型。Contribution: 1)验证了小参数模型在图像描述任务中的潜力;2)提出Sharp-Eyed Refinement框架,通过DeepLens增强视觉表示以改善描述质量。
Method: 使用125M参数的专用语言模型,并通过Sharp-Eyed Refinement框架(专注信息区域)优化视觉表示。
Result: 该模型在单句和详细描述任务中表现接近大型通用模型,Sharp-Eyed Refinement有效提升了描述准确性。
Insight: 小参数模型通过优化视觉表示和注意力机制,可以在本地设备上高效完成图像描述任务,挑战了对大型模型的依赖。
Abstract: Image captioning is fundamental for applications like video instruction systems and exploration robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal large language models (MLLMs). To address this, we first explore lightweight captioning by implementing a specialist based on a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluating its performance on both single-sentence and detailed captioning tasks. Surprisingly, we find that our model can achieve performance comparable to large multimodal generalists, suggesting its potential to serve as a strong visual specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from visual blindness, occasionally resulting in semantic captioning errors. We carry out toy experiments and investigate the underlying causes, where we observe that the problems arise from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality through improved visual grounding. At its core, our DeepLens extracts detailed visual representations by concentrating on informative regions identified during the initial glance. Our experiments confirm both the advantages of our specialist over prior small captioning models and large generalists and the effectiveness of our framework.
[41] ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
Hao Lu,Jiahao Wang,Yaolun Zhang,Ruohui Wang,Xuanyu Zheng,Yepeng Tang,Dahua Lin,Lewei Lu
Main category: cs.CV
TL;DR: 该论文提出了一种新的长视频理解中的语义聚合幻觉(SAH)现象,并引入了首个专注于长视频幻觉的基准测试ELV-Halluc,揭示了SAH的存在及其与语义复杂性和快速变化语义的关系。作者还探讨了缓解SAH的方法,包括位置编码策略和DPO策略,并在实验中取得了显著效果。
Details
Motivation: 现有的视频多模态大语言模型(Video-MLLMs)在长视频理解中存在语义聚合幻觉(SAH)的问题,即在帧级语义正确的情况下,模型在聚合为事件级语义时产生错误输出。此前的研究主要关注短视频的幻觉问题,忽略了长视频中SAH的重要性。Contribution: 1. 提出了语义聚合幻觉(SAH)的概念;2. 引入了首个专注于长视频幻觉的基准测试ELV-Halluc;3. 揭示了SAH与语义复杂性和快速变化语义的关系;4. 探讨了缓解SAH的方法,如位置编码策略和DPO策略。
Method: 1. 设计ELV-Halluc基准测试,针对长视频SAH现象;2. 通过实验验证SAH的存在及其影响因素;3. 提出位置编码策略和DPO策略以缓解SAH;4. 构建8K对抗数据对进行评测。
Result: 实验表明SAH现象确实存在,且随着语义复杂性增加而加剧。提出的位置编码策略和DPO策略显著改善了SAH问题,ELV-Halluc基准中SAH比率降低了27.7%,且在Video-MME上也取得了提升。
Insight: 长视频中的语义聚合幻觉(SAH)是一个独立于传统幻觉的新问题,其复杂性要求专门的研究和方法。位置编码与区分事件语义的能力是关键改进方向。
Abstract: Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model’s ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.
[42] Maybe you don’t need a U-Net: convolutional feature upsampling for materials micrograph segmentation
Ronan Docherty,Antonis Vamvakeros,Samuel J. Cooper
Main category: cs.CV
TL;DR: Error
Details
Motivation: ErrorContribution: Error
Method: Error
Result: Error
Insight: Error
Abstract: Feature foundation models - usually vision transformers - offer rich semantic descriptors of images, useful for downstream tasks such as (interactive) segmentation and object detection. For computational efficiency these descriptors are often patch-based, and so struggle to represent the fine features often present in micrographs; they also struggle with the large image sizes present in materials and biological image analysis. In this work, we train a convolutional neural network to upsample low-resolution (i.e, large patch size) foundation model features with reference to the input image. We apply this upsampler network (without any further training) to efficiently featurise and then segment a variety of microscopy images, including plant cells, a lithium-ion battery cathode and organic crystals. The richness of these upsampled features admits separation of hard to segment phases, like hairline cracks. We demonstrate that interactive segmentation with these deep features produces high-quality segmentations far faster and with far fewer labels than training or finetuning a more traditional convolutional network.
[43] HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones
Hao Ruan,Jinliang Lin,Yingxin Lai,Zhiming Luo,Shaozi Li
Main category: cs.CV
TL;DR: HCCM(Hierarchical Cross-Granularity Contrastive and Matching Learning)是一种针对自然语言引导无人机(NLGD)任务的新框架,通过区域-全局图像-文本对比学习和匹配学习,解决了动态环境中视觉-语言理解的挑战,显著提升了检索性能和零样本泛化能力。
Details
Motivation: 自然语言引导无人机任务需要处理广阔的视野和复杂的场景语义,而现有的视觉-语言模型(VLMs)过于依赖全局对齐,缺乏细粒度语义分析,且分层方法需要精确的场景分割,限制了动态环境中的适用性。Contribution: 1. 提出HCCM框架,结合区域-全局对比学习(RG-ITC)和匹配学习(RG-ITM),避免严格场景分割,增强局部和全局语义对齐。2. 引入动量对比与蒸馏(MCD)机制,解决文本描述不完整或不明确的问题,提升模型鲁棒性。
Method: 1. RG-ITC:通过局部视觉区域与全局文本对比(反之亦然),捕捉层级语义。2. RG-ITM:基于全局跨模态表示评估局部语义一致性,增强组合推理能力。3. MCD:通过动量对比和蒸馏稳定对齐过程。
Result: 在GeoText-1652数据集上,HCCM在图像检索(Recall@1: 28.8%)和文本检索(Recall@1: 14.7%)上达到SOTA;在未见过的ERA数据集上零样本泛化性能优异(mR: 39.93%)。
Insight: 避免精确场景分割的层级对比学习方法有效提升动态环境中的视觉-语言理解,而动量蒸馏机制显著改善了对不完整文本描述的鲁棒性。
Abstract: Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.
[44] Complete Gaussian Splats from a Single Image with Denoising Diffusion Models
Ziwei Liao,Mohamed Sayed,Steven L. Waslander,Sara Vicente,Daniyar Turmukhambetov,Michael Firman
Main category: cs.CV
TL;DR: 该论文提出了一种基于潜在扩散模型的生成方法,用于从单张图像重建完整的3D高斯溅射场景,包括遮挡部分。通过变分自重构器和扩散模型结合,解决了传统方法在处理遮挡和多模态问题上的不足。
Details
Motivation: 传统的高斯溅射方法需要密集的场景观测,无法重建遮挡或未被观测的区域,且往往只能预测单一模式,导致结果模糊或不合理。Contribution: 1. 提出了从单张图像生成完整3D高斯溅射的生成方法;2. 设计了变分自重构器,仅通过2D图像自监督学习潜在空间;3. 结合扩散模型,生成高质量且多样化的遮挡区域补全结果。
Method: 1. 使用变分自重构器从2D图像自监督学习潜在空间;2. 在潜在空间上训练扩散模型,生成多模态的3D高斯溅射表示;3. 通过单张图像条件化生成完整场景。
Result: 方法能够生成忠实于输入的重建结果,并补全遮挡区域,支持高质量的360度渲染。
Insight: 生成式方法能够更好地处理遮挡区域的多模态不确定性,而自监督学习避免了缺乏地面真实数据的问题。
Abstract: Gaussian splatting typically requires dense observations of the scene and can fail to reconstruct occluded and unobserved areas. We propose a latent diffusion model to reconstruct a complete 3D scene with Gaussian splats, including the occluded parts, from only a single image during inference. Completing the unobserved surfaces of a scene is challenging due to the ambiguity of the plausible surfaces. Conventional methods use a regression-based formulation to predict a single “mode” for occluded and out-of-frustum surfaces, leading to blurriness, implausibility, and failure to capture multiple possible explanations. Thus, they often address this problem partially, focusing either on objects isolated from the background, reconstructing only visible surfaces, or failing to extrapolate far from the input views. In contrast, we propose a generative formulation to learn a distribution of 3D representations of Gaussian splats conditioned on a single input image. To address the lack of ground-truth training data, we propose a Variational AutoReconstructor to learn a latent space only from 2D images in a self-supervised manner, over which a diffusion model is trained. Our method generates faithful reconstructions and diverse samples with the ability to complete the occluded surfaces for high-quality 360-degree renderings.
[45] EZ-Sort: Efficient Pairwise Comparison via Zero-Shot CLIP-Based Pre-Ordering and Human-in-the-Loop Sorting
Yujin Park,Haejun Chung,Ikbeom Jang
Main category: cs.CV
TL;DR: EZ-Sort通过结合零样本CLIP预排序和人类参与的合并排序,显著减少主观标注任务中成对比较的标注成本,同时在多个数据集上验证了其高效性和可靠性。
Details
Motivation: 成对比较在主观或复杂标注任务中更可靠,但完全比较的标注成本高(O(n^2))。已有方法通过主动采样降低了成本(O(n log n)),但仍有改进空间。Contribution: 1. 提出EZ-Sort,结合零样本CLIP预排序和人类参与的合并排序;2. 自动替换明显的人类比较;3. 在多个数据集上验证标注成本降低90.5%(相比完全比较)和19.8%(相比先前工作)。
Method: 1. 零样本CLIP模型分层预排序;2. 初始化基于桶感知的Elo分数;3. 不确定性引导的人类合并排序。
Result: 在FGNET、DHCI和EyePACS数据集上,EZ-Sort显著减少标注成本,同时保持或提高评分者间可靠性。
Insight: CLIP预排序和不确定性采样结合是高效扩展成对排名的可行方案。
Abstract: Pairwise comparison is often favored over absolute rating or ordinal classification in subjective or difficult annotation tasks due to its improved reliability. However, exhaustive comparisons require a massive number of annotations (O(n^2)). Recent work has greatly reduced the annotation burden (O(n log n)) by actively sampling pairwise comparisons using a sorting algorithm. We further improve annotation efficiency by (1) roughly pre-ordering items using the Contrastive Language-Image Pre-training (CLIP) model hierarchically without training, and (2) replacing easy, obvious human comparisons with automated comparisons. The proposed EZ-Sort first produces a CLIP-based zero-shot pre-ordering, then initializes bucket-aware Elo scores, and finally runs an uncertainty-guided human-in-the-loop MergeSort. Validation was conducted using various datasets: face-age estimation (FGNET), historical image chronology (DHCI), and retinal image quality assessment (EyePACS). It showed that EZ-Sort reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and by 19.8% compared to prior work (when n = 100), while improving or maintaining inter-rater reliability. These results demonstrate that combining CLIP-based priors with uncertainty-aware sampling yields an efficient and scalable solution for pairwise ranking.
[46] ECHO: Ego-Centric modeling of Human-Object interactions
Ilya A. Petrov,Vladimir Guzov,Riccardo Marin,Emre Aksan,Xu Chen,Daniel Cremers,Thabo Beeler,Gerard Pons-Moll
Main category: cs.CV
TL;DR: 该论文提出了一种名为ECHO的新方法,首次通过头戴和手腕追踪信息,统一建模人类姿态、物体运动和接触三种模态,采用扩散Transformer架构和三变量扩散过程,实现了灵活的输入配置和高效的长序列处理。
Details
Motivation: 随着智能手机手表等可穿戴设备的普及,从第一人称视角建模人-物交互(HOI)成为重要但未被充分探索的问题。论文旨在研究仅通过头部和手腕追踪信息能恢复多少交互信息。Contribution: 1. 提出了一种名为ECHO的统一框架,首次从最小观测信息(头戴和手腕追踪)恢复人类姿态、物体运动和接触三种模态。2. 设计了基于扩散Transformer的架构和独特的三变量扩散过程,支持灵活的输入配置。3. 引入了一种基于传送带的推断方法,可处理任意长度序列。
Method: 1. 使用扩散Transformer架构和三变量扩散过程联合建模人类运动、物体轨迹和接触序列。2. 在头部中心坐标系下操作,增强全局方向的鲁棒性。3. 提出传送带推断方法,逐步增加扩散时间戳和帧位置,支持长序列处理。
Result: 通过大量实验验证,ECHO在灵活性不足的现有方法中表现优异,实现了第一人称HOI重建的state-of-the-art性能。
Insight: 研究表明,仅通过头部和手腕追踪信息即可有效恢复多模态交互信息,为可穿戴设备上的实时HOI应用提供了新思路。
Abstract: Modeling human-object interactions (HOI) from an egocentric perspective is a largely unexplored yet important problem due to the increasing adoption of wearable devices, such as smart glasses and watches. We investigate how much information about interaction can be recovered from only head and wrists tracking. Our answer is ECHO (Ego-Centric modeling of Human-Object interactions), which, for the first time, proposes a unified framework to recover three modalities: human pose, object motion, and contact from such minimal observation. ECHO employs a Diffusion Transformer architecture and a unique three-variate diffusion process, which jointly models human motion, object trajectory, and contact sequence, allowing for flexible input configurations. Our method operates in a head-centric canonical space, enhancing robustness to global orientation. We propose a conveyor-based inference, which progressively increases the diffusion timestamp with the frame position, allowing us to process sequences of any length. Through extensive evaluation, we demonstrate that ECHO outperforms existing methods that do not offer the same flexibility, setting a state-of-the-art in egocentric HOI reconstruction.
[47] How Well Do Vision–Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images
Juneyoung Ro,Namwoo Kim,Yoonjin Yoon
Main category: cs.CV
TL;DR: 这篇论文研究了当前视觉语言模型(VLMs)在城市场景中的空间推理能力,比较了BLIP-2、InstructBLIP和LLaVA-1.5三种模型的零样本表现,并探讨了通过合成VQA数据集微调的效果。
Details
Motivation: 城市场景的理解需要细粒度的空间推理能力,但当前在通用场景上预训练的VLMs在这一领域的表现尚不明确。论文旨在填补这一空白。Contribution: 论文的主要贡献包括:1)提出城市空间推理作为VLMs的新挑战;2)通过合成数据集微调显著提升了模型性能;3)展示了通用模型在特定领域适应的可行性。
Method: 论文采用了三种VLM模型(BLIP-2、InstructBLIP、LLaVA-1.5),比较了它们的零样本性能,并通过合成的城市街景VQA数据集进行微调。数据集结合了分割、深度和目标检测结果,并配备了LLM生成的Chain-of-Thought(CoT)推理监督。
Result: 结果表明,VLMs在零样本设置下表现尚可,但通过合成的CoT监督数据集微调后,性能显著提升,尤其是在否定问题和反事实推理等挑战性问题上。
Insight: 论文揭示了合成数据集在提升VLM特定领域性能上的潜力,并强调了城市空间推理能力作为未来研究方向的重要性。
Abstract: Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.
[48] Integrating Pathology and CT Imaging for Personalized Recurrence Risk Prediction in Renal Cancer
Daniël Boeke,Cedrik Blommestijn,Rebecca N. Wray,Kalina Chupetlovska,Shangqi Gao,Zeyu Gao,Regina G. H. Beets-Tan,Mireia Crispin-Ortuzar,James O. Jones,Wilson Silva,Ines P. Machado
Main category: cs.CV
TL;DR: 本文提出了一种结合术前CT和术后病理WSI的多模态深度学习框架,用于预测肾透明细胞癌的复发风险。通过中间融合策略,模型性能优于单一模态,接近临床评分,但需进一步优化融合方法和扩展数据集。
Details
Motivation: 现有预测肾透明细胞癌(ccRCC)复发风险的Leibovich评分缺乏患者级别的精度且未利用影像信息,因此需要开发多模态方法以提升个性化预测能力。Contribution: 1. 开发了一种模块化深度学习框架,结合CT和病理WSI;2. 验证了中间融合策略在多模态预测中的优势;3. 揭示了放射学通过简单特征融合对病理学模型的补充作用。
Method: 1. 使用预训练编码器提取CT和WSI特征;2. 通过Cox生存模型进行风险预测;3. 比较了单模态、晚期融合和中间融合的性能。
Result: 病理WSI模型优于CT模型,中间融合(TITAN-CONCH + ResNet-18)性能最佳,接近Leibovich评分,但离散化可能高估个体化性能。
Insight: 1. 中间融合策略能有效提升多模态预测性能;2. 病理学数据在复发风险预测中具有更强预后能力;3. 需开发更通用的CT编码器和更大规模数据集以匹配病理学建模能力。
Abstract: Recurrence risk estimation in clear cell renal cell carcinoma (ccRCC) is essential for guiding postoperative surveillance and treatment. The Leibovich score remains widely used for stratifying distant recurrence risk but offers limited patient-level resolution and excludes imaging information. This study evaluates multimodal recurrence prediction by integrating preoperative computed tomography (CT) and postoperative histopathology whole-slide images (WSIs). A modular deep learning framework with pretrained encoders and Cox-based survival modeling was tested across unimodal, late fusion, and intermediate fusion setups. In a real-world ccRCC cohort, WSI-based models consistently outperformed CT-only models, underscoring the prognostic strength of pathology. Intermediate fusion further improved performance, with the best model (TITAN-CONCH with ResNet-18) approaching the adjusted Leibovich score. Random tie-breaking narrowed the gap between the clinical baseline and learned models, suggesting discretization may overstate individualized performance. Using simple embedding concatenation, radiology added value primarily through fusion. These findings demonstrate the feasibility of foundation model-based multimodal integration for personalized ccRCC risk prediction. Future work should explore more expressive fusion strategies, larger multimodal datasets, and general-purpose CT encoders to better match pathology modeling capacity.
[49] Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR
Shashank Vempati,Nishit Anand,Gaurav Talebailkar,Arpan Garai,Chetan Arora
Main category: cs.CV
TL;DR: 论文提出了一种从单词级OCR到行级OCR的自然扩展方法,旨在绕过单词分割的错误,并通过更大的上下文提升语言模型的利用效率。
Details
Motivation: 传统的OCR技术因字符分割易错且缺乏上下文,限制了语言模型的应用。现代技术虽提升至单词级,但仍受限于单词分割的准确性。因此,论文提出行级OCR以解决这一问题。Contribution: 1. 提出行级OCR方法,绕过单词分割错误并利用更大的上下文;2. 贡献了一个精心标注的行级OCR数据集(251页英语文档);3. 实验显示准确率提升5.4%,效率提升4倍。
Method: 采用序列到序列的翻译框架,直接从行级别输入文档图像,输出字符序列。利用更大的句子上下文优化语言模型的使用。
Result: 实验结果表明,行级OCR在端到端准确率上提升了5.4%,且效率是单词级OCR的4倍。
Insight: 行级OCR不仅能提升准确率和效率,还为未来大语言模型的进一步应用提供了潜力。
Abstract: Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: https://nishitanand.github.io/line-level-ocr-website
[50] Entropy-Based Non-Invasive Reliability Monitoring of Convolutional Neural Networks
Amirhossein Nazeri,Wael Hafez
Main category: cs.CV
TL;DR: 该论文提出了一种基于熵的非侵入式方法,用于监测卷积神经网络(CNN)对抗样本的可靠性。通过分析CNN激活层的熵变化,无需修改模型即可高效检测对抗扰动,实现了90%的检测准确率。
Details
Motivation: 现有对抗样本检测方法通常需要重新训练模型、修改网络结构或影响干净输入的性能,限制了实际应用。本文旨在通过分析CNN激活层的熵变化,提供一种无需修改模型的高效检测方法。Contribution: 1. 发现对抗扰动会在CNN早期卷积层的激活中产生显著的熵变化(7%);2. 提出了一种基于熵的非侵入式监测方法,无需修改模型即可实现90%的检测准确率;3. 揭示了CNN激活模式天然编码了输入分布变化的信息。
Method: 在VGG-16上并行监测激活层的熵变化,通过统计实验验证对抗样本和干净样本在熵分布上的显著差异。
Result: 对抗扰动导致早期卷积层激活熵变化7%,检测准确率达到90%,假阳性和假阴性率低于20%。
Insight: CNN的激活模式天然包含输入分布变化的信息,熵作为一种简单指标可以有效监测模型的可靠性。
Abstract: Convolutional Neural Networks (CNNs) have become the foundation of modern computer vision, achieving unprecedented accuracy across diverse image recognition tasks. While these networks excel on in-distribution data, they remain vulnerable to adversarial perturbations imperceptible input modifications that cause misclassification with high confidence. However, existing detection methods either require expensive retraining, modify network architecture, or degrade performance on clean inputs. Here we show that adversarial perturbations create immediate, detectable entropy signatures in CNN activations that can be monitored without any model modification. Using parallel entropy monitoring on VGG-16, we demonstrate that adversarial inputs consistently shift activation entropy by 7% in early convolutional layers, enabling 90% detection accuracy with false positives and false negative rates below 20%. The complete separation between clean and adversarial entropy distributions reveals that CNNs inherently encode distribution shifts in their activation patterns. This work establishes that CNN reliability can be assessed through activation entropy alone, enabling practical deployment of self-diagnostic vision systems that detect adversarial inputs in real-time without compromising original model performance.
[51] CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models
João Valente,Atabak Dehban,Rodrigo Ventura
Main category: cs.CV
TL;DR: 该论文介绍了CAD2DMD-SET,一个用于生成合成数字测量设备(DMD)数据集以微调大型视觉语言模型(LVLM)的工具,并通过实验验证了其有效性。
Details
Motivation: 当前的大型视觉语言模型在复杂现实场景(如头戴摄像头或AR应用中的遮挡、运动模糊等)下读取数字测量设备(DMD)的值表现不佳,因此需要一种工具生成高质量的合成数据以提升模型的鲁棒性。Contribution: 1. 提出了CAD2DMD-SET工具,通过3D CAD模型和高保真渲染生成多样化的合成DMD数据集;2. 发布了DMDBench,包含1000张标注真实图像用于模型评估;3. 实验证明该工具能显著提升LVLMs的性能(如InternVL的ANLS得分提升200%)。
Method: 1. 利用3D CAD模型和高级渲染技术生成合成DMD图像;2. 通过图像合成技术增加多样性;3. 使用生成的合成数据微调LVLMs(如LoRA)。
Result: 微调后的LVLMs在DMDBench上表现显著提升,例如InternVL的ANLS得分提高了200%,且不影响其他任务的性能。
Insight: 高质量的合成数据生成工具可以显著提升LVLMs在特定场景(如DMD读取)下的鲁棒性和性能,同时不影响其他任务表现。
Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA’s of these models with CAD2DMD-SET’s generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.
[52] Learning from Silence and Noise for Visual Sound Source Localization
Xavier Juanola,Giovana Morais,Magdalena Fuentes,Gloria Haro
Main category: cs.CV
TL;DR: 该论文提出了一种新的训练策略(SSL-SaN),通过结合静音和噪声数据,改进了视觉声音源定位任务在正负音频情况下的性能,同时提出了新指标和扩展数据集IS3+。
Details
Motivation: 当前视觉声音源定位方法在低音频-视觉语义对应(如静音、噪声和屏外声音)情况下表现不佳,且评估仅局限于正例场景。Contribution: 1. 提出了一种结合静音和噪声的训练策略(SSL-SaN);2. 提出了衡量正负音频-视觉对的特征对齐和分离性的新指标;3. 扩展并改进了IS3合成数据集(IS3+)。
Method: 通过自监督学习模型(SSL-SaN),结合噪声和静音数据,优化正负音频情况下的性能,提出新指标评估特征对齐与分离性。
Result: SSL-SaN在声音定位和跨模态检索任务中取得了最先进的性能,同时新指标和数据集为研究提供了更全面的评估基准。
Insight: 结合负例数据(如静音和噪声)可以显著提升模型对复杂场景的鲁棒性,为音频-视觉任务提供了新的研究方向。
Abstract: Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio. Our data, metrics and code are available on the https://xavijuanola.github.io/SSL-SaN/.
[53] UItron: Foundational GUI Agent with Advanced Perception and Planning
Zhixiong Zeng,Jing Huang,Liming Zheng,Wenkang Han,Yufeng Zhong,Lei Chen,Longrong Yang,Yingjie Chu,Yuzhi He,Lin Ma
Main category: cs.CV
TL;DR: UItron 是一个开源的 GUI 代理基础模型,具备高级的 GUI 感知、接地和规划能力,通过数据工程和交互基础设施的改进,解决了现有模型的局限性问题,并在中国移动应用场景中取得了显著进展。
Details
Motivation: GUI 代理在移动/PC 设备上的自动化操作是实现通用人工智能的重要任务。尽管 VLMs 取得了进展,但由于操作轨迹稀缺、交互基础设施不足和基础模型能力有限,构建 GUI 代理仍然是一个挑战。Contribution: 1. 提出了 UItron,一个具备高级 GUI 感知、接地和规划能力的开源基础模型;2. 系统性地研究了数据工程策略并建立了移动和 PC 设备的交互环境;3. 在中国移动应用场景中填补了现有模型的不足。
Method: UItron 采用监督微调技术处理多种 GUI 场景的感知和规划任务,并设计了一个课程强化学习框架以支持复杂推理和在线环境探索。
Result: UItron 在 GUI 感知、接地和规划基准测试中表现优异,特别是在中国移动应用场景中取得了显著进展。
Insight: 数据工程和交互基础设施是推进 GUI 代理开发的关键,同时在特定语言/地区场景(如中文应用)中的针对性改进对实际应用至关重要。
Abstract: GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.
[54] What Can We Learn from Harry Potter? An Exploratory Study of Visual Representation Learning from Atypical Videos
Qiyue Sun,Qiming Huang,Yang Yang,Hongjun Wang,Jianbo Jiao
Main category: cs.CV
TL;DR: 该论文探讨了从非典型视频(如科幻、动画等)中学习视觉表示对开放世界任务的潜在益处,并提出了一种新的数据集。实验表明,即使简单的学习方法也能在OOD检测、新类别发现和零样本动作识别任务中提升性能。
Details
Motivation: 现有研究主要关注典型数据集,而开放世界中的异常数据学习尚未充分探索。论文旨在研究非典型视频是否对开放世界的视觉表示学习有积极影响。Contribution: 1. 提出新的非典型视频数据集;2. 验证非典型数据对OOD检测、新类别发现和零样本动作识别任务的性能提升;3. 发现语义多样性而非数据量是提升性能的关键。
Method: 通过收集非典型视频数据集,并将其用于模型训练,评估其在开放世界任务中的表现。实验设计涵盖OOD检测、NCD和ZSAR。
Result: 实验结果表明,非典型数据能够显著提升开放世界任务的性能,尤其是语义多样性高的数据表现更优。
Insight: 1. 非典型视频的语义多样性对开放世界学习至关重要;2. 小规模但多样性强的数据集可能优于大规模典型数据集;3. 为非典型数据学习开辟了新研究方向。
Abstract: Humans usually show exceptional generalisation and discovery ability in the open world, when being shown uncommon new concepts. Whereas most existing studies in the literature focus on common typical data from closed sets, open-world novel discovery is under-explored in videos. In this paper, we are interested in asking: \textit{What if atypical unusual videos are exposed in the learning process?} To this end, we collect a new video dataset consisting of various types of unusual atypical data (\eg sci-fi, animation, \etc). To study how such atypical data may benefit open-world learning, we feed them into the model training process for representation learning. Focusing on three key tasks in open-world learning: out-of-distribution (OOD) detection, novel category discovery (NCD), and zero-shot action recognition (ZSAR), we found that even straightforward learning approaches with atypical data consistently improve performance across various settings. Furthermore, we found that increasing the categorical diversity of the atypical samples further boosts OOD detection performance. Additionally, in the NCD task, using a smaller yet more semantically diverse set of atypical samples leads to better performance compared to using a larger but more typical dataset. In the ZSAR setting, the semantic diversity of atypical videos helps the model generalise better to unseen action classes. These observations in our extensive experimental evaluations reveal the benefits of atypical videos for visual representation learning in the open world, together with the newly proposed dataset, encouraging further studies in this direction.
[55] Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering
Nattapong Kurpukdee,Adrian G. Bors
Main category: cs.CV
TL;DR: 该论文提出了一种无监督视频持续学习(uVCL)的新场景和方法,通过非参数深度嵌入聚类处理连续任务,无需标签或任务边界,显著提升了性能。
Details
Motivation: 视频数据富含时空信息,但无监督持续学习领域对其研究不足。现有方法多依赖标签和任务边界,而实际应用中获取标签成本高且不实用。因此,论文探索了无需标签和边界的无监督视频持续学习问题。Contribution: 1. 提出无监督视频持续学习(uVCL)的新场景和实验协议;2. 利用非参数核密度估计(KDE)表示视频特征;3. 动态扩展内存集群以捕捉新知识;4. 通过迁移学习提升性能。
Method: 使用无监督视频Transformer网络提取深度嵌入特征,通过KDE进行非参数概率表示。设计新颖性检测标准动态扩展内存集群,并利用迁移学习优化任务间知识传递。
Result: 在UCF101、HMDB51和Something-to-Something V2数据集上的实验表明,所提方法在连续任务学习中显著优于基线。
Insight: 非参数表示和动态内存扩展是无监督视频持续学习的有效策略,迁移学习进一步提升了模型的适应性。
Abstract: We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.
[56] Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight
Ugur Dinc,Jibak Sarkar,Philipp Schubert,Sabine Semrau,Thomas Weissmann,Andre Karius,Johann Brand,Bernd-Niklas Axer,Ahmed Gomaa,Pluvio Stephan,Ishita Sheth,Sogand Beirami,Annette Schwarz,Udo Gaipl,Benjamin Frey,Christoph Bert,Stefanie Corradini,Rainer Fietkau,Florian Putz
Main category: cs.CV
TL;DR: GPT-5在放射肿瘤学中表现出显著的性能提升,但仍需专家监督。
Details
Motivation: 研究旨在评估GPT-5在放射肿瘤学中的表现,尤其是在临床决策支持方面的潜力。Contribution: 通过标准化的基准测试和专家评分,验证了GPT-5在放射肿瘤学任务中的改进,并揭示了其局限性。
Method: 使用两个基准评估GPT-5:(1)300道多选题测试(TXIT),(2)60个真实病例的治疗方案生成,由专家评分。
Result: GPT-5在TXIT测试中准确率高达92.8%,病例治疗方案评分为3.24/4(正确性)和3.59/4(全面性),但复杂场景中仍有错误。
Insight: 尽管GPT-5性能优异,但在临床应用中仍需专家严格审核,尤其是复杂病例。
Abstract: Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use. Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss’ \k{appa}. Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5’s treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss’ \k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation. Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.
[57] VoCap: Video Object Captioning and Segmentation from Any Prompt
Jasper Uijlings,Xingyi Zhou,Xiuye Gu,Arsha Nagrani,Anurag Arnab,Alireza Fathi,David Ross,Cordelia Schmid
Main category: cs.CV
TL;DR: VoCap is a flexible video model that performs promptable video object segmentation, referring expression segmentation, and object captioning by leveraging multimodal prompts and pseudo annotations from a large Vision Language Model.
Details
Motivation: Video object understanding requires fine-grained localization and semantic details, but annotating such data is expensive. VoCap aims to address this by combining multiple tasks and leveraging existing datasets with pseudo annotations.Contribution: 1. Proposes VoCap, a multimodal model for video object captioning and segmentation. 2. Introduces SAV-Caption, a dataset with pseudo and manual annotations for evaluation. 3. Achieves state-of-the-art results in referring expression segmentation and establishes a benchmark for video object captioning.
Method: 1. Uses pseudo captions generated by a Vision Language Model (VLM) on an existing segmentation dataset (SAV). 2. Trains VoCap on SAV-Caption and other datasets to handle multimodal prompts (text, box, mask). 3. Produces spatio-temporal masks and object-centric captions.
Result: VoCap achieves state-of-the-art performance in referring expression video object segmentation, competitive results in semi-supervised VOS, and sets a benchmark for video object captioning.
Insight: Leveraging pseudo annotations from VLMs can effectively address data scarcity in video understanding tasks, enabling multitask models like VoCap to perform well across various benchmarks.
Abstract: Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.
[58] The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning
Yiming Lin,Yuchen Niu,Shang Wang,Kaizhu Huang,Qiufeng Wang,Xiao-Bo Jin
Main category: cs.CV
TL;DR: 本文重新审视了情境识别中的动词分类问题,指出其本质是多标签任务,并提出了基于单正例多标签学习(SPMLL)的新方法GE-VerbMLP,显著提升了性能。
Details
Motivation: 传统方法将动词分类视为单标签问题,但实际情境中图像常对应多个合理的动词类别,存在语义模糊性,这种简化导致性能受限。Contribution: 1. 揭示动词分类的多标签本质;2. 提出SPMLL问题的新视角;3. 设计了多标签评估基准,并开发了GE-VerbMLP模型。
Method: 基于SPMLL框架,结合图神经网络(GNN)捕捉标签相关性,并通过对抗训练优化决策边界。
Result: 实验显示,方法在多标签平均精度(MAP)上提升超过3%,同时在传统单标签指标(top-1/top-5准确率)上保持竞争力。
Insight: 情境识别需考虑语义重叠,多标签学习能更好捕捉视觉事件的模糊性,而SPMLL为有限标注下的多标签学习提供了可行方案。
Abstract: Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.
[59] DriveQA: Passing the Driving Knowledge Test
Maolin Wei,Wanzhou Liu,Eshed Ohn-Bar
Main category: cs.CV
TL;DR: DriveQA是一个开源的文本和视觉基准测试,用于评估大语言模型(LLM)和多模态大语言模型(MLLM)在驾驶知识测试中的表现。实验显示,现有模型在基础交通规则上表现良好,但在数值推理、复杂的路权场景和交通标志识别等方面仍有不足。通过在DriveQA上进行微调,模型性能得到显著提升,并增强了在下游真实驾驶任务中的表现。
Details
Motivation: 当前的自动驾驶基准测试主要集中在空间和视觉问答任务上,而忽视了全面的交通规则和复杂的驾驶场景理解。DriveQA旨在填补这一空白,推动模型在真实驾驶知识测试中的能力提升。Contribution: 1. 提出了一个全面的开源文本和视觉基准测试DriveQA,涵盖交通规则和复杂场景;2. 揭示了现有LLM和MLLM在驾驶知识测试中的局限性;3. 展示了微调和预训练在提升模型性能中的作用。
Method: 通过构建DriveQA数据集,包含文本和视觉数据,涵盖广泛的交通规则和场景。实验包括评估现有模型的性能、微调的效果以及对下游任务的影响。
Result: 1. 现有模型在基础规则表现良好,但在数值推理和复杂场景中表现不佳;2. 微调显著提升了模型在交通标志识别和交叉路口决策中的表现;3. 预训练DriveQA增强了模型在下游任务中的泛化能力。
Insight: 1. 数值推理和复杂场景是模型当前的主要挑战;2. 高质量的数据集和微调可以显著提升模型在特定任务中的能力;3. 文本和视觉知识的结合对驾驶任务的泛化至关重要。
Abstract: If a Large Language Model (LLM) were to take a driving knowledge test today, would it pass? Beyond standard spatial and visual question-answering (QA) tasks on current autonomous driving benchmarks, driving knowledge tests require a complete understanding of all traffic rules, signage, and right-of-way principles. To pass this test, human drivers must discern various edge cases that rarely appear in real-world datasets. In this work, we present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios. Through our experiments using DriveQA, we show that (1) state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios, traffic sign variations, and spatial layouts, (2) fine-tuning on DriveQA improves accuracy across multiple categories, particularly in regulatory sign recognition and intersection decision-making, (3) controlled variations in DriveQA-V provide insights into model sensitivity to environmental factors such as lighting, perspective, distance, and weather conditions, and (4) pretraining on DriveQA enhances downstream driving task performance, leading to improved results on real-world datasets such as nuScenes and BDD, while also demonstrating that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks.
cs.LG [Back]
[60] Normalisation of SWIFT Message Counterparties with Feature Extraction and Clustering
Thanasis Schoinas,Benjamin Guinard,Diba Esbati,Richard Chalk
Main category: cs.LG
TL;DR: 该论文提出了一种结合字符串相似性、主题建模、层次聚类和规则的方法,用于聚类银行支付消息系统中的交易对手方,解决了自然语言模型在此场景下的不适用性问题。
Details
Motivation: 银行支付消息系统(如SWIFT)中的交易对手方信息通常是手动输入的,缺乏句子结构且包含噪声,传统自然语言方法不适用。现有模糊匹配工具效果有限,因此需要一个更有效的解决方案。Contribution: 1. 提出了一种混合方法,结合字符串相似性、主题建模、层次聚类和规则,适用于交易对手方聚类。2. 设计了新的评估指标,补充了精确率和召回率。3. 在真实数据上验证了方法的优越性。
Method: 1. 字符串相似性:处理手动输入中的噪声和变体。2. 主题建模:提取有意义的信息。3. 层次聚类:适用于未知聚类数量的场景。4. 规则管道:保留规则系统的可解释性。
Result: 在真实标注数据集上测试,性能显著优于基于规则的基准方法,同时减少了人工审核的需要。
Insight: 1. 传统自然语言方法在某些结构化但缺乏句子结构的场景中不适用。2. 混合方法可以结合不同技术的优势(如可解释性和准确性)。3. 评估指标的设计需要结合实际应用需求。
Abstract: Short text clustering is a known use case in the text analytics community. When the structure and content falls in the natural language domain e.g. Twitter posts or instant messages, then natural language techniques can be used, provided texts are of sufficient length to allow for use of (pre)trained models to extract meaningful information, such as part-of-speech or topic annotations. However, natural language models are not suitable for clustering transaction counterparties, as they are found in bank payment messaging systems, such as SWIFT. The manually typed tags are typically physical or legal entity details, which lack sentence structure, while containing all the variations and noise that manual entry introduces. This leaves a gap in an investigator or counter-fraud professional’s toolset when looking to augment their knowledge of payment flow originator and beneficiary entities and trace funds and assets. A gap that vendors traditionally try to close with fuzzy matching tools. With these considerations in mind, we are proposing a hybrid string similarity, topic modelling, hierarchical clustering and rule-based pipeline to facilitate clustering of transaction counterparties, also catering for unknown number of expected clusters. We are also devising metrics to supplement the evaluation of the approach, based on the well-known measures of precision and recall. Testing on a real-life labelled dataset demonstrates significantly improved performance over a baseline rule-based (‘keyword’) approach. The approach retains most of the interpretability found in rule-based systems, as the former adds an additional level of cluster refinement to the latter. The resulting workflow reduces the need for manual review. When only a subset of the population needs to be investigated, such as in sanctions investigations, the approach allows for better control of the risks of missing entity variations.
[61] Model-Task Alignment Drives Distinct RL Outcomes
Haoze Wu,Cheng Wang,Wenshuo Zhao,Junxian He
Main category: cs.LG
TL;DR: 该论文研究表明,强化学习(RL)在大型语言模型(LLMs)中的许多反直觉现象(如少量训练样本取得高性能)仅在与任务的预训练对齐(Model-Task Alignment)较强时才成立,而在其他情况下,传统RL方法更有效。
Details
Motivation: RL在LLMs中表现出许多反直觉现象,但适用条件尚不明确。论文旨在探究这些现象的成立条件及其与模型-任务对齐的关系。Contribution: 揭示了模型-任务对齐是影响RL在LLMs中表现的关键因素,并系统验证了反直觉现象的适用范围。
Method: 通过系统实验验证不同模型架构和任务领域,分析模型-任务对齐(pass@k准确性)与RL表现的关系。
Result: 标准RL方法在各种场景中稳定有效,而反直觉现象仅在模型与任务预训练对齐强时出现。
Insight: 模型-任务对齐作为隐性条件,决定了RL技术的适用性,为未来RL在LLMs中的应用提供了理论指导。
Abstract: Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations hold - and, critically, when they fail - remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.
[62] Accept or Deny? Evaluating LLM Fairness and Performance in Loan Approval across Table-to-Text Serialization Approaches
Israel Abebe Azime,Deborah D. Kanubala,Tejumade Afonja,Mario Fritz,Isabel Valera,Dietrich Klakow,Philipp Slusallek
Main category: cs.LG
TL;DR: 该论文评估了大型语言模型(LLM)在贷款批准任务中的公平性和性能,重点关注不同表到文本序列化方法的影响。
Details
Motivation: LLM越来越多地用于高风险决策任务(如贷款批准),但其处理表格数据的能力有限,且公平性问题突出。Contribution: 1. 评估了LLM在零样本和上下文学习(ICL)下的性能与公平性;2. 发现序列化格式(如GReat和LIFT)对性能和公平性有显著影响;3. 提出需要更有效的表格数据表示方法和公平感知模型。
Method: 通过在加纳、德国和美国的贷款数据集上测试零样本和ICL场景,比较不同序列化格式的效果。
Result: ICL将模型性能提升了4.9%-59.6%,但对公平性的影响因数据集而异。某些序列化格式虽提高了F1分数,却加剧了公平性差距。
Insight: 序列化格式的选择对LLM的公平性和性能至关重要,需进一步研究公平性优化的方法。
Abstract: Large Language Models (LLMs) are increasingly employed in high-stakes decision-making tasks, such as loan approvals. While their applications expand across domains, LLMs struggle to process tabular data, ensuring fairness and delivering reliable predictions. In this work, we assess the performance and fairness of LLMs on serialized loan approval datasets from three geographically distinct regions: Ghana, Germany, and the United States. Our evaluation focuses on the model’s zero-shot and in-context learning (ICL) capabilities. Our results reveal that the choice of serialization (Serialization refers to the process of converting tabular data into text formats suitable for processing by LLMs.) format significantly affects both performance and fairness in LLMs, with certain formats such as GReat and LIFT yielding higher F1 scores but exacerbating fairness disparities. Notably, while ICL improved model performance by 4.9-59.6% relative to zero-shot baselines, its effect on fairness varied considerably across datasets. Our work underscores the importance of effective tabular data representation methods and fairness-aware models to improve the reliability of LLMs in financial decision-making.
[63] Summarize-Exemplify-Reflect: Data-driven Insight Distillation Empowers LLMs for Few-shot Tabular Classification
Yifei Yuan,Jiatong Li,Weijia Zhang,Mohammad Aliannejadi,Evangelos Kanoulas,Renjun Hu
Main category: cs.LG
TL;DR: 该论文提出了一种名为InsightTab的框架,通过数据驱动的洞察提炼方法,提升大型语言模型(LLMs)在少样本表格分类任务中的性能。
Details
Motivation: 现有的LLMs在少样本表格分类中存在挑战,主要由于结构化数据的多样性。论文受到人类学习过程的启发,希望通过提炼数据洞察来解决这一问题。Contribution: 提出了InsightTab框架,集成规则总结、策略性示例化和反思性学习,通过与数据建模技术的深度协作,提升LLMs在表格任务中的适应性。
Method: 采用分而治之、易者先行和反思性学习原则,通过规则总结、示例选择和反思提炼数据洞察,优化LLMs的分类能力。
Result: 在九个数据集上的实验表明,InsightTab显著优于当前最优方法,并验证了其有效性和对标签数据的利用能力。
Insight: 通过模仿人类学习过程,InsightTab框架为LLMs提供了一种高效利用数据驱动的洞察方法,显著提升了少样本学习中的分类性能。
Abstract: Recent studies show the promise of large language models (LLMs) for few-shot tabular classification but highlight challenges due to the variability in structured data. To address this, we propose distilling data into actionable insights to enable robust and effective classification by LLMs. Drawing inspiration from human learning processes, we introduce InsightTab, an insight distillation framework guided by principles of divide-and-conquer, easy-first, and reflective learning. Our approach integrates rule summarization, strategic exemplification, and insight reflection through deep collaboration between LLMs and data modeling techniques. The obtained insights enable LLMs to better align their general knowledge and capabilities with the particular requirements of specific tabular tasks. We extensively evaluate InsightTab on nine datasets. The results demonstrate consistent improvement over state-of-the-art methods. Ablation studies further validate the principle-guided distillation process, while analyses emphasize InsightTab’s effectiveness in leveraging labeled data and managing bias.
cs.AI [Back]
[64] Fuzzy, Symbolic, and Contextual: Enhancing LLM Instruction via Cognitive Scaffolding
Vanessa Figueiredo
Main category: cs.AI
TL;DR: 论文研究了架构的归纳偏置如何影响大语言模型(LLMs)在教学对话中的认知行为,提出了一种符号化脚手架机制和短期记忆模式,并验证了其对模型表现的提升作用。
Details
Motivation: 探索如何通过架构设计中的归纳偏置提升LLMs在教学对话中的认知行为和推理能力。Contribution: 提出了一种符号化脚手架机制和短期记忆模式,用于增强LLMs的结构化推理能力,并通过实验验证其有效性。
Method: 采用了符号化脚手架和短期记忆模式,设计了五种系统变体进行对比实验,通过专家设计的评估标准分析了模型的性能。
Result: 实验表明,完整系统的表现优于基线变体,移除记忆或符号结构会显著降低关键认知行为的表现。
Insight: 架构层面的脚手架设计可以可靠地塑造LLMs在教学策略中的涌现行为,支持了处理层面的认知行为形成理论。
Abstract: We study how architectural inductive biases influence the cognitive behavior of large language models (LLMs) in instructional dialogue. We introduce a symbolic scaffolding mechanism paired with a short-term memory schema designed to promote adaptive, structured reasoning in Socratic tutoring. Using controlled ablation across five system variants, we evaluate model outputs via expert-designed rubrics covering scaffolding, responsiveness, symbolic reasoning, and conversational memory. We present preliminary results using an LLM-based evaluation framework aligned to a cognitively grounded rubric. This enables scalable, systematic comparisons across architectural variants in early-stage experimentation. The preliminary results show that our full system consistently outperforms baseline variants. Analysis reveals that removing memory or symbolic structure degrades key cognitive behaviors, including abstraction, adaptive probing, and conceptual continuity. These findings support a processing-level account in which architectural scaffolds can reliably shape emergent instructional strategies in LLMs.
[65] AHELM: A Holistic Evaluation of Audio-Language Models
Tony Lee,Haoqin Tu,Chi Heem Wong,Zijun Wang,Siwei Yang,Yifan Mai,Yuyin Zhou,Cihang Xie,Percy Liang
Main category: cs.AI
TL;DR: 论文介绍了AHELM,一个用于全面评估音频-语言模型(ALMs)性能的基准测试,解决了当前评测中缺乏标准化和多维度能力考量的问题。
Details
Motivation: 现有对ALMs的评测缺乏标准化的基准测试,且多数仅关注一两种能力,忽视了公平性、安全性等多维度的评估需求。Contribution: 提出AHELM,整合多个数据集(包括新合成的PARADE和CoRe-Bench)并标准化评测方法,全面评估ALMs在10个关键维度的表现。
Method: AHELM通过聚合多个数据集(如PARADE和CoRe-Bench),并统一评测提示、推理参数和指标,对14种ALMs进行多维度评估。
Result: Gemini 2.5 Pro在10个维度中的5个表现最佳,但表现出群体不公平性(p=0.01);基線系統表现出乎意料地好。
Insight: 标准化评测框架和多维度评估对ALMs发展至关重要;基線系統的性能表明简单模型在特定任务中仍有竞争力。
Abstract: Evaluations of audio-language models (ALMs) – multimodal models that take interleaved audio and text as input and output text – are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets – including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering – to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ($p=0.01$) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 5th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.
cs.HC [Back]
[66] Morae: Proactively Pausing UI Agents for User Choices
Yi-Hao Peng,Dingzeyu Li,Jeffrey P. Bigham,Amy Pavel
Main category: cs.HC
TL;DR: Morae是一种UI代理,通过主动暂停任务以让用户选择关键决策点,提升盲人和低视力用户的自主权,相比基线代理表现更优。
Details
Motivation: 当前的UI代理通常端到端执行任务,忽略了用户在关键选择中的参与和上下文信息的共享,降低了用户的自主权。Contribution: 提出Morae,一种混合主动式UI代理,能在任务执行中自动识别决策点并暂停以让用户选择。
Method: 利用大型多模态模型解析用户查询、UI代码和截图,并在需要决策时提示用户澄清。
Result: 在实际网页任务测试中,Morae帮助用户完成任务更多,选项更符合偏好,优于基线代理如OpenAI Operator。
Insight: 混合主动式方法平衡了自动化效率和用户偏好表达,提升了UI代理的实用性。
Abstract: User interface (UI) agents promise to make inaccessible or complex UIs easier to access for blind and low-vision (BLV) users. However, current UI agents typically perform tasks end-to-end without involving users in critical choices or making them aware of important contextual information, thus reducing user agency. For example, in our field study, a BLV participant asked to buy the cheapest available sparkling water, and the agent automatically chose one from several equally priced options, without mentioning alternative products with different flavors or better ratings. To address this problem, we introduce Morae, a UI agent that automatically identifies decision points during task execution and pauses so that users can make choices. Morae uses large multimodal models to interpret user queries alongside UI code and screenshots, and prompt users for clarification when there is a choice to be made. In a study over real-world web tasks with BLV participants, Morae helped users complete more tasks and select options that better matched their preferences, as compared to baseline agents, including OpenAI Operator. More broadly, this work exemplifies a mixed-initiative approach in which users benefit from the automation of UI agents while being able to express their preferences.
cs.RO [Back]
[67] QuadKAN: KAN-Enhanced Quadruped Motion Control via End-to-End Reinforcement Learning
Allen Wang,Gavin Tao
Main category: cs.RO
TL;DR: QuadKAN结合KAN和强化学习,提出了一种新的四足运动控制方法,通过跨模态策略提升运动鲁棒性和效率。
Details
Motivation: 现有方法在处理视觉引导的四足运动控制时,未能充分考虑本体感觉与视觉的结合,导致控制不够鲁棒。Contribution: 提出了QuadKAN框架,利用KAN和样条编码器,实现了高效、低抖动的运动控制,并提供了可解释的姿势-动作敏感性分析。
Method: 结合KAN和样条编码器,采用MMDR和PPO进行端到端训练,优化跨模态策略。
Result: 在多地形和障碍物场景下,QuadKAN表现优于现有方法,实现了更高的回报、更远的距离和更少的碰撞。
Insight: 样条参数化的策略在处理非平滑步态时具有优势,为视觉引导的鲁棒运动控制提供了新思路。
Abstract: We address vision-guided quadruped motion control with reinforcement learning (RL) and highlight the necessity of combining proprioception with vision for robust control. We propose QuadKAN, a spline-parameterized cross-modal policy instantiated with Kolmogorov-Arnold Networks (KANs). The framework incorporates a spline encoder for proprioception and a spline fusion head for proprioception-vision inputs. This structured function class aligns the state-to-action mapping with the piecewise-smooth nature of gait, improving sample efficiency, reducing action jitter and energy consumption, and providing interpretable posture-action sensitivities. We adopt Multi-Modal Delay Randomization (MMDR) and perform end-to-end training with Proximal Policy Optimization (PPO). Evaluations across diverse terrains, including both even and uneven surfaces and scenarios with static or dynamic obstacles, demonstrate that QuadKAN achieves consistently higher returns, greater distances, and fewer collisions than state-of-the-art (SOTA) baselines. These results show that spline-parameterized policies offer a simple, effective, and interpretable alternative for robust vision-guided locomotion. A repository will be made available upon acceptance.
[68] Mini Autonomous Car Driving based on 3D Convolutional Neural Networks
Pablo Moraes,Monica Rodriguez,Kristofer S. Kappel,Hiago Sodre,Santiago Fernandez,Igor Nunes,Bruna Guterres,Ricardo Grando
Main category: cs.RO
TL;DR: 论文提出了一种基于RGB-D信息和3D卷积神经网络(3D CNN)的小型自动驾驶汽车(MAC)控制方法,在模拟环境中展示了其性能优于循环神经网络(RNN),并通过任务完成率、圈速和驾驶一致性等指标进行了验证。
Details
Motivation: 自动驾驶技术的复杂性和不确定性对模型开发提出了挑战,而小型自动驾驶汽车(MAC)作为一种简化且经济的测试平台,可以快速评估和比较机器学习模型,尤其是在需要在线训练的算法中。Contribution: 论文的主要贡献是提出了一种基于RGB-D和3D CNN的MAC自动驾驶方法,并验证了其在模拟环境中优于RNN的性能。
Method: 方法采用RGB-D数据和3D CNN,在两种不同环境特征的模拟赛道上训练和测试模型,评估指标包括任务完成率、圈速和驾驶一致性。
Result: 实验结果表明,3D CNN在性能上优于RNN,且通过架构调整和赛道复杂度分析,进一步验证了模型的泛化能力和控制性能。
Insight: 研究揭示了架构改进和赛道复杂度对模型泛化能力和车辆控制性能的影响,为小型自动驾驶汽车的算法设计提供了实用参考。
Abstract: Autonomous driving applications have become increasingly relevant in the automotive industry due to their potential to enhance vehicle safety, efficiency, and user experience, thereby meeting the growing demand for sophisticated driving assistance features. However, the development of reliable and trustworthy autonomous systems poses challenges such as high complexity, prolonged training periods, and intrinsic levels of uncertainty. Mini Autonomous Cars (MACs) are used as a practical testbed, enabling validation of autonomous control methodologies on small-scale setups. This simplified and cost-effective environment facilitates rapid evaluation and comparison of machine learning models, which is particularly useful for algorithms requiring online training. To address these challenges, this work presents a methodology based on RGB-D information and three-dimensional convolutional neural networks (3D CNNs) for MAC autonomous driving in simulated environments. We evaluate the proposed approach against recurrent neural networks (RNNs), with architectures trained and tested on two simulated tracks with distinct environmental features. Performance was assessed using task completion success, lap-time metrics, and driving consistency. Results highlight how architectural modifications and track complexity influence the models’ generalization capability and vehicle control performance. The proposed 3D CNN demonstrated promising results when compared with RNNs.
[69] The Rosario Dataset v2: Multimodal Dataset for Agricultural Robotics
Nicolas Soncini,Javier Cremona,Erica Vidal,Maximiliano García,Gastón Castro,Taihú Pire
Main category: cs.RO
TL;DR: Rosario Dataset v2是一个多模态农业机器人数据集,用于支持定位、建图、感知和导航算法的开发与评估,涵盖多种传感器数据并解决农业环境中的复杂挑战。
Details
Motivation: 农业环境中的机器人技术面临自然光照变化、崎岖地形等复杂挑战,需要一个高质量的多模态数据集来推动算法的发展与测试。Contribution: 提供了包含多种传感器同步数据的多模态数据集,支持农业机器人领域SLAM算法的开发与评估,并展示了现有方法的局限性。
Method: 数据采集平台集成多种传感器(红外相机、GNSS等),并实现了硬件同步和6-DOF地面真值标注,以捕捉农业环境的复杂性。
Result: 数据集展示了现有SLAM方法在农业场景中的局限性,为改进算法提供了参考。
Insight: 多模态数据集对推动农业机器人技术至关重要,现有SLAM方法在复杂农业场景中仍有改进空间。
Abstract: We present a multi-modal dataset collected in a soybean crop field, comprising over two hours of recorded data from sensors such as stereo infrared camera, color camera, accelerometer, gyroscope, magnetometer, GNSS (Single Point Positioning, Real-Time Kinematic and Post-Processed Kinematic), and wheel odometry. This dataset captures key challenges inherent to robotics in agricultural environments, including variations in natural lighting, motion blur, rough terrain, and long, perceptually aliased sequences. By addressing these complexities, the dataset aims to support the development and benchmarking of advanced algorithms for localization, mapping, perception, and navigation in agricultural robotics. The platform and data collection system is designed to meet the key requirements for evaluating multi-modal SLAM systems, including hardware synchronization of sensors, 6-DOF ground truth and loops on long trajectories. We run multimodal state-of-the art SLAM methods on the dataset, showcasing the existing limitations in their application on agricultural settings. The dataset and utilities to work with it are released on https://cifasis.github.io/rosariov2/.
cs.CY [Back]
[70] From Drone Imagery to Livability Mapping: AI-powered Environment Perception in Rural China
Weihuan Deng,Yaofu Huang,Luan Chen,Xun Li,Yao Yao
Main category: cs.CY
TL;DR: 该论文提出了一种基于无人机影像和多模态大语言模型(MLLMs)的农村宜居性评估框架,通过高效图像对比机制和链式思维提示,实现了对中国农村宜居性的全面测量,并揭示了其空间异质性及影响因素。
Details
Motivation: 随着扶贫和乡村振兴战略的深入,农村宜居性成为重要指标。传统问卷方法难以扩展,城市导向的视觉感知方法在农村场景中表现不佳,因此需要一种针对农村的宜居性评估方法。Contribution: 1. 提出了一种基于无人机影像和MLLMs的农村宜居性评估框架;2. 开发了高效图像对比机制和链式思维提示,提升了评估的合理性和可靠性;3. 揭示了中国农村宜居性的空间异质性及影响因素。
Method: 1. 采用自上而下的方法收集了1,766个村的无人机影像;2. 开发了基于二分搜索插值的高效图像对比机制;3. 利用专家知识构建了链式思维提示,结合生活质量和生态宜居性维度进行评估。
Result: 1. 中国农村宜居性呈现以四川和浙江为核心的双核-边缘空间格局;2. 政府财政支出是核心影响因素,每增加1单位对应宜居性提升3.9-4.9单位。
Insight: 该方法为农村建设政策制定提供了科学依据,无人机影像结合MLLMs的多模态评估框架在农村场景中具有广泛潜力。
Abstract: With the deepening of poverty alleviation and rural revitalization strategies, improving the rural living environment and enhancing the quality of life have become key priorities. Rural livability is a key indicator for measuring the effectiveness of these efforts. Current measurement approaches face significant limitations, as questionnaire-based methods are difficult to scale, while urban-oriented visual perception methods are poorly suited for rural contexts. In this paper, a rural-specific livability assessment framework was proposed based on drone imagery and multimodal large language models (MLLMs). To comprehensively assess village livability, this study first used a top-down approach to collect large-scale drone imagery of 1,766 villages in 146 counties across China. In terms of the model framework, an efficient image comparison mechanism was developed, incorporating binary search interpolation to determine effective image pairs while reducing comparison iterations. Building on expert knowledge, a chain-of-thought prompting suitable for nationwide rural livability measurement was constructed, considering both living quality and ecological habitability dimensions. This approach enhanced the rationality and reliability of the livability assessment. Finally, this study characterized the spatial heterogeneity of rural livability across China and thoroughly analyzed its influential factors. The results show that: (1) The rural livability in China demonstrates a dual-core-periphery spatial pattern, radiating outward from Sichuan and Zhejiang provinces with declining gradients; (2) Among various influential factors, government fiscal expenditure emerged as the core determinant, with each unit increase corresponding to a 3.9 - 4.9 unit enhancement in livability. The findings provide valuable insights for rural construction policy-making.
physics.ed-ph [Back]
[71] From Canonical to Complex: Benchmarking LLM Capabilities in Undergraduate Thermodynamics
Anna Geißler,Luca-Sophie Bien,Friedrich Schöppler,Tobias Hertel
Main category: physics.ed-ph
TL;DR: 这篇论文评估了当前大型语言模型(LLM)在本科热力学教育中的适用性,发现即使表现最好的模型也未达到95%的胜任阈值,尤其是在处理图像推理任务时表现较差。
Details
Motivation: 随着大型语言模型在科学教育中作为辅导工具的潜力,作者希望通过热力学这一复杂领域,评估LLM在无监督教学中的实际能力。Contribution: 提出了UTQA基准,一个包含50个本科热力学问题的测试集,用于衡量LLM在处理热力学概念(如可逆性、熵)时的表现。
Method: 通过UTQA基准测试,分析不同LLM在回答文本问题和图像推理任务中的准确性,并评估提示词表达和句法复杂度对性能的影响。
Result: 表现最好的LLM准确率为82%,远未达到95%的胜任阈值;图像推理任务表现接近随机水平,语法复杂性和提示词表达对性能影响有限。
Insight: 当前LLM在处理有限速率/不可逆场景和将视觉特征与热力学含义绑定时存在显著不足,表明其在无监督教学中的适用性有限。
Abstract: Large language models (LLMs) are increasingly considered as tutoring aids in science education. Yet their readiness for unsupervised use in undergraduate instruction remains uncertain, as reliable teaching requires more than fluent recall: it demands consistent, principle-grounded reasoning. Thermodynamics, with its compact laws and subtle distinctions between state and path functions, reversibility, and entropy, provides an ideal testbed for evaluating such capabilities. Here we present UTQA, a 50-item undergraduate thermodynamics question answering benchmark, covering ideal-gas processes, reversibility, and diagram interpretation. No leading 2025-era model exceeded our 95% competence threshold: the best LLMs achieved 82% accuracy, with text-only items performing better than image reasoning tasks, which often fell to chance levels. Prompt phrasing and syntactic complexity showed modest to little correlation with performance. The gap concentrates in finite-rate/irreversible scenarios and in binding visual features to thermodynamic meaning, indicating that current LLMs are not yet suitable for unsupervised tutoring in this domain.
cs.DB [Back]
[72] Database Normalization via Dual-LLM Self-Refinement
Eunjae Jo,Nakyung Lee,Gyuyeong Kim
Main category: cs.DB
TL;DR: Miffie是一个利用双LLM自我优化的数据库规范化框架,通过生成和验证模块的协作实现自动化数据规范化,兼顾高准确性和成本效率。
Details
Motivation: 数据库规范化通常是手动完成的,耗时且易出错。作者希望通过利用大型语言模型的能力,实现高效且准确的自动化规范化。Contribution: 提出了Miffie框架,利用双LLM自我优化架构(生成模块+验证模块)实现自动化规范化,并结合任务特定的零样本提示设计。
Method: 采用双模型自优化架构:生成模块消除异常,验证模块提供反馈,直至输出满足规范化要求;设计了零样本提示以提高准确性和效率。
Result: 实验证明Miffie能高效规范化复杂数据库模式,同时保持高准确性。
Insight: 双LLM协作架构可以显著提升自动化任务的性能,零样本提示设计能有效引导模型行为,兼顾准确性和成本优化。
Abstract: Database normalization is crucial to preserving data integrity. However, it is time-consuming and error-prone, as it is typically performed manually by data engineers. To this end, we present Miffie, a database normalization framework that leverages the capability of large language models. Miffie enables automated data normalization without human effort while preserving high accuracy. The core of Miffie is a dual-model self-refinement architecture that combines the best-performing models for normalized schema generation and verification, respectively. The generation module eliminates anomalies based on the feedback of the verification module until the output schema satisfies the requirement for normalization. We also carefully design task-specific zero-shot prompts to guide the models for achieving both high accuracy and cost efficiency. Experimental results show that Miffie can normalize complex database schemas while maintaining high accuracy.